Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Research Commons

A space and place for those seeking help with research-related needs.

What is data mining?

"Data mining is the practice of automatically searching large data sets to discover patterns, to extract the information from data sets transform it into a simple structure which can be understandable. Data mining term is not just a single method or single technique but rather a spectrum of different approaches, which searches for patterns and relationships of data. Data mining is concerned with an important aspect related to both database techniques and AI/machine learning mechanisms, and which provides an excellent opportunity for exploring the interesting relationship between retrieval and inference/reasoning, a fundamental issue concerning the nature of data mining".


What is text mining?

"Text mining is also known as Text data mining which is the process of deriving high-quality information from text. It is the set of processes required to get valuable structured information from unstructured text documents or resources. This requires both sophisticated linguistic and statistical techniques able to analyze unstructured text formats and techniques that combine each document with actionable metadata, which can be considered a sort of anchor in structuring this type of data. Once content has been annotated, it can automatically be classified, routed, summarized, visualized through link mapping and, most importantly, it becomes easier to search.

Text mining consists of a broad variety of methods and technologies such as:

  • Keyword-based technologies: The input is based on a selection of keywords in the text that are filtered as a series of character strings, not words nor “concepts”.
  • Statistics technologies: Refers to systems based on machine learning. Statistics technologies leverage a training set of documents used as a model to manage and categorize text.
  • Linguistic-based technologies: This method may leverage language processing systems. The output of text analysis allows a shallow understanding of the structure of the text, the grammar and logic employed. (For a better understanding of how this works, this post on text mining and NLP is helpful.)"


From dataset to text corpus

"Because some online databases can now provide trillions of words of textual data, covering huge historical and geographic spans, researchers getting started in text mining may be tempted to collect "all the data" and expect text mining tools to produce useful results automatically. Unfortunately, data provided at this scale is rarely organized enough to produce meaningful results without substantial modification. Often, a carefully constructed sampling of the data will yield more useful results than "all the data," which is likely to suffer from hidden biases and unpredictable gaps in coverage".

From Univ. of Pennsylvania Guide