Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Research Commons

A space and place for those seeking help with research-related needs.

Before you begin . . .

Text and data mining is unique and thus, can take different timeframes to accomplish.  A successful and time-efficient project begins with preplanning.

Step One starts with completing the outline below:

  • What are the goals of my project?
  • What data sources are available that meet my needs? How do I want my results to look--the form of my results?
    • The text analysis method you choose will depend on your research question.
  • What funding needs may this project incur?
    • How much money and potential sources?
  • What skills are needed to carry out this project? 

TDM (Text and Data Mining) is frequently a fair use under US copyright law, but some resources maybe restricted by license agreement.

  • Be aware that some vendors restrict automatic downloading; some may require purchase of the data.
  • Each user is responsible for ensuring that he or she uses products solely for noncommercial, educational, scholarly or research use. Systematic downloading, distribution of content to non-authorized users or indefinite retention of substantial portions of information is strictly prohibited.
  • The use of software such as scripts, agents, or robots, is generally prohibited and may result in loss of access to these resources for the entire Illinois State University community.
  • Please refer to the Illinois State University Information Technology Policy

"TDM (Text and Data Mining) is a broad term for developing research practices that involves building and processing a corpus: a collection of text that may contain millions or even billions of words.  Remember, less is (becomes) more as you build your corpus --see below -- as it will help avoid hidden biases and unpredictable gaps in coverage.

A source or database describes data collected without a strong organizing principle.

A corpus describes data collected and organized with specific questions in mind about certain geographic regions, time periods, or social phenomena.

Univ of Pennsylvania Guide TDM

  “Data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time actually analyzing it.” IBM Analytics

Data cleansing (also known as data cleaning) is a process of detecting and rectifying (or deleting) of untrustworthy, inaccurate or outdated information from a data set, archives, table, or database. It helps you to identify incomplete, incorrect, inaccurate or irrelevant parts of the data. By doing this you can then replace, modify, or delete the bad data. Data cleaning can be performed interactively with data wrangling tools, or as batch processing through scripting. 
There are 5 data cleansing steps:
  1. Standardize your data
  2. Validate your data
  3. De-duplicate your data
  4. Analyze data quality
  5. Find out if you have a data quality problem

Software Tools