Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Research Commons

A space and place for those seeking help with research-related needs.

Glossary A-B


[1] Mechanisms for obtaining or retrieving data or information from an instrument, sensor network, storage device, or data center

[2] Rights to download or use data

Accuracy An important factor in assessing the success of data mining.  When applied to data, accuracy refers to the rate of correct values in the data, When applied to models, accuracy refers to the degree of fit between the model and the data.  This measure how error-free the model’s predictions are.  Since accuracy does not include cost information, it is possible for a less accurate model to be more cost-effective.  Also see precision.
Activation function

A function used by a node in a neural net to transform input data from any domain of values into a finite range of values.

The original idea was to approximate the way neurons fired, and the activation function took on the value 0 until the input became large and the value jumped to 1.  The discontinuity of the 0 or 1 function caused mathematical problems, and sigmoid-shaped functions (e.g., the logistic function) are now used.

Antecedent When an association between two variable is defined, the first item (or left-hand side) is called the antecedent.  For example, in the relationship “When a prospector buys a pick he buys a shovel 14% of the time,” “buys a pick” is the antecedent.

An Application Program Interface.

When a software system features an API, it provides a means by which programs written outside of the system can interface with the system to perform additional functions.  For example, a data mining software system may have an APU which permits user-written programs to perform such tasks as extract data, perform additional statistical analysis, create specialized charts, generate a model, or make a prediction from a model.

Archive To place or store data in a data center; typically done for ensuring long-term preservation of the data and to promote discovery and use of the data.  Such a service records, organizes, and stores (digital or physical) items in optimal conditions, with standardized labeling to ensure longevity and continued access.
ASCI A character-encoding scheme based on alphabet order, used to represent text in computers.

An association algorithm creates rules that describe how often events have occurred together.  For example, “When prospectors but picks, they also buy shovels 14% of the time.” Such relationships are typically expressed with a confidence interval.


A part of an element that provides additional information about that element.  XML elements can have attributes that further describe them”

<Price currency=”Euro”>25.43</Price>

In this example, “currency” is an attribute of “Price”, and the attribute’s value is “Euro”.


Acknowledgment of the role than an individual, group, institution, or research sponsor played in support of a research project and the resulting products (e.g., papers and data).



A training method used to calculate the weights in a neural net from the data.


A copy (or copies) of digital data to be stored and used as a replacement (or data restoration) in case the main copy is either deleted or corrupted.  The backup service does not provide the same service as an archive i.e., it does not provide for access by data consumers or individuals other than the data owner or IT support.

Best Practice

Methods or approaches that are typically recognized by a community as being correct or most appropriate for acquiring, managing, analyzing, and sharing data.


In a neural network, bias refers to the constant terms in the model.  (Note that bias has a different meaning to most data analysts.) 

Also see precision.

A data preparation activity that converts continuous data to discrete data by replacing a value from a continuous range with a bin identifier, where each bin represents a range of values.  For example, age could be converted to ins such as 20 or under, 21-40, 41-65 and over 65.

Bootstrapping Training data sets are created by re-sampling with replacement from the original training set, so data records may occur more than once.  In other words, this method treats a sample as if it were the entire population.  Usually, final estimates are obtained y taking the average of the estimates from each of the bootstrap test sets.


Glossary C

Canonical Formats

“In information technology, canonicalization is the process of making something [conform] with some specification . . . and is in an approved format.  Canonicalization may sometimes mean generating canonical data from non-canonical data.”[i]  Canonical formats are widely supported and considered to be optimal for long-term preservation.

[i] From the glossary of Social Science Terms:

Clifford Lynch. Canonicalization: A fundamental Tool to Facilitate Preservation and Management of Digital Information D-Lib Magazine Sept 1999; 5(9).


Classification And Regression Trees.

CART is a method of splitting the independent variables into small groups and fitting a constant function to the small data sets.  In categorical trees, the constant function is one that takes on a finite small set of values (e.g., Y or N, low or medium or high).  In regression trees, the mean value of the response is fit to small connected data sets.

Categorical Data

Categorical data fits into a small number of discrete categories (as opposed to continuous).  Categorical data is either:

  • non-ordered (nominal) such as gender or city,
  • ordered (ordinal) such as high, medium, or low temperatures.

An algorithm for fitting categorical trees.  It relies on the chi-squared statistic to split the data into small connected data sets.


An XML element that is contained within another element.


A statistic that assesses how well a model fits the data.  In data mining, it is most commonly used to find homogeneous subsets for fitting categorical trees as in CHAID.

Classification Refers to the data mining problem of attempting to predict the category of categorical data by building a model based on some predictor variables.
Classification Tree

A decision tree that places categorical variables into classes.

Cleaning (Cleansing)

Refers to a step in preparing data for a data mining activity.  Obvious data errors are detected and corrected (e.g., improbable dates) and missing data is replaced.

Clustering Clustering algorithms find groups of items that are similar.  For example, clustering could be used by an insurance company to group customers according to income, age, types of policies purchased and prior claims experience.  It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other.  Since the categories are unspecified, this is sometimes referred to as unsupervised learning.

Creating a program that can be understood and acted upon by a computer.

Confidence Confidence of rule “B given A” is a measure of how much more likely it is that B occurs when A has occurred.  It is expressed as a percentage, with 100% meaning B always occurs if A has occurred.  Statisticians refer to this as conditional probability of B given A.  When use with association rules, the term confidence is observational rather than predictive.  (Statisticians also use this term in an unrelated way.  There are ways to estimate an interval and the probability that the interval contains the true value of a parameter is called the interval confidence.  So a 95% confidence interval for the mean has a probability of 0.95 of covering the true value of the mean).
Confusion Matrix

A confusion matrix shows the counts of the actual versus the predicted class values.  It shows not only how well the model predicts, but also presents the details needed to see exactly where things may have gone wrong.

Companion Metadata

Data documentation that accompanies data.

CMS (Content Management System)

A computer system for enabling multiple users to share, edit, and publish content usually on the web.  A CMS underpins a website enabling many people who have been granted permission to add and edit content. Ex: Drupal and WordPress.


In the context of data management, confidentiality can be thought of as information privacy.

Privacy protects access to an individual, while confidentiality protects access to information about an individual.  If data are not confidential, personal privacy can be compromised.


When an association between two variables is defined, the second item (or right-hand side) is called the consequent.  For example, in the relationship “When a prospector buys a pick, he buys a shovel 14% of the time”, “buys a shovel” is the consequent.


Continuous data can have any value in an interval of real numbers.  That is, the value does not have to be an integer.  Continuous is the opposite of discrete or categorical.nbsp.

Cross Validation

A method of estimating the accuracy of a classification or regression model.  The data set is divided into several parts, with each part in turn used to test a model fitted to the remaining parts.


The act of managing digital items held within an archive over the long term.  It is an active process, involving maintaining, preserving and adding value to archived items ‘throughout their lifecycle’.


Structure that consists of systems for computing and data storage, repositories, and computing tools that are linked by networks, providing more powerful capabilities, discovery, and innovation.

Product (e.g., textual descriptions of the rows and columns of a data table as well as the scientific context underlying the data).


Glossary D


Values collected through record keeping or by polling, observing, or measuring, typically organized for analysis or decision making.  More simply, data is facts, transactions, and figures.


An organized, structured set or collection of data, accessed via a (database) management system.  The data can then be queried in a consistent manner.  A database can be classified by the type of content included in it (e.g., bibliographic, statistical, document-text) or by its application area (e.g., Biological, Geological, etc.).

Data Catalog

From Techopedia:

A data catalog belongs to a database instance and is comprised of metadata containing database object definitions like base tables, synonyms, views or synonyms and indexes. . . . A data catalog ensures capabilities that enable any users, from analysts to data scientists or developers, to discover and consume data sources.  It usually provides a crowd-sourcing option.
Data Center A facility that contains computers and data storage devices and that is used for data storage and transmission (e.g., acquiring data from providers and making data available to users).  Data centers frequently provide curation and stewardship services, access to data products, user help desk support and training, and sometimes support data processing activities and other value-added services.
Data Dictionary

Contains structured data names; a repository of data (metadata) defining and describing the data resource.  Includes:


  • Defines a person, place, or thing about which data can be stored
  • Must be clearly understood before attributes can be named or defined

Attributions (data elements)—

  • Describe the inherent nature of the data
  • NOT the entity that the attribute contains information about
  • NOT the uses of the data (where, when, how, or by whom)
  • NOT the codes and values the codes represent

Above from USGS

See also Thesaurus
Data Documentation

The metadata or information about a data product (e.g., data table, database) that enables one to understand and use the data.  Such information may include the scientific context underlying the data as well as who collected the data, why the data were collected, and where, when, and how the data were collected.

Data Entropy

Normal degradation in information content associated with data and metadata over time.

Data Format Data items can exist in many formats such as text, integer, and floating-point decimal.  Data format refers to the form of the data in the database.
Data Mining

An information extraction activity whose goal is to discover hidden facts contained in databases.  Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patters and subtle relationships in data and infers rules that allow the prediction of future results.  Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.

Data Mining Method

Procedures and algorithms designed to analyze the data in the databases.


DataBase Management Systems

Decision Tree

A tree-like way of representing a collection of hierarchical rules that lead to a class or value.

Degree of Fit A measure of how closely the model fits the training data.  A common measure is r-squared.
Dependent Variable

The dependent variable (outputs or responses) of a model are the variables predicted by the equation or rules of the model using the independent variables (inputs or predictors).

Deployment After the model is trained and validated, it is used to analyze new data and make predictions.  This use of the model is called deployment
Deposit The act of submitting data, as to a repository
Derived Data Set A new dataset created by using multiple existing datasets and their data elements as sources.  Also refers to a new dataset created by the addition of a single dataset, used as a source with newly collected data.
Digital Curation

A new term---“Digital curation is . . .about maintaining and adding value to a trusted body of digital information for future and current use; specifically, the active management and appraisal of data over the entire life cycle.  Digital curation builds upon the underlying concepts of digital preservation whilst emphasizing opportunities for added value and knowledge through annotation and continuing resource management.  Preservation is a curation activity, although both are concerned with managing digital resources with no significant (or only controlled) changes over time.”

Dimension Each attribute of a case or occurrence in the data being mined.  Stored as a field in a flat file record or a column of relational database table.
Discover The act of finding new data.

A data item that has a finite set of values.  Discrete is the opposite of continuous.

Discriminant Analysis

A statistical method based on maximum likelihood for determining boundaries that separate the data into categories.

Dissemination The act of spreading widely.  Data dissemination refers to making data available from one or multiple sources.
Document Management System

A computer system to enable efficient management of large quantities of documents while they are in active use and editing; usually accessible by many permitted users.  SharePoint is an example.


Digital Object Identifier

DOI’s are unique, alphanumeric strings assigned to one digital object. A DOI is one type of persistent identifier that is permanently assigned to a specific electronic resource thus enabling the location of the digital object(s) on the internet. 

A DOI name takes the form of a character string divided into two parts, a prefix and a suffix, separated by a slash. DOI’s are generated through a registration agency (RA).  Here is a list of RA’s.  A common RA is CrossRef.

See also: persistent identifiers


Document Type Definition

A DTD provides a list of the elements, attributes, comments, notes, and entities contained in the document as well as their relationships to one another.


Glossary E - H


An XML element is the central building block of any XML document.

XML is a markup language that is used to store data in a self-explanatory manner. Making the data "self-explanatory" comes about by containing information in elements. If a piece of text is a title then it will be contained within a "title" element.

Example—in the following, book, chapter, title, and intro are elements: <book>  <chapter>  <title>  The Beginning</title>  <intro>blah blah blah. . .</intro> </chapter> </book>

Entropy A way to measure variability other than the variance statistic.  Some decision trees split the data into groups based on minimum entropy.
Exploratory Analysis

Looking at data to discover relationships not previously detected.  Exploratory analysis tools typically assist the user in creating tables and graphical displays.

External Data

Data not collected by the organization, such as data available from a reference book, a government source or a proprietary database.



A neural net in which the signals only flow in one direction, from the inputs to the outputs.

File Format The specific organization of information in a digital computer file.

File Transfer Protocol

A method to transfer computer files and web pages.

Fuzzy Logic

Fuzzy logic is applied to fuzzy sets where membership in a fuzzy set is a probability, not necessarily 0 or 1.  Non-fuzzy logic manipulates outcomes that are either true or false.  Fuzzy logic needs to be able to manipulate degrees of “maybe” in addition to true and false.


Genetic Algorithms

A computer-based method of generating and testing combinations of possible input parameters to find the optimal output.  It uses processes based on natural evolution concepts such as genetic combination, mutation and natural selection.


Graphical User Interface

From Linux: what is GUI


Header Row

A meaningful name for referencing the content contained in a row or column, as in a spreadsheet.

Hidden Nodes The nodes (see node) in the hidden layers in a neural net.  Unlike input and output nodes, the number of hidden nodes is not predetermined.  The accuracy of the resulting model is affected by the number of hidden nodes.  Since the number of hidden nodes directly affects the number of parameters in the model, a neural net needs a sufficient number of hidden nodes to enable it to properly model the underlying behavior.  On the other hand, a net with too many hidden nodes will overfit the data.  Some neural net products include algorithms that search over a number of alternative neural nets by varying the number of hidden nodes, in the end choosing the model that gets the best results without overfitting.

HyperText Markup Language

What is HTML from w3 schools


Glossary I

Impossible Value

An unreasonable value. Outside the working range for a parameter that should be identified in the QA/QC process; for example an air temperature of 190C for the Earth or a pH of -2 are impossible and should not be tagged as such.

Independent Variable

The independent variables (inputs or predictors) of a model are the variables used in the equation or rules of the model to predict the output (dependent) variable.


A technique that infers generalization from the information in the data.


Two independent variables interact when changes in the value of one change the effect on the dependent variable of the other.

Internal Data Data collected by an organization such as operating and customer data.


Glossary K - L

K-Nearest Neighbor

A classification method that classifies a point by calculating the distances between the point and points in the training data set.  Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer).

Kohonen Feature Map A type of neural network that uses unsupervised learning to find patterns in data.  In data mining it is employed for cluster analysis.


Layer Nodes in a neural net are usually grouped into layers, with each layer described as input, output or hidden.  There are as many input nodes as there are input (independent) variables and as many output nodes as there are output
Leaf A node not further split-the terminal grouping—in a classification or decision tree.

Training models (estimating their parameters) based in existing data.

Least Squares

The most common method of training (estimating) the wrights (parameters) of a model by choosing the weights that minimize the sum of the squared deviation of the predicted values of the model from the observed values of the data.

Left-Hand Side

When an association between two variable is defined, the first item is called the left-hand side (or antecedent).  For example, in the relationship “when a prospector buys a pick, he buys a shovel 14% of the time”, “buys a pick” is the left-hand side.

Live Data Data that is being worked on as part of a research project.  The files with that data will need to be accessed and amended or updated as new data is obtained. 
Logistic Regression (Logistic Discriminant Analysis) A generalization of linear regression.  It is used for predicting a binary variable (with values such as yes/no or 0/1).  An example of its use is modeling the odds that a borrower will default on a loan based on the borrower’s income, debt and age.


Glossary M


The characters and codes that change a text document into an XML or other Markup Language document.  This includes the < and > characters as well as the elements and attributes of a document.


Multivariate Adaptive Regression Splines.  MARS is a generalization of a decision tree.

Maximum Likelihood

Another training or estimation method.  The maximum likelihood estimate of a parameter is the value of a parameter that maximizes the probability that the data came from the population defined by the parameter.

Mean The arithmetic average value of a collection of numeric data.
Median The value in the middle of a collection of ordered data.  In other words, the value with the same number of items above and below it.
Meta-Analysis An analysis that combines the results of many studies

Data that provides descriptive information (content, context, quality, structure, and accessibility) about a data product and enables others to search for and use the data product.

Metadata Editing Tool

A software tool to input, edit, and view metadata; output from the tool is metadata in as standard extensible markup language (xml) format.

Metadata Format

Standardized structure and consistent content for metadata, usually in machine readable extensible markup language (xml) that can be represented in other human readable formats (e.g., html, pdf, etc.).  Standards for metadata include: DC (Dublin Core), EML (Ecological Metadata Language), FGDC (Federal Geographic Data Committee), and ISO 19115, DIF (Directory Interchange Format) and many others.

Metadata Standards

Requirements for metadata documentation that are intended to ensure correct use and interpretation of the data by its owners and users.  Different communities use different sets of metadata standards.  

For a complete list by discipline open this link to DCC (Digital Curation Centre)    Disciplinary Metadata Standards

Missing Data

Data values can be missing because they were not measured, not answered, were unknown or were lost.  Data mining methods vary in the way they treat missing values.  Typically, they ignore the missing values, or omit any record containing missing values, or replace missing values with the mode or mean or infer missing values from existing values.

Missing Value

A value that is not in the data file, because the information / sample was not collected, lost, not analyzed, or an impossible value, etc.  A missing value code indicates that a value is missing and a missing value flag [a categorical parameter] describes the reason that the value was missing.


The most common value in a data set.  If more than one value occurs the same number of times, the data is multi-modal.


An important function of data mining is the production of a model.  A model can be descriptive or predictive. 

Descriptive: helps in understanding underlying processes or behavior.  For example, an association model describes consumer behavior.

Predictive: an equation or set of rules that makes it possible to predict an unseen or unmeasured value (the dependent variable or output) from other, known values (independent variables or input).

The form of the equation or rules is suggested by mining data collected from the process under study.  Some training or estimation technique is used to estimate the parameters of the equation or rules.


Massively Parallel Processing.

A computer configuration that is able to use hundreds or thousands of CPUs simultaneously.  In MPP each node may be a single CPU or a collection of SMP CPUs.  An MPP collection of SMP nodes is sometimes called an SMP cluster.  Each node has its own copy of the operating system, memory, and disk storage, and there is a data or process exchange mechanism so the each computer can work on a different part of a problem.  Software must be written specifically to take advantage of this architecture.


Glossary N


Placing one element inside another.  When two tags are opened, they must be closed in the reverse order.

In HTML, elements can be ‘improperly’ nested:

    <b><i>This text is bold and italic</b></i>

In XML, elements must be “properly nested”:

    <b><i>This text is bold and italic</i></b>

"Properly nested" simply means that since the <i> element is opened inside the <b> element, it must be closed inside the <b> element.

Neural Network

A complex nonlinear modeling technique based on a model of a human neuron.  A neural net is used to predict outputs (dependent variables) from a set of inputs (independent variables) by taking linear combinations of the inputs and then making nonlinear transformation of the linear combinations using an activation function.

It can be shown theoretically that such combinations and transformations can approximate virtually any type of response function.  Thus, neural nets use large numbers of parameters to approximate any model.  Neural nets are often applied to predict future outcomes based on prior experience.  For example, a neural net application could be used to predict who will respond to a direct mailing.

Node A decision point in a classification (i.e., decision) tree.  Also, a pint in a neural net that combines input from other nodes and produces an output through application of an activation function

The difference between a model and its predictions.  Sometimes data is referred to as noisy when it contains errors such as many missing or incorrect values or when there are extraneous columns.

Non-Applicable Data

Missing values that would be logically impossible (e.g., pregnant males) or are obviously not relevant.


In the public domain, not protected by patent, copyright, or trademark.


A collection of numeric data is normalized by subtracting the minimum value from all values and dividing by the range of the data.  This yields data with a similarly shaped histogram but with all values between 0 and 1.  It is useful to do this or all inputs into neural nets and also for inputs into other regression models.  (See standardize).


Glossary O


The Open Archive Information System is an ISO standard.


On-Line Analytical Processing tools give the user the capability to perform multi-dimensional analysis of the data.


A framework for interrelated concepts within a domain; an ontology would link the terms “water vapor”, “relative humidity”, and H20 vapor pressure”, so that a user searching for one, would also see the other related terms and their relationships.

Optimization Criterion

A positive function of the difference between predictions and data estimates that are chosen so as to optimize the function or criterion.  Least squares and maximum likelihood are examples.


Open Researcher and Contributor ID

“. . .is a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized. Particularly important if your name is a common one.

You can link your ORCID to other professional information such as Researcher ID (Web of Science), Google Scholar, LinkedIn, Scopus . . .
Outliers Technically, outliers are data items that did not (or are thought not to have) have come from the assumed population of data—or example, a non-numeric when you are expecting only numeric values.  A more casual usage refers to data items that fall outside the boundaries that enclose most other data items in the data set.

A tendency of some modeling techniques to assign importance to random variations in the data by declaring them important patterns.


Data not collected by the organization, such as data from a proprietary database that is combined with the organization’s own data.


Glossary P

Parallel Processing Several computers or CPUs linked together so that each can be computing simultaneously.

A variable and measurable factor that determines or characterizes a system.


Analysts and statisticians spend much of their time looking for patterns in data.  A pattern can be a relationship between two variables.  Data mining techniques include automatic pattern discovery that makes it possible to detect complicated non-linear relationships in data.  Patterns are not the same as causality.


Portable Document Format

A file format that provides an electronic image of text or text and graphics that looks like a printed document and can be viewed, printed, and electronically transmitted.

Persistent Identifier

Globally unique numeric and / or character strings that reference a digital object.  Persistent identifiers:

  • Can be actionable providing access to the digital resource via a persistent link
  • Are intended to function for the long term
  • Allow datasets to be tracked and cited
  • Encourage access, discovery, and potential reuse of datasets
There are several systems though the most common is DOI (see above)

The precision of an estimate of a parameter in a model is a measure how variable the estimate would be over other similar data sets.  A very precise estimate would be one that did not vary much over other similar data sets.  Precision does not measure accuracy.  See accuracy.

Accuracy is measured by the average distance over different data sets of the estimate from the real value.  Estimates can be accurate but not precise, or precise but not accurate.  A precise but inaccurate estimate is usually biased, with the bias equal to the average distance from the real value of the parameter.


Some data mining vendors use predictability of associations or sequences to mean the same as confidence.


Format and document data for long term storage and potential use.


The measure of how often the collection of items in an association occur together as a percentage of all the transactions.  For example, “In 2% of the purchases at the hardware store, both a pick and a shovel were bought.”


In the context of data management, confidentiality can be thought of as information privacy.

Privacy protects access to an individual, while confidentiality protects access to information about an individual.  If data are not confidential, personal privacy can be compromised.

Provenance History of a data file or data set, including collection, transformations, quality control, analyses, or editing.

Eliminating lower level splits or entire sub-trees in a decision tree.  This term is also used to describe algorithms that adjust the topology of a neural net by removing (i.e., pruning) hidden nodes.


Glossary Q-R

Quality Assurance

A set of activities to ensure that the data are generated and compiled in a way that meets the goal of the project.

Quality Control

Testing or other activities designed to identify problems in the data.

Quality Level Flag An indicator within the data file that identifies the level of quality of a particular data point or data set.  Flags can be defined within the metadata.



The range of data is the difference between the maximum value and the minimum value.  Alternatively, range can include the minimum and maximum, as in “The ranges from to 2 to 8.”


Relational Database Management System

Regression Tree

A decision tree that predicts values of continuous variables.

Relational Database

A collection of tables [often call relations] with defined relationships to one another.


A somewhat general term used to refer to a destination designated for data storage.

Helpful sites to locate discipline-specific data repositories:

DataCite Registry of Research Data Repositories


Simmons College Data Repository Listing    Primarily for data


Simmons College Disciplinary Repository Listing   Primarily for texts

Research Data Management

The process of managing research data and the services and policies that support those activities.  Good RDM is a critical element in all disciplines.

Research Information Information (data) ABOUT research, as opposed to data produced as a PRODUCT of research.  Research information can include data which describes the people, places, funders, activities and other entities that form part of the research process.  It might describe who does what research, with whom, where, funded by whom.

Research Information Management

The activity of managing research information.  The process of keeping information about research current and making sure that those who need access to the data are able to obtain it.

Re-substitution Error

The estimate of error based on the differences between the predicted values of a trained model and the observed values in the training set.


Using data for a purpose other than that for which it was collected.

Right-hand Side

When an association between two variables is defined, the second item is called right-hand side (or consequent).  For example in the relationship “When a prospector buys a pick, he buys a shovel 14% of the time,” “buys a shovel” is the right-hand side.

R-squared A number between 0 and 1 that measures how well a model fits its training data.  One is a perfect fit; however zero implies the model has no predictive ability.  It is computed as the covariance between the predicted and observed values divided by the standard deviations of the predicted and observed values.


Glossary S


Creating a subset of data from the whole.  Random sampling attempts to represent the whole by choosing the sample through a random mechanism.

Scientific workflow

A precise description of scientific procedure, often conceptualized as a series of data ingestion, transformation, and analytical steps.

Scripted program

A program [requiring a command line interface or similar] that performs an action or task.  Scripts can be saved, modified, and re-used as necessary.

Sensitivity analysis

Varying the parameters of a model to assess the change in its output.

Sequence discovery

The same as association, except that the time sequence of events is also considered.  For example, “Twenty percent of the people who buy a VCR buy a camcorder within four months.”

SGML Standard Generalized Markup Language is an international standard that describes the relationship between a document’s content and its structure.

A probability measure of how strongly the data support a certain result (usually of a statistical test).  If the significance of a result is said to be 0.05, it means that there is only a 0.05 probability that the result could have happened by chance alone.  Very low significance (less than 0.05) is usually taken as evidence that the data mining model should be accepted since events with very low probability seldom occur.  So if the estimate of a parameter in a model showed a significance of 0.01 that would be evidence that the parameter must be in the model.


Symmetric Multi-Processing is a computer configuration where many CPUs share a common operating system, main memory and disks.  They can work on different parts of a problem at the same time.


Something is unlikely to become obsolete or undergo significant change due to changes in version development. Funding, etc.


A collection of numeric data is standardized by subtracting a measure of central location (such as the mean or median) and by dividing by some measure of spread (such as the standard deviation, interquartile range or range).  This yields data with a similarly shaped histogram with values centered around 0.  It is sometimes useful to do this with inputs into neural nets and also inputs into other regression models.  (See normalize)

Stewardship The act of caring for, preserving, or improving over time.
Supervised learning

The collection of techniques where analysis uses a well-defined (known) dependent variable.  All regression and classification techniques are supervised.


The measure of how often the collection of items in an association occur together as a percentage of all the transactions.  For example, “In 2% of the purchases at the hardware store, both a pick and a shovel were bought.”


Glossary T

Test data A data set independent of the training data set, used to fine-tune the estimates of the model parameters (i.e., weights).
Test error

The estimate of error based on the difference between the predictions of a model on a test data set and the observed values in the test data set when the test data set was not used to train the model.


A thesaurus is a structured list of preferred terms or subjects that indicate relationships between those terms. Preferred terms are focal points where all information about a concept is collected. Relationships between preferred terms can be broad, narrow, or related in another way.

A thesaurus also indicates non-preferred terms, which are terms indexers and searchers should not use. A good thesaurus makes clear what a term is meant to cover by providing preferred terms, their relationships with other preferred terms, and non-preferred terms.
Time series

A series of measurements taken at consecutive points in time.  Data mining products which handle time series incorporate time-related operators such as moving average.  (Also see windowing.)

Time series model

A model that forecasts future values of a time series based on past values.  The model form and training of the model usually take into consideration the correlation between values as a function of their separation in time.

Tool A device or implement [in this instance a piece of software] that is used to carry out an action or series of actions.

For a neural net, topology refers to the number of layers and the number of nodes in each layer.


Another term for estimating a model’s parameters based on the data set at hand.

Training data

A data set used to estimate or train a model.

Transformation A re-expression of the data such as aggregating it, normalizing it, changing its unit of measure, or taking the logarithm of each data item.


Gloosary U-X

Unsupervised learning

This term refers to the collection of techniques where groupings of the data are defined without the use of a dependent variable.  Cluster analysis is an example.



The process of testing the models with a data set different from the training set.

Variance The most commonly used statistical measure of dispersion.  The first step is to square the deviations of a data item from its average value.  Then the average of the squared deviations is calculated to obtain an overall measure of variability.

The version of a dataset can refer to either:

  1. The numerical tag attached to a dataset as it is updated
  2. The type of version for example raw data as collected, processed data, a subset of anonymized data generated as supplementary data for an article.
Version control

The task of keeping a software system consisting of many versions and configurations well organized.

Visualization Visualization tools graphically display data to facilitate better understanding of its meaning.  Graphical capabilities range from simple scatter plots to complex multi-dimensional representations.



Used when training a model with time series data.  A window is the period of time used for each training case.  For example, if we have weekly stock price data that covers fifty weeks, and we set the window to five weeks, then the first training case uses weeks one through five and compares its prediction to week six.  The second case uses weeks two through six to predict week seven, and so on.



A language that renders elements in a document machine-readable, essentially enabling computers to analyze and perfume operations on texts.  XML work(s) by tagging particular characters, words, or passages in a text, so that a computer knows that they have certain characteristics, or belong to a certain set, and should therefore be processed in a particular way when so instructed.  Metadata about datasets is also often expressed as XML to ensure that machines can interpret contextual information about data according to a particular standard.

XML is an initiative of the World Wide Web Consortium (W3C), the eXtensible Markup Language is a simple dialect of SGML.  It was designed for ease of implementation and for interoperability with both SGML and HTML.