Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Research Commons

A space and place for those seeking help with research-related needs.

Overview: what to include in this section

What metadata will be provided to help others identify and discover the data?

Describe the content, formats, and internal relationships of the data in detail

 

 

Project Documentation

Dataset Documentation

  • Rationale and context for data collection
  • Variable names and descriptions
  • Data collection methods
  • Explanation of codes and classification schemes used
  • Structure and organization of data files
  • Algorithms used to transform data (may include computer code)
  • Data sources used (data citation)
  • File format and software (including version(s)) used.
  • Transformations of data from the sanitize data through analysis

 

  • Information on confidentiality, access and use conditions

 

Components for Standards: You'll need info from each of the tabs

Where data is stored and how it is kept secure throughout the research process is critical. Funding agencies may require that you retain data for a given period and will likely ask you to explain in your data management plan how you will store and back it up, and how you will manage the security of and access to your data. If you will be working with large data sets (with larger storage and backup needs) you should contact your departmental IT staff.

Be sure to include who will be responsible for ensuring that files are stored and backed-up properly. Funding agencies are increasingly looking for details related to “roles and responsibilities.”

Storage is the act of keeping your data in a secure location that you can access readily. Files in storage should be the working copies of your files that you can access and change regularly.

Backup is the practice of keeping additional copies of your data in separate physical or cloud locations from your files in storage. Backup copies are copies you would access in the case of data loss and needing to access previous versions of your work.

Storage systems often provide mirroring, in which data is written simultaneously to two drives. This is not the same thing as backup since alterations in the primary files will be mirrored in the second copy.

A good rule to go by with storing and backing up copies of your work is LOCKSS (Lots Of Copies Keep Stuff Safe), and to keep each copy as physically far apart from the other copies as possible to prevent damage by natural disaster, such as a fire or flood occurring in the lab where the research is being performed. It’s a good idea to follow the rule of three when thinking about this: you should keep three copies of your data, two backup copies should be kept on different devices or storage media, and one backup copy should be kept off-site. This might look like:

  • One copy in active storage. This is a copy you are regularly accessing and working on during your research. It will likely be on your computer or a lab’s shared network drive.
  • A second copy on a different device on- or off-site, such as an external hard drive in your office or a backup server provided by your IT department.
  • A third copy, preferably off-site. This might be on a cloud application like Box, Google Drive, or another appropriate cloud solution.

Courtesy UW-Madison

Other Considerations:

If any of the following policies affect the management of your data, you will need to address them in a DMP, as they will affect how you can store and share your data.

  • HIPAA (Health Information Portability and Accountability Act)
    • HIPAA’s privacy rule set protections for the privacy of protected health information (PHI) and set limits on sharing it.
  • FISMA (Federal Information Security and Modernization Act of 2014)
    • FISMA works to ensure strong protection over federal information and information systems against cybersecurity threats.
  • FERPA (Federal Educational Rights and Privacy Act)
    • FERPA protects the privacy of data from student records, and applies to all schools that receive funding from the U.S. Department of Education.

Data becomes useful when it has meaning and context associated with it. The most common way to bring context to data is by applying metadata (description and documentation of your data) and through supplementary files, such as a data dictionary. Documenting your data is important for sharing your data, in order for other researchers to understand how to access, view, and possibly re-use your data.

  • Data dictionaries should provide the key information about the data that you will be collecting, and is used to explain what the variable names and values in a dataset really mean. Data dictionaries are most commonly used when working with tabular data or creating a database.
  • README files are documents in plain text (.txt) or markdown (.md) format that are often used to describe software packages, programming scripts, and datasets, and can also be used for research projects. It should include information about the creators of the files that it is describing, a list of the files included in the set, relevant funder information, and any associated research outputs, such as articles or presentations. The README should include a citation for the dataset, as well as for any of the byproducts of the research data that was collected and used. For more information about creating a README, see Cornell’s “Guide to Writing ‘README’ Style Metadata.
  • A data paper differs from a research paper in that it is “used to present large or expansive data sets, accompanied by metadata which describes the content, context, quality, and structure of the data” (Ecological Society of America). The Ecological Society of America provides a guide on writing a data paper.
  • A codebook provides descriptions and definitions about the variables and values included in a dataset to assist users in interpreting the data for potential replication or reuse. Codebooks provide variable names and a description for what each variable represents, each variable’s type, the format that the values for each variable should be in, and the range of values, if applicable.
  • Metadata is the describing and documenting of your data. There are different ways to do this, depending on the discipline that you are working in and the types and formats of data that you are collecting. The method that you use to describe your data will depend on the project, your team, and the complexity of your data. The documentation for your data should contain the minimum information required to be able to reuse the data that is being described.

Courtesy UW-Madison

                “Data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time actually analyzing it.” IBM Analytics

Data cleansing (also known as data cleaning) is a process of detecting and rectifying (or deleting) of untrustworthy, inaccurate or outdated information from a data set, archives, table, or database. It helps you to identify incomplete, incorrect, inaccurate or irrelevant parts of the data. By doing this you can then replace, modify, or delete the bad data. Data cleaning can be performed interactively with data wrangling tools, or as batch processing through scripting. 
There are 5 data cleansing steps:
  1. Standardize your data
  2. Validate your data
  3. De-duplicate your data
  4. Analyze data quality
  5. Find out if you have a data quality problem

A data catalog is a structured collection of data used by an organization. It is a kind of data library where data is indexed, well-organized, and securely stored. Most data catalog tools contain information about the source, data usage, relationships between entities as well as data lineage. This provides a description of the origin of the data and tracks changes in the data to its final form.

The catalog informs users about the available data sets and the metadata around a topic and assists users in locating it quickly.

You can contact your departmental IT staff for help finding a suitable catalog.

Below is a list of data catalogs.

Data curation is the active and ongoing management of data through its lifecycle of interest and usefulness to scholarship, science, and education. Data curation enables data discovery and retrieval, maintains data quality, adds value, and provides for re-use over time through activities including authentication, archiving, management, preservation, and representation.

-Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

A Data Dictionary is a set of important information about data used within an organization (metadata). This information includes names, definitions, and attributes about data, owners, and creators of assets. Data Dictionary tools provide insights into meaning and purposes of data elements. They add useful aliases about the scope and characteristics of data elements, as well as the rules for their usage and application. (UW Madison)

The OSF (Open Science Framework) provides a tutorial on how to make a data dictionary for tabular data.