A primer on Research Data Management

Introduction

As the amount of data used by academic researchers and private businesses continues to grow, proper management of that data becomes a critical component of a well-defined data handling strategy. Generally speaking, data management can be defined as the practice of collecting, storing, and using data securely, efficiently, and cost-effectively. The main goal of data management is to help groups and organizations optimize their use of data and comply with policies and regulations related to the use of data. Proper data management allows organizations to make data-driven decisions (Oracle). Data management is important across all parts of a project, including creating, accessing, and updating data, storing data across multiple platforms, providing disaster recovery, ensuring data privacy and security, and archiving and destroying data when required.

In a research setting, data management helps improve research workflows, making them more efficient and reproducible. Research data management can be defined as the process of providing organization, storage, and access to data across all of the stages of a research project (Harvard). A generic research life cycle will have a design and plan phase, a collect and analyze phase, a publish and archive phase, and a data reuse phase, and it is important to develop a data management plan that encompasses the entire research life cycle. Creating a plan for managing data is essential since it allows researchers to find data more easily and to take precautions against data loss, which will make the data easier to analyze and organize.

Characteristics of Good Data Management

Data Documentation

In order to incorporate data management effectively, it is vital to properly document your data. The first step in doing this is to properly cite your data. If you are writing a formal paper, you should generally cite your sources in the appropriate format (MLA, APA, etc.), as we do at the end of this post. Otherwise, you should also incorporate in-text citations, and most importantly, you need to give readers the ability to find the data you used so that they can replicate the analysis. Without that, it is difficult to establish your credibility as an author. If a visualization is made using the data, you can cite the data you used near the image, as shown below:

Next, it is important to document the variables in your data. In large datasets, it is common to abbreviate some column names to save space. Although this may be acceptable for those specifically working with the data for the original project, it can make it difficult for others to understand what the data means. Thus, when preparing data to be viewed by others, it is important to create a data dictionary so that others know what the data means. For example, for an NFL machine learning competition, the NFL released a large dataset with tracking data. One of the columns is called “a”, and without context, this could be confusing to readers. Thus, at https://www.kaggle.com/competitions/nfl-big-data-bowl-2023/data, they created a data dictionary, explaining what each column represented.

After that, once each data point can be easily seen and analyzed by readers, it is necessary to include some sort of README file that readers can go to in an effort to understand the data. For example, if someone stumbles upon your GitHub repository, which happens to contain a dataset called “tracking-data,” this dataset will likely not mean much to them without further context. Thus, it is important to explain what each dataset represents and what you are trying to do with the data. For instance, at https://github.com/nflverse/nflfastR , the header page (the README) explains how to get started with the data/package, what it does, and links resources that give more information if desired. Furthermore, there is an ‘Issues’ tab for users to go to if they need to contact the contributors to the project.

Data Organization

After data has been properly documented, it is important to organize it as well so that viewers know how to get to the data they want. If there are a large number of files in a dataset, a good organizational strategy solution is to use folders to organize all of the data. Data with common properties can be placed in the same folders. Another possibility is to put raw data in one folder and current/edited data into another. Once the data has been separated into folders, the folders need to be named appropriately. They should be concise, but also convey the content in the folder so that the reader or viewer is not confused. For example, a folder name such as “Tracking Data for the NFL 2023 Big Data Bowl” is a very long name, and may be cut off by certain software. Furthermore, it takes the reader a long time to grasp what is in the folder. On the other hand, a folder name such as “trck” can be vague, as it could stand for multiple different words. Something that may work is “tracking_data”; from the README. The reader should know it’s from the 2023 Big Data Bowl, and the title clearly and concisely depicts what the folder contains.

When naming files, in addition to being clear and concise, some programs do not take in spaces properly. The program may ask a data creator to rename the file in an attempt to deal with this. As a result, to prevent this error, consider using underscores/hyphens to separate words. Next, when using dates in a file name, potentially to separate different versions, use the YYYY-MM-DD format. This ensures that, when placed in alphabetical order, the documents will show up in chronological order. Furthermore, for different versions, ensure that there is version control in place. For example, for the second iteration of tracking data, one could use “tracking_data_v2”.

Data Storage and Archiving

After data has been properly documented and organized, it needs to be uploaded so that others can view it! When considering places to upload data, it is important to think of a few key factors. First, it can often be helpful if the hosting service is on a site that readers will likely be familiar with. By doing this, they should already have a general idea of how the data is formatted and how to access different items in the dataset. If a reader is not familiar with the hosting service, then the added learning of getting to know the site may prove to be an additional barrier. Next, it is important to consider whether readers can easily access the data from the hosting service. If the dataset is exceptionally large, then using a service like Google Drive or Dropbox may not be optimal as those have a limited amount of storage as a free user. Two common spaces to store data are GitHub and the Open Science Framework. Both of these services give users a free option to store a very large amount of data and share it with the public. Both of these are widely reputable and are used by many researchers in academic settings.

Resources for Research Data Management

To learn more about research data management at FSU, a great resource can be found at the Research Data Management page on the FSU Libraries website: https://guides.lib.fsu.edu/c.php?g=353199&p=2381998 . Here, the most common questions about research data management are answered, and if you have any further questions, the emails to Nick Ruhs and Mila Turner are listed on the page. If you email them, they will be more than happy to respond and set up a meeting if desired. Furthermore, if you would like to contact a librarian live, you can ask through the live chat found at https://www.lib.fsu.edu/help/ask .

Conclusion

After giving an overview of the characteristics of good data management, such as data organization, data storage, and archiving, and the resources for research data management, we want to again emphasize the importance of creating a good research data management plan at the start of a project. Creating a good data management plan will ensure that you are properly tracking your data throughout the research lifecycle and are in compliance with various policies and regulations related to data. Overall, data management is an important component of all stages of a research project. Having a proper research data management strategy will allow you to easily search for data, optimize your data usage, and store data which will in turn allow for easier data analysis.

Works Cited

For a list of references, please use this link: https://bit.ly/3vb8A7K

This blog post was written by Sahil Chugani (Senior STEM Data Fellow) and Reagan Bourne (Senior STEM Data Fellow) from FSU Libraries.