Building Data Sets with FSU’s Digital Library

Data science is all the rage lately. Harvard Business Review even named it the sexiest job of the 21st century. Even though the term is rapidly gaining mind share, many are still confused about what data science actually is. When you cut through the hype, the core of data science is actually pretty simple: it’s the study of data. What kind of data is being studied, how it is being studied, and what the individual data scientist is looking for all depend on the specific case. Data science is just another field of study using digital methods, putting it firmly under the umbrella of Digital Scholarship.

Now that we have an introductory understanding of what data science is, let’s explore how it works: a data scientist may use any number of tools to analyze the data in question to look for any number of patterns, but before any of that can happen a data scientist needs to obtain a data set. A data set can contain any kind of data or information, can be in any format, and can come from any source. There are plenty of public data sets available (like Data.gov and the Open Science Data Cloud), and more are being released all the time thanks to efforts like the Open Data Institute.

While it’s definitely interesting to work with publicly available data sets, a huge part of data science is knowing how to find data and transform it into something that can be processed. For an example of how to do this, we’ll take a look at our very own FSU Digital Library (FSUDL). FSUDL contains nearly 20,000 records, making it a rich and interesting source for data sets. In order to investigate the data in FSUDL, we must first obtain it. FSUDL allows users to download their search results as a CSV file. This makes creating new data sets from FSUDL trivially simple; just search for the things you want, and click on the CSV download button (the icon circled in red in the image below). FSUDL will compile all of the data for you into a single file. For instance, say I wanted to make a data set of all the records in FSUDL related to swamps. I could search ‘swamp’ (returning 13 records) and get those records as their own data set:

Once you have data set in your hands, the real fun begins. Part of what makes data science so cool is that the possibilities are only limited by your imagination; the data is all there, but it’s up to you to discover the patterns in it.

For an example of what can be done with the data in FSUDL, take a look at the fsudl-data GitHub repository, which contains a data set of every object in FSUDL and a Python script that goes through the CSV file and counts the format of each record. This is an example of processing a data set, and it yields another, smaller data set:

Image (JPEG): 1929
Document (PDF): 900
Image (JPEG2000): 15096
Set of related objects: 69
Video file: 5
E-book: 358

Once you have your data in the format you want, you can go a step further and visualize it to make it easier for other people to understand:

The visualization above was created using vida.io. Click the image for a more interactive version.

If this all seems interesting and you would like to learn more about data science, there are plenty of resources on web to choose from such as the Creative Commons licensed Introduction to Data Science ebook by Jeffrey Stanton of Syracuse University. If you like MOOCs, Coursera has a whole series of classes on data science geared towards beginners. Once you are a seasoned data hacker (or if you already are), get some data from FSUDL and show us what you can do with it!

Leave a ReplyCancel reply