A Short Dive Into Data Science

Introduction

Data science is a multidisciplinary field that combines math, statistics, programming, and artificial intelligence to analyze and discover insights embedded in large amounts of data. These insights can then be used to help data scientists ask and answer questions about data, such as: what happened, why it happened, what will happen, and what can be done with the results (Amazon Web Services). Data science combines different tools and technology to create meaning from and interpret data. With the accelerating volume of data being produced, data science is widely used to process and discover patterns in data through a variety of statistical techniques.

 The term data science was first used in the 1960s to describe a new profession focused on the interpretation of data. However, data science during this time period is extremely different from how data science is viewed in recent times. Starting in the 2000s, companies began to see data as a commodity from which they could capitalize. This resulted in a need for data scientists to analyze large quantities of data, by using skills to access, understand, and communicate insights from data. In more recent years, the field of data science has grown tremendously, and it is needed in every industry that generates or relies on data. 

Data science has completely reshaped the way that we collect, store and analyze data, and these new discoveries have transformed several industries and improved people’s lives. For example, in finance, data science can be used to help financial institutions make more accurate predictions and manage risks. In the healthcare industry, data science allows researchers to be able to predict and prevent certain diseases. In marketing, data science allows companies to make personalized ads, based on data that they have about consumers. However, as data science becomes more commonly used in society, it is important to consider the ethical implications behind the collection and analysis of large amounts of data. Subjects such as bias, transparency, and privacy are extremely important and relevant in this field. 

 One important ethical consideration is bias. Data science algorithms can unintentionally inherit bias from the datasets that they are trained on. This can lead to discrimination and inequality towards certain variables in a dataset. Transparency is also an important ethical consideration.  More specifically, some individuals may not understand how data science is used to make decisions that may affect them. This can lead to individuals having concerns about technology, and how it is being used. Finally, data privacy is a critical issue, particularly as the amount of data that is being collected continues to grow exponentially. It is essential to ensure that individuals’ personal information is being protected, and that the data is only collected in ethical ways. Overall, while data science has had so many positive impacts on society, it is also important to prioritize ethical considerations within data science, to ensure that these advances will continue to benefit society. 

What skills are needed in data science?

To quote the US Bureau of Labor Statistics, “Data scientists use analytical tools and techniques to extract meaningful insights from data.” However, this is an incredibly broad statement by itself. Oftentimes, it can be easier to think about data science in the context of what types of tasks are done within the discipline. Furthermore, there are several different articles on what skills data scientists need, often with similar but imprecise wording. However, there is some general overlap in the skills that are required or recommended for those considering a career in data science. 

Quantitative Skills

Broadly speaking, an understanding of mathematics and/or statistics is important for a data scientist. While a data scientist is not necessarily a mathematician or statistician, data scientists often use similar (if not the same) tools to calculate mathematical models that can be used for prediction or to make inferences about what correlations exist within the data. 

Additionally, there is a general consensus that data scientists need to know about data cleaning/wrangling/scrubbing, etc. Regardless of the terminology used to describe it, knowing how incomplete and complete data can and cannot be used is a critical skill. Understanding how pieces of data will interact with each other can give a data scientist a far more efficient understanding of what they can and cannot infer from the data, as well as how a computer would store the data they need to analyze. While this is typically listed as a separate skill, it often relies on linear algebra to work.

Programming/Coding

It should be noted that someone can be a statistician or a mathematician without knowing how to write a single line of code. In fact, according to the Bureau of Labor Statistics, 13% of mathematicians work in education, as do 9% of statisticians. Oftentimes, quantitative work in this realm is less reliant on code and more so about effectively communicating quantitative concepts. However, a professional data scientist absolutely must know how to program, as it is a de facto requirement of almost any data science position. While it would theoretically be possible to calculate statistical trends by hand, this would be impractical when dealing with the current reality of big data. Attempting to construct even one mathematical model with all available data could take a lifetime at best. Consequently, it is far more efficient to have a computer run all of the minute arithmetic calculations instead of adding everything manually.

The list of programming languages a data scientist might need is long enough to warrant its own blog post (in fact, we have covered some in a previous blog post entry). Due to the sprawling amount of programming languages available, it is generally more important to understand the fundamentals of programming and pick up programming languages as needed. While we cannot immediately recommend every situation, Python, R, and SQL are good languages for a strong data science foundation. These tend to be in the top three languages for most data science positions, all else equal. Even further, artificial intelligence (and more specifically, machine learning) can be implemented in the first two, which is also gaining precedence in modern data science.

Business Acumen and Communication

Surprisingly, business acumen (also known as domain knowledge) is frequently mentioned as an essential skill for data scientists. A more generalized way of thinking about it would be “understanding data in context”. In brief, domain knowledge involves knowing what the data is and what it means in the context of your work. It may be easy to think about what the purpose of your data analysis is when you are responsible for the beginning and end of a project – however, jumping into the middle of a project will require some knowledge of what the data is before you can proceed to do anything with it. 

Similar to business acumen, data scientists also need to know how to communicate their results so that others may understand the insights gleaned from their work. At first glance, one would assume that this predominantly refers to explaining predictive outputs in both verbal and written communication that anyone outside of the field can understand. Those two skills are important — however, visual communication and literacy can be included as well. The ability to make data visualizations that effectively show what trends are in the data is another critical skill within data science, as graphs can communicate far more ideas at once than words/numbers alone can. Further, data visualizations can bypass the need for an audience to understand more complex quantitative concepts.

Miscellaneous Skills+Addendums  

While the above skills are frequently mentioned in many articles describing critical skills for data scientists, there are some notable mentions of other skills. For instance, DevOps (a project management method) is considered to be critical enough to the discipline itself that an article about using it as a framework for modern data science work exists (Saxena et al. 2021). While DevOps is the most commonly mentioned project management method within data science, other aspects of project management are also valuable to have.

Further, being able to work with big data and its infrastructure is critical. (Georgia Tech Professional Education, 2023). As more and more data is collected by private and public entities, experience with data files that are not easily navigable is of the utmost importance. Additionally, knowing how to locate the data you need from large data warehouses is important regardless of how much data you need. 

As mentioned in the introduction, concerns about bias in data and ethical data use have ramifications for consumer privacy.  This is such a notable concern that Harvard Business School’s own page for data science skills mentions ethics as something that should be kept in mind due to how many dilemmas can arise from data breaches, biased algorithm development, the usage of consumer data without consent, among other issues (HBS Online, 2020).

What industries/fields use DS?

As mentioned in the introduction, data science is a combination of the fields of mathematics, statistics and computer science. Data scientists generally use programming languages such as R and Python to derive analyses and make visualizations from the data given to them. Data science is often seen as a subset of computer science, which focuses on developing new software to perform certain tasks. This can include making a new gaming application, designing a website, or developing a new programming language. Data scientists generally try to make software to execute a computer model or to visualize software, utilizing statistical techniques in the process. 

Furthermore, data science and statistics are generally similar fields with a lot of overlap. For the most part, both data scientists and statisticians aim to make models to analyze data. However, data scientists may place a heavier emphasis on data visualization techniques, while statisticians may use more math-heavy models. For example, a statistician may lean more heavily on models such as linear regression, while a data scientist may focus more on models such as neural networks, which attempt to imitate the human brain to make predictions. That being said, a statistician can almost certainly become a data scientist with a little extra training and vice versa. 

Data science can be used in almost every business worldwide and has many specific applications; there are too many to list in this blog post. Here, we will briefly describe a few of the more common industries and/or fields that use data science. 

One of the most common uses of data science is in the stock market. When looking at trends in the stock of a company, data scientists for stockbroking companies can analyze the stock data of the past and use that to predict how much that stock will be worth in the future. These predictions can then be used to help determine whether or not the company should buy more or less of that stock. 

Another example of data science is simply predicting future revenue, which can be used in virtually every business. Given past data such as previous prices, how well the economy has been doing as a whole, and many other variables, data scientists can predict how much revenue/profit a company may receive in the future. Using this, companies can determine whether or not they would want to set a certain price for their product or even continue with the business as a whole. 

Data science can also be used at almost every company for various HR applications.HR analytics can help a company in predicting turnover or recruiting the right candidates. For example, if a data scientist can analyze past data to determine if a specific role has higher turnover, the company can utilize that analysis to change up the role to help decrease turnover. Furthermore, data science can be used to determine which candidates may be the best fit for the job by looking at variables from their application or resume. For instance, if employees with great people skills tend to perform better, analysis can be done to determine which candidates have good people skills that can then be translated into a more productive work environment.

Finally, an application of data science that you’re likely well aware of is targeted advertisements. This is utilized by companies such as Facebook, Google, and Spotify. By analyzing the data from your previous searches, likes,  listening, and other various sources, these companies can feed you the advertisements most applicable to you. For example, if your neighborhood was hosting a baseball tournament, your neighbors might search for items such as “baseball bats” and “how to hit a baseball” on Google. Using this data, and seeing that you’re nearby, Google may then direct advertisements to you for baseball bats and baseball courses. 

Conclusion

If one wanted to be brief, data science could be simply described as the hybrid discipline of computer science, mathematics, and statistics. More specifically, data science can be used to take large amounts of organized and unorganized data and make predictions about what might happen in the near-ish future. Those large amounts of data can also be used as a descriptive explanation for phenomena that are currently happening, but would otherwise be invisible to us. Furthermore, those predictions can be used to make business decisions, public policy, predict future gameplay moves/outcomes, or even help create critical medical interventions. However, the above skills can also cause harm if used without consideration for the data itself. Regardless of where or how data science is applied, all of the above skills are necessary to do work in this field.

While we have discussed the use of data science in industries such as financial markets and HR analytics, the applications of data science go far beyond these two fields alone. In fact, what fields do not use data science at all would probably be a shorter list than the list of fields that do. For instance, one of the graphic design industry’s oldest professional organizations (the American Institute of Graphic Artists, or AIGA) released a report on the data economy in 2018 and how data science will impact graphic artists (Davis 2018). While it is not an industry that is traditionally thought of as handling data, graphic designers also produce infographics (Stinson 2021). Furthermore, data mining is a critical skill within data science with applications in anything from market research (Hall n.d.) to finding unused content within video game files (BBC Newsround 2020). If there is an area someone is interested in, one can look up “[field of interest] data science” and probably find how they can use data science in that realm. 

At the end of the day, though, understanding data science beyond what we have outlined in this post may require experimenting with each of the bits and pieces of what makes it work on your own. To quote notable data scientist Claudia Perlich, “Learning how to do data science is like learning to ski. You have to do it.”  Understanding more about what data science is may require the very same thing; reading more about it and doing it for yourself.

Works Cited

For a list of references, please use this link: https://bit.ly/3NF30zQThis blog post was written by William-Elijah Clark (Senior STEM Data Fellow), Sahil Chugani (STEM Data Fellow) and Reagan Bourne (STEM Data Fellow) from FSU Libraries.

Evaluating Data Through a Critical and Ethical Lens

Introduction

Data literacy is the combination of a few unique skill sets: statistical literacy, information literacy, and technical proficiency. It also involves being able to visualize, critically evaluate, determine the accuracy and reliability of, and understand data sets. There are many reasons why it is important to be data literate, especially in recent years with the advent of the internet and social media. Data literacy is also crucial to many different industries and research areas. It is important to interpret the data that you are collecting to make sure that the results are accurate and to be able to understand that data so that you can create useful visualizations for others. 

There are a variety of concepts to keep in mind when critically evaluating data. For example, you need to consider the methods that were used to collect the data and whether those methods are ethical. Furthermore, when evaluating how the data is presented, you need to consider whether that representation or visualization is the most accurate way to portray the data. Another particular topic of concern is bias. There are different points at which biases can be introduced, such as when data is collected, when it is analyzed, and when it is shared with the public. Also, if you are critically evaluating your own data, it is important to check that there are no biases within your own work. In this post we will be discussing the critical evaluation of data through the lens of data collection, data presentation and visualization, and data ethics. 

Data Collection

In the context of data collection, several different collection methods can be used for research. Some of these methodologies, such as focus groups, surveys, and participant interviews, are familiar to the public at large. However, there are other specific data collection processes that many people outside of certain academic disciplines may not be aware of, such as web scraping/text mining, phlebotomy procedures for blood tests, observational behavior recording for time series data, and many more.

Consequently, not only is recording the data itself of importance for experimental duplication purposes, but it can also be important for interdisciplinary work. Some fields of research may have different research data collection methods that researchers in other fields may not be aware of, even across seemingly similar disciplines. For example, accounting and finance may seem similar but can have drastically different ways of interpreting monetary data. The way accountants and financial analysts calculate when a company is at a net zero (i.e., a break-even) between revenues and costs is different. Even within the same field of research, transparency with how data is collected is important for peer review – whether it be for ethics accountability or determining methodological flaws within research. An incomplete set of data can make it difficult or impossible to know whether or not the data was collected in a way to prevent bias, and further make it impossible to know if the data is accurate and/or precise.

 Failing to document data and data collection methods can also create problems reproducing or using the data for further research, particularly if things such as question types, experiment conditions, and units of measure are not properly documented. For example, while the hypothetical idea of cold fusion (nuclear fusion performed at room temperature) would be a low-cost energy solution, the experimental methods and data were not recorded. As a result, the concept of cold fusion is now widely looked at with skepticism because none of the data was recorded! A less extreme case where incomplete data may cause research problems is that the way that a survey is constructed can bias responses. Therefore, documenting how a survey was written can be helpful in evaluating why a research study came to a specific conclusion, as well as testing whether or not changing questions or even question order would change results.

Furthermore, data cleaning – which is the process in which things such as incorrectly formatted data, corrupted data, etc are reformatted or fixed so that it can be used in analysis – can also contribute to statistical bias(es) via things such as eliminating outliers, accidentally losing a variable, how you decide to categorize your data, and more. Therefore, documenting how you clean your data is also a critical component of research – explaining what outliers you decided to keep or remove and why can help you and other researchers down the road. It is also important to consider the order questions are asked in and the way questions are worded when conducting surveys. While it might seem counterintuitive at first, the way that questions are ordered and worded can impact the percentages of people that respond in a certain way, whether or not potential participants qualify for research projects, and even the numeric values of the data itself.

Data Presentation and Visualization

 Most have probably heard the phrase “label your axes” at some point, even before college. It is often mentioned in K-12 education, with the pretense being that someone will not know what your graph(s) are depicting without them. While this is indeed correct, labeled axes constitute only one of many different components of data presentation and visualization.

Figure 1: Axes that are labeled!

A good place to start on the types of ways that data visualizations can be best implemented would be The Data Visualisation Catalogue. While the site was originally established with graphic designers in mind, Severino Ribeccca himself stated I felt it would also be beneficial to…anyone in a field that requires the use of data visualisation.”(Ribecca n.d.) As such, almost anyone who uses data typically has to consider how to visually communicate data in a way to an outside audience, or even the general public outside of the realm of academia. A nifty feature of The Data Visualisation Catalogue is that there is a way to filter recommended data visualization types by what concept you are trying to demonstrate.

One consideration when looking at a data visualization is whether the data is represented in a way that is appropriate for that specific data type. While it might not seem like the data presentation would differ between data types, certain visualizations will serve to more accurately and sufficiently depict different types of data. For instance, data related to time and Geographic Information Systems mapping produce distinct data types. While they can be combined and represented in the same graphic (i.e., how has the land of a certain area changed over time?), they both have their own distinct issues to consider to make sure that you are not creating misleading graphics. Namely, one cannot make a map with time data alone, and a map would be hard to make with a line graph that is meant to show trends in time.

Furthermore, the scales and units that are utilized in a data representation are also important considerations! Using our previous example, we can note that the visual scales of a map are different from the visual scales of time series data. For instance, you can get drastically different data visualizations if you transform data from a linear scale to a logarithmic scale (i.e., a scale that plots data based on what exponent would be needed to get your number back). This can be useful for situations where the data you are working with is so large that it is hard to see everything in an efficient way. For example, a logarithmic scale of time where millions of years are condensed into smaller numbers that are easier to conceptualize leads to graphs where you can see things like different geographical eras.

On a more human scale, while logarithmic data could be used to misrepresent data, a far more common tactic for misrepresenting data involves a truncated or broken axis on a graph (Figures 2a and 2b); a truncated graph deliberately not starting at zero on the y-axis, and a broken axis subtly skipping a large amount of units.  This is a common tactic that is present in some graphics that news outlets might use, whether it is intentional or not. Some other characteristics of misrepresented data might be plotting two graphs that are not on the same scale or zooming your scale in to make a trend look far larger than it truly is.


Figures 2a and 2b: Graphical Examples of a graph with a broken axis and a graph with a truncated axis, respectively

While there are many examples of distinctly misleading graphs, there are also many graphs that accurately portray the data, but use an incompatible or inaccessible color palette. Related to this, many color palettes used in data visualizations can be inaccessible to those with vision impairments such as green-red and blue-yellow color blindness. Utilizing distinct color-blind friendly palettes can help to make visualizations more accessible. Furthermore, using alt-text descriptions of what the graph is showing enhance the ability of screen readers and other tools utilized by those with low-vision and blindness to interpret the visualization. Thus, being hard to see or just looking aesthetically displeasing does not make a graph misleading, and is an important distinction to make (although the two are not mutually exclusive!)


Figure 3: A “Painbow” Graph

Data ethics

When examining a dataset, it is also important to consider whether there are any biases present that may affect interpretation of the data. Two common categories of biases are cognitive biases and statistical/algorithmic biases.  Cognitive biases involve individuals interpreting the results of a study to best fit a specific narrative. This may involve a data producer deleting data that does not fit the conclusion that they are trying to prove. At the same time, a data producer may also add data that is not accurate in an attempt to strengthen their claims. Furthermore, studies may be designed to collect data that only represents a small subset of a population, while claiming to be representative of the entire population. 

Similar to cognitive biases, statistical/algorithmic biases describe the concept of bias as your sample poorly describing your population. In that context, it is significantly mitigated (if not outright eliminated) if your data collection methods are not generally or statistically biased. This is particularly noticeable when examining artificial intelligence (AI) algorithms.  These algorithms are often trained with unequal datasets, which then leads to skewed results when performing data analysis with said algorithms. Therefore, when examining data that is outputted by an algorithm, one should consider whether the algorithm has been trained with accurate and equal data sets. An industry where statistical and algorithmic biases are extremely important to consider is the healthcare industry. For example, many hospitals use artificial intelligence to sort through patient data, which helps doctors determine who needs immediate emergency attention. While there are many benefits to such algorithms, there have been issues in the past because of them. In certain instances, if a patient has pre-existing medical conditions that affect their health, the algorithm will not be able to take that into account. In addition, many algorithms that are commonly used in healthcare systems are racially and gender biased. As mentioned in “Algorithmic Bias in Health Care Exacerbates Social Inequities — How to Prevent It” written by Katherine Igoe, “algorithms in health care technology don’t simply reflect back social inequities but may ultimately exacerbate them.” Igoe also mentions that certain prediction algorithms used for detecting heart diseases in the medical industry were biased in their design. For example, the “Framingham Heart Study cardiovascular risk score” worked very well for caucasion patients, but not for African American patients. This is due to the fact that around 80% of the collected data used for this algorithm was from caucasian patients. Utilizing such an  unequal dataset to train the algorithm can lead to unequal care and treatment in medical practices (Igoe).  This example is just one of the many examples of bias due to  algorithm design. 

Companies such as Amazon have also faced huge problems relating to algorithm bias. A few years ago, Amazon tried to utilize an algorithm that used artificial intelligence to hire new employees. However, it turned out that this algorithm was biased against women. This is because the algorithm was trained on resumes that were submitted during a time period where the number of male applicants was significantly higher than the number of female applicants. This ultimately caused the algorithm to be trained to favor men over women.

Conclusion

Critical evaluation of data is an extremely important skill set for any student or professional to have. Knowing the importance of checking the reliability, accuracy, and the bias in any data set is necessary when reading or working with data. Some questions to keep in mind are: is the collection method clear and documented? Is the data visualization appropriate for the dataset and for what the author is trying to represent? Is the data biased in the collection or visualization stages? It is important to evaluate data to ensure that we are using quality and accurate data to make sound decisions and conclusions. 

Works Cited

This blog post was written by William-Elijah Clark (Senior STEM Data Fellow) and Reagan Bourne (STEM Data Fellow) from FSU Libraries.

My Experience Attending the Midwest Data Librarian Symposium

The Midwest Data Librarian Symposium (MDLS) is an annual conference aimed at providing Midwestern librarians, as well as others across the United States, the chance to network and discuss several industry issues and topics related to research data management. This year the event was co-hosted by the University of Cincinnati, The Ohio State University, and Miami University, as well as virtually through online Zoom conference calls and presentations. With free registration to all participants, MDLS focuses on the goal of providing low-cost networking and educational opportunities for established professionals and developing librarians of the future. Relatively new to the environment of Research Data Management, I was eager to represent FSU and the entire state of Florida at the Symposium, being the only participant in attendance at the conference from the state. While I could not travel to participate in the in-person programming, the free registration allowed me to actively engage with the virtual conference presentations and events, like many others over zoom meetings. 

Whether it was a zoom scavenger hunt or a presentation surrounding a less talked about subject, like “Making Infographics More Accessible”, I found that with each opportunity to engage I was able to learn something new and many things that I could bring back and put into practice in my own work. The presentations also left me with a lot to contemplate and consider, opening my eyes to information and concepts I had yet to broach or discover through my own work, like Digital Curation and Data Management for filmmakers and documentaries. For example, in the growing industry of filmmaking there are many times limited resources, especially for independent filmmakers, to effectively meet the costs to preserve their data. With barriers, like high memory file capacities, time constraints, and the threat of file corruption or loss of data, documentaries have a much more indirect path to successfully serve as critical sources of historical and cultural documentation. 

The vulnerability of data collected in documentaries further illustrates the broader importance to take serious measures to securely store raw data, especially with its potential relevance to guide other research. Additionally, metadata’s pertinence in other research frameworks encapsulates the expansive benefits of open science and universal accessibility. Pressures of academic viability, publishing, and performance can direct researchers’ hesitancy to relinquish ownership and control of data. This exemplifies the utility and demand to create stronger avenues to motivate the open sharing of data even when it is imperfect or incomplete. Procedurally, sharing upon request protocols have been imperfect, to say the least, as the decision to distribute that data is left at the mercy of the Primary Investigator of the original research that was conducted, who may have internal or external factors that motivate, dissuade, or even obstruct their ability to share the data in a timely or consistent manner.

While there were a variety of different topics covered during the conference, several presentations were based around the new National Institutes of Health (NIH) Data Management and Sharing (DMS) policy that will come into effect at the beginning of 2023. More specifically, there were discussions about the effects of this new policy on data management and sharing, as well as how to prepare and instruct those in need of support to navigate through these changes at a university level. For one of the main presentations on this topic the authors conducted semi-structured interviews at their university to survey the research data service needs of their constituents, as well as to gauge and collect their perspectives in relational proximity to the new governmental regulations being put into place. These interviews produced a myriad of noteworthy and interesting observations to take away. Perhaps the most surprising theme to emerge was that many of the researchers and professors were unaware of or unworried about the policy changes, believing that they’d be able to adapt their research practices and proposals when the new year began. Others wondered about how strictly the new policies would be enforced, especially with loose criteria for what might qualify submissions as exceptions and with aspects of proposals not tied to scoring to motivate researchers to put more effort into adopting practices that promote open science. Additional implications of being able to recognize and remove protected health information further supports the importance of collaboration when it comes to properly following research assurance, protocols, and proper maintenance as well as storage of data. 

These interviews revealed that many students and faculty across the country were uninformed and/or ill equipped to seamlessly handle this transitional phase that will take place in the coming months to comply with the new NIH DMS policy. Perhaps an even larger overarching takeaway that can be applied is that the general level of informational literacy is relatively low in association to student needs and the expectations that they must meet in order to perform adequately in their field. Adjustments are necessary to overcome the deficiencies in standard coursework that often operates on a foundational assumption that students will come into their academic institutions already having research skills and a working knowledge of information systems, catalogs, and databases. In most cases an established base of informational literacy is required to locate or know that library resources for these causes even exist. Libraries as well as universities more broadly must make an effort to publicly promote their services and resources more widely, while also making them more accessible to effectively address this dilemma. Without additional infrastructure to develop these skills, students have a much larger barrier to overcome the limitations embedded in the university academic framework. Taking levels of privilege into account with access to both technology and experience must also play a part in the organization of their practicum. 

As always each institution has its own individual needs as well as priorities and is equipped with different resources to be able to develop the necessary systems and resources to provide its student body with enough support to navigate through all academic challenges. Conferences typically follow a shared academic code of free exchange that open science bases itself on principle. Just look at the public accessibility of most universities’ research guides that they produce and publish and one can truly get a sense of the collaborative instruction that academic libraries strive to achieve. The symposium offers an opportunity that amplifies this ideal, allowing different institutions to come together to cooperate and exchange different ideas through dialogue with similar like-minded individuals trying to reach mutual goals. 

Preparing for the Midwest Data Librarian Symposium, my impression was that I’d simply be attending lectures where I’d experience most of the learning. However, in addition to some of the networking events and opportunities, the interconnectedness and interactive components of the entire conference made attending the symposium a much more well-balanced exchange of ideas and information. Moreover, MDLS hosted a slack channel to further promote ongoing discussions and networking, as well as archiving notes that all participants were given access to and permission to contribute as well for each presentation and event. In addition, many of the presentations that were longer than the five-minute rapid-fire “Lightning Talk” featured aspects of involvement from the audience, whether it was through discussion questions, breakout room consultations, or jam board collaborations to exchange ideas on different subjects. The integration of technology was applied seamlessly and improved the overall quality of engagement within the presentations and symposium as a whole. Attending this symposium gave me the chance to consider and discuss countless ideas to bring into practice with my own work. I am grateful for opportunities like these and experiences that enrich professionals at all stages in their careers with an academic environment of common interests and goals. 

Author Bio: Liam Wirsansky is a second-year MSI student at Florida State University and the STEM Libraries Graduate Assistant at FSU’s Dirac Library. He currently serves as the President and Artistic Director of White Mouse Theatre Productions at FSU and acts as the Director of Research and Development for the Rosenstrasse Foundation. Liam loves the academic outlet that research has provided him as well as the opportunity to educate and assist students in the development of their information literacy skills.

If you have any questions regarding the Midwest Data Librarian Symposium (MDLS), please contact the organizers at mwdatalibsym@gmail.com.

Some Helpful Resources That Were Shared at the Symposium:

Who Has Access? The New OSTP Memo’s Rippling Effects on Publicly Funded Research

The White House Office of Science and Technology Policy (OSTP) made groundbreaking progress at the end of August when they released a memorandum that updated their policy guidance to specify that data and results coming from taxpayer-supported research must be made immediately available and accessible to the public at no cost. OSTP also issued directions for agencies to update their public access policies and data sharing plans as soon as possible to make publications and the research they host publicly accessible, without an embargo or cost and in machine-readable formats to enable their full use and reuse.  

So what does this truly mean for students and researchers?

For many students, OSTP and any of the memorandums that have been released prior to the latest one (which many are calling the Nelson Memo as it was issued by Dr. Alondra Nelson, currently the acting director of the OSTP,) is mostly a foreign subject. What is OSTP and why does it matter? As a Graduate Student myself, I was surprised to learn about the strides taken by the government agency leading up to the release of this memorandum, and the historical struggle to achieve an open science framework that works for the masses and which aims to combat discrimination and structural inequalities inherent in the funding and publishing disadvantages experienced by underserved backgrounds and minorities, as well as  early-career researchers. 

Like many students at universities, it is easy to take the access we have to library resources, journals, and repositories for granted, especially when they meet our immediate needs. But looking at the world around us and the integration of advancing technology into everyday life and society, it is clear we live in a data driven world, making the availability and access of information a premium. Metadata, or data that describes other data, has become one of the most important concepts in the field of information, as it allows researchers to organize the data from their research or from other projects in a way that is meaningful and often cross-disciplinary in its application. This means that data can have unintended benefits and relevance to other researchers to inform their own work, assuming that they are able to access that data. With the Nelson Memo, access to publicly funded research has been defined and recognized as a right to the public. 

Until now there have been clear barriers set in place to promote the interests of academic journals and publishing, and while some of these will still exist even after all of the federal grant-making agencies release their plans for new policy implementation, this advancement toward open access establishes a clear standard moving forward. It sets the United States apart in this respect as global leaders of change in the field of open science. Prior to the Nelson memorandum’s release, Plan S, served as the global standard for open access policy guidance. It mandated that access to publications that have been produced through research grants must be immediately open and fully accessible without being monetized in any form, setting the stage for the standard that OSTP wanted to mirror and build upon.

“cOAlition S”, a consortium of national research agencies and funders from twelve European countries developed around the implementation of Plan S, has come out in support of the newest memorandum and OSTP. More broadly calling the guidance “fully aligned with the open access policies of many forward looking universities and research agencies who have implemented Plan S”, also acknowledging its correlation with the recent UNESCO Recommendation on Open Science, which was adopted by the General Conference of UNESCO at its 41st session last November. Plan S realizes that we have the necessary elements and collective ability to produce digital content as well as public goods that can be shared to help shape the vision of a large connected community that makes up one body, rather than smaller disjointed organs that mirror each other because they cannot see what the other does. All of that is to say, essentially these paywalls of entry to access research act as hurdles that deny the very nature of science as a tool to better understand and help humanity as a whole.

Globally, we saw the power of open science at work in combating the COVID-19 pandemic and bringing the scientific community together, as commercial journals and governments were forced to alter their typical subscription based structure in favor of providing temporary open access to COVID-19 and monkeypox related research data. This allowed for the development of a vaccine and ensured that the common masses had the most credible data driven information to inform their health-based choices and medical practice. Countries across the globe spend billions of dollars on research and experimental development. The United States is no different, with estimates conducted by National Science Foundation (NSF) totalling nearly $667 Billion dollars for the year 2019 alone, continuing to grow in size each of the following years. The expectation would be that the government funding the research would have ownership of the data collected and analyzed, however in the current copyright structure agreement, publicly funded research is often turned over to commercial journals. 

One of the largest concerns catalyzed by the newest memo is understanding how the policy changes will affect the viability of the current subscription model when considering the important role journals play in supporting research, such as peer reviews. Publishers were more circumspect about the changes, designating some amount of skepticism towards the question of how the shift to full open access would be funded. To alleviate this issue researchers can now use research grants and funds to support the publication components of the new policies put forth by OSTP. On the other side of the argument, students stand to benefit from open access journals in terms of the widened levels of exposure that their research will receive with entry points to view such articles increasing exponentially. In addition, libraries across the country suffer from the subscription based model with journals and are not in a position to subscribe to every single research journal that exists. FSU Libraries subscribes to several journals and databases to provide access for its students, but an increase in publicly funded and published research can only append the framework of available research, data, and information that student communities here and at other universities will have access to. Looking forward towards the future, this relationship with academic journals and publishing must continue to evolve and change.  

Ideally, community owned and managed public knowledge infrastructure seems to be the long term solution, but how do we get there? Creative Commons, a non-profit organization and international network devoted to open access and broadening the scope of educational as well as creative works to be made available for others to build upon and share with legal protections, believes we must work on the progression of “open licensing to ensure open re-use rights”. I believe that if we want to move beyond access and towards improved sharing of the information and data we collect, produce, and use, we must begin following these steps and supporting organizations, like Creative Commons or the ​Subcommittee on Open Science, as well as continue to expand who contributes to new knowledge. Most importantly we must stay informed with the latest policy updates and changes, guiding researchers to success from different backgrounds and at all levels of experience.

Committed to the development of open science, Florida State University Libraries is devoted to the free exchange and access of information on a global scale for the good of people everywhere. This change in policy not only reinforces our mission, but also prioritizes the need for comprehensive support and resources to support the students and research that our institution hosts. We are thrilled to continue to work alongside our researchers, offering a wide array of different services and workshops to navigate through these policy changes, as they openly share and provide increased access to their work. We will continue to develop upon this foundation and explore more ways we can champion open science at Florida State University and beyond. 

For more information about how the FSU Libraries supports open access, please visit our Research and Publishing web page here.

For more specific details or information on the Nelson Memo, please see the White House OSTP announcement, here.

Author Bio: Liam Wirsansky is a second-year MSI student at Florida State University and the STEM Libraries Graduate Assistant at FSU’s Dirac Library. He currently serves as the President and Artistic Director of White Mouse Theatre Productions at FSU and acts as the Director of Research and Development for the Rosenstrasse Foundation. Liam loves the academic outlet that research has provided him as well as the opportunity to educate and assist students in the development of their information literacy skills.

References

Ambrose, M. (2022, September 1). US moves to make federally funded research free upon publication. Physics Today. Retrieved from https://physicstoday.scitation.org/do/10.1063/PT.6.2.20220901a/full/

Anderson, R. (2022, August 28). A new OSTP memo: Some initial observations and questions. The Scholarly Kitchen. Retrieved from https://scholarlykitchen.sspnet.org/2022/08/29/a-new-ostp-memo-some-initial-observations-and-questions/

Elder, A., & O’Donnell, M. (2022, September 7). New White House OSTP memo requires federally funded research be immediately open. Iowa State University Libraries. Retrieved from https://www.lib.iastate.edu/news/new-white-house-ostp-memo-requires-federally-funded-research-be-immediately-open-%C2%A0

Green, C. (2022, August 30). A big win for Open access: United States mandates all publicly funded research be freely available with no embargo. Creative Commons. Retrieved from https://creativecommons.org/2022/08/26/a-big-win-for-open-access/

Plan S. (2022, August 26). cOAlition S welcomes the updated Open Access policy guidance from the White House Office of Science Technology and Policy. Retrieved from https://www.coalition-s.org/coalition-s-welcomes-the-updated-open-access-policy-guidance-from-the-white-house-office-of-science-technology-and-policy/

SPARC. (2022, August 25). Fact sheet: White House OSTP memo on ensuring free, immediate, and equitable access to federally funded research. Retrieved from https://sparcopen.org/our-work/2022-updated-ostp-policy-guidance/fact-sheet-white-house-ostp-memo-on-ensuring-free-immediate-and-equitable-access-to-federally-funded-research/

Stebbins, M. (2013, February 22). Expanding public access to the results of federally funded research. National Archives and Records Administration. Retrieved from https://obamawhitehouse.archives.gov/blog/2013/02/22/expanding-public-access-results-federally-funded-research

Thurston, A. (2022, September 7). Will new white house open access rules impact researchers? The Brink – Pioneering Research for Boston University. Retrieved from https://www.bu.edu/articles/2022/impact-of-new-white-house-open-access-rules-on-researchers/UNESCO. (2021, November 24). UNESCO recommendation on Open science. Retrieved from https://en.unesco.org/science-sustainable-future/open-science/recommendation

STEM Data Fellow Spotlight: Diego Bustamante

For Love Data Week 2022, we are highlighting our FSU STEM Libraries Data Fellows! These posts, written by the fellows themselves, tell their stories of how they became interested in data-related work and their experience as a data fellow to this point. Today’s post is contributed by Diego Bustamante.

Prior to my role as a Data Fellow, my idea of what data is was defined by my previous work with quantitative data collected from laboratory experiments. For example, when I worked as a Research Assistant I recorded quantitative data for chemistry experiments, like mass, temperature, volume, etc. I then conducted statistical analysis on the data in order to draw conclusions from each experiment. I personally enjoy collecting and analyzing data, especially because it can lead to many scientific and technological advancements!

While searching for jobs in FSU’s NoleNetwork in summer 2021, one job title that immediately caught my attention was “FSU STEM Libraries Data Fellow.” The job description was unique amongst other jobs offered on campus. As a data fellow, I was offered the opportunity to develop several professional skills in data reference, co-hosting programming language workshops, writing and publishing blog posts, and many more. I felt like it was a great opportunity and a good fit with my previous experience and skills, and so I decided to apply. Thankfully, I was selected as one of the inaugural data fellows, leading to a journey of professional and personal development that has thus far surpassed my initial expectations. 

One of my first tasks in the program was meeting with different librarians at FSU Libraries. In these meetings I was able to learn about different methods and applications for data analysis in a variety of disciplines. For example, I learned that the Digital Humanities Librarian uses a text-mining software to find specific words from books published in the 1800s. She used the data drawn from the software to analyze certain traits of the story by counting the amount of times a character participates in an interaction of this type. This experience helped me realize that qualitative data sets can be used to draw similar conclusions about a study as quantitative data. 

Another concept that I have become familiar with while working as a Data Fellow is open data. We discussed this concept during a workshop where we talked about the potential benefits of making research data openly accessible to the wider research community. Initially, I was hesitant regarding the concept of open data, because I saw academic research as a “race” to find a solution to a given problem. However, further discussion of how researchers are compensated for sharing their data made me realize that it is possible to benefit from open data on a personal and global level. 

Currently, I am still learning about the many different types of data, its definitions, applications, and its importance. I am also working on developing an open source Canvas module on MATLAB where I explain the basics of the math based programming language in a student friendly manner. I look forward to sharing more about this work in the future!

How do the Pros do Data Analysis?

By: Diego Bustamante and William-Elijah Clark

INTRODUCTION

As technology continues to evolve, the infrastructure needed to run this technology gets more and more sophisticated. Processes and tasks carried out by personal computers, smartphones, and appliances are increasingly automated and run with minimal input from the user. This is made possible through code that is developed with one or more computer programming languages.  However, with the increase in the quantity of software and programming applications, the demand for programmers and the number of languages they are required to learn has increased.  Furthermore, many employers now require skills in data analysis and computer programming as prerequisites for job applications.  In this blog post, we will discuss the most in demand languages in the market and give a brief explanation of each.  (Grand Canyon University 2020; Jiidee 2020; Meinke 2020; University of California – Berkeley, n.d.) 

Continue reading How do the Pros do Data Analysis?

What is ‘Big Data’ Anyway?

By: Diego Bustamante and William-Elijah Clark

Maybe you’re on Twitter one day and search ‘#Statistics’ to look up some information for your Introductory Statistics course. Before you know it, you scroll through and see several tweets that are also marked with ‘#BigData’, and you’re left with more questions than you had when you started your search. Maybe you try to search for “big data” on Google, see the definition from Oxford, and are then left with even more questions: 

  • How large is “extremely large?”
  • What kind of patterns, trends, and interactions are we talking about?
  • What isn’t big data?

Big data as a term has become synonymous with the growth of digital data and the glut of information available to researchers and the public. Furthermore, there is a growing interest by both the public and private sector in utilizing large datasets to provide insight into market trends and to improve decision making. However, the exact definition of big data is sometimes unclear and can vary widely depending on who you ask. Businesses, nonprofit organizations, government agencies, and academic researchers each view big data in a different context and with different goals for its use. (University of Wisconsin Data Science, n.d.)

Above: a Google Trends graph that shows the number of searches for the term “Big Data” from 2007 to 2017

In this blog post, we aim to provide clarity and insight into the origins and definitions of big data.  We will also discuss the potential benefits and challenges surrounding big data. In doing so, we will provide some examples linking big data to applications or data that you may interact with on a daily basis.

Continue reading What is ‘Big Data’ Anyway?

What is a Census Research Data Center and Why Should You Care?

This semester, FSU became the newest consortial member of Atlanta’s Census Research Data Center. Funded primarily by the College of Social Sciences and the Office of Research, the Florida State community can now use Census micro-data without paying lab fees, which can range upwards of $15,000 per project.  There are currently 18 Census Research Data Centers in the United States, and outside of North Carolina’s Research Triangle the only one located in the southeastern United States is The Federal Reserve Bank of Atlanta.

So, what is a Census Research Data Center? The Center for Economic Studies defines Census Research Data Centers (RDCs) as U.S. Census Bureau facilities, staffed by a Census Bureau employee, which meet all physical and computer security requirements for access to restricted–use data. At RDCs, qualified researchers with approved projects receive restricted access to selected non–public Census Bureau data files.

Where do college graduates work? Visualization based on 2012 Census data.

To understand the true value of doing research with non-public data from the RDC, it’s important to note the difference between micro data and macro data, which is often referred to as aggregate data. When most of us use datasets for research or analysis, we’re looking at summary figures. For example, if you extract Census data for analysis, you’re typically looking at some sort of summary or aggregation for a specific geographic unit. These geographic units range from state, county, city as well as much smaller units such as census tracts and block groups. Regardless of unit of analysis, the data itself is a summarization of individual survey responses for participants in that specific area.

Building Data Sets with FSU’s Digital Library

Data science is all the rage lately. Harvard Business Review even named it the sexiest job of the 21st century. Even though the term is rapidly gaining mind share, many are still confused about what data science actually is. When you cut through the hype, the core of data science is actually pretty simple: it’s the study of data. What kind of data is being studied, how it is being studied, and what the individual data scientist is looking for all depend on the specific case. Data science is just another field of study using digital methods, putting it firmly under the umbrella of Digital Scholarship.

Continue reading Building Data Sets with FSU’s Digital Library