Evaluating Data Through a Critical and Ethical Lens

Introduction

Data literacy is the combination of a few unique skill sets: statistical literacy, information literacy, and technical proficiency. It also involves being able to visualize, critically evaluate, determine the accuracy and reliability of, and understand data sets. There are many reasons why it is important to be data literate, especially in recent years with the advent of the internet and social media. Data literacy is also crucial to many different industries and research areas. It is important to interpret the data that you are collecting to make sure that the results are accurate and to be able to understand that data so that you can create useful visualizations for others. 

There are a variety of concepts to keep in mind when critically evaluating data. For example, you need to consider the methods that were used to collect the data and whether those methods are ethical. Furthermore, when evaluating how the data is presented, you need to consider whether that representation or visualization is the most accurate way to portray the data. Another particular topic of concern is bias. There are different points at which biases can be introduced, such as when data is collected, when it is analyzed, and when it is shared with the public. Also, if you are critically evaluating your own data, it is important to check that there are no biases within your own work. In this post we will be discussing the critical evaluation of data through the lens of data collection, data presentation and visualization, and data ethics. 

Data Collection

In the context of data collection, several different collection methods can be used for research. Some of these methodologies, such as focus groups, surveys, and participant interviews, are familiar to the public at large. However, there are other specific data collection processes that many people outside of certain academic disciplines may not be aware of, such as web scraping/text mining, phlebotomy procedures for blood tests, observational behavior recording for time series data, and many more.

Consequently, not only is recording the data itself of importance for experimental duplication purposes, but it can also be important for interdisciplinary work. Some fields of research may have different research data collection methods that researchers in other fields may not be aware of, even across seemingly similar disciplines. For example, accounting and finance may seem similar but can have drastically different ways of interpreting monetary data. The way accountants and financial analysts calculate when a company is at a net zero (i.e., a break-even) between revenues and costs is different. Even within the same field of research, transparency with how data is collected is important for peer review – whether it be for ethics accountability or determining methodological flaws within research. An incomplete set of data can make it difficult or impossible to know whether or not the data was collected in a way to prevent bias, and further make it impossible to know if the data is accurate and/or precise.

 Failing to document data and data collection methods can also create problems reproducing or using the data for further research, particularly if things such as question types, experiment conditions, and units of measure are not properly documented. For example, while the hypothetical idea of cold fusion (nuclear fusion performed at room temperature) would be a low-cost energy solution, the experimental methods and data were not recorded. As a result, the concept of cold fusion is now widely looked at with skepticism because none of the data was recorded! A less extreme case where incomplete data may cause research problems is that the way that a survey is constructed can bias responses. Therefore, documenting how a survey was written can be helpful in evaluating why a research study came to a specific conclusion, as well as testing whether or not changing questions or even question order would change results.

Furthermore, data cleaning – which is the process in which things such as incorrectly formatted data, corrupted data, etc are reformatted or fixed so that it can be used in analysis – can also contribute to statistical bias(es) via things such as eliminating outliers, accidentally losing a variable, how you decide to categorize your data, and more. Therefore, documenting how you clean your data is also a critical component of research – explaining what outliers you decided to keep or remove and why can help you and other researchers down the road. It is also important to consider the order questions are asked in and the way questions are worded when conducting surveys. While it might seem counterintuitive at first, the way that questions are ordered and worded can impact the percentages of people that respond in a certain way, whether or not potential participants qualify for research projects, and even the numeric values of the data itself.

Data Presentation and Visualization

 Most have probably heard the phrase “label your axes” at some point, even before college. It is often mentioned in K-12 education, with the pretense being that someone will not know what your graph(s) are depicting without them. While this is indeed correct, labeled axes constitute only one of many different components of data presentation and visualization.

Figure 1: Axes that are labeled!

A good place to start on the types of ways that data visualizations can be best implemented would be The Data Visualisation Catalogue. While the site was originally established with graphic designers in mind, Severino Ribeccca himself stated I felt it would also be beneficial to…anyone in a field that requires the use of data visualisation.”(Ribecca n.d.) As such, almost anyone who uses data typically has to consider how to visually communicate data in a way to an outside audience, or even the general public outside of the realm of academia. A nifty feature of The Data Visualisation Catalogue is that there is a way to filter recommended data visualization types by what concept you are trying to demonstrate.

One consideration when looking at a data visualization is whether the data is represented in a way that is appropriate for that specific data type. While it might not seem like the data presentation would differ between data types, certain visualizations will serve to more accurately and sufficiently depict different types of data. For instance, data related to time and Geographic Information Systems mapping produce distinct data types. While they can be combined and represented in the same graphic (i.e., how has the land of a certain area changed over time?), they both have their own distinct issues to consider to make sure that you are not creating misleading graphics. Namely, one cannot make a map with time data alone, and a map would be hard to make with a line graph that is meant to show trends in time.

Furthermore, the scales and units that are utilized in a data representation are also important considerations! Using our previous example, we can note that the visual scales of a map are different from the visual scales of time series data. For instance, you can get drastically different data visualizations if you transform data from a linear scale to a logarithmic scale (i.e., a scale that plots data based on what exponent would be needed to get your number back). This can be useful for situations where the data you are working with is so large that it is hard to see everything in an efficient way. For example, a logarithmic scale of time where millions of years are condensed into smaller numbers that are easier to conceptualize leads to graphs where you can see things like different geographical eras.

On a more human scale, while logarithmic data could be used to misrepresent data, a far more common tactic for misrepresenting data involves a truncated or broken axis on a graph (Figures 2a and 2b); a truncated graph deliberately not starting at zero on the y-axis, and a broken axis subtly skipping a large amount of units.  This is a common tactic that is present in some graphics that news outlets might use, whether it is intentional or not. Some other characteristics of misrepresented data might be plotting two graphs that are not on the same scale or zooming your scale in to make a trend look far larger than it truly is.


Figures 2a and 2b: Graphical Examples of a graph with a broken axis and a graph with a truncated axis, respectively

While there are many examples of distinctly misleading graphs, there are also many graphs that accurately portray the data, but use an incompatible or inaccessible color palette. Related to this, many color palettes used in data visualizations can be inaccessible to those with vision impairments such as green-red and blue-yellow color blindness. Utilizing distinct color-blind friendly palettes can help to make visualizations more accessible. Furthermore, using alt-text descriptions of what the graph is showing enhance the ability of screen readers and other tools utilized by those with low-vision and blindness to interpret the visualization. Thus, being hard to see or just looking aesthetically displeasing does not make a graph misleading, and is an important distinction to make (although the two are not mutually exclusive!)


Figure 3: A “Painbow” Graph

Data ethics

When examining a dataset, it is also important to consider whether there are any biases present that may affect interpretation of the data. Two common categories of biases are cognitive biases and statistical/algorithmic biases.  Cognitive biases involve individuals interpreting the results of a study to best fit a specific narrative. This may involve a data producer deleting data that does not fit the conclusion that they are trying to prove. At the same time, a data producer may also add data that is not accurate in an attempt to strengthen their claims. Furthermore, studies may be designed to collect data that only represents a small subset of a population, while claiming to be representative of the entire population. 

Similar to cognitive biases, statistical/algorithmic biases describe the concept of bias as your sample poorly describing your population. In that context, it is significantly mitigated (if not outright eliminated) if your data collection methods are not generally or statistically biased. This is particularly noticeable when examining artificial intelligence (AI) algorithms.  These algorithms are often trained with unequal datasets, which then leads to skewed results when performing data analysis with said algorithms. Therefore, when examining data that is outputted by an algorithm, one should consider whether the algorithm has been trained with accurate and equal data sets. An industry where statistical and algorithmic biases are extremely important to consider is the healthcare industry. For example, many hospitals use artificial intelligence to sort through patient data, which helps doctors determine who needs immediate emergency attention. While there are many benefits to such algorithms, there have been issues in the past because of them. In certain instances, if a patient has pre-existing medical conditions that affect their health, the algorithm will not be able to take that into account. In addition, many algorithms that are commonly used in healthcare systems are racially and gender biased. As mentioned in “Algorithmic Bias in Health Care Exacerbates Social Inequities — How to Prevent It” written by Katherine Igoe, “algorithms in health care technology don’t simply reflect back social inequities but may ultimately exacerbate them.” Igoe also mentions that certain prediction algorithms used for detecting heart diseases in the medical industry were biased in their design. For example, the “Framingham Heart Study cardiovascular risk score” worked very well for caucasion patients, but not for African American patients. This is due to the fact that around 80% of the collected data used for this algorithm was from caucasian patients. Utilizing such an  unequal dataset to train the algorithm can lead to unequal care and treatment in medical practices (Igoe).  This example is just one of the many examples of bias due to  algorithm design. 

Companies such as Amazon have also faced huge problems relating to algorithm bias. A few years ago, Amazon tried to utilize an algorithm that used artificial intelligence to hire new employees. However, it turned out that this algorithm was biased against women. This is because the algorithm was trained on resumes that were submitted during a time period where the number of male applicants was significantly higher than the number of female applicants. This ultimately caused the algorithm to be trained to favor men over women.

Conclusion

Critical evaluation of data is an extremely important skill set for any student or professional to have. Knowing the importance of checking the reliability, accuracy, and the bias in any data set is necessary when reading or working with data. Some questions to keep in mind are: is the collection method clear and documented? Is the data visualization appropriate for the dataset and for what the author is trying to represent? Is the data biased in the collection or visualization stages? It is important to evaluate data to ensure that we are using quality and accurate data to make sound decisions and conclusions. 

Works Cited

This blog post was written by William-Elijah Clark (Senior STEM Data Fellow) and Reagan Bourne (STEM Data Fellow) from FSU Libraries.

Summer Tutoring Opens Today

Join us this summer for help with numerous core chemistry, math, and physics classes.

Our free service does not require appointments! Simply drop in anytime you need assistance and our tutors will be happy to help. All tutoring during summer 2021 will happen online through Zoom, and you can find more information about the service via our Online Tutoring page.

Our summer hours are Monday, Tuesday, and Wednesday from 8pm to 11pm.

For questions or to request additional information, please email lib-tutoring@fsu.edu.

My Experience as a STEM Research Data Services Assistant

By: Paxton Welton

Welcome to the third post in the Get Data Lit! blog series. This post will focus on my experience working as a STEM Research Data Services Associate with FSU Libraries during the 2020-2021 school year. In this role, I assisted with outreach and education to FSU students, groups, and organizations at Florida State University around STEM research data services. 

My name is Paxton Welton and I will be graduating with a bachelor’s degree in Finance this semester. One question that you might have right from the start-why is a finance major working in a STEM-focused role? 

When applying for jobs prior to this academic year, I knew I wanted a role that would challenge me and allow me to develop new skills. I believed that being the Research Data Services Assistant would provide me the appropriate level of challenge and opportunity that I was looking for. By and large, I believe that my experience provided me with just that. There was a major learning curve that I faced when I first started this role. While I had a grasp of the basics of data literacy and research data services, I quickly realized I did not know nearly enough to be able to properly speak to student groups about these topics. During the first few weeks of the fall semester, I spent a significant portion of my time getting a stronger understanding of data and everything FSU STEM Libraries had to offer to its students in regards to research data. By reading countless articles about data literacy and engaging in weekly discussions with my supervisor Dr. Nick Ruhs, the STEM Data & Research Librarian, I became confident in my working knowledge on these topics. 

As the STEM Research Data Services Assistant, one of my main responsibilities was conducting targeted outreach to different student organizations across campus. When I first started this process I reached out specifically to STEM-focused groups. This process involved me initiating conversations via email with registered student organizations (RSOs) to introduce them  to the research data services FSU Libraries offers them.  In several cases, we were invited to meet and/or present synchronously to these groups. This gave us a chance to share more in-depth information about our services and just how valuable they are to students. It also gave students a chance to ask us any questions they may have. Getting the chance to directly interact with students and help them find the right resources to feel more prepared for their future was by far my favorite part of this role.

I also had the opportunity to contribute to data-related events hosted by FSU STEM Libraries. Two examples include Love Data Week in February and the Virtual FSU Libraries Data Services Quest in March. My involvement in these events allowed me to see the entire process of creating programming for students. I was able to sit in on brainstorming meetings, give my input on the marketing materials, and create content for the events.

One of my main focuses throughout this year has been to develop and create this blog series you are reading right now–Get Data Lit! The focus of this blog series was data literacy and its applicability to student’s educational experiences. As such, I had the chance to put into practice the new data literacy skills I learned in this role. I also had the opportunity to connect data literacy to real-world practice and explain the importance of critically evaluating data. Doing so made me realize just how important learning data skills are for my future career and education.

One thing that proved to be a common theme throughout all the work I was doing is that data is powerful and knowing how to work with it is even more powerful. From a career in law to a career in fashion, you are going to be working with data in some form. Learning how to critically evaluate data is going to give you the skills you need to stand out in the future. 

By taking on a job in a discipline that I knew very little about, I was able to challenge myself and make the most out of this past year. From getting to work on student programming events to developing a blog series, I was constantly challenged and learning something new. 

7 LinkedIn Learning Skills to Master This Summer

Hi everyone, this is Courtney again, the STEM Libraries GA, along with Emily McClellan, the STEM Libraries Outreach Associate, to talk about ways we can continue our learning and professional development throughout what promises to be a unique semester. It’s often said that we should try to control how we react to the things we can’t control. While that’s a lot easier said than done, we wanted to share some opportunities that you may find helpful while continuing to learn and grow throughout the summer.  While the world is constantly shifting and changing around us, finding stability can be hard. If you’re looking for a professional goal you can achieve this summer, try a LinkedIn Learning training to keep you grounded and focused as we continue to work from home. 

Continue reading 7 LinkedIn Learning Skills to Master This Summer

Introducing our Newest Librarians

Continuing the series, here are two more of our new librarians.

Renaine Julian – Data Research Librarian

Hi folks. My name is Renaine and I’m the Data Research Librarian at FSU. I’m a three time FSU alum and I couldn’t be happier to be back on campus! Before starting my current position, I worked for the Libraries for about five years as a student worker and, later, as a staff member before heading over to the state-wide library consortium, The Florida Virtual Campus.

The Data Research Librarian is a new position and I’m responsible for creating a new suite of services for students and faculty related to quantitative data as well as the management of research data. That being said, I can help you find data as well as figure out what to do with it once you have your hands on something useful. If you’re creating large datasets for your research, you’ll need a plan for managing that information and, in many cases, making it available to others. I’m working with other folks in the Libraries and around campus to develop data management consulting services to assist you in planning to keep your research intact, findable and usable.

I’m also the subject specialist for Economics, Geography, and Urban and Regional Planning. My research interests include: data management, data visualization, open data, emerging technologies and digital libraries. I work in the Scholars Commons which is located in Strozier’s basement. Please come by and say hello.

Contact Renaine – rjulian at fsu.edu

[Editors note – photo coming soon! That’s how new Stacey is!]

Hello! My name is Stacey Mantooth and I am a new addition to the library staff at Dirac Science Library. Before joining Florida State University, I earned my MSLS at the University of North Carolina at Chapel Hill and worked at the EPA Library at Research Triangle Park in North Carolina. While I’ve lived in several states around the Southeast and Midwest, this is my first time living in Florida, and I’m excited to see what Tallahassee has to offer.

As the liaison to the Chemistry and Biochemistry and Earth, Ocean, and Atmospheric Science departments, I help students and faculty with research activities like finding journal articles, writing literature reviews, patent searching, or managing data. I also help make decisions about what materials the Libraries buy or keep for these subjects. In addition to my regular library and liaison work, I’m interested in doing research on the information needs of STEM faculty and students on campus. Studying which information researchers need, knowing how they go about getting it, and understanding how they view the research process could lead to improved University services and greater STEM success.

Contact Stacey – smantooth at fsu.edu