A Short Dive Into Data Science

Introduction

Data science is a multidisciplinary field that combines math, statistics, programming, and artificial intelligence to analyze and discover insights embedded in large amounts of data. These insights can then be used to help data scientists ask and answer questions about data, such as: what happened, why it happened, what will happen, and what can be done with the results (Amazon Web Services). Data science combines different tools and technology to create meaning from and interpret data. With the accelerating volume of data being produced, data science is widely used to process and discover patterns in data through a variety of statistical techniques.

 The term data science was first used in the 1960s to describe a new profession focused on the interpretation of data. However, data science during this time period is extremely different from how data science is viewed in recent times. Starting in the 2000s, companies began to see data as a commodity from which they could capitalize. This resulted in a need for data scientists to analyze large quantities of data, by using skills to access, understand, and communicate insights from data. In more recent years, the field of data science has grown tremendously, and it is needed in every industry that generates or relies on data. 

Data science has completely reshaped the way that we collect, store and analyze data, and these new discoveries have transformed several industries and improved people’s lives. For example, in finance, data science can be used to help financial institutions make more accurate predictions and manage risks. In the healthcare industry, data science allows researchers to be able to predict and prevent certain diseases. In marketing, data science allows companies to make personalized ads, based on data that they have about consumers. However, as data science becomes more commonly used in society, it is important to consider the ethical implications behind the collection and analysis of large amounts of data. Subjects such as bias, transparency, and privacy are extremely important and relevant in this field. 

 One important ethical consideration is bias. Data science algorithms can unintentionally inherit bias from the datasets that they are trained on. This can lead to discrimination and inequality towards certain variables in a dataset. Transparency is also an important ethical consideration.  More specifically, some individuals may not understand how data science is used to make decisions that may affect them. This can lead to individuals having concerns about technology, and how it is being used. Finally, data privacy is a critical issue, particularly as the amount of data that is being collected continues to grow exponentially. It is essential to ensure that individuals’ personal information is being protected, and that the data is only collected in ethical ways. Overall, while data science has had so many positive impacts on society, it is also important to prioritize ethical considerations within data science, to ensure that these advances will continue to benefit society. 

What skills are needed in data science?

To quote the US Bureau of Labor Statistics, “Data scientists use analytical tools and techniques to extract meaningful insights from data.” However, this is an incredibly broad statement by itself. Oftentimes, it can be easier to think about data science in the context of what types of tasks are done within the discipline. Furthermore, there are several different articles on what skills data scientists need, often with similar but imprecise wording. However, there is some general overlap in the skills that are required or recommended for those considering a career in data science. 

Quantitative Skills

Broadly speaking, an understanding of mathematics and/or statistics is important for a data scientist. While a data scientist is not necessarily a mathematician or statistician, data scientists often use similar (if not the same) tools to calculate mathematical models that can be used for prediction or to make inferences about what correlations exist within the data. 

Additionally, there is a general consensus that data scientists need to know about data cleaning/wrangling/scrubbing, etc. Regardless of the terminology used to describe it, knowing how incomplete and complete data can and cannot be used is a critical skill. Understanding how pieces of data will interact with each other can give a data scientist a far more efficient understanding of what they can and cannot infer from the data, as well as how a computer would store the data they need to analyze. While this is typically listed as a separate skill, it often relies on linear algebra to work.

Programming/Coding

It should be noted that someone can be a statistician or a mathematician without knowing how to write a single line of code. In fact, according to the Bureau of Labor Statistics, 13% of mathematicians work in education, as do 9% of statisticians. Oftentimes, quantitative work in this realm is less reliant on code and more so about effectively communicating quantitative concepts. However, a professional data scientist absolutely must know how to program, as it is a de facto requirement of almost any data science position. While it would theoretically be possible to calculate statistical trends by hand, this would be impractical when dealing with the current reality of big data. Attempting to construct even one mathematical model with all available data could take a lifetime at best. Consequently, it is far more efficient to have a computer run all of the minute arithmetic calculations instead of adding everything manually.

The list of programming languages a data scientist might need is long enough to warrant its own blog post (in fact, we have covered some in a previous blog post entry). Due to the sprawling amount of programming languages available, it is generally more important to understand the fundamentals of programming and pick up programming languages as needed. While we cannot immediately recommend every situation, Python, R, and SQL are good languages for a strong data science foundation. These tend to be in the top three languages for most data science positions, all else equal. Even further, artificial intelligence (and more specifically, machine learning) can be implemented in the first two, which is also gaining precedence in modern data science.

Business Acumen and Communication

Surprisingly, business acumen (also known as domain knowledge) is frequently mentioned as an essential skill for data scientists. A more generalized way of thinking about it would be “understanding data in context”. In brief, domain knowledge involves knowing what the data is and what it means in the context of your work. It may be easy to think about what the purpose of your data analysis is when you are responsible for the beginning and end of a project – however, jumping into the middle of a project will require some knowledge of what the data is before you can proceed to do anything with it. 

Similar to business acumen, data scientists also need to know how to communicate their results so that others may understand the insights gleaned from their work. At first glance, one would assume that this predominantly refers to explaining predictive outputs in both verbal and written communication that anyone outside of the field can understand. Those two skills are important — however, visual communication and literacy can be included as well. The ability to make data visualizations that effectively show what trends are in the data is another critical skill within data science, as graphs can communicate far more ideas at once than words/numbers alone can. Further, data visualizations can bypass the need for an audience to understand more complex quantitative concepts.

Miscellaneous Skills+Addendums  

While the above skills are frequently mentioned in many articles describing critical skills for data scientists, there are some notable mentions of other skills. For instance, DevOps (a project management method) is considered to be critical enough to the discipline itself that an article about using it as a framework for modern data science work exists (Saxena et al. 2021). While DevOps is the most commonly mentioned project management method within data science, other aspects of project management are also valuable to have.

Further, being able to work with big data and its infrastructure is critical. (Georgia Tech Professional Education, 2023). As more and more data is collected by private and public entities, experience with data files that are not easily navigable is of the utmost importance. Additionally, knowing how to locate the data you need from large data warehouses is important regardless of how much data you need. 

As mentioned in the introduction, concerns about bias in data and ethical data use have ramifications for consumer privacy.  This is such a notable concern that Harvard Business School’s own page for data science skills mentions ethics as something that should be kept in mind due to how many dilemmas can arise from data breaches, biased algorithm development, the usage of consumer data without consent, among other issues (HBS Online, 2020).

What industries/fields use DS?

As mentioned in the introduction, data science is a combination of the fields of mathematics, statistics and computer science. Data scientists generally use programming languages such as R and Python to derive analyses and make visualizations from the data given to them. Data science is often seen as a subset of computer science, which focuses on developing new software to perform certain tasks. This can include making a new gaming application, designing a website, or developing a new programming language. Data scientists generally try to make software to execute a computer model or to visualize software, utilizing statistical techniques in the process. 

Furthermore, data science and statistics are generally similar fields with a lot of overlap. For the most part, both data scientists and statisticians aim to make models to analyze data. However, data scientists may place a heavier emphasis on data visualization techniques, while statisticians may use more math-heavy models. For example, a statistician may lean more heavily on models such as linear regression, while a data scientist may focus more on models such as neural networks, which attempt to imitate the human brain to make predictions. That being said, a statistician can almost certainly become a data scientist with a little extra training and vice versa. 

Data science can be used in almost every business worldwide and has many specific applications; there are too many to list in this blog post. Here, we will briefly describe a few of the more common industries and/or fields that use data science. 

One of the most common uses of data science is in the stock market. When looking at trends in the stock of a company, data scientists for stockbroking companies can analyze the stock data of the past and use that to predict how much that stock will be worth in the future. These predictions can then be used to help determine whether or not the company should buy more or less of that stock. 

Another example of data science is simply predicting future revenue, which can be used in virtually every business. Given past data such as previous prices, how well the economy has been doing as a whole, and many other variables, data scientists can predict how much revenue/profit a company may receive in the future. Using this, companies can determine whether or not they would want to set a certain price for their product or even continue with the business as a whole. 

Data science can also be used at almost every company for various HR applications.HR analytics can help a company in predicting turnover or recruiting the right candidates. For example, if a data scientist can analyze past data to determine if a specific role has higher turnover, the company can utilize that analysis to change up the role to help decrease turnover. Furthermore, data science can be used to determine which candidates may be the best fit for the job by looking at variables from their application or resume. For instance, if employees with great people skills tend to perform better, analysis can be done to determine which candidates have good people skills that can then be translated into a more productive work environment.

Finally, an application of data science that you’re likely well aware of is targeted advertisements. This is utilized by companies such as Facebook, Google, and Spotify. By analyzing the data from your previous searches, likes,  listening, and other various sources, these companies can feed you the advertisements most applicable to you. For example, if your neighborhood was hosting a baseball tournament, your neighbors might search for items such as “baseball bats” and “how to hit a baseball” on Google. Using this data, and seeing that you’re nearby, Google may then direct advertisements to you for baseball bats and baseball courses. 

Conclusion

If one wanted to be brief, data science could be simply described as the hybrid discipline of computer science, mathematics, and statistics. More specifically, data science can be used to take large amounts of organized and unorganized data and make predictions about what might happen in the near-ish future. Those large amounts of data can also be used as a descriptive explanation for phenomena that are currently happening, but would otherwise be invisible to us. Furthermore, those predictions can be used to make business decisions, public policy, predict future gameplay moves/outcomes, or even help create critical medical interventions. However, the above skills can also cause harm if used without consideration for the data itself. Regardless of where or how data science is applied, all of the above skills are necessary to do work in this field.

While we have discussed the use of data science in industries such as financial markets and HR analytics, the applications of data science go far beyond these two fields alone. In fact, what fields do not use data science at all would probably be a shorter list than the list of fields that do. For instance, one of the graphic design industry’s oldest professional organizations (the American Institute of Graphic Artists, or AIGA) released a report on the data economy in 2018 and how data science will impact graphic artists (Davis 2018). While it is not an industry that is traditionally thought of as handling data, graphic designers also produce infographics (Stinson 2021). Furthermore, data mining is a critical skill within data science with applications in anything from market research (Hall n.d.) to finding unused content within video game files (BBC Newsround 2020). If there is an area someone is interested in, one can look up “[field of interest] data science” and probably find how they can use data science in that realm. 

At the end of the day, though, understanding data science beyond what we have outlined in this post may require experimenting with each of the bits and pieces of what makes it work on your own. To quote notable data scientist Claudia Perlich, “Learning how to do data science is like learning to ski. You have to do it.”  Understanding more about what data science is may require the very same thing; reading more about it and doing it for yourself.

Works Cited

For a list of references, please use this link: https://bit.ly/3NF30zQThis blog post was written by William-Elijah Clark (Senior STEM Data Fellow), Sahil Chugani (STEM Data Fellow) and Reagan Bourne (STEM Data Fellow) from FSU Libraries.

A Primer on Machine Learning and Artificial Intelligence

Introduction 

Artificial intelligence is a very broad topic that includes machine learning and deep learning. These terms are often used interchangeably with the assumption that they are all the same topic. However, while the terms are related, there are specific characteristics that differentiate between them. Deep learning is actually a subfield of machine learning, which is a subfield of artificial intelligence. Artificial intelligence involves developing computers that are capable of mimicking human cognitive functions and following through with specific tasks. Machine learning uses algorithms to recognize patterns and trends from previous data, and then uses this information to make real-world applications. The whole goal of artificial intelligence is to allow computers to work independently, without the need for humans to instruct and interact with them. There is a large variety of applications for artificial intelligence and machine learning, ranging from essentially every industry.  Artificial intelligence is widely used in the manufacturing, banking, and healthcare industries. In this blog post, we will go deeper into the definitions of artificial intelligence and machine learning, and their practical applications.

What is Artificial Intelligence?

 There are many different ways to define artificial intelligence, and over the course of several years, the definition has changed drastically. Alan Turing, who is often referred to as the father of modern computer science created a test known as the Turing Test in an attempt to answer the question “can machines think?” In this test, a human has to differentiate between a computer’s response to a question and another human’s response to the same question (IBM).  Furthermore, in “Artificial Intelligence: A Modern Approach”, Stuart Russel and Peter Norvig discuss a human approach vs. a rational approach to artificial intelligence. They discuss four different goals to pursue when designing artificial intelligence: systems that think like humans, systems that act like humans, systems that think rationally, and systems that act rationally. Each method or goal has its own advantages and disadvantages, and all of these methods are used today. An overall definition for artificial intelligence, that fits into these different goals, is that artificial intelligence allows machines to learn from previous experiences and information, and perform human-like tasks (SAS).

Along with the general definition described above, artificial intelligence can also be differentiated into  weak and strong artificial intelligence. Weak artificial intelligence, also known as narrow artificial intelligence is artificial intelligence that is programmed and trained for one task. Narrow artificial intelligence can not mimic a human as a whole, but rather certain aspects, and has very specific applications. For example, narrow artificial intelligence is used in Amazon Alexa, Google Home, personalized advertisements on social media, recommended songs on Spotify, and so many more. 

Strong artificial intelligence, also known as artificial general intelligence, focuses on creating a machine that can perform any cognitive task that a human can. In other words, a machine that can mimic a human. There are three main tasks that are critical to making an artificial general intelligence machine. The first is the ability to generalize knowledge (being able to use knowledge from a different area) and apply it to an issue or task. The second task involves  the ability to make a prediction based on prior knowledge and experiences, while the third and final task  is the ability to adapt to changes (Forbes). Notably, there are a lot of ethical arguments that come along with artificial general intelligence, and it can be argued that it is impossible to make a “strong” artificial intelligence. 

Overall, artificial intelligence can be used to add intelligence to preexisting technologies. It can perform tasks reliably, with much less error than a human, and faster than a human. Artificial intelligence can also adapt through progressive learning. In the future, artificial intelligence may have even more of an impact on our everyday lives, and we can learn so much from it. 

Real-Life Use Cases for Artificial Intelligence

Daily Tech Use

Depending on how much tech you interface with, you may be thinking: “Artificial Intelligence isn’t used for anything I do or use. Why would I need to know where AI is used?” To answer the question quickly…artificial intelligence is currently embedded in a lot of daily tasks that most people (possibly even you!) use.

Whether you’re trying to find something via Google, trying to decide on what you’d like to watch on Netflix, or trying to discover niche music genres on Spotify, all of these sites use algorithms via AI in order to deduce what you’re probably interested in looking at. (University of York n.d.) For example…if you’re a STEM major who happens to search for the phrase “R Programming” enough, Google will eventually pick up that you are most likely not looking for the history of how the letter R came to exist. Likewise, if you’re a linguistics major looking for how the modern letter R came to exist, you will most likely not get search results related to the R programming language. Of course, this isn’t the only situation where two people will get radically different search results. In fact, Google’s algorithmic presentation of information based on what you typically look for has a name — “filter bubbles”. The term was coined over a decade ago by political activist Eli Pariser. He demonstrated this phenomenon in a 2011 TED Talk with two different people searching for “Egypt” around the same time. While the conversation was predominantly about how filter bubbles impact politics and activism, it should be noted that filter bubbles would not exist without artificial intelligence behind them.  This said, being aware of how AI algorithms can influence what you see is an important aspect of civic engagement. This concept may become even more pertinent as newer chatbots present further issues, such as giving false information when asked certain questions. Thus, the implementation of AI is important for everyone.

For a less ominous use of modern AI, there are also applications with handwriting recognition software. Even with written English, a touch-screen interface combined with AI, image processing, and computer vision to convert handwriting into text-compatible notes. This can be extremely useful for transferring text data from one computer to another. While you could take a photo of your notes for someone else to look at, this might have limited use for finding words within the text after the fact – you would not be able to search for a keyword if it was only saved as an image. Further, a computer that can convert handwriting to typed text also allows someone to use a search engine without typing. This use of AI even extends beyond the English language. Handwriting recognition research has been used for several different languages, including non-Western languages such as simplified Chinese, Arabic, Thai, and more. As a consequence, handwriting recognition AI can bypass the need to type (a skill that is separate from writing and is even less common). Further, converting from hand-written text to computer text formats is also applicable to these languages, which can be used for translation AIs – while things such as Google Translate may not be the most reliable, they can serve in a pinch in situations such as a hospital ER.

AI in Economics and Finance

Economics and Finance also embrace technology to carry out their work. For example, technology is particularly relevant to detecting credit card and insurance fraud.  There are well-established ways to use mathematics and statistics to determine if someone’s financial accounts have been compromised. However, the conundrum that comes with modern finance and economics is that transactions happen at far, far faster speeds than humans can currently keep up with.  An AI algorithm can calculate the probability that a financial transaction was fraudulent far faster than a human could. Therefore, as long as the humans behind the algorithm have given their AI formulas to work with, faster processing speed is of great assistance in preventing modern-day fraud.

Likewise, AI is already the cornerstone of the modern foreign exchange market (also known as FOREX). While the concept of foreign exchange has existed since Antiquity, there are some additional considerations in contemporary times. Specifically, modern currencies are traded in significantly larger amounts and at faster speeds than anything before. In fact, modern FOREX is so large and so fast that a human being cannot efficiently or consistently make profits without AI tools! This is predominantly due to the majority of FOREX transactions being carried out by AI bots instead of humans.  A study commissioned by JPMorgan in 2020 determined that about 60% of all FOREX transactions were made by AI rather than humans! This is not to say that human involvement in FOREX is non-existent. Instead, the human role of a FOREX trader is no longer in the realm of physically placing trades, but in examining formulas and creating better and better code that a FOREX AI bot will operate with. Essentially, AI frees up time for human financiers to make analytical decisions as opposed to physically waiting or physically making trades…if so inclined. It should be noted that these applications of AI are still new, and often come with the risk of sudden price shifts wiping out short-term profits. 

AI in Healthcare

Artificial Intelligence also has applications in healthcare. It might be odd to think about how AI would impact something as physical as your own body, but there are already several cases where it can be used. 

For example, AI can be used to detect lethal drug interactions and make vaccines from scratch. For the former, researchers at Pennsylvania State University used AI to study what prescription drug combinations could cause liver damage. In the case of the latter, in 2019 researchers at Flinders University in Australia developed the first flu vaccine that was completely designed by artificial intelligence. Previously developed vaccines have been partially designed by AI, giving precedence to the first 100% AI-made vaccine. Furthermore, AI is used in physical machines developed for medicinal purposes – namely, via Robot-assisted surgery. While most robotic surgical systems are not 100% AI-driven, the very first instance of a surgical robot doing surgery by itself was back in 2006 (United Press International 2006)! This isn’t a commonplace practice at the moment, but robot-assisted surgery with human intervention is. Hence, it is worth considering whether or not medical science should completely automate surgery altogether, or use AI-surgical robots as collaborative machines. 

What is Machine Learning?

Machine learning is a subset of AI specializing in taking data and improving the accuracy of predictions using that data. For example, if the temperature increased by one degree Fahrenheit every day, a machine learning algorithm could use that data to predict that the temperature would keep increasing by one degree per day. This is arguably the simplest form of machine learning, called linear regression (as there is a linear relationship between the number of days and the temperature). However, machine learning can encompass a number of different ideas and models, even including items such as weather forecasts. 

Machine learning is used in many ways throughout our everyday lives, such as for Spotify/YouTube recommendations, stock market predictions, and advertisements. With more data being readily available every day, the potential applications of ML will only continue to increase. Creative destruction, in economics, is the concept that with new and better technology, some jobs may be lost in the short run. However, in the long run, productivity will increase, new jobs will be created, and living standards will increase. With AI potentially taking over some jobs such as customer service jobs, and some of those jobs being replaced by jobs requiring the coding of AI tools, creative destruction is taking place and will only continue to do so. Therefore, with ML taking over a large portion of the Internet today, it is fundamental to obtain an in-depth understanding of what it does. 

Machine learning can generally work in two ways: supervised and unsupervised learning. With supervised learning, a computer is trained with labeled data and can then use that data to make new predictions. For example, if we wanted to train a computer to recognize a picture of an apple, we would first need to input a large number of pictures containing apples and pictures that do not have apples. Then, we would appropriately label them. The computer would then take this data, make a model out of it, and predict whether or not something is an apple from a new picture. Unsupervised learning is generally used to cluster or group segments of data. For example, Spotify could use this type of ML algorithm to group listeners into certain categories. One potential grouping of the listeners could be hip-hop and rap, enabling Spotify to suggest hip-hop artists to rap listeners and vice versa. 

Figure 1: Supervised vs. Unsupervised Learning (Yan et. al. 2018)

One way a computer can make a model is through reinforcement learning, which tells a computer to predict the future given the past. Going back to the apple example, the computer could start out by making random guesses on which pictures have apples and which do not. Then, the model would check the guesses against the data – if the guesses were off, the model would change to adapt. Each pass through the dataset (each time the model goes through the dataset and guesses which pictures have apples) is called an epoch. Eventually, after tens or hundreds of epochs, the model will get better and better. Ideally, a good model would be able to guess which pictures contain apples with close to 100% accuracy. 

Use Cases for ML: Sports Analytics

One example of machine learning in the real world is using the rushing yards over expectation (RYOE) metric in the NFL (National Football League). To calculate RYOE, developers can calculate the expected rushing yards given a few factors, such as the speed of defenders and the number of blockers in the area. Then, given the actual rushing yards that occurred, RYOE can be calculated as (actual yards) – (expected yards). Using new data and machine learning modules based on this metric, teams can better determine whether rushing yards are the products of running backs themselves or of offensive linemen and schemes. This also allows for quantitative comparisons related to the value of passing plays versus running plays, and subsequently where teams should invest personnel resources into. Thus, with the introduction of new data and machine learning models applied to that data, we are able to make a cohesive argument to finally answer the question: do running backs really matter?

Another use of machine learning is in sports betting. By analyzing previous historical data with player ratings, injury history, and various other metrics, betting companies and bettors can use this to train a machine-learning model. By plugging in the current values of those metrics, the model is able to predict, for example, who will win a game and by how many points. By doing this, betting companies can set betting lines for games, and if the models of bettors do not align with this, the bettors may believe that their model is better and use that to bet on the game.

Furthermore, machine learning can be used to analyze game-time decisions in sports such as baseball and basketball. By looking at player performance in the past and seeing how they perform compared to other players in specific situations, such as in the rain or sun, teams can utilize machine learning to predict how players will perform in the future. Given this data, they can put their players in the best possible position to succeed.

Conclusion

In essence, it can be noted that Artificial Intelligence and Machine Learning are deeply interrelated concepts. This is especially true when Machine Learning is a subset of the broader AI field itself. Further, both broader AI and more specific Machine Learning techniques have applications ranging from entertainment such as sports and music, to daily living tasks just as hand-writing recognition and home assistant devices, to critical infrastructure such as finance and medicine. This leads one to ask where artificial intelligence is not implemented yet. While it can be hard to say when tech experts in academia and the private sector cannot come to a consensus, there is one thing that is absolutely certain. AI and Machine Learning carries least some importance to everyone’s lives in one way or another, whether directly or indirectly. 

Further, This also leads to further discussions, such as “is the importance of these technologies overstated or understated?”, as the exact magnitude to which artificial intelligence and machine learning will impact society is still unknown. With the introduction of machine learning chatbots such as ChatGPT, it can be challenging to ascertain how useful it will be in the long run. While it can answer questions from “Where was Abraham Lincoln killed?” to “Code a website for me”, it fails to answer some simple logical questions from time to time. Although the tool has been trained on an astounding three billion words, it’s far from perfect at this time. However, as time goes on, ChatGPT and similar tools will be trained on even more data, computers will become even faster, and the applications and accuracy will only increase – leaving us to wonder if future applications will be indistinguishable from humans. Similar to our previous example of robotic surgeons, time will only tell if AI and ML-powered chatbots will require extensive assistance from humans or if they will be capable of being autonomous in the future. While we cannot answer this question at this time, nor do we encourage a specific stance on artificial intelligence and machine learning… we can say that it is a topic to keep an eye on.

Works Cited

For a list of references, please use this link: http://bit.ly/3GBKGof

This blog post was written by William-Elijah Clark (Senior STEM Data Fellow), Sahil Chugani (STEM Data Fellow) and Reagan Bourne (STEM Data Fellow) from FSU Libraries.

STEM Data Fellow Spotlight: Sahil Chugani

When I was in elementary school, I remember Googling various football statistics, running down to my parents, and telling them, for example,  “Ben Roethlisberger had 4,328 passing yards in 2009!” I played football for eight years from elementary school to high school, and I was good with working with numbers. I found that sports analytics was a great combination of the two. In high school, I entered a sports analytics competition, where my project was to determine what would happen if onside kicks in football would be replaced with a 4th down and 15, and I absolutely loved it. Now, I’m fascinated with data science as a whole– being able to make a computer do something that we could never imagine doing as humans is an amazing feeling for me.

Since the sports analytics competition, I’ve been doing anything and everything I could related to data science. Some of the research I’m currently working on includes sports team values, kickstarter data, and sportswashing (for example, Qatar holding the World Cup amidst some controversial political issues). I also had a job this year working for a company called Scouting Heroes, where I logged basic statistics for the FSU football team. (More information on what the data I collected was for can be found at https://simplebet.io/nfl.html.) I’ve also worked on creating data visualizations based on football data. For example, this past summer I created over 20 graphs that can be found at https://twitter.com/a_graph_a_day .

In one of my classes, one of my (now) coworkers, William-Elijah Clark, posted the opening for the STEM Libraries Data Fellowship in the class’s GroupMe, and I was eager to apply. Something I’m super excited for with this Data Fellowship is that I really want to translate my skills into some real-world experience. Instead of simply creating graphs or finding statistics on my own, I want to have a tangible impact with regard to data. I hope to be able to help students out with their needs or be able to have my data analysis translate into a decision being made that affects people. In a way, it would signify that my hard work on data analysis is paying off. 

One of the projects that I’m super interested in working on as a Data Fellow is the use of Jupyter Books to assist users in learning more about how to code and analyze data as a whole. By offering interactive code blocks and giving users the opportunity to run code on their own, they may be more willing to learn about the data analysis techniques used. Furthermore, I hope that by implementing sports analytics examples, specifically football, people who are interested in sports may be more willing to learn how to use data analysis techniques with respect to sports.

As a whole, I’m very excited to learn more about data analysis techniques here at the FSU libraries and as well as apply my skills to tangibly help others at Florida State as a whole. 

This blog post was written by Sahil Chugani, STEM Data Fellow at FSU Libraries.

STEM Data Fellow Spotlight: Reagan Bourne

Prior to my experience at Florida State University, I took a few research classes in high school. In these classes, I had assignments where I would have to collect and analyze data as part of a research project. These experiences sparked my interest in data science, and from that point forward I always knew that I was interested in data-related research. Furthermore, I have always been interested in a few different subjects, including computer science, biology, and mathematics. I never realized that I would be able to combine my interests before starting this data fellowship.

When I first found this fellowship during the summer of 2022, I felt that I was at an academic crossroads. I was unsure of what I wanted to study and my career goals. However, I was extremely interested in this opportunity, because it was unlike anything I had ever really known about. I thought that this position would be a great learning opportunity for me, and would  hopefully allow me to utilize my data skills and pursue some of my interests. So far, this fellowship has gone above and beyond what I was hoping for. 

As I am still in the beginning of my academic career, I have not had the opportunity to obtain much experience using my data skills before this fellowship. For this reason, I am so grateful to be participating in this fellowship. I have already learned so many different things in my few months here. One of my first assignments was to meet with many of the different librarians at FSU Libraries. I really enjoyed this task, because I liked hearing about all of the different paths that were taken until finding this career. It introduced me to a lot of different projects and areas of expertise in the library that I had never known about, such as the Health Data Sciences Initiative and open science. 

Another concept that I have recently learned a lot about is the importance of critically evaluating data. Working on a blog post about this topic has been a great learning experience for me. It has introduced me to so many ideas that I had never known about.  Specifically, I have learned about machine learning algorithms for data science. As a student currently pursuing a computer science degree with a minor in data analytics, this topic was extremely interesting to me, and is something that I am excited to explore further. 

As I take more classes related to my major, I am excited to apply the skills I learn towards this fellowship. In the future I hope to teach workshops about Unix, C#, SQL, and many more.  I am looking forward to continuing my work with the FSU Libraries.

This blog post was written by Reagan Bourne, STEM Data Fellow at FSU Libraries.

FSU Libraries celebrates Love Data Week 2023!

Introduction

Love Data Week is coming back to FSU in 2023! Love Data Week, or LDW, is an international event where individuals and groups are encouraged to host and participate in activities related to any and all data. It occurs every week that Valentine’s Day lands, and focuses on helping people learn about the best data management practices and methods for interpreting data. LDW was started in 2015 and is headed by the Inter-university Consortium for Political and Social Research at the University of Michigan. For those looking to learn more about data or are interested in statistics, this is an excellent opportunity to ask questions and get started!

Events

Because looking at raw data can sometimes be boring, we’re looking to spice things up this year by including two new activities! We’ll be right inside the entrance of Dirac from 12:00 – 2:00 PM on Thursday and Strozier from 12:00-2:00 PM on Friday! First, we’re going to be doing an Adopt-a-Dataset activity, where participants will be able to “adopt” one of the openly available datasets we have displayed. Your task will then be to determine what conclusions can be drawn from the data, and you’ll receive a Dum-Dum for your work! After that, we’ll have a jar of Smarties at the table, with a list of numbers from a normal distribution on hand. From there, you’ll have to guess the number of smarties in the jar, and the person with the closest guess will win them all! In addition to the tabling events, our Research Data Management Librarian, Dr. Nick Ruhs, will be giving a workshop on Data Analysis with Microsoft Excel on Valentine’s Day (February 14) from 3:00-4:30. If you are or will be using Excel for your projects or research and are looking to enhance your skills, this will be a great workshop to attend!

Blog Posts

In addition to the wonderful events that are occurring during Love Data Week, we will be publishing two blog posts introducing the two new Data Fellows at FSU, Reagan Bourne and Sahil Chugani. In those posts, you’ll learn all about what inspired them to become a data fellow and how they became passionate about data analysis and management techniques.

Contact/Resources

For more information about any data questions/concerns you may have, you can either check out https://www.icpsr.umich.edu/web/pages/ or contact Dr. Nick Ruhs, our resident Research Data Management Librarian, at nruhs@fsu.edu. Furthermore, if you ever need any assistance with any data question you may have, you can check out the walk-up hours for our STEM Data Fellows!

This blog post was written by Sahil Chugani (STEM Data Fellow) from FSU Libraries.

Evaluating Data Through a Critical and Ethical Lens

Introduction

Data literacy is the combination of a few unique skill sets: statistical literacy, information literacy, and technical proficiency. It also involves being able to visualize, critically evaluate, determine the accuracy and reliability of, and understand data sets. There are many reasons why it is important to be data literate, especially in recent years with the advent of the internet and social media. Data literacy is also crucial to many different industries and research areas. It is important to interpret the data that you are collecting to make sure that the results are accurate and to be able to understand that data so that you can create useful visualizations for others. 

There are a variety of concepts to keep in mind when critically evaluating data. For example, you need to consider the methods that were used to collect the data and whether those methods are ethical. Furthermore, when evaluating how the data is presented, you need to consider whether that representation or visualization is the most accurate way to portray the data. Another particular topic of concern is bias. There are different points at which biases can be introduced, such as when data is collected, when it is analyzed, and when it is shared with the public. Also, if you are critically evaluating your own data, it is important to check that there are no biases within your own work. In this post we will be discussing the critical evaluation of data through the lens of data collection, data presentation and visualization, and data ethics. 

Data Collection

In the context of data collection, several different collection methods can be used for research. Some of these methodologies, such as focus groups, surveys, and participant interviews, are familiar to the public at large. However, there are other specific data collection processes that many people outside of certain academic disciplines may not be aware of, such as web scraping/text mining, phlebotomy procedures for blood tests, observational behavior recording for time series data, and many more.

Consequently, not only is recording the data itself of importance for experimental duplication purposes, but it can also be important for interdisciplinary work. Some fields of research may have different research data collection methods that researchers in other fields may not be aware of, even across seemingly similar disciplines. For example, accounting and finance may seem similar but can have drastically different ways of interpreting monetary data. The way accountants and financial analysts calculate when a company is at a net zero (i.e., a break-even) between revenues and costs is different. Even within the same field of research, transparency with how data is collected is important for peer review – whether it be for ethics accountability or determining methodological flaws within research. An incomplete set of data can make it difficult or impossible to know whether or not the data was collected in a way to prevent bias, and further make it impossible to know if the data is accurate and/or precise.

 Failing to document data and data collection methods can also create problems reproducing or using the data for further research, particularly if things such as question types, experiment conditions, and units of measure are not properly documented. For example, while the hypothetical idea of cold fusion (nuclear fusion performed at room temperature) would be a low-cost energy solution, the experimental methods and data were not recorded. As a result, the concept of cold fusion is now widely looked at with skepticism because none of the data was recorded! A less extreme case where incomplete data may cause research problems is that the way that a survey is constructed can bias responses. Therefore, documenting how a survey was written can be helpful in evaluating why a research study came to a specific conclusion, as well as testing whether or not changing questions or even question order would change results.

Furthermore, data cleaning – which is the process in which things such as incorrectly formatted data, corrupted data, etc are reformatted or fixed so that it can be used in analysis – can also contribute to statistical bias(es) via things such as eliminating outliers, accidentally losing a variable, how you decide to categorize your data, and more. Therefore, documenting how you clean your data is also a critical component of research – explaining what outliers you decided to keep or remove and why can help you and other researchers down the road. It is also important to consider the order questions are asked in and the way questions are worded when conducting surveys. While it might seem counterintuitive at first, the way that questions are ordered and worded can impact the percentages of people that respond in a certain way, whether or not potential participants qualify for research projects, and even the numeric values of the data itself.

Data Presentation and Visualization

 Most have probably heard the phrase “label your axes” at some point, even before college. It is often mentioned in K-12 education, with the pretense being that someone will not know what your graph(s) are depicting without them. While this is indeed correct, labeled axes constitute only one of many different components of data presentation and visualization.

Figure 1: Axes that are labeled!

A good place to start on the types of ways that data visualizations can be best implemented would be The Data Visualisation Catalogue. While the site was originally established with graphic designers in mind, Severino Ribeccca himself stated I felt it would also be beneficial to…anyone in a field that requires the use of data visualisation.”(Ribecca n.d.) As such, almost anyone who uses data typically has to consider how to visually communicate data in a way to an outside audience, or even the general public outside of the realm of academia. A nifty feature of The Data Visualisation Catalogue is that there is a way to filter recommended data visualization types by what concept you are trying to demonstrate.

One consideration when looking at a data visualization is whether the data is represented in a way that is appropriate for that specific data type. While it might not seem like the data presentation would differ between data types, certain visualizations will serve to more accurately and sufficiently depict different types of data. For instance, data related to time and Geographic Information Systems mapping produce distinct data types. While they can be combined and represented in the same graphic (i.e., how has the land of a certain area changed over time?), they both have their own distinct issues to consider to make sure that you are not creating misleading graphics. Namely, one cannot make a map with time data alone, and a map would be hard to make with a line graph that is meant to show trends in time.

Furthermore, the scales and units that are utilized in a data representation are also important considerations! Using our previous example, we can note that the visual scales of a map are different from the visual scales of time series data. For instance, you can get drastically different data visualizations if you transform data from a linear scale to a logarithmic scale (i.e., a scale that plots data based on what exponent would be needed to get your number back). This can be useful for situations where the data you are working with is so large that it is hard to see everything in an efficient way. For example, a logarithmic scale of time where millions of years are condensed into smaller numbers that are easier to conceptualize leads to graphs where you can see things like different geographical eras.

On a more human scale, while logarithmic data could be used to misrepresent data, a far more common tactic for misrepresenting data involves a truncated or broken axis on a graph (Figures 2a and 2b); a truncated graph deliberately not starting at zero on the y-axis, and a broken axis subtly skipping a large amount of units.  This is a common tactic that is present in some graphics that news outlets might use, whether it is intentional or not. Some other characteristics of misrepresented data might be plotting two graphs that are not on the same scale or zooming your scale in to make a trend look far larger than it truly is.


Figures 2a and 2b: Graphical Examples of a graph with a broken axis and a graph with a truncated axis, respectively

While there are many examples of distinctly misleading graphs, there are also many graphs that accurately portray the data, but use an incompatible or inaccessible color palette. Related to this, many color palettes used in data visualizations can be inaccessible to those with vision impairments such as green-red and blue-yellow color blindness. Utilizing distinct color-blind friendly palettes can help to make visualizations more accessible. Furthermore, using alt-text descriptions of what the graph is showing enhance the ability of screen readers and other tools utilized by those with low-vision and blindness to interpret the visualization. Thus, being hard to see or just looking aesthetically displeasing does not make a graph misleading, and is an important distinction to make (although the two are not mutually exclusive!)


Figure 3: A “Painbow” Graph

Data ethics

When examining a dataset, it is also important to consider whether there are any biases present that may affect interpretation of the data. Two common categories of biases are cognitive biases and statistical/algorithmic biases.  Cognitive biases involve individuals interpreting the results of a study to best fit a specific narrative. This may involve a data producer deleting data that does not fit the conclusion that they are trying to prove. At the same time, a data producer may also add data that is not accurate in an attempt to strengthen their claims. Furthermore, studies may be designed to collect data that only represents a small subset of a population, while claiming to be representative of the entire population. 

Similar to cognitive biases, statistical/algorithmic biases describe the concept of bias as your sample poorly describing your population. In that context, it is significantly mitigated (if not outright eliminated) if your data collection methods are not generally or statistically biased. This is particularly noticeable when examining artificial intelligence (AI) algorithms.  These algorithms are often trained with unequal datasets, which then leads to skewed results when performing data analysis with said algorithms. Therefore, when examining data that is outputted by an algorithm, one should consider whether the algorithm has been trained with accurate and equal data sets. An industry where statistical and algorithmic biases are extremely important to consider is the healthcare industry. For example, many hospitals use artificial intelligence to sort through patient data, which helps doctors determine who needs immediate emergency attention. While there are many benefits to such algorithms, there have been issues in the past because of them. In certain instances, if a patient has pre-existing medical conditions that affect their health, the algorithm will not be able to take that into account. In addition, many algorithms that are commonly used in healthcare systems are racially and gender biased. As mentioned in “Algorithmic Bias in Health Care Exacerbates Social Inequities — How to Prevent It” written by Katherine Igoe, “algorithms in health care technology don’t simply reflect back social inequities but may ultimately exacerbate them.” Igoe also mentions that certain prediction algorithms used for detecting heart diseases in the medical industry were biased in their design. For example, the “Framingham Heart Study cardiovascular risk score” worked very well for caucasion patients, but not for African American patients. This is due to the fact that around 80% of the collected data used for this algorithm was from caucasian patients. Utilizing such an  unequal dataset to train the algorithm can lead to unequal care and treatment in medical practices (Igoe).  This example is just one of the many examples of bias due to  algorithm design. 

Companies such as Amazon have also faced huge problems relating to algorithm bias. A few years ago, Amazon tried to utilize an algorithm that used artificial intelligence to hire new employees. However, it turned out that this algorithm was biased against women. This is because the algorithm was trained on resumes that were submitted during a time period where the number of male applicants was significantly higher than the number of female applicants. This ultimately caused the algorithm to be trained to favor men over women.

Conclusion

Critical evaluation of data is an extremely important skill set for any student or professional to have. Knowing the importance of checking the reliability, accuracy, and the bias in any data set is necessary when reading or working with data. Some questions to keep in mind are: is the collection method clear and documented? Is the data visualization appropriate for the dataset and for what the author is trying to represent? Is the data biased in the collection or visualization stages? It is important to evaluate data to ensure that we are using quality and accurate data to make sound decisions and conclusions. 

Works Cited

This blog post was written by William-Elijah Clark (Senior STEM Data Fellow) and Reagan Bourne (STEM Data Fellow) from FSU Libraries.

STEM Data Fellow Spotlight: Diego Bustamante

For Love Data Week 2022, we are highlighting our FSU STEM Libraries Data Fellows! These posts, written by the fellows themselves, tell their stories of how they became interested in data-related work and their experience as a data fellow to this point. Today’s post is contributed by Diego Bustamante.

Prior to my role as a Data Fellow, my idea of what data is was defined by my previous work with quantitative data collected from laboratory experiments. For example, when I worked as a Research Assistant I recorded quantitative data for chemistry experiments, like mass, temperature, volume, etc. I then conducted statistical analysis on the data in order to draw conclusions from each experiment. I personally enjoy collecting and analyzing data, especially because it can lead to many scientific and technological advancements!

While searching for jobs in FSU’s NoleNetwork in summer 2021, one job title that immediately caught my attention was “FSU STEM Libraries Data Fellow.” The job description was unique amongst other jobs offered on campus. As a data fellow, I was offered the opportunity to develop several professional skills in data reference, co-hosting programming language workshops, writing and publishing blog posts, and many more. I felt like it was a great opportunity and a good fit with my previous experience and skills, and so I decided to apply. Thankfully, I was selected as one of the inaugural data fellows, leading to a journey of professional and personal development that has thus far surpassed my initial expectations. 

One of my first tasks in the program was meeting with different librarians at FSU Libraries. In these meetings I was able to learn about different methods and applications for data analysis in a variety of disciplines. For example, I learned that the Digital Humanities Librarian uses a text-mining software to find specific words from books published in the 1800s. She used the data drawn from the software to analyze certain traits of the story by counting the amount of times a character participates in an interaction of this type. This experience helped me realize that qualitative data sets can be used to draw similar conclusions about a study as quantitative data. 

Another concept that I have become familiar with while working as a Data Fellow is open data. We discussed this concept during a workshop where we talked about the potential benefits of making research data openly accessible to the wider research community. Initially, I was hesitant regarding the concept of open data, because I saw academic research as a “race” to find a solution to a given problem. However, further discussion of how researchers are compensated for sharing their data made me realize that it is possible to benefit from open data on a personal and global level. 

Currently, I am still learning about the many different types of data, its definitions, applications, and its importance. I am also working on developing an open source Canvas module on MATLAB where I explain the basics of the math based programming language in a student friendly manner. I look forward to sharing more about this work in the future!

STEM Data Fellow Spotlight: William-Elijah Clark

For Love Data Week 2022, we are highlighting our FSU STEM Libraries Data Fellows! These posts, written by the fellows themselves, tell their stories of how they became interested in data-related work and their experience as a data fellow to this point. Today’s post is contributed by William-Elijah Clark.

It’s hard to say exactly when I first got interested in data. After all, my mother was a statistician, so I’ve always been surrounded by data since I was in elementary school — from Arkansas Department of Health public health and mortality statistics to Disney World focus groups and market research. Personally, I started liking statistics when I took UCF’s equivalent to QMB 3200 and Econometrics. This experience extended into being a research assistant at UCF, and even into conducting and monitoring surveys at Universal Orlando Resort! Through my Econometrics course and from additional professional development opportunities at Universal, I was also able gain experience with R (although I didn’t learn it to the extent that I would call myself a professional data analyst or a data scientist.)

Due to the COVID-19 pandemic and subsequent lockdowns in Orlando back in 2020, I decided to go back to school here at Florida State University for Statistics, especially considering that FSU has a SAS coding certificate! Overall, I came to Florida State University with over two years of professional survey experience between academia and hospitality industry work.           

I spent time in 2020 taking calculus courses and statistics electives here at FSU to hone my data analysis skills further. I then saw an opportunity to apply for a FSU Libraries data fellowship beginning in Fall 2021. I decided to apply, as this position would give me the opportunity to utilize some of the skills I obtained from my previous positions and coursework at UCF and FSU, and hopefully develop some new skills to further myself in my goals of becoming a data analyst (and hopefully even an econometrician).

So far in my fellowship here at FSU Libraries, I have had the opportunity to gain some experience with MATLAB and SQL through the Data @ Your Desk workshops at Dirac, as well as some experience writing surveys in Qualtrics (as opposed to just conducting and monitoring surveys). I’ve also had the opportunity to learn more about citation management, library research, and data management. I’ve even been able to explain concepts for MS Excel to a patron via the online “Ask a Data Librarian” feature on the FSU Libraries website. This all said, I’m looking forward to applying some of my previous R coding and statistical analysis skills to some survey data for FSU Libraries this semester.

Love Data Week: Data is for Everyone

By: Dr. Nick Ruhs

INTRODUCTION

It’s once again time for Love Data Week!  LDW is a yearly, international outreach event taking place the week of Valentine’s Day (February 14-18 this year). The week is focused on promoting good data stewardship and best practices around working with and interpreting data. LDW was started in 2015 and is currently celebrated by academic libraries and data organizations around the world. While every institution celebrates in their own way, common activities include data workshops, social media outreach, and more! 

Each year, a theme is chosen around which organizations can theme their Love Data Week activities. For 2022, the theme is “Data is for everyone.” This year, we are shining a light on the “people-side” of data, and on how different folks use and interact with data. Data often means something different to everyone, and how someone interacts with data varies based on their chosen discipline, research project, life experiences, and their own beliefs and values. There are also often inherent biases that exist in data collection, analysis, and interpretation, which can affect one’s own impression of a dataset. Despite these differences, the ability to critically evaluate data and interact with it is a universal skill that is crucial for everyone. 

Continue reading Love Data Week: Data is for Everyone

How do the Pros do Data Analysis?

By: Diego Bustamante and William-Elijah Clark

INTRODUCTION

As technology continues to evolve, the infrastructure needed to run this technology gets more and more sophisticated. Processes and tasks carried out by personal computers, smartphones, and appliances are increasingly automated and run with minimal input from the user. This is made possible through code that is developed with one or more computer programming languages.  However, with the increase in the quantity of software and programming applications, the demand for programmers and the number of languages they are required to learn has increased.  Furthermore, many employers now require skills in data analysis and computer programming as prerequisites for job applications.  In this blog post, we will discuss the most in demand languages in the market and give a brief explanation of each.  (Grand Canyon University 2020; Jiidee 2020; Meinke 2020; University of California – Berkeley, n.d.) 

Continue reading How do the Pros do Data Analysis?