A Short Dive Into Data Science

Introduction

Data science is a multidisciplinary field that combines math, statistics, programming, and artificial intelligence to analyze and discover insights embedded in large amounts of data. These insights can then be used to help data scientists ask and answer questions about data, such as: what happened, why it happened, what will happen, and what can be done with the results (Amazon Web Services). Data science combines different tools and technology to create meaning from and interpret data. With the accelerating volume of data being produced, data science is widely used to process and discover patterns in data through a variety of statistical techniques.

 The term data science was first used in the 1960s to describe a new profession focused on the interpretation of data. However, data science during this time period is extremely different from how data science is viewed in recent times. Starting in the 2000s, companies began to see data as a commodity from which they could capitalize. This resulted in a need for data scientists to analyze large quantities of data, by using skills to access, understand, and communicate insights from data. In more recent years, the field of data science has grown tremendously, and it is needed in every industry that generates or relies on data. 

Data science has completely reshaped the way that we collect, store and analyze data, and these new discoveries have transformed several industries and improved people’s lives. For example, in finance, data science can be used to help financial institutions make more accurate predictions and manage risks. In the healthcare industry, data science allows researchers to be able to predict and prevent certain diseases. In marketing, data science allows companies to make personalized ads, based on data that they have about consumers. However, as data science becomes more commonly used in society, it is important to consider the ethical implications behind the collection and analysis of large amounts of data. Subjects such as bias, transparency, and privacy are extremely important and relevant in this field. 

 One important ethical consideration is bias. Data science algorithms can unintentionally inherit bias from the datasets that they are trained on. This can lead to discrimination and inequality towards certain variables in a dataset. Transparency is also an important ethical consideration.  More specifically, some individuals may not understand how data science is used to make decisions that may affect them. This can lead to individuals having concerns about technology, and how it is being used. Finally, data privacy is a critical issue, particularly as the amount of data that is being collected continues to grow exponentially. It is essential to ensure that individuals’ personal information is being protected, and that the data is only collected in ethical ways. Overall, while data science has had so many positive impacts on society, it is also important to prioritize ethical considerations within data science, to ensure that these advances will continue to benefit society. 

What skills are needed in data science?

To quote the US Bureau of Labor Statistics, “Data scientists use analytical tools and techniques to extract meaningful insights from data.” However, this is an incredibly broad statement by itself. Oftentimes, it can be easier to think about data science in the context of what types of tasks are done within the discipline. Furthermore, there are several different articles on what skills data scientists need, often with similar but imprecise wording. However, there is some general overlap in the skills that are required or recommended for those considering a career in data science. 

Quantitative Skills

Broadly speaking, an understanding of mathematics and/or statistics is important for a data scientist. While a data scientist is not necessarily a mathematician or statistician, data scientists often use similar (if not the same) tools to calculate mathematical models that can be used for prediction or to make inferences about what correlations exist within the data. 

Additionally, there is a general consensus that data scientists need to know about data cleaning/wrangling/scrubbing, etc. Regardless of the terminology used to describe it, knowing how incomplete and complete data can and cannot be used is a critical skill. Understanding how pieces of data will interact with each other can give a data scientist a far more efficient understanding of what they can and cannot infer from the data, as well as how a computer would store the data they need to analyze. While this is typically listed as a separate skill, it often relies on linear algebra to work.

Programming/Coding

It should be noted that someone can be a statistician or a mathematician without knowing how to write a single line of code. In fact, according to the Bureau of Labor Statistics, 13% of mathematicians work in education, as do 9% of statisticians. Oftentimes, quantitative work in this realm is less reliant on code and more so about effectively communicating quantitative concepts. However, a professional data scientist absolutely must know how to program, as it is a de facto requirement of almost any data science position. While it would theoretically be possible to calculate statistical trends by hand, this would be impractical when dealing with the current reality of big data. Attempting to construct even one mathematical model with all available data could take a lifetime at best. Consequently, it is far more efficient to have a computer run all of the minute arithmetic calculations instead of adding everything manually.

The list of programming languages a data scientist might need is long enough to warrant its own blog post (in fact, we have covered some in a previous blog post entry). Due to the sprawling amount of programming languages available, it is generally more important to understand the fundamentals of programming and pick up programming languages as needed. While we cannot immediately recommend every situation, Python, R, and SQL are good languages for a strong data science foundation. These tend to be in the top three languages for most data science positions, all else equal. Even further, artificial intelligence (and more specifically, machine learning) can be implemented in the first two, which is also gaining precedence in modern data science.

Business Acumen and Communication

Surprisingly, business acumen (also known as domain knowledge) is frequently mentioned as an essential skill for data scientists. A more generalized way of thinking about it would be “understanding data in context”. In brief, domain knowledge involves knowing what the data is and what it means in the context of your work. It may be easy to think about what the purpose of your data analysis is when you are responsible for the beginning and end of a project – however, jumping into the middle of a project will require some knowledge of what the data is before you can proceed to do anything with it. 

Similar to business acumen, data scientists also need to know how to communicate their results so that others may understand the insights gleaned from their work. At first glance, one would assume that this predominantly refers to explaining predictive outputs in both verbal and written communication that anyone outside of the field can understand. Those two skills are important — however, visual communication and literacy can be included as well. The ability to make data visualizations that effectively show what trends are in the data is another critical skill within data science, as graphs can communicate far more ideas at once than words/numbers alone can. Further, data visualizations can bypass the need for an audience to understand more complex quantitative concepts.

Miscellaneous Skills+Addendums  

While the above skills are frequently mentioned in many articles describing critical skills for data scientists, there are some notable mentions of other skills. For instance, DevOps (a project management method) is considered to be critical enough to the discipline itself that an article about using it as a framework for modern data science work exists (Saxena et al. 2021). While DevOps is the most commonly mentioned project management method within data science, other aspects of project management are also valuable to have.

Further, being able to work with big data and its infrastructure is critical. (Georgia Tech Professional Education, 2023). As more and more data is collected by private and public entities, experience with data files that are not easily navigable is of the utmost importance. Additionally, knowing how to locate the data you need from large data warehouses is important regardless of how much data you need. 

As mentioned in the introduction, concerns about bias in data and ethical data use have ramifications for consumer privacy.  This is such a notable concern that Harvard Business School’s own page for data science skills mentions ethics as something that should be kept in mind due to how many dilemmas can arise from data breaches, biased algorithm development, the usage of consumer data without consent, among other issues (HBS Online, 2020).

What industries/fields use DS?

As mentioned in the introduction, data science is a combination of the fields of mathematics, statistics and computer science. Data scientists generally use programming languages such as R and Python to derive analyses and make visualizations from the data given to them. Data science is often seen as a subset of computer science, which focuses on developing new software to perform certain tasks. This can include making a new gaming application, designing a website, or developing a new programming language. Data scientists generally try to make software to execute a computer model or to visualize software, utilizing statistical techniques in the process. 

Furthermore, data science and statistics are generally similar fields with a lot of overlap. For the most part, both data scientists and statisticians aim to make models to analyze data. However, data scientists may place a heavier emphasis on data visualization techniques, while statisticians may use more math-heavy models. For example, a statistician may lean more heavily on models such as linear regression, while a data scientist may focus more on models such as neural networks, which attempt to imitate the human brain to make predictions. That being said, a statistician can almost certainly become a data scientist with a little extra training and vice versa. 

Data science can be used in almost every business worldwide and has many specific applications; there are too many to list in this blog post. Here, we will briefly describe a few of the more common industries and/or fields that use data science. 

One of the most common uses of data science is in the stock market. When looking at trends in the stock of a company, data scientists for stockbroking companies can analyze the stock data of the past and use that to predict how much that stock will be worth in the future. These predictions can then be used to help determine whether or not the company should buy more or less of that stock. 

Another example of data science is simply predicting future revenue, which can be used in virtually every business. Given past data such as previous prices, how well the economy has been doing as a whole, and many other variables, data scientists can predict how much revenue/profit a company may receive in the future. Using this, companies can determine whether or not they would want to set a certain price for their product or even continue with the business as a whole. 

Data science can also be used at almost every company for various HR applications.HR analytics can help a company in predicting turnover or recruiting the right candidates. For example, if a data scientist can analyze past data to determine if a specific role has higher turnover, the company can utilize that analysis to change up the role to help decrease turnover. Furthermore, data science can be used to determine which candidates may be the best fit for the job by looking at variables from their application or resume. For instance, if employees with great people skills tend to perform better, analysis can be done to determine which candidates have good people skills that can then be translated into a more productive work environment.

Finally, an application of data science that you’re likely well aware of is targeted advertisements. This is utilized by companies such as Facebook, Google, and Spotify. By analyzing the data from your previous searches, likes,  listening, and other various sources, these companies can feed you the advertisements most applicable to you. For example, if your neighborhood was hosting a baseball tournament, your neighbors might search for items such as “baseball bats” and “how to hit a baseball” on Google. Using this data, and seeing that you’re nearby, Google may then direct advertisements to you for baseball bats and baseball courses. 

Conclusion

If one wanted to be brief, data science could be simply described as the hybrid discipline of computer science, mathematics, and statistics. More specifically, data science can be used to take large amounts of organized and unorganized data and make predictions about what might happen in the near-ish future. Those large amounts of data can also be used as a descriptive explanation for phenomena that are currently happening, but would otherwise be invisible to us. Furthermore, those predictions can be used to make business decisions, public policy, predict future gameplay moves/outcomes, or even help create critical medical interventions. However, the above skills can also cause harm if used without consideration for the data itself. Regardless of where or how data science is applied, all of the above skills are necessary to do work in this field.

While we have discussed the use of data science in industries such as financial markets and HR analytics, the applications of data science go far beyond these two fields alone. In fact, what fields do not use data science at all would probably be a shorter list than the list of fields that do. For instance, one of the graphic design industry’s oldest professional organizations (the American Institute of Graphic Artists, or AIGA) released a report on the data economy in 2018 and how data science will impact graphic artists (Davis 2018). While it is not an industry that is traditionally thought of as handling data, graphic designers also produce infographics (Stinson 2021). Furthermore, data mining is a critical skill within data science with applications in anything from market research (Hall n.d.) to finding unused content within video game files (BBC Newsround 2020). If there is an area someone is interested in, one can look up “[field of interest] data science” and probably find how they can use data science in that realm. 

At the end of the day, though, understanding data science beyond what we have outlined in this post may require experimenting with each of the bits and pieces of what makes it work on your own. To quote notable data scientist Claudia Perlich, “Learning how to do data science is like learning to ski. You have to do it.”  Understanding more about what data science is may require the very same thing; reading more about it and doing it for yourself.

Works Cited

For a list of references, please use this link: https://bit.ly/3NF30zQThis blog post was written by William-Elijah Clark (Senior STEM Data Fellow), Sahil Chugani (STEM Data Fellow) and Reagan Bourne (STEM Data Fellow) from FSU Libraries.

Leave a Reply

Powered by WordPress.com.

Up ↑

Discover more from FSULIB

Subscribe now to keep reading and get access to the full archive.

Continue reading