What is ‘Big Data’ Anyway?

This image has an empty alt attribute; its file name is G6vrY6DLc67Wpw5hxpdsL7U5JkTLoM8SgJeEpu0ArZGDRUYjKk8970n_W27ywByLlR4Fa65qPZldsxk8dJaZKLrnt9jvfvM-5JshA_xKHtudqCpsrHZxjariXhqDIhH5SW4XZx4s

By: Diego Bustamante and William-Elijah Clark

Maybe you’re on Twitter one day and search ‘#Statistics’ to look up some information for your Introductory Statistics course. Before you know it, you scroll through and see several tweets that are also marked with ‘#BigData’, and you’re left with more questions than you had when you started your search. Maybe you try to search for “big data” on Google, see the definition from Oxford, and are then left with even more questions: 

  • How large is “extremely large?”
  • What kind of patterns, trends, and interactions are we talking about?
  • What isn’t big data?

Big data as a term has become synonymous with the growth of digital data and the glut of information available to researchers and the public. Furthermore, there is a growing interest by both the public and private sector in utilizing large datasets to provide insight into market trends and to improve decision making. However, the exact definition of big data is sometimes unclear and can vary widely depending on who you ask. Businesses, nonprofit organizations, government agencies, and academic researchers each view big data in a different context and with different goals for its use. (University of Wisconsin Data Science, n.d.)

a Google Trends graph that shows the number of searches for the term “Big Data” from 2007 to 2017

Above: a Google Trends graph that shows the number of searches for the term “Big Data” from 2007 to 2017

In this blog post, we aim to provide clarity and insight into the origins and definitions of big data.  We will also discuss the potential benefits and challenges surrounding big data. In doing so, we will provide some examples linking big data to applications or data that you may interact with on a daily basis.

The Origin of the Term “Big Data” 

While today you’re most likely to discover the term “big data” on social media hashtags and in the news, the concept predates social media as we know it. While the exact origins are vague, the term “big data” was originally coined in four separate instances, independent of each other, in the mid-1990s to early 2000 (Diebold, 2012). John Mashey first coined “big data” while working for Silicon Graphics in the mid-1990s. While Mashey did not record the term in any sort of formal academic paper, he conducted several graphics industry conferences and talks in California from the mid to late-90s (Lohr 2013), one of which includes a PowerPoint presentation called “Big Data…and the Next Wave of InfraStress” (Mashey 1999). In the context that Mashey described it, he stated “…I wanted the simplest, shortest phrase to convey that the boundaries of computing keep advancing.” 

Mashey’s general intent is also reflected in Weiss and Indurkhya’s book, Predictive Data Mining: A Practical Guide, wherein they comment that “big data” was starting to emerge as a consequence of more powerful computing technology becoming available at more affordable prices in the 1990s. In addition, they also discussed the practical usability of big data; namely, that despite the fact that big data is too big for people to process alone, it could still be useful. (Weiss and Indurkhya 1998) In the field of econometrics, Francis X. Diebold made a similar observation regarding the quantity of data (2000). Namely, he referred to the exponential increase in how much and what kinds of data are available — and possibly even useful. Unlike Mashey, Weiss, and Indurkhya, Diebold proposes a distinction for what constitutes the “big” of “big data”, wherein he clarifies that data is no longer measured best by a number of observations, but by the size of the files that contain the data. (Diebold, 2000) 

The cover for Predictive Data Mining: A Practical Guide

Building on these early definitions and taking into account the rapidly changing nature and availability of data, researchers have attempted to refine and develop the scope of big data. For example, one definition from the University of Wisconsin (n.d.) defines big data as “consisting of extensive datasets that require a scalable architecture for efficient storage, manipulation, and analysis.” A more quantitative definition would theoretically include a specific file size. However, the size limit for when data becomes “big” hasn’t been established, and is generally more of a guideline—generally, the file space for a website server is a good example. 

In reference to Diebold’s own 2012 paper on the term “Big Data”, there is also a secondary description of “Big Data” as a discipline in addition to the type of data itself. Furthermore, the American Statistical Association (n.d.) describes big data as “the collection and analysis of data sets that are complex in terms of the volume and variety, and in some cases the velocity at which they are collected.” In other words, big data (the discipline) is the analysis of big data (actually large data). From that, it’s no wonder that the concept of Big Data is such a complicated term!This leads us to yet another question about big data, which is…

How “Big” is “Big Data”?

As mentioned earlier, the “big” in “big data” doesn’t refer to a specific number of observations or data points. While one could argue that file size determines whether a dataset can be considered “big,” file size alone does not paint a clear picture and there is no specific threshold to compare to. However, there are some general guidelines for what defines the “Big” of Big Data, often referred to as the “Three V’s” (Diabold, 2012; Laney, 2001; Kitchin and McArdle, 2016): 

● Volume: Consisting of enormous quantities of data 

● Velocity: Created in real-time 

● Variety: Being structured, semi-structured, and unstructured 

Kitchin and McArdle noted that others have expanded on the Three V’s to describe big data (2016). However, the three V’s are so ubiquitous that they are even referenced on data analysis software company websites such as Oracle. (Oracle n.d.) Thus, for our purposes, we will focus on describing these terms. Notably, while Volume is defined in terms of dataset file sizes instead of number of observations and is vague, the concepts of Velocity and Variety are a bit more straightforward.

Above: A diagram of the three V’s in a radial format
A diagram of the three V’s in a radial format (GRC Data Intelligence n.d.) (Soubra 2012)

Data Velocity predominantly refers to how often the data is generated. For instance, collecting survey samples in the field is an example of small data, but real time transaction accounts on large retail sites such as Amazon would count as “large” data. Other examples of high velocity data would include things such as real time COVID-19 ICU bed availability across the United States, voting in real time during election season, and people going in and out of a business, library, or other public place at any point. Data Variety, on the other hand, refers to the types of data that are generated. While the types of data that you would run into in most statistics classes would involve a tabular dataset, Big Data includes things such as YouTube videos, Instagram photos, cell phone tracking on a geographical map, audio files, and many more non-numerical inputs.

What are the Pros and Cons of Big Data?

In brief, there are a lot of benefits to using big data—but there are also costs.

The insights gained from big data can be incredibly useful —not just for businesses, but also for individual people. For instance, a FitBit watch can continuously track assorted health data so that people can make real-time decisions about their exercise habits. Furthermore, a single patient in a hospital can generate up to 1 TB of health data.  Big data can also be used to  predict agricultural yields for crops and livestock, consumer spending habits, and crime. (Heggie 2019). It can even be used to help musical artists break out in the music industry via music streaming algorithms, such as with Billie Eilish, Lil Nas X, and Olivia Rodrido! (McCabe, Wired.com Culture 2019). 

That said, big data has a wealth of uses for almost every area of business. Some applications for big data in business and finance include increasing efficiency with pricing, financial fraud detection, more accurate actuarial calculations for insurance, and even customer experience optimization. Additionally, while big data may pose some technological obstacles, there’s also the benefit of accessing third party data at faster rates than ever before—which allows for more accurate analytic processes than what could be accomplished without big data.  (Geeks for Geeks 2020).

In contrast to the aforementioned benefits, there are inherent challenges and risks to utilizing big data. Namely, the data can be harmful if collected or leveraged in a specific way. For example, when one gives their email address to a company, that act alone may not represent a threat to personal and/or sensitive information. However, if the company suffers a cyber attack then that data may be compromised. This issue can be worse in the case of more sensitive data like phone numbers, housing addresses, social security numbers, etc. Specific types of data misuse include improper profiling, discrimination, collection errors, data breaches or cyber attacks, and political/social manipulation. 

The above cons of big data have had real world implications, and several of the previously mentioned examples of data misuse have already happened. One example of discrimination caused by big data involves Fitbit wearable devices. The watches are useful for collecting health data, but the data collected can be biased due to technical limitations— particularly for people of color. Specifically, these devices often have a hard time measuring heart rates for people with darker skin tones, which can further exacerbate existing racial healthcare disparities because of biased data. (Hailu 2019) An example of a recent data breach is the breach of Amazon’s Twitch game streaming platform, where hordes of sensitive information such as financial details and passwords might have been leaked, which may cause future problems for users of that platform. (Gilbert 2021) These two examples are just the tip of the iceberg. Taken to even further societal extremes, big data can exacerbate inequalities in the US justice system, skew collegiate rankings, cause inefficiencies in the labor market, and deny people financial resources — in brief, big data in the wrong hands can be described as “increas[ing] inequality and threaten[ing] democracy” (O’Neil 2016).

Conclusion

Big data has garnered a lot of attention throughout the last few years in both academic and non-academic settings for it’s potential in solving real-world problems. But, while there has been much research and discourse on the subject, there is yet to be a concrete definition put forward. Based on the work described above, one can think of big data as an explosive growth of the type and amount of data that is available for consumption. (Diebold, 2012) Furthermore, the three V’s of big data also provide some handy guidelines for discerning whether data can truly be classified as “big.”  All in all, as society continues to generate more and more data over time, the sheer amount of data that has to be interpreted will only grow larger and larger. It is likely that within this time, big data will continue to evolve and new applications will be found for its use. 

References

American Statistical Association. n.d. “Statistics and Big Data.” http://amazonaws.com. Accessed October 05, 2021. http://higherlogicdownload.s3.amazonaws.com/AMSTAT/UploadedImages/49ecf7cf-cb26-4c1b-8380-3dea3b7d8a9d/BigDataOnePager.pdf.

Bureau of Labor Statistics. 2018. bls.gov. June. Accessed October 05, 2021. https://www.bls.gov/opub/btn/volume-7/big-data-adds-up.htm.

Diebold, Francis X. 2000. “Big Data Dynamic Factor Models for Macroeconomic Measurement and Forecasting: A Discussion of the Papers by Lucrezia Reichlin and by Mark W. Watson.” Advances in Economics and Econometrics, Theory and Applications, Eighth World Congress, Volume III, August: 115.

—. 2012. “On the Origin(s) and Development of the Term “Big Data”.” Penn Institute for Economic Research, September 21: 8.

Geeks for Geeks. 2020. Geeks for Geeks. September 10. Accessed October 26, 2021. https://www.geeksforgeeks.org/benefits-big-data/.

Gilbert, Ben. 2021. “Business Insider.” businessinsider.com. October 6. Accessed October 07, 2021. https://www.businessinsider.com/major-twitch-hack-breach-leak-source-code-streamer-payments-2021-10.

GRC Data Intelligence. n.d. http://www.grcdi.nl. Accessed 09 30, 2021. http://www.grcdi.nl/dqglossary/3vs%20%28volume%2C%20variety%20and%20velocity%29.html.

Hailu, Ruth. 2019. Stat News. July 24. Accessed October 07, 2021. https://www.statnews.com/2019/07/24/fitbit-accuracy-dark-skin/.

Heggie, Jon. 2019. “National Geographic Science.” National Geographic. January 8. Accessed October 07, 2021. https://www.nationalgeographic.com/science/article/partner-content-big-data-benefits.

Kitchin, Rob, and Gavin McArdle. 2016. “What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets.” SAGE Journals 1-2.

Laney, Doug. 2001. “3-D Data Management: Controlling Data Volume, Velocity and Variety”.” Meta Group Research Note. Meta Group Research [AKA Gartner], February 6.

Lohr, Steve. 2013. The New York Times: BITS. February 1. Accessed September 28, 2021. https://bits.blogs.nytimes.com/2013/02/01/the-origins-of-big-data-an-etymological-detective-story/.

Mashey, John R. 1999. “Big Data… and the Next Wave of InfraStress, Solutions, Opportunities.” 1999 {USENIX} Annual Technical Conference ({USENIX} {ATC} 99). Monterey, CA: {USENIX} Association. 46.

McCabe, Allyson. 2019. Wired. December 23. Accessed October 07, 2021. https://www.wired.com/story/big-data-music/.

—. 2019. Wired.com Culture. December 23. Accessed October 07, 2021. https://www.wired.com/story/big-data-music/.

O’Neil, Cathy. 2016. Weapons of Math Destruction : How Big Data Increases Inequality and Threatens Democracy. New York: Crown Publishing Group.

Oracle. n.d. http://www.oracle.com. Accessed 09 30, 2021. https://www.oracle.com/in/big-data/what-is-big-data/.

Soubra, Diya. 2012. datasciencecentral.com. July 05. Accessed 09 2021, 2021. https://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data.

University of Wisconsin Data Science. n.d. datasciencedegree.wisconsin.edu. Accessed October 05, 2021. https://datasciencedegree.wisconsin.edu/data-science/what-is-big-data/.

Weiss, Sholom M, and Nitin Indurkhya. 1998. “Predictive Data Mining: A Practical Guide.” In Predictive Data Mining: A Practical Guide, by Sholom M Weiss and Nitin Indurkhya, 2,14. San Francisco, CA: Morgan Kaufmann Publishers, Inc.

Leave a Reply