November 29, 2016

Big Data: Part One - Introduction


By Brielle Huang

Opportunities to leverage  big data are almost as immeasurable as the amount of big data. As more and more success stories of innovative users become available, corporations, institutions, and governments are starting to realize the need to understand big data. The following series of blog posts are meant to educate the reader on the concept of big data, the key players in big data (i.e., demand and supply side of big data, as well as mediators), and the diffusion of big data.

Technological changes in the last 20 plus years have changed our notion of what data is. For example, in the past accountants have associated data with transactions (e.g. sales and cash receipts). With the introduction of the Internet, this concept began to change. The new tech companies (e.g., Amazon, Google, and eBay) brought with them new sources and types of data. The sales transaction was transformed. What used to be, for most of transactions, a single data point became a cluster of data including among others the customer’s name, credit card, and address, as well as the customer’s browsing pattern (web traffic and click through rate). Later in the early 2000s, as social network firms like Facebook and Twitter came into existence, the ability to capture information related to a user’s online activities created a fundamental change in the type of information companies sought to capture. Instead of only providing binary information like whether the user was online at a certain time, Web 2.0 gave companies the ability to capture valuable information about preferences and human interactions. With the continued progression of technology, currently we are in the process of supplementing networks of humans with networks of machines - a phenomenon known as the Internet of Things (IoT). The IoT could provide us with more data about the product that customers buy.

The problem with keeping data is that it costs money to store it. Luckily, as technology such as cloud computing has improved, storage capacities have increased and the cost of storing has become cheaper. Paraphrasing Parkinson’s First Law in his 1980 speech  I.A. Tjomsland quipped that as storage became cheaper, the ability to retain large amounts of data which could become useful to the future increased. This points to the importance of making sure that captured/stored data are leveraged to create value.

In order to proceed with the analysis of the life of the big data phenomenon, it is necessary to first come up with a reasonable definition of big data. This first blog post will aim to do just that. Throughout the years, many experts have come up with their own definition of big data. We will attempt to look at the similarities between them in order to come up with our own definition.

Cox and Elseworth (1997):
“Visualization provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data.”
SAS:
“Big data is a term that describes the large volume of data - both structured and unstructured - that inundates a business on a day-to-day basis.”

In simple terms both of the above definitions can be understood by referencing tools such as MS Excel and MS Access. Both of these tools have been used extensively in business settings. Therefore, big data could be visualized as a data set that could force the limits of these tools.  For example, Microsoft Access only has a capacity of 2GB and Excel has a maximum of about a million lines.

Gartner:
“Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”

High volume means that the data sets are much larger than the data sets we are used to - for example, there are almost 2 billion users on Facebook and the data sets that contain their profile and information would be enormous. This data would be impossible to organize in an a database management system like MS Access. High velocity means that the data is being captured at an incredibly high speed - for example, throughout a day one Facebook user could be liking upwards of a dozen photos, posts, and news articles. High variety means that the data is not structured and so cannot be easily processed - for example, you would be hard-pressed to capture the information from Tweets on an Excel sheet. In general very large data sets that combine structured and unstructured data are likely to exceed the storage and analyzing capability of traditional software applications like MS Access and Excel.

From the above definitions, several themes emerge:
  1. Big data indicates a high volume of data
  2. Big data is a continual flow of data, which means it must be processed in a timely manner or the business’ systems will become inundated
  3. The data collected can be structured or unstructured

While big data is one of the emerging technologies with far reaching implications for  businesses, we would like to make sure that our readers realize that it differs from several other technologies. In fact, big data is not necessarily linked to a particular supplier/vendor or a patented technology. Unlike systems like Enterprise Resource Planning (ERP) or Customer Relationship management (CRM), big data is not as simple as a system that someone can patent. This is mainly due to the third theme we distilled above: structured and unstructured data must first be organized and processed before it can be analyzed and/or stored. Due to the high volume of complexity of the data, these steps would probably need to be conducted by different players who specialize in each step. The significance of this “unpatentable” phenomenon as well as the three themes discussed above will be utilized in our analysis of big data in our next blog post, which will be about the major players in big data - namely the demand, supply, and mediation sides.

Brielle Huang is a third year Accounting and Financial Management Student minoring in Legal Studies at the University of Waterloo. She is working as a Research Assistant under Professor Stratopoulos and researching emerging technologies, with a focus in Big Data. Brielle has completed her first coop term in Assurance at PwC. Her other interests include creative writing and travelling.

Sources:
- Cox, M., & Ellsworth, D. (1997). Application-controlled demand paging for out-of-core visualization. In Proceedings of the 8th IEEE Visualization ’97 Conference (p. 235-). IEEE Computer Society Press. Retrieved from http://dl.acm.org/citation.cfm?id=266989.267068
- Press, G. (2013, May 9). A Very Short History Of Big Data. Retrieved October 31, 2016, from http://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/