Understanding Big Data Source Types Can Improve Your Project Planning
Big data can be divided into four source types. By understanding those types, you can better organize and scope new big data projects.
- By Stan Pugsley
- January 4, 2022
Many articles have been written about the dimensions or attributes of big data, focusing on the list of Vs (including volume, velocity, variety, and veracity) and sometimes additional attributes (such as value). Much less is written about the common sources of that data. In this article we discuss a framework for understanding the types of systems that generate big data, along with some of the common attributes of data from those systems.
Sensors have been embedded in devices, appliances, and machinery for decades, but in recent years those sensors have become connected to the Internet to transmit data to a central repository. Smart home devices -- from thermostats to microwaves to light bulbs -- are now connected via Wi-Fi. In the transportation sector (including ships, cars, airplanes, and trains), sensor data used to be collected in a "black box" for emergency analysis. Today that data can be streamed to a manufacturer's servers for real-time status analysis. Manufacturing and utility sensors stream data to maintenance applications to allow intervention when issues are detected.
Sensor data can be characterized by a high volume of small, standard messages received on a mostly predictable cadence. There are no issues with variety and veracity because the sensors are owned and managed by the organization; we know the format and source of the messages ahead of time and can plan carefully to store and aggregate those messages on a planned schedule. Creation of sensor data may be triggered by events (for example, air bag deployment) or by schedule (such as one temperature reading per hour). Due to the consistent, predictable formatting of sensor data, it is easy to define a schema to aggregate and analyze the data on a regular basis.
Trackers are a type of sensor that must be added to or enabled on an asset if the owner consents to being monitored. Some trackers are placed on an asset at the time of use, such as the GPS monitoring system of a delivery truck. Others, such as web cookies, are placed on a user's system when that user visits a website and consents to being tracked. Asset tracking, such as Apple's "Find My" service or the NFC trackers on equipment in a warehouse, involves placing a transmitter on an asset and using Internet-connected devices to receive a message whenever the asset is in close proximity.
The key reason to separate tracker data from sensor data is to emphasize the concept of consent, or opting-in, to allowing data collection to occur. Tracker data will also be standard and predictable, like other sensor data, but may be subject to specific retention and storage requirements, such as GDPR. Unwanted or unapproved tracker data should be purged from your systems regularly.
Financial trading, and financial transactions in general, are economically important sources of high volume and velocity data, but also have the added overhead of veracity and regulatory oversite. Alternative currencies such as Bitcoin carry an especially heavy computational load to ensure veracity. Trading data may also have high variety if data sets are sourced from many trading platforms, as well as from many analyst reports and financial disclosures (such as year-end statements).
Trading data can be overwhelming to any organization. Every investor is looking for a unique insight or advantage in the market, requiring them to look for new variety in data sources and data modeling. When planning a big data project, it is important to separate structured, standardized data that can feed automated data pipelines from unstructured, irregular data that will require significant time and attention to manage.
When dealing with data generated by users of websites or applications, we have two types of data to consider: data within the application (for example, orders, social media posts, and user profiles), and metadata about the users' interactions with the application (such as page views, likes, clicks, and conversions). Interestingly, the first type of data is generally considered "traditional data"; the second type is considered "big data," being much higher in volume. The activity metadata collected through tools such as Adobe Analytics or Google Analytics can be used to understand the customer experience and to support targeted marketing campaigns. Due to the evolving nature of the collection tools and APIs, variety will always be an issue as you try to keep up with emerging internal (such as new ERP systems) and public apps (such as TikTok).
User interaction metadata is often pre-aggregated by a tool such as Google Analytics before it is exported for use by the content producer, changing from big data to traditional data. Before you request the "fire hose" of base-granularity data, you should look carefully at your business requirements to understand if the high-volume big data is required. Most organizations find that summarized data is sufficient for their needs, leaving the big-data engineering burden on Adobe, Google, or other data platforms.
Understanding the common types of big data sources is as important as understanding the dimensions of big data. The four "R" data sources are distinct in their security, retention, and schema requirements. New data sources will emerge every year, but the common types will be applicable for the long term.
Stan Pugsley is an independent data warehouse and analytics consultant based in Salt Lake City, UT. He is also an Assistant Professor of Information Systems at the University of Utah Eccles School of Business. You can reach the author via email.