Big Data Will Create More Accurate Predictions
Although advances in technology have dramatically improved our ability to analyze and mine vast amounts of data, we must remember that the quality of our predictions is directly dependent on the quality of our data.
- By Mike Schiff
- August 11, 2015
Early Analytic Techniques
One of the earliest examples of data analytics was gathering two data points, each of which represented an independent variable (such as number of years of education) and a dependent variable (such as salary), and connect them with a straight line. Another person's salary would then be predicted by seeing where that person's years of education fell on the straight line.
As new data points (or observations) were added, methods such as least-squares fitting would be used to derive a new straight line that best fit the observed data points. This technique was known as simple linear regression. When additional dependent variables (such as age) were added, analytics evolved into multiple linear regression. Although numerous techniques have evolved to test the validity of the predicted results and derive new prediction algorithms, in many cases their capabilities were constrained by both the available data and the available computing power.
Technology Enhances Analytics Capabilities
Consider, for example, data mining. It wasn't that long ago when most data mining analyses used a relatively small subset of the available data to discover patterns and relationships. Along came parallel and distributed processing, solid state storage, and in-memory databases. Hardware costs decreased. For example, in the late 1960s memory cost approximately $1/byte and disk storage approximately $0.10/byte; today a megabyte of memory can be purchased for less than a penny and a gigabyte of storage for less than a nickel. Now, vast amounts of data can be quickly analyzed. This is truly a case of "cheaper, better, and faster!"
Another factor that has contributed to advances in predictive analytics is new database structures such as the Hadoop distributed file system and other "NoSQL" databases. Although relational databases excel at organizing and processing structured data, non-relational data structures can be more appropriate for the vast amounts of semi-structured and unstructured data that organizations now want to analyze and mine in their attempts to discover new insights.
In addition to data generated by transaction processing systems, today's data sources include sensor data, social media data, and even voice data from call center interactions. Depending on the organization (e.g., industrial, healthcare, government), these insights might relate to customer behavior, medical treatments, or even potential terrorist activities. Our ability to process greater amounts and types of data can only serve to improve the accuracy of predictive analytics by providing a more complete view of the subject at hand by increasing the total number of underlying data points.
Data Quality is More Important than Ever
However, as our ability to generate better predictions continues to improve, we must recognize that big data comes with big responsibilities. We must not forget that the accuracy of these predictions is only as good as the accuracy of the underlying data. Garbage-in, garbage-out still applies today and always will. We must continue to take proactive steps to ensure the quality of our data lest our data lakes become polluted data swamps.
The ability to cost-effectively analyze vast amounts of data can certainly generate additional insights and better predictions. However, if the big data under analysis is of poor data quality, it might also produce erroneous predictions, counter-productive decisions, and result in big data problems. The quality of the data we analyze is often more important that the quantity.