Archive Big Data for Future Analysis
Can't analyze your data now? We explain why it's worth the time and effort to archive your data anyway.
- By Mike Schiff
- July 10, 2012
It was not that long ago when most of us have make video tape recordings of television shows so that we could view them later at a more convenient time. Many of us ended up with a large collection of unviewed tapes that were disposed of when we upgraded from VCRs to DVRs. We simply lacked the time to view or "process" all of the video data we had collected.
A similar situation exists with the data many organizations collect. They are having problems keeping up with the amount of data they generate. Although much of this data originates from human interactions such as Web site navigation, call detail records, or social media, it can also originate from sources such as radio frequency identification devices (RFID tags), video monitoring, or database logs.
The vast volumes of some of this data may have once limited an organization's ability to analyze it. However, the era of "Big Data" and the associated enabling technology advances now make this more feasible, if not necessarily today then certainly tomorrow. Just think back to the days when almost all data warehouses contained highly summarized data or when most data mining efforts used sample, rather than complete, data sets. Consider as well, the days when a terabyte- size data warehouse was considered a relative rarity.
Big Data is often characterized by its variability (i.e., structured, semi-structured, unstructured) , velocity (e.g., real-time), and volume (large amounts of data), with the goal of gaining insights that were not previously known. However, I believe that the only required characteristic is the ability to analyze a much larger volume of data than an organization previously analyzed. Yes, we live in an unstructured world and the realization that that not all data resides in neat rows and columns has been a major big data driver. However, in my opinion, unstructured data, although significantly augmenting the volume of data organizations wish to analyze, does not by itself define big data; nor does the need to analyze it in real time.
In addition to commercial applications which tend to highlight sales and marketing applications (e.g., cross-selling and customer retention) or fraud detection, there are numerous non-commercial environments where collecting and storing data for later analysis will provide future insights.
The Internal Revenue Service (IRS) is a good example of an organization that collects large amounts of data with no need to process it in real time. Each year when we file our income taxes, the IRS processes them and, if applicable, issues a refund within a few weeks. Many people think that if their Federal income tax return has not been questioned by the IRS within three years of the filing due date, their return is no longer subject to IRS scrutiny.
This is actually only partially true; the IRS has six years to audit a return that under-reports income by 25 per cent and "forever" to question a return that was filed with fraudulent entries. Just because the IRS may not have the processing power to analyze and mine every detail today does not mean that it cannot save all the data it collects and examine it in greater detail sometime in the future when evolving technology may make this more feasible.
Another area where the ability to mine increasingly vast amounts of data is yielding major benefits is in the field of health care. Medical records containing patient and family histories, symptoms, treatments, diets, medications, travel histories, etc., are being collected today to identify possible cures and more effective (and possibly personalized) treatments. Our ability to analyze and mine greater quantities of data will certainly increase in the future and, we hope, lead to major health-care breakthroughs.
Homeland security and criminal investigations can benefit from collecting and saving visual and audio data to identify possible immediate terrorist or criminal activities. This data can also be analyzed "after the fact" to determine who was in contact with whom and identify possible future threats, or to establish proof of criminal activity. As police forces begin to use drones for surveillance and facial recognition software continues to improve, the magnitude of visual data will significantly increase.
The Bottom Line
I would advise organizations to consider archiving and retaining the data they feel would be useful "if only they had the capability to fully analyze it." Technology is rapidly evolving and what cannot be thoroughly processed today may very likely be easily analyzed tomorrow. As storage costs continue to dramatically decrease, I recommend erring in favor of archiving data today rather than throwing it away.