Save Your Data? Maybe, Maybe Not!
Rather than saving every piece of data by default, enterprises should evaluate their data and assess whether keeping it is a potential value or a potential risk.
- By Mike Schiff
- May 20, 2016
Decreasing storage costs, increasing processing power, distributed and cloud-based storage, and the rise of big data analytics have encouraged many organizations to save data that, in the past, they might have purged. In fact, many organizations seem to have developed the attitude that even if they can't think of a compelling reason to retain certain data, they might find a use for it in the future, so why not save it anyway?
While I have advocated retaining data by default even if its future potential were slim, the massive volume of data currently being generated by a growing cadre of sources (think the Internet of Things) has led me to qualify my opinion.
When Retaining Data Might Be Beneficial
Consider, for example, your tax return. Most people know that the IRS has three years to audit your return (the clock starts ticking from the latter of the April tax filing deadline or the date you file your return), but if you omit more than 25 percent of your income, the deadline extends to six years. If you file a fraudulent or false return (you didn't claim your dog as a dependent, did you?), there is no deadline.
Even if there is insufficient processing power today to process, catalogue, and cross-reference all the data the IRS collects, advances in technology could make it possible in the future. As hypothetical examples, consider a situation where the IRS matches medical deductions against healthcare provider receipts or home office deductions in a rental apartment against the landlord's receipts.
In these situations, the IRS would certainly be justified (privacy issues aside) in retaining much of the detailed data they are collecting today because they could potentially analyze it for fraud at a later date.
Healthcare is another example of an industry that may benefit by retaining data that is currently of little use. Collecting massive volumes of data involving symptoms, treatments, patient genetics, test results, output (perhaps continuous) from monitoring devices, and ultimate outcomes will almost certainly yield major benefits once it can be analyzed. Future analytics might find early predictors, minimize deleterious effects, and even provide customized treatments.
When to Think Twice
Despite continued technical advances, storage costs will never reach zero. The amount of data generated each year is increasing exponentially and you must also consider the associated backup and administrative costs.
Data breaches, and the cost of remediation and loss of confidence in the breached organization, increase the risks associated with saving every single bit (pun intended!) of data. Additional risk factors include privacy concerns, transborder data flow regulations, and country-specific laws and regulations governing what data -- especially consumer-related data -- can be collected and retained.
Furthermore, there are other reasons for not retaining data. Some organizations have a policy of deleting data as soon as it is no longer needed for operational purposes to avoid having to turn this data over to adversaries in the event of a lawsuit. These organizations need to recognize, however, that governmental compliance may require that some data be retained (perhaps in archival storage) long after it is of any true value to the organization.
What's the Bottom Line?
In 2009 I published my Data Axioms in an article I wrote for TDWI's BI This Week. One of these, which I first presented at an industry conference in the mid-1990s, was:
Data in a warehouse is like clothes in a closet; even if we haven't accessed some data in two years, we still tend not to throw it away.
This is still true today. Although I believe that saving historical data potentially provides many benefits, data also ages and can lose value over time.
Consequently, you should continually evaluate your retention policies against current and expected needs. At the very least, periodically review archived data to see if it is now obsolete and should (if not required for compliance purposes) be purged.
Michael A. Schiff is founder and principal analyst of MAS Strategies, which specializes in formulating effective data warehousing strategies. With more than four decades of industry experience as a developer, user, consultant, vendor, and industry analyst, Mike is an expert in developing, marketing, and implementing solutions that transform operational data into useful decision-enabling information.
His prior experience as an IT director and systems and programming manager provide him with a thorough understanding of the technical, business, and political issues that must be addressed for any successful implementation. With Bachelor and Master of Science degrees from MIT's Sloan School of Management and as a certified financial planner, Mike can address both the technical and financial aspects of data warehousing and business intelligence.