Making the Most of Legacy Data (Part 2 of 2)
Old data can still provide new insights.
By Jules J. Berman
Analytics isn't just about examining fresh data; it's about incorporating past and present data to explore trends and patterns. In his newest book, Repurposing Legacy Data: Innovative Case Studies, Jules J. Berman uses case studies to illustrate how the value of legacy data -- either your own or from a public source -- can be maximized.
Below we've excerpted and lightly edited the introductory chapter from Berman's book that sets the stage for the numerous ways you can repurpose and gain value from existing data. This is the second of two parts. You can read Part 1 here.
This section of Repurposing Legacy Data: Innovative Case Studies is reprinted by permission of the author and publisher. Copyright 2015 by Elsevier Inc., which reserves all rights. For more information, visit this site.
Chapter 1 (Continued from Part 1)
1.3 Data Worth Preserving
Despite the preponderance of old data, most data scientists concentrate their efforts on newly acquired data or to nonexistent data that may emerge in the unknowable future. Why does old data get such little respect? The reasons are manifold.
1. Much of old data is proprietary and cannot be accessed by anyone other than its owners.
2. The owners of proprietary data, in many cases, are barely aware of the contents, or even the existence of their own data, and have no understanding of the value of their holdings, to themselves or to others.
3. Old data is typically stored in formats that are inscrutable to young data scientists. The technical expertise required to use the data intelligibly is long-forgotten.
4. Much of old data lacks proper annotation. There simply is not sufficient information about the data (e.g., how it was collected and what the data means) to support useful analysis.
5. Much of old data, annotated or not, has not been indexed in any serious way. There is no easy method of searching the contents of old data.
6. Much of old data is poor data, collected without the kinds of quality assurances that would be required to support any useful analysis of its contents.
7. Old data is orphaned data. When data has no guardianship, the tendency is to ignore the data or to greatly underestimate its value.
The sheer messiness of old data is conveyed by the gritty jargon that permeates the field of data repurposing. Anything that requires munging, scraping, and scrubbing can't be too clean.
Data sources are referred to as "old" or "legacy"; neither term calls to mind vitality or robustness. A helpful way of thinking about the subject is to recognize that new data is just updated old data. New data, without old data, cannot be used for the purpose of seeing long-term trends.
It may seem that nobody puts much value on legacy data; that nobody pays for legacy data, and that nobody invests in preserving legacy data. It is not surprising that nobody puts much effort into preserving data that has no societal value. The stalwart data scientist must not be discouraged. We shall see that preserving old data is worth the bother.
1.4 Basic Data Repurposing Tools
[Editor's note: In this section, omitted due to space limitations, the author examines simple text editors, programming skills needed, and data visualization tools.]
1.5 Personal Attributes of Data Repurposers
By far, the most important asset of any data analyst is her brain. A set of personal attributes that include critical thinking, an inquisitive mind, the patience to spend hundreds of hours reviewing data, is certain to come in handy.
Expertise in analytic algorithms is an overrated skill. Most data analysis projects require the ability to understand the data, and this can often be accomplished with simple data visualization tools. The application of rigorous mathematical and statistical algorithms typically comes at the end of the project, after the key relationships among data objects are discovered. It is important to remember that if your old data is verified, organized, annotated, and preserved, the analytic process can be repeated and improved. In most cases, the first choice of analytic method is not the best choice. No single analytic method is critical when the data analyst has the opportunity to repeat his work applying many different methods, all the while attaining a better understanding of the data and more meaningful computational results.
1.5.1 Data Organization Methods
Everyone who enters the field of data science dreams of using advanced analytic techniques to solve otherwise intractable problems. In reality, much of data science involves collecting, cleaning, organizing, annotating, transforming, and integrating data. For reasons that will become apparent in later chapters, repurposing projects will require more data organization than "fresh data" projects. There are established techniques whereby data is usefully organized, and these techniques should be learned and practiced. Individuals who are willing to spend a considerable portion of their time organizing data will be in the best position to benefit from data repurposing projects.
1.5.2 Ability to Develop a Clear Understanding of the Goals of a Project
It is often remarked that members of an interdisciplinary team find it difficult to communicate with one another. They all seem to be babbling in different languages. How can a biologist understand the thoughts and concerns of a statistician; or vice versa? In many cases, communication barriers that arise in multidisciplinary repurposing projects result from the narrow focus of individual team members; not in their inability to communicate. If everyone on a data repurposing team is pursuing a different set of project goals, it is unlikely that they will be able to communicate effectively with one another.
It is crucial that everyone involved in a data repurposing project must come to the same, clear understanding of the project. Failure comes when team members lose track of the overall goal of a project and lack any realistic sense of the steps involved in reaching the goal. When every project member understands the contributions of every other project member, most of the communication problems disappear.
In future chapters, we shall see that the most important and most innovative repurposing projects involve multiple individuals, from diverse disciplines, who use the achievements of their co-workers to advance toward a common goal.