Making the Most of Legacy Data (Part 1 of 2)
Old data can still provide new insights.
By Jules J. Berman
Analytics isn't just about examining fresh data; it's about incorporating past and present data to explore trends and patterns. In his newest book, Repurposing Legacy Data: Innovative Case Studies, Jules J. Berman uses case studies to illustrate how the value of legacy data -- either your own or from a public source -- can be maximized. In his own words:
This chapter explains the goals of data repurposing, namely:
- Employing preexisting data to ask and answer questions that were not contemplated when the data was collected
- Combining preexisting data with other data sets of the same kind, producing an aggregate data set that is more useful than any single component data sets
- Reanalyzing the original data set using alternate or improved methods to attain outcomes of greater precision or reliability than the outcomes produced in the original analysis
- Integrating heterogeneous data sets (i.e., data sets with seemingly unrelated types of information) to answer questions or develop concepts that span several different disciplines
- Finding subsets in a population once thought to be homogeneous
- Finding new relationships among data
- Retesting assertions, theories, or conclusions that had been drawn from the original data
- Creating new concepts or metaphors from the data
- Fine-tuning existing data models
- Starting over and remodeling the system.
The test of the book provides case studies that illustrate how data repurposing accomplishes these goals.
Below we've excerpted and lightly edited the introductory chapter from Berman's book that sets the stage for the numerous ways you can repurpose and gain value from existing data. This is the first of two parts; the second part will appear next week.
This section of Repurposing Legacy Data: Innovative Case Studies is reprinted by permission of the author and publisher. Copyright 2015 by Elsevier Inc., which reserves all rights. For more information, visit this site.
1.1 Why Bother?
This book demonstrates how old data can be used in new ways that were not foreseen by the people who originally collected the data. All data scientists understand that there is much more old data than there is new data, and that the proportion of old data to new data will always be increasing. Two reasons account for this situation: (i) all new data becomes old data, with time and (ii) old data accumulates, without limit. If we are currently being flooded by new data, then we are surely being glaciated by old data.
Old data has enormous value, but we must be very smart if we hope to learn what the data is telling us. Data scientists interested in resurrecting old data must be prepared to run a gauntlet of obstacles. At the end of the gauntlet lies a set of unavoidable and important questions:
1. Can I actually use abandoned and ignored data to solve a problem that is worth solving?
2. Can old data be integrated into current information systems (databases, information systems, network resources) and made accessible, along with new data?
3. Will the availability of this old data support unknowable future efforts by other data scientists?
4. Can old data tell us anything useful about the world today and tomorrow?
A credible field of science cannot consist exclusively of promises for a bright future. As a field matures, it must develop its own written history, replete with successes and failures and a thoughtful answer to the question, "Was it worth the bother?" This book examines highlights in the history of data repurposing. In the process, you will encounter the general attitudes, skill sets, and approaches that have been the core, essential ingredients for successful repurposing projects. Specifically, this book provides answers to the following questions:
1. What are the kinds of data that are most suitable for data repurposing projects?
2. What are the fundamental chores of the data scientists who need to understand the content and potential value of old data?
3. How must data scientists prepare old data so that it can be understood and analyzed by future generations of data scientists?
4. How do innovators receive intellectual inspiration from old data?
5. What are the essential analytic techniques that are used in every data repurposing project?
6. What are the professional responsibilities of data scientists who use repurposed data?
The premise of this book is that data repurposing creates value where none was expected. In some of the most successful data repurposing projects, analysts crossed scientific disciplines, to search for relationships that were unanticipated when the data was originally collected. We shall discover that the field of data repurposing attracts individuals with a wide range of interests and professional credentials. In the course of the book, we will encounter archeologists, astronomers, epigraphers, ontologists, cosmologists, entrepreneurs, anthropologists, forensic scientists, biomedical researchers, and many more. Anyone who needs to draw upon the general methods of data science will find this book useful.
Although some of the case studies use advanced analytical techniques, most do not. It is surprising, but the majority of innovative data repurposing projects primarily involve simple counts of things. The innovation lies in the questions asked and the ability of the data repurposer to resurrect and organize information contained in old files. Accordingly, the book is written to be accessible to a diverse group of readers. Technical jargon is kept to a minimum, but unavoidable terminology is explained, at length, in an extensive Glossary.
1.2 What Is Data Repurposing?
Data repurposing involves taking preexisting data and performing any of the following:
1. Using the preexisting data to ask and answer questions that were not contemplated by the people who designed and collected the data
2. Combining preexisting data with additional data, of the same kind, to produce aggregate data that suits a new set of questions that could not have been answered with any one of the component data sources
3. Reanalyzing data to validate assertions, theories, or conclusions drawn from the original studies
4. Reanalyzing the original data set using alternate or improved methods to attain outcomes of greater precision or reliability than the outcomes produced in the original analysis
5. Integrating heterogeneous data sets (i.e., data sets with seemingly unrelated types of information), for the purpose of answering questions, or developing concepts, that span diverse scientific disciplines
6. Finding subsets in a population once thought to be homogeneous
7. Seeking new relationships among data objects
8. Creating, on-the-fly, novel data sets through data file linkages
9. Creating new concepts or ways of thinking about old concepts, based on a re-examination of data
10. Fine-tuning existing data models
11. Starting over and remodeling systems
Most of the listed types of data repurposing efforts are self-explanatory and all of them will be followed by examples throughout this book. Sticklers may object to the inclusion of one of the items on the list, namely, "reanalyzing data to validate assertions, theories, or conclusions drawn from the original studies." It can be argued that a reanalysis of data, for the purposes of validating a study, is an obligatory step in any well-designed project, done in conformance with the original purpose of the data; hence, reanalysis is not a form of data repurposing.
If you believe that data reanalysis is a normal and usual process, you may wish to try a little experiment. Approach any scientist who has published his data analysis results in a journal. Indicate that you would like to reanalyze his results and conclusions, using your own preferred analytic techniques. Ask him to provide you with all of the data he used in his published study. Do not be surprised if your request is met with astonishment, horror, and a quick rebuff.
In general, scientists believe that their data is their personal property. Scientists may choose to share their data with bona fide collaborators, under their own terms, for a restricted purpose. In many cases, data sharing comes at a steep price. A scientist who shares his data may stipulate that any future publications, based in whole or in part, on his data, must meet with his approval and must list him among the coauthors.
Because third-party reanalysis is seldom contemplated when scientists are preparing their data, I include it here as type of data repurposing (i.e., an unexpected way to use old data). Furthermore, data reanalysis often involves considerably more work than the original analysis, because data repurposers must closely analyze the manner in which the data was originally collected, must review the methods by which the data was prepared (i.e., imputing missing values, deleting outliers), and must develop various alternate methods of data analysis. A data reanalysis project, whose aim was to validate the original results, will often lead to new questions that were not entertained in the original study. Hence, data reanalysis and data repurposing are tightly linked concepts.