Turning Big Data into Smart Data (Part 1 of 2)
We explain the business and technical challenges that motivate the need for smart data.
By Lee Feigenbaum, Co-founder, Cambridge Semantics
We live and work in the Age of Big Data. The quantity and variety of data that surrounds us and barrages us daily is unlike anything seen previously. In fact, for anyone who works in a profession even tangentially related to information, it's nearly impossible to avoid the hype around big data statistics:
- We create exabytes (or even zettabytes, depending on the source) of new data every day
- Over half the data (or some reports say as high as 90 percent) of data ever created was created within the last 12 months
- The rate at which we're creating new data doubles every month
Thankfully, most serious conversation has advanced beyond the numbers and now focuses on answering the question, What do we do with all of this data? This article is the first of two exploring the idea of moving from big data to smart data to derive business value from this deluge. It focuses on five key business drivers and technical challenges of working with all of this data. The second article in this pair looks at five best practices for turning big data into smart data to surmount these challenges.
An Increasing Desire for Data
There is an unmistakable trend across society towards data-driven decisions. No one seems to be exempt, as the desire to gather and analyze data can be seen across industries as well as from consumers, governments, and academics. Notable examples include:
- The big business of sports has led the charge. The high-profile book-turned-movie Moneyball conveys the trend in Major League Baseball away from decisions that were based on the intuition or gut of front-office staff members towards the data-driven identification and exploitation of undervalued players by number-crunching teams such as Billy Beane's late '90s Oakland A's.
- The pharma and healthcare industries continue to strive to develop and deliver personalized medicine -- diagnostics, treatments, and therapies tailored to the characteristics and genetic makeup of a particular patient. Personalized medicine demands the integration of a tremendous amount of highly granular and diverse data from the lab to the clinic.
- The data that may drive personalized medicine isn't just found in labs and clinics, either. Individuals are increasingly embracing the concept of the quantified self. We're using our smartphones, watches, and other wearable devices to gather data about ourselves to better understand fitness, nutrition, health, and behavioral tendencies.
- Large banks and federal agencies are increasingly hiring chief data officers (CDOs) and data scientists to foster strategic thinking around the collection, analysis, distribution, and application of data across all functions of an organization.
- Local and national governments are contributing to the deluge with significant movements towards transparent publication of data on websites (such as http://data.gov and http://data.gov.uk. Additional efforts to compel the disclosure of additional government data are in process via efforts such as the United States' DATA Act.
- Clever uses of big data techniques played a pivotal role in the last U.S. presidential election to help identify and target likely and swing voters in key states.
- Large retail chains such as Target and Walmart have been honing their customer data mining strategies for many years and are now long past the point where they are able to identify a pregnancy before even family and friends are aware.
- The approaching Internet of things -- as heralded by new devices such as the Nest Thermostat, Quirky devices, or even the Waze service that uses consumers' GPS-enabled smartphones to crowd-source traffic information -- have such companies as GE and Google making substantial investments based on their potential to both generate and find value in big data.
Given the existence of so much data and the keen interest in leveraging it for competitive advantage, why do so many companies still struggle to weave big data into their critical decision-making processes? Let's look at five common challenges that organizations encounter when embracing big data and that can be overcome by learning to turn big data into smart data.
Challenge #1: We don't know the answers up front; analytics are hit or miss
One of the defining characteristics of operating in the Age of Big Data is that we rarely know the answers we need -- or even the questions we'll want to ask -- up front. In part, this is because much of the value in big data is tied to serendipitous discovery of patterns and relationships previously hidden within large amounts of information. We can't rely on experts to program a new MapReduce job for every new line of analysis, and other big data analytics frameworks have similarly high barriers to entry (in terms of required skill level and time-to-answers) that discourage use for casual data exploration and analysis.
Data scientists who have the in-depth analytics, math, statistics, and programming skills currently required to mine value from big data are a hot property these days and there are not enough of them to go around.
Challenge #2: Unstructured data is hard to mine
In many applications, big data is synonymous with unstructured data -- data derived from applying text analytics to unstructured text, audio, and video. However, looking across the text analytics landscape reveals a myriad of issues that hinder us from incorporating unstructured data into our day-to-day operational business decisions. These include:
- The need to apply different text analytics tools to different content. The best natural-language processing tools for scientific documents differ from the best tools for analyzing customer feedback which differ from the best tools for e-discovery across industries. General-purpose text analytics tools tend to have lower accuracy and precision than more specialized solutions.
- The need to apply different text analytics techniques at different times. Extracting business entities from text is a different challenge than analyzing social media sentiment which is different than discovering unknown relationships in multi-lingual text.
- The results of text analytics is unpredictable. Mining large quantities of Web pages, e-mail messages, or other documents often reveals entities and relationships previously unknown. Although big data stores give us a convenient way to capture arbitrary data, they don't give us much help in terms of performing further downstream analysis on data that wasn't known up front.
Challenge #3: Data is not collaborative or reusable
In general, data is gathered, stored, and used for a single purpose. For example:
- An investment bank gathers 10-K filings to aid buy-side equity analysts
- A biotech stores clinical results in a data repository to support regulatory reporting to the FDA
- An online retailer uses feeds from manufacturers to populate an inventory database that, in turn, feeds their website content management system
In all these cases, the data is trapped in a stovepipe and can't easily be reused by other users across the business -- many of whom often don't even know of the data's existence. Thus, risk officers can't benefit from exposure information derived from the 10-K analyses; senior scientists at the biotech can't leverage the clinical data repository to forecast the future clinical success of early-stage drugs; and the retailer can't repurpose the inventory database to identify strategic gaps in their product offerings.
Challenge #4: Big data is only part of the story
Although big data exacerbates the challenges involved in instituting a true data-driven, decision-making culture, big data sources on their own don't give the full context necessary for these decisions. Big data must be integrated with traditional enterprise data sources (e.g., transactional and operational databases, data warehouses, or ERP stores), data from cloud-based SaaS applications (e.g., Salesforce.com CRM data), and countless "shadow IT" data sources (including spreadsheets, presentations, and documents scattered across thousands of file shares and SharePoint sites). Such integration is expensive and time-consuming, but without it our ability to incorporate big data into day-to-day business processes is severely hampered.
Challenge #5: Data preparation costs too much
The mentality and tools in the Age of Big Data encourage us to collect as much data as we can get our hands on, but collecting data does not a priori make it useful. For data to have value, it must be prepared to be integrated, distributed, or consumed as part of some business process. Data preparation might mean:
- Discovery: Identifying the right data records within a large data store
- Curation: Evaluating and improving the quality, trustworthiness, or accuracy of the data
- Alignment: Mapping data schemas and individual records to a common model to foster integration and analytics
Generally, data preparation is fully manual and therefore tedious, time-consuming, and prone to errors, and often restrains regular use of data in enterprise operations.
Smart Data to the Rescue
Smart data is a new approach to working with big data that addresses these challenges. Smart data is well-described, discoverable data that can be easily managed and manipulated by business users. Smart data carries its lineage and context with it, supports the diversity of data formats and schemas necessary for effectively capturing unstructured data, and maintains rich semantics that allows it to be reused in many business contexts.
In the second part of this series, we'll look at five ways to get started turning big data into smart data.
Lee Feigenbaum is the co-founder of Cambridge Semantics. You can contact the author at firstname.lastname@example.org.