Q&A: New Perspectives on Big Data Problems
Big data issues -- from proper testing to developing accurate models -- can’t be ignored. The Modeling Agency’s Tony Rathburn takes at look at several important big data issues troubling enterprises today.
- By James E. Powell
- November 19, 2013
[Editor’s Note: Tony Rathburn, senior consultant and training director at The Modeling Agency. will be leading sessions at the TDWI World Conference in Orlando (December 8-13, 2013) on Big Data versus Fat Data and Supporting the Analytics-Driven Organization. We asked Tony about several big data (and fat data) issues he’s watching now. ]
TDWI: You make a strong distinction between IT’s perspective and the analyst’s perspective in your conversations about big data. Can you tell us more about why that distinction is so important?
Tony Rathburn: The phenomenon of big data has been with us for a few years now and is continuing to grow rapidly. Most organizations are recognizing the potential of utilizing an analytics perspective to expand and enhance their business potential with effective use of the information content that is available from the growing Internet-based sources.
A great deal of this effort involves shifting our data warehouses away from systems that supported primarily transaction-based processing and reorganizing them to support analytics-driven applications. Unfortunately, many of these efforts are being hampered by attempts to embed analytics capabilities into the information infrastructure.
Analytics is performed to enhance specific decision processes across the organization. The motivations, needs, and perspectives of the business users, and the efforts of the analysts that support them, vary widely in the approaches taken and in the requirements for specific projects.
IT departments would deliver much stronger support to the analytics-driven organization by viewing themselves as the primary supplier of data resources to a broad customer base within the organization rather than attempting to provide both the data resources and the solutions development tools.
Decentralizing the acquisition of analytic tools and implementation of the solutions development process allows the business decision makers the flexibility to customize their analytics projects in the way that is necessary in today’s sophisticated decision environments.
Only the most trivial of projects will rely on one-size-fits-all, off-the-shelf solutions. The development of comprehensive data resource delivery platforms, complete with solution delivery tools, is increasing costs and slowing delivery of new, mission-critical capabilities. In many cases, it has the unintended consequence of also diminishing the capabilities of the data resource platform itself, as the embedded solution tools are not consistent with the needs of the business projects under development.
You regularly make the statement that we need a “sufficient” amount of experience in our data to develop quantitative models, but beyond that sufficient quantity of records, additional records may actually work against us. Why wouldn't more records always help us?
In developing our analytic models, our algorithms use a set of historical data called a ‘training data set” to build mathematical formulas. These algorithms essentially adjust the weights in our formulas to best fit a line through the known points in the training data. When our analytics projects are successful, these formulas have the ability to interpolate between the known points they saw in the training data and provide reliable estimates into areas that were not specifically represented.
Adding records to the training data essentially makes these known points closer together and provides the potential to achieve additional precision in our estimates. This is especially true in physical systems problems where there is an underlying order in the system. Physical systems are consistent in their behavior so we can achieve highly precise estimates if our models are sufficiently sophisticated.
In most business problems, we are not modeling a physical system. Rather, we are developing a model of a human behavior in our business relationships. Human behavior is not consistent. Behavior is choice. We often fall into patterns of behavior, and those patterns are what we are attempting to identify in our analytics projects. However, different groups of people display the behaviors we are interested in at differing rates, and even the same individual will be inconsistent in displaying a behavior.
The result of this inconsistency of behavior is the introduction of “noise” content into our training data. This noise is a reality that our models need to deal with in a live decision environment. To have that capability, it is often useful to develop our models in a way that intentionally reduces precision so that we also reduce the impact of the inconsistencies in the data. We do this by reducing the number of records in our training data.
In your role as an analyst, you’ve said that you have far more interest in “fat data” than in “big data.” Would you explain the difference and why you believe that fat data is the critical issue?
As I just discussed, a large number of experiences, or records, isn’t necessarily important to me as an analyst. What is important is the number of fields associated with a record. When there are a large number of fields associated with a record, we have “fat data.”
Each of these fields is a potential variable in a model being developed. Each time we add an additional variable to a model, we are creating an additional dimension. These additional dimensions may enhance the performance of our model by providing additional resolution. However, additional dimensions also increase the model’s complexity.
As an analyst, my concern is developing the simplest model possible that delivers the maximum business performance enhancement. Determining which condition attributes, from the growing number of available options, is a major challenge for an analyst.
We are also faced with an array of new types of data. Each of these new data types present issues with how to best present them to our algorithms for quantitative analysis, as well as a number of new data quality concerns.
We see large volumes of data being captured from social media, geo-spatial data, text mining and a variety of other sources. Do these new types of data pose any special challenges or opportunities for the analyst?
The biggest challenge coming from these new data types is determining what information content they offer that will enhance our decision processes, and how to convert these fields from their raw formats into a form that can be subjected to the development of our mathematical modeling efforts.
We are also faced with a number of issues related to the consistency of this data. Social media data and free-form text processing, in particular, present threats from inconsistent usage of language that may cause our data to lead us to inaccurate conclusions.
Do you see any particular threats to the success of big data projects that are different from traditional and predictive analytics?
The biggest threat that I see to big data projects is the resurgence of an overreliance on quantitative techniques to solve our business problems. Quantitative techniques can offer exceptional sophistication in complex decision environments. However, our algorithms will never understand the context of our business problem, our business objectives, our performance metrics, or our data.
The overreliance on quantitative techniques leads far too many organizations to invest heavily in technologies they don’t understand, and to build capabilities for which they have no clear business justification.
We are being inundated by a massive influx of additional data from a variety of sources. It is essential that organizations have a clear sense of why they are accumulating these new repositories and how they expect to utilize them effectively. The involvement of the business decision makers in specifying the needs and requirements of prospective projects that will utilize these resources continues to be the key factor that separates successful analytics-driven organizations from those that invest heavily in technology centers that yield little return on investment.
What advice would you offer to organizations that are currently ramping up their investments in big data?
The strongest recommendation I would make to organizations launching analytics efforts is that they establish a business-driven orientation to the acquisition of these new technologies. Technology in and of itself will not yield significant enhancement.
Our data is a form of raw materials. Our algorithms and software are tools that are useful for specific tasks. The organizations that will emerge from the current growth surge will be those that have a clear business focus and are goal driven in developing their capabilities. They will build and acquire the skills and resources targeted to achieving specific business objectives, and their success will be evaluated by performance metrics customized to the business’ goals.
Analytics is a business problem, not a technology problem. We use data. We use technology. However, our first priority must continue to be the enhancement of our business performance.