at the TDWI Conference in Orlando December 7-12, 2014.]
TDWI: It's clear you think the big data conversation has become unbalanced and moved away from data and toward data management technologies. Why has that happened?
Marc Demarest: I am not sure I know why we seem to want to drift from conversations about data, as the raw material for decision-making, to other kinds of conversations. I've always assumed that the drift stems from a lack of comfort we have, collectively, with decision-making as a process problem with our precursor discipline, decision support systems (DSS). DSS is largely about non-technical factors — about people, organizations, and politics and how and why companies do (and don't) model their decision-making processes, and let data drive their decision-making processes and outcomes. DSS is sociotechnical, not technical. It is more concerned with the sociological problem than the technical infrastructure required to support decision-making.
We seem to prefer a periodic change in vocabulary. We used to talk BI, and now we talk analytics, but the things we talk about don't change, and those things are, largely, compiled things. Shiny, programmatic objects. Many of us are technologists, or think we ought to be technologists, and so we drift in the direction of technology more or less naturally. Those of us who aren't technologists by inclination find ourselves being pulled by the technological center of gravity into largely technical discussions, particularly when we hear statements -- often from suppliers -- that this or that technology inherently provides some kind of competitive advantage or uniquely offers the opportunity to discover "new insights" in our company's data.
Those sorts of statements are maddening, and so we rush into the conversation -- and the damage is done. The current penchant for wallowing in big data technologies isn't the first time the discipline has drifted into largely or wholly technical conversations. We did the same thing with multidimensional database management systems and OLAP in the 1990s, and with data warehousing appliances a decade or so ago. We seem, historically and certainly at present, to be eminently distractible.
What about the "three Vs" and other models that have been proposed to understand the complexities of the new data challenges? Have those helped recenter the conversation?
I don't think so. They've mostly served to offer evidence -- in the weakest sense of that term -- for the position that the technologies we've used for the past 20 years or so to build decision-making infrastructure are now, somehow, inadequate. Volume -- your tech can't handle it. Variety -- your tech can't handle it. Velocity -- forget it. We have forgotten, already, that when McKinsey declared the era of Big Data, they offered one concrete element for a definition of the new age: data beyond the capability of conventional data management technologies.
McKinsey is McKinsey -- they are capable of making market, and they did that. They opened up a hole for a bunch of technology suppliers, who drove various trucks and tanks through that hole, flying the "three Vs" flag and the "discovery" flag and the "data scientist" flag. All of those flags are, in my opinion, false flags. They're technology flags; they point to particular technologies, not to particular kind of decision-making problems and the ways of wrangling data that those decision-making problems require.
You sound like you're sneaking up to a defense of conventional data warehousing.
I am. No doubt about it. The data warehouse, as the centerpiece of an information distribution environment that provides every decision-maker in the organization with basic decision-making information, is as relevant as it has ever been. There's no doubt in my mind that the data warehouse as the centerpiece of data engineering -- or data integration, as I guess we're supposed to call it now -- is compromised, probably for all time. I've no doubt that the great dream of conventional data warehousing -- that it was possible to pre-engineer 100 percent of the data that all decision-makers might require, at any future point in time -- was a false one.
The data warehouse isn't dead, but it's surrounded. That's as it should be, or at least that's how it should be, for companies that have managed to build a functional, enterprisewide data warehousing environment. They've done a decent job of information distribution, and they need to get on with the process of getting competent at the next level of the maturity model: exploiting data -- big and small, slow and fast -- to make more, better, and more data-leveraged decisions.
Now, we need to be honest, and acknowledge that plenty of companies large and small have not succeeded in making enterprise data warehousing work, regardless of the specific technologies they've chosen. They can't distribute information from the point of capture to the point of use, reliably, repeatedly, and securely. For those companies that haven't managed to build an effective enterprisewide data warehouse, the news that a complete technological renovation is required in order to do "big data" -- whatever that means -- is good news. I don't have to finish what I've started or fix what I've messed up. I can just move on.
Do you really believe that's happening?
I do. My colleagues and I see the pattern, regularly. We're asked by organizations to advise them on big data or analytics and discover during our initial diagnosis that the organization is making decisions today based on hundreds of extracts from several moribund data warehouses, on different technology platforms, feeding thousands of desktop data marts implemented in Excel or Access, which are in turn the sources of yet further extractions.
The "data extraction" problem that in the late 1980s clued in Inmon to the need for a data warehouse has been reproduced using several generations of modern technology. We don't use IND$FILE anymore. Now we use e-mail and file servers, but the informal, ungoverned network of extractions is still the dominant means of information distribution in too many companies. It's the underground economy of data, if you will.
It's an open question -- and an important one -- whether a company can tackle the unique problems and opportunities that our increasingly large data management scope (capturing more data of more sorts at higher rates of capture, leveraging more data that originates outside the company, and so forth) when the organization has been unable to install effective baseline information delivery, with appropriate governance and security. My position is: you can't go to Phase 2 -- effective management of so-called "big data" and in particular the unique decision-making problems associated with streaming data -- when your organization has no heritage of organized, systematic, information distribution supported by the conventional technologies: conventional relational database management systems, conventional query, reporting, and dashboarding tools.
You've drifted into a discussion of technology.
I have, haven't I, but in the interest of pushing what I'll call the information economy, inside the corporation, into the light. From my perspective, what's always true about the information economy inside any company is: demand will get met, some way or another. People who need to make data-driven decisions will make them and will get data to make them, regardless of the source of that data. That's been true as long as I have worked in the business. As we change over to a generation of knowledge workers that expects to have massive amounts of data to inform every decision they make, the perennial and inexorable demand for data will only become more stark. Plenty of sources, and plenty of inflexible demand -- these are the givens of the environments we work in. The supply chain in between those sources and that inflexible demand is either formal -- a data warehouse, data marts, big data infrastructure, what have you -- or its informal (an underground economy, self-assembling, undocumented, and ungoverned).
If you inspect one of those underground economies closely, you'll discover that the logistical network that moves data from source to sink, in any given instance, is very brittle. Often the supply chain depends on a single individual outside of the IT organization with a day job who just happens to be competent to extract data from the official locations in which it pools inside or outside the organization.
Changing the technology we use to persist data for decision-making has a profound effect on the abilities of those individuals to serve their portions of the informal information economy. Sometimes, it thwarts their abilities completely. The informal information economy breaks down. Although no one may be aware of the informal economy, everyone will be aware when it stops working -- particularly when the people at the end of those attenuated informal supply chains are customers, channel partners, suppliers, or regulators.
If you formalize that information distribution network and that economy using technology, you can feed new kinds and new volumes of data into the network with predictable downstream effects with appropriate guardrails.
You can formalize that information distribution network with conventional technologies -- with what we used to call BI infrastructure.
You can't formalize that information distribution network with big data technologies. They weren't designed for that use case, and many suppliers of big data technologies have no interest in that use case.
These, it seems to me, are the facts of the matter.
Is this a preview of your keynote?
A: That last metaphor -- the idea of an information economy, inside the company but extending across its boundaries -- is very much a preview of the keynote. I'll tilt at a few other windmills, but my goal is to get folks back to where the industry began -- to an understanding of the essentials of information demand and supply, rather than a focus on the technical characteristics of the plumbing we might choose to use to connect demand with supply.
Enter the data lake.
A lake is not a reservoir, but yes, to the extent that we're moving from built-environment metaphors such as "data warehouse" to natural-world metaphors such as "data lake," we're also moving to metaphors of flow: of how data moves and how it's transformed as it moves through a designed or ad hoc system.
Marc Demarest is CEO and a principal in Noumenal, Inc., an international management consulting firm based in the Pacific Northwest and the UK that provides a range of management and technical consulting services to high-tech, biotech, nanotech, and greentech firms. Widely known as an early proponent of data marting and tiered enterprise data warehousing models, Demarest is currently writing a book on nontechnical aspects of enterprise data warehousing. You can contact the author at