Q&A: Big Data Warehouses and the Intelligent Enterprise
What are the most important topics BI and data warehouse professionals should be paying attention to?
- By James E. Powell
- January 8, 2013
What do you need to know about the most important topics of BI and DW -- from master data management to big data, platform selections to data virtualization? To learn what topics BI and DW professionals should be watching -- and pick up pointers we haven't heard before -- we spoke to William McKnight, president of McKnight Consulting Group and an experienced, successful, and credentialed information management strategist and practitioner.
Mr. McKnight is a keynote speaker at the upcoming TDWI World Conference in Las Vegas, February 17-22. On February 21st he will discuss Capitalizing on Chaos: Navigating Information Management Possibilities to Build Organizational Value.
BI This Week: You talk a lot about platform selection, but aren't they all so fast these days that it doesn't matter too much which one you choose?
William McKnight: It matters a great deal. Sure, platforms are faster than they used to be, but it's still behind demands, especially if the workloads are platformed incorrectly. The primary consideration is the platform category that the workload fits. Considering all options is a must because we are dealing with the important -- arguably the most important these days -- asset of the company: information.
Some platforms are genre-benders and naturally vendors have all bases covered until after the sale, and then it's "Well, you need to break up the workload and get into our offering in this other category." Go there first. Give your workloads the best chance. Don't assume a need for an errorless implementation, either. The right platform can accommodate the inevitable suboptimal development and tuning, but a poor platform category selection itself gives you little room for error. We could definitely quibble over the ultimate technology selection, but dividing workloads by their characteristics and assigning those workloads to their best platform is a must. Let me repeat that: there is a best platform for each workload. With expertise, you can get there quickly.
We shouldn't be cramming master data management (MDM) functions and big data into the data warehouse, for example. Neither should we be treating all post-operational data access as homogenous out of a single data warehouse with a single tool.
You're quoted around TDWI frequently talking about how various functions are being pulled off the data warehouse. With different analytic stores in the mix now, does one called a data warehouse still need to exist?
Yes, it does. I take that to mean a post-operational data store that has a primary function of feeding other systems and a secondary function of storing historical data. As for serving innumerable manner of data access, not so much. This store will sit in the architecture alongside analytic stores and, obviously, earlier in the cycle than some analytic stores. There are architected independent data marts, and the data warehouse is not necessarily the sun around which everything orbits. Some of these functions being pulled off the data warehouse are going operational as well. Stream processing and master data management are obvious examples.
Yes, master data management is a great example of what used to be done in our data warehouse. How are those implementations going?
Well, they are going very well once they get started. These projects tend to have large justification cycles and it's usually because MDM can serve an organization is so many different, and valuable, ways. It can be difficult for a shop to conform to another shop's view of MDM, or an analyst's take on MDM, and make it work in their shop. MDM is really a collection and integration of functions that most shops are already doing but just not doing them very well, or in an integrated fashion. We're seeing phenomenal value delivered to organizations -- once the projects get going!
What does our audience need to hear about big data that they haven't heard already?
The 10 V's! Just kidding. I think what a TDWI audience needs to understand about big data is that it is not strictly about analytic data -- not nearly at all. Much of the data that is falling into the big data camp is operational big data and is being stored in NoSQL stores like Cassandra, Mongo DB, Couch DB, Riak, and the like and is serving operational functions such as game states, instant offers, and shopping carts. It is very likely -- and we see it already -- that organizations are deploying both Hadoop for analytics and other NoSQL stores for operational purposes.
However, the skill set for all is data management and that skill set has been honed over the years by the data warehouse professional in the post-operational world. Data managers rule. They're becoming the real rock stars of the organization. That is, as long as they see their skills transferring to other platforms and technologies more suited to different workloads than the data warehouse.
So, back to data storage. If Hadoop might, as has been famously predicted, store half of the world's data soon, much of the other half will be in non-Hadoop NoSQL stores.
So many technologies are necessary in an enterprise these days to manage information. It's more not less. If we're doing this right, what do we need besides the platforms themselves?
One refrain you'll hear from me is to put your shop in the best position to take advantage of information technologies. This means with the ability to quickly test and prototype. It also means having an innovator's mindset and culture. Fail fast if necessary.
Your shop is absolutely going to have to adopt new technologies -- and get them to work together. This leads me to your answer and that is data virtualization. Data virtualization can help a shop in one of two ways. It can handle the unanticipated query that needs to be federated due to the dispersion of data across technology platforms. It can also be architected right into the environment for a daily, hourly or a whenever query cycle when you don't want to permanently integrate data – or perhaps you do it for a period of time until you can permanently integrate the data.
Any shop that is treating information as a primary asset will end up with a true heterogeneous environment and will benefit from data virtualization.