Q&A: Use Cases Illustrate Value of Data Lakes
Data lakes are a relatively new concept in the age of big data. In this interview, the second of two parts, Teradata’s Dan Graham and MapR’s Steve Wooledge offer up plenty of examples of how customers are using data lakes.
- By Linda L. Briggs
- February 4, 2016
Data lakes are a relatively new concept in the age of big data. In this interview, the second of two parts, Teradata’s Dan Graham and MapR’s Steve Wooledge offer plenty of examples of how customers are using data lakes.
Wooledge is VP of product marketing for MapR, where he works to identify new market opportunities and increase awareness for MapR technical innovations and solutions for Hadoop. He was previously VP of marketing for Teradata Unified Data Architecture, where he drove Big Data strategy and market awareness across the product line, including Apache Hadoop. Graham leads Teradata’s technical marketing activities. He joined Teradata in 1989, worked for IBM in various capacities, including as an executive for IBM’s Global Business Intelligence Solutions, then rejoined Teradata, where he serves as general manager for enterprise systems. [Part 1 of this interview can be viewed here.]
BI This Week: What are some data lake use cases? What are your customers doing with data lakes?
Wooledge: One example that comes to mind is a joint Teradata and MapR customer, a very large telecommunications company that does B2B services. This company provides cloud computing and Internet services, as well as lots of different data services, to other large enterprises. Historically, they used Teradata for all the report metrics on usage, billing, and so forth. However, what they didn’t have in the data warehouse was information on the B2B portal. They didn’t have information on what clients were looking at and clicking on. Therefore, clickstream data was one source they wanted to examine in order to understand how to better service their customers and how to make the customer Web experience more personalized.
In addition, there was another type of data they wanted to look at. Regarding the services they’re providing, they wanted to know if there were machines within their cloud environment that were spiking in terms of usage, thereby impacting quality of service to their enterprise clients. In a nutshell, there was lots of IP traffic and log information that they wanted to analyze and receive alerts regarding spikes and hot spots in the network.
For those two types of data, they looked to MapR for an inexpensive way to collect all the data -- all those patterns and anomalies, and in the case of clickstream data, all the user behavior and segregation processes and so forth -- and then massage it into some sort of report. That metric or report is moved into the Teradata data warehouse.
In this example, Teradata maintains the front end to the business, really, but a lot of the data science, or exploration of new data types, could now be done in the data lake. That includes some of the ETL processing ... now that they had finer-grained usage statistics on the machines. Lots of that was done in the data lake as well. Now, they could free up analytic cycles on the data warehouse so they could maintain good SOAs on the business.
Graham: Here’s another good example: A vehicle manufacturer (I don’t know which Hadoop they use) had a tremendous amount of censor data flowing off their vehicles into their system. They were using that data for engineering purposes -- looking for a predictive problem or a circumstance or a root cause for failures. They were also looking to design the next generation of their vehicle.
They had a lot of data coming in. The engineers, being engineers, went out and got themselves a Hadoop system and put the data there. Why? Because these are people who like to tinker with the code -- with the Java. They wanted to do the types of explorations that Steve described. They wanted to look at the data from different angles -- which isn’t really something that’s done easily with structured data in a data warehouse. Using their sensor data, they were looking for root causes, looking for possible vehicle improvements. What they discovered was that ... they could query a Hadoop system and join it to Teradata data. With one use case, they asked, “If we can predict failures on some of the vehicles, how about if we map that out inside the data warehouse to do labor schedules? We have all the information about where the machines are, and what the contract is, and what the warranty is, in the data warehouse.”
They started putting all these things together and saying, “I can join the two sets of data, and I can then adjust my customer support people.” Well, low and behold, not only did they get this to work, but they did two things we didn’t expect. First, they came up with more use cases as soon as they realized what the Hadoop data could do. ... The unforeseen bonus is that the engineers, who are very techie, and the business users, who were not techie -- all of a sudden, there was a bridge between these two cultures. They started working together in ways that they’d never even known were possible.
Are there common misconceptions about data lakes? Are there myths that you find yourselves having to dispel?
Wooledge: Sure, lots of them. Some people think a data lake is a silver bullet, and you don’t have to worry about data governance and structuring data and some of the traditional data pipelines that you have in a data warehouse. That’s a big one.
Security is another one. I lump security in with data protection and overall liability. A lot of people just don’t put enough planning into making sure that in the data lake they put in place has disaster recovery and data availability and good security -- all the things you’d expect from an enterprise system. Just because you have a vendor with a Hadoop distribution doesn’t mean that vendor’s Hadoop distribution has those items. I don’t want to get into products, but you need to have your same standards in terms of reliability as well as performance, governance, and security. Those are the things that you need to check the boxes on. You need to really dig under the hood to make sure they are going to apply that to your data lake.
Is governance with data lakes different from governance with a data warehouse?
Graham: It’s different. First of all, you don’t have much of a schema [with a data lake]. You’re dealing with a whole different collection of data. For the most part, it’s files coming into Hadoop. They come in in different formats, of course, but they don’t come in with a centralized “everybody’s agreed to it and we’ve worked through these logical data models and schemas.” That’s not what’s happening.
In fact, that’s a great benefit of Hadoop. The governance process has to face that reality. I think many data organizations have taken the data governance thing too far. They have data priests and they won’t let you touch the data; they position themselves as enforcers rather than supporters. The Hadoop community has reacted by saying, “We don’t want that. We don’t want to be controlled by this small group. We want to be more agile.”
There’s a strong push for some agility and flexibility with the data and with the processes you apply without having a rigid environment.
Regarding data lake myths, the most common myth I hear is that it takes six months to change a data warehouse. What they are really saying is that there is a data warehouse competency center going full force and they don’t want you to change anything. In other words, they’ve gone overboard. The Hadoop community, on the other hand, starts at the polar opposite, with no control. We’re trying to say that they need all this governance, too, but you want to maintain some of the flexibility.
Striking a balance is going to be hard. You want to give people the opportunity to dig around in the dark data. Leave me alone! I don’t know what I’m looking for. I just want to see what’s in here. Give me a week. I don’t want the competency center looking over my shoulder. No, we don’t know what the security is yet because we don’t even know what’s in here until we’ve explored it. That’s very common. You must have some place for innovative, creative thinking.
What about security? Is security different when we talk about data lakes?
Wooledge: It is. When we talk about security with the data warehouse, we talk about having ro-w and column-level access control to the data. I’m simplifying things somewhat – but basically, you have one execution layer that more or less controls access to the data. In Hadoop, you have files, you have tables, you might have streams, you have multiple execution engines, you could have a query optimizer as part of Hive -- there are lots of different areas in which access to the data is being allowed.
I’d say that the way to secure data in Hadoop is a little bit patchwork right now. There are efforts underway in the Hadoop community in general to unify security and access control and all those things. It’s still evolving, but an organization that knows how to use the various products and tools that are available to them can maintain fairly strict access control, encryption, auditing, and all those types of things. They are there, but it might not be as out-of-the-box as you would get with more mature relational database technology. It’s different. It’s harder. It’s more flexible, but your security is also more flexible.
Graham: Hadoop is at the beginning of a long journey, and a lot of enterprises have a long journey ahead with Hadoop. It’s still young. You can’t buy your way into security. This takes time. Just look at all these major banks and retailers being attacked by hackers every day. They’ve had years to perfect their security in their relational databases and their mainframes, and they are still getting hammered.
Hadoop is young and there are still many places to work on regarding security. They’re working on it, but it’s not going to happen overnight. It’s not even going to happen in five years.
Wooledge: I don’t know that I agree with that, Dan. I was at a TDWI event recently and MasterCard talked about what they’ve done with Hadoop in terms of applying security and using it in a strategic way. MasterCard, Amex, Visa -- these companies have very strict security policies as well as very strict personal information policies. We have a very large credit card company as a customer of MapR, and they are using Hadoop strategically, and believe me, it is secure. But they have the experts who know how to apply to policies using the technology that is there.
What about challenges and drawbacks. As you talk to companies, what are they stumbling over in implementing a data lake?
Wooledge: One of the biggest thing is skills. If you’re bringing in a new technology such as a data lake, you have to hire someone with that skill. It’s closing but there’s still a skill gap.
For one thing, you have about 20 different products that call themselves Hadoop. Right there is a proliferation of new technology, and no one person knows it all. Look at Spark -- it’s really only two years old or less, so finding people who have been able to spend some time on it and are good on it is tough. Having turned it on and used it for a year doesn’t necessarily make you proficient. Getting good people is just hard to find.
Graham: I think one of the biggest challenges is this expectation that Steve mentioned earlier -- that I can store all this data in the data lake and I don’t have to do any of the hard work. I don’t have to have governance, I don’t have to have a plan, I don’t have to have an expectation of ROI, I don’t have to have a lot of things that are historically expected for any product. If we build it, they will come. Just put all the data in there, and somehow data will pop out the other end.
Given that, do you have advice for people getting started with a data lake? What do you tell customers?
Graham: Start with the definition. Know what you’re talking about. Then go to use cases. Which one of these would be attractive to a business user? As long as we’re following the guidelines of getting business value out, then we have a much higher chance of taking a data use case and proof of concept to the data lake. That isn’t any different than most other projects.
Wooledge: Ideally, as Dan said, you have a business use case and goals in mind. That’s absolutely number one. Many people want to just download and play, so there are a lot of options out there to get you started. There is software available for free and training available for free.
I agree with Dan. It comes back to what is the value of doing this? You’re trying to make life easier for users in some way. Many people assume that cost is the No. 1 reason, but with our customers and their data, the reason they started with Hadoop wasn’t cost. It was new products and new services that can drive new applications. I would say aim high and think about the new kinds of things you can do, not just trying to replace the old with something cheaper.
Graham: Right. Cost is an important driver for considering a data lake, but if it’s your only driver, you’re in the wrong neighborhood.