Q&A: Understanding Data Lakes
Experts from Teradata and MapR discuss what data lakes are and how you can benefit.
- By Linda L. Briggs
- February 3, 2016
Data lakes present many benefits, along with some attendant potential challenges and misunderstandings. In this interview, the first of two parts, we talk with Teradata's Dan Graham and MapR's Steve Wooledge about the basics and benefits of data lakes, including use cases. "We see many customers who start with a data lake in [a basic] way, hoping to uncover insights from new data that they haven't explored before," Wooledge says.
Wooledge is VP of product marketing for MapR, where he works to identify new market opportunities and increase awareness for MapR technical innovations and solutions for Hadoop. He was previously VP of marketing for Teradata Unified Data Architecture, where he drove big data strategy and market awareness across the product line, including Apache Hadoop. Graham leads Teradata's technical marketing activities. He joined Teradata in 1989, worked for IBM in various capacities, including as an executive for IBM's Global Business Intelligence Solutions, then rejoined Teradata, where he serves as general manager for enterprise systems.
BI This Week: Data lake is a relatively new term in the industry. Dan, how does Teradata define a data lake?
Graham: We shared the formal Teradata definition at the November Webinar with TDWI and Wayne Eckerson [of The Eckerson Group]: "A data lake is a collection of long-term data containers that capture, refine, and explore any form of raw data at scale, enabled by low-cost technologies, from which multiple downstream facilities may draw upon."
At Teradata, we spent a lot of time getting that definition right. It's not product-specific -- as it says, you can use any product you want. The core of the definition is this: data at scale, at low cost, feeding downstream facilities, and being captured for exploration.
Let's drill down on "low-cost technologies." You used a good phrase in the Webinar, Dan: "Hadoop is more than a data lake and a data lake is more than Hadoop." Can you expand on that?
Graham: You can use a lot of different things to build your data lake. Sure, nine out of ten people are grabbing Hadoop first, but people are also building data lakes with Amazon, they're building data lakes with Cassandra, they're building data lakes many different things. The point is, you're not limited.
In a way, it's similar to the definition of a data warehouse: it's subject-oriented, it's about integrated data. Most everyone builds it with a relational database, but you don't have to. Here's a similar correlation: you build a data warehouse with a relational database, but a relational database has other jobs. The relational database is not a data warehouse and a data warehouse is not a relational database. It's the same thing with the data lake.
Hadoop is not a data lake. It's the primary data lake tool most people use, but you can use other tools, and there are tools that Hadoop doesn't have that your data lake is crying out for.
Wooledge: Before we talk about the tools the data lake is crying out for, I'd add that one reason many customers are using Hadoop is its flexibility. It's such a general-purpose file system underneath and an ecosystem of general-purpose processing engines on top.
That makes it an excellent way for people to throw a bunch of files of any style or format or data type directly in, without having to think about how they're going to organize it just yet. That's not to say you shouldn't organize it -- even Cassandra requires some sort of structure for the data. You have to transform it in some way to get it loaded.
In a way, Hadoop has become a data insurance policy for people who have lots of data and aren't sure yet if they want to do anything with it. With Hadoop or something like it, they can capture the data in its rawest format, then figure out later whether they want to structure it and analyze it and so forth.
Graham: You can build a data lake with a Teradata product, and some people are doing that already, but as Steve points out, Hadoop is quite flexible. ... It can do other things. Hadoop can be an operational data store, for example. It doesn't really need to be a data warehouse or a data lake. ...
[Speaking of tools that Hadoop doesn't have,] Hadoop doesn't come with a big boatload of ETL functions, so the data lake has to get that from somewhere. There's manual labor, of course -- you could build your own functions, or you can go out and get [products from] Talend or Informatica or IBM DataStage and make one of them use the data inside Hadoop.
Steve, can you expand on Dan's comment that Hadoop is crying out for more tools?
Wooledge: There are open source ecosystem projects such as Pig or Storm or Spark that people are using to transform data that comes into the data lake, but that's primarily custom coding that someone has to do, just like some of the ETL tools out there. ... When you get [into tools for the data lake,] you want libraries of reuseable components, tools that Talend or others already provide. A lot of people have those technologies already and they can apply them to the data lake.
The data lake is a pool. It's a pool of resources that you still might want to run commercial software packages on top of to help create a repeatable data flow. The advantage of these libraries of transformation, steps, and logic is that your code isn't a one-off custom job that has to be maintained by a single developer.
With a data lake, you can preserve this original, untouched data in a relatively inexpensive way. What are some of the benefits of that?
Wooledge: For me, there are two fundamental benefits. First, people are looking for places to offload costs, and a data lake certainly can offer that. Second, they are looking for ways to find new insights from new types of data, types they haven't used in their current analytics process.
First, on the cost side, people are looking at workloads that could be done on a less-expensive platform wherever it makes sense. It might simply be online archiving of data that isn't being queried very frequently as long as you can still access it via your data warehouse through something like Teradata's QueryGrid. People may want to store some of those tables in the data lake, because it can be a cheaper platform.
To Dan's earlier point, it doesn't have to be Hadoop. It could also be a low-cost Teradata appliance, for example. In any case, you're offloading workloads from a high-end, in-memory data warehouse to a lower-cost commodity, a scale-out, hardware-cluster kind of configuration.
Of course, there are other costs -- ETL is one of them. Maybe we're talking about a one-time batch job in which you're stripping out duplicate addresses or something similar. As long as there aren't lots of complex SQL required or lookup tables or other things that are in the data warehouse, maybe you can do it in Hadoop.
Getting back to that second benefit -- people are getting more and more into analytics and discovering new insights from new types of data. In my opinion, that's really the most interesting part of all this. Now you're bringing in that same data, although you haven't yet figured out what its value is. We use the term low-value business density data for it. It could be clickstream data or data from servers in your datacenter. In any case, you want to explore that information to see if there are interesting patterns that would indicate something that you want to begin measuring on a routine basis.
At that point, you're just exploring data, massaging the data, and doing data discovery in something like a data lake. If you find something interesting, you might move the data you're trying to track into your data warehouse. There, you have repeatable, fast, active analytics for users or applications that need to routinely access whatever you've now defined.
We see many customers who start with a data lake in that way, hoping to uncover insights from new data that they haven't explored before.
Graham: Here's an example. Let's say that sensor data comes in, maybe two or three or ten terabytes. As a data architect, here are your choices: You can load it into the data warehouse and burn four to eight hours of compute time every night, and then do some data reduction on it. If you work on it all night, you can get it down to a gigabyte of actual value (the signal-to-noise ratio in some of these cases is pretty poor). You burn the heck out of an expensive machine all night -- or you could put it in Hadoop and you could get the same result. You could distill it and pass a gigabyte to the data warehouse. That's one cost saving -- you don't have to burn out your data warehouse just to do ETL.
The other aspect that Steve was getting at is cold data and dark data. With cold data, you're looking at something that has aged. Maybe it's seven years old and we're not looking at it every often. Do we really want to keep it on the big expensive machine? Instead, we put it on Hadoop and just have remote access to it.
Dark data is data that people are throwing away, generally because no one has figured out what to do with it yet -- no one has figured out how much value it has. The signal-to-noise ratio is not good with some of the dark data we've seen – there might be one gold nugget for every half ton of dirt we dig into, but the gold nugget is a good one. With Hadoop, you can take a ton of dirt, sift it, find the nugget, and avoid bypassing the value. You don't have to throw it away.
[Editor's note: The conversation continues here. ]