Q&A: Quantum Leap in Technology Addresses Data Issues
IBM distinguished engineer Sam Lightstone discusses the challenges and opportunities offered by big data.
- By Linda L. Briggs
- October 15, 2013
“With our latest offerings, IBM has taken a quantum leap forward in making [big] data manageable, successful and analyzable in a scalable fashion.” That’s IBM distinguished engineer Sam Lightstone discussing IBM’s new BLU Acceleration technology with BI This Week. Lightstone is a business intelligence architect for next-generation data analytics, working with IBM’s DB2 for Linux, UNIX, and Windows development team. He is the product architect for BLU Acceleration, IBM’s technology for parallel, vectorized, in-memory columnar analytics, a project he founded in collaboration with colleagues at IBM’s Almaden Research Lab.
An IBM master inventor with over 40 patents granted or pending, Lightstone has been widely published on the topic of self-managing database systems, and is co-author of five books, including a guide to software development professionalism, Making it Big in Software.
BI This Week: What are some of the constraints around big data today that are slowing companies down as they wrestle with large amounts of data?
Sam Lightstone: First, keep in mind as we use that term that there are many different kinds of big data.
That said, one class of problems is that the data is coming from many different locations and in many different formats. Managing those formats, those different streams of data, is complex, as is unifying it into some whole that is understandable -- not just by human beings but by computers, too. It’s an IT challenge, and a key problem especially in larger organizations -- the more disparate the location of the data, the bigger the challenge. So that’s one class of problems.
Another class of problems is the sheer volume of the data. We live in a world that is increasingly interconnected, and data is being generated from more and more locations. It used to be that just institutions had data. Now, everybody with a phone is generating data -- [along with] your laptops and computers; it’s not just backend servers. This incredible explosion of data continues at a phenomenal rate. The world is generating exabytes of data on a daily basis.
So there’s a problem of volume and how to manage it and leverage it. You have all this data -- how do you leverage it in a way to tell you something useful, to convert that data into useful information?
Dealing with data volume and turning it into useful information has multiple aspects as well. For example, there’s the analytic problem of turning data into useful information. There’s also the engineering question -- how can you do that quickly enough for humans? Nobody wants to wait two days to get an answer to a question; we live in a very impatient world.
Those are some key challenges in working with big data.
When you talk about data formats, is part of the challenge dealing with structured versus unstructured versus semi-structured data?
Yes, but it’s even deeper than that. Within any of those realms, there are many, many different formats. Even within structured data, there are many formats, and it gets more complex as you go from structured to semi-structured to completely unstructured. It’s a painful problem. You’re dealing with problems of conflicting data types, mismatched schemas, and mismatched topologies.
What we’ve done at IBM is, I think, very powerful, in the sense that we haven’t begun this quest to tackle big data by asking ourselves, “What are the interesting engineering problems? What are the interesting scientific problems?” There was a time in IBM’s history when we looked at things in that way.
What we do now -- and what we’re very proud of -- is we look at these problems in terms of, “What are the challenges that are important for our customers?” All of our technology really is focused on this theme: What is important for the customer, and what can we do to meet the challenges of our customers and society with all this data? That’s instead of: what is interesting and scientific to us as engineers?
In listening to your customers, do you find big data is a big challenge for them or are they beginning to see it more as an opportunity – that is, more a positive than a negative?
It’s definitely a huge opportunity; it’s also still rapidly emerging. We’re really just at the beginning of the curve in the evolution and adoption of the technology. It’s been around for a few years, but it’s still in its early stages -- like your cell phone or even your television. These things take years to evolve and mature -- not just that the technology itself has to become better, but society has to wrap its head around what they want to use it for.
When we as a society first invented cell phones, nobody talked about a mobile phone as a place to play backgammon or watch movies. It was a phone. Now actually calling someone is just one minor task your cell phone can do.
Would you say we’re at the point where we’re collecting and storing data effectively but not yet using it effectively? Do companies really know what to do with their data, or is the volume simply overwhelming?
Well, I think that IBM, with our latest offerings, has really taken a quantum leap forward in making this data manageable, successful, and analyzable in a scalable fashion -- in a way that its performance is enough for human beings and human reactions. That’s behind the IBM phrase, “analytics at the speed of thought.”
The key idea there is -- and I think this is one thing that engineers across the industry don’t necessarily internalize, but we’ve tried to stress with our engineering teams -- it’s not really about being 20 percent faster or 10 percent faster than your competitor. Certainly, there’s a point where it’s too slow, because human beings are not that patient and businesses can’t afford to wait. There’s another point at the other end of the spectrum where it’s so fast that making it any faster really doesn’t buy you anything.
If you think of it in that context, and consider things from the perception of human beings, it doesn’t really matter to a person once you can get a sub-second answer, for example. It really doesn’t matter if it’s a tenth of a second, a hundredth of a second or a millionth of a second -- it’s all faster than your perceptions can acknowledge, so it’s all the same. However, it matters tremendously if it’s an hour versus 10 hours versus 100 hours, so by focusing on this notion of making it manageable and scalable to the human being, the human perception, we create a technology that is really valuable and consumable by the marketplace.
That might be the first time I’ve heard someone say that something related to data delivery could actually be too fast.
Yes, that’s not to say that sub-second responses are always fast enough, because sometimes they’re not. A classic case in which they’re not is this: You may be able to process a small transaction or a banking transaction at a millisecond, but if there are a million people or 10 million people doing it all at once, it becomes a totally different situation. The scale and the concurrency -- the number of concurrent users that are pounding on the system -- is an important consideration.
How much of a factor is speed for companies that are failing to use analytics in a useful way? Is a lack of speed really holding people back from using analytics?
Oh, yes. I think speed is a huge issue. It’s absolutely huge -- performance remains one of the main attributes of the technologies that we compete on. It’s not the only attribute. The ability to scale with the volume, the ability to handle all the different formats and so on -- to functionally do what customers need us do will always remain and still is a fundamental part of what we compete on. However, performance is almost always one of the main attributes, because the analysis of large amounts of data is time-consuming. It’s time-consuming even if you have 50 gigabytes, but if you have a terabyte, or 20 terabytes, or maybe hundreds of terabytes or even petabytes, then it starts getting really costly in terms of computation time. People are people, and human beings just don’t want to wait.