Q&A: Tackling "Big Data" Challenges
How big is big data, and how can you get it under control?
[Editor's Note: This Q&A originally appeared at esj.com
As the ongoing explosion of data challenges IT practitioners and organizations of all shapes and sizes, the term "big data" is becoming ubiquitous. Everybody is talking about "big data," but what does it really mean, and does this concept lead businesses to think too broadly about their needs?
For answers, we turned to Don DeLoach, CEO of the open source analytic database provider Infobright. He shares his insights on the myriad of changes happening in the information management landscape, and explains the many different approaches that can be taken to capture, store, analyze, and capitalize on corporate data.
Enterprise Strategies: How big is "big data"? What kind of volumes are we talking about, and how fast is this data growing?
Don DeLoach: Although "big data" has become the new buzzword, there is no agreement on the definition of big. For one company, big could mean 10 terabytes; for another company, big could mean 100 petabytes. Despite all the noise, industry statistics still show that almost 90 percent of data warehouses are under 10 terabytes in size.
However, there is no question that data volume is growing quickly, especially so for "machine-generated data" such as information that comes from online activity or from the huge growth in mobile communications and social media. Valuable data is being generated by a host of devices and machines, including smart phones, PCs, iPads, Web servers, Xboxes, GPS systems, telecom and computer network logs, sensors, and more.
There has been a rash of vendor consolidation in the last several months -- Teradata's purchase of Aster Data, HP's purchase of Vertica, EMC's purchase of Greenplum, among others. What is the impact of this consolidation for end users? What are the benefits and drawbacks to this company activity?
This consolidation is logical, as clearly the value of analytic databases is being recognized on a broader level than ever before in the market. Storage is one issue, but you also need an efficient way to actually extract business value from all this data, so it makes sense for some of the big players and analytic upstarts to join forces.
For end users, there are probably both benefits and drawbacks. On the plus side is the increased investment that can be made to further these technologies. However, there is always a large downside risk, as small, nimble companies become dragged down from large corporate processes and procedures.
The ability to innovate quickly, test new ideas, and stay very close to early customers are the reasons so much innovation comes from start-ups rather than larger, established technology companies.
Are data management professionals up to the challenge of big data, or are they falling behind?
Data professionals will fall behind if they don't change their approach—in terms of both the tools and processes they are deploying for information analysis. For example, traditional large-scale data warehousing requires extensive configuration and tuning, a massive and expensive hardware footprint and ongoing resources to store and maintain ever-expanding volumes of information.
It's not uncommon for a large organization to keep an army of database administrators on staff to partition data, create custom indexes, and optimize poor performing queries. This is simply not sustainable in a world where information requirements change much more rapidly than they used to, and IT budgets remain tight. We can see the results in the huge growth of independent application data marts and new types of analytic solutions.
Data professionals need to diversify their solution set and look at new approaches for loading data faster: storing it more compactly and reducing the cost, resources, and time involved in analyzing and managing it. This requires a willingness to try out new tools and even experiment with different combinations of data solutions.
In light of increasing data volumes, how important is it to match your data management tool(s) to your challenges?
This is very important. There is a reason that businesses use purpose-built tools for certain jobs. You don't want your business solutions to use a standard relational database for everything, just as you wouldn't use a screwdriver when you really need a power drill.
As the information management landscape evolves, careful consideration should be paid to the objectives of an intended business solution and how the underlying computing components need to come together to achieve it.
This is especially true in the database world, where there are good and justifiable reasons for everything from traditional row-based relational databases, to purpose-built columnar stores to memory-based systems (and the variations that drive complex event processing) to the emerging Hadoop and noSQL-style of key value store databases being used in large SaaS and social media environments.
There's no silver bullet, but understanding and then linking project objectives to the right architecture can mean the difference between a costly failure and an efficient success.
Part of the problem of growing data volumes is the "Rise of the Machines" -- that is, machine-generated data coming from Web logs, call detail records, online gaming, sensor output, and financial transaction data, to name just a few sources. This growing class of data has a particularly unique set of characteristics and database/analytic requirements. What's the impact for end users?
The definition of machine-generated data is fluid. Some say it's data generated without any direct human intervention (sensor output, for example) and others say it can include the machine tracking of human activities as well (such as Web logs or call detail records). Even so, some key characteristics apply: new information is added with a high frequency and the volume is extremely large and continuously growing. Businesses need to be able to quickly understand this information, but all too often, extracting useful intelligence can be like finding the proverbial "needle in the haystack."
Traditional databases are well-suited for initially storing machine-generated data, but they are often ill-suited for analyzing it. They simply run out of headroom in terms of volume, query speed, and the disk and processing infrastructure required to support it. The instinctive response to this challenge seems to be to throw more people (database administrators) or money (expanded disk storage subsystems) at the problem, or to scale back by archiving further to reduce the size of the dataset, which really only yields a short-term fix.
Enter the data warehouse, which is seen by many as the only solution to the myriad information management challenges presented by machine-generated data. The problem is, data warehouse projects are generally very costly in terms of people, hardware, software, and maintenance.
What specific features should companies look for to manage the specific challenges introduced by machine-generated data?
Just as there are growing numbers of purpose-built databases for super-large social networks and SaaS environments, databases specifically suited to the management of machine-generated data make sense. In an ideal world, users should have almost immediate access to their data and the flexibility to run new reports, perform ad hoc analytics or do complex data mining without IT intervention (to tune or manage the database). Taking our "ideal" even further, advantages such as deep data compression would also be inherent to take advantage of the characteristics of machine-generated data to achieve 10:1, 20:1, even 50:1 compression and requires less hardware to run.
Affordability is also key. In the past, many small and midsize enterprises were priced out of the traditional data warehouse market, but this is changing with open source, as there are now many open source options available, ranging from the free-to-download to extremely cost-effective commercial open source products that include more features and support. (Infobright offers both.)
What features and technologies are going to get you there? To start, columnar databases (which store data column-by-column versus traditional row-oriented databases) have emerged as a compelling choice for high-volume analytics, and for good reason. As most analytic queries only involve a subset of the columns in a table, a columnar database that retrieves just the data required accelerates queries while reducing disk I/O and computing resources. These types of databases can also enable significant compression and accelerated query processing, which means that users don't need as many servers or as much storage to analyze large volumes of information.
Taking these benefits even further are analytic solutions, like Infobright's, that combine column orientation with capabilities that use knowledge about the data itself to intelligently isolate relevant information and return results more quickly.
What product or services does Infobright offer for managing big data?
By providing an instantly downloadable, easy-to-use, open source alternative to traditional database solutions, Infobright is making analysis of machine-generated data possible for companies without armies of database administrators, large budgets, or lots of time. Our analytic database solutions come in two flavors: Infobright Community Edition (ICE), which is free to download, and Infobright Enterprise Edition (IEE), which is available on a license basis and includes service and support. Our self-tuning technology reduces administrative effort (by up to 90 percent), provides strong data compression (from 10:1 to over 40:1), and turbo-charges query performance, offering a fast, simple, and affordable path to high-volume, high-performance analytics.
What new innovative technologies and open source projects (everything from Hadoop and Cassandra to Map Reduce and NoSQL) allow specialized information management solutions to co-exist in ways never anticipated before?
There's been a lot of excitement surrounding Hadoop, as well as NoSQL variants like Mongo, Cassandra, Couch, and others and I think this is good for the data management market. The emergence of new modalities doesn't necessarily replace the existing ones, just as the reality of airplanes did not result in the retiring of cars or trains. Overall, there is more synergy than overlap. New technologies are creating greater value based on integration with existing ones such as relational and columnar databases.
For instance, Infobright has several customers using Hadoop in conjunction with Infobright. Hadoop does not perform the analytics that Infobright does any more than Infobright sweeps petabytes of unstructured data, but the combination of these two technologies is compelling, to say the least. We also have a proof-of-concept in the works linking Infobright with MongoDB.
Returning to our discussion about matching the right tool to the challenge, the reality is that sometimes combining several clever technologies together is the key to yielding the best result.