Q&A: Big Data and Database Technologies

When it comes to big data, technologies abound. How do you select the options best for your environment?

Big data affects the storage and database technology choices you make. To help us sort out the options, we turned to Paul Dix, author of the video training series "Working with Big Data LiveLessons" published by Addison-Wesley Professional.

BI This Week: When you speak of big data as it relates to databases, are you simply talking about data volumes or something more?

Paul Dix: Big data as a whole has other meanings, but for storage it's mostly about volume -- although I also think of it as being about how the data is stored. Big data and storage can mean more than just relational storage. You can have columnar storage or unstructured storage like Hadoop. However, even those topics within big data storage mostly concern themselves with volume.

A number of vendors are introducing database technologies that are quite different from the traditional RDBMS approach, such as column-oriented databases. What advantages do these newer technologies offer, and what does that mean for the future of existing row-based databases?

NoSQL technologies like Cassandra or HBase have a clear advantage when it comes to big data. They support clusters with hundreds of servers with total storage measured in terabytes or petabytes. Columnar databases are useful when performing calculations on a single column of data, such as calculating aggregates. They can run through many times more rows in the same time as a row-oriented data store. Many of these technologies also make schema updates to existing tables much easier and faster.

For existing row-based databases to be competitive in the context of big data, they'll have to add more sophisticated clustering. That means sharding data in tables across multiple servers and performing queries across the cluster in real time. Currently, if you want to do these types of things with traditional row-based databases, you have to build all that logic into the application layer.

Are these new technologies hard to learn or integrate in an organization's existing infrastructure?

Non-traditional databases can be fairly difficult to integrate into an organization. On one side you have the operations aspect. Sysops or devops in most organizations have years of knowledge and experience administering monolithic relational databases. Distributed databases such as Cassandra and HBase present different challenges when it comes to operations. The recurring tasks that need to be run, the errors encountered, or how to backup the database are all different from traditional RDBMSes.

Then there's the developer side of things. There's definitely a learning curve for developers when picking up NoSQL or columnar databases. They have to think about their data in a new way. Considerations in schema design in a NoSQL database are very different than in a relational database. The development team has to have a good idea ahead of time how they'll need to lookup and query the data.

These challenges aren't insurmountable, they just need to be planned for accordingly. The open source NoSQL databases aren't drop-in replacements for relational DBMSes, so they require an investment in education, operations, and implementation.

Performance is one thing, but price can be another, especially for budget-strapped IT departments. Open source is one approach to cutting costs. Tell us about open source DB options.

For big data, the three primary players in the open source world are HBase, Cassandra, and Riak. Each has its own strengths and weaknesses. They all have a slightly different take on how data is modeled, so which one makes the most sense to use depends heavily on the specific use case.

All three share a few common design principles though. They're designed to be highly available, fault-tolerant, and horizontally scalable. Single server failures shouldn't take down a running cluster, and you should be able to add more capacity to a cluster seamlessly by just adding more servers.

HBase is a project that came out of Hadoop. It uses Hadoop's File Storage as a building block for the data store, which is roughly modeled after Google's BigTable storage system. Cassandra was originally built inside Facebook and is a hybrid key/value store and row database. However, its model can also be used like a columnar database. Cassandra can also be combined with Hadoop to give it MapReduce functionality for large batch processing jobs. Riak is a distributed key/value store with MapReduce capabilities and full text search.

Databases aren't the only technology affected by big data, of course. There's a variety of other information, from unstructured social media text to streaming data, that makes up big data. Do these data sources pose any additional/different challenges?

Unstructured or semi-structured data is where NoSQL data stores generally have their biggest advantages. With this type of data, you often only have to look it up by a key/value pair, which is a great fit for Riak, HBase, or Cassandra. It's much easier to scale out these data stores than a traditional RDBMS. For streaming data, the challenge becomes not only with the data store but the real-time processing required to handle it. For that you need to start looking at messaging systems.

What different open source messaging systems are available that work well in the context of big data?

Two messaging systems that I've had success with are RabbitMQ and Kafka. They have fairly different designs, so it's best to see which matches up more closely with your needs. RabbitMQ works best when the entire message queue fits in memory. However, it will scale horizontally across a cluster with multiple queues. Kafka also scales well and is a distributed system that works even when the queue is far too large to fit in memory.

Another promising new entry in open source messaging systems is NSQ. It's a distributed real-time messaging system built at Bitly that is designed to handle billions of messages per day. How is big data affecting other IT issues, such as performance? How is it struggling to deliver the best performance for analytics applications, and what practices have you found work the best in delivering optimal performance in big data environments?

When it comes to performance and big data, it's all about two things: what can you keep in memory and what do you need to pre-compute. Keeping things in memory seems counter-intuitive when talking about big data. However, the goal is to spread your data out over a cluster of machines so you end up with more memory than you'd have in disk space on a single box, so results can be returned without expensive disk access.

Pre-computing means you run calculations as data comes into the system so that when a query is issued, you're just returning a value instead of running over millions or billions of records. The issue with pre-computing is that it doesn't work for ad hoc queries, so you'll have to know ahead of time what to compute or backfill it with a more traditional batch processing MapReduce job.

Earlier you mentioned messaging systems and real-time data processing. When would you use real-time infrastructure vs. traditional batch-based big data approaches?

Batch-based methods are great for when you're doing ad hoc querying or processing large amounts of unstructured text after you've collected everything. Real-time infrastructure is useful to process unstructured text into more structured forms that go into databases, pre-processing results for queries that are run frequently, or things that require results returned to a user while they wait. With real-time data processing, you have to know upfront what your queries are and pull out the results as data comes in.

TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.