Q&A: New Tools Mean Big Data Dives Can Yield Fast Results

Users can explore and analyze even massive data sets quickly with new tools and platforms now available.

Analytic tools that can tap into raw, unstructured machine data are becoming increasingly important and valuable, enabling organizations to explore unstructured data on the fly without fixed schemas or specialized skill sets. In this interview, Brett Sheppard, director of big data product marketing at Splunk, explains how new tools and analytic platforms enable even non-technical users to explore massive data sets, completing analysis of unstructured data in Hadoop in hours instead of weeks and months. Based in San Francisco, Sheppard has been a data analyst both at Gartner and the U.S. Department of Defense and is a certified Hadoop system administrator.

BI This Week: Let's start with a fairly basic question. With all this talk of more and more data, what are some of the newer sources for all this raw data? Where's it coming from?

Brett Sheppard: Social, cloud, and sensors all contribute to big data. For social media as well as e-commerce transactions, every click on a Web site generates some type of clickstream data, as well as every interaction on social media and every tweet. There's significant value for organizations to have an ongoing dialog with their customers and prospects and with their communities.

Likewise, every application in the cloud generates sets of data, whether it's from the hardware, the applications, the security systems, the IT operations for data, or apps in the cloud. Finally, we're seeing a wealth of very actionable data generated by sensors, ranging from automobiles and airplanes to building sensors.

For example, Eglin Air Force Base in Florida is reducing their base-wide energy costs by 10 percent or more by taking data from the heating, ventilation, and air conditioning systems, the hardware systems -- basically anything in a building that generates data -- and using that sensor data to find where inefficiencies are, whether it's lights on in the middle of the night or a center that's running air conditioners too much. They are thus able to uncover energy waste and to find the right balance of heating and cooling.

That's just one example. Basically, those three sources of social, cloud, and sensor data are contributing to a significant increase in the volume and variety of data as well as the need to address it in as real-time a manner as possible.

We're also seeing a lot of variability in data formats. For example, we work with Ford Motor Company on data from cars. Ford has a standard that they have proposed for automobile makers regarding sensors. It's not a standard, though, that's shared by the rest of the industry, so when we work with Ford, we're able to use their open data standard, but the other auto makers haven't yet standardized on the OpenXC standard. Accordingly, there's a lot of variability in working with car sensor data. That's something we see in a lot of industries right now.

How many companies out there are successfully capturing big data yet? How many have actually deployed an HDFS cluster, for example, and populated it with big data?

Well, I would distinguish between big data as a whole and specific technologies. There is a great deal of interest in Hadoop and HDFS, and it's focused on three areas right now. We see a lot of Hadoop use in the federal government, within Internet companies, and in Fortune 500 enterprises. Beyond Hadoop, there are a variety of NoSQL data stores ... and some organizations have big data stored in relational databases.

What's happened is [that] organizations are able now to store these data types so inexpensively ... using a variety of storage methods such as Hadoop that the opportunity costs of throwing the data away are actually more than the cost of storing it.

How many companies are performing meaningful analytics against all that big data?

The challenge is getting actionable insights from that data because the data is in so many different formats. Typically, it's raw or unstructured data and typically doesn't fit very well in either a relational database or a data intelligence tool, which tends to require an extreme amount of ETL processing, so it can start to look like a Rube Goldberg project in a way. There are all these steps to go from raw data to business insight.

That's what organizations are struggling with and why the percent rate of failure with big data projects is actually quite high. Companies spend six months or more with five or 10 people working on a big data project, then find at the end that they just aren't able to get the actionable insights that they wanted to from that raw, unstructured big data.

What are some of the challenges for companies of working directly with Hadoop and MapReduce? Why is it so hard to get value from data in Hadoop?

Beyond Splunk-specific offerings, there are three approaches today to extract value from data in Hadoop. All three of those have ways they are generating value, but they also have significant disadvantages.

First is MapReduce and Apache Pig, which is the way most organizations get started. You can run searches of data, but it's very slow. It can take minutes or hours. Unlike with a relational database, where you can have deterministic query response times, Hadoop does not have that. A job can run indefinitely -- it could take minutes, it could take hours. That consumes a lot of resources in the cluster.

Because of that, most organizations also try one of the other two options. The first is Apache Hive or SQL on Hadoop. That works very well if you have a narrow set of questions to ask because it requires fixed schemas. If an organization wants to replace an existing ETL framework with something in Hadoop, and it's for, say, static reporting on a small number of data sources, the SQL on Hadoop or Apache Hive approach can work quite well.

Where that approach runs into challenges is with exploratory analytics across an entire Hadoop cluster, where it's impossible to define fixed schemas. A knowledge worker in that organization may want to iterate, ask questions, see the results, and ask follow-up questions. They need to be able to look at all the data that they have access to within their role-based access controls without having to pre-define schemas and be limited to the data returned from those schemas.

Finally, the third approach is to extract data out of Hadoop and into an in-memory store. This could be Tableau Software, or SAP HANA, or a variety of in-memory data stores. That approach works really well if Hadoop is basically doing batch ETL -- where you're taking raw data, you're creating a set of results that can be interpreted in a row and columnar format in a relational database, and you're able to export it out of Hadoop.

Customers come to Splunk when they don't want to move the data out of Hadoop. They essentially want Hadoop to be a data lake where they keep the data at rest. Organizations may have security concerns about moving data around too much or they don't want to have to set up data marts. In those cases, organizations are using Hadoop as the data lake, where they persist that data for many months or years, and use software such as Hunk to ask and answer questions about that data.

Working with big data can also introduce skills challenges, correct? There just aren't enough people around who understand the technologies.

Absolutely. In fact, that's the single biggest limitation today for Hadoop adoption. Hadoop is maturing as a technology for storing data in a variety of formats and from many sources, but what's limiting organizations today is the need for rare, specialized skill sets to do that. ...

There's also a need to mask Hadoop's complexity so that non-specialists can ask questions of the data. At the same time, data scientists who are fluent in the dozen-plus projects and sub-projects in the Hadoop system can focus their skills on advanced statistics and advanced algorithms that really benefit from their knowledge.

Unfortunately, today many data scientists end up wasting their time as "data butlers," where they have colleagues in the line of business or corporate departments who have analytics tasks. Those colleagues don't have the advanced skill sets needed to ask and get answers to questions in Hadoop. Accordingly, data scientists are basically setting up access for their non-specialist colleagues rather than spending their time doing what they are really there for, which is advanced statistics and algorithms that really do require a custom, personalized approach.

You mentioned role-based access controls. Why are they so important?

That's one of the challenges to address with big data. Along with rare, specialized skill sets, role-based access controls are needed to protect non-public information that may be stored in that Hadoop data lake.

That's a weakness of Hadoop, which was founded as shared-use clusters. Organizations such as Yahoo that were storing clickstream data in Hadoop had relatively few security concerns. The data didn't contain non-public information -- it was simply rankings of public Web sites. Accordingly, Hadoop was not founded with the role-based access controls that anyone familiar with relational databases would be conversant in.

For working with big data the way most companies need to, though, it's important to have a technology that can mask some data from some users, offering role-based access to select data in Hadoop.

In fact, that issue is part of what's held the average size of a cluster in Hadoop to 40 nodes rather than, say, 400 or 4,000 nodes. Organizations have to restrict the number of users who have access because in traditional HDFS and MapReduce, once someone has access to the cluster, they can do anything they want within the cluster. They can see all the data, they can delete the data, there is limited auditability, and at best you're able to see what someone has done after they've done it. If you're able to define access by role, you can prevent people either maliciously or inadvertently accessing data that is beyond their role.

Splunk's analytics solution for Hadoop is Hunk, which you introduced last year. Can you talk about what Hunk brings to the big data equation?

We're very excited about Hunk as part of the Splunk portfolio for big data. Splunk has 7,000 customers today who find data in real time with historical context. Many organizations, though, want to persist big data for many months or years, and although Hadoop is a convenient and inexpensive data lake for long-time historical storage, at the same time, organizations want to be able to ask and answer questions of that data in Hadoop.

The benefit of Hunk is that an organization can explore, analyze, and visualize data at rest in Hadoop without a need for specialized skill sets. Many organizations that have tested Hunk have found that they've been able to go from the free trial to searching data in Hadoop within an hour or less. That's possible because of the Splunk architecture, which has schema on the fly. There's no need to apply fixed schemas, and there's no need to migrate data out of Hadoop into a fixed store, so the time to value is significantly faster. You can also expose the data through ODBC drivers to existing business intelligence dashboards.

TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.