TDWI Blog

Philip RussomPhilip Russom, Ph.D., is senior director of TDWI Research for data management and is a well-known figure in data warehousing, integration, and quality, having published over 550 research reports, magazine articles, opinion columns, and speeches over a 20-year period. Before joining TDWI in 2005, Russom was an industry analyst covering data management at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and consultant, was a contributing editor with leading IT magazines, and a product manager at database vendors. His Ph.D. is from Yale. You can reach him by email ([email protected]), on Twitter (twitter.com/prussom), and on LinkedIn (linkedin.com/in/philiprussom).


The Three Vs of Big Data Analytics: VARIETY

Blog by Philip Russom
Research Director for Data Management, TDWI

This blog is number 2 in a series of 3, about the three Vs of big data analytics, namely data volume, variety, and velocity. You can read the first blog here online.

Data Type Variety as a defining attribute of Big Data

One of the things that makes big data big is that it’s coming from a greater variety of sources than ever before. Many of the newer ones are Web sources (logs, click streams, and social media). Sure, user organizations have been collecting Web data for years. But, for most organizations, it’s been a kind of hoarding. We’ve seen similar un-tapped big data collected and hoarded, such as RFID data from supply chain apps, text data from call center apps, semi-structured data from various insurance processes, and geospatial data in logistics. What’s changed is that far more users are now analyzing big data, instead of merely hoarding it. And the few organizations that have been analyzing it, now do so at a more complex and sophisticated level. A related point is that big data isn’t new; but the effective leverage of it for analytics is. (For more on that point, see my blog: The Intersection of Big Data and Advanced Analytics.)

But my real point for this blog is that the recent tapping of these sources means that so-called structured data (which previously held unchallenged hegemony in analytics) is now joined (both figuratively and literally) by unstructured data (text and human language) and semi-structured data (XML, RSS feeds). There’s also data that’s hard to categorize, as it comes from audio, video, and other devices. Plus, multidimensional data can be drawn from a data warehouse to add historic context to big data. I hope you realize that’s a far more eclectic mix of data types than analytics has ever seen (or any discipline within BI, for that matter). So, with big data, variety is just as big as volume. Plus, variety and volume tend to fuel each other.

To further support the point that big data is about variety, let’s look at Hadoop. I managed to find a couple of users who’ve used Hadoop as an analytic database. Both said the same thing: Hadoop’s scalability for big data volumes is impressive. But the real reason they’re working with Hadoop is its ability to manage a very broad range of data types in its file system, plus process analytic queries via MapReduce across numerous eccentric data types.

Stay tuned for the third and final blog in this series, which will be titled: The Three Vs of Big Data Analytics: VELOCITY.

=============================================
NOTE -- Don’t miss TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking this online survey

Posted by Philip Russom, Ph.D. on June 14, 20110 comments


The Three Vs of Big Data Analytics: VOLUME

Blog by Philip Russom
Research Director for Data ManagementTDWI

I was recently on a group call along with several other analysts where IBMers spelled out their definition of big data. They structured the definition by explaining big data’s primary attributes, namely data volume, data type variety, and the velocity of streams and other real time data. I don’t necessarily agree with everything the IBMers said, but I must say that the three Vs of big data – volume, variety, and velocity – constitute a more comprehensive definition than I’ve heard elsewhere. In particular, the three Vs bust the myth that big data is only about data volume. Plus, the term “three Vs” is a catchy mnemonic. So I freely admit that I am shamelessly stealing the concept of the three Vs as a structure for my own definition of big data.

Note that IBMers didn’t consistently link big data with advanced analytics – but I will. This blog focuses on data volume, whereas other upcoming blogs will hit data type variety and data stream velocity.

Data Volume as a defining attribute of Big Data

It’s pretty obvious that data volume is the primary attribute of big data. With that in mind, some people have asked me for a definitive number quantifying the volume, a common question being: “Exactly how many terabytes constitute big data?” In some user interviews I’ve conducted lately, users have said that big data used to start at 3 terabytes, but now the bottom threshold is more like 10 terabytes. In a 2010 TDWI Technology Survey, a third of users surveyed said they will have 10 terabytes within three years. So 3 to 10 terabytes seems an accurate baseline – for now.

But there’s a catch. Note that my research isn’t about just any big data; it’s about big data collected specifically for analytics. So the numbers quoted above are only for analytic datasets -- not all BI data stores and certainly not every bit and byte in an enterprise.

Here are some comments from the field that add more attributes to big data quantification. I asked one user how many terabytes he’s managing for analytics, and he said: “I don’t know, because I don’t have to worry about storage. IT provides it generously, and I tap it like crazy.” Another user said: “We don’t count terabytes. We count records. My analytic database for quality assurance alone has 3 billion records. There’s another 3 billion in other analytic databases.”

From this we see that big data is a moving target that’s growing, there are different units for quantifying it, and it varies with scope (e.g., analytics vs BI vs whole enterprise). In future blogs in this series, we’ll see that data variety and velocity are just as important as volume when it comes to defining big data. Please stay tuned for those blogs.

So, what do you think, folks? Let me know. Thanks!

======================================================
Don’t miss TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking the survey online: http://bit.ly/jxWh9N

Posted by Philip Russom, Ph.D. on June 9, 20110 comments


Big Data Analytics: The View from Kognitio

Blog by Philip Russom
Research Director for Data Management, TDWI

I recently chatted with Paul Groom, the VP of Business Intelligence at Kognitio. Among other things, Paul had some great tips for moving beyond common barriers to analytics with big data. I’d like to share some of those tips with you.

Philip Russom: I’ve encountered several user companies that are hoarding big data – especially log data from Web sites – but they don’t know how to get started with analyzing it. Are you seeing this, too?

Paul Groom: Yes. I call it “data car parking.” Over time, the data car park gets so big that it’s a psychological barrier to taking any kind of action. For some reason, many data warehouse professionals think they have to process the entire data car park all at once – with the usual ETL, data quality, and data modeling techniques – before analytics can commence. That particular mindset is a show-stopper for big data analytics.

Philip Russom: In data warehousing, we’re taught that transforming, cleansing, and modeling data are requirements, because reports require squeaky clean, auditable data. But analytics and big data have different requirements. Right?

Paul Groom: Right. OLAP aside, most analytic methods require large data samples of highly detailed data drawn straight from operational sources. That’s because a business analyst is trying to discover unknown business facts in previously untapped data, which differs from data warehousing that reports on known business facts based on well-understood data. Careful data preparation is desirable in data warehousing for reports, but it’s actually a problem with analytics, because data prep strips out the details and granularity that analytics depends on. Oddly enough, when users figure out that they should forego most of the data prep they’re used to in data warehousing, it removes a barrier so they can proceed to analytics with the big data they’ve been caching away.

Philip Russom: I’ve been talking up the perils of data prep for analytics for about two years now. Even when users get the point, they’re still skeptical about the next step, namely complex analytic queries against non-optimized big data.

Paul Groom: We get a lot of that, too. The skepticism is natural, because data warehouse pros have been using hand-me-down database management systems designed for transaction processing, and these don’t perform well with complex analytic queries. But the newest generation of analytic databases does. Assuming you have one of these, such as Kognitio WX2, then the rules of the analytic game just changed.

Our mantra is: “Trust the database.” A modern analytic database can quickly execute any query you come up with, without need for time-consuming data prep or repetitive tweaking of queries and data models. Once users build confidence in new database performance, it removes another barrier to analytics.

So, what do you think, folks? Let me know. Thanks!

Posted by Philip Russom, Ph.D. on May 31, 20110 comments


Big Data Analytics: The View from Tableau Software

Blog by Philip Russom
Research Director for Data Management, TDWI

I just got off the phone with Ellie Fields, the director of product marketing at Tableau Software. Ellie has a lot to say to about intersections among big data, analytics, and data visualization. So allow me to recount the high spots of the conversation.

Philip Russom: Tableau is often pigeon-holed as a data visualization vendor. But the Tableau users I’ve met are using the tool for analytics. How does Tableau position itself?

Ellie Fields: Our customers use Tableau in different ways. For example, many use us as their primary, enterprise BI platform. Others use us for specific BI applications within a department. Still other customers use Tableau for fast analytics, as a complement to a legacy BI platform. Given the breadth of use, we see ourselves as a multi-purpose BI platform.

Philip Russom: I’ve seen demonstrations of the Tableau tool, so I know that ease-of-use is high. But is it high enough to enable self-service BI?

Ellie Fields: The Tableau tool was designed with self-service in mind for a broad range of BI users. For example, with a few mouse clicks, a user can access a database, identify data structures of interest, and bring data into server memory for reporting or analysis. The user needs to know the basics of enterprise data, but doesn’t need to wait for assistance from IT. With a few more clicks, you can publish your work for colleagues to use. Going back to your question about positioning, we describe this quick and easy method as “rapid fire business intelligence.”

Philip Russom: What’s the relationship between data visualization and big data?

Ellie Fields: As you know, Tableau is strongly visual. In fact, the visual images representing data are an extension of the user interface, in that you grab your mouse and – with simple drag-and-drop methods – you interact directly with the visualization and other visual controls to form queries, reports, and analyses. Analysis is iterative, and iterations need to flow fast. The drag-and-drop environment enables an analyst to work quickly, without losing the train of thought, and even to collaborate with others on live data. So, we’re fast with results – even against big data.

When working with big data, all of our visualizations scale up and down, in that they can represent ten data points from a spreadsheet or ten million rows of big data. And when working with big data, visualization is even more important. It’s how humans explore and consume information to arrive at a conclusion. Analytics without good visualization is hamstrung from the beginning.

Philip Russom: What types of analytic applications have you seen in your customer base recently?

Ellie Fields: Many of our customers practice what we call “exploratory analytics.” This is especially important with big data, where the point is to explore and discover things you didn’t already know. For example, we have a lot of Web companies as customers, and they depend on advertizing for revenue. As they explore big data, they’re answering analytic questions like: “How do small ads compare to big ones? Or which colors in an ad sell the most?” Yahoo! is a customer, and they analyze online ads by many dimensions, including size, color, location, frequency, Web site locations, revenue, and so on.

High tech manufacturing stands out as a growing area, especially analytics for monitoring product and supply quality. Healthcare, finance, and education companies have also adopted Tableau. One healthcare client analyzes its supply chain to be sure all locations are equipped adequately. Another hospital uses analytics to optimize nurse staffing. And a university client analyzes trends in SAT scores to enlighten decisions about recruitment, scholarships, and educational curricula.

So, what do you think, folks? Let me know. Thanks!

Note: The next TDWI Solution Summit, September 25-27 in San Diego, will feature case studies focused on the theme of “Deep Analytics for Big Data.”

Posted by Philip Russom, Ph.D. on May 19, 20110 comments


Big Data Analytics: The View from Cloudera

Blog by Philip Russom, Research Director for Data Management, TDWI

I recently had a great phone conversation with Mike Olson, the CEO of Cloudera. Mike has a gift for explaining new and complex technologies and their emerging best practices. Let me share a few of Mike’s insights.

Philip Russom: My understanding is that Cloudera makes a business by distributing open source software, namely MapReduce-based Apache Hadoop. Is that right?

Mike Olson: Well, that’s part of it. Cloudera does a lot more than simply distribute open source Hadoop. We make Hadoop viable for serious enterprise users by also providing technical support, upgrades, administrative tools for Hadoop clusters, professional services, training, and Hadoop certification. Furthermore, our distribution package of Hadoop includes more than Hadoop. So Cloudera collects and develops additional components to strengthen and extend Hadoop.

Philip Russom: So, what is Hadoop?

Mike Olson: Essentially there are two pieces in Hadoop. First, there’s the Hadoop Distributed File System (or HDFS), which can manage big data on clusters of many nodes. Our customers typically start with twenty nodes or so, then quickly grow to fifty or more. Some of our customers have thousands of nodes, managing petabytes of data. A many-node cluster enables big data management, plus other nice benefits like scalability, performance, and high availability. But the ramification is that data is heavily distributed.

That’s where the second piece comes in, namely MapReduce. Thanks to this capability of Hadoop, you can define a data operation--like a query or analysis--and the platform ‘maps’ the operation across all relevant nodes, for distributed processing and data collection. The platform then consolidates and reduces the responses that come back. Due to the distributed processing of MapReduce, analytics against very big data is possible—and with good performance.

Philip Russom: What kind of analytics?

Mike Olson: Hadoop excels in discovering patterns in big data, patterns that you didn’t know were there, in data that you probably don’t know very well. That makes Hadoop the opposite of your average data warehouse query against well-understood relational data. Since Hadoop and a traditional data warehouse are complementary, putting them together gives you a very broad range of business intelligence capabilities.

Philip Russom: What data types and data models are your customers managing?

Mike Olson: In Hadoop, you can mix and match data types to your heart’s content. Hadoop will store anything without requiring a data type declaration. Also, Hadoop is amazingly tolerant of messy data. For example, our customers manage any kind of file you can think of in the HDFS, and these can have just about any kind of data model. This also includes human language text and complex data types. So, big data’s not just big. It’s also highly diverse and complicated. And Hadoop excels in handling data of such extreme size, diversity, and complexity for the purposes of analytics.

So, what do you think, folks? Let me know. Thanks!

Posted by Philip Russom, Ph.D. on May 12, 20110 comments


Big Data Analytics: For Many, It’s a Departmental Affair

I’ve recently been interviewing users and business sponsors, asking them about their new practices with advanced analytics, plus the special role of big data. When I ask people to talk about critical factors that make or break or success, they usually come around to a common issue that needs sorting out. It’s the fact that most analytic applications are departmentally focused (often departmentally owned and funded) and they satisfy department requirements, not enterprise ones.

Give me a minute to explain what I’m hearing from users, as well as why big data analytics is progressively a departmental affair:

Analytic applications are departmental, by nature. Just about any analytic application you think of is focused on tasks, data domains, and business opportunities that are associated with specific departments. For example, customer base segmentation should be owned and executed by marketing and sales departments. The actuarial department does risk analysis. The procurement department does supply and supplier analysis.

Most data warehouse (DW) and business intelligence (BI) infrastructure is not designed for advanced analytics. In most organizations, it is, instead, designed and optimized for reporting, performance management, and online analytic processing (OLAP). This enterprise asset is invaluable for “big picture” reports and analyses that span enterprise-wide processes (especially financial ones). And it’s capable of satisfying most departmental requirements for reporting and OLAP. But, in many organizations, the BI/DW infrastructure cannot (and, due to its owners, will not) satisfy departmental requirements for advanced analytics and big data.

Many departments are deploying their own platforms for big data and analytics. They do this when the department has a strong business need for analytics with big data, plus the budget and management sponsorship to back it up. Just think of the many new vendor tools and platforms that have arisen in recent years. Data warehouse appliances, columnar databases, MapReduce, visual discovery tools, and analytic tools for business users all supply analytic functionality that user organizations are demanding at the department level. And all are built from the bottom up to management and operate on big data. Obviously, big data analytics can be implemented on older, more traditional databases and tools, as well.

Put it all together, and this user and vendor activity reveals that big data analytics is progressively a departmental affair, implemented on departmentally owned platforms.

So, what do you think? Does the trend toward departmental big data analytics make sense to you? Let me know. Thanks!

Posted by Philip Russom, Ph.D. on May 10, 20110 comments