Blog by Philip Russom
Research Director for Data Management, TDWI
I just got off the phone with Ellie Fields, the director of product marketing at Tableau Software. Ellie has a lot to say to about intersections among big data, analytics, and data visualization. So allow me to recount the high spots of the conversation.
Philip Russom: Tableau is often pigeon-holed as a data visualization vendor. But the Tableau users I’ve met are using the tool for analytics. How does Tableau position itself?
Ellie Fields: Our customers use Tableau in different ways. For example, many use us as their primary, enterprise BI platform. Others use us for specific BI applications within a department. Still other customers use Tableau for fast analytics, as a complement to a legacy BI platform. Given the breadth of use, we see ourselves as a multi-purpose BI platform.
Philip Russom: I’ve seen demonstrations of the Tableau tool, so I know that ease-of-use is high. But is it high enough to enable self-service BI?
Ellie Fields: The Tableau tool was designed with self-service in mind for a broad range of BI users. For example, with a few mouse clicks, a user can access a database, identify data structures of interest, and bring data into server memory for reporting or analysis. The user needs to know the basics of enterprise data, but doesn’t need to wait for assistance from IT. With a few more clicks, you can publish your work for colleagues to use. Going back to your question about positioning, we describe this quick and easy method as “rapid fire business intelligence.”
Philip Russom: What’s the relationship between data visualization and big data?
Ellie Fields: As you know, Tableau is strongly visual. In fact, the visual images representing data are an extension of the user interface, in that you grab your mouse and – with simple drag-and-drop methods – you interact directly with the visualization and other visual controls to form queries, reports, and analyses. Analysis is iterative, and iterations need to flow fast. The drag-and-drop environment enables an analyst to work quickly, without losing the train of thought, and even to collaborate with others on live data. So, we’re fast with results – even against big data.
When working with big data, all of our visualizations scale up and down, in that they can represent ten data points from a spreadsheet or ten million rows of big data. And when working with big data, visualization is even more important. It’s how humans explore and consume information to arrive at a conclusion. Analytics without good visualization is hamstrung from the beginning.
Philip Russom: What types of analytic applications have you seen in your customer base recently?
Ellie Fields: Many of our customers practice what we call “exploratory analytics.” This is especially important with big data, where the point is to explore and discover things you didn’t already know. For example, we have a lot of Web companies as customers, and they depend on advertizing for revenue. As they explore big data, they’re answering analytic questions like: “How do small ads compare to big ones? Or which colors in an ad sell the most?” Yahoo! is a customer, and they analyze online ads by many dimensions, including size, color, location, frequency, Web site locations, revenue, and so on.
High tech manufacturing stands out as a growing area, especially analytics for monitoring product and supply quality. Healthcare, finance, and education companies have also adopted Tableau. One healthcare client analyzes its supply chain to be sure all locations are equipped adequately. Another hospital uses analytics to optimize nurse staffing. And a university client analyzes trends in SAT scores to enlighten decisions about recruitment, scholarships, and educational curricula.
So, what do you think, folks? Let me know. Thanks!
Note: The next TDWI Solution Summit, September 25-27 in San Diego, will feature case studies focused on the theme of “Deep Analytics for Big Data.”
Posted by Philip Russom, Ph.D. on May 19, 20110 comments
When you’re 100 years old, as IBM is this year, it would be easy to think that you’ve seen it all. What could possibly be new to Big Blue about “big data”? In the view of Robert LeBlanc, SVP of Middleware Software for the IBM Software Group, quite a bit.
The new problem set, defined by business opportunities opening up due to the availability of new sources of information, cannot be solved with traditional data systems alone. Kicking off the IBM Big Data Symposium for industry analysts at the Yorktown Research Center on May 11, LeBlanc itemized a number of challenges, including multi-channel customer sentiment and experience analysis, detection of life-threatening conditions at hospitals in time to intervene, Medicare fraud interdiction before payment, and weather pattern predictions to optimize wind turbine locations. (Note: The next TDWI Solution Summit, September 25-27 in San Diego, will feature case studies focused on the theme of “Deep Analytics for Big Data.”)
“Big data” is both an evolutionary and revolutionary phenomenon. Given that organizations have been working with large data warehouses and other types of files for some time, it should come as no surprise that the sheer quantity of data would continue to grow. Data is a renewable resource; the more applications and systems that use it, the more data that they tend to generate. Data warehouses will continue to be important, but even as the terabytes of structured data pile up, organizations are hunting down unstructured sources to tap their value and discover new competitive advantages.
IBM’s view of what makes big data revolutionary comes down to the convergence of the three “V’s”: volume, velocity, and variety. Volume is the easiest to understand, although IBM speakers at the Symposium described scenarios where so much data was streaming through in real time that storing it was impossible. Huge data volumes plus the velocity with which it is flowing in are opening up opportunities for technology alternatives, including Hadoop, MapReduce, and event stream processing. Variety, the third “V,” adds in the unstructured and complex data sources growing up on the Web, particularly in social media. Some organizations, of course, do store all this data; Eric Baldeschwieler, VP of Hadoop Development at Yahoo!, described their use of the Hadoop Distributed File System (HDFS) to store petabytes of data on nodes through its vast array of clusters. “Hadoop is behind everything we do,” he said.
It was not surprising news, but Baldeschwieler and IBM experts gave a full-throated defense of Apache Hadoop and the importance of having open source software at the foundation of big data programs. IBM did not mention EMC explicitly, but it was clear that the company was responding to EMC’s May 9 announcement of the new Greenplum HD Data Computing Appliance, which offers its own distribution of Apache Hadoop. IBM execs warned of the dangers of “forking,” which is what happened when vendors created their own versions of the UNIX operating system and users had to deal with competing standards. Baldeschwieler and IBM execs did acknowledge, however, that Apache Hadoop is far from a finished product, and in any case is not the solution to all problems.
I came away from the Symposium excited by the future of big data analytics but also aware that there’s a long way to go. “Big data” is not about a single technology, such as Hadoop or MapReduce (for more on Hadoop, see my colleague, Philip Russom’s interview with the CEO of Cloudera here). These technologies are more of a complement to data warehousing rather than replacement for it. Yahoo!’s Baldeschwieler made the point that Yahoo also has data warehouses. As each industry’s requirements become clearer, vendors such as IBM will assemble packages that will bring together the strengths in their existing solutions with new technologies. Then, organizations will have a better understanding of how to compare the vendors’ offerings. We’re not quite there yet.
Posted by David Stodder on May 17, 20110 comments
Blog by Philip Russom, Research Director for Data Management, TDWI
I recently had a great phone conversation with Mike Olson, the CEO of Cloudera. Mike has a gift for explaining new and complex technologies and their emerging best practices. Let me share a few of Mike’s insights.
Philip Russom: My understanding is that Cloudera makes a business by distributing open source software, namely MapReduce-based Apache Hadoop. Is that right?
Mike Olson: Well, that’s part of it. Cloudera does a lot more than simply distribute open source Hadoop. We make Hadoop viable for serious enterprise users by also providing technical support, upgrades, administrative tools for Hadoop clusters, professional services, training, and Hadoop certification. Furthermore, our distribution package of Hadoop includes more than Hadoop. So Cloudera collects and develops additional components to strengthen and extend Hadoop.
Philip Russom: So, what is Hadoop?
Mike Olson: Essentially there are two pieces in Hadoop. First, there’s the Hadoop Distributed File System (or HDFS), which can manage big data on clusters of many nodes. Our customers typically start with twenty nodes or so, then quickly grow to fifty or more. Some of our customers have thousands of nodes, managing petabytes of data. A many-node cluster enables big data management, plus other nice benefits like scalability, performance, and high availability. But the ramification is that data is heavily distributed.
That’s where the second piece comes in, namely MapReduce. Thanks to this capability of Hadoop, you can define a data operation--like a query or analysis--and the platform ‘maps’ the operation across all relevant nodes, for distributed processing and data collection. The platform then consolidates and reduces the responses that come back. Due to the distributed processing of MapReduce, analytics against very big data is possible—and with good performance.
Philip Russom: What kind of analytics?
Mike Olson: Hadoop excels in discovering patterns in big data, patterns that you didn’t know were there, in data that you probably don’t know very well. That makes Hadoop the opposite of your average data warehouse query against well-understood relational data. Since Hadoop and a traditional data warehouse are complementary, putting them together gives you a very broad range of business intelligence capabilities.
Philip Russom: What data types and data models are your customers managing?
Mike Olson: In Hadoop, you can mix and match data types to your heart’s content. Hadoop will store anything without requiring a data type declaration. Also, Hadoop is amazingly tolerant of messy data. For example, our customers manage any kind of file you can think of in the HDFS, and these can have just about any kind of data model. This also includes human language text and complex data types. So, big data’s not just big. It’s also highly diverse and complicated. And Hadoop excels in handling data of such extreme size, diversity, and complexity for the purposes of analytics.
So, what do you think, folks? Let me know. Thanks!
Posted by Philip Russom, Ph.D. on May 12, 20110 comments
I’ve recently been interviewing users and business sponsors, asking them about their new practices with advanced analytics, plus the special role of big data. When I ask people to talk about critical factors that make or break or success, they usually come around to a common issue that needs sorting out. It’s the fact that most analytic applications are departmentally focused (often departmentally owned and funded) and they satisfy department requirements, not enterprise ones.
Give me a minute to explain what I’m hearing from users, as well as why big data analytics is progressively a departmental affair:
Analytic applications are departmental, by nature. Just about any analytic application you think of is focused on tasks, data domains, and business opportunities that are associated with specific departments. For example, customer base segmentation should be owned and executed by marketing and sales departments. The actuarial department does risk analysis. The procurement department does supply and supplier analysis.
Most data warehouse (DW) and business intelligence (BI) infrastructure is not designed for advanced analytics. In most organizations, it is, instead, designed and optimized for reporting, performance management, and online analytic processing (OLAP). This enterprise asset is invaluable for “big picture” reports and analyses that span enterprise-wide processes (especially financial ones). And it’s capable of satisfying most departmental requirements for reporting and OLAP. But, in many organizations, the BI/DW infrastructure cannot (and, due to its owners, will not) satisfy departmental requirements for advanced analytics and big data.
Many departments are deploying their own platforms for big data and analytics. They do this when the department has a strong business need for analytics with big data, plus the budget and management sponsorship to back it up. Just think of the many new vendor tools and platforms that have arisen in recent years. Data warehouse appliances, columnar databases, MapReduce, visual discovery tools, and analytic tools for business users all supply analytic functionality that user organizations are demanding at the department level. And all are built from the bottom up to management and operate on big data. Obviously, big data analytics can be implemented on older, more traditional databases and tools, as well.
Put it all together, and this user and vendor activity reveals that big data analytics is progressively a departmental affair, implemented on departmentally owned platforms.
So, what do you think? Does the trend toward departmental big data analytics make sense to you? Let me know. Thanks!
Posted by Philip Russom, Ph.D. on May 10, 20110 comments
I recently started work on a new TDWI Best Practices Report with the working title: Deep Analytics with Big Data. The report is a tad schizophrenic, in that it’s really about two things – big data and analytics – plus how the two have teamed up to create one of the most profound trends in business intelligence (BI) today. Let me share some of the thinking behind the schizophrenia. Please reply to this blog to tell me whether this makes sense or not.
According to a recent TDWI survey, 38% of organizations surveyed are practicing advanced analytics today. But 85% say they’ll do it within 3 years!
Why the rush to advanced analytics? First, change is rampant in business; we’ve been through multiple “economies” in recent years. And analytics helps us discover what changed plus how we should react. Second, there are still many business opportunities to leverage -- even in the recession -- and more will come as we finally crawl out of the recession. To that end, advanced analytics is the best way to discover new customer segments, identify the best suppliers, associate products of affinity, understand sales seasonality, and so on. For these reasons, TDWI has seen an explosion of user organizations implementing analytics in recent years.
But note that user organizations are implementing specific forms of analytics, particularly what is sometimes call advanced analytics. This is a collection of related techniques and tools, usually including predictive analytics, data mining, statistical analysis, and complex SQL. We might also extend the list to cover data visualization, artificial intelligence, natural language processing, and database methods that support analytics.
All these techniques have been around for years, many of them appearing in the 1990s. The thing that’s different now is that far more user organizations are actually using them. That’s because most of these techniques adapt well to very large, multi-terabyte datasets, with minimal data preparation. And that brings us to big data.
Big data can be defined simply as multi-terabyte datasets. And this make sense, given that corporations, government agencies, and other user organizations are generating and retaining more data than ever before. Soon enough, big data will involve petabytes, not terabytes Yet, big data also involves big complexity, namely many diverse data sources (both internal and external), data types (structured, unstructured, semi-structured), and indexing schemes (relational, multidimensional, no-SQL).
Occasionally, I hear a user complain about the problems of storing and managing big data. Much more often, however, I hear people talk about what an extraordinary opportunity big data is. That’s because, for the kinds of discovery and prediction that most advanced analytic techniques enable, big data is truly a treasure trove of information that merits leverage for business advantage. And that brings us to the intersection mentioned in the title of this blog.
Advanced Analytics and Big Data: Why put them together?
Here are a few reasons:
Big data yields gigantic statistical samples. Most tools designed for data mining or statistical analysis tend to be optimized for large datasets. In fact, the general rule is that the larger the data sample, the more accurate are the statistics and other products of the analysis. Instead of mining and statistical tools, I regularly find users generating or hand-coding complex SQL, which parses big data in search of just the right customer segment, churn profile, or excessive operational cost. The newest generation of data visualization tools and in-database analytic functions likewise operate on big data.
Analytic tools and databases can now handle big data. And they can execute big queries and parses in record time. Recent generations of vendor tools and platforms have raised us onto a new plateau of performance that’s very compelling for applications involving big data.
There’s a lot to learn from messy data, as long as it’s big. Most modern tools and techniques for advanced analytics and big data are very tolerant of raw source data, with its transactional schema, non-standard data, and poor-quality data. That’s a good thing, because discovery and predictive analytics depend on lots of details, even questionable data. For example, analytic applications for fraud detection often depend on outliers and non-standard data as indications of fraud. If you apply ETL and DQ processes to big data, as you do for a data warehouse, you’ll strip out the very nuggets that make big data a treasure trove for advanced analytics.
Big data is a special asset that merits leverage. And that’s the real point of Deep Analytics with Big Data. The new technologies and new best practices are fascinating, even mesmerizing. And there’s a certain macho coolness to working with dozens of terabytes. But don’t do it for the technology. Put Big Data and Advance Analytics together for the new insights they give the business.
So, what do you think? Does the intersection of Big Data and Advance Analytics make sense to you? Let me know. Thanks!
To learn more, register to attend a TDWI Webinar on this topic. “The Intersection of Big Data and Analytics,” May 5, 2011 at noon eastern time. http://bit.ly/eh5YA9
Posted by Philip Russom, Ph.D. on April 25, 20110 comments
A few days ago, I presented a TDWI Webinar based on my newly published TDWI Best Practices report about “Next Generation Data Integration” (NGDI). Almost three hundred people attended the broadcast, and (with such a large turnout) I got a ton of great questions from the audience about data integration (DI).
I’d like to share some of those questions with you (and my responses to Webinar attendees who asked them), as a way of expanding and clarifying the research findings of the report. If you care about DI, this should be interesting for you.
Concerning bulk upload, should we use a batch upload mechanism or Web services?
It depends on the dataset being bulk loaded. You should stick to your old reliable bulk loader for datasets that are very large, too large for a service bus, don’t have an immediate delivery requirement, or demand multiple complex passes (as many multidimensional structures do, when being loaded into a data warehouse). Most services, messages, or events used in a DI context handle time-sensitive data, which is delivered faster over a message or service bus. Also, real-time DI often enables Operational Business Intelligence (OpBI), where data is drawn frequently from ERP, CRM, and other operational applications, then loaded into a warehouse, mart, or other BI data store. OpBI may also use DI to publish improved data back to those applications. Many operational applications (especially SAP) are best extracted from via the application layer, and services and messages usually support such an interface. From these examples, you can see that the old (bulk loaders) and the new (services) intermingle in the newest DI generation.
Do staging tables play an important role in DI?
Yes. The newest generation of DI still relies of older, tried-and-true designs and DI architectures. And these typically have a variety of data landing and data staging areas, including databases (like operational data stores) and tables (whether physically in the data warehouse or external to it). One new spin on this is that 64-bit computing and very large memory spaces in server hardware now enable more effective DI pipes. This is where data is staged and processed in server memory, not landed to disk. This both speeds up DI transformational processing and boosts scalability for large data volumes. For many organizations, NGDI is about adjusting (not abandoning) useful best practices like this to take advantage of newly available platform capabilities.
Is DI architecture and information architecture the same thing?
No, they’re different. Information architecture is usually about the data models and schema within individual enterprise databases, plus data dependencies across multiple ones. DI architecture concerns the design of data flows, plus development standards (like preferred interfaces for specific applications). For DI, hub-and-spoke is the most common architecture, where a vendor’s DI tool or a control server (in home-grown DI solutions) is equivalent to a hub. But point-to-point interfaces still abound in DI jobs, and DI over a bus is subject to whatever the bus requires. My report explains that designing and using just the right DI architecture has become a critical success factor for satisfying next-generation requirements, like scalability, real time, governance, and DI team collaboration.
Where do you see ERP choices within the context of NGDI?
In my world, Operational Business Intelligence (OpBI) has become quite common. OpBI requires much from a DI tool. The DI tool has to support feature-rich interfaces to ERP and other application types. The DI tool must have optimization to draw data fast, frequently, and non-invasively from ERP modules and applications. And the DI tool must understand ERP data structures and function calls to make sense of ERP data, before integrating it elsewhere. OpBI and other real-time business practices wouldn’t be possible without real-time DI. In fact, my report shows that various real-time DI functions are the ones users will increase the use of most over the next three years.
Other common DI practices involving ERP include synchronizing customer data (and other data domains, especially product data) across multiple ERP modules and instances. Synchronizing reference data is a similar practice, one that’s growing quickly. Since some ERPs are almost impermeable, DI is regularly called in to assist with data access for data quality. This kind of coordination between DI and DQ is one of the hallmarks of NGDI.
Do you think certain aspects of traditional EAI are going to be part of NGDI?
Well, first of all, I regularly find some DI functions executed over EAI and similar buses in user organizations that have already made a substantial investment in a robust EAI infrastructure. Firms in financial and insurance industries are typical examples. Second, I think what’s happening in such firms is that DI is simply leveraging more deeply an existing infrastructure, just as other users, applications, and tools are. Third, DI is being driven to EAI, in situations where EAI has better interfaces (especially to packaged applications) or certain time-sensitive data has a real-time requirement (for which EAI messages are easily configured). Even so, there’s still a need for standard data interfaces over the enterprise LAN.
Any metrics around how much operational cost is associated with near real-time data integration vs the traditional batch model?
Ten years ago, real-time DI via EAI was possible, but it usually required the purchase of extra tools. Plus, real-time functions in tools and applications weren’t very robust, so an administrator had to watch and tweak them constantly. These two characteristics drove up the cost. Luckily, a lot of RT functionality is built into today’s applications, databases, and DI tools. Many firms have a robust EAI or service bus infrastructure that DI can tap for real time. For firms that have kept their enterprise software and infrastructure up-to-date, real time DI is quite accessible, reliable, and inexpensive, as compare to the recent past. But that’s with EAI in mind. From a different direction, batch processing has improved, too. It may be preferred in the form of so-called micro-batches for frequent intra-day extract that needn’t be truly RT.
Can you expand on RT event processing, including contexts for applicability?
You probably don’t want to handle just any kind of event via a DI tool. Instead, some kind of “complex event” benefits from DI processing. A complex event is actually multiple events, typically occurring at different times (even different months or years) that need to be correlated. ETL-ish DI can access the many diverse data sources and data models where complex data events may be managed. Today, I almost exclusively find federal intelligence or security agencies doing this, to recognize and quantify security threats. The TSA and Coast Guard come to mind. But it’s just a matter of time before such DI-enabled practices are common with customer events in for-profit corporations.
If you have a question or answer about Next Generation Data Integration (or a reaction to one presented above), please share them by responding to this blog.
Register for and replay the TDWI Webinar these questions came from at
Download a free copy of the TDWI Best Practices Report titled Next Generation Data Integration, at http://tdwi.org/research/list/tdwi-best-practices-reports.aspx
Find tweets about NGDI by searching Twitter.com for the hash tag #NGDI.
Posted by Philip Russom, Ph.D. on April 19, 20110 comments