TDWI Blog

Philip RussomPhilip Russom, Ph.D., is senior director of TDWI Research for data management and is a well-known figure in data warehousing, integration, and quality, having published over 550 research reports, magazine articles, opinion columns, and speeches over a 20-year period. Before joining TDWI in 2005, Russom was an industry analyst covering data management at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and consultant, was a contributing editor with leading IT magazines, and a product manager at database vendors. His Ph.D. is from Yale. You can reach him by email ([email protected]), on Twitter (twitter.com/prussom), and on LinkedIn (linkedin.com/in/philiprussom).


The Intersection of Big Data and Advanced Analytics

I recently started work on a new TDWI Best Practices Report with the working title: Deep Analytics with Big Data. The report is a tad schizophrenic, in that it’s really about two things – big data and analytics – plus how the two have teamed up to create one of the most profound trends in business intelligence (BI) today. Let me share some of the thinking behind the schizophrenia. Please reply to this blog to tell me whether this makes sense or not.

Advanced Analytics

According to a recent TDWI survey, 38% of organizations surveyed are practicing advanced analytics today. But 85% say they’ll do it within 3 years!

Why the rush to advanced analytics? First, change is rampant in business; we’ve been through multiple “economies” in recent years. And analytics helps us discover what changed plus how we should react. Second, there are still many business opportunities to leverage -- even in the recession -- and more will come as we finally crawl out of the recession. To that end, advanced analytics is the best way to discover new customer segments, identify the best suppliers, associate products of affinity, understand sales seasonality, and so on. For these reasons, TDWI has seen an explosion of user organizations implementing analytics in recent years.

But note that user organizations are implementing specific forms of analytics, particularly what is sometimes call advanced analytics. This is a collection of related techniques and tools, usually including predictive analytics, data mining, statistical analysis, and complex SQL. We might also extend the list to cover data visualization, artificial intelligence, natural language processing, and database methods that support analytics.

All these techniques have been around for years, many of them appearing in the 1990s. The thing that’s different now is that far more user organizations are actually using them. That’s because most of these techniques adapt well to very large, multi-terabyte datasets, with minimal data preparation. And that brings us to big data.

Big Data

Big data can be defined simply as multi-terabyte datasets. And this make sense, given that corporations, government agencies, and other user organizations are generating and retaining more data than ever before. Soon enough, big data will involve petabytes, not terabytes Yet, big data also involves big complexity, namely many diverse data sources (both internal and external), data types (structured, unstructured, semi-structured), and indexing schemes (relational, multidimensional, no-SQL).

Occasionally, I hear a user complain about the problems of storing and managing big data. Much more often, however, I hear people talk about what an extraordinary opportunity big data is. That’s because, for the kinds of discovery and prediction that most advanced analytic techniques enable, big data is truly a treasure trove of information that merits leverage for business advantage. And that brings us to the intersection mentioned in the title of this blog.

Advanced Analytics and Big Data: Why put them together?

Here are a few reasons:

Big data yields gigantic statistical samples. Most tools designed for data mining or statistical analysis tend to be optimized for large datasets. In fact, the general rule is that the larger the data sample, the more accurate are the statistics and other products of the analysis. Instead of mining and statistical tools, I regularly find users generating or hand-coding complex SQL, which parses big data in search of just the right customer segment, churn profile, or excessive operational cost. The newest generation of data visualization tools and in-database analytic functions likewise operate on big data.

Analytic tools and databases can now handle big data. And they can execute big queries and parses in record time. Recent generations of vendor tools and platforms have raised us onto a new plateau of performance that’s very compelling for applications involving big data.

There’s a lot to learn from messy data, as long as it’s big. Most modern tools and techniques for advanced analytics and big data are very tolerant of raw source data, with its transactional schema, non-standard data, and poor-quality data. That’s a good thing, because discovery and predictive analytics depend on lots of details, even questionable data. For example, analytic applications for fraud detection often depend on outliers and non-standard data as indications of fraud. If you apply ETL and DQ processes to big data, as you do for a data warehouse, you’ll strip out the very nuggets that make big data a treasure trove for advanced analytics.

Big data is a special asset that merits leverage. And that’s the real point of Deep Analytics with Big Data. The new technologies and new best practices are fascinating, even mesmerizing. And there’s a certain macho coolness to working with dozens of terabytes. But don’t do it for the technology. Put Big Data and Advance Analytics together for the new insights they give the business.

So, what do you think? Does the intersection of Big Data and Advance Analytics make sense to you? Let me know. Thanks!

To learn more, register to attend a TDWI Webinar on this topic. “The Intersection of Big Data and Analytics,” May 5, 2011 at noon eastern time. http://bit.ly/eh5YA9

Posted by Philip Russom, Ph.D. on April 25, 20110 comments


FAQ: Next Generation Data Integration

A few days ago, I presented a TDWI Webinar based on my newly published TDWI Best Practices report about “Next Generation Data Integration” (NGDI). Almost three hundred people attended the broadcast, and (with such a large turnout) I got a ton of great questions from the audience about data integration (DI).

I’d like to share some of those questions with you (and my responses to Webinar attendees who asked them), as a way of expanding and clarifying the research findings of the report. If you care about DI, this should be interesting for you.

Concerning bulk upload, should we use a batch upload mechanism or Web services?

It depends on the dataset being bulk loaded. You should stick to your old reliable bulk loader for datasets that are very large, too large for a service bus, don’t have an immediate delivery requirement, or demand multiple complex passes (as many multidimensional structures do, when being loaded into a data warehouse). Most services, messages, or events used in a DI context handle time-sensitive data, which is delivered faster over a message or service bus. Also, real-time DI often enables Operational Business Intelligence (OpBI), where data is drawn frequently from ERP, CRM, and other operational applications, then loaded into a warehouse, mart, or other BI data store. OpBI may also use DI to publish improved data back to those applications. Many operational applications (especially SAP) are best extracted from via the application layer, and services and messages usually support such an interface. From these examples, you can see that the old (bulk loaders) and the new (services) intermingle in the newest DI generation.

Do staging tables play an important role in DI?

Yes. The newest generation of DI still relies of older, tried-and-true designs and DI architectures. And these typically have a variety of data landing and data staging areas, including databases (like operational data stores) and tables (whether physically in the data warehouse or external to it). One new spin on this is that 64-bit computing and very large memory spaces in server hardware now enable more effective DI pipes. This is where data is staged and processed in server memory, not landed to disk. This both speeds up DI transformational processing and boosts scalability for large data volumes. For many organizations, NGDI is about adjusting (not abandoning) useful best practices like this to take advantage of newly available platform capabilities.

Is DI architecture and information architecture the same thing?

No, they’re different. Information architecture is usually about the data models and schema within individual enterprise databases, plus data dependencies across multiple ones. DI architecture concerns the design of data flows, plus development standards (like preferred interfaces for specific applications). For DI, hub-and-spoke is the most common architecture, where a vendor’s DI tool or a control server (in home-grown DI solutions) is equivalent to a hub. But point-to-point interfaces still abound in DI jobs, and DI over a bus is subject to whatever the bus requires. My report explains that designing and using just the right DI architecture has become a critical success factor for satisfying next-generation requirements, like scalability, real time, governance, and DI team collaboration.

Where do you see ERP choices within the context of NGDI?

In my world, Operational Business Intelligence (OpBI) has become quite common. OpBI requires much from a DI tool. The DI tool has to support feature-rich interfaces to ERP and other application types. The DI tool must have optimization to draw data fast, frequently, and non-invasively from ERP modules and applications. And the DI tool must understand ERP data structures and function calls to make sense of ERP data, before integrating it elsewhere. OpBI and other real-time business practices wouldn’t be possible without real-time DI. In fact, my report shows that various real-time DI functions are the ones users will increase the use of most over the next three years.

Other common DI practices involving ERP include synchronizing customer data (and other data domains, especially product data) across multiple ERP modules and instances. Synchronizing reference data is a similar practice, one that’s growing quickly. Since some ERPs are almost impermeable, DI is regularly called in to assist with data access for data quality. This kind of coordination between DI and DQ is one of the hallmarks of NGDI.

Do you think certain aspects of traditional EAI are going to be part of NGDI?

Well, first of all, I regularly find some DI functions executed over EAI and similar buses in user organizations that have already made a substantial investment in a robust EAI infrastructure. Firms in financial and insurance industries are typical examples. Second, I think what’s happening in such firms is that DI is simply leveraging more deeply an existing infrastructure, just as other users, applications, and tools are. Third, DI is being driven to EAI, in situations where EAI has better interfaces (especially to packaged applications) or certain time-sensitive data has a real-time requirement (for which EAI messages are easily configured). Even so, there’s still a need for standard data interfaces over the enterprise LAN.

Any metrics around how much operational cost is associated with near real-time data integration vs the traditional batch model?

Ten years ago, real-time DI via EAI was possible, but it usually required the purchase of extra tools. Plus, real-time functions in tools and applications weren’t very robust, so an administrator had to watch and tweak them constantly. These two characteristics drove up the cost. Luckily, a lot of RT functionality is built into today’s applications, databases, and DI tools. Many firms have a robust EAI or service bus infrastructure that DI can tap for real time. For firms that have kept their enterprise software and infrastructure up-to-date, real time DI is quite accessible, reliable, and inexpensive, as compare to the recent past. But that’s with EAI in mind. From a different direction, batch processing has improved, too. It may be preferred in the form of so-called micro-batches for frequent intra-day extract that needn’t be truly RT.

Can you expand on RT event processing, including contexts for applicability?

You probably don’t want to handle just any kind of event via a DI tool. Instead, some kind of “complex event” benefits from DI processing. A complex event is actually multiple events, typically occurring at different times (even different months or years) that need to be correlated. ETL-ish DI can access the many diverse data sources and data models where complex data events may be managed. Today, I almost exclusively find federal intelligence or security agencies doing this, to recognize and quantify security threats. The TSA and Coast Guard come to mind. But it’s just a matter of time before such DI-enabled practices are common with customer events in for-profit corporations.

CONCLUSION

If you have a question or answer about Next Generation Data Integration (or a reaction to one presented above), please share them by responding to this blog.

Register for and replay the TDWI Webinar these questions came from at
http://tdwi.org/webcasts/2011/04/next-generation-data-integration.aspx?tc=page0

Download a free copy of the TDWI Best Practices Report titled Next Generation Data Integration, at http://tdwi.org/research/list/tdwi-best-practices-reports.aspx

Find tweets about NGDI by searching Twitter.com for the hash tag #NGDI.

Posted by Philip Russom, Ph.D. on April 19, 20110 comments


Themes and Insights from the TDWI Solution Summit on Master Data, Quality and Governance

This week, we at TDWI produced our third annual Solution Summit on Master Data, Quality, and Governance, again in Savannah, Georgia. Jill Dyché and I moderated the conference, and we lined up a host of great user speakers and vendor panelists. The audience asked dozens of insightful questions, and the event included hundreds of one-to-on meetings among attendees, speakers, and vendor sponsors. The aggregated result was a massive knowledge transfer that highlights most of today’s burning issues in master data management, data quality, and data governance. I’d like to share with you some of the themes and insights that arose at the TDWI Solution Summit.

RECURRING THEMES

As you can see in the title of the conference, it brings together the data disciples master data management (MDM), data quality (DQ), and data governance (DG). TDWI has noted that many of its members coordinate these three very closely, sometimes with addition coordination for data integration, business intelligence, data warehousing, and metadata management. Most of the user case study speakers explained how their organizations are handling the coordination, including Cathy Burrows from Royal Bank of Canada (RBC), Becky Briggs from Airlines Reporting Corporation (ARC), and Mark Love and Sara Temlitz from the Veterans Health Administration (VHA).

A number of recurring themes were heard across the presentations and panels. But three stood out prominently: the importance of identity resolution, accurate matching, and de-duplication of redundant records. All three techniques apply to both MDM and DQ. It was obvious from speeches and questions from the audience that most organizations currently have matching and de-duplication in place, but need to complement these with identity resolution in the near future.

It would appear that data stewardship plays an important role in MDM--not just DQ. For example, speaker Becky Briggs (ARC) has won TDWI Best Practices awards for her applications of stewardship in data quality and analytics implementations. For many organizations, stewardship is a first step toward data governance. Mark Love from the VHA explained: “Our purpose for stewardship is to create a framework for data-related decision making, collaboration, and governance.” Furthermore, the VHA has expanded the concept of stewardship by hiring Identity Management Stewards who support MDM.

SPEAKER INSIGHTS

We all know that taking DQ functions upstream into the operational applications where data is entered or altered can substantially reduce DQ problems in applications databases and downstream databases (like data warehouses). But did you know that the same applies to MDM? Rick Clements from IBM played an eye-opening video that shows how to embed MDM matching and identity resolution functions in the GUIs of salesforce.com and other operational applications

David Smith (who’s on the CIO team at Citrix) shared with the audience his method for quantifying vendors and their tools, thereby giving structure and hard facts to the otherwise ad hoc process of selecting a vendor tool. Although the method can apply to many tool types, David explained how to apply it to the selection of an MDM tool.

In discussions of the future of MDM, we heard about how MDM is now available through clouds and software as a service (SaaS). For example, Peter Kulupka from Acxiom describing the cutting edge of MDM, where it’s provided as a service via a third-party public cloud. Similarly, Dan Soceanu from DataFlux pointed out that “The private cloud’s where it’s at for most enterprises doing DQ and MDM.”

The TDWI Solution Summit on Master Data, Quality, and Governance concluded with John Biderman of Harvard Pilgrim Health Care, who explained in detail his mature master reference data strategy, which includes a business-friendly, browser-based tool for entering, validating, and studying master data.

To learn more about the event, visit its Web site at: http://events.tdwi.org/events/solution-summit-savannah-2011/home.aspx. You can also read its tweets by searching Twitter for #tdwimdm.

Posted by Philip Russom, Ph.D. on March 14, 20110 comments