Blog by Philip Russom
Research Director for Data Management, TDWI
I’ve just completed a TDWI Best Practices Report titled Next Generation Master Data Management. The goal is to help user organizations understand MDM lifecycle stages so they can better plan and manage them. TDWI will publish the 40-page report in a PDF file on April 2, 2012, and anyone will be able to download it from www.tdwi.org. In the meantime, I’ll provide some “sneak peeks” by blogging excerpts from the report. Here’s the second in a series of three excerpts. If you haven’t already, you should read the
first excerpt before continuing.
Collaborative Processes for MDM
By definition, MDM is a collaborative discipline that requires a lot of communication and coordination among several types of people. This is especially true of entity definitions, because there is rarely one person who knows all the details that would go into a standard definition of a customer or other entity. The situation is compounded when multiple definitions of an entity are required to make reference data “fit for purpose” across multiple IT systems, lines of business, and geographies. For example, sales, customer service, and finance all interact with customers, but have different priorities that should be reflected in a comprehensive entity model. Likewise, technical exigencies of the multiple IT systems sharing data may need addressing in the model. And many entities are complex hierarchies or have dependencies that take several people to sort out, as in a bill of material (for products) or a chart of accounts (for financials).
Once a definition is created from a business viewpoint, further collaboration is needed to gain review and approval before applying the definition to IT systems. At some point, business and technical people come together to decide how best to translate the definition into the technical media through which a definition is expressed. Furthermore, technical people working on disparate systems must collaborate to develop the data standards needed for the exchange and synchronization of reference data across systems. Since applying MDM definitions often requires that changes be made to IT systems, managing those changes demands even more collaboration.
That’s a lot of collaboration! To organize the collaboration, many firms put together an organizational structure where all interested parties can come together and communicate according to a well-defined business process. For this purpose, data governance committees or boards have become popular, although stewardship programs and competency centers may also provide a collaborative process for MDM and other data management disciplines (especially data quality).
================================
ANNOUNCEMENTS Keep an eye out for part 3 in this MDM blog series, coming March 2. I’ll tweet so you know when that blog is posted.
David Loshin and I will moderate the
TDWI Solution Summit on Master Data, Quality, and Governance, coming up March 4-6, 2012 in Savannah, Georgia.
Please attend the TDWI Webinar where I will present the findings of my TDWI report Next Generation MDM, on April 10, 2012 Noon ET.
Register online for the Webinar.
Posted by Philip Russom, Ph.D. on February 17, 20120 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
I’ve just completed a TDWI Best Practices Report titled Next Generation Master Data Management. The goal is to help user organizations understand MDM lifecycle stages so they can better plan and manage them. TDWI will publish the 40-page report in a PDF file on April 2, 2012, and anyone will be able to download it from www.tdwi.org. In the meantime, I’ll provide some “sneak peeks” by blogging excerpts from the report. Here’s the first in a series of three excerpts.
Defining Master Data Management
To get us all on the same page, let’s start with a basic definition of MDM, then drill into details:
Master data management (MDM) is the practice of defining and maintaining consistent definitions of business entities (e.g., customer or product) and data about them across multiple IT systems and possibly beyond the enterprise to partnering businesses. MDM gets its name from the master and/or reference data through which consensus-driven entity definitions are usually expressed. An MDM solution provides shared and governed access to the uniquely identified entities of master data assets, so those enterprise assets can be applied broadly and consistently across an organization.
That’s a good nutshell definition of what MDM is. However, to explain in detail what MDM does, we need to look at the three core activities of MDM, namely: business goals, collaborative processes, and technical solutions.
Business Goals and MDM
Most organizations have business goals, such as retaining and growing customer accounts, optimizing a supply chain, managing employees, tracking finances accurately, or building and supporting quality products. All these and other data-driven goals are more easily and accurately achieved when supported by master data management. That’s because most business goals focus on a business entity, such as a customer, supplier, employee, financial instrument, or product. Some goals combine two or more entities, as in customer profitability (customers, products, and finances) or product quality (suppliers and products). MDM contributes to these goals by providing processes and solutions for assembling complete, clean, and consistent definitions of these entities and reference data about them. Many business goals span multiple departments, and MDM prepares data about business entities so it can be shared liberally across an enterprise.
Sometimes the business goal is to avoid business problems. As a case in point, consider that one of the most pragmatic applications of MDM is to prevent multiple computer records for a single business entity. For example, multiple departments of a corporation may each have a customer record for the same customer. Similarly, two merging firms end up with multiple records when they have customers in common.
Business problems ensue from redundant customer records. If the records are never synchronized or consolidated, the firm will never understand the complete relationship it has with that customer. Undesirable business outcomes include double billing and unwarranted sales attempts. From the view of a single department, the customer’s commitment seems less than it really is, resulting in inappropriately low discounts or service levels. MDM alleviates these problems by providing collaborative processes and technical solutions that link equivalent records in multiple IT systems, so the redundant records can be synchronized or consolidated. Deduplicating redundant records is a specific use case within a broader business goal of MDM, namely to provide complete and consistent data (especially views of specific business entities) across multiple departments of a larger enterprise, thereby enabling or improving cross-functional business processes.
================================
ANNOUNCEMENTS
Keep an eye out for part 2 and part 3 in this MDM blog series, coming February 17 and March 2, respectively. I’ll tweet so you know when each blog is posted.
David Loshin and I will moderate the TDWI Solution Summit on Master Data, Quality, and Governance, coming up March 4-6, 2012 in Savannah, Georgia. You should attend!
Please attend the TDWI Webinar where I present the findings of my TDWI report Next Generation MDM, on April 10, 2012 Noon ET. Register online for the Webinar.
Posted by Philip Russom, Ph.D. on February 3, 20120 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
I’m currently researching a TDWI Best Practices Report that will redefine master data management (MDM) by describing what its next generation should look like. As part of the research, I’ve been interviewing users on the phone about their MDM programs.
The news so far is a mix of good and bad. I hate saying it, but half of the organizations I’ve talked with are mired in early lifecycle stages of their MDM programs, unable to get over certain humps and mature into the next generation. On the flip side, the other half is well into the next generation; so I know it can be done.
Allow me to list desirable capabilities of MDM’s next generation, and briefly say why these need to replace similar early phase capabilities. The following list (with a great deal more detail) will probably appear in my Next Generation MDM report that TDWI will publish April 2, 2012. After all, the list defines MDM’s next generation. And my goal is to establish a set of rules (or requirements) that can guide users into the next generation.
Multi-domain MDM. Many MDM solutions address only the customer data domain, and they need to move on to other domains, like products, financials, and locations. Single-data-domain MDM is a barrier to having common, consensus-based entity definitions and standard reference data that would allow you to correlate information across multiple domains. (See my blog
The State of Multi-Data-Domain MDM.)
Multi-department, multi-application MDM. MDM for a single application (typically ERP, CRM or BI) is a safe and effective start. But the point of MDM is to share common definitions across multiple, diverse applications and the departments that depend on them. It’s important to overcome organizational boundaries if MDM is move from being a local fix to an enterprise infrastructure.
Bidirectional MDM. "Roach Motel MDM," as I call it, is when you extract reference data and study in a database from which it never emerges (as with many BI/DW systems). One-way MDM is bad whenever you need to improve reference data in a central place, then publish it out to a wide variety of operational applications. (See my article
Roach Motel MDM.)
Real-time MDM. The strongest trend in data management today (and BI/DW, too) is toward real-time operation as a complement to batch. Real-time is critical to identity resolution and the immediate application of recent changes to reference data.
Consolidating multiple, competing MDM solutions. How can you have a single view of the customer, if you have multiple customer-domain MDM solutions? How can you correlate reference data across domains, if the domains are treated in separate MDM solutions? For many organizations, next-gen MDM begins with a consolidation of multiple MDM solutions.
Beyond enterprise data. Despite the obsession with customer data that most MDM solutions suffer, almost none of them today incorporate data about customers from Web sites or social media. If you’re truly serious about MDM as an enabler for CRM, next-gen MDM (and CRM, too) must reach into every customer channel.
Richer modeling. Reference data in the customer domain works fine with flat modeling, involving a simple (but very wide) record per customer. However, other domains make little sense without a richer, hierarchical model, as with a chart of accounts in finance of a bill of material in manufacturing. Metrics and KPIs – so common in BI, today – rarely have proper master data in multidimensional models. (See my article
MDM for Performance Management.)
Coordination with other disciplines. To achieve next-gen goals, many organizations need to stop practicing MDM in a vacuum. Instead of MDM as merely a technical fix, it also needs to be aligned with business goals for data. And MDM should be coordinated with related data management disciplines, especially data integration and data quality. A solid data governance program can be an effective medium for such coordination. (See my blog
MDM Can Learn from Data Quality.)
MDM Workflow. Development and collaborative efforts in MDM today are mostly ad hoc actions with little or no process. For MDM program to scale up and grow, it needs workflow functionality that automates the proposal, review, and approval process for newly created or improved reference and master data. Also, a few MDM programs need the kind of workflow enabled by tools for business process management. Vendor tools and dedicated applications for MDM are starting to support such workflows.
So, what do you think? Do you know of other generational changes that MDM is facing? Let me know.
================================
ANNOUNCEMENTS Please take the
TDWI MDM Survey for my upcoming report about Next-Generation MDM.
David Loshin and I will moderate the
TDWI Solution Summit on Master Data, Quality, and Governance, coming up March 4-6, 2012 in Savannah, Georgia. You should attend!
Posted by Philip Russom, Ph.D. on November 17, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
Early this morning, Informatica Corporation announced Informatica HParser, a new product for parsing data in Apache Hadoop environments. Instead of repeating the details of the announcement – which you can read on www.informatica.com, etc. – I’d rather use the announcement as a springboard for my own thoughts about the bigger trends and issues in Big Data Analytics and Hadoop that the announcement fits into. The catch is that there are so many myths and misconceptions (i.e., “mythconceptions”) about Hadoop right now, that I can’t bust them all in a short piece like this blog. So I’ll just present the two leading mythconceptions as background, plus a brief rant for color.
First Mythconception. Hadoop is not one, monolithic thing, so we need to stop talking about it that way. It’s actually an open source software library administered by the Apache Software Foundation. (Some Hadoop products are also available via vendor distributions; but that’s another story.) The Apache Hadoop library includes several products and technologies, including (in BI priority order) the Hadoop Distributed File System (HDFS), MapReduce, Hive, Hbase, Pig, Zookeeper, Flume, Sqoop, Oozie, Hue, and so on. It’s up to you to figure out which combination of Apache Hadoop products to implement for a given application. For applications in business intelligence (BI) and Big Data Analytics, HDFS and MapReduce (perhaps with Hbase and Hive) constitute a useful technology stack.
Second Mythconception. Theoretically, HDFS can manage the storage and access of any data type, as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true, and it’s exactly what brings many users to Apache HDFS in the first place. Yet, HDFS’s admirable tolerance for diverse data doesn’t mean that an Apache Hadoop environment operates equally well with all file and data types. According to users I’ve interviewed, if you expect to get speed, scalability, and development simplicity, you need to work with Hadoop’s preference for record-based data. That’s not as limiting as it sounds, because many types of Big Data handled by HDFS are inherently record-based, as in logs from Web servers and sensors or table dumps of call detail records, customer records, transactions, etc. Furthermore, many sources of traditional enterprise data can be converted to records and copied to HDFS for Big Data Analytics and other applications.
And that brings us to Informatica Corporation’s announcement today of the new Informatica HParser. In a Hadoop environment, it’s MapReduce that actually executes the programmatic logic of an application. In the context of Big Data Analytics, the logic is (today) usually hand-coded data transformations or analytic logic. HParser provides an integrated development environment (IDE) for creating data transformation logic, plus ties into MapReduce to ensure that the logic executes in a fully distributed and parallel fashion. Given Apache Hadoop’s preference for record-based data, use cases cited by Informatica focus on how HParser can convert unstructured data into records and tables, plus flatten overly structured or “complex” data (as in the hierarchies of XML and JSON) into records that are more palatable to HDFS and Apache MapReduce. Record structures aside, Informatica HParser also supports a long list of data standards and document types. And Informatica PowerExchange for Hadoop provides additional functionality.
A brief rant. If you’ve been reading my writings on data integration for the last ten years, you know that I consider hand-coded data integration to be non-productive. Hand coding is time-consuming, not very re-usable, hard to update, and inherently feature-poor compared to vendor platforms. Now, we’re faced with Apache MapReduce, which – out of the box – demands huge amounts of hand coding, because it’s a processing engine that manages and provides parallelization for hand-coded routines (whether for analytics, DI, or otherwise). Informatica HParser shows promise for reducing the non-productive hand-coding that open-source environments like Hadoop, MapReduce, and Hive assume.
Conclusion. I feel that the men and women who’ve contributed to open source Hadoop have made an impressive and innovation contribution. And the Apache Software Foundation does a great job enabling the open source community. Thanks to these contributions, Hadoop is successfully used in production, but mostly in large, Internet-based businesses, like Amazon, Comscore, eBay, Google, and LinkedIn. However, for the Hadoop family – and the Big Data Analytics it enables – to become truly useful in a wide range of mainstream organizations across multiple industries, I think that the Hadoop family needs a number of new extensions, improvements, and options for interoperability.
This is why we’re now seeing software vendor companies coming out with various types of support for Apache Hadoop products and technologies. Informatica’s HParser and Informatica PowerExchange for Hadoop are prime examples, and other DI vendors will soon follow suit with similar interfaces and extensions for Hadoop. Some vendors are building administrative tools, which HDFS sorely lacks. And BI and analytic tool vendors are scrambling to sit atop HDFS and MapReduce. Personally, I hope to see more support for Hadoop and soon, because, without it, mainstream user organizations can’t get full value from Hadoop. Hence, they may not adopt it.
So, what do you think? Let me know!
===============================
Do you suffer mythconceptions about Hadoop? If so, TDWI can help you bust them:
• TDWI will soon publish my new Checklist Report on Hadoop, available as a free download on tdwi.org, starting Dec.13, 2011.
• On Dec.14, 2011, I’ll broadcast a TDWI Webinar based on that report. Please
register online for the Hadoop Webinar.
Posted by Philip Russom, Ph.D. on November 2, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
For about a month now, I’ve been interviewing users on the phone, in search of speakers for upcoming TDWI events. I need speakers who can share their organization’s best practices and strategies for data management. As you can imagine, I’ve heard a lot great tips in these interviews, many of them concerning master data management (MDM).
A tip I’ve heard from people in multiple organizations is that MDM solutions achieve a higher level of success when they adopt some of the techniques and best practices of data quality (DQ). Let me give you some examples of DQ practices applied to MDM.
DQ techniques. For years, I’ve watched data integration solutions incorporate functions that originated with data quality tools, especially data profiling and data monitoring. In a similar trend, I’m now seeing MDM solutions incorporating DQ functions for data standardization, deduplication, augmentation, identification, and verification. After all, master and reference data benefits from these functions, just as any data domain would.
Data stewardship. DQ success usually depends on the processes of data stewardship. A data steward plays a key role in linking data quality work and standards to specific business goals and business applications. The average data steward can identify and prioritize DQ work that will yield a noticeable return for the business. I’m now seeing a similar stewardship approach to prioritizing MDM work.
Collaborative data management. Note that a steward’s priority list is only accurate, when developed in conjunction with business managers who know the impact of data’s quality on the business. Likewise, data stewardship can be a process for IT-to-business alignment and collaboration in the context of MDM, not just DQ.
Data governance (DG). I’ve seen a number of organizations take a successful data stewardship program (originally designed to support DQ) and evolve it into a data governance program. You see, a good data stewardship program will establish a process for proposing and authorizing changes to data and applications for the sake of improving data’s quality. A DG board or committee needs a similar process for the data standards and data usage policies it has to create and enforce. In fact, the first policies produced by a DG program usually govern data via quality rules. And a typical “next step” that a DG program takes is to apply said process to data standards and usage policies for MDM.
Change management. DQ and MDM share very similar goals, in that each strives to improve data, whether the data domain is master data, customer data, product data, financial data, etc. Achieving improvement almost always requires changes to data, applications, and how end-users use applications. Therefore, a change management process is key to effecting improvements. DQ has long standing change management processes via stewardship, plus new options for change management via data governance. MDM’s likelihood of effecting positive change is increased when it taps the data-oriented change management processes that evolved from DQ and stewardship.
Conclusion. Frankly, I’m not surprised that MDM solutions are absorbing DQ techniques and best practices. I’ve seen a similar absorption by DI solutions, going on for about ten years now. And I already mentioned how some data governance programs are essentially data stewardship programs, expanded into a data-standards-oriented form of data governance. So, it’s clear to me that a variety of data management disciplines can learn from DQ techniques and stewardship practices. And the discipline going through that cycle right now is MDM. You should follow this trend, if you’re not already.
So, what do you think, folks? Let me know. Thanks!
Posted by Philip Russom, Ph.D. on September 8, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
Allow me a moment to parachute into the middle of an issue that’s come up a lot this calendar year, namely multi-data-domain master data management (MDM). I assume you are familiar with MDM; if not, spend a few minutes on Wikipedia.
The issue is that most user organizations deploy single-domain MDM solutions. The most popular data domain is customer data, but other common domains for MDM are (in priority order) financials, products, partners, employees, and locations.
Here’s the problem with single-data-domain MDM. It’s a barrier to having common, consensus-based entity definitions and standard reference data that would allow you to correlate information across multiple domains. For example, single-domain MDM is great for creating a single view of customers. But it needs to federate or somehow integrate with MDM for the product-data domain, if you want to extend that view to include (with a high level of accuracy and consistency) products and services that each customer has acquired or considered. Or you might include financial or location data. Some day, you’ll include data from social media. All this is easier and more accurate with multi-data-domain MDM.
The examples probably sound analytic to you, but they’re equally applicable to operations. And multi-data-domain MDM can improve lots of data management functions, like analytics, identity resolution, customer intimacy, data quality, data integration, deduplication, and sharing data across disparate departments and their IT systems.
I wish it weren’t true, but I still see most MDM solutions as focused on the customer data domain -- and that’s all. If MDM addresses other domains -- typically financial or product data -- that’s done in a separate solution, with little or integration with MDM for customer data. Some user organizations have multiple customer-focused MDM solutions, say one each for marketing analytics, direct marketing, sales pipeline, customer service, and so on. So much for a single view of the customer! These organizations have their hands full consolidating customer-data-domain MDM solutions, and that delays the next step, which is multi-data-domain MDM.
Despite these dire situations, I’ve also encountered user organizations that have successfully extended MDM to span multiple data domains. And some of these spoke at TDWI’s Solution Summit on Master Data in March 2011. For example, Cathy Burrows from Royal Bank of Canada explained how they consolidated multiple MDM solutions to create a single, central, and governed MDM solution that provides a rich, accurate, and even intimate view of each customer. They’re now enriching customer views with reference data about the products these customers have.
As another example, Mark Love of the Veterans Health Administration (VHA) talked about how the VHA started with a form of MDM for patient identity, then branched out into many other domains. To keep the domaines straight and to leverage hierarchical relations among domains, the VHA created a “master set of domains.”
I got to thinking about all this because, just yesterday, I was talking about multi-data-domain MDM with Ravi Shankar of Informatica. “Most of our recent MDM deals are multi-domain,” he said. Ravi talked through a list of Informatica customers who have multi-data-domain MDM in production today. I can’t tell you the customer names, but they’re in banking, high-tech manufacturing, food services, and government agencies. All began with one domain, then extended to others. Also, all deployed MDM in combination with their data integration and/or data quality solutions, which shows how MDM is interrelated with other data management disciplines. The list Ravi shared with me gives me confidence that more and more user organizations are succeeding with multi-data-domain MDM – and that’s a good thing.
But the future of multi-data-domain MDM isn’t totally rosy. At TDWI’s Solution Summit on Master Data in March 2011, we also heard from Evan Levy of Baseline Consulting (recently acquired by DataFlux). He said: “Multi-data-domain MDM is technically feasible today. But it makes no sense in terms of sponsorship, funding, or satisfying departmental and application-specific requirements.”
I agree with Evan’s second point wholeheartedly, because a number of users have explained to me over the years that sales and marketing need to own customer-data-domain MDM, even if it’s only applied within their customer-base segmentation, direct marketing, and sales contact applications. Likewise, the supply chain managers want to fund and control product and partner reference data. The financial guys have their own requirements for financial data, and HR has MDM requirements for employee data. All too often, these departments aren’t too keen on sharing.
But I don’t fully agree with Evan’s first point. I think there ARE situations where multi-data-domain MDM makes perfect sense, and I noted those earlier in this blog. In my experience, a common tipping point is often when technical and business people have reached maturity with customer-data MDM, and they realize they can’t get to the next level without consistent and integrated MDM about other domains.
Another way to put it is that the single view of the customer gets broader as it matures, thus demanding information from other domains. Yet another way to think of it is that multi-data-domain MDM often comes in a later life cycle stage, after single-data-domain MDM has proved the concept of MDM, in general. And much of the success of multi-data-domain MDM -- in my opinion -- is not about technology. Success depends on having a corporate culture that demands data sharing in support of cross-functional coordination.
So, folks, what do you think about the state of multi-data-domain MDM? Let me know. Thanks!
(Note that TDWI will repeat (for the fourth year) its Solution Summit on Master Data, Quality, and Governance, coming up March 4-6, 2012 in Savannah, Georgia. Mark your calendar!)
Posted by Philip Russom, Ph.D. on August 24, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
While researching a new TDWI report on big data analytics, I’ve run across a few BI professionals who are concerned about the seeming lack of data preparation that’s common with some forms of advanced analytics. Allow me a moment to sort this out.
On the one hand, all of us in BI and data warehousing are indoctrinated to believe that the data of an enterprise data warehouse (EDW) (and hence the data that feeds into reports) must be absolutely pristine, integrated and aggregated properly, well-documented, and modeled for optimization. To achieve these data requirements, BI teams work hard on extract, transform, and load (ETL), data quality (DQ), meta and master data management (MDM), and data modeling. These data preparation best practices make perfect sense for the vast majority of the reports, dashboards, and OLAP-based analyses that are refreshed from data warehouse data. For those products of BI, we want to use only well-understood data that’s brought as close to perfection as possible. And many of these become public documents, where problems with data could be dire for a business.
On the other hand, preparing data for advanced analytics requires very different best practices – especially when big data is involved. The product of advanced analytics is insight, typically an insight about bottom-line costs or customer churn or fraud or risk. These kinds of insights are never made public, and the analytic data they’re typically based on doesn’t have the reuse and publication requirements that data warehouse data has. Therefore, big data for advanced analytics rarely needs the full brace of ETL, data quality, metadata, and modeling we associate with data from an EDW.
In fact, if you bring to bear the full arsenal of data prep practices on analytic datasets, you run the risk of reducing its analytic value. This is ironic, because we usually think of ETL, DQ, and data modeling as adding value to data, not subtracting it. So, how can they harm analytic data?
To answer that question, let’s first take a look at so-called “advanced analytics.” This collection of analytic techniques would be better called “discovery analytics,” because that’s what users do with it. A business analyst or similar user applies techniques like data mining, statistical analysis, complex SQL, MapReduce, and natural language processing to discovery facts about the business that no one knew before. For example, you might discover the root cause of the latest form of customer churn. Or you might find a cluster of transactions that indicate a new kind of fraud. Or you could stumble onto an untapped customer segment.
In general, you can’t discover those entities and facts from the overly studied, calculated, modeled, and aggregated data of an EDW. Instead, you need big data, with lots of granular detail, typically in the schema of the source systems it came from. Some forms of analytics actually thrive on questionable data in poor condition. For example, analytic applications for fraud detection may depend on outliers and non-standard data as indications of fraud. And the insights of discovery analytics often focus in narrow slices of the business, like an obscure customer segment, or time frame or group of shipments or transaction types or risky neighborhood. These thin slices can easily disappear in an aggregation pass. Hence, if you apply ETL and DQ processes to big data, as you do for a data warehouse, you run the risk of stripping out the very nuggets that make big data a treasure trove for discovery oriented advanced analytics. This is why the preparation of data for discovery analytics seems minimal (even slipshod) – often just extracts and table joins – compared to the full range of data prep applied to EDW data.
Does this mean that we can throw out the sacrosanct best practices for ETL, DQ, metadata, MDM, and data modeling? No, of course not. Some organization will simply need to suspend these for discovery analytics with big data—but only temporarily. Here’s a typical scenario.
After business analysts and other users have discovered what they’re looking for in big data, they need to take the discovery to the BI and DW team, so the results can be “institutionalized” in the EDW. For example, when discovery analytics reveals valuable items – like new forms of churn, customer segments, cost centers, etc. – these need to be represented by data structures in the EDW and reports, so that business people can track them regularly. At that point, the best practices of data preparation come back into play.
So, what do you think, folks? Let me know. Thanks!
Posted by Philip Russom, Ph.D. on July 12, 20110 comments
I recently started work on a new TDWI Best Practices Report with the working title: Deep Analytics with Big Data. The report is a tad schizophrenic, in that it’s really about two things – big data and analytics – plus how the two have teamed up to create one of the most profound trends in business intelligence (BI) today. Let me share some of the thinking behind the schizophrenia. Please reply to this blog to tell me whether this makes sense or not.
Advanced Analytics
According to a recent TDWI survey, 38% of organizations surveyed are practicing advanced analytics today. But 85% say they’ll do it within 3 years!
Why the rush to advanced analytics? First, change is rampant in business; we’ve been through multiple “economies” in recent years. And analytics helps us discover what changed plus how we should react. Second, there are still many business opportunities to leverage -- even in the recession -- and more will come as we finally crawl out of the recession. To that end, advanced analytics is the best way to discover new customer segments, identify the best suppliers, associate products of affinity, understand sales seasonality, and so on. For these reasons, TDWI has seen an explosion of user organizations implementing analytics in recent years.
But note that user organizations are implementing specific forms of analytics, particularly what is sometimes call advanced analytics. This is a collection of related techniques and tools, usually including predictive analytics, data mining, statistical analysis, and complex SQL. We might also extend the list to cover data visualization, artificial intelligence, natural language processing, and database methods that support analytics.
All these techniques have been around for years, many of them appearing in the 1990s. The thing that’s different now is that far more user organizations are actually using them. That’s because most of these techniques adapt well to very large, multi-terabyte datasets, with minimal data preparation. And that brings us to big data.
Big Data
Big data can be defined simply as multi-terabyte datasets. And this make sense, given that corporations, government agencies, and other user organizations are generating and retaining more data than ever before. Soon enough, big data will involve petabytes, not terabytes Yet, big data also involves big complexity, namely many diverse data sources (both internal and external), data types (structured, unstructured, semi-structured), and indexing schemes (relational, multidimensional, no-SQL).
Occasionally, I hear a user complain about the problems of storing and managing big data. Much more often, however, I hear people talk about what an extraordinary opportunity big data is. That’s because, for the kinds of discovery and prediction that most advanced analytic techniques enable, big data is truly a treasure trove of information that merits leverage for business advantage. And that brings us to the intersection mentioned in the title of this blog.
Advanced Analytics and Big Data: Why put them together?
Here are a few reasons:
Big data yields gigantic statistical samples. Most tools designed for data mining or statistical analysis tend to be optimized for large datasets. In fact, the general rule is that the larger the data sample, the more accurate are the statistics and other products of the analysis. Instead of mining and statistical tools, I regularly find users generating or hand-coding complex SQL, which parses big data in search of just the right customer segment, churn profile, or excessive operational cost. The newest generation of data visualization tools and in-database analytic functions likewise operate on big data.
Analytic tools and databases can now handle big data. And they can execute big queries and parses in record time. Recent generations of vendor tools and platforms have raised us onto a new plateau of performance that’s very compelling for applications involving big data.
There’s a lot to learn from messy data, as long as it’s big. Most modern tools and techniques for advanced analytics and big data are very tolerant of raw source data, with its transactional schema, non-standard data, and poor-quality data. That’s a good thing, because discovery and predictive analytics depend on lots of details, even questionable data. For example, analytic applications for fraud detection often depend on outliers and non-standard data as indications of fraud. If you apply ETL and DQ processes to big data, as you do for a data warehouse, you’ll strip out the very nuggets that make big data a treasure trove for advanced analytics.
Big data is a special asset that merits leverage. And that’s the real point of Deep Analytics with Big Data. The new technologies and new best practices are fascinating, even mesmerizing. And there’s a certain macho coolness to working with dozens of terabytes. But don’t do it for the technology. Put Big Data and Advance Analytics together for the new insights they give the business.
So, what do you think? Does the intersection of Big Data and Advance Analytics make sense to you? Let me know. Thanks!
To learn more, register to attend a TDWI Webinar on this topic. “The Intersection of Big Data and Analytics,” May 5, 2011 at noon eastern time. http://bit.ly/eh5YA9
Posted by Philip Russom, Ph.D. on April 25, 20110 comments
A few days ago, I presented a TDWI Webinar based on my newly published TDWI Best Practices report about “Next Generation Data Integration” (NGDI). Almost three hundred people attended the broadcast, and (with such a large turnout) I got a ton of great questions from the audience about data integration (DI).
I’d like to share some of those questions with you (and my responses to Webinar attendees who asked them), as a way of expanding and clarifying the research findings of the report. If you care about DI, this should be interesting for you.
Concerning bulk upload, should we use a batch upload mechanism or Web services?
It depends on the dataset being bulk loaded. You should stick to your old reliable bulk loader for datasets that are very large, too large for a service bus, don’t have an immediate delivery requirement, or demand multiple complex passes (as many multidimensional structures do, when being loaded into a data warehouse). Most services, messages, or events used in a DI context handle time-sensitive data, which is delivered faster over a message or service bus. Also, real-time DI often enables Operational Business Intelligence (OpBI), where data is drawn frequently from ERP, CRM, and other operational applications, then loaded into a warehouse, mart, or other BI data store. OpBI may also use DI to publish improved data back to those applications. Many operational applications (especially SAP) are best extracted from via the application layer, and services and messages usually support such an interface. From these examples, you can see that the old (bulk loaders) and the new (services) intermingle in the newest DI generation.
Do staging tables play an important role in DI?
Yes. The newest generation of DI still relies of older, tried-and-true designs and DI architectures. And these typically have a variety of data landing and data staging areas, including databases (like operational data stores) and tables (whether physically in the data warehouse or external to it). One new spin on this is that 64-bit computing and very large memory spaces in server hardware now enable more effective DI pipes. This is where data is staged and processed in server memory, not landed to disk. This both speeds up DI transformational processing and boosts scalability for large data volumes. For many organizations, NGDI is about adjusting (not abandoning) useful best practices like this to take advantage of newly available platform capabilities.
Is DI architecture and information architecture the same thing?
No, they’re different. Information architecture is usually about the data models and schema within individual enterprise databases, plus data dependencies across multiple ones. DI architecture concerns the design of data flows, plus development standards (like preferred interfaces for specific applications). For DI, hub-and-spoke is the most common architecture, where a vendor’s DI tool or a control server (in home-grown DI solutions) is equivalent to a hub. But point-to-point interfaces still abound in DI jobs, and DI over a bus is subject to whatever the bus requires. My report explains that designing and using just the right DI architecture has become a critical success factor for satisfying next-generation requirements, like scalability, real time, governance, and DI team collaboration.
Where do you see ERP choices within the context of NGDI?
In my world, Operational Business Intelligence (OpBI) has become quite common. OpBI requires much from a DI tool. The DI tool has to support feature-rich interfaces to ERP and other application types. The DI tool must have optimization to draw data fast, frequently, and non-invasively from ERP modules and applications. And the DI tool must understand ERP data structures and function calls to make sense of ERP data, before integrating it elsewhere. OpBI and other real-time business practices wouldn’t be possible without real-time DI. In fact, my report shows that various real-time DI functions are the ones users will increase the use of most over the next three years.
Other common DI practices involving ERP include synchronizing customer data (and other data domains, especially product data) across multiple ERP modules and instances. Synchronizing reference data is a similar practice, one that’s growing quickly. Since some ERPs are almost impermeable, DI is regularly called in to assist with data access for data quality. This kind of coordination between DI and DQ is one of the hallmarks of NGDI.
Do you think certain aspects of traditional EAI are going to be part of NGDI?
Well, first of all, I regularly find some DI functions executed over EAI and similar buses in user organizations that have already made a substantial investment in a robust EAI infrastructure. Firms in financial and insurance industries are typical examples. Second, I think what’s happening in such firms is that DI is simply leveraging more deeply an existing infrastructure, just as other users, applications, and tools are. Third, DI is being driven to EAI, in situations where EAI has better interfaces (especially to packaged applications) or certain time-sensitive data has a real-time requirement (for which EAI messages are easily configured). Even so, there’s still a need for standard data interfaces over the enterprise LAN.
Any metrics around how much operational cost is associated with near real-time data integration vs the traditional batch model?
Ten years ago, real-time DI via EAI was possible, but it usually required the purchase of extra tools. Plus, real-time functions in tools and applications weren’t very robust, so an administrator had to watch and tweak them constantly. These two characteristics drove up the cost. Luckily, a lot of RT functionality is built into today’s applications, databases, and DI tools. Many firms have a robust EAI or service bus infrastructure that DI can tap for real time. For firms that have kept their enterprise software and infrastructure up-to-date, real time DI is quite accessible, reliable, and inexpensive, as compare to the recent past. But that’s with EAI in mind. From a different direction, batch processing has improved, too. It may be preferred in the form of so-called micro-batches for frequent intra-day extract that needn’t be truly RT.
Can you expand on RT event processing, including contexts for applicability?
You probably don’t want to handle just any kind of event via a DI tool. Instead, some kind of “complex event” benefits from DI processing. A complex event is actually multiple events, typically occurring at different times (even different months or years) that need to be correlated. ETL-ish DI can access the many diverse data sources and data models where complex data events may be managed. Today, I almost exclusively find federal intelligence or security agencies doing this, to recognize and quantify security threats. The TSA and Coast Guard come to mind. But it’s just a matter of time before such DI-enabled practices are common with customer events in for-profit corporations.
CONCLUSION
If you have a question or answer about Next Generation Data Integration (or a reaction to one presented above), please share them by responding to this blog.
Register for and replay the TDWI Webinar these questions came from at
http://tdwi.org/webcasts/2011/04/next-generation-data-integration.aspx?tc=page0
Download a free copy of the TDWI Best Practices Report titled Next Generation Data Integration, at http://tdwi.org/research/list/tdwi-best-practices-reports.aspx
Find tweets about NGDI by searching Twitter.com for the hash tag #NGDI.
Posted by Philip Russom, Ph.D. on April 19, 20110 comments
I transcended time and space earlier this week when I attended Hadoop World in New York City.
It started Monday evening. After taking a high-speed train from Boston, I emerged from the bowels of Penn Station onto the bright lights and bustling streets of mid-town Manhattan. The pavement was wet from a passing rain and lightening pulsed in the distant sky, framed by the city’s cavernous skyscrapers. I felt like I had entered a Hollywood set for an apocalyptic movie. But that was just the beginning.
Invigorated by the city’s pulsing energy, I decided to walk 15 blocks to my hotel. Halfway there, the winds picked up, the muted lightening roared to life, and rain scoured the streets in endless waves. I ducked under a large hotel canopy just in time to see hail the size of shooter marbles pelt everything in sight. After 15 minutes, the deluge subsided. But by the time I reached my hotel, I was soggy and stunned.
Welcome Aboard!
The next morning, as I listened to the proceedings from Hadoop World, I realized that the prior night’s surreal weather was a fitting prelude to the conference—at least for me. Hadoop World was a confab for programmers—almost 1,000 of them. As a data guy, it felt like I had been transported to parallel universe where the people looked and acted the same but spoke a completely different language. But what I did understand, I liked.
With Hadoop, it seems that the application community finally discovered data and its potential to make businesses smarter. “Hadoop is a high value analytics engine for today’s businesses,” said Mike Olson, during his kickoff keynote. Mike is CEO and Founder of Cloudera, an open source provider of Hadoop software and services and host of the event. Following Olson on the stage was Tim O’Reilly, founder of O’Reilly Media, a long-time high-tech luminary and open source proponent. He said, "We are the beginning of an amazing world of data-driven applications. It's up to us to shape the world."
It was wonderful to see the developer community discover data in all its glory. To my fellow developers, I say, “Welcome aboard!” We’re all on the same page now.
Fathoming Hadoop
Hadoop is one of the first attempts by the developer community to get their arms around data in a way that conforms to their skills, knowledge, and culture. From a data guy’s perspective, Hadoop is clunky, slow, and woefully immature. But it does have advantages. As a result, it’s already popping up in corporate data environments as a complement to analytical databases. For example, some leading-edge companies are using Hadoop to process and store large volumes of clickstream and sensor data that they then feed into analytical databases for query processing.
So what is Hadoop? It might be easier to say what it's not.
· Hadoop is not a database; it’s a distributed file system (Hadoop Distributed File System or HDFS) that scales linearly across commodity servers. It is also a programming model (MapReduce) that enables developers to build applications in virtually any language they want and run them in parallel across large clusters.
· Hadoop is not a transactional system; it’s a batch-oriented system that runs hand-crafted Map-Reduce programs. You are not going to run iterative queries in Hadoop.
· Hadoop does not support random data access; it reads and writes all data sequentially, which makes it tortuously slow for tactical updates and queries and mixed workload applications.
Today, Hadoop shines as an infinitely scalable data processing environment for handling huge volumes of data that would be prohibitively expensive to store and analyze in a traditional relational database or even a data warehousing appliance. Hadoop lets companies capture and store all their data—structured, semi-structured, and unstructured—without having to archive or summarize the data. Consequently, some companies, such as Comscore and CBS Interactive, use Hadoop as a massive staging area to capture, store, and prepare large volumes of data for delivery to downstream analytic structures.
The main advantages of Hadoop are:
1. Open Source. The software is free. And free is good compared to spending millions of dollars on a relational database to handle tens of terabytes to petabytes of data (if it can.) You can download individual components from the Apache Software Foundation, or purchase a “distribution” from third party providers, such as Cloudera or IBM. A distribution is a package of Hadoop-related applications that are tested to ensure compatibility and stability and delivered with support and professional services on a subscription basis.
2. Linear Scalability. Hadoop is an MPP system that runs on commodity servers. It scales linearly as you add more servers. It has minimal overhead compared to relational databases so it offers superior scalability.
3. Streaming. Hadoop is a file system that does not require specialized schema or normalization to capture and store data or a special language to access it. Therefore, Hadoop makes it possible to perform (high-speed) reads and writes. In addition, a new application called Flume lets Hadoop consume streaming event data. In other words, it’s easy to get large volumes of data in and out of Hadoop.
4. Unstructured data. Because of its schema-less design, Hadoop and MapReduce work well on any type of data. MapReduce interprets data at run time based on the keys and values defined in the MapReduce program. Thus, a developer can design the program to work against structured, semi-structured, or even unstructured data, such as images or text.
5. Minimal Administration. Hadoop automatically handles node failures, making it easy to administer large clusters of machines and write parallelized programs that run against the cluster.
The Future of Hadoop
We are in the early days of Hadoop. There is a tremendous amount of excitement and energy around the initiative. The open source community is innovating quickly and bringing to market new capabilities that make Hadoop more database-like and a better partner in corporate data centers. For example, the community has introduced Hive, a SQL-like language that generates MapReduce programs under the covers and makes Hadoop appear more like a relational engine. It has also released Pig, a dataflow language that makes it easier to create MapReduce transformation logic than writing low-level Java.
Conversely, some BI vendors are adopting elements of Hadoop. For example, database vendors, such as Aster Data and Greenplum, have added support for MapReduce. And many relational database and ETL vendors, such as Pentaho and Talend, have implemented or announced bidirectional interfaces for moving data in and out of Hadoop. In addition, BI vendors, led by DataMeer, are working on JDBC interfaces to Hadoop so users can execute reports and queries against Hadoop from the confines of their favorite BI tool. Expect a slew of announcements this year from the likes of MicroStrategy, SAP BusinessObjects, IBM Cognos, and others supporting Hadoop.
It's clear that we’ve entered the era of big data analytics. And frameworks, such as Hadoop, are helping to advance our ability to generate valuable insights from large volumes of data and new data types. Just as exciting, the developer and data communities are converging to address large-scale data issues. And while our language and approaches may differ, it won’t be long before we all sing the same tune with the same words.
Posted on October 15, 20100 comments