TDWI Blog

Data Management Blog Posts

See the most recent Data Management related items below.


Dual BI Architectures: The Time Has Come

As a parent, by the time you have your second or third child, you know which battles to fight and which to avoid. It’s time we did the same in business intelligence (BI). For almost two decades we’ve tried to shoehorn both casual users and power users into the same BI architecture. But the two don’t play nicely together. Given advances in technology and the explosion in data volumes and types, it’s time we separate them and create dual BI architectures.

Mapping Architectures

Casual users are executives, managers, and front-line workers who periodically consume information created by others. They monitor daily or weekly reports and occasionally dig deeper to analyze an issue or get details. Generally, a well-designed interactive dashboard or parameterized report backed by a data warehouse with a well-designed dimensional schema is sufficient to meet these information needs. Business users who want to go a step further and build ad hoc views or reports for themselves and peers—whom I call Super Users—are best served with a semantic layer running against the same data warehouse.

Power users, on the other hand, explore data to answer unanticipated questions and issues. No predefined dashboard, report, or semantic layer is sufficient to meet their needs. They need to access data both in the data warehouse and outside of it, beyond the easy reach of most BI tools and predefined metrics and entities. They then need to dump the data into an analytical tool (e.g. Excel, SAS) so they can merge and model the data in novel and unique ways.

For years, we’ve tried to reconcile casual users and power users within the same BI architecture, but it’s a losing cause. Power users generate “runaway” queries that bog down performance in the data warehouse, and they generate hundreds or thousands of reports that overwhelm casual users. As a result, casual users reject self-service BI and revert back to old habits of requesting custom reports from IT or relying on gut feel. Meanwhile, power users exploit BI tools to proliferate spreadmarts and renegade data marts that undermine enterprise information consistency while racking up millions in hidden costs.

Time for a New Analytic Sandbox

Some forward-looking BI teams are now creating a separate analytic architecture to meet the needs of their most extreme power users. And they are relegating their data warehouses and BI tools to handle standard reporting, monitoring, and lightweight analysis.

Compared to a traditional data warehousing environment, an analytic architecture is much more free-form with fewer rules of engagement. Data does not need rigorous cleaning, mapping, or modeling, and hardcore business analysts don’t need semantic guardrails to access the data. In an analytic architecture, the onus is on the business analyst to understand source data, apply appropriate filters, and make sense of the output. Certainly, it is a “buyer beware” environment. As such, there may only be a handful of analysts in your company who are capable of using this architecture. But the insights they generate may make the endeavor well worth the effort and expense.

Types of Analytic Architectures

There are many ways to build an analytic architecture. Below are three approaches. Some BI teams implement one approach; others mix all three.

Physical Sandbox. One type of analytic architecture is uses a new analytic platform—a data warehousing appliance, columnar database, or massively parallel processing (MPP) database—to create a separate physical sandbox for their hardcore business analysts and analytical modelers. They offload complex queries from the data warehouse to these turbocharged analytical environments , and they enable analysts to upload personal or external data to those systems. This safeguards the data warehouse from runaway queries and liberates business analysts to explore large volumes of heterogeneous data without limit in a centrally managed information environment.

Virtual Sandbox. Another approach is to implement virtual sandboxes inside the data warehouse using workload management utilities. Business analysts can upload their own data to these virtual partitions, mix it with corporate data, and run complex SQL queries with impunity. These virtual sandboxes require delicate handling to keep the two populations (casual and power users) from encroaching on each other’s processing territories. But compared to a physical sandbox, it avoids having to replicate and distribute corporate data to a secondary environment that runs on a non-standard platform.

Desktop Sandboxes. Other BI teams are more courageous (or desperate) and have decided to give their hardcore analysts powerful, in-memory, desktop databases (e.g., Microsoft PowerPivot, Lyzasoft, QlikTech,Tableau, or Spotfire) into which they can download data sets from the data warehouse and other sources to explore the data at the speed of thought. Analysts get a high degree of local control and fast performance but give up data scalability compared to the other two approaches. The challenge here is preventing analysts from publishing the results of their analyses in an ad hoc manner that undermines information consistency for the enterprise.

Dual, Not Dueling Architectures

As an industry, it’s time we acknowledge the obvious: our traditional data warehousing architectures are excellent for managing reports and dashboards against standard corporate data, but they are suboptimal for managing ad hoc requests against heterogeneous data. We need dual BI architectures: one geared to casual users that supports standard, interactive reports and dashboards and lightweight analyses; and another tailored to hardcore business analysts that supports complex queries against large volumes of data.

Dual architectures does not mean dueling architectures. The two environments are complementary, not conflicting. Although companies will need to invest additional time, money, and people to manage both environments, the payoff is worth the investment: companies will get higher rates of BI usage among casual users and more game-changing insights from hardcore power users.

Posted on September 30, 20100 comments


The Evolution of Data Federation

Data federation is not a new technique. The notion of virtualizing multiple back-end data sources has been around for a long time, reemerging every decade or so with a new name and mission.

Database Gateways. In the 1980s, database vendors introduced database gateways that transparently query multiple databases on the fly, making it easier for application developers to build transaction applications in a heterogeneous database environment. Oracle and IBM still sell these types of gateways.

VDW. In the 1990s, vendors applied data federation to the nascent field of data warehousing, touting its ability to create “virtual” data warehouses. However, data warehousing purists labeled the VDW technology as “voodoo and witchcraft” and it never caught on, largely because standardizing disparate data from legacy systems was nearly impossible to do without creating a dedicated data store.

EII. By the early 2000’s, with more powerful computing resources, data federation was positioned as a general purpose data integration tool, adopting the moniker “enterprise information integration” or EII. The three-letter acronym was designed to mirror ETL—extract, transform, and load—which had become the predominant method for integrating data in data warehouses.

Data Services. In addition, the rise of Web services and services oriented architectures during the past decade gave data federation another opportunity. It got positioned as a data service, abstracting back-end data sources behind a single query interface. It is now being adopted by many companies that are implementing services oriented architectures.

Data Virtualization. Today, data federation vendors now prefer the label of data virtualization, capitalizing on the popularity of hardware virtualization in corporate data centers and the cloud. The term data virtualization reinforces the idea that data federation tools abstract databases and data systems behind a common interface.

Data Integration Toolbox

Over the years, data federation has gained a solid foothold as an important tool in any data integration toolbox. Companies use the technology in a variety of situations that require unified access to data in multiple systems via high-performance distributed queries. This includes data warehousing, reporting, dashboards, mashups, portals, master data management, SOA architectures, post- acquisition systems integration, and cloud computing.

One of the most common uses of data federation are for augmenting data warehouses with current data in operational systems. In other words, data federation enables companies to “real-time-enable” their data warehouses without rearchitecting them. Another common use case is to support “emergency” applications that need to be deployed quickly and where the organization doesn’t have time or money to build a data warehouse or data mart. Finally, data federation is often used to create operational reports that require data from multiple systems.

A decade ago there were several pureplay data federation vendors, but now the only independent is Composite Software, which is OEM’d by several BI vendors, including IBM Cognos. Other BI vendors support data federation natively, including Oracle (OBIEE) and MicroStrategy. And many data integration vendors, including Informatica and SAP have added data federation to their data integration portfolios.

Federation versus Integration

Traditionally, the pros and cons of data federation are weighed against those of data integration toolsets, especially when creating data warehouses. The question has always been “Is it better to build virtual data warehouses with federation tools or physical data marts and data warehouses with ETL tools?”

Data federation offers many advantages -- it’s a fast, flexible, low cost way to integrate diverse data sets in real time. But data integration offers benefits that data federation doesn’t: scalability, complex transformations, and data quality and data cleansing.

But what if you could combine the best of these two worlds and deliver a data integration platform that offered data federation as an integrated module, not a bolt on product? What if you could get all the advantages of both data federation and data integration in a single toolset?

If you could have your cake and eat it, too, you might be able to apply ETL and data quality transformations to real-time data obtained through federation tools. You wouldn’t have to create two separate semantic models, one for data federation and another ETL; you could use one model to represent both modalities. Basically, you would have one tool instead of two tools. This would make it easier, quicker, and cheaper to apply both data federation and data integration capabilities to any data challenge you might encounter.

This seamless combination is the goal of some data integration vendors. I recently did a Webcast with Informatica, which shipped a native data federation capability this year that runs on the same platform as its ETL tools. This is certainly a step forward for data integration teams that want a single, multipurpose environment instead of multiple, independent tools, each with their own architecture, metadata, semantic model, and functionality.

Posted by Wayne Eckerson on June 15, 20100 comments


Evolving Your BI Team from a Data Provider to a Solutions Provider

In her presentation on “BI Roadmaps” at TDWI’s BI Executive Summit last month, Jill Dyche explained that BI teams can either serve as “data providers” or “solutions providers.” Data providers focus on delivering data in the form of data warehouses, data marts, cubes, and semantic layers that can be used by BI developers in the business units to create reports and analytic applications. Solutions providers, on the other hand, go one step further, by working hand-in-hand with the divisions to develop BI solutions.

I firmly believe that BI teams must evolve into the role of solutions provider if they want to succeed long term. They must interface directly with the business, serving as a strategic partner that advises the business on how to leverage data and BI capabilities to solve business problems and capitalize on business opportunities. Otherwise, they will become isolated and viewed as an IT cost-center whose mission will always be questioned and whose budget will always be on the chopping block.

Data Provisioning by Default. Historically, many BI teams become data providers by default because business units already have reporting and analysis capabilities, which they’ve developed over the years in the absence of corporate support. These business units are loathe to turn over responsibility for BI development to a nascent corporate BI group that doesn’t know its business and wants to impose corporate standards for architecture, semantics, and data processing. Given this environment, most corporate BI teams take what they can get and focus on data provisioning, leaving the business units to weave gold out of the data hay they deliver.

Mired Down by Specialization

However, over time, this separation of powers fails to deliver value. The business units lose skilled report developers, and they don’t follow systematic procedures for gathering requirements, managing projects, and developing software solutions. They end up deploying multiple tools, embedding logic into reports, and spawning multiple, inconsistent views of information. Most of all, they don’t recognize the data resources available to them, and they lack the knowledge and skills to translate data into robust solutions using new and emerging BI technologies and techniques, such as OLAP cubes, in-memory visualization, agile methods, dashboard, scorecards, and predictive analytics.

On the flip side, the corporate BI team gets mired down with a project backlog that it can’t seem to shake. Adopting an industrialized assembly line mindset, it hires specialists to handle every phase of the information factory to improve efficiency (e.g. requirements, ETL, cube building, semantic modeling, etc.) yet it can’t accelerate development easily. Its processes have become too rigid and sequential. When divisions get restless waiting for the BI team to deliver, CFOs and CIOs begin to question their investments and put its budget on the chopping block.

Evolving into Solutions Providers

Rethink Everything. To overcome these obstacles, a corporate BI team needs to rethink its mission and the way it’s organized. It needs to actively engage with the business and take some direct responsibility for delivering business solutions. In some cases, it may serve as an advisor to a business unit which has some BI expertise while in others it may build the entire solution from scratch where no BI expertise exists. By transforming itself from a back-office data provider to a front-office solutions developer, a corporate BI team will add value to the organization and have more fun in the process.

It will also figure out new ways to organize itself to serve the business efficiently. To provide solutions assistance without adding budget, it will break down intra-organizational walls and cross-train specialists to serve on cross-functional project teams that deliver an entire solution from A to Z. Such cross-fertilization will invigorate many developers who will seize the chance to expand their skill sets (although some will quit when forced out of their comfort zones). Most importantly, they will become more productive and before long eliminate the project backlog.

A High Performance BI Team

For example, Blue Cross/Blue Shield of Tennessee has evolved into a BI solutions provider over the course of many years. BI is now housed in an Information Management (IM) organization that reports to the CIO and is separate from the IT organization. The IM group consists of three subgroups: 1) the Data Management group 2) the Information Delivery group and 3) the IM Architecture group.

  • The Data Management group is comprised of 1) a data integration team that handles ETL work and data warehouse administration and 2) a database administration team that designs, tunes, and manages IM databases.
  • The Information Delivery group consists of 1) a BI and Performance Management team which purchases, installs, and manages BI and PM tools and solutions and provides training and two customer-facing solutions delivery teams that work with business units to build applications. The first is the IM Health Informatics team that builds clinical analytic applications using reporting, OLAP, and predictive analytics capabilities, and the second is the IM Business Informatics team which builds analytic applications for other internal departments (i.e. finance, sales, marketing).
  • The IM Architecture group builds and maintains the IM architecture, which consists of the enterprise data warehouse, data marts, and data governance programs, as well as closed loop processing and the integration of structured and unstructured data.

Collaborative Project Teams. Frank Brooks, director of data management and information delivery at BCBS of Tennessee, says that the IM group dynamically allocates resources from each IM team to support business-driven projects. Individuals from the Informatics teams serve as project managers, interfacing directly with the customers. (While Informatics members report to the IM group, many spend most of their in the departments they serve.) One or more members from each of the other IM teams (data integration, database administration, and BI/PM) is assigned to the project team and they collaboratively work to build a comprehensive solution for the customer.

In short, the BI team of BCBS of Tennessee has organized itself as a BI solutions provider, consolidating all the functions needed to deliver comprehensive solutions in one group, reporting to one individual who can ensure the various teams collaborate efficiently and effectively to meet and exceed customer requirements. BCBS of Tennessee has won many awards for its BI solutions and will be speaking at this summer’s TDWI BI Executive Summit in San Diego (August 16-18.)

The message is clear: if you want to deliver value to your organization and assure yourself a long-term, fulfilling career at your company, then don’t be satisfied with being just a data provider. Make sure you evolve into a solutions provider that is viewed as a strategic partner to the business.

Posted by Wayne Eckerson on March 16, 20100 comments


SAS Kicks It Up A Notch

With “analytics” a hot buzzword and the newest competitive differentiator in Corporate America, SAS is sitting in the catbird’s seat. With $3.1 billion in sales last year, SAS is by far the leader in analytics software and is prepared to leverage its position to compete head on with the big boys.

It was a clear after spending a day with SAS executives in Steamboat Springs, Colorado that SAS is thinking as big as the towering Rocky Mountains that surrounded our resort hotel. Specifically, SAS is focused on competing with IBM. That’s partly because IBM just acquired its nearest analytics competitor, SPSS, and is now heavily marketing its “Smart Analytics System” initiative, which enables customers to purchase a single integrated analytical solution consisting of hardware, software, tools, consulting services, and support via a single SKU and have it delivered in two weeks. (See “IBM Rediscovers Itself” August, 2009.)

To bulk up against Big Blue and its 4,000 consultants, SAS has teamed up with Accenture, who is also feeling the heat from IBM, but in  consulting services. Together, SAS and Accenture will spend $50 million initially to develop new analytics products for six industries. For its part, Accenture sees analytics as a big growth area and partnering with the top analytics software vendor makes sense. And to beef up its corporate image, SAS is building a new 280,000 foot executive briefing center with 690 offices, two auditoriums, and a full café to impress analytics prospects.

SAS has never been known for its partnering ability, but the stakes are high and both SAS and Accenture are motivated by a common, fierce competitor. In the past two years, SAS has honed its partnering skills by investing serious time and money with Teradata, the fruits of which are starting to pay dividends in six-figure sales deals. And now, SAS is aggressively partnering with a slew of upstart database vendors to move analytics into the database, even IBM DB2. This is an interesting surround strategy for SAS: wherever IBM goes, it will find SAS already embedded in the customers database of choice. Not bad!

Other highlights from the SAS Analyst Day:

- SAS executives believe social media analysis is a big potential growth area for the company, leveraging its strength in analytics as well as its 2008 acquisition of Teragram, a leader in natural language processing and advanced linguistic technology. This is a natural market for SAS to dominate, although it faces a slew of nimble startups and established players

- SAS’ data integration products, which are now housed under the DataFlux subsidiary, were the fastest growing parts of its portfolio, with more than 50 percent growth in some markets. While most of the products are not new, SAS executives say the growth comes from the fact that DataFlux provides vendor-neutral data integration products so is not tied to the SAS installed base. Plus, companies are mining ever-growing volumes of data and are recognizing the need for clean, integrated data. “More companies view fact-based decision making as a key to success and are coming to recognize that clean, integrated data is critical for delivering insights,” says Jim Davis, senior vice president and chief marketing officer.

- SAS is no longer overlooking the small- and medium-sized business (SMB) market, ramping up hosted offerings for customer intelligence and other applications that smaller companies can access. In 2009, it acquired 228 new SMB customers.

- SAS has an active hosting business for analytics customers who don’t want to provision servers in their own environment. SAS’ on demand revenues grew 30% and number of customers increased 43% from 2008 to 2009. SAS is building a huge new hosting center so expect more on demand offerings to come from SAS in the months ahead.

Posted by Wayne Eckerson on March 7, 20100 comments


Sleep Well at Night: Abstract Your Source Systems

It’s odd that our industry has established a best practice for creating a layer of abstraction between business users and the data warehouse (i.e., a semantic layer or business objects), but we have not done the same thing on the back end.

Today, when a database administrator adds, changes, or deletes fields in a source system, it breaks the feeds to the data warehouse. Usually, source systems owners don’t notify the data warehousing team of the changes, forcing us to scramble to track down the source of the errors, rerun ETL routines, and patch any residual problems before business users awake in the  morning and demand to see their error-free reports.

It’s time we get some sleep at night and create a layer of abstraction that insulates our ETL routines from the vicissitudes of source systems changes. This sounds great, but how?

Insulating Netflix

Eric Colson, whose novel approaches to BI appeared two weeks ago in my blog “Revolutionary BI: When Agile is Not Fast Enough,” has found a simple way to abstract source systems at Netflix. Rather than pulling data directly from source systems, Colson’s BI team pulls data from a file that source systems teams publish to. It’s kind of an old-fashioned publish-and-subscribe messaging system that insulates both sides from changes in the other.

“This has worked wonderfully with the [source systems] teams that are using it so far,” says Colson, who believes this layer of abstraction is critical when source systems change at breakneck speed, like they do at Neflix. “The benefit for the source systems team is that they get to go as fast as they want and don’t have to communicate changes to us. One team migrated a system to the cloud and never even told us! The move was totally transparent.”

On the flip side, the publish-and-subscribe system alleviates Colson’s team from having to 1) access to source systems 2) run queries on those systems 3) know the names and logic governing tables and columns in those systems and 4) keep up with changes in the systems. They also get much better quality data from source systems in this way.

Changing Mindsets

However, Colson admits that he might get push back from some source systems teams. “We are asking them to do more work and take responsibility for the quality of data they publish into the file,” says Colson. “But this gives them a lot more flexibility to make changes without having to coordinate with us.” If the source team wants to add a column, it simply appends it to the end of the file. 

This approach is a big mindset change from the way most data warehousing teams interface with source systems teams. The mentality is: “We will fix whatever you give us.” Colson’s technique, on the other hand, forces the source systems teams to design their databases and implement changes with downstream analysis in mind. For example, says Colson, “they will inevitably avoid adding proprietary logic and other weird stuff that would be hard to encapsulate in the file.”

Time to Deploy

Call me a BI rube, but I’ve always assumed that BI teams by default create such an insulating layer between their ETL tools and source systems. Perhaps for companies that don’t operate at the speed of Netflix, ETL tools offer enough abstraction. But, it seems to me that Colson’s solution is a simple, low-cost way to improve the adaptability and quality of data warehousing environments that everyone can and should implement.

Let me know what you think!

Posted by Wayne Eckerson on February 8, 20100 comments


IBM and Informatica Acquire MDM Capabilities

Special Note: This analysis was written by Philip Russom, Wayne's colleague in TDWI's research department who covers MDM and data integration.

I don’t know about you, but when I read two, similar announcements from competing software vendors, delivered at pretty much the same time, I can’t help but compare them. So that’s what I’m thinking about today (February 3, 2010), after hearing that IBM intends to acquire Initiate Systems. This bears strong resemblance to Informatica’s announcement a few days ago (on January 28, 2010) about their completion of the acquisition of Siperian.

If you work in data management or a related field, these are noteworthy announcements coming from noteworthy vendors, and, therefore, worth understanding. So let me help you by making some useful comparisons – and some differentiations that you may not be aware of. I’ll organize my thoughts around some questions I’ve just seen in the blogosphere and my Twitter deck.

1 – Are both acquisitions about MDM?

Well, sort of. Siperian is well-known for its master data management (MDM) solution. It has attained one of the holy grails among MDM solutions, in that it works with multiple data domains (that’s data about customers, products, financials, and other domains).

Initiate, on the other hand, is well-known for its identity resolution hub. Initiate’s identity resolution capabilities are commonly embedded within applications for MDM and customer data integration (CDI). The way that I think of it, Initiate’s hub isn’t for MDM per se, but it can improve MDM when embedded within it.

At this point, I need to cycle back to Siperian and point out that it, too, provides identity resolution capabilities. And I forgot to mention that Initiate also has some MDM capabilities. You could say that Siperian is mostly MDM, but with identity resolution and other capabilities, whereas Initiate is mostly about identity resolution, but with MDM and other capabilities.

2 - So, the two acquisitions are about identity resolution?

Yes, but to varying degrees. For example, IBMers were very clear that their primary interest in Initiate is its ability to very accurately match data references to patients in healthcare and citizens in government. IBM’s campaign for a Smarter Planet has strong subsets focused on healthcare and government, two industries where Initiate has reference clients doing sophisticated things with identity resolution. My impression is that IBMers are hoping Initiate’s identity resolution functionality will help them sell more products and services into these industries.

Returning to Informatica and Siperian, let’s recall that for years now the Siperian hub has been integrated with Informatica PowerCenter (similar to pre-existing integration among IBM and Initiate products). Among other things, this integration enables Siperian’s identity resolution functions to be embedded within the PowerCenter platform under the name Informatica Identification Resolution. Hence, identity resolution was one off the key capabilities paving the path to this acquisition.

3 – What do these acquisitions mean for IBM and Informatica?

As noted, IBM is counting on the Initiate product to help their campaigns for Smarter Healthcare and Smarter Government. Informatica has now filled the largest hole in its otherwise comprehensive product line, filling it with one of the better tools available via acquisition. Both IBM and Informatica are aggressively building out their portfolios of diverse data management tools, driven both by user demand and competitive pressures. Since both have customers with growing demands for more diverse data management tool types, both will have no trouble cross-selling the new tools to their existing customer bases, as well as selling their older products to the newly acquired customer bases.

4 -- What do these acquisitions mean for technical users?

In my experience, Informatica and IBM both have rather faithful customers, in that they tend to get most of their data management tools from a primary supplier. Technical users from both customer bases now have more functionality available from a preferred technology supplier.

But these aren’t just any data management functions. The two acquisitions focus the spotlight on two of the hottest functions today, in terms of user organizations adopting them, namely: MDM and identity resolution. More than ever, organizations need trusted data, in support of regulatory reporting, compliance, business intelligence, analytics, operational excellence, and other data-driven requirements. MDM and identity resolution are key enablers for these requirements, so it’s no surprise that two leading vendors have chosen to acquire these at this time.

Posted by Wayne Eckerson on February 3, 20100 comments


Operational BI: To DW or Not?

Operational business intelligence (BI) means many things to many people. But the nub is that it delivers information to decision makers in near real time, usually within seconds, minutes, or hours. The purpose is empower front-line workers and managers with timely information so they can work proactively to improve performance.

Key Architectural Decision. This sounds easy but it’s hard to do. Low latency or operational BI systems have a lot of moving parts and there is not much time to recover from errors, especially in high-volume environments. The key decision you need to make when architecting a low-latency system is whether to use the data warehouse (DW) or not. The ramifications of this decision are significant.

On one hand, the DW will ensure the quality of low-latency data; but doing so may disrupt existing processes, add undue complexity, and adversely impact performance. On the other hand, creating a stand-alone operational BI system may be simpler and provide tailored functionality and higher performance, but potentially creates redundant copies of data that compromise data consistency and quality.

So take your pick: either add complexity by rearchitecting the DW or undermine data consistency by deploying a separate operational BI system. It’s kind of a Faustian bargain, and neither option is quick or cheap.

Within the DW

If you choose to deliver low-latency data within your existing DW, you have three options:

1. Mini Batch. One option is simply to accelerate ETL jobs by running them more frequently. If your DW supports mixed workloads (e.g. simultaneous queries and updates), this approach allows you to run the DW 24x7. Many start by loading the DW hourly and then move to 15-minute loads, if needed. Of course, your operational systems may not be designed to support continuous data extracts, so this is a consideration. Many companies start with this option since it uses existing processes and tools, but just runs them faster.

2. Change Data Capture. Another option is to apply change data capture (CDC) and replication tools which extract new records from system logs and move them in real-time to an ETL tool, staging table, flat file, or message queue so they can be loaded into the DW. This approach minimizes the impact on both your operational systems and DW since you are only updating records that have changed instead of wiping the slate clean with each load. Some companies combine both mini-batch and CDC to streamline processes even further.

3. Trickle Feed. A final option is to trickle feed records into the DW directly from an enterprise service bus (ESB), if your company has one. Here, the DW subscribes to selected events which flow through a staging area and into the DW. Most ETL vendors sell specialized ESB connectors to trickle feed data or you can program a custom interface. This is the most complex of the three approaches since there is no time to recover from a failure, but it provides the most up-to-date data possible.

Working Outside the DW

If your existing DW doesn’t lend itself to operational BI for architectural, political, or philosophical reasons, then you need to consider building or buying a complementary low-latency decision engine. There are three options here: 1) data federation 2) operational data stores 3) event-driven analytic engines.

1. Data federation. Data federation tools query and join data from multiple source systems on the fly. These tools create a virtual data mart that can combine historical and real-time data without the expense of creating a real-time DW infrastructure. Data federation tools are ideal when the number of data sources, volume of data, and complexity of queries are low.

2. ODS. Companies often use operational data stores (ODS) when they want to create operational reports that combine data from multiple systems and don’t want users to query source systems directly. An ODS extracts data from each source system in a timely fashion to create a repository of lightly integrated, current transaction data. To avoid creating redundant ETL routines and duplicate copies of data, many organizations load their DW from the ODS.

3. Event-driven engines. Event-driven analytic engines apply analytics to event data from an ESB as well as static data in a DW and other applications. The engine filters events, applies calculations and rules in memory, and triggers alerts when thresholds have been exceeded. Although tailored to meet high-volume, real-time requirements, these systems can also support general purpose BI applications.

In summary, you can architect operational BI systems in multiple ways. The key decision is whether to support operational BI inside or outside the DW. Operational BI within a DW maintains a single version of truth and ensures high quality data. But not all organizations can afford to rearchitect a DW for low latency data and must look to alternatives.

Posted by Wayne Eckerson on November 6, 20090 comments


MDM Lessons Learned

Master data management (MDM) enables organizations to maintain a single, clean, consistent set of reference data about common business entities (e.g. customers, products, accounts, employees, partners, etc.) that can be used by any individual or application that requires it. In many respects, MDM applies the same principles and techniques that apply to data warehousing—clean, accurate, authoritative data.

Not surprisingly, many data warehousing (DW) professionals have taken the lead in helping their organizations implement MDM solutions. Yet, even grizzled DW veterans pose fundamental questions about how to get started and succeed in this new arena. Here are answers to the seven most common questions:

1. What’s the best place to start with MDM? People want to know whether it’s best to start with customer, product, or account data, or whether the finance, service, or marketing department is most receptive to MDM. The actual starting place is determined by your organization and the amount of pain that different groups or departments feel due to lack of conformed master data. The only surefire advice is to start small and work incrementally to deliver an enterprise solution.

2. How do you fund MDM? Few people have succeeded in funding stand-alone MDM projects, especially if their company has recently funded data warehousing, data quality, and CRM initiatives. Executives invariably ask, “Weren’t those initiatives supposed to address this?” Replying that MDM makes those initiatives more efficient and effective just doesn’t cut it. The best strategy is to bake MDM projects into the infrastructure requirements for new strategic initiatives.

3. How do you architect an MDM solution? The right architecture depends on your existing infrastructure, what you’re trying to accomplish, and the scope and type of reference data you need to manage. A classic MDM hub is essentially a data reconciliation engine that can feed harmonized master data to a range of systems—including the data warehouse. MDM hubs come in all shapes and sizes: on one extreme, a hub serves as the only source of master data for all applications; on the other, it simply maintains keys to equivalent records in every application. Most MDM solutions fall somewhere in the middle.

4. What’s the role of the data warehouse in MDM? There is no reason you can’t designate a single application to serve as the master copy. For example, you could designate the data warehouse as the master for customer data or an Oracle Financials application as the master for the chart of accounts. These approaches are attractive because they reuse existing models, data, and infrastructure, but may not be suitable in all situations. For instance, you may want an MDM solution that supports dynamic bidirectional updates of master data in both the hub and operational applications. This requires a dynamic matching engine, a real-time data warehouse, and Web services interfaces to integrate both ends of the transaction.

5. What organizational pitfalls will I encounter? Managing the expectations of business and IT stakeholders is nothing less than a make-or-break proposition. “Change management can derail an MDM project,” says one chief technology officer at a major software manufacturer that implemented a global MDM project. “When you change the data that end users have become accustomed to receiving, it can cause significant angst. You have to anticipate this, implement a transition plan, and prepare the users.” In addition, don’t underestimate the need to educate IT professionals about the need for MDM and the new tools and techniques required to implement it.

6. What technical pitfalls will I encounter? First of all, MDM requires a panoply of tools and technologies, some of which may already exist in your organization. These include database management systems, data integration tools, data matching and quality tools, rules-based systems, reporting tools, scheduling, and workflow management. Buying a packaged solution alleviates the need to integrate these tools. But if you already have the tools that exist in a package, negotiate a steep discount. Early MDM adopters say the biggest challenges are underestimating the time and talent required to define and document MDM requirements, analyze source data, maintain high-performance Web services interfaces, and fine tune matching algorithms to avoid under- or over-matching.

7. How do I manage a successful MDM implementation? To succeed, MDM requires business managers to take responsibility for defining master data and maintaining its integrity. This involves assigning business executives to stewardship roles in which they drive consensus about data definitions and rules and oversee processes for changing, managing, auditing, and certifying master data. Good data governance may or may not involve steering committees and meetings, but it always involves establishing clear policies and processes and holds business people accountable for the results.

MDM is a major undertaking and there is much to learn to be successful. But hopefully the answers to these seven questions will get you moving in the right direction.

Posted by Wayne Eckerson on November 6, 20090 comments


TDWI's BI Executive Summit in San Diego

More than 70 business intelligence directors and sponsors gathered in San Diego this week for the TDWI’s semi-annual BI Executive Summit. The executives were enlightened by a mix of case studies (NetApp, Dell, Arizona State University, RBC Wealth Management, and ShareThis) and educational sessions on key areas of interest (BI in the Cloud, change management, cost estimating BI projects, Hadoop, visualization, BI mashups, social intelligence, pervasive BI, and the future of BI.)

The attendees also voted on technologies and programs that will be of key interest to them in the next three years. Topping the technology chart were predictive and in-database analytics, dashboards, visualization, the cloud, and operational BI. On the process side, key initiatives will be data quality and governance, business process management, and BI competency centers.

The event’s sponsors also chimed in during sessions and the future of BI panel. Teradata discussed active data warehousing, Information Builders showed how to integrate strategic, tactical, and operational BI, Birst discussed the benefits of cloud-based computing via its customer RBC Wealth, Sybase discussed in-database analytics via SQL-based plug-ins from Fuzzy Logix, and IBM discussed innovations in BI, such as search, collaboration, and mashups.

Most attendees valued the networking and collaboration that the event offers during lunch and breaks. In addition, hands-on workshops put the attendees in the drivers seat. During the Dell case study, attendees were given a real-life scenario of Dell’s data warehousing environment circa 2007 and then in groups had to decide how they would have approached the issues that Dell faced. After considerable discussion and debate, representatives from Dell—who were actively taking notes during the discussion phase—told the audience how they approached the situation. The workshop on change management also asked the attendees to work in small groups to discuss the implications of the FEE (fear, entitlement, earning.)

Quotable Quotes from the event:

- “You typically only need 12 metrics to run a company.” Tony Politano, consultant and author of Chief Performance Officer

- “Our dashboards, which are all built on the same platform and enterprise model, provide enterprise, functional, and role based views of information.” Dongyan Wang, NetApp

- “We designed the dashboard so that any user could get any information they want at any level of detail in three clicks.” Dongyan Wang, NetApp

- “A dashboard should provide more than composite views of information; they should allow users to create a personalized mashup via Web services and widgets so the environment caters to their needs.” Laura Edell Gibbons

- “Add a top 10 list to your portal to enhance usage and pay attention to principles of visual design.” John Rome, Arizona State University

- “We inventoried our spreadmarts and found we had about 40,000 around the world and that didn’t even count spreadsheets.” Rob Schmidt, Dell

- “It’s important to pick data governance decision makers at the right level; we had Michael Dell’s direct reports.” Mike Lampa, Dell

- “The initial response of the business units to the decision to shift analysts who spent most of their time creating reports into IT was ‘Go to hell!’ But quickly, the saw that the move would free up budget dollars and headcount for them and they bought into the decision.” Mike Lampa, Dell

- “One lesson learned [during the Dell DW renovation] was losing sight of the importance of delivering value while establishing and enforcing enterprise standards.” James Franklin, Dell

- “Our BI architectures are rooted in outdated assumptions of resource scarcity.” Mark Madsen, Third Nature

- “Because of advances in processing and storage power, we could now ship the data warehouse that I built in 1993 for $1.5 million on a laptop to every individual in the company.” Mark Madsen, Third Nature


Posted by Wayne Eckerson on August 7, 20090 comments


The Scope of Data Governance

I recently reviewed the course materials for a class titled “A Step by Step Guide to Enterprise Data Governance” taught by Mike Ferguson at the TDWI Munich conference in June. Mike did a tremendous job covering the full scope of the data governance topic.

Mike defines enterprise data governance as “the set of processes by which structured and unstructured data assets are formally managed and protected by people and technology to guarantee commonly understood trusted and secure data throughout the enterprise.”

The elements that Mike puts in the data governance bucket are: data definitions and shared business vocabulary; metadata management; data modeling, data quality; data integration; master data management; data security; content management; and taxonomy design and maintenance.

This is a big vision, and certainly elevates the discussion to its proper perspective: that is, data is a business asset and it’s the responsibility of business to oversee and manage this resource. The corollary here is that IT plays a supporting, not supervisory, role in managing the company’s data.

Central to Mike’s vision of enterprise data governance is a Change Control Board, which is the “gatekeeper” for the shared business vocabulary. This board, which is comprised of data stewards from the business, is responsible for approving requests to change, add, or decommission data items. Implicit in this is that the Change Control Board manages data names and definitions, transformation rules, and data quality rules. And these get baked into data models, BI metadata, MDM models, and taxonomies.

Given how fundamental data is to a business (whether it knows it or not), it’s imperative that a senior executive oversee the data governance team that is comprised of senior business managers and stewards. Maria Villar, owner of MCV LLC, writes, “A business data steward is a leadership position…. who understands the importance of data to their part of the business.” (See “Establishing Effective Business Data Stewards” in the spring 2009 edition of the BI Journal.)

Villar says a business data steward “understands the priorities and strategies of the business unit, commands respect within the organization, builds consensus across a varied set of business priorities; influences and drives changes to business processes, enjoys strong support from senior business leaders, can communicate to business and technical teams, and builds a diverse team of technical and business data experts.

Now that we have the verbiage straight, we have to execute on the vision. And that will keep us busy for years to come!


Posted by Wayne Eckerson on July 14, 20090 comments