Best Practices in MDM: Cross-System Data Analysis
Cross-system data analysis is part of the critical path of any MDM project.
- By Todd Goldman
- April 23, 2008
By Todd Goldman
Cross-system source-data discovery and data mapping are some of the most important, and the most overlooked, steps in a master data management (MDM) implementation. Before you can populate a new master with “trusted” data, you need to analyze the potential source-data systems individually and you must perform cross-system analysis to determine the survivorship rules that will determine which data, under which circumstances, will be used to populate the master. In addition, once the master is populated, you must map the new master to the downstream “consuming” applications -- yet another level of cross-system data analysis. The bottom line is that cross-system data analysis and mapping are part of the critical path for most MDM projects, often consuming over 40 percent of the time and effort in an MDM deployment.
Most companies underestimate the difficulty of this work and tend to rely on the corporate memory of subject matter experts (SMEs). However, most SMEs only know their specific system and don’t know how the data in their system relates to data in other systems. Determining the trusted source of data for a specific attribute becomes a matter of the SMEs’ opinions and, depending on those opinions, may lead to questionable quality for the data in the master and increase risk for the overall deployment success of the project.
Fortunately, there are new best practices in data analysis that automate single-system data profiling, cross-system source-data discovery, and detailed data mapping between systems. This article discusses each type of analysis and the appropriate use of each throughout an MDM project.
Single-System Data Profiling: A Necessary but not Sufficient First Step
When it comes to data analysis, single-system data profiling is what data warehousing practitioners are familiar with. Data profiling software scans the data values within the tables in a single data source and, based on those values, generates statistics about the data in each column, including but not limited to, data types, value frequency, length, precision, scale, format, cardinality, mean, median, mode, minimum, and maximum.
Profiling will identify outliers for a given column. For instance, if most of the values for a column are integers from 1 to 100, a profiling tool will identify and report any values that are characters (a, b, c, etc.), for example. Some of the better profiling tools will also identify and validate primary/foreign-key relationships that are useful in understanding the structure of individual source systems.
Profiling is a necessary first step in almost any data analysis. In fact, both the cross-system source-data discovery and the data mapping sections rely on profiling as a first step. However, by itself, profiling only provides information about individual data sources. MDM by definition is about driving consistency across multiple data sources. As a result, after profiling, most of the hard work is still left as an exercise for the user to perform manually. This is why profiling only provides about 5 percent of the analysis you will need for your MDM project and why automating cross-source data discovery and data mapping (both described below) are critical for accelerating an MDM deployment.
Cross-Source Data Discovery: Finding Critical Data Elements
Following the detailed profiling and structural analysis of each source, the business analyst must still determine which critical data elements (CDE) will be used in the master data set. The CDE is a logical element that appears in multiple data sources. Each appearance of a CDE in a data set is called a physical data element (PDE). For example, “Surname” would be a CDE for an identity master, and might be located in multiple databases with a physical column name of Sname, Surname, Name, Column1, etc.
The project team will often know which CDEs to include in the master data set. However, there may be CDEs that are less apparent or unknown. Cross-source data discovery analysis software performs overlap analysis across multiple data sources (up to about 20) simultaneously and is used to confirm known CDEs and identify less-apparent CDEs. Such determinations are based on the following heuristics:
- CDEs usually have high cardinality (high cardinality indicates a high level of unique values within a column) such as account IDs, SSNs, and credit card numbers
- CDEs usually appear in multiple sources and are represented by overlapping attributes across sources
Cross-source data discovery software automatically identifies database or file fields that have high cardinality and high overlap with other data sources. It is then up to the subject matter experts to confirm or eliminate potential CDEs based on the statistical properties revealed by the discovery tool.
In addition to the cross-source profiling information, interactive previewing and comparison of record values within and between different data sources side-by-side is a critical capability that SMEs will find useful in cross-source discovery products.
Cross-Source Data Discovery: Choosing Trusted Sources Discovery
Once the team knows which critical data elements should be incorporated into the master data set and where the physical data elements (PDEs) currently reside, the team must consolidate the different element instances into one data element in the master. The team must make one of two decisions:
- Choose one PDE and make it the trusted source
- Select and combine several PDEs from multiple sources (this may require conditional logic to populate the MDM system)
Either choice will require users to determine the PDE’s “trustworthiness.” This decision is subjective, but a “trustworthy” PDE will have the following statistical properties:
- It will be a superset of other PDEs in other sources
- It will have a high overlap percentage with other PDEs in other sources
- It will be “fresher” than PDEs in other sources
Identifying trusted sources for specific PDEs is the foundation for reducing the complexity of resolution rules. Trying to create resolution rules for all sources can overwhelm an MDM deployment with complexity. Software that automates the trusted-source discovery for each PDE locates specific cross-source data conflicts which, in turn, simplifies the creation of the conflict resolution rules.
Once the trusted sources have been identified, business analysts can begin programming the MDM system with the business rules that will determine survivorship and govern the data.
Data mapping provides detailed mapping (down to the column level) between two structured data sources. This analysis is used for more detailed comparisons of two sources being considered to populate a master and can be used to automate the creation of case logic for survivorship rules used when populating the master. It is also very commonly used to map downstream consuming applications to the master itself. Performing this task manually is extremely time consuming, labor intensive, and highly prone to error, which is why automation is critical.
Fortunately, data mapping software is now available that analyzes data values between two structured data sources to automatically discover the following column-level business rules and transformations between those data sources:
- Scalar relationships: simple mappings, substrings, constants, and concatenations
- Filters such as equality/inequality, null conditions, and conjunctions
- Complex nested case statements (if-then logic)
- Inner and outer joins
- Aggregations (minimum, maximum, average, sum)
- Reverse pivot rows
- Correlations to be stored in cross-reference tables
- Arithmetic relationships between columns
Once the analysis software has derived and validated these business rules between a downstream application and the master data hub, the information is used to program ETL tools to move and transform data from the master to the downstream applications in a format the downstream application will understand.
The market for data analysis has developed well beyond the simple profiling products that were developed back in the 20th century. New 21st-century data analysis approaches now provide best practices for cross-source data discovery and data mapping that provide a significant leap forward in the automation of information analysis. The new generation of cross-source analysis tools automates up to 90 percent of the effort in cross-system data analysis used in master data management projects, accelerates the data analysis process by a factor of 10 and removes data analysis and mapping from the critical path in any MDM project.
- - -
Todd Goldman is vice president of marketing with Exeros and has over 15 years of experience as a general manager and marketing executive. Todd has an M.B.A. from the Kellogg Graduate School of Management and a B.S.E.E. from Northwestern University. You can reach the author at firstname.lastname@example.org.