View online: www.tdwi.org/flashpoint
|
|||||
February 4, 2010 |
ANNOUNCEMENTS TDWI Members receive a $275 discount on the price of registration for the TDWI World Conference Las Vegas, February 21-26. Register here. CONTENTS
|
||||
Data Migration Information Integrity Todd Saunders |
|||||
Topic:
Data Migration There are a number of options for modifying a DW solution when introducing a new coding scheme. The scheme could reference raw materials in a manufacturing company, inventory items in a distribution company, or disease classification in a healthcare organization. Additionally, the scheme may include changes to both the format of the code as well as the taxonomy and classification hierarchy in the code set. Although many operational and procedural changes are typically required to accommodate this type of change, we will focus on the handling of the data within the DW environment. Some questions that arise under these circumstances include:
As these questions indicate, moving to a new code set requires more than just modifying the format of the old code fields to handle the new format. However, let’s first focus on some options for handling the data in the DW. Option 1: Modify Field Formats and Keep Both Code Sets Option 2: Modify Field Formats and Update All Old Codes Option 3: Add Additional Column for the New Code Set Option 4: Add Duplicate Tables for All Tables with the Old Code Set Option 5: Develop an Internal Surrogate Key This new key is used in place of the other code set keys and can be referenced using the cross-reference mapping table. Most of these mappings will simply identify a one-to-one mapping between the old and new codes. However, this approach can also enable efficient management of the required one-to-many mappings that may appear. The option you choose should be the one that has a reasonable cost and best accommodates the needs of the business in terms of time to implement and ease of use and maintenance. Options 1, 2, and 4 will present the most difficulty in dealing with the one-to-many mapping issues. Also, option 1 introduces the risk associated with having two code sets stored in one column, which can be misleading for those not intimately aware of the code and data structures. Options 3 and 4 make it clearer that two code sets exist in the environment, but there may still be difficulty with the one-to-many mappings. Option 5 offers the most flexibility while introducing the least amount of risk, but still requires a review of the mappings. Todd Saunders is the executive vice president of customer solutions with CONNECT: The Knowledge Network, a consulting firm that specializes in data, technical, and organizational solutions for business intelligence. Todd can be reached at 303.730.7171, ext. 236. The Starting Point of Data Quality: Deanne Larson Topics: Data Quality, Baseline Metrics It is a new year and your resolution list is daunting. Where do you begin? The start of a new year is like the start of your data quality journey--you know you need to make changes, but you are unsure how to start. With data quality, the journey begins by examining the data you have: the data your users trust and the data your executives are using to make decisions. What Is a Baseline Metric? Establish Priorities for Baseline Metrics A quick way to assess the business criticality of data is to review the data lineage of your organization’s key performance indicators (KPIs). Indicators are made up of metrics, metrics are a collection of measures, and measures are composed of data attributes. The data attributes that align to KPIs are a high priority and could be a starting point. Business criticality is the best way to prioritize data for a quality assessment, as it defines the value of the data. Profiling Data and Establishing the Baseline Data profiling tools will provide data demographics. Some of the common statistical information provided includes record count, minimum and maximum values, mean, mode, percentiles, frequency distribution, and standard deviation. These results are typically available for each attribute you profile. With this demographic information, content discovery transitions to analysis and establishing the baseline. Examples of baseline data quality metrics that can be derived from data profile demographics include: percent complete, percent null, percent out of compliance, exception counts, exception values, range values, and out-of-range values. This is not an exhaustive list--these are just some of the common data quality issues discovered though content analysis. Many of these examples can be converted into percentages, which make good baseline metrics. One of the simplest baseline metrics is percent complete. This metric provides insight into missing values, which can have a huge impact on KPI calculations. Once baseline metrics are established, root causes can be researched and solutions recommended. Root causes can sometimes be corrected upstream in the source or resolved through cleansing and transformation. Benefits Baseline metrics also remove assumptions from the requirement phase of the development cycle. Data is not assumed to be correct and is known by baseline values, which can be analyzed to determine the impact on ability to deliver the desired result. For example, if an attribute used to calculate a KPI is 78 percent complete, is it possible for IT to deliver the KPI? Can the business provide logic to populate the missing 22 percent? These issues are best discovered in the requirements phase versus the development phase of the life cycle. Finally, baseline data quality metrics provide the ability to demonstrate improvement. Once data transformation and cleansing are complete, you can establish new metrics. By establishing quality baselines on business-critical data, value is realized quickly, which is a great way to start the data quality journey. Deanne Larson, CBIP, PMP, and DM (candidate), is the president of Larson & Associates. She has more than 18 years of BI/DW experience and has spoken nationally and internationally at industry conferences.
Reporting and Analysis Tools from SAP and Third-Party Vendors Source: Business Intelligence Solutions for SAP (TDWI Best Practices Report, Q4 2007). Click here to access the report.
Mistake:
Overemphasizing Metadata Many ETL selections focus too much on the metadata components of the ETL product. Metadata use is implicit within the tool. Metadata accessibility, external integration, and reporting options are used less frequently outside the ETL environment than most people expect. It is better to focus on the use of metadata within the development and maintenance environment. Can you see the impact of a potential source change across an extract or a set of extracts? Can you trace, in sequence, a value from the target column back through multiple sources? These are the features that are most likely to be used. Features like end-user access to the repository or the ability to import metadata from multiple tools are generally not as important to the overall environment, unless you are doing an enterprise metadata project as part of your current project. By overemphasizing little-used features, you run the risk of choosing the wrong product. Source: Ten Mistakes to Avoid When Selecting and Deploying ETL Tools (Q2 2006). Click here to access the publication. |
|
||||
EDUCATION & RESEARCH TDWI World Conference: TDWI Solution Summit: |
WEBINARS Revving Your Analytics Engine, Part 1: Analysis and Exploration Revving Your Analytics Engine, Part 2: Prediction and Mining Profiting from Your Clickstream Data: Enabling More Granular Insight into Online Visitor Behavior |
MARKETPLACE TDWI Solutions Gateway TDWI White Paper Library TDWI White Paper Library |
MANAGE YOUR TDWI MEMBERSHIP Renew your Membership by: [-ENDDATE-] Renew & FAQ | Edit Your Profile | Contact Membership
|
||
Copyright 2010. TDWI. All rights reserved. |