View online: tdwi.org/flashpoint
|
|||||
November 4, 2010 |
ANNOUNCEMENTS Submissions for the next Business Intelligence Journal are due December 3. Submission guidelines CONTENTS
|
||||
Profiling Time-Dependent Data Arkady Maydanchik |
|||||
Topics:
Data Profiling, Metadata With the proliferation of efficient tools, data profiling has become one of the most common activities in data management. Unfortunately, many data profiling initiatives do not go beyond basic column profiling--gathering summary counts and statistics with frequency and distribution charts for individual data fields. Historically, data profiling tools were built for column profiling, even though they can be used to do much more. Although column profiling produces a wealth of valuable metadata, it falls short when dealing with time-dependent data. Consider a simple example: imagine you are dealing with a payroll table that contains historical compensation data by employee paycheck. The table will likely have the following fields:
Column profiling will easily produce the distribution of paycheck effective dates and the frequency chart for payroll codes. We may then learn that the table has more than 10 years of compensation history and 133 distinct payroll codes. We can identify and count the records with missing payroll codes and/or strange negative pay amounts, and might identify payroll code “ZZ,” which is sometimes used in place of a missing value (often referred to as a default value). All of this information is extremely valuable. However, it leaves many questions unanswered, such as:
Unfortunately, none of these questions can be answered by analyzing individual column profiles. A different set of techniques is required: time-dependent data profiling. The objective of time-dependent data profiling is to learn how much history exists for different data categories, whether the data follows any predictable patterns, and whether the data meaning and patterns change over time. There are a great variety of techniques for time-dependent data profiling, ranging from simple timeline and time-stamp pattern profiling procedures to complex approaches for analysis of event histories and state-transition models. Some are basic; advanced techniques may involve multidimensional analysis. Although I am not aware of existing data profiling tools that explicitly target time-dependent data, in many cases the desired information can be gathered using column profiling tools and simple SQL queries. Time-dependent data profiling requires skill, experience, and creative thinking. The real challenge is to understand what to profile, how to organize the results, and what to look for. As usual, the devil is in the details. Arkady Maydanchik is a recognized data quality practitioner and educator. He is a frequent speaker at industry conferences and the author of the Data Profiling, Data Quality Assessment, and Ensuring Data Quality in Data Integration online courses available through eLearningCurve. Getting Started with Data Governance Dave Wells Topics:
Data Governance
1. Why Govern Data? Whatever your motivation, start with a pain point. Begin where there is pressure and visibility. Be clear about your goals, and identify a small number that are achievable and where you believe you can overcome the obstacles and challenges. 2. What Data to Govern? The combination of abundant and often redundant data with a multitude of data subjects raises several questions about data scope. Looking at a single subject--customer, for example--the questions include:
These are not easy questions to answer, and the answers will vary by business. Consider the implications of customer data in a spreadsheet on a portable USB drive. If your business is healthcare and your customer data is patient data, security and compliance issues are significant. If your business is media services, your data may be at lower risk. It may be necessary to govern spreadsheets in one instance and impractical to do so in another. Limit the scope to data that clearly needs governance. If you’re getting started with governance, start small: one or a few subjects with a high degree of cross-functional business activity. In addition, consider for each subject and for each kind of database the level of business interest and participation you can expect. What level of support and sponsorship is realistic? What level of resistance is likely? 3. How Much Governance? At one end of the spectrum is education and communication about good data management practices. Adoption of the practices and development of good data management habits is mostly voluntary. Some workgroups are early adopters and others lag behind. The level of accountability is low and data governance maturity develops slowly. The opposite end of the spectrum is one of defined policies with rigid enforcement. Processes, checkpoints, reviews, and audits are the essential elements of governance. The level of accountability is high, and data governance maturity is aggressively pursued. Data management collaboration lies between the two extremes. Data management is a workgroup activity, but isolated pockets of activity aren’t sufficient. You need to have communication and coordination among workgroups. Sometimes goals are achieved through communication, education, and the development of good habits; other times policy definition and enforcement are required. The level of accountability is commensurate with the level of risk; the state of data governance maturity is continuously evolving. Closing Thoughts Dave Wells is a consultant, teacher, and practitioner in information management with strong focus on practices to get maximum value from your data: data quality, data integration, and business intelligence. For more on data governance, attend TDWI Data Governance Fundamentals, a new course being taught for the first time at the upcoming TDWI World Conference in Orlando, November 7–12, 2010.
Integrating Data into SAP BW
In summary, most BW instances are fed data once a day on average, and most are fed data from both SAP and non-SAP sources, yet the non-SAP data is only 20% of BW’s content on average. Source: Business Intelligence Solutions for SAP (TDWI Best Practices Report, Q4 2007). Click here to access the report.
Mistake:
Inadequate Monitoring of Data Interfaces It is not uncommon for a data warehouse to receive hundreds of batch feeds and uncountable real-time messages from multiple data sources every month. These ongoing data interfaces usually account for the greatest number of data quality problems. The problems tend to accumulate over time, and there is little opportunity to fix the ever-growing backlog as we strive toward faster data propagation and lower data latency. Why do the well-tested data propagation interfaces falter? The source systems that originate the feeds are subject to frequent structural changes, updates, and upgrades. Testing the effect of these changes on the data feeds to multiple independent downstream databases is a difficult and often impractical step. Lack of regression testing and quality assurance inevitably leads to numerous data problems with the feeds anytime the source system is modified--which is all of the time! The solution to interface monitoring is to design programs operating between the source and target databases. Such programs are entrusted with the task of analyzing the interface data before it’s loaded and processed. Individual data monitors use data quality rules to test data accuracy and integrity. Their objective is to identify all potential data errors. Advanced monitors that use complex business rules to compare data across batches and against target databases identify more problems. Aggregate monitors search for unexpected changes in batch interfaces. They compare various aggregate attribute characteristics (such as counts of attribute values) from batch to batch. A value outside of the reasonably expected range indicates a potential problem. Source: Ten Mistakes to Avoid In Data Quality Management (Q4 2007). Click here to access the publication. |
|
||||
EDUCATION & RESEARCH TDWI World Conference: TDWI World Conference: TDWI BI Executive Summit: |
WEBINARS What's Required for Enterprise Business Intelligence Deployments Unifying the Practices of Data Profiling, Integration, and Quality (dPIQ) Who, What, and Why? A Functional Model for Data Management |
MARKETPLACE TDWI Solutions Gateway TDWI White Paper Library TDWI White Paper Library |
MANAGE YOUR TDWI MEMBERSHIP Renew your Membership by: [-ENDDATE-] Renew & FAQ | Edit Your Profile | Contact Us
|
||
Copyright 2010. TDWI. All rights reserved. |