Raising the Bar for Data Profiling
By Philip Russom, Senior Manager, TDWI Research
TDWI’s position is that detailed profiling of the data in source and
target systems is a prerequisite to successful data management
projects for business intelligence (BI), data warehousing (DW),
data integration (DI), data quality (DQ), master data management
(MDM), customer data integration (CDI), and so on. The problem is
that many technical users scrimp on data profiling by doing it rarely
or shallowly, profiling only known systems or small pieces of them,
and settling for profiles that are not very insightful or actionable.
These poor practices result in project overruns, the exclusion of
important data, incomplete and inaccurate profiles, severe productivity
losses, and low-quality or just-plain-wrong data in reports
and applications. In other words, the negligence of data profiling—whether intentional or a product of minimal resources—has
engendered a body of poor practices that hold data profiling back
from making a more significant contribution to the many data management
practices and related business initiatives of which it is a
Hence, there’s a real need to raise the bar on data profiling so that
data management professionals and their business counterparts
expect more and get more from it. For many organizations, raising
the bar is about augmenting the current narrow definition of data
profiling (generating statistics about data values in a column or
table) to embrace more advanced functions for cross-system data
discovery, regular data monitoring and remediation, and the collaborative
improvement and sharing of profiles. To help organizations
wishing to make such improvements, this article defines the four
practice areas of data profiling, then lists 10 best practices indicative
of a mature, feature-rich implementation.1
Defining Data Profiling and its Four Practice Areas
So we all start on the same page, here’s a definition of data profiling adapted from Wikipedia:
Data profiling is the process of examining the data available in an existing data source (e.g., a database or file) and collecting statistics and information about that data. The purpose of these statistics may be to find out whether existing data can easily be used for other purposes, provide metrics relevant to data quality and standards, assess the risk involved in integrating data for new applications, and assess whether metadata accurately describes the actual values in the source database.
It’s convenient to have a single term and concept that everyone recognizes, like data profiling. But the single term—and the Wikipedia definition—suggest a smaller practice than is actually the case. In fact, data profiling consists of four practice areas, and for many users, raising the bar involves branching out into all four:
Data profiling. This is core functionality, focused on the development of a data inventory with actionable profiles. The greatest need for improvement here is for technical users to reach out and profile more data sources and targets, while ferreting out dependencies across them.
Data discovery. This helps you avoid bad habits, such as profiling only known sources. Through data discovery, you can find nuggets of un-mined data. And it helps you find the best source of data for a given purpose, not merely a source that works. Depending on how functions are organized within tools or solutions, data discovery may enable you to relate data structures distributed across multiple IT systems.
Data monitoring. Though profiling is done before developing a data management solution (for DI, DQ, MDM, etc.), monitoring provides a quick re-profile as data moves through the deployed solution. The point is to assess the current state of a source or target data set and keep a history of data’s quality and structure (and sometimes its usage, too). From this information, a data steward or similar user can see how data’s quality and content have changed since the solution last ran, as well as trends over time. Monitoring is critical to the continuous improvement expected of DQ and DW solutions. It can also spot changes to source data structures, quantify the rate of quality degradation, and classify data usage for data
Collaborative profiling. At the low end, data profiling is collaborative when multiple people on a technical team share and improve a common set of profiles. A better example, however, is when line-of-business people add value to the data inventory and its profiles by providing context about what the data represents and how departments use it. For example, business people can give meaning to columns of data that are poorly defined by metadata or documentation, and they are indispensable at prioritizing data sets and data structures for improvement based on business pain or need. Collaborative data profiling is a critical success factor with cross-functional and cross-system data management practices, such as BI, DW, DI, DQ, CDI, and MDM. As data governance programs deepen their policy-driven practices, they invariably seek automation for data governance from all four data profiling practices, but within the collaborative context of the data governance committee.
Ten Best Practices in Data Profiling
The following best practices raise the bar on data profiling but with realistic, achievable goals. Don’t be surprised if you read about poor practices that you have perpetuated. Even seasoned data management professionals have some room for improvement as they raise the bar on data profiling and related practice areas for data monitoring, data discovery, and collaborative profiling.
Just do it! Technical users regularly voice their excuses for not profiling: “We know which sources to tap and we know their data so well that we needn’t profile them.” “We have a tight deadline, and data profiling is not a deliverable.” Skipping data profiling puts data-intense projects at risk, because project workers don’t know the data and its current state as well as they should. Inevitably, surprises pop up, such that the project concludes over time and over budget. So just do data profiling, as a prerequisite and critical success factor for projects in BI, DW, DI, DQ, MDM, data migrations, and so on.
Profile data thoroughly. Even when data management professionals conduct data profiling, they often do it superficially. A common but fallacious attitude is: “Running some ad hoc queries and SQL scripts tells us all we need to know.” Some folks rush through profiling, only to have to stop and do more later. Be thorough from the start to avoid delays.
Produce more thorough data profiles. Whether hand coded or tool based, most profiling merely generates statistics about data values in one column in one table. This isn’t enough. As you work, thoroughly document data sources, meanings, relationships, mappings, value types, and so on.
Discover and profile new data sources. Don’t just profile known data sources. Look for those you may have forgotten, overlooked, or been unaware of. Assume that new sources are appearing and old ones are changing, else you’ll miss opportunities.
Don’t be content with just any source for a data point. Don’t stop when a source is found, ignoring other potentially better sources. Distinguish similar data sources to ensure you’ve found the best source for a particular purpose (e.g., “customer” will be defined differently in different systems).
Rely on data discovery to find all the potential sources. Seek tool automation that assesses sources you’re exploring and suggests data’s meaning and relations.
Profile data across multiple IT systems. Don’t profile only parts of tables or databases. Statistics describing the occurrence of values and data types in a table column are useful, but not broad enough. Instead, cross-system data discovery can operate on whole tables and databases, plus reach across multiple systems.
Discover cross-system keys, business rules, data mappings, data exceptions, and transformations. Documenting all these dependencies is essential for projects that are inherently cross-system, like MDM, CDI, and DQ. Cross-system data discovery is crucial with undocumented legacy systems, which are typical of database migration projects.
Map data as you discover and profile it. Don’t just inventory data elements. Profiles should also catalog or develop mappings. After all, mappings are a big chunk of the work in developing DI and DQ solutions. Mapping is time-consuming and prone to error when manual. Avail yourself of any available software automation to make mappings faster and more accurate.
Re-profile data as it evolves. Data profiling is not a one-shot deal. Databases are alive, in that they grow, shrink, and evolve. For example, system owners change data structures and data definitions without telling you. And data sources come and go, as they appear, age, and retire.
Thus, you must re-profile data sources periodically to assess their current state and discover changes in sources and targets that require adjustments to solutions for DQ, DI, MDM, BI, etc. This is true whether you are working with source databases you’ve handled before or you are building a target database that demands continuous improvement (as in CDI and DW projects).
Re-profile data daily via data monitoring. Data monitoring differs from profiling and discovery. Data profiling and discovery tend to be pre-development tasks that examine potential data sources and targets. Data monitoring re-profiles data touched daily as part of a deployed solution for DI, DQ, CDI, and MDM, to assure that these achieve predetermined metrics for data’s quality, consistency, and validation. Monitoring also identifies structural changes in data sources and exceptions to be processed. While data profiling and discovery can be done with hand coding or mostly manual methods, data monitoring involves a high level of automation that is best executed via a vendor’s tool.
Data monitoring reveals exceptions that need special handling the same way that deployed routines for DI, DQ, and MDM do. Therefore, data monitoring needs to integrate with other tools where a business or technical user processes exceptions.
Collaborate through data profiles. Poor practices limit collaboration via data profiles. “We’ve always just documented source-data profiles in Microsoft Offices files, which we e-mail to everyone.” Documentation is an all-too-common product of manual profiling. Yet manual documentation is hard to update, share, and apply to multiple projects.
Tool automation enables collaborative profiling. Ideally, all team members (whether business or technical) should collaborate around a single view of source and target data. This requires a tool that can store and share data profiles and other valuable outputs of data discovery and monitoring via an easily accessed repository, data glossary, corporate portal, or an equivalent. You should also make profiles accessible to many users, so that both technical and business people can annotate them. This gives the profiles greater accuracy, richer metadata, and better prioritization. Reuse profiles across multiple projects for greater productivity and consistency.
Support many practices with data profiling, discovery, and monitoring. Let’s be honest. Data management professionals make a living developing and maintaining primary solutions for BI, DW, DI, DQ, and MDM—not data profiling. Even so, primary solutions wouldn’t be possible without secondary, supporting data management practices like data profiling. Equally indispensable supporting practices include metadata management, data modeling, and database administration. In fact, the quality, performance, and maintainability of primary solutions often depend on how well the supporting practices were performed—and that’s probably the best reason for raising the bar on data profiling and other secondary practices.
Data management practices aside, data profiling also contributes to data-driven organizational practices, such as data stewardship and data governance. After all, these depend on developing a data glossary and quality metrics for data, which aren’t possible without data profiling.2
Use a data profiling tool. Mostly manual methods are inhibiting data profiling. This includes ad hoc queries, SQL scripts, and visual perusals of data performed via a hodgepodge of utilities and hand coding. The result is documentation that’s rarely kept up-to-date and nearly impossible to apply directly to primary data management solutions. Furthermore, a mostly manual approach to data profiling is feature poor, in that it lacks functions for discovery, monitoring, and collaboration.
Avoid these limitations by using the data profiling, discovery, and monitoring functions built into tools for DQ, DI, CDI, and MDM. Or augment these tools with a dedicated data profiling tool. Ideally, you want data profiles, data definitions, and mappings to be shareable across the multiple tools you use to create primary data management solutions. Data discovery, profiling, and monitoring need significant software automation to be sustainable and applicable for multiple users, tools, and types of data management projects. So wean yourself off hand-coded and mostly manual methods in preference of tool-based data profiling.
- Move up to the next level of data profiling. Do it more often and thoroughly, and share the results. Extend profiling to embrace related practice areas in data discovery, monitoring, and collaboration.
- Realize that profiling is not a one-shot deal. Data monitoring keeps profiles current and discovers changes. Remediate exceptions, don’t just find and log them.
- Eschew myopia! Discover new data sources, and learn new things about old data.
- Link the four data profiling practices to other practices. After all, data profiling is a secondary practice that supports primary data management practices such as BI, DW, DI, DQ, CDI, and MDM. And profiling supports organizational practices, especially data stewardship and governance.
- Collaborate over data profiles. Solicit input from various technical and business people. Share profiles broadly for consistent data use and reuse.
- Prefer tool functions over mostly manual methods. This way, profiles are directly applicable to project deliverables. And tool automation enables you to raise the bar by profiling often, thoroughly, and collaboratively.
1 This article is based on the TDWI Webinar “Best Practices in Data Profiling and
Cross-System Data Discovery,” available for replay on tdwi.org.
2 For more information about relationships among data management practices, see the TDWI Best Practices Report Unified Data Management, available at tdwi.org.
Philip Russom is the senior manager of research and services at The Data Warehousing Institute (TDWI), where he oversees many of TDWI’s research-oriented publications, services, and events. Prior to joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research, Giga Information Group, and Hurwitz Group, as well as a contributing editor with Intelligent Enterprise and DM Review magazines. You can reach him at firstname.lastname@example.org.
This article originally appeared in the issue of .