TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Smarter Marketing in Retail: How AI and Modern Data Foundation Drive Growth July 31, 2025
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - Expert Panel: Building an AI-Driven Data Strategy September 15, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Raising the Bar for Data Profiling

May 6, 2010

By Philip Russom, Senior Manager, TDWI Research

TDWI’s position is that detailed profiling of the data in source and target systems is a prerequisite to successful data management projects for business intelligence (BI), data warehousing (DW), data integration (DI), data quality (DQ), master data management (MDM), customer data integration (CDI), and so on. The problem is that many technical users scrimp on data profiling by doing it rarely or shallowly, profiling only known systems or small pieces of them, and settling for profiles that are not very insightful or actionable.

These poor practices result in project overruns, the exclusion of important data, incomplete and inaccurate profiles, severe productivity losses, and low-quality or just-plain-wrong data in reports and applications. In other words, the negligence of data profiling—whether intentional or a product of minimal resources—has engendered a body of poor practices that hold data profiling back from making a more significant contribution to the many data management practices and related business initiatives of which it is a subset.

Hence, there’s a real need to raise the bar on data profiling so that data management professionals and their business counterparts expect more and get more from it. For many organizations, raising the bar is about augmenting the current narrow definition of data profiling (generating statistics about data values in a column or table) to embrace more advanced functions for cross-system data discovery, regular data monitoring and remediation, and the collaborative improvement and sharing of profiles. To help organizations wishing to make such improvements, this article defines the four practice areas of data profiling, then lists 10 best practices indicative of a mature, feature-rich implementation.¹

Defining Data Profiling and its Four Practice Areas

So we all start on the same page, here’s a definition of data profiling adapted from Wikipedia:

Data profiling is the process of examining the data available in an existing data source (e.g., a database or file) and collecting statistics and information about that data. The purpose of these statistics may be to find out whether existing data can easily be used for other purposes, provide metrics relevant to data quality and standards, assess the risk involved in integrating data for new applications, and assess whether metadata accurately describes the actual values in the source database.

It’s convenient to have a single term and concept that everyone recognizes, like data profiling. But the single term—and the Wikipedia definition—suggest a smaller practice than is actually the case. In fact, data profiling consists of four practice areas, and for many users, raising the bar involves branching out into all four:

Data profiling. This is core functionality, focused on the development of a data inventory with actionable profiles. The greatest need for improvement here is for technical users to reach out and profile more data sources and targets, while ferreting out dependencies across them.

Data discovery. This helps you avoid bad habits, such as profiling only known sources. Through data discovery, you can find nuggets of un-mined data. And it helps you find the best source of data for a given purpose, not merely a source that works. Depending on how functions are organized within tools or solutions, data discovery may enable you to relate data structures distributed across multiple IT systems.

Data monitoring. Though profiling is done before developing a data management solution (for DI, DQ, MDM, etc.), monitoring provides a quick re-profile as data moves through the deployed solution. The point is to assess the current state of a source or target data set and keep a history of data’s quality and structure (and sometimes its usage, too). From this information, a data steward or similar user can see how data’s quality and content have changed since the solution last ran, as well as trends over time. Monitoring is critical to the continuous improvement expected of DQ and DW solutions. It can also spot changes to source data structures, quantify the rate of quality degradation, and classify data usage for data governance purposes.

Collaborative profiling. At the low end, data profiling is collaborative when multiple people on a technical team share and improve a common set of profiles. A better example, however, is when line-of-business people add value to the data inventory and its profiles by providing context about what the data represents and how departments use it. For example, business people can give meaning to columns of data that are poorly defined by metadata or documentation, and they are indispensable at prioritizing data sets and data structures for improvement based on business pain or need. Collaborative data profiling is a critical success factor with cross-functional and cross-system data management practices, such as BI, DW, DI, DQ, CDI, and MDM. As data governance programs deepen their policy-driven practices, they invariably seek automation for data governance from all four data profiling practices, but within the collaborative context of the data governance committee.

Ten Best Practices in Data Profiling

The following best practices raise the bar on data profiling but with realistic, achievable goals. Don’t be surprised if you read about poor practices that you have perpetuated. Even seasoned data management professionals have some room for improvement as they raise the bar on data profiling and related practice areas for data monitoring, data discovery, and collaborative profiling.

Just do it! Technical users regularly voice their excuses for not profiling: “We know which sources to tap and we know their data so well that we needn’t profile them.” “We have a tight deadline, and data profiling is not a deliverable.” Skipping data profiling puts data-intense projects at risk, because project workers don’t know the data and its current state as well as they should. Inevitably, surprises pop up, such that the project concludes over time and over budget. So just do data profiling, as a prerequisite and critical success factor for projects in BI, DW, DI, DQ, MDM, data migrations, and so on.

Profile data thoroughly. Even when data management professionals conduct data profiling, they often do it superficially. A common but fallacious attitude is: “Running some ad hoc queries and SQL scripts tells us all we need to know.” Some folks rush through profiling, only to have to stop and do more later. Be thorough from the start to avoid delays.

Produce more thorough data profiles. Whether hand coded or tool based, most profiling merely generates statistics about data values in one column in one table. This isn’t enough. As you work, thoroughly document data sources, meanings, relationships, mappings, value types, and so on.

Discover and profile new data sources. Don’t just profile known data sources. Look for those you may have forgotten, overlooked, or been unaware of. Assume that new sources are appearing and old ones are changing, else you’ll miss opportunities.

Don’t be content with just any source for a data point. Don’t stop when a source is found, ignoring other potentially better sources. Distinguish similar data sources to ensure you’ve found the best source for a particular purpose (e.g., “customer” will be defined differently in different systems).

Rely on data discovery to find all the potential sources. Seek tool automation that assesses sources you’re exploring and suggests data’s meaning and relations.

Profile data across multiple IT systems. Don’t profile only parts of tables or databases. Statistics describing the occurrence of values and data types in a table column are useful, but not broad enough. Instead, cross-system data discovery can operate on whole tables and databases, plus reach across multiple systems.

Discover cross-system keys, business rules, data mappings, data exceptions, and transformations. Documenting all these dependencies is essential for projects that are inherently cross-system, like MDM, CDI, and DQ. Cross-system data discovery is crucial with undocumented legacy systems, which are typical of database migration projects.

Map data as you discover and profile it. Don’t just inventory data elements. Profiles should also catalog or develop mappings. After all, mappings are a big chunk of the work in developing DI and DQ solutions. Mapping is time-consuming and prone to error when manual. Avail yourself of any available software automation to make mappings faster and more accurate.

Re-profile data as it evolves. Data profiling is not a one-shot deal. Databases are alive, in that they grow, shrink, and evolve. For example, system owners change data structures and data definitions without telling you. And data sources come and go, as they appear, age, and retire.

Thus, you must re-profile data sources periodically to assess their current state and discover changes in sources and targets that require adjustments to solutions for DQ, DI, MDM, BI, etc. This is true whether you are working with source databases you’ve handled before or you are building a target database that demands continuous improvement (as in CDI and DW projects).

Re-profile data daily via data monitoring. Data monitoring differs from profiling and discovery. Data profiling and discovery tend to be pre-development tasks that examine potential data sources and targets. Data monitoring re-profiles data touched daily as part of a deployed solution for DI, DQ, CDI, and MDM, to assure that these achieve predetermined metrics for data’s quality, consistency, and validation. Monitoring also identifies structural changes in data sources and exceptions to be processed. While data profiling and discovery can be done with hand coding or mostly manual methods, data monitoring involves a high level of automation that is best executed via a vendor’s tool.

Data monitoring reveals exceptions that need special handling the same way that deployed routines for DI, DQ, and MDM do. Therefore, data monitoring needs to integrate with other tools where a business or technical user processes exceptions.

Collaborate through data profiles. Poor practices limit collaboration via data profiles. “We’ve always just documented source-data profiles in Microsoft Offices files, which we e-mail to everyone.” Documentation is an all-too-common product of manual profiling. Yet manual documentation is hard to update, share, and apply to multiple projects.

Tool automation enables collaborative profiling. Ideally, all team members (whether business or technical) should collaborate around a single view of source and target data. This requires a tool that can store and share data profiles and other valuable outputs of data discovery and monitoring via an easily accessed repository, data glossary, corporate portal, or an equivalent. You should also make profiles accessible to many users, so that both technical and business people can annotate them. This gives the profiles greater accuracy, richer metadata, and better prioritization. Reuse profiles across multiple projects for greater productivity and consistency.

Support many practices with data profiling, discovery, and monitoring. Let’s be honest. Data management professionals make a living developing and maintaining primary solutions for BI, DW, DI, DQ, and MDM—not data profiling. Even so, primary solutions wouldn’t be possible without secondary, supporting data management practices like data profiling. Equally indispensable supporting practices include metadata management, data modeling, and database administration. In fact, the quality, performance, and maintainability of primary solutions often depend on how well the supporting practices were performed—and that’s probably the best reason for raising the bar on data profiling and other secondary practices.

Data management practices aside, data profiling also contributes to data-driven organizational practices, such as data stewardship and data governance. After all, these depend on developing a data glossary and quality metrics for data, which aren’t possible without data profiling.²

Use a data profiling tool. Mostly manual methods are inhibiting data profiling. This includes ad hoc queries, SQL scripts, and visual perusals of data performed via a hodgepodge of utilities and hand coding. The result is documentation that’s rarely kept up-to-date and nearly impossible to apply directly to primary data management solutions. Furthermore, a mostly manual approach to data profiling is feature poor, in that it lacks functions for discovery, monitoring, and collaboration.

Avoid these limitations by using the data profiling, discovery, and monitoring functions built into tools for DQ, DI, CDI, and MDM. Or augment these tools with a dedicated data profiling tool. Ideally, you want data profiles, data definitions, and mappings to be shareable across the multiple tools you use to create primary data management solutions. Data discovery, profiling, and monitoring need significant software automation to be sustainable and applicable for multiple users, tools, and types of data management projects. So wean yourself off hand-coded and mostly manual methods in preference of tool-based data profiling.

Recommendations

Move up to the next level of data profiling. Do it more often and thoroughly, and share the results. Extend profiling to embrace related practice areas in data discovery, monitoring, and collaboration.
Realize that profiling is not a one-shot deal. Data monitoring keeps profiles current and discovers changes. Remediate exceptions, don’t just find and log them.
Eschew myopia! Discover new data sources, and learn new things about old data.
Link the four data profiling practices to other practices. After all, data profiling is a secondary practice that supports primary data management practices such as BI, DW, DI, DQ, CDI, and MDM. And profiling supports organizational practices, especially data stewardship and governance.
Collaborate over data profiles. Solicit input from various technical and business people. Share profiles broadly for consistent data use and reuse.
Prefer tool functions over mostly manual methods. This way, profiles are directly applicable to project deliverables. And tool automation enables you to raise the bar by profiling often, thoroughly, and collaboratively.

¹ This article is based on the TDWI Webinar “Best Practices in Data Profiling and Cross-System Data Discovery,” available for replay on tdwi.org.
² For more information about relationships among data management practices, see the TDWI Best Practices Report Unified Data Management, available at tdwi.org.

Philip Russom is the senior manager of research and services at TDWI, where he oversees many of TDWI’s research-oriented publications, services, and events. Prior to joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research, Giga Information Group, and Hurwitz Group, as well as a contributing editor with Intelligent Enterprise and DM Review magazines. You can reach him at [email protected].

This article originally appeared in the issue of .

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders