TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Modernize and Govern: Unifying Your Data Strategy July 10, 2025
  - Expert Panel: Best Practices for Modernizing Your Data Environment July 14, 2025
  - Powering Data Science with AI-Driven Tools and Practices July 15, 2025
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

The Necessity of Data Profiling: A How-to Guide to Getting Started and Driving Value

Allocating sufficient time and resources to conduct a thorough data profiling assessment will help architects design a better solution and reduce project risk by quickly identifying and addressing potential data issues.

February 3, 2010

By Matt Austin

Data profiling is a critical input task to any database initiative that incorporates source data from external systems. Whether it is a completely new database build or simply an enhancement to an existing system, data profiling is a key analysis step in the overall design. Allocating sufficient time and resources to conduct a thorough data profiling assessment will help architects design a better solution and reduce project risk by quickly identifying and addressing potential data issues.

Best Practices

How should you approach a new data profiling engagement and what can you expect in terms of value-added results?

Data profiling is best scheduled prior to system design, typically occurring during the discovery or analysis phase. The first step -- and also a critical dependency -- is to clearly identify the appropriate person to provide the source data and also serve as the “go to” resource for follow-up questions. Once you receive source data extracts, you’re ready to prepare the data for profiling. As a tip, loading data extracts into a database structure will allow you to freely write SQL to query the data while also having the flexibility to use a profiling tool if needed.

When creating or updating a data profile, start with basic column-level analysis such as:

Distinct count and percent: Analyzing the number of distinct values within each column will help identify possible unique keys within the source data (which I’ll refer to as natural keys). Identification of natural keys is a fundamental requirement for database and ETL architecture, especially when processing inserts and updates. In some cases, this information is obvious based on the source column name or through discussion with source data owners. However, when you do not have this luxury, distinct percent analysis is a simple yet critical tool to identify natural keys.
Zero, blank, and NULL percent: Analyzing each column for missing or unknown data helps you identify potential data issues. This information will help database and ETL architects set up appropriate default values or allow NULLs on the target database columns where an unknown or untouched (i.e.,., NULL) data element is an acceptable business case. This analysis may also spawn exception or maintenance reports for data stewards to address as part of day-to-day system maintenance.
Minimum, maximum, and average string length: Analyzing string lengths of the source data is a valuable step in selecting the most appropriate data types and sizes in the target database. This is especially true in large and highly accessed tables where performance is a top consideration. Reducing the column widths to be just large enough to meet current and future requirements will improve query performance by minimizing table scan time. If the respective field is part of an index, keeping the data types in check will also minimize index size, overhead, and scan times.
Numerical and date range analysis: Gathering information on minimum and maximum numerical and date values is helpful for database architects to identify appropriate data types to balance storage and performance requirements. If your profile shows a numerical field does not require decimal precision, consider using an integer data type because of its relatively small size. Another issue which can easily be identified is converting Oracle dates to SQL Server. Until SQL Server 2008, the earliest possible datetime date was 1/1/1753 which often caused issues in conversions with Oracle systems.

With the basic data profile under your belt, you can conduct more advanced analysis such as:

Key integrity: After your natural keys have been identified, check the overall integrity by applying the zero, blank, and NULL percent analysis to the data set. In addition, checking the related data sets for any orphan keys is extremely important to reduce downstream issues. For example, all customer keys from related transactions (e.g., orders) should exist in the customer base data set; otherwise you risk understating aggregations grouped by customer-level attributes.
Cardinality: Identification of the cardinality (e.g. one-to-one, one-to-many, many-to-many, etc.) between the related data sets is important for database modeling and business intelligence (BI) tool set-up. BI tools especially need this information to issue the proper inner- or outer-join clause to the database. Cardinality considerations are especially apparent for fact and dimension relationships.
Pattern, frequency distributions, and domain analysis: Examination of patterns is useful to check if data fields are formatted correctly. As example, you might validate e-mail address syntax to ensure it conforms to user@domain. This type of analysis can be applied to most columns but is especially practical for fields that are used for outbound communication channels (e.g., phone numbers and address elements). Frequency distributions are typically simple validations such as “customers by state” or “total of sales by product” and help to authenticate the source data before designing the database. Domain analysis is validation of the distribution of values for a given data element. Basic examples of this include validating customer attributes such as gender or birth date, or address attributes such as valid states or provinces within a specified region. Although these steps may not play as critical a role in designing the system, they are very useful for uncovering new and old business rules.

Picking the right techniques depends on the project objectives. If you’re building a new database from scratch, take the time to execute and review outcomes of each of the above bullet points. If you’re simply integrating a new data set into an existing database, select the most applicable tasks that apply to your source data.

All of these steps may be conducted by writing raw SQL. The basic profiling steps can usually be accomplished using a tool designed specifically for data profiling. Many third-party data profiling tools have been introduced into the marketplace over the last several years to help streamline the process. These tools typically allow the user to point to a data source and select the appropriate profile technique(s) to apply. The outputs of these tools vary, but usually a data source summary is produced with the field level profile statistics.

Summary

The first step in adopting a comprehensive data profiling program is to realize the importance and value it ultimately provides. Data profiling should be of interest to all roles within the technical project team. For instance, database architects, engineers, administrators, and support teams need to have a deep understanding of the database system’s core data to make sound technical decisions; business analysts and project managers may also use the conclusions drawn from the data profiling to help steer project direction and set realistic expectations with project stakeholders.

Furthermore, understanding the available data, the missing data, and the required data can help map out future technical strategies and data capture methods. Using this information to improve data capture techniques will improve the source data integrity and may lead to further improvements in overall customer engagement and intelligence.

The data profiling process may seem arduous and less than glamorous at times, but it is an important step which adds value to any database project.

Matt Austin is a senior professional services consultant at Quaero, a CSG Solution. In his role, he is responsible for design and development of client solutions. Matt can be reached at [email protected].

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

The Necessity of Data Profiling: A How-to Guide to Getting Started and Driving Value

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research