The Necessity of Data Profiling: A How-to Guide to Getting Started and Driving Value
Allocating sufficient time and resources to conduct a thorough data profiling assessment will help architects design a better solution and reduce project risk by quickly identifying and addressing potential data issues.
By Matt Austin
Data profiling is a critical input task to any database initiative that incorporates source data from external systems. Whether it is a completely new database build or simply an enhancement to an existing system, data profiling is a key analysis step in the overall design. Allocating sufficient time and resources to conduct a thorough data profiling assessment will help architects design a better solution and reduce project risk by quickly identifying and addressing potential data issues.
How should you approach a new data profiling engagement and what can you expect in terms of value-added results?
Data profiling is best scheduled prior to system design, typically occurring during the discovery or analysis phase. The first step -- and also a critical dependency -- is to clearly identify the appropriate person to provide the source data and also serve as the “go to” resource for follow-up questions. Once you receive source data extracts, you’re ready to prepare the data for profiling. As a tip, loading data extracts into a database structure will allow you to freely write SQL to query the data while also having the flexibility to use a profiling tool if needed.
When creating or updating a data profile, start with basic column-level analysis such as:
- Distinct count and percent: Analyzing the number of distinct values within each column will help identify possible unique keys within the source data (which I’ll refer to as natural keys). Identification of natural keys is a fundamental requirement for database and ETL architecture, especially when processing inserts and updates. In some cases, this information is obvious based on the source column name or through discussion with source data owners. However, when you do not have this luxury, distinct percent analysis is a simple yet critical tool to identify natural keys.
- Zero, blank, and NULL percent: Analyzing each column for missing or unknown data helps you identify potential data issues. This information will help database and ETL architects set up appropriate default values or allow NULLs on the target database columns where an unknown or untouched (i.e.,., NULL) data element is an acceptable business case. This analysis may also spawn exception or maintenance reports for data stewards to address as part of day-to-day system maintenance.
- Minimum, maximum, and average string length: Analyzing string lengths of the source data is a valuable step in selecting the most appropriate data types and sizes in the target database. This is especially true in large and highly accessed tables where performance is a top consideration. Reducing the column widths to be just large enough to meet current and future requirements will improve query performance by minimizing table scan time. If the respective field is part of an index, keeping the data types in check will also minimize index size, overhead, and scan times.
- Numerical and date range analysis: Gathering information on minimum and maximum numerical and date values is helpful for database architects to identify appropriate data types to balance storage and performance requirements. If your profile shows a numerical field does not require decimal precision, consider using an integer data type because of its relatively small size. Another issue which can easily be identified is converting Oracle dates to SQL Server. Until SQL Server 2008, the earliest possible datetime date was 1/1/1753 which often caused issues in conversions with Oracle systems.
With the basic data profile under your belt, you can conduct more advanced analysis such as:
- Key integrity: After your natural keys have been identified, check the overall integrity by applying the zero, blank, and NULL percent analysis to the data set. In addition, checking the related data sets for any orphan keys is extremely important to reduce downstream issues. For example, all customer keys from related transactions (e.g., orders) should exist in the customer base data set; otherwise you risk understating aggregations grouped by customer-level attributes.
- Cardinality: Identification of the cardinality (e.g. one-to-one, one-to-many, many-to-many, etc.) between the related data sets is important for database modeling and business intelligence (BI) tool set-up. BI tools especially need this information to issue the proper inner- or outer-join clause to the database. Cardinality considerations are especially apparent for fact and dimension relationships.
- Pattern, frequency distributions, and domain analysis: Examination of patterns is useful to check if data fields are formatted correctly. As example, you might validate e-mail address syntax to ensure it conforms to user@domain. This type of analysis can be applied to most columns but is especially practical for fields that are used for outbound communication channels (e.g., phone numbers and address elements). Frequency distributions are typically simple validations such as “customers by state” or “total of sales by product” and help to authenticate the source data before designing the database. Domain analysis is validation of the distribution of values for a given data element. Basic examples of this include validating customer attributes such as gender or birth date, or address attributes such as valid states or provinces within a specified region. Although these steps may not play as critical a role in designing the system, they are very useful for uncovering new and old business rules.
Picking the right techniques depends on the project objectives. If you’re building a new database from scratch, take the time to execute and review outcomes of each of the above bullet points. If you’re simply integrating a new data set into an existing database, select the most applicable tasks that apply to your source data.
All of these steps may be conducted by writing raw SQL. The basic profiling steps can usually be accomplished using a tool designed specifically for data profiling. Many third-party data profiling tools have been introduced into the marketplace over the last several years to help streamline the process. These tools typically allow the user to point to a data source and select the appropriate profile technique(s) to apply. The outputs of these tools vary, but usually a data source summary is produced with the field level profile statistics.
The first step in adopting a comprehensive data profiling program is to realize the importance and value it ultimately provides. Data profiling should be of interest to all roles within the technical project team. For instance, database architects, engineers, administrators, and support teams need to have a deep understanding of the database system’s core data to make sound technical decisions; business analysts and project managers may also use the conclusions drawn from the data profiling to help steer project direction and set realistic expectations with project stakeholders.
Furthermore, understanding the available data, the missing data, and the required data can help map out future technical strategies and data capture methods. Using this information to improve data capture techniques will improve the source data integrity and may lead to further improvements in overall customer engagement and intelligence.
The data profiling process may seem arduous and less than glamorous at times, but it is an important step which adds value to any database project.
Matt Austin is a senior professional services consultant at Quaero, a CSG Solution. In his role, he is responsible for design and development of client solutions. Matt can be reached at email@example.com.