Tackling Data Quality Problems in Data Consolidations
Data quality problems arise when an enterprise must consolidate data during a merger or acquisition. We examine a real-world example of how a firm in Iceland confronted the challenge.
By Hinrik Jósafat Atlason
Most of us are genuinely preoccupied with quality in general, whether it’s the quality of our food, education, clothing, or Internet connections. Other aspects also come into mind -- to name just two, consider restorative vacations and a safe work environment. Many of us are also preoccupied with the quality of the data we work with, but we accept its shortcomings because, well, we’re used to it.
The article will shed light on some of the challenges posed by data quality problems as well as suggested best practices based on our experience solving these problems.
Accepting the Challenge
The first step in overcoming poor data quality is to admit to the problem. Being dissatisfied with the quality of our data is no judgment upon ourselves or the technology we employ. This problem is far more common than most of us realize and may be due to the extraction of data from different source systems or simply the challenge of consolidating information systems after a merger or an acquisition. Regardless of the cause of poor-quality data, the war on bad data is an ongoing commitment rather than a temporary exorcism. A continuous quality assessment of our data is a process that should come as natural to us as refraining from buying sour milk, but in order to win this war, the quality of our data has to be measured and understood.
The challenges of improving general data quality are many and this is particularly true in situations where the scope of the problem is underestimated or perhaps even below the surface. Bad data is often corrected by the data consumers themselves who may be relying on obsolete applications or manual calculations in spread sheets to verify the correctness of their data. Reduced productivity usually characterizes such situations.
Unfortunately, far too often projects to improve data quality are not approved until undeniable proof exists that the project will produce higher profits or reduce costs (preferably both). In such cases, it’s helpful to keep in mind that the cost of decisions based on incorrect information is not always quantifiable. The same applies to opportunities lost because we weren’t aware of them until it was too late because we need to manually verifiy data to feel confident about its soundness.
Our Experience Consolidating Data
At Advania, we have participated in drastic infrastructural changes over the last few years. Those changes include fortifying our market position in the Nordics through mergers, a process that is usually accompanied by the inevitable data quality challenges that arise when consolidating data from a myriad of different information systems.
Consolidating Product and Service Hierarchies
One particular challenge we faced involved different standards of categorizing our Icelandic product and service offerings, depending on the source of the recently acquired information. This made both financial planning and reporting time consuming, not to mention the uncertainty due to the lack of confidence and the opportunities we might have never seen coming.
To remedy this problem, we took a three-phase approach that gave us valuable experience we can use in the future (we’ll surely continue consolidating data from information systems across our branches in the Nordics).
What did we do? To start with, teams of supply managers decided how our products and services should be categorized. The result of this exercise was a mapping between different product and service hierarchies. This mapping was used as the foundation for brand new hierarchies; this was more feasible than reusing parts of the older categories and modifying the rest. Like moving to a new apartment and then finally deciding to throw away that box of old receipts, this was a great opportunity to reconsider the inventory in light of our recent expansion and start anew.
In the second part of our approach, the finance department had the honor of verifying the soundness of the mapping from an accounting point of view, making sure the bottom-line was unaffected in an historical perspective.
Finally, once the mapping was approved, it was forwarded to the IT department, which modified the programming accordingly and reprocessed all related information back in time. The revised information was then stored in a new data warehouse which now serves our domestic financial planning and reporting needs for that segment of our operations.
Consolidating Customer Representations
Another challenge that usually accompanies mergers is related to different customer representations, such as the one I just described regarding our product and service hierarchies. That, however, was never really a problem for us because we use personal and enterprise Social Security numbers quite freely here in Iceland. These unique identifiers already exist in most systems and can be joined when retrieving information from different systems. Our government even allows us access to a data feeds from which up-to-date demographic information can be obtained for any Social Security number.
However, in order not to disappoint my readers completely, I’ll admit that getting off that easily might be temporary. Consolidating data from our branches in Iceland was one thing, consolidating it from our branches in Norway, Sweden, and Latvia, introduced a completely different set of challenges. Consider the most obvious example -- dealing with different languages, currencies, and conventions for just about anything else. Managing the financial aspects of international projects and internal billing on a corporate level will also be very interesting once we reach that point.
Because of this, our approach was to create a solid framework in which to meet the challenges that lie ahead. Our purpose-built data warehouse was designed with one important aspect in mind -- flexibility. As an IT company, we are quite familiar with how quickly things can change and we aim to be as operationally agile as possible. Therefore, our data warehouse was designed according to the data vault paradigm with its inherent ability to adapt. It’s designed to be scalable, robust, and to give us the turnaround time we need to effectively integrate data from our international branches within an acceptable timeframe.
Our project was extensive and will continue to consume considerable effort for the foreseeable future. In fact, it was extensive enough to allow us the luxury of designing a new data warehouse from scratch, specified by the challenges that lay before us. However, I realize that such an approach may not be feasible in all situations due to the cost and effort required. The remainder of this article provides practical steps that can be taken to minimize the damages caused by bad data, wherever they may be found.
Reacting to Bad Data
Resolving the quality issue is a goal that senior managers -- on behalf of the consumers -- must set for themselves. They must go into this battle with vigorous determination, and to ensure success, they must also account for the inevitable cost this effort demands, and they must keep track continuously of progress using key performance indicators. IT will most likely be able to help kick-start this effort, but the importance of genuine support from the managers of the required resources, be it time or money (or both) must not be underestimated.
The following three-step approach will help you resolve poor data quality.
Step 1. Locate the source of bad data
Once the goal has been defined, the first step is to locate the source of the data that is unsatisfactory. The quality-problem may be located at the data’s transfer from one system to another (for example, database fields that are not large enough to contain certain string values resulting in truncation) or even so elementary as data-entry errors in the front-end application (such as a CRM application that allows typing the customers’ addresses instead of it being selected from a list of valid entries).
It’s worth mentioning that the added value we can give our data by enriching it with additional information about addresses, currency developments, or indexes (just to name a few examples that illustrate the variety of options already available).
Step 2. Prevent distribution to other systems
Modifying front-end applications to eliminate data-entry errors can be both expensive and time consuming. Should that be done, however, it’s nevertheless important to anticipate and prevent bad data spreading through our systems. This applies especially to data warehouses, should they be employed.
We can stop errors from spreading to a tolerable extent by defining a set of automated checks in our ETL application that determine whether, for example, an address is legitimate. Regardless of what method we use (automated vs. manual verification, in-house vs. vendor solutions), it’s important that we invest in thorough analysis to make sure that we understand our data and all of its peculiarities. We must also learn which fields we should inspect and determine the legitimate data values in order to correct errors.
Step 3. Monitor your progress
A product of this effort should be the classification of our data which can lead to a physical separation or simply a flag set on fields indicating their perceived quality. A by-product of this -- which is just as useful, in my opinion -- is the statistical information that can be acquired during this process. It allows us to follow changes over a period of time and also to display a quality index in our most vital reports, effectively telling us how confident we should be in that particular set of data.
Last but not least, I’d like to emphasize the importance of this discussion on data quality being honest and remind readers to give it the attention it deserves. Focusing on the (financial) pain of having bad data seems to draw the attention of senior management.
- - -
Hinrik Jósafat Atlason is a senior business intelligence consultant at Advania in Reykjavík, Iceland. His responsibilities include the framework and methodology for business intelligence and analytics at Advania. You can contact the author at [email protected].