How Accurate Are Your Data Sources?
We often assume that data obtained from outside sources meets the same quality standards as data from our own operational systems. Unfortunately, this may not be true.
- By Mike Schiff
- August 30, 2016
Data quality is of paramount importance to data warehouse practitioners and their business users. Although most of us have a good idea of the accuracy of the data that originates from within our own organizations, we may not know how accurate the data obtained from external sources really is or what assumptions were made during the collection process.
An Example of Inexact External Data
Consider the physical address associated with an Internet Protocol (IP) address.
An IP address is associated with a particular computer or other device (such as a networked printer) on a computer network. Several commercial and public organizations provide geolocation services that map an IP address to its physical location. The corresponding physical location may have been obtained in a variety of ways -- including drive-by trolling for open networks, mobile apps that broadcast geocodes, or from one of five non-profit Regional Internet Registries.
These services can provide valuable insights about actual or potential customers who visit vendor websites. From an operational perspective, they can be used to estimate delivery charges, sales tax, or even suggest a nearby retail store where a consumer can purchase or pick up a product. From an analytics perspective, they can help spot purchasing trends and geographic preferences.
Although many people assume that each IP address can be precisely mapped to a specific location, this is not always true. In some situations it can only be mapped to a ZIP code, city, state, or even just a country.
One major commercial vendor claims to be 98 percent accurate at the country level but only 70 percent accurate at the city level in the United States. When the IP address can only be mapped to a country, some vendors may default to coordinates at the geographic center of that country. This can certainly present a problem if a shopper on the East Coast is referred to a "nearby" store in Kansas or our analytics falsely indicate that the plurality of our customers are located there.
Understand Your Data's Limitations
The lesson here is that we should not assume data acquired from outside sources is always precise or even accurate. We need to know the limits of our data sources.
We all know that "garbage in yields garbage out," but we also need to recognize that imprecise data may yield imprecise analyses. Any decision to acquire data from commercial or public data sources should include an investigation of the quality, constraints, timeliness, underlying assumptions, and limitations of the data we are considering.
Although I have used the IP address to physical location mapping as an example, there are many other third-party data providers that many of us have used to augment the data in our data warehouses. These include census demographic data and third-party psychographic data.
We owe it to our users to understand the precision and accuracy of data obtained from both internal and external sources and to make them aware of any potential issues that could affect the validity of their results.
Michael A. Schiff is founder and principal analyst of MAS Strategies, which specializes in formulating effective data warehousing strategies. With more than four decades of industry experience as a developer, user, consultant, vendor, and industry analyst, Mike is an expert in developing, marketing, and implementing solutions that transform operational data into useful decision-enabling information.
His prior experience as an IT director and systems and programming manager provide him with a thorough understanding of the technical, business, and political issues that must be addressed for any successful implementation. With Bachelor and Master of Science degrees from MIT's Sloan School of Management and as a certified financial planner, Mike can address both the technical and financial aspects of data warehousing and business intelligence.