Data Warehouse Telemetry: Measuring the Health of Your Systems
Just as medical professionals use a patient’s vitals to measure and manage their overall health and well-being, data professionals can use a series of data warehouse vitals to ensure its overall health.
- By Troy Hiltbrand
- July 31, 2023
When a patient visits a hospital or doctor’s office, one of the first things any medical professional does is measure the person’s vital signs. These metrics are generally accepted as indicators of the patient’s overall health. In a similar way, data engineers monitoring the data warehouse will use a set of vitals in the form of data warehouse telemetry to ensure that the system is performing optimally.
There are three data warehouse vitals that a data team can use to gauge the overall health and effectiveness of their analytics system: system performance, data quality, and source drift and performance.
At the heart of the data warehouse system, there is a pulse. This is the set of measures that indicate the system's performance -- its heartbeat, so to speak. This includes the measurements of system resources, such as disk reads and writes, CPU and memory utilization, and disk usage. These metrics are an indicator of how well the overall system is performing.
It is important to measure to make sure that these metrics do not go too low or too high. When they go too low, it is an indicator that the system has been oversized and resources are being wasted. When they go too high, it is an indicator that the system is undersized and resources are nearing exhaustion. As the resources hit a critical level, overall performance can grind to a halt, freezing processes and negatively impacting the user experience.
When a medical practitioner sees that a patient’s heart rate/pulse is too fast or too slow, they will provide several recommendations, including ongoing monitoring to see if the situation improves or changes to diet or exercise (for example, eliminating unhealthy practices such as too much caffeine, tobacco, or alcohol intake). When the situation warrants, they can also prescribe medication targeted at controlling the issue.
When data practitioners see out-of-threshold system metrics, they can continue to monitor the systems to see if things improve, optimize database structures and indexes, and eliminate harmful processes that could create performance bottlenecks and system deadlocks. If the situation is pervasive, they can procure faster and more efficient hardware, upgrading the overall system to bring it back to a healthy level of performance.
Another metric medical providers use to assess patient health is respiratory rate and blood oxygen saturation. It is critical for the body to intake oxygen, which then gets embedded in the blood and delivered to tissues and cells around the body. As oxygen is delivered to the body, it provides life-sustaining support. In addition to delivering the much-needed oxygen, the bloodstream carries harmful carbon dioxide the body must filter out and remove.
One of the critical components of a data warehouse architecture is an effective extract, transform, and load (ETL) layer. Within this ETL layer, data from a variety of different source systems is brought together, flowing on its way to the data warehouse. As part of this process, there are often impurities in the data. This can include duplicates, missing data, incorrect data types, invalid outliers, and other features that do not meet the standards set forth by your data warehouse governance model.
The purpose of the ETL process is to filter out these harmful errors, correct the data, and inject missing values. This ensures that the data flowing to the data warehouse is as clean as possible and provides accuracy for downstream analytics processing.
Measurements such as the number of defects detected and the number of defects corrected -- along with measurements representing data accuracy, consistency, and completeness -- provide an overall gauge of the data quality. These allow data warehouse teams to assess how healthy the data is that is coming into the data warehouse.
Just as a doctor will often have to put a patient on oxygen if the respiration rate is too low or the blood oxygen saturation levels are outside the normal threshold, a data warehouse team will have to instigate a special data quality project if the level of impurities in the data warehouse exceeds predefined limits. These efforts may include post-load clean-up on the data and optimizing the ETL processes to correct the data inflow to prevent future data quality problems.
Source Drift and Performance
The third metric medical professionals use is blood pressure. This is a measurement of overall cardiovascular health and is often used as a surrogate to assess the heart's strength and the arteries' effectiveness in delivering blood to the body.
In the data warehouse, data is often flowing from multiple operational systems, ranging from relational databases and unstructured NoSQL databases to messages from IoT devices. For the data to flow correctly, it is important that the sources are known and their structure defined. As more systems move to less structured NoSQL databases and IoT infrastructures, the format and structure of the data are likely to change or drift from the known definition. Such changes in upstream sources can block data from flowing downstream or create impediments as the data quality processes struggle to keep up.
Performing schema comparisons between a known model and the current structure of the source systems can pinpoint source drift. Capturing and monitoring the record counts being extracted from the source system and the record counts being entered into the data warehouse can be an indicator of the health of the flow of data through the ETL processes. Finally, measuring the data freshness in the data warehouse can indicate if the data is flowing in a timely manner. Slowing or gaps in this data flow can indicate source drift from the predefined structure or of source system performance issues.
A Final Word
As a living system, the data warehouse ecosystem relies on healthy sources, clean data, and the resources necessary to provide a responsive and reliable analytics environment for your end users. As a data warehouse professional, you must often diagnose problems in this ecosystem. Leveraging a system’s vitals, you can quickly identify and solve problems with a solid system of telemetry just as a medical practitioner uses vitals to identify a patient’s health challenges.
Troy Hiltbrand is the chief information officer at Amare Global where he is responsible for its enterprise systems, data architecture, and IT operations. You can reach the author via email.