Data Reliability Engineering: What You Need to Know to Get Ready
As the importance of data to the enterprise continues to grow, organizations are building a new collection of tools and techniques to ensure they have the highest quality data.
- By Kyle Kirwan
- February 24, 2023
It’s a cliché by now that “data is the new oil,” but the fact that the phrase is overused doesn’t make it any less true. Modern enterprises run on data in much the same way traditional cars run on gasoline. If you put bad fuel (such as watered-down gasoline) into a car, it runs poorly -- if at all. Enterprises using bad data suffer the same fate.
Enterprises need a reliable, scalable process:
- Mission-critical apps require a lot of clean, reliable data. Everything from inventory management and financial planning to product recommendation and support bots depends heavily on good data. If there’s a data outage, the business will suffer as a result.
- Much of an enterprise’s operations are automated, especially for data. Pipelines must be reliable because human beings are often no longer a part of the process. Machine learning models, streaming data, dashboards -- there’s not an analyst spot checking the data (which is a good thing, because they have more important work to accomplish).
- Data engineers are in very short supply. These highly skilled professionals are expensive to hire and hard to find and retain. As a result, data teams are often small, which means enterprises must automate problem detection and resolution as much as possible so these highly valuable individuals don’t spend all their time firefighting.
Data quality is already an important practice area for data professionals with a current collection of tools and techniques. So what are organizations doing to ensure that their operations are getting clean “fuel”?
Some DQ engineers are taking lessons from the work software engineers have already done. After all, though software outages still occur, software is far more reliable than it was two decades ago, when MIT Technology Review’s cover story was “Why Software Is So Bad.” Users have come to expect that the software they use throughout the day will all work. In fact, it’s surprising when it doesn’t. That’s in part because the software industry follows the principles of site reliability engineering (SRE).
Data reliability engineering (DRE) borrows from these principles to improve data quality, ensure data transfer occurs with good performance, and that apps such as AI and analytics that depend heavily on data receive clean input.
However, though software engineers have a wide array of mature tools to use for SRE, data tools that would enable DRE are just now starting to come onto the market. As a result, data reliability work usually involves spot checking, late night backfills, and manually turning SQL into Grafana monitoring. It’s not a repeatable, scalable process.
DRE Tools and Best Practices
The DRE framework for data is still emerging, so tools and best practices are still evolving. However, some of the primary principles include:
- Manage risk. Eventually, systems fail. The DRE team needs to mitigate that risk and prepare for the inevitable day when something goes wrong.
- Pervasive monitoring. Teams can’t fix problems they can’t see, so alerting and monitoring are crucial to give teams the visibility they require.
- Set quantifiable standards. It’s not enough to just say the data is high quality. That’s a subjective term. Teams need concrete service level indicators, objectives, and agreements (SLIs, SLOs, and SLAs).
- Remove “toil” wherever possible. All that grunt work that’s required to operate a system? That’s toil, and you want to remove as much of it as possible so the engineering team can focus their work on improving and building new systems.
- Automate. Automation is the key to eliminating toil. Skilled engineers are expensive and hard to recruit. Automation can fill the gaps and help your team become far more productive and valuable.
- Release control. Engineers can’t improve a system without changing it, but change also introduces the risk of breaking the system. Data teams can employ most of the same methods as SRE, DevOps, and CI/CD pipelines.
- Work to keep it simple. The more complex a system is, the more likely it is to fail or become unreliable. Complexity can’t be completely eliminated, but a mindset of continually simplifying complex systems where possible can go a long way towards making a system cheaper to maintain and more reliable for everyone.
DRE is still a young concept, and professionals from a wide array of organizations are collaborating to define best practices and standard tools so that DRE becomes as effective as DevOps and SRE.
About the Author
Kyle Kirwan is the co-founder and CEO of Bigeye. In his career, Kirwan was one of the first analysts at Uber. There, he launched the company's data catalog, Databook, as well as other tools used by thousands of their internal data users. He then went on to co-found Bigeye, a Sequoia-backed startup that works on data observability. You can reach Kyle on Twitter or LinkedIn.