Getting Data Correct: Balancing Data Control and Business Demands
How do you balance the need for dependable data with the need for decision speed? Take these three steps.
By Charles Caldwell, Director of Solutions Engineering and Principal Solutions Architect, Logi Analytics
As a data warehouse practitioner, I've spent countless hours trying to deliver the ever-elusive "single version of the truth." In other words, I've been sourcing data, addressing quality issues, integrating disparate sources, enforcing key business rules, curating master data, and trying to get it all to happen consistently within a batch update window so the business has the information it needs to operate.
Although the tools and practices have improved and adapted to become more agile over the years, let's face it: adding a new data set to the warehouse doesn't happen fast if you put it through even the basics of these practices. Never mind that the request is likely queued up behind other requests the data warehouse team is handling.
What do you do? If you're a business analyst, you just bypass the data warehouse team and BYOD(ata). Not a problem, right? Sometimes you pull from the warehouse, sometimes the source systems, sometimes from a spreadsheet. Data issues? You fix them yourself. Need data enrichment? You enrich the data yourself. What's the worry?
There's validity to allowing analysts to BYOD. Some business decisions can't wait. For many questions, data that is "close enough" is good enough. The downside here is twofold: analysts often don't have the tools they need to efficiently deal with the data, and very often you wind up with duplicated effort among analysts and inconsistencies between data sets. You can never be sure if one analysis is really comparable to another. Did the number really change, or just the way the data was prepared?
How do you balance the need for dependable data against the need for decision speed? You need to:
- Define what correct means
- Manage the data maturity life cycle
- Enable analysts within the process
What is "Correct"?
First things first. Getting the data "correct" isn't a one-size-fits-all affair. The meaning of "correct" depends on your requirements. Different aspects of "correct" are trade-offs against each other, and very few data sets require hitting all of these aspects at once.
Getting the data correct can imply:
Accuracy: The number reported accurately reflects what happened. Accuracy tends to be a major concern for regulated reporting.
Timeliness: The amount of time that can elapse between an event and the associated data being reported. Timeliness tends to be a major concern when a lag in decision making has a significant effect.
Consistency: The number reported is consistent with source systems, related business systems, and numbers previously reported both in terms of reporting a consistent value and having a consistent definition. Any comparative analysis (year-over-year) demands consistency.
Quality: Quality can be thought of as a subset of accuracy, but it involves specialized concepts. Quality encapsulates such things as getting addresses correct and ensuring that fields follow business rules, data is not duplicated, and records aren't orphaned. If actions are directly affected by data quality, this becomes a higher concern.
Performance: Can I get a response to my question before I forget what my question was? As more people use the system, will performance scale? This is one of the largest efforts in warehousing: modeling and tuning the data to ensure performance.
Security: After working so hard to make data available, you also must make sure it isn't "too available." This ranges from helping users cut through the noise to preventing insider trading and other legal issues.
For any data set, everyone needs to be clear on what "correct" means. If the business doesn't require one of these attributes, don't impose it unnecessarily. It will slow you down in meeting requirements.
Agility and Maturity
When trying to balance dependable data and decision speed, don't try to apply "too much correctness" to a data set when it's not required. Furthermore, don't try to get there all at once.
One of the key lessons I've taken from my agile programmer colleagues is the idea of a "minimally viable product." Basically, the idea is that you don't have to get to the final state all at once. You just need to deliver enough value now to meet the immediate requirement, and then improve the implementation in response to user feedback. What does this mean with data?
When working with new data sets, requirements are volatile. Analysts are still learning what they can do with the data. I take the requirements they have and I try to implement them with the least amount of overall "processing" in the initial pass. I will do as little transformation, data quality work, modeling, tuning, metadata building, and other data tasks as possible. I draw the shortest path from data source to initial view of the data for my end users and deliver it. Direct access to source system data? If it works for the initial requirement, yes.
You will need to help end users understand which data sets are "fully fledged" and which are "still baking," but the agility that results is worth it. As the requirements mature, I improve the implementation of the data set. In such an approach, you initially sacrifice aspects such as scalability or fully conformed dimensions but gain agility in meeting emerging business needs. Yes, you'll need to go back in future iterations to address scalability, conformance, etc., but you'll be more informed by actual usage when you do. You'll also feel better when the analysts see something for the first time and say, "Yeah, that wasn't it."
What do you do about business analysts bringing their own data? After all, we've been fighting off spreadmarts for decades now, right? Well, actually …
Business analysts become more data savvy every day. The warehouse team should focus on delivering the high-value, hard-to-deliver data sets to support analysts, then we should get out of their way so they can use the tools and techniques they prefer to conduct their analyses. When needed, they should be able to bring their own data. In fact, they serve as a kind of R&D department for the warehouse team to identify emerging requirements.
How do you avoid "bad data" from "untrusted sources"? With BYOD, the focus should shift away from the data to whom is bringing the data. Business leaders, with some "advise and consent" from the BI team, should identify analysts with the skills, business knowledge, and past performance that indicate they can be trusted to combine standard warehouse data with new data sources, advise business leaders on how the data can be used (given concerns such as accuracy), and be trusted to deal with data at a sophisticated level. They become "certified" and are given the authority (and responsibility) for authoring new data sets. As their data sets mature, we bring them into the warehouse.
What about the "uncertified" analysts? Can they create data? Sure they can, but the organization should be mature enough to recognize that the data set being presented should be treated with a higher-than-normal level of skepticism. It could still be valuable and lead to great insight, but may need one of the certified analysts to follow up before making a major decision based on that information.
Many organizations get stuck in either single-version-of-the-truth fervor or Wild West spreadmart hell, neither of which ultimately serves their needs. The trick is to balance high-quality data services with highly agile BYOD practices.
Charles Caldwell is the director of solutions engineering and principal solutions architect for Logi Analytics. In addition to delivering DW and BI projects, Charles has built high quality technical teams in the BI space throughout his career, building the Logi solutions engineering team from scratch. You can contact the author at firstname.lastname@example.org .