Sleep Well at Night: Abstract Your Source Systems
It’s odd that our industry has established a best practice for creating a layer of abstraction between business users and the data warehouse (i.e., a semantic layer or business objects), but we have not done the same thing on the back end.
Today, when a database administrator adds, changes, or deletes fields in a source system, it breaks the feeds to the data warehouse. Usually, source systems owners don’t notify the data warehousing team of the changes, forcing us to scramble to track down the source of the errors, rerun ETL routines, and patch any residual problems before business users awake in the morning and demand to see their error-free reports.
It’s time we get some sleep at night and create a layer of abstraction that insulates our ETL routines from the vicissitudes of source systems changes. This sounds great, but how?
Eric Colson, whose novel approaches to BI appeared two weeks ago in my blog “Revolutionary BI: When Agile is Not Fast Enough,” has found a simple way to abstract source systems at Netflix. Rather than pulling data directly from source systems, Colson’s BI team pulls data from a file that source systems teams publish to. It’s kind of an old-fashioned publish-and-subscribe messaging system that insulates both sides from changes in the other.
“This has worked wonderfully with the [source systems] teams that are using it so far,” says Colson, who believes this layer of abstraction is critical when source systems change at breakneck speed, like they do at Neflix. “The benefit for the source systems team is that they get to go as fast as they want and don’t have to communicate changes to us. One team migrated a system to the cloud and never even told us! The move was totally transparent.”
On the flip side, the publish-and-subscribe system alleviates Colson’s team from having to 1) access to source systems 2) run queries on those systems 3) know the names and logic governing tables and columns in those systems and 4) keep up with changes in the systems. They also get much better quality data from source systems in this way.
However, Colson admits that he might get push back from some source systems teams. “We are asking them to do more work and take responsibility for the quality of data they publish into the file,” says Colson. “But this gives them a lot more flexibility to make changes without having to coordinate with us.” If the source team wants to add a column, it simply appends it to the end of the file.
This approach is a big mindset change from the way most data warehousing teams interface with source systems teams. The mentality is: “We will fix whatever you give us.” Colson’s technique, on the other hand, forces the source systems teams to design their databases and implement changes with downstream analysis in mind. For example, says Colson, “they will inevitably avoid adding proprietary logic and other weird stuff that would be hard to encapsulate in the file.”
Time to Deploy
Call me a BI rube, but I’ve always assumed that BI teams by default create such an insulating layer between their ETL tools and source systems. Perhaps for companies that don’t operate at the speed of Netflix, ETL tools offer enough abstraction. But, it seems to me that Colson’s solution is a simple, low-cost way to improve the adaptability and quality of data warehousing environments that everyone can and should implement.
Let me know what you think!
Posted by Wayne Eckerson on February 8, 2010