Q&A: Agile Data Engineering Using Advanced Data Modeling Concepts
How data architects can solve age-old dilemmas with new techniques.
- By James E. Powell
- July 16, 2013
Traditional data modeling techniques have served us well, but data warehouses requirements are quickly evolving, and data architects must find a new way to keep up -- to be agile. We spoke to Ralph Hughes, chief systems architect at Ceregenics, Inc., about a new concept -- data engineering -- that can help. Ralph will be presenting his ideas in greater detail at the TDWI World Conference in San Diego (August 18-23, 2013). We spoke to him about what data engineering is and how it fits with agile business intelligence.
BI This Week: Ralph, your course at TDWI's World Conference in San Diego is called "Agile Data Engineering." The phrase "data engineering" will be a new term for many attendees. What is your basic definition of this concept?
Ralph Hughes: When following traditional DW/BI development approaches, data architects believe the warehouse's data model should be essentially complete and correct before ETL coding begins, yet it's nearly impossible to get an enterprise data warehouse (EDW) model complete at the start of a project. There's always so much left to discover about an organization’s true business requirements. Moreover, we can't get the data model perfect up-front because those business requirements will change over the many months it takes us to build the ETL for a given project.
Given those realities, it would be far easier if the data architects could instead:
- Model for what they know now
- Adapt the data model across the years and many DW releases as business conditions change
- "Automagically" morph the EDW's existing data to fit into the new data model
Over the past decade, I’ve said that the DW/BI fringe has developed some new data modeling techniques that allow us to economically work in this fashion. Building for what's needed now and evolving the warehouse as requirements change -- that's what I call "data engineering."
To put a finer point to this new concept, how is "data engineering" different from the data modeling or data architectural work most of us already practice?
Traditional data architecture and modeling practices strive to deliver a design that can be built once and remain largely static for many years. With that goal, data modelers must add many features to the data schemas intended to handle the rare cases that will break the warehouse design if they should ever occur. Such "future-proofing" is a level of technical perfection that adds much complexity and delay to DW/BI projects.
By contrast, look at the work of other types of engineers outside the software industry, such as mechanical and civil engineers. You quickly realize these physical engineers are not striving for technical perfection per se. Instead, they narrow the performance specifications their creations will support, and then balance technical considerations against crucial constraints such as time, material costs, and construction labor. They work to an economical solution.
Reflecting upon my 30 years of building data warehouses with large systems integrators, I would say considerations such as minimizing time and cost have not been given the weight they deserve. Instead of meeting current business needs, the data architects I've worked with have chosen to future-proof their designs, accepting delay and expense as necessary evils.
If we in the DW/BI profession would incorporate some of the mind set from physical engineering, we'd design for the problem at hand, and adapt that design if and when the problem space changes. The additional challenge for DW software engineers is that we have to roll the existing data forward into each new design, rather than throwing it away. "Data engineering," then, is data architectural techniques adapted to take advantage of the flexible data schemas that some new data modeling paradigms have made possible.
Given that data can now evolve, data modelers can bring cost and schedule constraints back into the mix of priorities. They can focus on providing economical designs that solve current BI requirements without having to invest even a tenth as much in future proofing.
How does this notion of "data engineering" abet the concept of agile business intelligence?
Agile is all about constantly delivering value to the business customer. Agile data warehousing practitioners repeatedly deploy small improvements to a BI application every two or three months. These many increments build upon one another until soon an enterprise data warehouse emerges from the overall collection of online modules. We have achieved 3X accelerations in delivery speeds following this approach, which was born out of carefully reviewing traditional DW/BI development methods and eliminating all activities that do not directly contribute to a new increment of value to the customer.
Yet, even this super-fast delivery method only addresses the ETL development portion of a DW/BI project. We've still had to wait 12 months or longer for the data architects to provide a data model that was sufficiently future-proofed that it would be reasonably immune to expensive re-engineering costs every time business requirements changed.
Agile data engineering eliminates this wait time by either of two new, "hyper data modeling" techniques that permit you to build out a warehouse's database a few objects at a time. The re-engineering cost of conversion scripts is eliminated by these techniques because they allow already-loaded data to be easily re-cast into a new design when the warehouse's underlying database tables must be redesigned.
Typical schema updates can be accomplished by re-structuring at most only a small table or two. Such capabilities eliminate the risks that used to force data architects to insist on near-perfect data designs before they would let a development team start coding ETL. We can now get a project started quickly, adapting the data as we learn about the business problem and as business conditions change. Fast starts and incremental deliveries provide steady value to the customer and delight our EDW business stakeholders. That's agile!
Your description of agile data engineering sounds like a very big topic for a single day's class. How are you planning on focusing the presentation, especially for attendees who are not sure they want to adopt any of the practices you'll be explaining?
The focus will be on comparing the four data modeling paradigm DW/BI practitioners must choose from before they design their next warehouse. The four data modeling approaches include the two traditional styles of standard normal forms and conformed dimensional forms, plus the two new styles of hyper-normalized and hyper-generalized forms. We will quickly review the traditional modeling styles, provide a medium-level introduction of the two new styles, and then detail the impact that each style has on the speed and expense of both EDW development and EDW re-engineering.
In particular, the class will revolve around estimating the level-of-effort required to solve four common data model changes that typically require expensive re-programming and/or conversion scripting:
- Promoting attributes to new entities
- Upgrading to a party model for customers, employees, and vendors
- Adding a new slow-change trigger
- Adjusting the granularity of a subject area
By contrasting the re-engineering costs of these types of changes across the four data modeling approaches, we'll equip attendees to return to their work sites ready to demonstrate to their colleagues where the new data modeling styles will save development time and money. They'll be able to provide a well-reasoned estimate of how large those savings will be.