Building for Yesterday's Future
Are we building new information delivery systems for a future that isn't there?
- By Mark Madsen
- September 6, 2011
[Editor's Note: Mark Madsen is a keynote speaker at the TDWI World Conference in Orlando, October 30 - November 4, 2011. He'll be speaking about "The End of the Beginning: Looking Beyond Today's BI."]
"We can't solve problems by using the same kind of thinking we used when we created them." -- Albert Einstein
The challenges faced by organizations building and maintaining business intelligence systems are familiar: slow turnaround on projects, inflexibility, performance, scalability, development backlog, and an inability to integrate more advanced capabilities. These sound like the complaints about transaction processing in the 1990s. The difference is that our problems are largely self-inflicted.
We're building new information delivery systems for a future that isn't there. Our state-of-the-art environments are already becoming obsolete because our view is distorted by the lens of the past, showing us the future as it was years ago. That world of scarce computing resources and limited data is gone.
Computers today are ten thousand times more powerful than they were in 1994. Despite this fact, we're building systems using the same methods and tools we used then. Conventional BI approaches are road maps to the future envisioned when a cell phone was the size of a brick.
Our assumptions about systems and how they should be built are rooted in outdated perceptions. Over time, we become blind to the changes happening around us. Our industry progresses at a steady and predictable pace. The problem with this gradual pace is that we don't notice how much has changed.
We don't perceive the point when the sum of tiny changes becomes substantively different. Discontinuous change always looks obvious in hindsight but never when you're going through it. If you leave a place and come back fifteen years later, you see the total accumulation of minor changes for the massive shift it is.
In any other context, it is silly to think that nothing would be different after fifteen years. Yet we're designing and building information systems based on assumptions older than this, and they are no longer valid.
Why We Do What We Do
The assumptions that underlie our methods and designs for analytic systems came from an old world that no longer exists. We need to assess these assumptions and adjust our perspective to the new realities.
Scarcity
Scarcity is the core assumption that affects the design and architecture of data warehouses and BI. It's rooted in timeshare history when every program had to compete for limited resources on a single-processor computer. We had to carefully conserve processor cycles, memory, and storage when these were rare.
This imposes design constraints on what is built and how it's built. The batch data extract and load model is an artifact of a time when systems waited to process data so they wouldn't interfere with interactive users. We're still using batch models at a time when almost all transaction systems operate in real time, around the clock.
Assuming that computing resources are scarce leads to premature optimization decisions. We design logical and physical data models to improve query performance, make queries static, and isolate unpredictable work using query governors or by offloading to data marts.
The avoidance of data redundancy in warehouses was necessary to preserve expensive central storage and I/O. A focus on storage leads to attempts to limit data volume by prematurely archiving, prematurely summarizing, and rigid normalized data models. It also limits the use and storage of derived data (for example, to support data mining models). This activity is pushed to other systems to satisfy the constraints of the data warehouse.
With resource constraints disappearing, performance and capacity problems should be in the past. That they are not highlights problems with the software architectures we're using and our design methodologies.
Clean Slate
Most data warehouse and BI methodologies assume that you start with no analysis systems in place. The methodologies were created at a time when information delivery meant reports from OLTP applications.
The reality today is that analytics projects don't start with a clean slate. Reporting and BI applications are common in different parts of the organization. Although this can make data requirements gathering simpler, it also creates a situation where legacy data needs must be supported, and the legacy may not be an old mainframe but a modern SaaS application.
Stability
Most methodologies were designed to guide building the initial warehouse, mart, or BI application. The methodologies were created by consulting organizations that rarely had to take ownership of the system once it was created, so they don't address how the system grows and evolves over time.
By not focusing on evolution, the methodologies miss a key element about analytics: they often focus on decisions that change business processes. Process change means the business works differently and new data will be needed. When someone solves a problem, they move on to a new problem. The work is never done because an organization is constantly adapting to changing market conditions.
Once established, a BI program is subject to an inflow of new data and information delivery requests. Dealing with the inflow is a common challenge for managers of business intelligence.
To meet the requests with a limited development team, we prioritize requirements based on ROI or effort. The developers focus on the most important shared needs. This creates a growing queue of small requests that are rarely completed because bigger problems move ahead faster. For the manager, it's like being nibbled to death by guppies instead of being eaten by a shark.
This is the same development backlog problem that IT managers have struggled with for years, only now it's related to analytic systems. The backlog is being compounded by faster decision cycles requiring more real-time data, particularly for customer analysis and interaction.
As information delivery and analytics grow in importance, methods and architecture must evolve to keep up.
The Data Warehouse as a Platform Rather Than an Application
Business intelligence methods and architecture assume that what's being built is a single system to meet all data needs. We still think of analytics as giving reports to users. This ignores what they really want: information in the context of their work process and in support of their goals. Sometimes reports are sufficient; sometimes more is needed.
The interaction model for BI delivery is that a user asks a question and gets an answer. This only works if they know what they are looking for. Higher data volumes, more sophisticated business needs, and high-performance platforms require that BI be extended to include advanced analytics. These answer "why" questions that can't be answered by the simple sums and sorts of BI.
The data warehouse has evolved to the point where it needs to provide data infrastructure, and needs to support information delivery by other applications rather than trying to do both. Data infrastructure requires a focus on longer planning horizons, stability where it matters, and standardized services. Information delivery requires meeting specific needs and use cases.
Design methods today seldom address the need to separate data infrastructure from delivery applications. Designs focus on data management and fitting the database to the delivery tools. This leads to IT efforts to standardize on one set of user tools for everything, much like Henry Ford tried to limit the color of his cars to black.
The new needs and analysis concepts go against the idea that a data warehouse is a read-only repository with one point of entry. They do not fit with established ideas, tools, and methodologies.
Today, the tight coupling of data, models and tools via a single SQL-based access layer prevent us from delivering what both business users and application developers need. The data warehouse must be split into data management infrastructure that can meet high-performance storage, processing, and retrieval needs, and an application layer that is decoupled from this infrastructure. This separation of storage and retrieval from delivery and use is a key concept required by data warehouse architectures as business and technology move forward.