Data Warehousing: Remembering the Basics
The data warehouse concept has continued to evolve over the past two decades and now includes a variety of architectures such as data marts, operational data stores, data warehouse appliances, analytic databases, and federated and virtual implementations.
- By Mike Schiff
- January 28, 2010
The data warehouse concept has continued to
evolve over the past two decades and now includes a variety
of architectures such as data marts, operational data
stores, data warehouse appliances, analytic databases, and
federated and virtual implementations. Its functionality
has also grown to encompass real-time data feeds, master
data management repositories, operational business
intelligence, and the inclusion of both structured and
unstructured data.
As the evolution continues, we must not lose sight of
the early concepts that served as the catalyst for the
growth of the data warehousing industry. In particular,
Bill Inmon's classic definition that "a data warehouse is a
subject oriented, integrated, nonvolatile, time variant
collection of data in support of management's decisions"
(Building the Data Warehouse," John Wiley & Sons,
Inc., 1992).
One of my concerns is that some of us are losing sight
of the fourth characteristic of the definition, the time-
variant collection of data. This is what allows us to make
period-to-period comparisons and enables us to spot trends
that raise issues that may need attention. I am especially
concerned when I hear some vendors deliver the message
that a data warehouse may no longer be needed since their
technology allows organizations to access and consolidate
data from multiple operational systems without having to
first extract the data and store it in another database in
a data warehouse. There are several potential problems
with this approach. Among these: the absence of historical
values and data inconsistency among the systems being
consolidated.
Operational vs. Analytical Systems
In general, operational systems contain current data
values such as inventory on-hand, the availability of a
seat on an airline flight, a customer's outstanding
balance, or which vendors supply a specific manufacturing
part. Although many operational systems will also contain
some historical data, this is not their primary design
purpose. These systems were, by definition, designed to
help operate the organization, not necessarily to help
analyze it. For example, although an order entry system
will contain detailed data about current year sales to a
given customer, it may only contain limited (and likely
highly summarized) data about last year's sales to that
customer, and perhaps no sales data for earlier years.
Reconciling Data Among Multiple
Systems
Additionally, as many operational systems were designed
to serve specific functional areas, they may not all use
the same data definitions and value lists. For example,
the same customer may be represented by different customer
numbers in different operational systems. Many view the
process of cleansing and reconciling data (for example,
unit-of-measure, value list, and format conversions) from
multiple systems before loading it into a data warehouse as
a necessary burden, but I consider it to be a act of
purification. It ensures that those utilizing the data
warehouse are not comparing apples to oranges.
Historically, operational systems, in particular legacy
applications, were designed to meet specific functional
needs and did not necessarily conform to corporate data
standards (assuming data standards even existed when the
systems were developed). Even when organizations made the
investment to develop enterprisewide data standards, they
frequently did not have the luxury of reworking existing
operational systems to conform the these standards. These
organizations could, however, ensure that when data from
these systems was loaded into a data warehouse, the data
was transformed so that data residing in the warehouse
conformed to these standards.
Federation Has Its Place
This is not to say that a federated approach to data
warehousing does not have its place in an overall data
warehouse architecture, especially if the potential
limitations are recognized and understood. One possible
use could be to access operational data in disparate
systems to prototype potential reports and analyses for
"one-time requests" prior to making a decision to
incorporate additional data into a data warehouse.
Furthermore, if an organization has matured to the point
of having established (and more importantly enforcing)
corporate data standards, it may be possible to utilize a
federated approach that includes a data warehouse to access
historical values in conjunction with operational systems
to access the most current values. At the very least, this
approach should be evaluated against near-real-time data
feeds when implementing a data warehouse solution for
analyses that requires both immediate and historic
values.
About the Author
Michael A. Schiff is founder and principal analyst of MAS Strategies, which specializes in formulating effective data warehousing strategies. With more than four decades of industry experience as a developer, user, consultant, vendor, and industry analyst, Mike is an expert in developing, marketing, and implementing solutions that transform operational data into useful decision-enabling information.
His prior experience as an IT director and systems and programming manager provide him with a thorough understanding of the technical, business, and political issues that must be addressed for any successful implementation. With Bachelor and Master of Science degrees from MIT's Sloan School of Management and as a certified financial planner, Mike can address both the technical and financial aspects of data warehousing and business intelligence.