Load First, Model Later -- What Data Warehouses Can Learn from Big Data
One simple change in approach can maximize the ROI of data warehouses.
By Jonas Olsson, CEO and Founder, Graz Sweden AB
Big data has captured the attention of everyone from the common man to the CEO to the data professional. The common man gets excited at big data's potential to solve some of the world's biggest problems, from healthcare to education to the environment. The CEO sees the power to tap into new opportunities for revenue and growth.
The data professional gets excited about big data for a more practical -- but equally critical -- reason, one that offers guidance on how to add flexibility to your existing data environment, and one that could greatly benefit data warehouses.
That is, big data is not just about the "big." In fact, I would argue that key to big data's appeal is that it allows data to be loaded first and modeled later. This approach, "schema on read," is not new, but because it has never been the traditional method for implementing data warehouses, it certainly feels new.
Traditionally, data warehouses have been designed around the opposite principle, "schema on write," also known as extract, transform, and load (ETL). This approach requires a pre-defined data model to be implemented as a set number of tables. The data being loaded is then mapped to the existing data model represented by the tables in the database. If the data the user wants to load into the warehouse does not match the existing tables, changes must be made, such as to the ETL process or the table structure. These changes can be expensive and time consuming, especially for businesses that deal with complex data or work in a highly dynamic environment.
Big data's "schema on read" approach is different -- and much more appealing. You just throw the data in there and then figure out what to grab and how to grab it later. Because you basically apply the data model when you read the data, you get much greater flexibility because changes in the data model can be addressed in a layer above the physical tables. As an added benefit, you don't need to narrow your selection of data and you can, if your business requires it, support multiple data definitions, effectively having parallel data models applied to the same set of physical data.
For businesses that don't have highly complex data or whose business environments are more static, the traditional approach of data warehousing will continue to work well. However, in today's "Internet-is-everywhere" economy, even the most hidebound and old school industries are increasingly dynamic and complex.
Data warehouses have earned a reputation for being difficult to change and costly to deploy -- not just technologically and financially but also politically -- due to the time and budget resources required from various business units.
We can look at the traditional ETL data warehouse model -- or "schema on write" -- as one of the culprits. ETL has served data professionals well for many years, but with the increase in data environments' complexity, it should come as no surprise that people are looking for other solutions, and that big data is where they are looking for answers.
There is no inherent limitation in data warehouses that prevents data professionals from using schema on read, to become lower cost, more easily deployed, and more powerful.
The less-restrictive schema on read data model lets companies collect data more freely without having to go back to the drawing board as data sources change. With it, data warehouse professionals will no longer have to know how they will use the data as that step can be done -- and repeatedly redone -- later, as business dynamics dictate.
This simple change in approach can maximize the ROI of data warehouses -- already in place and a permanent budget line item in most companies -- making them a better solution than rushing headlong into new and unproven big data technologies and data management philosophies.
Jonas Olsson is the CEO and founder of Graz Sweden AB, a data warehouse software company. You can contact the author at firstname.lastname@example.org.