The BI Implications of Highly Customizable Data
Although storage prices are dropping, you must still consider how your customized data set will work for self-service or ad hoc reporting, especially in a real-time environment.
By Sanjay Bhatia
Modern DBMS systems were designed in an era where someone was in charge of all data structures. A DBA, or sometimes a small committee of them, dictated what could be stored and how it should be structured.
At the time, data storage was unimaginably expensive by today's standards. Imagine a world where oil is $10,000 a barrel, and consider how carefully it might be controlled and utilized. In a world where billion-dollar companies could not even store as much data as you have on your cell phone, relational data structures represented efficient ways to store various types of data such as strings, dates, and numbers in a structured way that enabled quick look-up through indexing.
The idea of storing a new piece of information had to be considered very seriously because adding a column to a table would mean every single record would now take up more space, even if the new field for that record was blank (NULL). The incredible advantages of structured data start to disappear when you have low-density information. Additionally, arbitrarily adding new types of data was an involved process.
Today's knowledge workers are blessed to have access to technological resources that were once unimaginable. Storage capacity and available has exploded -- the scarcity is now in people who have technical skills to manipulate the data. But, the cloud brings with it the idea emerged that businesses could simply launch and deploy applications without any IT resources. Businesses quickly wanted the ability to store whatever information became relevant to the business. Without a committee of DBAs in the way, custom fields flourished and the business world probably won't be going back any time soon.
Lower storage prices mean that the overhead of custom fields is insignificant and you can have as many as you like.
I'm a strong proponent the SQL language even though I think the underlying data architectures of SQL database servers need an overhaul. An progress is coming from next-generation Big Data storage vendors delivering technologies such as Hadoop that support schema-less operations where every bit of data is treated as a custom field. The big challenge, however, is the engineering cost of working with such systems. Even basic aggregate operations require Java programmers, and many companies resort to making a relational copy of the Hadoop data on a nightly basis for their analytical needs.
Next-generation relational databases such as SQL 2012 integrate natively with Hadoop, giving you the best of both worlds. Just as hybrid cars prove the benefits of electric technology while keeping you compatible with gasoline infrastructure, SQL 2012’s hybrid approach offers easy access and provides the performance and flexibility of Hadoop. With such technologies emerging, NoSQL is transitioning from meaning “No SQL” to “Not Only SQL.”
SQL-Based Entity Attribute Value (EVA)
A more traditional variation of a hybrid approach that can be implemented without next-generation data storage is EVA. Essentially, this means creating tables that store the names of virtual entities and attributes. These long, skinny tables store data in a flexible manner and the PIVOT operations provided by the database let you convert them on-the-fly into more traditional data structures. This is the approach many enterprises use with Fusion, a product from Izenda, the company I work for.
The only disadvantage of this approach is somewhat limited scalability. The PIVOT operations create challenges when you have more than a few gigabytes of data.
Custom Field Tables
One of the best solutions to the custom fields challenges if you only have a few dozen fields per object is to create a single table with fields that get used in purpose-specific ways. So each system object gets linked to a single custom fields table that has names such as Field1 and Field2. Use software aliasing to display these in a context-specific way that hides the data names from the user. This approach works well for very large data sets containing billions of records and is very easy to manage and also very reportable.
A Final Word
When considering a data storage methodology, it’s important to consider the reporting and BI ramifications. As storage becomes more affordable, analyzing all that data is the true business cost. It’s important to consider how your data set will work in a self-service or ad hoc reporting environment, especially one that is real time.
Sanjay Bhatia is the founder and CEO of Izenda, which specializes in ad hoc reporting, embedded BI, and data virtualization for SQL data. Prior to founding Izenda, Mr. Bhatia worked in engineering and consulting positions at Radiant Systems, Trilogy, and Microsoft. You can contact the author at firstname.lastname@example.org.