Why All Data is Not Created Equal
Although relational databases will continue to be used for high-speed online transaction processing, conventional relational databases have clearly exhausted their usefulness for business analytics. What’s ahead for database technology?
By Charlie Silver, CEO, Algebraix Data Corp.
Every enterprise relies on data to make decisions, as evidenced by the fact that the business analytics software market is now over $30 billion annually and growing rapidly. Yet, today the business analytics systems of nearly every Global 2000 company still relies on outdated relational database technology from the 1970s. Despite the challenges the classic relational data model imposes on scalability, performance, and manageability in the face of modern data volumes and applications, it continues to be the de facto standard for enterprises worldwide.
Until recently, suppliers of relational databases were arguing that they could handle all the data needs of any enterprise, but that argument started to break down with the advent of XML and the proliferation of XML documents because the XML format doesn’t fit well into the relational data model.
Today, the preponderance of electronic data -- in the form of computer-generated documents containing combinations of audio recordings, graphics, images, numeric data, text, and video recordings, which together constitute more than 85 percent of all enterprise data -- does not fit well into the relational data model. In fact, the Web is now the world’s largest heterogeneous database of this so-called “unstructured” data. Think of all the unstructured data accessible via the public Internet and private intranets that cannot be readily analyzed because it’s not “structured” in the time-honored relational-data format
As a result, at least 85 percent of the business analytics investments that are being made today analyze less than 15 percent of enterprise data. Although the way we use data is increasingly focused on analytics, relational databases have been maxed out in their usefulness as an analytics tool. Based on some estimates, current data growth rates suggest we'll produce nearly 10 trillion gigabytes of data within the next three years. That’s enough to store nearly 70 trillion hours of Flash video. Moreover, this unstructured data is growing at more than twice the rate of structured data.
The Dirty Little Secret of the Relational Database Industry
The dirty little secret of the relational database industry is that before data can be processed by a conventional relational database, it must be pre-structured into “relational-data tables” comprised of rows and columns, much like spreadsheets. Unfortunately, the data stored in relational-data tables is not inherently searchable. Consequently, over the last 40 years, the conventional relational database has gone from an emerging technology to one that is rapidly becoming obsolete.
As the amount of enterprise data continues to grow exponentially, so do the sizes and numbers of relational-data tables required to manage it. Moreover, as relational-data tables become larger, queries from analytic applications must scan through an increasing number of rows and columns to find requested data. Today we’ve come to accept that large enterprises need teams of IT professionals to deal with the burden of managing relational-data tables -- creating, loading, and tuning them.
As a consequence of inherent problems stemming from the limitations of conventional relational databases, including the burdensome task of managing the ever-growing complexity and numbers of tables, vendors are being forced to hire increasing numbers of database administrators to rely upon a variety of workarounds as well.
As an example, for lack of better alternatives, many enterprises are being forced to embrace Hadoop, an open-source program inspired by Google’s MapReduce. These are brutal, desperate attempts to gain control over mushrooming volumes of unstructured data or so-called “Big Data.” It is a very painful and expensive process because there is no query language and there are no standards -- everything developed is application-specific. This is a clear indication that conventional relational databases simply can’t cope with Big Data, and the industry is desperately looking for new and better tools.
In 1999, Internet pioneer Tim Berners-Lee together with the World Wide Web Consortium (W3C) unveiled the first Resource Description Framework (RDF) standard -- a radically new data model that was conceived as a way of facilitating access to any information available via the Web -- independently of its format. Since its inception, the U.S. government has been a leader in fostering development of the new tools required to develop data-management applications based on the RDF standard. As you can well imagine, the U.S. government faces huge obstacles when combating its overwhelming amounts of unstructured data.
Look no further than the U.S. intelligence community that must integrate information feeds from all over the world in a variety of disparate formats. RDF technology is also known as “triplestore” technology because the RDF data model is comprised of exactly three resource identifiers; this emerging technology is still in its infancy that many believe will ultimately replace relational database technology.
Relational Data Model vs. Triplestore Data Model
The relational data model was a breakthrough when it was conceived over 40 years ago. It provided conceptual means to manipulate data (e.g., sales transactions) retrieved from a computer database. When application performance became a problem (which happened frequently), its relational-data tables and queries were restructured, more server hardware was added, or both. Unfortunately, because it was often difficult or impossible to design a single database to handle multiple applications with acceptable performance, this application dependence led to rampant data “silos” with different table and query structures. As no thought was given to managing unstructured data when the relational data model was conceived, the relational data model was not designed to accommodate unstructured data and, to no one’s surprise, it has proven to be highly ineffective for that task.
Unlike the relational data model which forces all data to be pre-structured into a two-dimensional, row-and-column tabular format, the triplestore data model was designed to embrace any logical data format. Thus, the triplestore data model was the first universal conceptual data model to be developed. By virtue of the W3C Consortium, the RDF became a worldwide standard in 1999. Although it has taken 12 years for it to begin to be employed in commercial data-management applications, and few companies are currently using it for commercial applications, the U.S. Department of Defense has decreed that, in the future, all of its data-management applications will be implemented using triplestore technology.
The triplestore data model is a vastly different data model than the relational data model with which database professionals have become so familiar. Decades of time and investment have made people slow to accept the fact that relational databases are not able to do all the things they would like them to do. Sometimes it takes, and unfortunately in this case it has taken, a long time for people to adjust to the realities. A typical enterprise is not going to go too far out on a limb for a technology that isn’t embraced by major suppliers such as IBM, Oracle, and Microsoft.
Oracle does have a triplestore database product, but perhaps more promising is IBM’s Watson system, an artificial intelligence system capable of answering questions posed in natural language. Many wouldn’t know that Watson depends on triplestore technology for its predictive analytics capabilities. Enterprises are now beginning to develop commercial applications with triplestore technology. In IBM’s case, it evidently believes that vertical applications such as financial analysis, evidence-based medicine, and government intelligence represent huge new business opportunities for its Watson systems -- i.e., for applications requiring triplestore databases.
Historically, those who came up the curve most rapidly on triplestore technology were people from the artificial intelligence (AI) community. The affinity between AI and triplestore technology is that triplestore technology enables people to make logical assertions and draw inferences from data -- which, of course, is what AI is all about. Triplestore technology gives enterprises a way to manage different types of data and structures andit enables them to draw inferences from the data. IBM is doing precisely that with its Watson systems thanks to a layer of predictive mathematics on top of a triplestore database.
Algebra: The Missing Link
What still is needed to enhance triplestore databases is a technology that can effortlessly handle management of unstructured data and query large amounts of data at greater speeds aswell as perform logical inferencing, the process of deriving logical conclusions from premises known or assumed to be true.
Mathematics is the key to unifying data management across different data structures. By using advanced algebra to define and manipulate the relationships between data in disparate formats, the mathematical approach eliminates the time-consuming maintenance and performance problems associated with pre-structuring, importing, cataloging, indexing, and storing data in relational-data tables. As the mathematical approach is completely compatible with the relational data model as well, it enables simultaneous access to both structured and unstructured data and provides commercial enterprises with a non-disruptive path forward.
Analyzing enormous and rapidly increasing volumes of data from different sources in different formats in time to make a difference in business operations, is practically impossible using conventional relational databases. Consequently, as data is collected over time, the gap between the amount of data collected and the amount of data that can be effectively analyzed continues to increase.
Although relational databases will continue to be used for high-speed online transaction processing, conventional relational databases have clearly exhausted their usefulness for business analytics. We’re now on the precipice of a major breakthrough in enterprise data management, and advanced algebra will be what pushes analytics across the Big Data frontier.
Charles Silver is the CEO of Algebraix Data Corp., which provides mathematically based data-management technology across the entire spectrum of computer data-management applications. He has more than 25 years of experience as a successful entrepreneur and can be contacted at [email protected].