Can Big Data and SQL Get Along?
Is the news of SQL's demise greatly exaggerated or should we be preparing for a post-SQL world?
By Alex Kodat, Senior Product Architect, Rocket Software
Anyone who works with databases or data warehouses understands the preeminent role of SQL in structuring database requests. However, recent press reports about NoSQL databases and non-relational big data sources raise questions about SQL's preeminence in databases . Is the news of SQL's demise greatly exaggerated, or should we be preparing for a post-SQL world?
In addressing this question, we must first consider what people mean when they say "SQL." Of course, they mean Structured Query Language, but they also usually associate SQL with the relational data model, a view of data as collections of tables (with rows and columns) and connections between the tables defined through relationships between columns in the tables -- relationships that are used to join the data in the tables during processing. In the relational model, the result of the join is usually an ephemeral table comprised of rows and columns. SQL is simply a standard language for manipulating data in relational tables.
This has proved to be a very powerful model over the years, and it's easy to see why. Most data with sufficient structure to be processed by computers (leaving aside AI applications) can be logically mapped into the relational model. If the data is to be algorithmically processed, there must be an explicit or implicit schema that defines the relationships between the data elements; the schema can be used to map data to a relational model. By providing a standard language to access relational data, SQL makes it possible for applications to access data in different databases with little or no database-specific code. This has been especially true for business intelligence (BI) and business analytics (BA) tools, most of which can access data from a wide variety of back-end databases.
The relational model often works well on a logical level but can be problematic on a physical level. For data that is largely accessed hierarchically, joining the parent rows to the child rows can be expensive. Picture an oil well row with hundreds of core samples or a customer row with hundreds of transactions. Such hierarchical data has created a cottage industry of hierarchical and multi-value databases that have been around as long -- or longer than -- relational databases, and this cottage industry has morphed into the next great thing: NoSQL databases.
Why NoSQL? By making the physical data explicitly hierarchical, NoSQL ensures that data usually retrieved at the same time will all be found close together. In old-school, non-relational mainframe databases, this meant that the data was often on the same physical block and the difference between doing five I/Os to display a screen versus 20 was critical. In the newer NoSQL databases, the difference in I/O counts is less significant than the fact that all the data associated with a request can be found on a single machine in a highly parallel, multi-machine cluster. For ultra-high load systems (think big, commercial Web sites), sufficient scalability can only be achieved with a high degree of parallelism, which is accomplished by sharding (partitioning of data among many physical machines, each with its own set of disks). It relies on the fact that most common requests can be satisfied by a single machine in the cluster.
It would be possible to accomplish sharding with a relational database, but it's actually much more challenging because the relational model organizes data by table. This makes it quite challenging to ensure that all the table data you might need for common requests is on the same machine. By simply putting all the data you are likely to need together in a single record or document, the ability to access the data from a single machine can be guaranteed.
However, why throw out the SQL baby with the physical relational bathwater with NoSQL databases? The answer is threefold.
- First, the overhead costs and time of translating relational SQL requests into physical hierarchical requests can be viewed as pointless overhead if the true underlying model is hierarchical.
- Second, by obscuring the underlying hierarchical structure, you do no favors for programmers writing transactional systems because they have to mentally translate the SQL requests to what's happening under the covers to know whether what they're doing is sensible and efficient. Inevitably, mistakes will be made. By making the DML for NoSQL databases match the underlying data, things are much clearer for the programmers.
- Finally, most modern applications are written with object-oriented programming languages that handle hierarchical data quite naturally. Forcing data from one hierarchical data structure to another through a tabular pipe adds pointless complexity to applications.
Does this mean that SQL is going away? Not at all: BI/BA tools will continue to drive SQL and relational views of data. Ultimately, these tools present most results on a two-dimensional screen that maps very naturally to the classic relational table view. It is possible to represent three-dimensional data on a flat screen, but the approaches tend to be difficult to generalize -- and beyond three dimensions, humans have a difficult time visualizing anything. Because there is no standard way of representing arbitrary hierarchical structures tabularly, tabular data will continue to be the representation form of choice for BI/BA tools.
In theory (and maybe even some day in practice), BI/BA tools could accept hierarchical data and transform it internally to their preferred tabular form, yet all this really accomplishes is to reduce transferred data volume, a reduction that can be accomplished by generic compression. It remains to be seen whether BI/BA tools will begin to do their own automatic tabularization of hierarchical data or depend on servers to do so. Some tools already allow users to manually specify mapping of non-relational data representations to relational tables.
Given that BI/BA tools will likely be working with tabular representations of data for a long time to come, and given that the question of which language is to be used for accessing tabular data is pretty much a settled issue, it's likely that BI/BA tools will be using SQL for the foreseeable future. This means that NoSQL databases will either have to provide SQL access to their data for BI/BA tools or depend on front ends to accomplish that. Indeed, this is already happening: the growth of Cassandra, MongoDB, HBase, or Hadoop, JDBC, and ODBC drivers is proof that SQL access to non-relational sources is proliferating. The growing realization, even among the most ardent NoSQL enthusiasts, that SQL access is critical (even to NoSQL databases) has led to the coining of the backronym NOSQL, meaning not "No SQL" but "Not Only SQL.
Alex Kodat is a senior product architect with Rocket Software. He has a long history with NoSQL databases, having spent many years as a developer and architect for Model 204, one of the original mainframe NoSQL databases. You can contact the author at [email protected]