NoSQL and Hadoop: Document-Based versus Relational Databases
NoSQL and Hadoop have overlapping capabilities but they are not competitors. We examine seven features that differentiate a NOSQL document database from a relational database.
- By Sachin Sinha, Mehul Shah
- January 10, 2017
If the term "big data" has been bandied around your organization as something that should be further explored, then you've likely also heard about Hadoop and NoSQL. Both these technologies are closely associated with big data, and therefore overlap in terms of internal architecture and functionality. For example, they're both great for managing large and rapidly growing data sets, and they're both great for handling a variety of data formats, following the schema-on-read design paradigm.
Both can leverage commodity hardware and scale horizontally, also referred to as scaling out. Contrast this to scaling up, in which you upgrade your existing servers with more powerful hardware, as with traditional relational databases. With regard to data formats, both technologies are suitable for the different types you want to manage, including structured, semistructured, and unstructured.
With these overlapping capabilities, it might seem that NoSQL and Hadoop are direct competitors, right? Not exactly. Although each technology is great for handling big data, they are intended for different types of workloads. A simple way to distinguish them is to look at the workloads they handle best. Hadoop is good for analytics- and historical-archive use cases, whereas NoSQL shines itself in operational workloads complementing their relational counterparts.
NoSQL databases started their journey as key-value store databases and later document/JSON and graph databases joined them. Although the simplicity of key-value store databases made them popular, increasingly people started asking for more when it came to storing complex and hierarchical data typically stored in a JSON or XML. That gave rise to the document-oriented database.
Today, document-oriented databases are one of the main categories of NoSQL databases. The central concept of a document-oriented database is a hierarchical document. While each document-oriented database differs on internal implementation, in general, they all assume data is encoded in some standard JSON or JSON like format. These DBMS provide many advantages over the relational databases especially allowing schema flexibility, high availability, and data distributed across multiple nodes in cluster.
Below are the key features that differentiate a NOSQL document database from a relational database.
High availability: Document databases are highly available and provide much better SLAs compared to their relational counterparts by being distributed horizontally as part of a cluster.
Consistency: Document databases usually lean more towards relaxed consistency models, so reads will always lag behind few writes. It's a classic CAP (consistency, availability, and partitioning) theorem tradeoff where you get higher availability and horizontal scaling in return for looser consistency. Most document databases either provide strong or weak consistency with an exception on Azure Cloud=based Document DB which provides four consistency models: strong, bounded staleness, session, and eventual. This provides more choices to the application builders.
Partitioning/sharding: Data in document databases is partitioned using a hash- or round-robin-based approach. This allows for storing and managing large data volumes at scale. This also makes the read/write access faster, allowing for much higher throughput compared to their relational counterparts.
Data model flexibility: This is one of the hallmarks of document databases -- they allow developers to model the database exactly as objects in their application. In addition, schema-on-read enables faster development, reducing overall time to market. They eliminate the object relational impedance mismatch by modeling the application behavior. In an ever-changing business landscape, this becomes even more important to allow the data model to evolve easily and keep application development time reasonable.
Querying capability: Document databases provide multiple ways to query the data. These methods range from something as simple as REST operations such as GET/PUT/POST/DELETE to SQL like queries. Some of these (such as Mongo DB) allow secondary indexes. Azure Document DB indexes all the properties of a document by default without compromising performance. Some databases (such as Azure Document DB) provide a rich, SQL-like query language syntax and support most ANSI SQL operations, whereas MongoDB has a rich ecosystem of developer tools which aid with faster delivery and easy adoption.
Transaction support: Document databases are usually weak in this area and provide BASE support. Many of them provide transaction support within a collection only. Generally they don't provide transactions across collections. If they do, it's at the cost of higher latency for reads and writes.
Elastic scale: Cloud-based document databases provide the elastic scale capability to meet the growing demands of the application. Storage and computing resources both can be scaled to provide additional capacity. Both AWS Dynamo DB and Azure Document DB provide elastic scale that can be programmed easily, so applications can scale out during peak hours and times and scale back to the regular workload when not needed.
Some of the big name vendors currently in market can be broadly classified into two high-level categories:
On premise: Cassandra, MongoDB, CouchDB
Public Cloud: AWS DynamoDB, Azure Document DB, and Google Cloud Datastore
It should be noted that on-premises vendors do run on cloud mostly as infrastructure-as-a-service. The native cloud-based vendors such as AWS Dynamo DB and Azure Document DB operate in platform-as-a-service mode with very low maintenance overhead and smaller TCO.
A Final Word
We have just scratched the surface of document-based databases to help readers understand the differences as well as advantages over relational databases. Many new cloud-based, mobile-based applications are adopting a polyglot persistence strategy of using one or more data stores. One can have a relational database along with a document-based database, both as part of the application to allow for handling the new world scenarios. We want to call out the advantages of native cloud-based document databases such as Azure Document DB and AWS Dynamo DB because they have much lower TCO compared to their on-premises counterparts such as Mongo DB and Cassandra.
Whether on-premises or cloud only, document-based NOSQL provides features, functionality, and flexibility unavailable a few years ago. We believe this is the new frontier in big data for operational workloads that is bound to expand many times in the coming years.
Sachin Sinha is director of big data analytics at ThrivON. In this role, Mr. Sinha is responsible for design of innovative architectures, development of methodologies, and delivery of solutions in big data, analytics, and data warehousing that help clients realize maximum value from their data assets. For over 15 years, Mr. Sinha has designed, architected, and delivered big data, data warehousing, and business analytics solutions. Specializing in data engineering and architecture, Mr Sinha's domestic and international consulting portfolio includes a broad array of organizations in the healthcare, financial services, insurance, pharmaceutical, and energy domains. You can contact the author at [email protected].
Mehul Shah is a senior solutions architect for Microsoft. He engages and consults with business and technical leaders of large enterprise customers and partners on cloud strategy and architecture journey, digital transformation, data and analytics strategy, architecture, design, and data Program/project planning. He has over 15 years of experience in information management and successfully led, managed, and executed enterprise information management projects for commercial and government organizations. He earned an MBA in marketing and analytics and an MS in computer science from the University of Maryland. You can contact the author at [email protected] or visit his blog at mehulshah008.blogspot.com.