NoSQL Hashing: How Couchbase Does It
How do NoSQL databases distribute data in order to provide low-cost elastic scalability?
- By William McKnight
- August 9, 2016
In my last Upside article, I discussed a function of NoSQL databases that made them unique and offered options that relational databases cannot. That function was tunable consistency, and I used Cassandra as the example.
Today, I will continue that theme by taking a look at how the NoSQL database Couchbase distributes data across its nodes that allows NoSQL databases to provide low-cost elastic scalability.
A Flexible Document Database
Couchbase is a NoSQL database in the document category, although it started life as a key-value database and can still be used that way. It features JSON storage -- XML and other data types are also possible.
JSON can make every document (the NoSQL term for equivalent relational record) unique and fixes the sparse data problem common to relational databases, whereby missing or irrelevant fields are padded as nulls or with dummy values. It also easily allows documents to be "interleaved" in a meaningful fashion, as when an order document is followed by all of its line-item documents.
The primary purpose of a document database is to enable read and write operations that meet high performance requirements without the need for an identifier or key in order to access the data, which is needed in key-value database queries. Indexes are possible in Couchbase in B-Tree and other styles.
Many of the founders came from Memcached. Memcached is itself a NoSQL database, but it is mostly known today as the in-RAM memory store in widespread use on the Internet. Couchbase inserts will first go into Memcached and later in the background write to the disk asynchronously, decoupled completely from your action.
Clusters and Buckets
Couchbase provides a vast amount of metrics on its cluster, which, of course, must be populated. Like other NoSQL databases, Couchbase has an algorithm for randomizing the distribution.
Documents live in data buckets, which live on the cluster. The buckets are logical partitions of the cluster. There are 1024 virtual buckets (vBuckets) on a cluster, regardless of size. If there are 512 nodes in the cluster, there are 2 vBuckets per node. If there are 128 nodes, there are 8 vBuckets per node, and so on.
In Couchbase, every document has a Document ID -- a UTF-8 string up to 250 bytes. Using the CRC32 hash algorithm, the Document ID is resolved to one of the 1024 virtual buckets, which dictates which section of which node the document is written to. Document IDs always get hashed to the same virtual bucket, regardless of location. For example, "William" will map to virtual bucket 592 everywhere.
Virtual buckets do not have a fixed physical server location, nor are they modifiable. The mapping between the virtual buckets and the physical server is called the Cluster Map. Ultimately, with good hashing, each virtual bucket will contain 1/1024 of the data set.
Hashing across a scale-out cluster is another of the many points of differentiation between SQL and NoSQL databases. Hopefully, seeing how one leading NoSQL database ¬¬-- Couchbase -- manages the mapping gives you a stronger sense of how you might use NoSQL databases.
About the Author
McKnight Consulting Group is led by William McKnight. He serves as strategist, lead enterprise information architect, and program manager for sites worldwide utilizing the disciplines of data warehousing, master data management, business intelligence, and big data. Many of his clients have gone public with their success stories. McKnight has published hundreds of articles and white papers and given hundreds of international keynotes and public seminars. His teams’ implementations from both IT and consultant positions have won awards for best practices. William is a former IT VP of a Fortune 50 company and a former engineer of DB2 at IBM, and holds an MBA. He is author of the book Information Management: Strategies for Gaining a Competitive Advantage with Data.