3 Ways NoSQL Can Be Part of Your Data Architecture
As big data becomes more commonplace, it is important for data management teams to be able to understand where NoSQL fits into their architectures.
- By Troy Hiltbrand
- August 5, 2022
In the past two decades, we have seen the term big data burst onto the scene. At one time in the mid-2010s, the term was everywhere, including in publications, presentations, and classrooms. Today, that flurry of excitement around the term has started to abate. It is not that the concepts of volume, velocity, and variety have gone away. It is just that now, these concepts are a de facto standard in data architectures. What used to be considered big data is now just considered data.
This means that today, data architectures must be a combination of different technologies, including data lakes, clouds, real-time streaming, and unstructured data containers, to support these different facets of functionality. The ability to rely solely on the relational database as the core of the data architecture is a thing of the past.
As data architectures become more complex and multifaceted, business users want to be shielded from them. They look to the data management team to take care of the complexity that exists in the back end of this data jungle and provide data in a consistent, clean, and easy-to-access format that they can use for their everyday analysis.
One of the technologies becoming central to modern-day data architectures is NoSQL. NoSQL is a family of technologies that were specially designed to handle the volume, velocity, and variety of data needed to run your business. Within this NoSQL family, there are four different database types: wide column, key-value store, document store, and graph. Each is different in its architecture and strengths, but they share the commonality that each was built to fill a gap that existed in the relational database model.
One of the challenges with these technologies is that they are not always as end-user-friendly for performing analytics as were the relational database and spreadsheet interfaces of the past.
The challenge for data management groups is to incorporate the power of NoSQL in the right places in the data architecture while keeping the user experience clean and easy to use. Data management teams looking for where to leverage NoSQL can focus on three areas: data ingestion, automated insights, and data lakes.
Data Ingestion
Depending on your organization, your data management group might or might not have influence over which data stores are selected as part of the application development process. With options such as document-store databases and key-value databases, many application developers find these NoSQL databases are a better fit for their needs than the traditional relational database.
Developers often attribute a key-value data store with a dictionary data structure found in multiple programming languages to store groups of objects. In this data structure, each object is referenced by a unique key. A key-value data store functions the same, but with its distributed architecture, it outperforms a dictionary data structure because it can be massively scalable and is durable across sessions and systems. It allows developers to expand systems beyond the limits of a single computer to a distributed cloud environment.
Similarly, they often look at document-store databases as more aligned with the object-oriented approaches that they use within their code. Being able to serialize and deserialize their objects into JSON or XML can appear to be a fast-track approach to getting their application code developed quickly. These databases also allow the developers to easily scale out their applications and harness the power of distributed computing.
As a data management team, you have the responsibility of ingesting all these varied sources and pulling them together into a unified view for your analytics users. This means that your ETL processes need to be able to effectively ingest data from these NoSQL databases, clean and augment it, and standardize it so it can be integrated and presented back to your business users and used to answer questions.
To address this, your data management team needs to be very familiar with the architectures associated with NoSQL databases and with the methods available to ingest from them. This requires that they expand their skill to include using multiple query languages, depending on the environment. Some of these are like the traditional SQL and some are more aligned with other programming or scripting languages.
In addition, it is critical for your team to understand the mechanisms available for change data capture. NoSQL databases are often termed as schema-less or schema-on-read, meaning that the database does not enforce schema adherence. This allows the data to change the schema from record to record. It becomes the responsibility of the ETL process to track and manage the schema and identify ways to isolate incrementally changed records to avoid full data refreshes.
Data Lakes
Modern data architectures are pushing closer to the concept of ELT, where all data from a variety of sources gets loaded as quickly as possible into a data lake and then processed as needed. This means that you need a data store that has more versatility than a traditional structured relational database.
Using a NoSQL database as your data lake allows you to avoid being constrained by the structure of the incoming data. You can store everything at a high volume with the understanding that you will process it and apply the structure only when there is a business case. Key-value databases let you store all objects regardless of the format of the object as long as you have a unique key.
The data lake is not an ideal destination to send business users with questions, but with a wide variety of data from different systems, it holds a treasure trove of valuable data that can be processed and added to your data warehouse or mined by your data science groups to identify previously undiscovered patterns and insights.
Automated Insights
Data management teams are often expected to do more than just move data from one location to another. They are responsible for augmenting the data along the data pipeline. This is either their responsibility alone or in concert with the data science and data engineering teams. In the end, users want the data that they need to answer real business questions. If the data from the source systems does not give them those answers, data enrichment in the pipeline can deliver data that will. A column-store database or graph database can be optimized for specific types of advanced analytics that can deliver those needed enhancements to the data set.
Data sets with sparsely populated attributes can be very effectively managed in a wide column database. Problems such as text analysis -- where attributes are represented by unique word counts in the text -- can be efficiently stored and processed in this type of database. From these data sets, measures and metrics can be calculated and stored as pre-calculated insights in the data warehouse. From there, users can have easy access to this data for their analytics.
Data sets that represent networks of concepts, ideas, or objects can be effectively managed in a graph database. Problems such as social network analysis can generate measures and metrics associated with the importance of specific nodes in the graph or the strength of relationships. These can then be stored in the data warehouse for future reporting and analysis.
Moving data from a data lake or an operational source into one of these NoSQL databases can allow for automated processes to discover insights that would be virtually impossible in a relational database. Enriching the base data set with these insights can greatly enhance your users’ ability to perform their work.
Final Word
With the increased complexity of data architectures to support a new level of variety, volume, and velocity of data, data management teams need to understand where NoSQL fits. Teams need to understand how they balance the diverse nature of these technologies with the need to continue to provide easy-to-consume, post-processed data. Focusing on the areas of data ingestion, data lake creation and management, and automated insights are three of the highest value areas where your team can focus.