Balancing Static and Dynamic Data Models for NoSQL Data Systems
To get the most out of a NoSQL database, you must understand the best way to balance the advantages of static and dynamic data models.
- By David Loshin
- September 26, 2016
In a relational database management system (RDBMS), the structure of the data to be represented must be defined before the input is loaded into the tables. In most, if not all cases, this requires significant forethought in data modeling. Business process analysis helps the data modeler identify the conceptual entities, their attributes, and their relationships. The model is defined (and iteratively refined) before developing the application.
Although analyzing the business process to create a robust data model is a worthwhile investment, a major drawback of this approach is that it becomes difficult to adjust the model after the data is loaded. In essence, the model is static.
You might be able to execute simple changes to the model (such as adding a new table that is related using an existing foreign key), but if you need to add a new column or modify the data type of an attribute, it may require a drastic step. You may need to dump the existing database, instantiate the new model, and modify the data integration routines to ensure that the added attribute values are properly loaded.
NoSQL Allows Dynamic Modeling
Freedom from this static modeling approach is one of the appeals of NoSQL databases. Most provide greater flexibility in the structure of data representations by using "tagged" data attributes.
For example, consider a customer relationship management database in which customer data is being populated from a set of data sources. One data source provides birthdate information, and another provides data about magazine subscriptions, but not every customer is represented in both sources.
If, for customer John Smith, there is a record in the birthdate data source, then we can add a new BirthDate attribute and value to John Smith's record. If there is no record of any magazine subscriptions, then there is no need to even have any SubscribesTo attributes in the record, let alone a field named SubscribesTo that remains unpopulated.
In other words, the NoSQL approach allows for a completely dynamic data model.
Examples of a Data Model Definition
You can define a NoSQL data model ahead of time (like the RDBMS approach) or as data is acquired and inserted into the database (based on inferences made at data acquisition) -- or combine both approaches.
For example, a NoSQL data model can be defined in relation to attributes specified in a structured input file, such as labeling the attributes using column headers provided in a CSV (comma-separated values) file.
Alternatively, a text analytics application can scan unstructured streaming traffic reports and extract locations, types of traffic events (such as a car crash), times, severity of the hazard or slowdown, etc. It can then create new traffic log records that are dynamically attributed based on the data gleaned from the analyzed report.
The challenge is to determine the best way to balance the use of a static model with the ability to evolve a model dynamically. No matter what, the application developer must have some understanding of the business process. Even in our traffic example, foreknowledge of a traffic event's possible characteristics guides the creation of new log records.
Agile Approach Can Refine the Data Model
All of this suggests that when using a NoSQL data environment, you should adopt an agile approach to data modeling that leverages cycles of definition and implementation.
Begin with a basic assessment of the business process to identify the key entities and their relationships. Next, develop the application using the predefined model and see whether there is any variation in the data sources that might warrant augmenting the model. Adjust the model by modifying the data integration routines to dynamically add attributes or relationships that were not addressed in the original model.
This cycle can be repeated as the existing data sources are consumed as well as when new data sources are identified for integration. The result will be a continuously refined data model that provides increasing precision without demanding a complete database refresh for every change.
David Loshin is a recognized thought leader in the areas of data quality and governance, master data management, and business intelligence. David is a prolific author regarding BI best practices via the expert channel at BeyeNETWORK and numerous books on BI and data quality. His valuable MDM insights can be found in his book, Master Data Management, which has been endorsed by data management industry leaders.