Data Models Will Be Beautiful Again
Data modeling has fallen out of favor with the rise of big data, but the context provided by modeling will be critical to successful AI applications and algorithm-based decision making.
- By Barry Devlin
- November 22, 2016
Data modeling has been having a tough time as the IT industry has pivoted to big data. With big data's external sourcing and so-called unstructured form, data models seemed less relevant. In the new NoSQL world, data modelers struggled to apply tools and techniques grounded in a relational database mindset.
Schema-on-read is the antithesis of data models' diligent analysis and structuring performed long before the first field was committed to disk. Even in the relational world, speed and agility of delivery have undercut the role of the traditional data modeler.
The Value of Modeling
This hiatus in modeling thought and application has been most unfortunate. What is data modeling other than a search for and exploration of meaning in information? In an environment where data is coming into enterprises from more diverse and ill-defined sources, in ever-greater volumes, and at higher speed, understanding its meaning and deciphering its structure is of the utmost importance.
Data scientists complain they spend 80 percent of their time preparing data for analysis. Their plight has led to the emergence of data wrangling and structure discovery tools. However, despite their value in "munging" data, most lack the theoretical foundation that was incorporated in entity-relationship (ER) data modeling as far back as Peter Chen's 1976 seminal paper, "The Entity-Relationship Model -- Toward a Unified View of Data."
Data is a representation of the real world, and it is from that real world that an understanding of all data must emerge.
A Multilevel Visual Approach
How that can happen is the subject of Thomas Frisendal's new book Graph Data Modeling for NoSQL and SQL. Starting from the psychological premise that data modeling is the exploration and discovery of meaning and structure and thus requires a largely visual approach, he notes that traditional models look too much like relational tables. They are engineering artifacts closer to physical database design than to the maps of meaning a business person could understand and use creatively.
Frisendal is reiterating an older concept: a multilevel architecture for data modeling. Although I discussed five levels in Data Warehouse -- from Architecture to Implementation in 1997, he opts for a simpler three-level approach -- conceptual, logical, and physical -- and proposes that new representations are needed at the top two levels.
These representations are based on directed graphs: concept models offer new foundations for discussing meaning with the business, and property graphs expose identity and uniqueness of elements at the logical level. This is vital work that precedes and drives physical database design irrespective of the type of database or store, SQL or NoSQL, that will be implemented.
It might be argued that graph modeling favors popular NoSQL implementations, just as ER representations mirror relational tables. However, there is much to be said for the visual simplicity of these new representations as a basis for business engagement and high-level, implementation-agnostic design. Data scientists, in particular, would be well advised to examine the approach for their data preparation and structuring needs.
Algorithms Need Data Modeling
Today the information world is undergoing another radical transformation -- applying algorithms and artificial intelligence (AI) to decision making. A renaissance in data modeling is sorely needed.
Algorithms and AI approach the meaning of data by developing models based on data content, largely independent of human interpretation or knowledge of business usage. Although they are driving speed and agility in the automation and augmentation of decision making, these approaches may also create barriers between the information, its context, and real business meaning.
An early indication of what may be in store for business-oriented decision support lies in how one of the early success stories of algorithm-driven analysis turned sour. Google Flu Trends was introduced in 2008. It was based on the premise that the number of searches on Google for a set of phrases related to influenza was correlated with the onset of flu symptoms among the searching population. In its first two years of operation, this correlation provided accurate trending predictions of the regional spread of flu some two weeks earlier than traditional reporting by doctors to the Centers for Disease Control and Prevention.
This seemed to validate the view of Google research director Peter Norvig, who claimed that year that "All models are wrong, and increasingly you can succeed without them." However, within a few years its predictions were becoming questionable because of the lack of a valid data model describing people's search behavior.
A new 2015 tool from a team of Harvard statisticians has now incorporated some of this context from other data sources, although without the data modeling formalism discussed above. The lesson is clear: understanding context up front is vital in the development of data-driven applications.
The lessons of data modeling must be incorporated into new designs as early as possible to avoid overly simplistic or downright faulty models of reality.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.