Graph Databases for Analytics (Part 1 of 4): What’s So Great about Graphs?
What makes now the right time to learn about graph databases? When the connections between data elements are as important as the elements themselves, you need a new way to handle your data.
- By David Loshin
- July 25, 2016
Your enterprise probably collects and processes an increasing amount of data today. If you want to implement advanced analytics on this data, you might need an innovative alternative for data representation. A graph database is a model that focuses on the relationships between entities.
The relational database management system (RDBMS) has been the core of most types of transaction processing, operational, and reporting applications over the past three decades. From a reporting and analytics perspective, the relational model remains suited to applications involving the characteristics and activities of entities, such as operational reports about sales performance, manufacturing productivity, or call center effectiveness.
However, the table structure of the RDBMS limits your ability to execute more sophisticated types of analyses, and connections are subsidiary to the entities, effectively buried within the primary/foreign key relationships. The relational model can’t optimally capture all the valuable information associated with the connections.
Consider a relationship between two people interacting via Facebook. There are characteristics of the relationship that are not necessarily attributes of either individual, such as the nature of their relationship (personal or professional), the duration, where they met, or how frequently they correspond. Similarly, consider the attributes of a supply chain: the driving distance between facilities, how often deliveries are made, or the preferred trucking vendor for each leg of the journey.
These limitations have become more acute as the domain of data sets extends beyond traditional structured data models and encompasses unstructured data and data streams continuously fed by Internet-connected devices and sensors and human-generated content from Internet communities.
An Alternative Approach
Graph databases provide an alternative approach to data representation that not only captures information about entities and their attributes but also elevates relationships among the entities to be first-class objects.
Graphs consist of a collection of vertices (also called nodes or points) that represent the modeled entities; vertices are connected by edges (also called links, connections, or relationships) that capture the way that two entities are related.
Figure 1: An example graph.
The example in Figure 1 shows a directed graph consisting of vertices (employees, manufacturers, and retail channels) and edges (indicating that an employee worked for a particular manufacturer or indicating that a manufacturer supplies products to a retail channel).
This model can represent attributes of each entity (such as the manufacturer’s address) as well as attributes of each relationship, such as the dates of employment associated with each “employed-by” edge in the graph.
A graph’s structure best supports analyses related to networks and connectivity. Examples include analyzing the performance of mobile telecommunications networks or identifying logistics bottlenecks in a supply chain. You can use graphs to identify employees who work well as teams (thereby improving recruitment and retention) or to leverage familiarity among customer cohorts to develop recommendation engines.
Each of these examples leverages the different types of relationships among the entities to infer actionable knowledge leading to profitable decisions. Because the graph database model allows you to query based on the attributes of both the entities and their relationships, a data scientist can use it to develop sophisticated patterns for predictive models that can be integrated into real-time and streaming applications.
In the coming articles in this series, we will explore the features that make a business problem suited to graph analytics, discuss the basic algorithms for graph analysis, and consider the challenges for managing graph performance. Read Part 2 here.
David Loshin is a recognized thought leader in the areas of data quality and governance, master data management, and business intelligence. David is a prolific author regarding BI best practices via the expert channel at BeyeNETWORK and numerous books on BI and data quality. His valuable MDM insights can be found in his book, Master Data Management, which has been endorsed by data management industry leaders.