Graph Databases from a Data Integration Perspective
Data virtualization enables you to get all the value out of your graph database. Here's how.
By Paul Moxon, Senior Director of Product Management, Denodo Technologies
At an increase of 500 percent, the relative growth in popularity of graph databases over the last couple of years appears to double that of other "hot" categories of NoSQL databases such as document or key-value stores. Note that RDF Triple Stores, although similar to graph databases in that they store "linked data," are considered to be a separate category. If RDF stores are included, the relative growth of graph databases would be even greater.
I'm a firm believer in the pertinence of graph databases and their potential for advanced analytics in scenarios that model highly interconnected entities where other NoSQL alternatives and, of course, relational databases fall short. Consider that this is not just about finding friends of friends in social networks. The thinking here is around crucial use cases such as running complex root cause or impact analysis in multilayer network topologies in telecommunications, or building effective recommendation engines on rich product taxonomies.
These examples are far more valuable than, say, someone finding friends of friends in social networks (although the people at Facebook might disagree). However, the interest is not limited to the type of problems graph databases are good at solving but rather in examining graph databases from a data integration point of view: how do they fit in today's data ecosystem, how easily can they be integrated in your current BI architecture, and, in this context, how can data virtualization leverage the value in graph databases?
If you've experienced the power of graph databases in your projects, you probably have also experienced the pain of exposing graph data to standard BI tools. Equally frustrating are the users who could benefit from the incredibly rich linked data in your graph but who struggle to use it beyond its initial purpose.
Graph databases typically use different query languages (e.g., Cypher) that are different in syntax from the more familiar SQL syntax. This means that users trying to extract data from the graph databases need to learn a new query language focused on navigating the data links rather than querying structured tables.
Because of these challenges, graph databases often end up becoming data silos only accessible to graph-savvy IT developers. Data virtualization solutions help overcome this by virtualizing the graph, or what could be described as applying effective schema-on-read. Data virtualization provides a level of abstraction on top of the graph database and hides the details of the specific implementation.
The BI tools, consumer applications, and even processes do not need to know if the graph needs to be queried via SPARQL, Cypher, or MQL, or whether it needs to be traversed using Tinkerpop's Gremlin or another HTTP Rest API. Data virtualization technology enables agile integration of the graph data with the rest of the organization's enterprise data assets -- whether they are internal (enterprise data warehouse, transactional databases, etc.) or external (cloud app data, public Web, and so on), structured or unstructured -- to effectively realize the notion of polyglot persistence both in informational and operational scenarios. It also promotes repurposing of graph data as well as reuse of access and integration logic in an incremental and explicit way.
What follows are a few scenarios where data virtualization can add value, or should I say, where data virtualization enables you to get all the value out of your graph database.
Agile BI on graph data (plus other enterprise data): Data virtualization enables easy and on-demand delivery of graph data to standard BI and analytics tools. The data virtualization platform makes the data from the graph database look like relational data (tables and columns, etc.) to the BI and analytics tools. Typically, this "data preparation" step is a time-consuming, manual process that needs to be repeated for each data set. Eventually, the data is integrated and enriched in real time, with other enterprise data sources or even external or cloud sources. This saves you and your team from having to kick off an IT "data prep and blending" project every time there is a reporting or analytics requirement from the business that involves graph data. This also eliminates unneeded data replication, with less ungoverned and out-of-sync copies of portions of your graph.
Fine-grained, integrated security over graph data: Fine-grained data access control is not one of the strongest points of existing graph databases, although this will change over time as the graph databases mature. However, until then, the lack of security can hinder graph database adoption. Fortunately, data virtualization can overlay a rich role-based access control model on top of your graph data aligning it in terms of security capabilities with relational databases: type- (category-) based access control, resource/node/individual-based, property-based, etc.
Unified view of multiple graphs: Graph data is a reality, whether it's your own data residing in your own data center (such as an RDF triple store, Neo4J, MarkLogic or Titan DB) or external data coming from Google's Knowledge Graph, Facebook's Graph, or any public SPARQL endpoint. Thanks to its extended relational model, data virtualization can offer an elegant solution to the specific integration case of offering a single unified virtual view of multiple vendor and/or paradigm graph data.
Graph-driven RESTful endpoint: Leading data virtualization platforms also enable the provision of integrated "virtualized" data through a REST interface. They offer a navigational interface that can be driven by graph data using an existing graph database as the master source of data enriched with other enterprise NoGraph (as in Not only Graph) data sources. Using this RESTful "Linked Data Services" interface, users can browse both graph and non-graph data, traverse the relationships between data entities, and search for the exact data that they need.
This linked data services interface provides the "browse, navigate, and query" operations typically associated with graph data to all data from any data source without needing to understand or be aware of the location, format, technology, or protocol used by each and every data store. Imagine a telecom service provider being able to perform root-cause and impact analysis for a network outage (as suggested earlier) and then navigating to customer data for customers affected by the outage (typically held in a CRM system and not a part of the graph data). This opens possibilities such as notifying customers about the outage and anticipated times for service restoration (e.g., through text messages which aren't impacted by the outage). Needless to say, this would be a big win for customer service!
These are just a few selected scenarios, but data virtualization can also help in other instances, such as loading data into a graph database in either batch or transactional modes or in prototyping data graphs.
In a TechRadar report (Enterprise DBMS, Q1 2014), Forrester forecasted that by 2017 over 25 percent of enterprises would be using graph databases. Although it's obviously not easy to measure where we are in terms of adoption today, it is even more difficult to know how much of this adoption is in production environments. However, if popularity -- as measured by its presence in social networks/professional profiles, in the frequency the topic is being discussed in industry and technical circles or even the number of job offers that seek the skill set - then graph databases are grabbing and demanding a lot of attention.
Paul Moxon is senior director of product management at Denodo Technologies, a firm focused on data virtualization. You can contact the author at [email protected].