Graph Databases for Analytics (Part 4 of 4): Addressing Performance Challenges
Once you put graph databases into practice, you may find performance does not scale at large volumes. We explain how you can improve the performance of your graph applications.
- By David Loshin
- July 28, 2016
Graph databases provide a straightforward model for creating and storing the data that represents different types of networks. We reviewed graph structure in the previous article.
There are a number of graph databases and analytics engines suited to solving business challenges of moderate size. However, as interest expands in the graph approach, performance may not scale as the data volumes increase.
As the size and breadth of the graph grows, queries and graph algorithms will run more slowly due to the following issues:
Rapid acceleration of graph data volumes: The data sources that feed graph analytics applications continuously stream data, and the number of these sources is constantly growing. Quality graph analytics tools must be able to accommodate rapidly accelerating data volumes, provide an efficient means for representing and storing the data, and ensure reasonable response times for querying and analytics.
Increased memory requirements: Most graph algorithms check each edge and node in the network, so good performance requires efficiency in traversing the edges that link entity nodes together. However, as graphs incorporate more entities and more relationships, they will occupy more memory and require more disk space.
Latency due to data non-locality: Unlike the tabular layout of relational database systems, the data layout of a graph is unpredictable. Linked nodes may not be physically co-located in memory or within the same persistent storage area, increasing the time it takes to access the data. As data access latency increases, the performance of searches and analytics deteriorates.
Bounds of in-memory computing: One way to mitigate the impact of data non-locality is to load the entire graph into memory. However, even with lower memory costs there will be some point at which the data volume exceeds what can be managed in memory. At that point, the application will start thrashing data in and out of memory, which seriously degrades performance.
In essence, all of these performance challenges are associated with delays due to data access latency. To ensure reasonable, scalable performance, you must reduce data latency.
The first way to reduce latency is to expand memory. Although that may temporarily ease the memory bottleneck, it will lead to a processing bottleneck if the application is executing on a single processor.
That leads to the next approach -- employ multiple processing units. This is the method used by the graph processing systems layered on top of the Hadoop ecosystem. By adding processing units with massive amounts of memory, the graph data structure can be distributed across the memory associated with those processing units, and the queries and analyses performed on the graph can be executed in parallel.
One project that uses this approach is Giraph, an Apache project based on technology developed by Google. Another is Spark GraphX, a library for processing and analyzing graphs layered on top of the Spark parallel environment. Companies such as Neo4j have also adapted their products to load graphs into memory and parallelize operations.
Even with a parallel computing environment, nodes can still fall victim to memory hierarchy performance issues -- caused by delays in moving data from memory through the different cache levels to the CPU where the processing occurs.
Because of the unpredictability of the graph traversals, data is constantly cycled through the cache, and those latencies will begin to add up. Companies such as Blazegraph have attempted to bypass this by deploying their solution on a parallel cluster composed of graphics processing units (GPUs), which are designed for gaming. The GPU architecture employs a large number of smaller core processors with rapid and direct (i.e., no cache) access to a large shared memory bank.
Obviously, ensuring high performance for the rapidly expanding world of graph processing will be tightly coupled to developments in parallel and distributed computing. It will be interesting to see how these different approaches will enable more organizations to adopt graph databases as a valuable alternative platform for analytics.
David Loshin is a recognized thought leader in the areas of data quality and governance, master data management, and business intelligence. David is a prolific author regarding BI best practices via the expert channel at BeyeNETWORK and numerous books on BI and data quality. His valuable MDM insights can be found in his book, Master Data Management, which has been endorsed by data management industry leaders.