Optimization of Traversal Queries on Distributed Graph Stores
Abstract
In this era of Big Data Analytics, much of the semi-structured data has inherent interconnectivity
between representative entities. These are increasingly being modeled as property graphs
because of the semantic advantages of the associations/edges formed between entities/vertices
and their attributes that helps with their analysis. These graphs scale to billions of edges
thereby requiring distributed frameworks. Giraph, GoFFish, Blogel are some of the distributed
frameworks for graph analytics. These are generally used for long-running analytical queries
that may take minutes or hours, while short term interactive querying requires low-latency
O(seconds) responses. In particular, traversal queries over a property graph is used to retrieve
paths or neighborhoods from distributed property graph stores rapidly, for interactive analysis.
Horton+, Titan, and GoDB are some of the distributed frameworks for traversal queries.
Efficient execution of traversal queries presents a challenge because of the unstructured and
irregular access patterns of property graphs. This requires strategies to reduce the number
of vertices and edges to visit and evaluated for a predicate speci fied in the query, while also
reducing the network overheads for executing across parts of the graph present on different
machines in a distributed setup. GoDB proposed a heuristics-based cost model that uses graph
statistics to estimate the number of elements to be visited and the network overhead for each
plan and selects the optimal plan. However, the large sizes of the in-memory Java objects used
to hold the graph entities leads to the use of a large number of distributed machines, which in
turn increases the network overhead. We leverage recent work on compressed data structures
like Succinct to model the subgraph data in GoDBX to sharply reduce the memory usage and
hence network latency across machines, while minimizing the increase in computational latency.
We extend GoDB's cost model to operate on both compressed and uncompressed forms of real
world property graphs, and o er detailed performance experiments to evaluate the scalability
of GoDBX and GoDB, relative to Titan, for graphs with up to billions of edges