Optimization of Traversal Queries on Distributed Graph Stores

Sharma, Abhilash

View/Open

Thesis full text (3.141Mb)

Author

Sharma, Abhilash

Metadata

Show full item record

Abstract

In this era of Big Data Analytics, much of the semi-structured data has inherent interconnectivity between representative entities. These are increasingly being modeled as property graphs because of the semantic advantages of the associations/edges formed between entities/vertices and their attributes that helps with their analysis. These graphs scale to billions of edges thereby requiring distributed frameworks. Giraph, GoFFish, Blogel are some of the distributed frameworks for graph analytics. These are generally used for long-running analytical queries that may take minutes or hours, while short term interactive querying requires low-latency O(seconds) responses. In particular, traversal queries over a property graph is used to retrieve paths or neighborhoods from distributed property graph stores rapidly, for interactive analysis. Horton+, Titan, and GoDB are some of the distributed frameworks for traversal queries. Efficient execution of traversal queries presents a challenge because of the unstructured and irregular access patterns of property graphs. This requires strategies to reduce the number of vertices and edges to visit and evaluated for a predicate speci fied in the query, while also reducing the network overheads for executing across parts of the graph present on different machines in a distributed setup. GoDB proposed a heuristics-based cost model that uses graph statistics to estimate the number of elements to be visited and the network overhead for each plan and selects the optimal plan. However, the large sizes of the in-memory Java objects used to hold the graph entities leads to the use of a large number of distributed machines, which in turn increases the network overhead. We leverage recent work on compressed data structures like Succinct to model the subgraph data in GoDBX to sharply reduce the memory usage and hence network latency across machines, while minimizing the increase in computational latency. We extend GoDB's cost model to operate on both compressed and uncompressed forms of real world property graphs, and o er detailed performance experiments to evaluate the scalability of GoDBX and GoDB, relative to Titan, for graphs with up to billions of edges

URI

https://etd.iisc.ac.in/handle/2005/5397

Collections

Department of Computational and Data Sciences (CDS) [116]