Search-Optimized Disk Layouts For Suffix-Tree Genomic Indexes
Abstract
Over the last decade, biological sequence repositories have been growing at an exponential rate. Sophisticated indexing techniques are required to facilitate efficient searching through these humongous genetic repositories. A particularly attractive index structure for such sequence processing is the classical suffix-tree, a vertically compressed trie structure built over the set of all suffixes of a sequence. Its attractiveness stems from its linearity properties -- suffix-tree construction times are linear in the size of the indexed sequences, while search times are linear in the size of the query strings.
In practice, however, the promise of suffix-trees is not realized for extremely long sequences, such as the human genome, that run into the billions of characters. This is because suffix-trees, which are typically an order of magnitude larger than the indexed sequence, necessarily have to be disk-resident for such elongated sequences, and their traditional construction and traversal algorithms result in random disk accesses. We investigate, in this thesis, post-construction techniques for disk-based suffix-tree storage optimization, with the objective of maximizing disk-reference locality during query processing. We begin by focusing on the layout reorganization in which the node-to-block assignments and sequence of blocks are reworked. Our proposed algorithm is based on combining the breadth-first layout approach advocated in the recent literature with probabilistic techniques for minimizing the physical distance between successive block accesses, based on an analysis of node traversal patterns. In our next step, we consider techniques for reducing the space overheads incurred by suffix-trees. In particular, we propose an embedding strategy whereby leaf nodes can be completely represented within their parent internal nodes, without requiring any space extension of the parent node's structure.
To quantitatively evaluate the benefits of our reorganized and restructured layouts, we have conducted extensive experiments on complete human genome sequences, with complex and computationally expensive user queries that involve finding the maximal common substring matches of the query strings.
We show, for the first time, that the layout reorganization approach can be scaled to entire genomes, including the human genome. In the layout reorganization, with careful choice of node-to-block assignment condition and optimized sequence of blocks, search-time improvements ranging from 25% to 75% can be achieved with respect to the construction layouts on such genomes. While the layout reorganization does take considerable time, it is a one-time process whereas searches will be repeatedly invoked on this index. The internalization of leaf nodes results in a 25% reduction in the suffix-tree space occupancy. More importantly, when applied to the construction layout, it provides search-time improvements ranging from 25% to 85%, and in conjunction with the reorganized layout, searches are speeded up by 50% to 90%. Overall, our study and experimental results indicate that through careful choice of node implementations and layouts, the disk access locality of suffix-trees can be improved to the extent that upto an order-of-magnitude improvements in search-times may result relative to the classical implementations.