Search-Optimized Disk Layouts For Suffix-Tree Genomic Indexes

Bhavsar, Rajul D

dc.contributor.advisor	Haritsa, Jayant
dc.contributor.author	Bhavsar, Rajul D
dc.date.accessioned	2013-07-18T09:51:55Z
dc.date.accessioned	2018-07-31T05:09:04Z
dc.date.available	2013-07-18T09:51:55Z
dc.date.available	2018-07-31T05:09:04Z
dc.date.issued	2013-07-18
dc.date.submitted	2011
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/2124
dc.identifier.abstract	https://etd.iisc.ac.in/static/etd/abstracts/2728/G24897-Abs.pdf	en_US
dc.description.abstract	Over the last decade, biological sequence repositories have been growing at an exponential rate. Sophisticated indexing techniques are required to facilitate efficient searching through these humongous genetic repositories. A particularly attractive index structure for such sequence processing is the classical suffix-tree, a vertically compressed trie structure built over the set of all suffixes of a sequence. Its attractiveness stems from its linearity properties -- suffix-tree construction times are linear in the size of the indexed sequences, while search times are linear in the size of the query strings. In practice, however, the promise of suffix-trees is not realized for extremely long sequences, such as the human genome, that run into the billions of characters. This is because suffix-trees, which are typically an order of magnitude larger than the indexed sequence, necessarily have to be disk-resident for such elongated sequences, and their traditional construction and traversal algorithms result in random disk accesses. We investigate, in this thesis, post-construction techniques for disk-based suffix-tree storage optimization, with the objective of maximizing disk-reference locality during query processing. We begin by focusing on the layout reorganization in which the node-to-block assignments and sequence of blocks are reworked. Our proposed algorithm is based on combining the breadth-first layout approach advocated in the recent literature with probabilistic techniques for minimizing the physical distance between successive block accesses, based on an analysis of node traversal patterns. In our next step, we consider techniques for reducing the space overheads incurred by suffix-trees. In particular, we propose an embedding strategy whereby leaf nodes can be completely represented within their parent internal nodes, without requiring any space extension of the parent node's structure. To quantitatively evaluate the benefits of our reorganized and restructured layouts, we have conducted extensive experiments on complete human genome sequences, with complex and computationally expensive user queries that involve finding the maximal common substring matches of the query strings. We show, for the first time, that the layout reorganization approach can be scaled to entire genomes, including the human genome. In the layout reorganization, with careful choice of node-to-block assignment condition and optimized sequence of blocks, search-time improvements ranging from 25% to 75% can be achieved with respect to the construction layouts on such genomes. While the layout reorganization does take considerable time, it is a one-time process whereas searches will be repeatedly invoked on this index. The internalization of leaf nodes results in a 25% reduction in the suffix-tree space occupancy. More importantly, when applied to the construction layout, it provides search-time improvements ranging from 25% to 85%, and in conjunction with the reorganized layout, searches are speeded up by 50% to 90%. Overall, our study and experimental results indicate that through careful choice of node implementations and layouts, the disk access locality of suffix-trees can be improved to the extent that upto an order-of-magnitude improvements in search-times may result relative to the classical implementations.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	G24897	en_US
dc.subject	Suffix-Tree	en_US
dc.subject	Suffix-tree Indexing	en_US
dc.subject	Disk-based Suffix-trees	en_US
dc.subject	Suffix-tree Internal Nodes	en_US
dc.subject	Suffix Trees - Storage Optimization	en_US
dc.subject	Suffix-tree Layouts	en_US
dc.subject	Suffix-Tree Nodes	en_US
dc.subject	Disk-resident Suffix-trees	en_US
dc.subject.classification	Computer Science	en_US
dc.title	Search-Optimized Disk Layouts For Suffix-Tree Genomic Indexes	en_US
dc.type	Thesis	en_US
dc.degree.name	MSc Engg	en_US
dc.degree.level	Masters	en_US
dc.degree.discipline	Faculty of Engineering	en_US

Files in this item

Name:: G24897.pdf
Size:: 1.218Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Supercomputer Education and Research Centre (SERC) [116]

Show simple item record