Sequence Alignment to Cyclic Pangenome Graphs

Rajput, Jyotshna

dc.contributor.advisor	Jain, Chirag
dc.contributor.author	Rajput, Jyotshna
dc.date.accessioned	2024-07-17T05:13:54Z
dc.date.available	2024-07-17T05:13:54Z
dc.date.submitted	2024
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/6562
dc.description.abstract	The growing availability of genome sequences of several species, including humans, has created the opportunity to utilize multiple reference genomes for bioinformatics analyses and improve the accuracy of genome resequencing workflows. Graph-based data structures are suitable for compactly representing multiple closely related reference genomes. Pangenome graphs use a directed graph format, where vertices are labeled with strings, and the individual reference genomes are represented as paths in the graph. Aligning sequences (reads) to pangenome graphs is fundamental for pangenome-based genome resequencing. The sequence-to-graph alignment problem seeks a walk in the graph that spells a sequence with minimum edit distance from the input sequence. Unfortunately, the exact algorithms known for solving this problem are not scalable. Among the known heuristics, co-linear chaining is a common technique for quickly aligning reads to a graph. However, the known chaining algorithms are restricted to directed acyclic graphs (DAGs) and are not trivially generalizable to cyclic graphs. Addressing this limitation is important because pangenome graphs often contain cycles due to inversions, duplications, or copy number mutations within the reference genomes. This thesis presents the first practical formulation and algorithm for co-linear chaining on cyclic pangenome graphs. Our work builds upon the known chaining algorithms for DAGs. We propose a novel iterative algorithm to handle cycles and provide a rigorous proof of correctness and runtime complexity. We also use the domain-specific small-width property of pangenome graphs. The proposed optimizations enable our algorithm to scale to large human pangenome graphs. We implemented the algorithm in C++ and referred to it as PanAligner (https://github.com/at-cg/PanAligner). PanAligner is an end-to-end long-read aligner for pangenome graphs. We evaluated its speed and accuracy by aligning simulated long reads to a cyclic human pangenome graph comprising 95 haplotypes. We achieved superior read mapping accuracy compared to existing methods.	en_US
dc.description.sponsorship	SERB Start-up Research Grant (SRG), National Supercomputing Mission (NSM)	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	;ET00572
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Sequence alignment	en_US
dc.subject	variation graph	en_US
dc.subject	genome sequencing	en_US
dc.subject	co-linear chaining	en_US
dc.subject	path cover	en_US
dc.subject.classification	Research Subject Categories::INTERDISCIPLINARY RESEARCH AREAS	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology::Computer science	en_US
dc.title	Sequence Alignment to Cyclic Pangenome Graphs	en_US
dc.type	Thesis	en_US
dc.degree.name	MTech (Res)	en_US
dc.degree.level	Masters	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Final_Thesis___Jyotshna (3).pdf
Size:: 19.00Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Department of Computational and Data Sciences (CDS) [106]

Show simple item record