Sequence Alignment to Cyclic Pangenome Graphs
Abstract
The growing availability of genome sequences of several species, including humans, has created the opportunity to utilize multiple reference genomes for bioinformatics analyses and improve the accuracy of genome resequencing workflows. Graph-based data structures are suitable for compactly representing multiple closely related reference genomes. Pangenome graphs use a directed graph format, where vertices are labeled with strings, and the individual reference genomes are represented as paths in the graph. Aligning sequences (reads) to pangenome graphs is fundamental for pangenome-based genome resequencing.
The sequence-to-graph alignment problem seeks a walk in the graph that spells a sequence with minimum edit distance from the input sequence. Unfortunately, the exact algorithms known for solving this problem are not scalable. Among the known heuristics, co-linear chaining is a common technique for quickly aligning reads to a graph. However, the known chaining algorithms are restricted to directed acyclic graphs (DAGs) and are not trivially generalizable to cyclic graphs. Addressing this limitation is important because pangenome graphs often contain cycles due to inversions, duplications, or copy number mutations within the reference genomes.
This thesis presents the first practical formulation and algorithm for co-linear chaining on cyclic pangenome graphs. Our work builds upon the known chaining algorithms for DAGs. We propose a novel iterative algorithm to handle cycles and provide a rigorous proof of correctness and runtime complexity. We also use the domain-specific small-width property of pangenome graphs. The proposed optimizations enable our algorithm to scale to large human pangenome graphs. We implemented the algorithm in C++ and referred to it as PanAligner (https://github.com/at-cg/PanAligner). PanAligner is an end-to-end long-read aligner for pangenome graphs. We evaluated its speed and accuracy by aligning simulated long reads to a cyclic human pangenome graph comprising 95 haplotypes. We achieved superior read mapping accuracy compared to existing methods.