An error correction algorithm for long-read sequencing

Barak, Parvesh

View/Open

Thesis full text (3.455Mb)

Author

Barak, Parvesh

Metadata

Show full item record

Abstract

Long-read sequencing technologies have transformed genomics by generating longer and suffi- ciently accurate DNA sequences, offering advantages in analysing highly repetitive and complex regions of a genome. A long-read sequencing machine can produce terabases of data in one day, with individual reads ranging from 10,000 to 100,000 bases in length. However, these instru- ments also introduce errors during sequencing. The average sequencing error rate ranges from 0.5% − 5%, depending on the choice of instrument and sequencing protocol. These errors must be corrected to enable accurate genome reconstruction. In diploid organisms such as humans, two copies of each chromosome are inherited, one from each parent. These copies, referred to as haplotypes, are nearly identical and differ at only about 0.1% of the genomic positions. These differences must be preserved for downstream genomics applications. To address this challenge, several ‘haplotype-aware’ error correction methods have been developed, which aim to correct errors within reads while preserving the haplotype-specific differences. However, existing solu- tions are based on either ad-hoc heuristics or deep learning approaches that require substantial computational resources and lack formal guarantees. This thesis presents the first rigorous formulation for haplotype-aware error correction of long reads. Our approach builds on the minimum error correction framework originally developed for reference-based haplotype phasing, where the read clustering procedure is guided by the alignment of reads to a known reference genome. We extend this framework to the more challenging de novo setting, where no reference genome is available. We prove that the resulting formulation is NP-hard. To make our exact algorithm scale to large datasets, we introduce practical heuristics. Experiments using PacBio HiFi sequencing datasets from human and plant genomes show that our approach achieves accuracy comparable to state-of-the-art methods. A software implementation of our algorithm is available at https://github.com/at-cg/HALE.

URI

https://etd.iisc.ac.in/handle/2005/7092

Collections

Department of Computational and Data Sciences (CDS) [112]