An error correction algorithm for long-read sequencing

Barak, Parvesh

dc.contributor.advisor	Jain, Chirag
dc.contributor.author	Barak, Parvesh
dc.date.accessioned	2025-09-26T07:19:56Z
dc.date.available	2025-09-26T07:19:56Z
dc.date.submitted	2025
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/7092
dc.description.abstract	Long-read sequencing technologies have transformed genomics by generating longer and suffi- ciently accurate DNA sequences, offering advantages in analysing highly repetitive and complex regions of a genome. A long-read sequencing machine can produce terabases of data in one day, with individual reads ranging from 10,000 to 100,000 bases in length. However, these instru- ments also introduce errors during sequencing. The average sequencing error rate ranges from 0.5% − 5%, depending on the choice of instrument and sequencing protocol. These errors must be corrected to enable accurate genome reconstruction. In diploid organisms such as humans, two copies of each chromosome are inherited, one from each parent. These copies, referred to as haplotypes, are nearly identical and differ at only about 0.1% of the genomic positions. These differences must be preserved for downstream genomics applications. To address this challenge, several ‘haplotype-aware’ error correction methods have been developed, which aim to correct errors within reads while preserving the haplotype-specific differences. However, existing solu- tions are based on either ad-hoc heuristics or deep learning approaches that require substantial computational resources and lack formal guarantees. This thesis presents the first rigorous formulation for haplotype-aware error correction of long reads. Our approach builds on the minimum error correction framework originally developed for reference-based haplotype phasing, where the read clustering procedure is guided by the alignment of reads to a known reference genome. We extend this framework to the more challenging de novo setting, where no reference genome is available. We prove that the resulting formulation is NP-hard. To make our exact algorithm scale to large datasets, we introduce practical heuristics. Experiments using PacBio HiFi sequencing datasets from human and plant genomes show that our approach achieves accuracy comparable to state-of-the-art methods. A software implementation of our algorithm is available at https://github.com/at-cg/HALE.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	;ET01086
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Error Correction	en_US
dc.subject	Long read sequencing	en_US
dc.subject	Combinatorial algorithms	en_US
dc.subject	Genome assembly	en_US
dc.subject	Phasing	en_US
dc.subject	Clustering	en_US
dc.subject	overlap graph	en_US
dc.subject	Combinatorial Algorithms	en_US
dc.subject	genomics	en_US
dc.subject	sequencing error rate	en_US
dc.subject	Halotype	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology::Computer science	en_US
dc.title	An error correction algorithm for long-read sequencing	en_US
dc.type	Thesis	en_US
dc.degree.name	MTech (Res)	en_US
dc.degree.level	Masters	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: MTech_R__Thesis-2.pdf
Size:: 3.455Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Department of Computational and Data Sciences (CDS) [116]

Show simple item record