Show simple item record

dc.contributor.advisorJain, Chirag
dc.contributor.authorBarak, Parvesh
dc.date.accessioned2025-09-26T07:19:56Z
dc.date.available2025-09-26T07:19:56Z
dc.date.submitted2025
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/7092
dc.description.abstractLong-read sequencing technologies have transformed genomics by generating longer and suffi- ciently accurate DNA sequences, offering advantages in analysing highly repetitive and complex regions of a genome. A long-read sequencing machine can produce terabases of data in one day, with individual reads ranging from 10,000 to 100,000 bases in length. However, these instru- ments also introduce errors during sequencing. The average sequencing error rate ranges from 0.5% − 5%, depending on the choice of instrument and sequencing protocol. These errors must be corrected to enable accurate genome reconstruction. In diploid organisms such as humans, two copies of each chromosome are inherited, one from each parent. These copies, referred to as haplotypes, are nearly identical and differ at only about 0.1% of the genomic positions. These differences must be preserved for downstream genomics applications. To address this challenge, several ‘haplotype-aware’ error correction methods have been developed, which aim to correct errors within reads while preserving the haplotype-specific differences. However, existing solu- tions are based on either ad-hoc heuristics or deep learning approaches that require substantial computational resources and lack formal guarantees. This thesis presents the first rigorous formulation for haplotype-aware error correction of long reads. Our approach builds on the minimum error correction framework originally developed for reference-based haplotype phasing, where the read clustering procedure is guided by the alignment of reads to a known reference genome. We extend this framework to the more challenging de novo setting, where no reference genome is available. We prove that the resulting formulation is NP-hard. To make our exact algorithm scale to large datasets, we introduce practical heuristics. Experiments using PacBio HiFi sequencing datasets from human and plant genomes show that our approach achieves accuracy comparable to state-of-the-art methods. A software implementation of our algorithm is available at https://github.com/at-cg/HALE.en_US
dc.language.isoen_USen_US
dc.relation.ispartofseries;ET01086
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectError Correctionen_US
dc.subjectLong read sequencingen_US
dc.subjectCombinatorial algorithmsen_US
dc.subjectGenome assemblyen_US
dc.subjectPhasingen_US
dc.subjectClusteringen_US
dc.subjectoverlap graphen_US
dc.subjectCombinatorial Algorithmsen_US
dc.subjectgenomicsen_US
dc.subjectsequencing error rateen_US
dc.subjectHalotypeen_US
dc.subject.classificationResearch Subject Categories::TECHNOLOGY::Information technology::Computer scienceen_US
dc.titleAn error correction algorithm for long-read sequencingen_US
dc.typeThesisen_US
dc.degree.nameMTech (Res)en_US
dc.degree.levelMastersen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record