An error correction algorithm for long-read sequencing
Abstract
Long-read sequencing technologies have transformed genomics by generating longer and suffi-
ciently accurate DNA sequences, offering advantages in analysing highly repetitive and complex
regions of a genome. A long-read sequencing machine can produce terabases of data in one day,
with individual reads ranging from 10,000 to 100,000 bases in length. However, these instru-
ments also introduce errors during sequencing. The average sequencing error rate ranges from
0.5% − 5%, depending on the choice of instrument and sequencing protocol. These errors must
be corrected to enable accurate genome reconstruction. In diploid organisms such as humans,
two copies of each chromosome are inherited, one from each parent. These copies, referred to as
haplotypes, are nearly identical and differ at only about 0.1% of the genomic positions. These
differences must be preserved for downstream genomics applications. To address this challenge,
several ‘haplotype-aware’ error correction methods have been developed, which aim to correct
errors within reads while preserving the haplotype-specific differences. However, existing solu-
tions are based on either ad-hoc heuristics or deep learning approaches that require substantial
computational resources and lack formal guarantees.
This thesis presents the first rigorous formulation for haplotype-aware error correction of long
reads. Our approach builds on the minimum error correction framework originally developed
for reference-based haplotype phasing, where the read clustering procedure is guided by the
alignment of reads to a known reference genome. We extend this framework to the more
challenging de novo setting, where no reference genome is available. We prove that the resulting
formulation is NP-hard. To make our exact algorithm scale to large datasets, we introduce
practical heuristics. Experiments using PacBio HiFi sequencing datasets from human and plant
genomes show that our approach achieves accuracy comparable to state-of-the-art methods. A
software implementation of our algorithm is available at https://github.com/at-cg/HALE.