Coding Schemes for Secure and Reliable DNA-Based Data Storage
Abstract
In DNA-based data storage, the desired information is encoded into the quaternary
sequence of synthetic DNA molecules, called oligos. We look at secure communication
via information-containing oligos in the usual three-party setting, where Alice and Bob
are legitimate communicators and Eve is an eavesdropper. Our scheme for secure DNAbased
communication is two-fold: First, we store secret data in synthetic DNA molecules
in a properly designed oligo pool of suitably high dilution. The oligo pool comprises
information-containing oligos mixed with background genomic DNA cleaved into strands
of the same length as the useful oligos. Designing oligos for storing the secret data
involves the design of a library of primer pairs and a code book specific to this set
of primers, that satisfy constraints on homopolymer run length, GC content, primerprimer
dissimilarity, and primer-sequence dissimilarity. The differential knowledge of
the designed primer pairs allows Bob to retrieve most of the information-containing
DNA molecules after carrying out sufficient rounds of PCR amplification on the diluted
oligo pool. In contrast, Eve is not able to access the stored information owing to her
lack of any prior knowledge of the primer pairs.
In order to improve upon the security of this scheme, we next develop an index-based
secrecy coding scheme for the resulting wiretap system, where Bob observes a substantially
lower number of erasures in the main channel, as compared to Eve’s channel, which
suffers a large number of erasures (loss of DNA molecules). We show that under the
conditions of a noise-free setting, proper library preparation, and proper measurement of
oligos represented in smaller fractions, this coding scheme achieves the secrecy capacity
of a DNA storage wiretap channel model with strong secrecy.
An error-correcting code for reliable storage of DNA-based data must adhere to the
constraints of GC content and homopolymer run length. In this thesis, we also explore
the constraint of GC balance in the context of insertion and deletion error-correcting
codes for DNA storage systems. We derive an upper bound on the cardinality of single
indel-correcting GC-balanced quaternary codes. This allows us to deduce the minimal
number of redundancy bits required for GC-balancing in such codes. We also develop
a GC-balanced coding scheme for the correction of a single burst of 2 insertion or 2
deletion errors.