Group Testing: A Probably Approximately Correct Analysis and Recovery Algorithms for pooled RT-qPCR
Abstract
The goal of group testing, also called pool testing, is to successfully identify a set of k defectives from a population of n items using only m (< n) group tests. In each group test, a subset of the n items is tested together as dictated by a pooling protocol.
The outcome of a group test is negative if and only if none of the defective items participate in that group test; it is positive otherwise. When k << n, group testing can help significantly reduce the number of tests needed. A decoding algorithm then estimates the defective item set using the group test outcomes and the knowledge of the pooling protocol.
From its first introduction by Dorfman in 1943, group testing has been adopted in various applications like infectious disease detection, multiple-access protocols, cognitive radio, and product testing, to name a few. The primary motivating factors for its widespread success include improving the test reliability and reducing the testing cost. On the other hand, the theoretical study involving the design of pooling protocols, development of decoding algorithms, deriving sufficiency (achievability), and converse bounds on the number of tests under various scenarios can be found in a rich literature spanning over half a century. Recently, group testing has gained a renewed interest in the eyes of both practitioners and theorists alike due to the outbreak of COrona-VIrus Disease (Covid-19) pandemic.
In the first part of this thesis, we focus on deriving sufficiency bounds on the number of tests for Boolean non-adaptive group testing algorithms, namely, Combinatorial Orthogonal Matching Pursuit (COMP) and Definite Defectives (DD) with random pooling. The term non-adaptive means that all the group tests are conducted in a single stage, possibly in parallel. The term random pooling refers to the fact that the strategy used to determine which item participates in which group test is determined by a probability distribution.
We view the group testing problem through the lens of a function learning problem and formulate it in a probably approximately correct (PAC) analysis framework. This enables us to characterize our sufficiency bounds by a confidence parameter and an approximation error tolerance parameter. In practical settings with finite resources, one is often interested in ensuring that the probability of the error incurred by a function learned using a finite number of randomly drawn samples exceeding a threshold remains below a small number, which we call the confidence parameter. Also, approximate defective set recovery is sufficient in many applications, wherein the number of errors that can be tolerated is quantified by the approximation error tolerance parameter. Our resulting sufficiency bounds provide a finer perspective of the random-pooling-based group testing algorithms by separately accounting for the randomness in the pooling protocol and the defective set identification errors.
In the second part of this thesis, we focus on developing recovery algorithms for Covid-19 infected sample detection with pooled Reverse Transcriptase (quantitative) Polymerase Chain Reaction (RT-qPCR) assay. The quantitative output from the RT-qPCR test is called the cycle threshold (CT), a quantity inversely related to the amount of the viral load in the sample. For a healthy sample, the CT returned is, in theory, infinity. Existing recovery algorithms suffer from challenges related to the non-linear nature of the RT-qPCR model, the existence of infinities in the feasible set of the optimization problem, and are sensitive to the unknown PCR efficiency factor, q. We develop two iterative algorithms to address the above-mentioned gaps: 1) alternating direction method of multipliers CT (ADMM-CT) and 2) block coordinate descent CT (BCD-CT). At the heart of these algorithms lie gradient descent (GD-CT) and iterative mirrored hard thresholding (IMHT-CT) algorithms for individual sample CT estimation and a projected gradient descent (PGD) method for estimating q. Lastly, we present empirical results demonstrating the advantage of using quantitative measurements in non-adaptive pool testing in terms of the testing rate and, hence, the cost on publicly available Covid-19 data on the number of tests conducted and also compile the best rates achievable for a given prevalence rate.
In summary, we address these two aspects of group testing in this thesis: 1) theoretical analysis of Boolean non-adaptive group testing algorithms and 2) developing recovery algorithms to detect Covid-19 using pooled RT-qPCR. The key takeaways are as follows:
1) Unlike the traditional PAC framework, our formulation allows for deriving bounds under zero error or exact recovery scenario. Further, we can choose the data distribution from which the samples are drawn for function learning based on our knowledge of the hypothesis space from which the target function is learned.
2) Our bounds consider both the randomness in the test matrix and the approximation error probability. This makes existing bounds a special case of our PAC bounds. In addition, we characterize a lower bound on the cumulative distribution of the approximation errors. In deriving these bounds, we characterize the expected stopping time and the tail probability for the subset coupon collector problem (SCCP).
3) Pooled RT-qPCR based Covid-19 detection algorithms that jointly estimate both individual sample CTs and q are robust to model uncertainties.
4) Finally, we collate best optimal designs given a prevalence rate to help the group testing practitioners make the best use of the theoretical results available on group testing.