Counting unique molecular identifiers in sequencing using a multitype branching process with immigration
Serik Sagitov, Anders Ståhlberg
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Detection of extremely rare variant alleles, such as tumour DNA, within a complex mixture of DNA molecules is experimentally challenging due to sequencing errors. Barcoding of target DNA molecules in library construction for next-generation sequencing provides a way to identify and bioinformatically remove polymerase induced errors. During the barcoding procedure involving t consecutive PCR cycles, the DNA molecules become barcoded by unique molecular identifiers (UMI). Different library construction protocols utilise different values of t. The effect of a larger t and imperfect PCR amplifications is poorly described. This paper proposes a branching process with growing immigration as a model describing the random outcome of t cycles of PCR barcoding. Our model discriminates between five different amplification rates r_1, r_2, r_3, r_4, r for different types of molecules associated with the PCR barcoding procedure. We study this model by focussing on C_t, the number of clusters of molecules sharing the same UMI, as well as C_t(m), the number of UMI clusters of size m. Our main finding is a remarkable asymptotic pattern valid for moderately large t. It turns out that E(C_t(m))/E(C_t) 2^-m for m=1,2,, regardless of the underlying parameters (r_1,r_2,r_3,r_4,r). The knowledge of the quantities C_t and C_t(m) as functions of the experimental parameters t and (r_1,r_2,r_3,r_4,r) will help the users to draw more adequate conclusions from the outcomes of different sequencing protocols.