SOTAVerified

FastDup: a scalable duplicate marking tool using speculation-and-test mechanism

2025-05-09Code Available1· sign in to hype

Zhonghai Zhang, Yewen Li, Ke Meng, Chunming Zhang, Guangming Tan

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Duplicate marking is a critical preprocessing step in gene sequence analysis to flag redundant reads arising from polymerase chain reaction(PCR) amplification and sequencing artifacts. Although Picard MarkDuplicates is widely recognized as the gold-standard tool, its single-threaded implementation and reliance on global sorting result in significant computational and resource overhead, limiting its efficiency on large-scale datasets. Here, we introduce FastDup: a high-performance, scalable solution that follows the speculation-and-test mechanism. FastDup achieves up to 20x throughput speedup and guarantees 100\% identical output compared to Picard MarkDuplicates. FastDup is a C++ program available from GitHub (https://github.com/zzhofict/FastDup.git) under the MIT license.

Reproductions