External Benchmarking of Lung Ultrasound Models for Pneumothorax-Related Signs: A Manifest-Based Multi-Source Study
Takehiro Ishikawa
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Background and Aims: Reproducible external benchmarks for pneumothorax-related lung ultrasound (LUS) AI are scarce, and binary lung-sliding classification may obscure clinically important signs. We therefore developed a manifest-based external benchmark and used it to test both cross-domain generalization and task validity. Methods: We curated 280 clips from 190 publicly accessible LUS source videos and released a reconstruction manifest containing URLs, timestamps, crop coordinates, labels, and probe shape. Labels were normal lung sliding, absent lung sliding, lung point, and lung pulse. A previously published single-site binary classifier was evaluated on this benchmark; challenge-state analysis examined lung point and lung pulse using the predicted probability of absent sliding, P(absent). Results: The single-site comparator achieved ROC-AUC 0.9625 in-domain but 0.7050 on the heterogeneous external benchmark; restricting external evaluation to linear clips still yielded ROC-AUC 0.7212. In challenge-state analysis, mean P(absent) ranked absent (0.504) > lung point (0.313) > normal (0.186) > lung pulse (0.143). Lung pulse differed from absent clips (p=0.000470) but not from normal clips (p=0.813), indicating that the binary model treated pulse as normal-like despite absent sliding. Lung point differed from both absent (p=0.000468) and normal (p=0.000026), supporting its interpretation as an intermediate ambiguity state rather than a clean binary class. Conclusion: A manifest-based, multi-source benchmark can support reproducible external evaluation without redistributing source videos. Binary lung-sliding classification is an incomplete proxy for pneumothorax reasoning because it obscures blind-spot and ambiguity states such as lung pulse and lung point.