SOTAVerified

When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines

2026-03-20Unverified0· sign in to hype

Artem Maryanskyy

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Multi-agent LLM pipelines produce contradictory evidence on whether team diversity improves output quality: heterogeneous Mixture-of-Agents teams outperform single models, yet homogeneous Self-MoA teams consistently win under synthesis-based aggregation. We propose a resolution by identifying the selection bottleneck -- a crossover threshold in aggregation quality that determines whether diversity helps or hurts. Under this model, we obtain a closed-form crossover threshold s^* (Proposition 1) that separates the regimes where diversity helps and hurts. In a targeted experiment spanning 42 tasks across 7 categories (N=210), a diverse team with judge-based selection achieves a win rate of 0.810 against a single-model baseline, while a homogeneous team scores 0.512 -- near chance (Glass's Δ= 2.07). Judge-based selection outperforms MoA-style synthesis by Δ_WR = +0.631 -- the synthesis approach is preferred over the baseline in zero of 42 tasks by the judge panel. A decoupled evaluation with independent judges confirms all directional findings (Spearman ρ= 0.90). Exploratory evidence suggests that including a weaker model improves performance while reducing cost (p < 10^-4, not pre-registered). Our results suggest that selector quality may be a more impactful design lever than generator diversity in single-round generate-then-select pipelines.

Reproductions