SOTAVerified

Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration

2026-03-06Code Available0· sign in to hype

Yanjun Chen, Yirong Sun, Hanlin Wang, Xinming Zhang, Xiaoyu Shen, Wenjie Li, Wei Zhang

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Cooperative multi-agent reinforcement learning (MARL) systems powered by large language models (LLMs) are frequently optimized via sparse terminal-only feedback. This shared signal entangles upstream decisions, obstructing accurate decision-level credit assignment. To address this trajectory-level diffusion, we introduce Contextual Counterfactual Credit Assignment (C3). Instead of distributing rewards across an entire episode, C3 isolates the causal impact of individual messages by freezing the exact transcript-derived context, evaluating context-matched alternatives via fixed-continuation replay, and applying a leave-one-out (LOO) baseline. This localized intervention extracts unbiased, low-variance marginal advantages for standard policy-gradient optimization. Evaluated across five mathematical and coding benchmarks under matched budgets, C3 improves terminal performance over established baselines. Mechanistic diagnostics further show that these gains are accompanied by higher credit fidelity, lower contextual variance, and stronger inter-agent causal dependence. Our code is available at https://github.com/EIT-EAST-Lab/C3.

Reproductions