Navigating the Thicket: Why DeepSeek-V4 Trains Specialists Instead of One Model
DeepSeek-V4 replaced multi-domain RL with something counterintuitive: train ten-plus domain specialists independently, then merge them through on-policy distillation. Three recent papers explain why this works. The base model already contains the experts. Post-training is just the map.
Read more