SOTAVerified
← Back to blog

Navigating the Thicket: Why DeepSeek-V4 Trains Specialists Instead of One Model

|David Colmenares
ml-synthesisdeepseekpost-trainingneural-thicketsrlvrself-distillationweight-space

The latest DeepSeek-V4 numbers are impressive, with eval scores only 3-6 months behind frontier closed models at a fraction of the cost. Although most of the attention has been on their improvements to the attention architecture with hybrid CSA/HCA, I think the true innovation is in their post-training pipeline. DeepSeek-V3.2 used a mixed reinforcement learning stage where multiple domains were optimized simultaneously. V4 replaced this entirely. They now train ten-plus domain specialists independently via GRPO with domain-specific rewards, each with its own length penalties and context windows, and then consolidate everything into a single model through multi-teacher on-policy distillation (OPD) using full-vocabulary logit matching.

This is a counterintuitive production decision. The naive expectation is that a single RL run across all domains should be more sample-efficient than N separate runs plus a consolidation step. You're duplicating the base model N times, running independent training loops, and then spending additional compute to merge them. At the scale DeepSeek operates (a 1.6T parameter MoE with 49B activated parameters, pretrained on 32T+ tokens), this is not a decision anyone makes casually. When a lab commits to this architecture over the simpler mixed-RL alternative they already had working, it tells you something important about the optimization landscape these models live in.

The reason for this decision is that modern post-training doesn't teach LLMs new capabilities. It navigates a pretrained model's weight-space geometry to reach domain-specific experts that already exist within a dense neighborhood of the base weights. DeepSeek-V4's specialist-then-distill pipeline is the first major production system designed around this insight, even if the DeepSeek team wouldn't necessarily frame it in these terms. Three recent papers, when read together, provide the theoretical and empirical foundation for why it works. Neural Thickets maps the geometry and shows that the expert neighborhood is far denser than anyone expected. Sparse but Critical shows how little actually changes during RL, with performance hinging on fewer than 4% of token decisions. And Simple Self-Distillation demonstrates that you don't even need RL to navigate the same landscape.

1. The Geometry

Neural Thickets (Gan and Isola, MIT, arXiv:2603.12228) has a result that shouldn't work. Take a large pretrained model, add random Gaussian noise to the weights, and evaluate on a downstream task. At 7B+ parameters, roughly 60% of these random perturbations improve task performance. These perturbations aren't a little noise carefully tuned by gradient descent, but fully random. The paper has a fantastic visual explanation. While small models sit on isolated peaks in the accuracy landscape where any perturbation makes things worse, large models inhabit broad valleys full of expert modes. Solution density scales monotonically with model size, going from near zero at 0.5B to around 60% at 32B on tasks like GSM8K and Countdown.

The paper's RandOpt method exploits this geometry through brute force. Sample N random perturbations, evaluate each one, pick the top K performers, and ensemble them via majority vote. It's competitive with PPO, GRPO, and evolutionary strategies for post-training at scale, despite being almost absurdly simple. But RandOpt isn't the point. The point is what its success reveals about the landscape. For random guessing to work as a post-training strategy, good solutions must be dense under the sampling distribution. The fact that it works tells us the thicket is real.

But these perturbation-found experts are specialists, not generalists. The paper introduces a spectral discordance metric that measures how correlated the rankings of perturbations are across different tasks. If perturbation #47 is great at math and also great at coding, that's low discordance (generalists). What they actually find is that perturbation-found experts are anti-correlated across tasks, with anti-correlation increasing with scale. A perturbation that improves coding performance will typically hurt math or writing. You're not finding generally better models, but ones that traded capability in one area for capability in another.

This is exactly why DeepSeek's decision to train specialists independently makes geometric sense. If experts are anti-correlated across tasks, then trying to reach multiple experts simultaneously through mixed multi-domain RL creates competing gradients. The math reward signal pulls weights in one direction while the coding reward pulls in another, and the resulting compromise may not correspond to any expert the thicket actually contains. Training specialists independently avoids this entirely. Each RL run navigates to a single expert without interference.

The shared pretrained initialization is what makes OPD consolidation possible afterward. Fort et al. (2019) showed that SGD with different random seeds converges to distinct modes in weight space, with high loss barriers between them. But Model Soups (Wortsman et al. 2022) demonstrated that you can average weights across runs that share a pretrained checkpoint without hitting those barriers, because the shared initialization constrains all solutions to the same basin. DeepSeek-V4's OPD operates in exactly this regime. All ten-plus specialists started from the same pretrained weights, so they live in the same basin and can be meaningfully combined through logit-level alignment rather than naive weight averaging.

2. RLVR Changes Almost Nothing (and That's the Point)

If post-training navigates between nearby experts in the same weight-space neighborhood, then RL fine-tuning should produce small, sparse changes to the model's behavior rather than global rewrites. The Qwen Pilot Team tested exactly this in "Sparse but Critical" (arXiv:2603.22446, ICLR 2026), and the results are the most direct empirical validation of the thickets picture I've seen.

RLVR induces highly sparse distributional shifts at the token level. The vast majority of token positions, over 80% under DAPO and up to 98% under SimpleRL, show near-zero Jensen-Shannon divergence between the base and RL models. The implication is surprising! A model that jumps from 8% to 25% on AIME 2024 after RL training is producing nearly identical token distributions to the base model at 96-98% of positions. The improvement is concentrated in a tiny fraction of decisions. And this sparsity is not a generic property of fine-tuning. When the Qwen team ran the same analysis on supervised fine-tuning, SFT produced substantially denser and more globally distributed shifts. RLVR is fundamentally more surgical, which is consistent with it navigating locally within a thicket rather than rewriting the model's behavior from scratch.

The cross-sampling experiments are the key result. The team built a framework that selectively swaps token choices between the base and RL models at high-divergence positions. In forward cross-sampling, they generated primarily under the base policy while injecting RL-sampled tokens at positions where the two models disagree most. Injecting fewer than 40 RL-sampled tokens (under 4% of the sequence) into a base-model generation was sufficient to recover full RL-level performance on AIME 2024, going from 8% to over 25%. In the reverse direction, swapping roughly 30 base-model tokens back into an RL generation collapsed performance completely, from 25% back to 8%. For the stronger DAPO method, roughly 7.8% of tokens recovers performance from 8% to over 44%, and the mixed policy can sometimes outperform the standalone RL policy.

An important subtlety is that the which tokens are high-divergence is context-dependent, not determined by token identity alone. The same token type can be sampled from both low-divergence and high-divergence distributions depending on the surrounding context. RLVR is learning when to steer, not just what to steer toward. The fine-grained analysis reveals the mechanism. At high-divergence positions, over 80% of the RL model's new top-1 token was already in the base model's top-3 candidates. RLVR rarely promotes tokens below 0.01 base probability. The mechanism is probability reallocation within existing support, not vocabulary expansion. The base and RL models share the same candidate set at almost every token position. They differ only in which candidate they prefer at a sparse set of critical branching points.

This connects directly to why DeepSeek's full-vocabulary OPD outperforms simpler alternatives. Prior work typically simplified the KL distillation loss to a token-level estimate using only the sampled token, essentially replacing the full distributional signal with a per-token advantage estimate plugged into a standard RL framework. DeepSeek's technical report is explicit about the tradeoff. This approach is resource-efficient, but leads to high variance in gradient estimation and often causes training instability. Their solution is to preserve the complete logit distribution when computing reverse KL between student and teacher, which requires significant engineering. The Sparse but Critical results explain why this engineering investment pays off. Logit-level distillation captures exactly the right information to redistribute probability mass among existing candidates. A specialist that learned to slightly prefer a different token ordering at 2-4% of positions would lose most of its value if distillation only preserved the top-1 choice at each position.

3. The Alternative Route: Simple Self-Distillation

You don't even need RL to navigate the thicket. Apple's Simple Self-Distillation paper (SSD, arXiv:2604.01193) shows that sampling outputs from a model at elevated temperature with top-k/top-p truncation, then fine-tuning the model on those samples with standard cross-entropy loss, produces substantial improvements. No filtering for correctness, no external reward signal, no verifier, no teacher model. Qwen3-30B-Instruct went from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with the largest gains concentrating on harder problems (+15.3 on hard, +14.2 on medium, +6.5 on easy).

The method is comically simple, and the SSD authors' theoretical analysis helps explain why it works. When the training targets come from a temperature-shifted, truncated version of the model's own distribution, standard SFT implicitly optimizes a combination of three things: compressing the distribution's support (killing the garbage tail via truncation), reshaping the surviving candidates to be more uniformly weighted, and staying close to the base model's overall preferences. The temperature and truncation parameters aren't just sampling knobs. They define a specific direction to move through weight space, and SFT provides the gradient signal to move there.

The deliberately degraded version of this experiment made me questions if I ever understood how SFT works. At temperature 2.0 with no truncation, 62% of the resulting training data was gibberish with no extractable code. The model still improved. The thicket intuition is what makes this sensible. If roughly 60% of random perturbations at scale already improve task performance on a given domain, the bar for useful training signal is remarkably low. What matters is the direction of the weight update, not the surface quality of the individual samples. You're choosing which expert to navigate toward rather than sampling the thicket blindly.

SSD also produces specialists, not generalists, and the effect is scale-dependent. Table 5 of the paper shows that coding improves while math stays flat, and smaller models actually regress on AIME after SSD training. The 30B models hold steady on math while gaining substantially on coding, but the 4B models show clearer tradeoffs. This is exactly the anti-correlation that Neural Thickets' spectral discordance predicts. It also explains why DeepSeek trains specialists independently rather than hoping a single training run will improve everything at once. At any finite model size, navigating toward a coding expert means navigating away from the math expert. Larger models have denser thickets where expert neighborhoods overlap more, which is why SSD shows bigger absolute gains on bigger models (Qwen 30B gets +12.9 percentage points versus Qwen 4B getting +7.5), but even at 30B the tradeoff is there.

Taken together, SSD, RLVR, and DeepSeek's specialist pipeline represent an interesting progression of navigation strategies for the same underlying geometry. Neural Thickets' RandOpt is brute-force sampling: generate random perturbations and ensemble whatever you find. SSD uses the model's own output distribution to choose a specific direction, navigating with intent rather than randomness. RLVR uses reward signal to steer at a sparse set of critical token-level forks, and the Sparse but Critical results show it's remarkably precise about where it intervenes. DeepSeek runs independent navigations to each expert and merges the destinations through distillation, which is the most principled of the lot because it respects the anti-correlation structure that Neural Thickets reveals. Each step in this progression is more efficient than the last, but they're all moving through the same landscape, and the fact that all of them work is the strongest evidence that the landscape is real.

4. What This Means

The pretrained model is the product. Post-training is curation of what's already there. This reframing has real consequences for how we should think about the field.

Investment in pretraining quality compounds in a way that wasn't obvious before. Richer base models create denser thickets with more accessible experts. Neural Thickets' scaling result (solution density increases monotonically with model size) and SSD's scaling pattern (larger models show larger gains from self-distillation) are measuring the same underlying phenomenon. A better pretrained model doesn't just perform better out of the box. It gives post-training methods more experts to find and easier navigation to reach them. This is also consistent with the Rajani et al. "scalpel vs. hammer" framing. GRPO amplifies existing capabilities while SFT replaces them. The Qwen team's SFT comparison validates this empirically. SFT rewrites the model's behavior globally, while RLVR edits surgically at a sparse set of decision points. The scalpel works because the base model already has the right answer in its top-3 candidates at those critical positions. It just needs a nudge to prefer the right one.

For practitioners, there's a provocative implication. If RLVR only changes 1-4% of token decisions, and those new choices were already in the base model's top-3 candidates, there may be much cheaper ways to achieve similar effects. SSD is already one example. The Qwen team also explored divergence-weighted variants of the advantage signal in GRPO, weighting RL updates by the Jensen-Shannon divergence at each token position, and found that this diagnostic intervention can yield improvements over baselines. If we know that the signal is concentrated at high-divergence positions, we can potentially focus compute there rather than spreading it uniformly across the entire sequence.

There's a broader story about scale here too. Everything in this post has a scale threshold. Neural Thickets shows that solution density goes from near zero at 0.5B to 60% at 32B. SSD gains are bigger on bigger models. The cross-sampling results show that stronger RL methods (DAPO vs. SimpleRL) require more intervention tokens but still operate on a small fraction of the sequence. All of it is consistent with a picture where the loss landscape undergoes a geometric transition as models get large enough, from isolated peaks separated by barriers to a dense thicket of accessible experts. DeepSeek-V4's specialist-then-distill pipeline is a bet that this transition is real and that the right architecture respects it rather than fighting against it.

DeepSeek-V4's architecture is the first production system built around the geometry of neural thickets, whether or not that's how they'd describe it internally. The papers published in the last few months provide the theoretical and empirical foundation for why the specialist-then-distill pipeline works. The base model already contains the experts. Post-training is just the map.


This synthesis draws on the DeepSeek-V4 technical report (April 2026), "Sparse but Critical" from the Qwen Pilot Team (arXiv:2603.22446, ICLR 2026), Neural Thickets from Gan and Isola at MIT (arXiv:2603.12228), Apple's SSD paper (arXiv:2604.01193), Fort et al.'s loss landscape work (2019), and Model Soups (Wortsman et al. 2022). I'd love to hear from anyone who's experimented with specialist-then-merge pipelines or SSD-style self-distillation in their own post-training work.