Blog

Navigating the Thicket: Why DeepSeek-V4 Trains Specialists Instead of One Model

April 25, 2026

DeepSeek-V4 replaced multi-domain RL with something counterintuitive: train ten-plus domain specialists independently, then merge them through on-policy distillation. Three recent papers explain why this works. The base model already contains the experts. Post-training is just the map.

Attention Beats Energy Gradients

April 23, 2026

A controlled comparison of URM-style transformer recurrence versus EBT-style MCMC refinement in shared hidden space on ARC-AGI. At matched compute, each transformer pass is a decisively better refinement step than each energy-gradient step. First-order trajectory ranking fails across five distinct failure modes. Second-order MCMC produces a correct energy landscape whose gradients don't improve decoding.

Built on Randomness: Why the Optimizer Is the Least Important Part of Deep Learning

April 19, 2026

Train the same model twice with different random seeds. Both hit 90% accuracy, but they disagree on 10% of the test set. This isn't noise. It's a window into the deep structure of neural networks: the geometry of the loss landscape, the lottery tickets hiding in your initialization, and the distinct modes that make ensembles work.

The Dark Factory Harness: Turning Autonomous Hill-Climbing into Autonomous Research

April 8, 2026

Autonomous ML experiment loops like autoresearch nail the primitives: edit, train, evaluate, keep or discard. But after 20 experiments, the agent is doing a random walk through code space. The fix isn't a better model, it's a better environment. Five principles for turning autonomous hill-climbing into autonomous research.