On Geometric Understanding and Learned Priors in Feed-forward 3D Reconstruction Models

2026-03-17Unverified0· sign in to hype

Jelena Bratulić, Sudhanshu Mittal, Thomas Brox, Christian Rupprecht

Unverified — Be the first to reproduce this paper.

Abstract

Feed-forward 3D reconstruction models such as DUSt3R, VGGT, and Depth Anything 3 (DA3) are transformer-based foundation models that infer camera geometry and dense scene structure in a single forward pass. Trained at scale in a supervised fashion, they raise a central question: do these models build upon geometric principles akin to traditional multi-view pipelines, or do they primarily rely on learned priors arising from the large-scale training setup? We find that epipolar geometry emerges within the intermediate layers of all three models and is causally linked to correspondence patterns in attention heads. To study this, we perform a systematic analysis of their internal representations across three real-world datasets and a controlled synthetic dataset. We quantify geometric understanding by probing intermediate features, analyzing attention patterns to identify correspondence matching patterns, and performing targeted interventions at the attention level. Further, we assess the role of learned priors by applying challenging input-level perturbations, such as occlusions, scene ambiguities, and varying camera configurations, and compare them against classical multi-stage reconstruction pipelines.

On Geometric Understanding and Learned Priors in Feed-forward 3D Reconstruction Models

Abstract

Reproductions