SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation
Taewan Cho, Taeryang Kim, Andrew Jaeyong Choi
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/taewan2002/space-clipOfficialIn paper★ 1
Abstract
Contrastive Language-Image Pre-training (CLIP) provides strong semantic representations, but it is not designed for dense geometric prediction. Most CLIP-based monocular depth methods still rely on text prompts and image-text matching, which adds indirection and inference overhead. We propose SPACE-CLIP, a decoder-only framework that predicts depth directly from a frozen CLIP vision encoder and fully bypasses the text encoder. Its decoder fuses scene-level context from FiLM-conditioned semantic features with fine spatial cues from shallow layers. Under the TFI-FB constraint (text-free inference and frozen vision backbone), SPACE-CLIP achieves AbsRel 0.0901 on KITTI and 0.1042 on NYU Depth V2. These results, together with ablations, show that hierarchical fusion of semantic and structural cues is effective while preserving modularity for embodied AI systems such as vision-language-action (VLA) models. We also observe stable training behavior across both datasets with the same frozen-backbone setting, which supports reproducible deployment in integration-constrained pipelines. Our model is available at https://github.com/taewan2002/space-clip