SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation

2026-03-14Code Available0· sign in to hype

Taewan Cho, Taeryang Kim, Andrew Jaeyong Choi

Code Available — Be the first to reproduce this paper.

Code

github.com/taewan2002/space-clip
OfficialIn paper★ 1

Abstract

Contrastive Language-Image Pre-training (CLIP) provides strong semantic representations, but it is not designed for dense geometric prediction. Most CLIP-based monocular depth methods still rely on text prompts and image-text matching, which adds indirection and inference overhead. We propose SPACE-CLIP, a decoder-only framework that predicts depth directly from a frozen CLIP vision encoder and fully bypasses the text encoder. Its decoder fuses scene-level context from FiLM-conditioned semantic features with fine spatial cues from shallow layers. Under the TFI-FB constraint (text-free inference and frozen vision backbone), SPACE-CLIP achieves AbsRel 0.0901 on KITTI and 0.1042 on NYU Depth V2. These results, together with ablations, show that hierarchical fusion of semantic and structural cues is effective while preserving modularity for embodied AI systems such as vision-language-action (VLA) models. We also observe stable training behavior across both datasets with the same frozen-backbone setting, which supports reproducible deployment in integration-constrained pipelines. Our model is available at https://github.com/taewan2002/space-clip

SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation

Code

Abstract

Reproductions