When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models May 16, 2024 In-Context Learning Question Answering
Code Code Available 75 Trajectory Prediction Meets Large Language Models: A Survey Jun 3, 2025 Language Modeling Language Modelling
Code Code Available 55 GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models Jan 2, 2025 Scene Understanding text annotation
Code Code Available 45 Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving Oct 29, 2024 Autonomous Driving Scene Understanding
Code Code Available 45 Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models Feb 12, 2024 Hallucination Object Localization
Code Code Available 45 SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM Dec 4, 2023 Camera Pose Estimation Novel View Synthesis
Code Code Available 45 OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model Mar 30, 2025 Autonomous Driving Decision Making
Code Code Available 45 Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator Feb 26, 2025 Depth Estimation Diversity
Code Code Available 45 Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation Dec 4, 2023 Depth Estimation GPU
Code Code Available 45 HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation Jan 24, 2025 Autonomous Driving Language Modeling
Code Code Available 35 CrossOver: 3D Scene Cross-Modal Alignment Feb 20, 2025 cross-modal alignment Object
Code Code Available 35 GARField: Group Anything with Radiance Fields Jan 17, 2024 Scene Understanding
Code Code Available 35 4D Panoptic Scene Graph Generation May 16, 2024 4D Panoptic Segmentation Graph Generation
Code Code Available 35 iDisc: Internal Discretization for Monocular Depth Estimation Apr 13, 2023 Autonomous Driving Depth Estimation
Code Code Available 35 DeepInteraction++: Multi-Modality Interaction for Autonomous Driving Aug 9, 2024 3D Object Detection Autonomous Driving
Code Code Available 35 Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment Dec 1, 2023 Contrastive Learning Few-Shot Learning
Code Code Available 35 DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation Apr 7, 2025 3D geometry RGBD Semantic Segmentation
Code Code Available 35 Embodied Understanding of Driving Scenarios Mar 7, 2024 Autonomous Driving Language Modeling
Code Code Available 35 Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene Understanding Feb 22, 2024 Diversity Scene Understanding
Code Code Available 35 SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving Mar 16, 2023 3D Object Detection Autonomous Driving
Code Code Available 35 SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM Feb 5, 2024 3D Semantic Segmentation Camera Pose Estimation
Code Code Available 35 EPRecon: An Efficient Framework for Real-Time Panoptic 3D Reconstruction from Monocular Video Sep 3, 2024 3D Reconstruction Scene Understanding
Code Code Available 35 SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining Mar 23, 2025 3DGS Benchmarking
Code Code Available 35 Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation Apr 5, 2024 Decoder Mamba
Code Code Available 35 Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving May 8, 2024 Autonomous Driving LIDAR Semantic Segmentation
Code Code Available 35 AudioBench: A Universal Benchmark for Audio Large Language Models Jun 23, 2024 Audio Scene Understanding Instruction Following
Code Code Available 35 MoAI: Mixture of All Intelligence for Large Language and Vision Models Mar 12, 2024 All Mixture-of-Experts
Code Code Available 35 Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models Nov 11, 2023 Image Captioning MMR total
Code Code Available 35 STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes Dec 31, 2024 Dynamic Reconstruction Scene Flow Estimation
Code Code Available 35 Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning Mar 1, 2025 Scene Understanding
Code Code Available 25 InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding Mar 15, 2022 Boundary Detection Human Parsing
Code Code Available 25 Hier-SLAM: Scaling-up Semantics in SLAM with a Hierarchically Categorical Gaussian Splatting Sep 19, 2024 Scene Understanding Semantic Segmentation
Code Code Available 25 Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding Nov 4, 2020 Multi-Task Learning Scene Understanding
Code Code Available 25 InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding Jun 8, 2023 Decoder Multi-Task Learning
Code Code Available 25 Grounded 3D-LLM with Referent Tokens May 16, 2024 Dense Captioning Diversity
Code Code Available 25 GroupViT: Semantic Segmentation Emerges from Text Supervision Feb 22, 2022 Object Detection Scene Understanding
Code Code Available 25 GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-training in Autonomous Driving Nov 19, 2024 3D Object Detection Autonomous Driving
Code Code Available 25 BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation Apr 3, 2022 Decoder Depth Estimation
Code Code Available 25 GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks Nov 28, 2024 Benchmarking Object Counting
Code Code Available 25 HAKE: A Knowledge Engine Foundation for Human Activity Understanding Feb 14, 2022 Action Recognition Human-Object Interaction Detection
Code Code Available 25 IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes Mar 20, 2025 Scene Understanding Spatial Reasoning
Code Code Available 25 FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything Feb 29, 2024 3D Object Reconstruction Instance Segmentation
Code Code Available 25 An Egocentric Vision-Language Model based Portable Real-time Smart Assistant Mar 6, 2025 Language Modeling Language Modelling
Code Code Available 25 An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models Nov 25, 2024 Denoising Scene Understanding
Code Code Available 25 Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning Dec 16, 2024 Hallucination Robot Manipulation
Code Code Available 25 Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion Jul 8, 2025 3D geometry Domain Generalization
Code Code Available 25 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding Dec 24, 2024 Natural Language Understanding Scene Understanding
Code Code Available 25 ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding Oct 17, 2024 3D Semantic Segmentation Image Generation
Code Code Available 25 GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis Jan 30, 2023 Image Generation Scene Understanding
Code Code Available 25 EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding Dec 5, 2024 Prediction Scene Understanding
Code Code Available 25