Scene Understanding

Scene understanding involves interpreting the visual information of a scene, including objects, their spatial relationships, and the overall layout. It goes beyond simple object recognition by considering the context and how objects relate to each other and the environment.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–150 of 1723 papers

Title	Date	Tasks	Status	Hype
CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition	Mar 20, 2023	RetrievalScene Understanding	CodeCode Available	2
GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis	Jan 30, 2023	Image GenerationScene Understanding	CodeCode Available	2
Diffusion-based Generation, Optimization, and Planning in 3D Scenes	Jan 15, 2023	DenoisingGrasp Generation	CodeCode Available	2
Panoptic Lifting for 3D Scene Understanding with Neural Fields	Dec 19, 2022	2D Panoptic SegmentationPanoptic Segmentation	CodeCode Available	2
PLA: Language-Driven Open-Vocabulary 3D Scene Understanding	Nov 29, 2022	3D Open-Vocabulary Instance SegmentationContrastive Learning	CodeCode Available	2
OpenScene: 3D Scene Understanding with Open Vocabularies	Nov 28, 2022	3D Open-Vocabulary Instance Segmentation3D Semantic Segmentation	CodeCode Available	2
Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer	Jul 28, 2022	Autonomous DrivingAutonomous Vehicles	CodeCode Available	2
Panoptic Scene Graph Generation	Jul 22, 2022	BenchmarkingPanoptic Scene Graph Generation	CodeCode Available	2
BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation	Apr 3, 2022	DecoderDepth Estimation	CodeCode Available	2
InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding	Mar 15, 2022	Boundary DetectionHuman Parsing	CodeCode Available	2
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers	Mar 9, 2022	3D Object DetectionAutonomous Vehicles	CodeCode Available	2
GroupViT: Semantic Segmentation Emerges from Text Supervision	Feb 22, 2022	Object DetectionScene Understanding	CodeCode Available	2
HAKE: A Knowledge Engine Foundation for Human Activity Understanding	Feb 14, 2022	Action RecognitionHuman-Object Interaction Detection	CodeCode Available	2
Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and Tracking	Sep 8, 2021	BenchmarkingDiversity	CodeCode Available	2
Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding	Nov 4, 2020	Multi-Task LearningScene Understanding	CodeCode Available	2
Multi-Task Learning as Multi-Objective Optimization	Oct 10, 2018	Depth EstimationGeneral Classification	CodeCode Available	2
Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation	Jul 15, 2025	Large Language ModelScene Understanding	CodeCode Available	1
SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting	Jun 29, 2025	3D ReconstructionScene Understanding	CodeCode Available	1
ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation	Jun 26, 2025	Open Vocabulary Semantic SegmentationOpen-Vocabulary Semantic Segmentation	CodeCode Available	1
DIP: Unsupervised Dense In-Context Post-training of Visual Representations	Jun 23, 2025	GPUMeta-Learning	CodeCode Available	1
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving	Jun 6, 2025	Autonomous DrivingAutonomous Vehicles	CodeCode Available	1
OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis	Jun 4, 2025	Action GenerationDecision Making	CodeCode Available	1
PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis	Jun 3, 2025	Novel View SynthesisScene Understanding	CodeCode Available	1
CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation	May 22, 2025	Scene UnderstandingSpatial Reasoning	CodeCode Available	1
StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation	May 15, 2025	Face RecognitionObject	CodeCode Available	1
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	May 13, 2025	3D visual groundingAutonomous Driving	CodeCode Available	1
Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization	May 8, 2025	Scene UnderstandingSound Source Localization	CodeCode Available	1
LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics	Apr 30, 2025	In-Context LearningObject	CodeCode Available	1
Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs	Apr 17, 2025	3D geometry3DGS	CodeCode Available	1
DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency	Apr 16, 2025	Few-Shot LearningInteractive Segmentation	CodeCode Available	1
SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding	Apr 14, 2025	Camera CalibrationObject Localization	CodeCode Available	1
Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding	Apr 9, 2025	Scene UnderstandingSelf-Supervised Learning	CodeCode Available	1
CamContextI2V: Context-aware Controllable Video Generation	Apr 8, 2025	DiversityScene Understanding	CodeCode Available	1
F-ViTA: Foundation Model Guided Visible to Thermal Translation	Apr 3, 2025	Scene UnderstandingStyle Transfer	CodeCode Available	1
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision	Apr 3, 2025	3D Object Detectioncross-modal alignment	CodeCode Available	1
WikiVideo: Article Generation from Multiple Videos	Apr 1, 2025	ArticlesRAG	CodeCode Available	1
Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model	Mar 30, 2025	Depth EstimationMonocular Depth Estimation	CodeCode Available	1
Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction	Mar 28, 2025	Autonomous DrivingScene Understanding	CodeCode Available	1
The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs	Mar 25, 2025	BenchmarkingScene Segmentation	CodeCode Available	1
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding	Mar 20, 2025	Scene Understanding	CodeCode Available	1
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models	Mar 17, 2025	Question AnsweringScene Understanding	CodeCode Available	1
Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding	Mar 16, 2025	Autonomous DrivingRAG	CodeCode Available	1
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning	Mar 10, 2025	ObjectScene Understanding	CodeCode Available	1
VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion	Mar 8, 2025	3D Semantic Scene CompletionAutonomous Driving	CodeCode Available	1
Occlusion-aware Non-Rigid Point Cloud Registration via Unsupervised Neural Deformation Correntropy	Feb 15, 2025	Point Cloud RegistrationScene Understanding	CodeCode Available	1
Event-aided Semantic Scene Completion	Feb 4, 2025	Autonomous DrivingScene Understanding	CodeCode Available	1
EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery	Jan 20, 2025	Language ModelingLanguage Modelling	CodeCode Available	1
A Survey of World Models for Autonomous Driving	Jan 20, 2025	Anomaly DetectionAutonomous Driving	CodeCode Available	1
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding	Jan 14, 2025	Language ModelingLanguage Modelling	CodeCode Available	1
All-Day Multi-Camera Multi-Target Tracking	Jan 1, 2025	AllMamba	CodeCode Available	1

Show:10 25 50

← PrevPage 3 of 35Next →

All datasets Semantic Scene Understanding Challenge (passive actuation & ground-truth localisation)ADE20K val Semantic Scene Understanding Challenge (active actuation & ground-truth localisation)

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	ACRV Baseline	OMQ	0.44	—	Unverified
2	Team VGAI (TCS Research)	OMQ	0.37	—	Unverified
3	Demo_semantic_SLAM	OMQ	0.11	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	CPN(ResNet-101)	Mean IoU	46.3	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	ACRV Baseline	OMQ	0.35	—	Unverified