SOTAVerified

Scene Understanding

Scene understanding involves interpreting the visual information of a scene, including objects, their spatial relationships, and the overall layout. It goes beyond simple object recognition by considering the context and how objects relate to each other and the environment.

Papers

Showing 401450 of 1723 papers

TitleStatusHype
A2-FPN for Semantic Segmentation of Fine-Resolution Remotely Sensed ImagesCode1
M3D-RPN: Monocular 3D Region Proposal Network for Object DetectionCode1
MassMIND: Massachusetts Maritime INfrared DatasetCode1
Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene UnderstandingCode1
Panoramic Panoptic Segmentation: Insights Into Surrounding Parsing for Mobile Agents via Unsupervised Contrastive LearningCode1
PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic SegmentationCode1
PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object DetectionCode1
Class-Incremental Domain Adaptation with Smoothing and Calibration for Surgical Report GenerationCode1
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous DrivingCode1
Distilled Semantics for Comprehensive Scene Understanding from VideosCode1
Event-aided Semantic Scene CompletionCode1
Microsoft COCO: Common Objects in ContextCode1
Estimating and Exploiting the Aleatoric Uncertainty in Surface Normal EstimationCode1
SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian SplattingCode1
Estimating Generic 3D Room Structures from 2D AnnotationsCode1
Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation ModelCode1
Bootstraping Clustering of Gaussians for View-consistent 3D Scene UnderstandingCode1
Event-based Motion Segmentation with Spatio-Temporal Graph CutsCode1
PanopticNDT: Efficient and Robust Panoptic MappingCode1
A Versatile and Efficient Reinforcement Learning Framework for Autonomous DrivingCode1
EndoChat: Grounded Multimodal Large Language Model for Endoscopic SurgeryCode1
0-MMS: Zero-Shot Multi-Motion Segmentation With A Monocular Event CameraCode1
A Data-Centric Revisit of Pre-Trained Vision Models for Robot LearningCode1
DPF: Learning Dense Prediction Fields with Weak SupervisionCode1
Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense KnowledgeCode1
MonoDistill: Learning Spatial Features for Monocular 3D Object DetectionCode1
PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic SegmentationCode1
MSeg: A Composite Dataset for Multi-domain Semantic SegmentationCode1
Explainable Object-induced Action Decision for Autonomous VehiclesCode1
TextSLAM: Visual SLAM with Planar Text FeaturesCode1
OK-VQA: A Visual Question Answering Benchmark Requiring External KnowledgeCode1
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question AnsweringCode1
Multi3DRefer: Grounding Text Description to Multiple 3D ObjectsCode1
DTCLMapper: Dual Temporal Consistent Learning for Vectorized HD Map ConstructionCode1
Panoptic 3D Scene Reconstruction From a Single RGB ImageCode1
Dual-Hybrid Attention Network for Specular Highlight RemovalCode1
Multimodal Dataset for Localization, Mapping and Crop Monitoring in Citrus Tree FarmsCode1
Egocentric Scene Understanding via Multimodal Spatial RectifierCode1
Cityscapes-Panoptic-Parts and PASCAL-Panoptic-Parts datasets for Scene UnderstandingCode1
Efficient Multi-Task RGB-D Scene Analysis for Indoor EnvironmentsCode1
Dynamic Graph Message Passing Networks for Visual RecognitionCode1
Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation ModelsCode1
Exploiting Edge-Oriented Reasoning for 3D Point-based Scene Graph AnalysisCode1
Dynamic Scene Understanding through Object-Centric Voxelization and Neural RenderingCode1
Multi-Scale Attention for Audio Question AnsweringCode1
Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture RecognitionCode1
P2T: Pyramid Pooling Transformer for Scene UnderstandingCode1
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D DataCode1
ECLAIR: A High-Fidelity Aerial LiDAR Dataset for Semantic SegmentationCode1
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene UnderstandingCode1
Show:102550
← PrevPage 9 of 35Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1ACRV BaselineOMQ0.44Unverified
2Team VGAI (TCS Research)OMQ0.37Unverified
3Demo_semantic_SLAMOMQ0.11Unverified
#ModelMetricClaimedVerifiedStatus
1CPN(ResNet-101)Mean IoU46.3Unverified
#ModelMetricClaimedVerifiedStatus
1ACRV BaselineOMQ0.35Unverified