SOTAVerified

Scene Understanding

Scene understanding involves interpreting the visual information of a scene, including objects, their spatial relationships, and the overall layout. It goes beyond simple object recognition by considering the context and how objects relate to each other and the environment.

Papers

Showing 101150 of 1723 papers

TitleStatusHype
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous DrivingCode2
TrackOcc: Camera-based 4D Panoptic Occupancy TrackingCode2
InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene UnderstandingCode2
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous DrivingCode2
Hier-SLAM: Scaling-up Semantics in SLAM with a Hierarchically Categorical Gaussian SplattingCode2
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial ReasoningCode2
InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene UnderstandingCode2
HAKE: A Knowledge Engine Foundation for Human Activity UnderstandingCode2
Chameleon: Fast-slow Neuro-symbolic Lane Topology ExtractionCode2
CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D RecognitionCode2
ARKit LabelMaker: A New Scale for Indoor 3D Scene UnderstandingCode2
Grounded 3D-LLM with Referent TokensCode2
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D ScenesCode2
RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene UnderstandingCode2
Tackling View-Dependent Semantics in 3D Language Gaussian SplattingCode2
Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction TuningCode2
Generating Visual Spatial Description via Holistic 3D Scene UnderstandingCode1
General Geometry-aware Weakly Supervised 3D Object DetectionCode1
GFF: Gated Fully Fusion for Semantic SegmentationCode1
3DMIT: 3D Multi-modal Instruction Tuning for Scene UnderstandingCode1
A Review of Panoptic Segmentation for Mobile Mapping Point CloudsCode1
Advances in Deep Concealed Scene UnderstandingCode1
F-ViTA: Foundation Model Guided Visible to Thermal TranslationCode1
Global Aggregation then Local Distribution in Fully Convolutional NetworksCode1
FPS-Net: A Convolutional Fusion Network for Large-Scale LiDAR Point Cloud SegmentationCode1
FocusFlow: Boosting Key-Points Optical Flow Estimation for Autonomous DrivingCode1
FreDSNet: Joint Monocular Depth and Semantic Segmentation with Fast Fourier ConvolutionsCode1
Few-Shot Object Detection and Viewpoint Estimation for Objects in the WildCode1
Arabic Scene Text Recognition in the Deep Learning Era: Analysis on A Novel DatasetCode1
FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene UnderstandingCode1
From General to Specific: Informative Scene Graph Generation via Balance AdjustmentCode1
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous DrivingCode1
Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and ReasoningCode1
A2-FPN for Semantic Segmentation of Fine-Resolution Remotely Sensed ImagesCode1
Exploiting Edge-Oriented Reasoning for 3D Point-based Scene Graph AnalysisCode1
AVSegFormer: Audio-Visual Segmentation with TransformerCode1
Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene ContextsCode1
From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object DetectionCode1
Global-Reasoned Multi-Task Learning Model for Surgical Scene UnderstandingCode1
Estimating and Exploiting the Aleatoric Uncertainty in Surface Normal EstimationCode1
Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic SegmentationCode1
Estimating Generic 3D Room Structures from 2D AnnotationsCode1
Automatic Extrinsic Calibration Method for LiDAR and Camera Sensor SetupsCode1
OK-VQA: A Visual Question Answering Benchmark Requiring External KnowledgeCode1
Event-aided Semantic Scene CompletionCode1
A Data-Centric Revisit of Pre-Trained Vision Models for Robot LearningCode1
EndoChat: Grounded Multimodal Large Language Model for Endoscopic SurgeryCode1
3DRM:Pair-wise relation module for 3D object detectionCode1
Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense KnowledgeCode1
Event-based Motion Segmentation with Spatio-Temporal Graph CutsCode1
Show:102550
← PrevPage 3 of 35Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1ACRV BaselineOMQ0.44Unverified
2Team VGAI (TCS Research)OMQ0.37Unverified
3Demo_semantic_SLAMOMQ0.11Unverified
#ModelMetricClaimedVerifiedStatus
1CPN(ResNet-101)Mean IoU46.3Unverified
#ModelMetricClaimedVerifiedStatus
1ACRV BaselineOMQ0.35Unverified