SOTAVerified

3D Question Answering (3D-QA)

A 3D-QA task requires models to answer a question when given all the information of a 3D scene. Here, models use the 3D spatial information, such as RGB-D scans or point cloud data. We also require models to specify the 3D-bounding boxes of objects that are related to this question answering. This prevents models from answering questions by relying on the textual priors of the trained questions without examining the scene. However, unlike 3D dense captioning, we do not require models to target one described object for each question. This is because multiple objects can be used to answer certain questions. For example, the question “What color is the chairs around the table?” is related to multiple objects. This question is also answerable as long as the chairs around the unique table in the scene have the same color. In such scenarios, we require models to answer the question addressing multiple 3D-bounding boxes.

Papers

Showing 122 of 22 papers

TitleStatusHype
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-AnalysisCode1
DSPNet: Dual-vision Scene Perception for Robust 3D Question AnsweringCode1
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene UnderstandingCode0
Video Instruction Tuning With Synthetic Data0
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness0
Multi-modal Situated Reasoning in 3D ScenesCode2
LLaVA-OneVision: Easy Visual Task TransferCode0
Unifying 3D Vision-Language Understanding via Promptable Queries0
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning0
ShapeLLM: Universal 3D Object Understanding for Embodied InteractionCode3
Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQACode1
Chat-Scene: Bridging 3D Scene and Large Language Models with Object IdentifiersCode2
Towards Learning a Generalist Model for Embodied NavigationCode2
MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkCode2
An Embodied Generalist Agent in 3D WorldCode2
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction FollowingCode2
PointLLM: Empowering Large Language Models to Understand Point CloudsCode2
3D-VisTA: Pre-trained Transformer for 3D Vision and Text AlignmentCode2
3D-LLM: Injecting the 3D World into Large Language ModelsCode3
Visual Instruction TuningCode6
ScanQA: 3D Question Answering for Spatial Scene UnderstandingCode1
Scan2Cap: Context-aware Dense Captioning in RGB-D Scans0
Show:102550

No leaderboard results yet.