SOTAVerified

Dense Captioning

Papers

Showing 150 of 69 papers

TitleStatusHype
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous DrivingCode1
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs0
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in ActionCode1
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation0
PerLA: Perceptive 3D Language AssistantCode1
3D Scene Graph Guided Vision-Language Pre-training0
ComiCap: A VLMs pipeline for dense captioning of Comic PanelsCode1
Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving0
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations0
See It All: Contextualized Late Aggregation for 3D Dense Captioning0
Bi-directional Contextual Attention for 3D Dense Captioning0
PaveCap: The First Multimodal Framework for Comprehensive Pavement Condition Assessment with Dense Captioning and PCI EstimationCode0
Complete 3d relationships extraction modality alignment network for 3d dense captioning0
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions0
3D Vision and Language Pretraining with Large-Scale Synthetic DataCode1
Details Make a Difference: Object State-Sensitive Neurorobotic Task PlanningCode0
Grounded 3D-LLM with Referent TokensCode2
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense CaptioningCode4
Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based LocalizationCode0
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection0
TOD3Cap: Towards 3D Dense Captioning in Outdoor ScenesCode2
Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition0
FlexCap: Describe Anything in Images in Controllable Detail0
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning0
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes0
ControlCap: Controllable Region-level CaptioningCode2
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and PlanningCode3
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingCode2
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and PlanningCode2
Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense CaptioningCode1
3D-VisTA: Pre-trained Transformer for 3D Vision and Text AlignmentCode2
3D-LLM: Injecting the 3D World into Large Language ModelsCode3
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense CaptionerCode1
IIITD-20K: Dense captioning for Text-Image ReIDCode0
CapDet: Unifying Dense Captioning and Open-World Detection Pretraining0
End-to-End 3D Dense Captioning with Vote2Cap-DETRCode1
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-TrainingCode1
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding0
GRiT: A Generative Region-to-text Transformer for Object UnderstandingCode2
Contextual Modeling for 3D Dense Captioning on Point Clouds0
SAVCHOI: Detecting Suspicious Activities using Dense Video Captioning with Human Object Interactions0
CapOnImage: Context-driven Dense-Captioning on Image0
Spatiality-guided Transformer for 3D Dense Captioning on Point CloudsCode1
Semantic-Aware Pretraining for Dense Video Captioning0
MORE: Multi-Order RElation Mining for Dense Captioning in 3D ScenesCode1
X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense CaptioningCode1
Describing image focused in cognitive and visual details for visually impaired people: An approach to generating inclusive paragraphs0
3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds0
D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding0
Integrating Visuospatial, Linguistic, and Commonsense Structure into Story VisualizationCode1
Show:102550
← PrevPage 1 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1ControlCapmAP18.2Unverified
2GRiT (ViT-B)mAP15.5Unverified
3CAG-NetmAP10.5Unverified
4FCLNmAP5.4Unverified