| STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving | Jun 6, 2025 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs | Jun 5, 2025 | cross-modal alignmentDense Captioning | —Unverified | 0 |
| TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action | May 2, 2025 | Dense CaptioningHighlight Detection | CodeCode Available | 1 |
| 3D Spatial Understanding in MLLMs: Disambiguation and Evaluation | Dec 9, 2024 | 3D dense captioning3D visual grounding | —Unverified | 0 |
| PerLA: Perceptive 3D Language Assistant | Nov 29, 2024 | Dense CaptioningGraph Neural Network | CodeCode Available | 1 |
| 3D Scene Graph Guided Vision-Language Pre-training | Nov 27, 2024 | 3D dense captioning3D visual grounding | —Unverified | 0 |
| ComiCap: A VLMs pipeline for dense captioning of Comic Panels | Sep 24, 2024 | AttributeDense Captioning | CodeCode Available | 1 |
| Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving | Sep 10, 2024 | 3D dense captioningAutonomous Driving | —Unverified | 0 |
| xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations | Aug 22, 2024 | Dense CaptioningMotion Estimation | —Unverified | 0 |
| See It All: Contextualized Late Aggregation for 3D Dense Captioning | Aug 14, 2024 | 3D dense captioningAll | —Unverified | 0 |
| Bi-directional Contextual Attention for 3D Dense Captioning | Aug 13, 2024 | 3D dense captioningAttribute | —Unverified | 0 |
| PaveCap: The First Multimodal Framework for Comprehensive Pavement Condition Assessment with Dense Captioning and PCI Estimation | Aug 7, 2024 | DecoderDense Captioning | CodeCode Available | 0 |
| Complete 3d relationships extraction modality alignment network for 3d dense captioning | Aug 1, 2024 | 3D dense captioning3D Object Detection | —Unverified | 0 |
| Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions | Jul 9, 2024 | Dense Captioningobject-detection | —Unverified | 0 |
| 3D Vision and Language Pretraining with Large-Scale Synthetic Data | Jul 8, 2024 | Dense CaptioningDiversity | CodeCode Available | 1 |
| Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning | Jun 14, 2024 | Dense CaptioningObject | CodeCode Available | 0 |
| Grounded 3D-LLM with Referent Tokens | May 16, 2024 | Dense CaptioningDiversity | CodeCode Available | 2 |
| PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | Apr 25, 2024 | Dense CaptioningMVBench | CodeCode Available | 4 |
| Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization | Apr 17, 2024 | 3D dense captioning3D visual grounding | CodeCode Available | 0 |
| DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection | Apr 14, 2024 | Dense CaptioningLanguage Modelling | —Unverified | 0 |
| TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes | Mar 28, 2024 | 3D dense captioningDense Captioning | CodeCode Available | 2 |
| Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition | Mar 19, 2024 | Dense CaptioningImage Captioning | —Unverified | 0 |
| FlexCap: Describe Anything in Images in Controllable Detail | Mar 18, 2024 | AttributeDense Captioning | —Unverified | 0 |
| Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning | Mar 18, 2024 | 3D Question Answering (3D-QA)Dense Captioning | —Unverified | 0 |
| A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes | Mar 12, 2024 | 3D dense captioningDense Captioning | —Unverified | 0 |
| ControlCap: Controllable Region-level Captioning | Jan 31, 2024 | Dense Captioning | CodeCode Available | 2 |
| LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning | Jan 1, 2024 | 3D dense captioningDense Captioning | CodeCode Available | 3 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning | Nov 30, 2023 | 3D dense captioningDense Captioning | CodeCode Available | 2 |
| Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning | Sep 6, 2023 | 3D dense captioningCaption Generation | CodeCode Available | 1 |
| 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment | Aug 8, 2023 | 3D Question Answering (3D-QA)Dense Captioning | CodeCode Available | 2 |
| 3D-LLM: Injecting the 3D World into Large Language Models | Jul 24, 2023 | 3D Object Captioning3D Question Answering (3D-QA) | CodeCode Available | 3 |
| Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner | May 19, 2023 | Dense CaptioningImage Captioning | CodeCode Available | 1 |
| IIITD-20K: Dense captioning for Text-Image ReID | May 8, 2023 | Dense Captioning | CodeCode Available | 0 |
| CapDet: Unifying Dense Captioning and Open-World Detection Pretraining | Mar 4, 2023 | Dense Captioning | —Unverified | 0 |
| End-to-End 3D Dense Captioning with Vote2Cap-DETR | Jan 6, 2023 | 3D dense captioningDecoder | CodeCode Available | 1 |
| Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training | Jan 1, 2023 | 3D dense captioning3D visual grounding | CodeCode Available | 1 |
| UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding | Dec 1, 2022 | 3D dense captioning3D visual grounding | —Unverified | 0 |
| GRiT: A Generative Region-to-text Transformer for Object Understanding | Dec 1, 2022 | DecoderDense Captioning | CodeCode Available | 2 |
| Contextual Modeling for 3D Dense Captioning on Point Clouds | Oct 8, 2022 | 3D dense captioningDense Captioning | —Unverified | 0 |
| SAVCHOI: Detecting Suspicious Activities using Dense Video Captioning with Human Object Interactions | Jul 24, 2022 | Dense CaptioningDense Video Captioning | —Unverified | 0 |
| CapOnImage: Context-driven Dense-Captioning on Image | Apr 27, 2022 | Dense CaptioningDiversity | —Unverified | 0 |
| Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds | Apr 22, 2022 | 3D dense captioning3D Object Detection | CodeCode Available | 1 |
| Semantic-Aware Pretraining for Dense Video Captioning | Apr 13, 2022 | Dense CaptioningDense Video Captioning | —Unverified | 0 |
| MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes | Mar 10, 2022 | 3D dense captioningDense Captioning | CodeCode Available | 1 |
| X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning | Mar 2, 2022 | 3D dense captioningDense Captioning | CodeCode Available | 1 |
| Describing image focused in cognitive and visual details for visually impaired people: An approach to generating inclusive paragraphs | Feb 10, 2022 | Dense CaptioningImage Captioning | —Unverified | 0 |
| 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds | Jan 1, 2022 | 3D dense captioningAttribute | —Unverified | 0 |
| D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding | Dec 2, 2021 | 3D dense captioning3D visual grounding | —Unverified | 0 |
| Integrating Visuospatial, Linguistic, and Commonsense Structure into Story Visualization | Nov 1, 2021 | Dense CaptioningImage Generation | CodeCode Available | 1 |