| PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | Apr 25, 2024 | Dense CaptioningMVBench | CodeCode Available | 4 |
| 3D-LLM: Injecting the 3D World into Large Language Models | Jul 24, 2023 | 3D Object Captioning3D Question Answering (3D-QA) | CodeCode Available | 3 |
| LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning | Jan 1, 2024 | 3D dense captioningDense Captioning | CodeCode Available | 3 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment | Aug 8, 2023 | 3D Question Answering (3D-QA)Dense Captioning | CodeCode Available | 2 |
| LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning | Nov 30, 2023 | 3D dense captioningDense Captioning | CodeCode Available | 2 |
| GRiT: A Generative Region-to-text Transformer for Object Understanding | Dec 1, 2022 | DecoderDense Captioning | CodeCode Available | 2 |
| Grounded 3D-LLM with Referent Tokens | May 16, 2024 | Dense CaptioningDiversity | CodeCode Available | 2 |
| ControlCap: Controllable Region-level Captioning | Jan 31, 2024 | Dense Captioning | CodeCode Available | 2 |
| TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes | Mar 28, 2024 | 3D dense captioningDense Captioning | CodeCode Available | 2 |