| Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning | Aug 22, 2023 | Caption GenerationLarge Language Model | CodeCode Available | 2 |
| Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions | Aug 8, 2023 | Caption GenerationImage Captioning | CodeCode Available | 2 |
| Fine-grained Image Captioning with CLIP Reward | May 26, 2022 | Caption GenerationDescriptive | CodeCode Available | 2 |
| VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation | May 29, 2025 | Caption GenerationLanguage Modeling | CodeCode Available | 1 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 |
| VideoMultiAgents: A Multi-Agent Framework for Video Question Answering | Apr 25, 2025 | Caption GenerationEgoSchema | CodeCode Available | 1 |
| Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition | Mar 16, 2025 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| Large-scale Pre-training for Grounded Video Caption Generation | Mar 13, 2025 | Caption Generation | CodeCode Available | 1 |
| Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension | Oct 18, 2024 | Caption Generation | CodeCode Available | 1 |
| MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations | Oct 17, 2024 | Caption GenerationMotion Generation | CodeCode Available | 1 |