| GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning | Jul 9, 2025 | Caption GenerationClustering | —Unverified | 0 |
| DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World | Jun 30, 2025 | Caption GenerationObject | CodeCode Available | 2 |
| SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning | Jun 18, 2025 | Caption GenerationDescriptive | CodeCode Available | 2 |
| EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits | Jun 11, 2025 | Artifact DetectionCaption Generation | —Unverified | 0 |
| Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation | Jun 3, 2025 | Caption GenerationImage Captioning | —Unverified | 0 |
| FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion | Jun 1, 2025 | Audio captioningCaption Generation | CodeCode Available | 2 |
| VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation | May 29, 2025 | Caption GenerationLanguage Modeling | CodeCode Available | 1 |
| NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-ID | May 26, 2025 | AttributeCaption Generation | —Unverified | 0 |
| GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance | May 25, 2025 | Caption GenerationQuestion Answering | —Unverified | 0 |
| Temporal Object Captioning for Street Scene Videos from LiDAR Tracks | May 22, 2025 | Caption GenerationVideo Captioning | —Unverified | 0 |
| Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives | May 20, 2025 | Caption GenerationContrastive Learning | —Unverified | 0 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 |
| VideoMultiAgents: A Multi-Agent Framework for Video Question Answering | Apr 25, 2025 | Caption GenerationEgoSchema | CodeCode Available | 1 |
| TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation | Apr 24, 2025 | Caption GenerationDense Video Captioning | —Unverified | 0 |
| Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training | Apr 17, 2025 | Caption GenerationHallucination | —Unverified | 0 |
| 3D CoCa: Contrastive Learners are 3D Captioners | Apr 13, 2025 | 3D dense captioningCaption Generation | CodeCode Available | 0 |
| Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention | Apr 3, 2025 | Caption GenerationContrastive Learning | —Unverified | 0 |
| Identifying Multi-modal Knowledge Neurons in Pretrained Transformers via Two-stage Filtering | Mar 29, 2025 | Caption Generationknowledge editing | —Unverified | 0 |
| LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images | Mar 20, 2025 | Caption GenerationDiversity | —Unverified | 0 |
| Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition | Mar 16, 2025 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| Large-scale Pre-training for Grounded Video Caption Generation | Mar 13, 2025 | Caption Generation | CodeCode Available | 1 |
| IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification | Mar 13, 2025 | Caption Generation | —Unverified | 0 |
| Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models | Mar 8, 2025 | Caption GenerationQuestion Answering | —Unverified | 0 |
| Fine-Grained Video Captioning through Scene Graph Consolidation | Feb 23, 2025 | Caption GenerationImage Captioning | —Unverified | 0 |
| LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models | Feb 21, 2025 | Caption GenerationVideo Captioning | —Unverified | 0 |