| GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning | Jul 9, 2025 | Caption GenerationClustering | —Unverified | 0 |
| EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits | Jun 11, 2025 | Artifact DetectionCaption Generation | —Unverified | 0 |
| Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation | Jun 3, 2025 | Caption GenerationImage Captioning | —Unverified | 0 |
| NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-ID | May 26, 2025 | AttributeCaption Generation | —Unverified | 0 |
| GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance | May 25, 2025 | Caption GenerationQuestion Answering | —Unverified | 0 |
| Temporal Object Captioning for Street Scene Videos from LiDAR Tracks | May 22, 2025 | Caption GenerationVideo Captioning | —Unverified | 0 |
| Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives | May 20, 2025 | Caption GenerationContrastive Learning | —Unverified | 0 |
| TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation | Apr 24, 2025 | Caption GenerationDense Video Captioning | —Unverified | 0 |
| Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training | Apr 17, 2025 | Caption GenerationHallucination | —Unverified | 0 |
| 3D CoCa: Contrastive Learners are 3D Captioners | Apr 13, 2025 | 3D dense captioningCaption Generation | CodeCode Available | 0 |