| Large-scale Pre-training for Grounded Video Caption Generation | Mar 13, 2025 | Caption Generation | CodeCode Available | 1 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 |
| Improving Image Captioning with Better Use of Captions | Jun 21, 2020 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset | May 12, 2024 | Action SpottingAutomatic Speech Recognition | CodeCode Available | 1 |
| Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks | Oct 30, 2017 | 3D Action RecognitionAction Recognition | CodeCode Available | 1 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 |
| Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning | Jul 16, 2024 | Caption Generationcross-modal alignment | CodeCode Available | 1 |
| Injecting Semantic Concepts into End-to-End Image Captioning | Dec 9, 2021 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response | Sep 15, 2023 | Caption GenerationLanguage Modelling | CodeCode Available | 1 |
| Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data | Oct 2, 2024 | Audio ClassificationCaption Generation | CodeCode Available | 1 |
| Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection Models | Sep 17, 2021 | Caption GenerationDenoising | —Unverified | 0 |
| Deep Verifier Networks: Verification of Deep Discriminative Models with Deep Generative Models | Nov 18, 2019 | Anomaly DetectionAutonomous Driving | —Unverified | 0 |
| End-to-End Video Captioning | Apr 4, 2019 | Action RecognitionCaption Generation | —Unverified | 0 |
| Deep Learning Approaches on Image Captioning: A Review | Jan 31, 2022 | Caption GenerationDeep Learning | —Unverified | 0 |
| VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools | Oct 16, 2023 | Caption GenerationDescriptive | —Unverified | 0 |
| Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation | Jan 18, 2024 | Caption GenerationLanguage Modeling | —Unverified | 0 |
| Bi-directional Contextual Attention for 3D Dense Captioning | Aug 13, 2024 | 3D dense captioningAttribute | —Unverified | 0 |
| Deep Bayesian Natural Language Processing | Jul 1, 2019 | Caption GenerationClustering | —Unverified | 0 |
| An encoder-decoder based framework for hindi image caption generation | Jul 9, 2021 | Caption GenerationDecoder | —Unverified | 0 |
| DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism | Nov 25, 2023 | Caption GenerationDenoising | —Unverified | 0 |
| D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding | Dec 2, 2021 | 3D dense captioning3D visual grounding | —Unverified | 0 |
| BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving | Jan 2, 2024 | Autonomous DrivingCaption Generation | —Unverified | 0 |
| GNNFormer: A Graph-based Framework for Cytopathology Report Generation | Mar 17, 2023 | Caption GenerationGraph Neural Network | —Unverified | 0 |
| Cross-modal Coherence Modeling for Caption Generation | Jul 1, 2020 | Caption Generationcontrollable image captioning | —Unverified | 0 |
| Cross-Lingual Image Caption Generation | Aug 1, 2016 | Caption GenerationDependency Parsing | —Unverified | 0 |
| CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving | Aug 19, 2024 | Autonomous DrivingCaption Generation | —Unverified | 0 |
| Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains | Nov 22, 2024 | BenchmarkingCaption Generation | —Unverified | 0 |
| A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism | Mar 3, 2022 | Caption GenerationDecoder | —Unverified | 0 |
| Explainable Image Captioning using CNN- CNN architecture and Hierarchical Attention | Jun 28, 2024 | Caption GenerationDecoder | —Unverified | 0 |
| Analysis of Convolutional Decoder for Image Caption Generation | Mar 8, 2021 | Caption GenerationData Augmentation | —Unverified | 0 |
| Controlled Caption Generation for Images Through Adversarial Attacks | Jul 7, 2021 | Caption GenerationImage Captioning | —Unverified | 0 |
| Evaluation of Automatic Video Captioning Using Direct Assessment | Oct 29, 2017 | Caption GenerationMachine Translation | —Unverified | 0 |
| 3G structure for image caption generation | Apr 21, 2019 | Caption GenerationSentence | —Unverified | 0 |
| Geo-Aware Image Caption Generation | Dec 1, 2020 | Caption GenerationImage Captioning | —Unverified | 0 |
| Geometry-Entangled Visual Semantic Transformer for Image Captioning | Sep 29, 2021 | Caption GenerationImage Captioning | —Unverified | 0 |
| GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning | Jul 9, 2025 | Caption GenerationClustering | —Unverified | 0 |
| GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning | Oct 12, 2024 | Caption GenerationDecoder | —Unverified | 0 |
| Entity-aware Image Caption Generation | Apr 21, 2018 | Caption GenerationImage Captioning | —Unverified | 0 |
| Enhancing Image Captioning with Neural Models | Dec 1, 2023 | Caption GenerationImage Captioning | —Unverified | 0 |
| Generating captions without looking beyond objects | Oct 12, 2016 | Caption GenerationImage Captioning | —Unverified | 0 |
| Enhancing Image Caption Generation Using Reinforcement Learning with Human Feedback | Mar 11, 2024 | Caption Generationreinforcement-learning | —Unverified | 0 |
| A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation | Oct 11, 2023 | Caption GenerationDecoder | —Unverified | 0 |
| Error Causal inference for Multi-Fusion models | Jun 1, 2021 | Caption GenerationCausal Inference | —Unverified | 0 |
| GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance | May 25, 2025 | Caption GenerationQuestion Answering | —Unverified | 0 |
| Enhancing Chest X-ray Classification through Knowledge Injection in Cross-Modality Learning | Feb 19, 2025 | Caption GenerationClassification | —Unverified | 0 |
| Everything is a Video: Unifying Modalities through Next-Frame Prediction | Nov 15, 2024 | Caption GenerationCross-Modal Retrieval | —Unverified | 0 |
| Examining the Effects of Language-and-Vision Data Augmentation for Generation of Descriptions of Human Faces | Jun 1, 2022 | Caption GenerationData Augmentation | —Unverified | 0 |
| Cortico-cerebellar networks as decoupled neural interfaces | Jan 1, 2021 | Caption Generation | —Unverified | 0 |
| End to End Recognition System for Recognizing Offline Unconstrained Vietnamese Handwriting | May 14, 2019 | Caption GenerationDecoder | —Unverified | 0 |
| Fusion Models for Improved Visual Captioning | Oct 28, 2020 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |