| LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models | Jan 31, 2025 | Caption GenerationLanguage Modeling | CodeCode Available | 4 | 5 |
| PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | Nov 4, 2024 | Caption GenerationMultiple-choice | CodeCode Available | 2 | 5 |
| Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions | Aug 8, 2023 | Caption GenerationImage Captioning | CodeCode Available | 2 | 5 |
| SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models | Jul 30, 2024 | Caption GenerationQuestion Answering | CodeCode Available | 2 | 5 |
| MeaCap: Memory-Augmented Zero-shot Image Captioning | Mar 6, 2024 | Caption GenerationImage Captioning | CodeCode Available | 2 | 5 |
| Fine-grained Image Captioning with CLIP Reward | May 26, 2022 | Caption GenerationDescriptive | CodeCode Available | 2 | 5 |
| Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training | Oct 9, 2024 | Caption GenerationContrastive Learning | CodeCode Available | 2 | 5 |
| Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning | Aug 22, 2023 | Caption GenerationLarge Language Model | CodeCode Available | 2 | 5 |
| DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World | Jun 30, 2025 | Caption GenerationObject | CodeCode Available | 2 | 5 |
| SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning | Jun 18, 2025 | Caption GenerationDescriptive | CodeCode Available | 2 | 5 |
| FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion | Jun 1, 2025 | Audio captioningCaption Generation | CodeCode Available | 2 | 5 |
| Segment and Caption Anything | Dec 1, 2023 | Caption Generationobject-detection | CodeCode Available | 2 | 5 |
| AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models | Nov 28, 2024 | Audio captioningAudio to Text Retrieval | CodeCode Available | 2 | 5 |
| Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data | Oct 2, 2024 | Audio ClassificationCaption Generation | CodeCode Available | 1 | 5 |
| Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation | Aug 17, 2016 | Caption GenerationDecoder | CodeCode Available | 1 | 5 |
| Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks | Oct 30, 2017 | 3D Action RecognitionAction Recognition | CodeCode Available | 1 | 5 |
| SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset | May 12, 2024 | Action SpottingAutomatic Speech Recognition | CodeCode Available | 1 | 5 |
| Show, Attend and Tell: Neural Image Caption Generation with Visual Attention | Feb 10, 2015 | Caption GenerationImage Captioning | CodeCode Available | 1 | 5 |
| Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds | Apr 22, 2022 | 3D dense captioning3D Object Detection | CodeCode Available | 1 | 5 |
| TAP: Text-Aware Pre-training for Text-VQA and Text-Caption | Dec 8, 2020 | Caption GenerationLanguage Modeling | CodeCode Available | 1 | 5 |
| Controllable Video Captioning with an Exemplar Sentence | Dec 2, 2021 | Caption GenerationDecoder | CodeCode Available | 1 | 5 |
| NeuSyRE: Neuro-Symbolic Visual Understanding and Reasoning Framework based on Scene Graph Enrichment | Nov 5, 2023 | Caption GenerationCommon Sense Reasoning | CodeCode Available | 1 | 5 |
| BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and Multilingual Exploration of Persuasion in Memes | Apr 3, 2024 | Caption GenerationHierarchical Multi-label Classification | CodeCode Available | 1 | 5 |
| Self-supervised Cross-view Representation Reconstruction for Change Captioning | Sep 28, 2023 | Caption GenerationHallucination | CodeCode Available | 1 | 5 |
| RECAP: Retrieval-Augmented Audio Captioning | Sep 18, 2023 | AudioCapsAudio captioning | CodeCode Available | 1 | 5 |
| Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation | Jan 2, 2023 | Caption GenerationInstance Segmentation | CodeCode Available | 1 | 5 |
| MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations | Oct 17, 2024 | Caption GenerationMotion Generation | CodeCode Available | 1 | 5 |
| EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning | Oct 14, 2022 | Caption GenerationKnowledge Distillation | CodeCode Available | 1 | 5 |
| GL-RG: Global-Local Representation Granularity for Video Captioning | May 22, 2022 | Caption GenerationDescriptive | CodeCode Available | 1 | 5 |
| SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning | Nov 25, 2021 | Caption GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| End-to-End Dense Video Captioning with Parallel Decoding | Aug 17, 2021 | Caption GenerationDense Video Captioning | CodeCode Available | 1 | 5 |
| Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches | Jun 30, 2022 | Caption GenerationVideo Captioning | CodeCode Available | 1 | 5 |
| Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning | Jul 16, 2024 | Caption Generationcross-modal alignment | CodeCode Available | 1 | 5 |
| Injecting Semantic Concepts into End-to-End Image Captioning | Dec 9, 2021 | Caption GenerationImage Captioning | CodeCode Available | 1 | 5 |
| Improving Image Captioning with Better Use of Captions | Jun 21, 2020 | Caption GenerationImage Captioning | CodeCode Available | 1 | 5 |
| Large-scale Pre-training for Grounded Video Caption Generation | Mar 13, 2025 | Caption Generation | CodeCode Available | 1 | 5 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 | 5 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 | 5 |
| Deep Reinforcement Learning For Sequence to Sequence Models | May 24, 2018 | Abstractive Text SummarizationCaption Generation | CodeCode Available | 1 | 5 |
| Human-like Controllable Image Captioning with Verb-specific Semantic Roles | Mar 22, 2021 | Caption Generationcontrollable image captioning | CodeCode Available | 1 | 5 |
| Microsoft COCO Captions: Data Collection and Evaluation Server | Apr 1, 2015 | Caption Generation | CodeCode Available | 1 | 5 |
| Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts | Feb 17, 2021 | Caption GenerationDiversity | CodeCode Available | 1 | 5 |
| Connecting What to Say With Where to Look by Modeling Human Attention Traces | May 12, 2021 | Caption GenerationImage Captioning | CodeCode Available | 1 | 5 |
| MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response | Sep 15, 2023 | Caption GenerationLanguage Modelling | CodeCode Available | 1 | 5 |
| COSMic: A Coherence-Aware Generation Metric for Image Descriptions | Sep 11, 2021 | Caption GenerationImage Captioning | CodeCode Available | 1 | 5 |
| Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension | Oct 18, 2024 | Caption Generation | CodeCode Available | 1 | 5 |
| Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network | Dec 13, 2020 | Caption GenerationDecoder | CodeCode Available | 1 | 5 |
| Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs | Mar 1, 2020 | AttributeCaption Generation | CodeCode Available | 1 | 5 |
| Belief Revision based Caption Re-ranker with Visual Semantic Information | Sep 16, 2022 | Caption GenerationImage Captioning | CodeCode Available | 1 | 5 |
| Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization | Jun 11, 2021 | Caption GenerationObject | CodeCode Available | 1 | 5 |