| Enhancing Chest X-ray Classification through Knowledge Injection in Cross-Modality Learning | Feb 19, 2025 | Caption GenerationClassification | —Unverified | 0 |
| FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning | Feb 13, 2025 | Caption GenerationDecoder | —Unverified | 0 |
| Expertized Caption Auto-Enhancement for Video-Text Retrieval | Feb 5, 2025 | Caption GenerationRetrieval | CodeCode Available | 0 |
| Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SCICAP Challenge 2023 | Jan 31, 2025 | ArticlesCaption Generation | —Unverified | 0 |
| LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models | Jan 31, 2025 | Caption GenerationLanguage Modeling | CodeCode Available | 4 |
| MAMS: Model-Agnostic Module Selection Framework for Video Captioning | Jan 30, 2025 | Caption GenerationVideo Captioning | —Unverified | 0 |
| Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing | Jan 24, 2025 | Caption GenerationDataset Generation | —Unverified | 0 |
| Understanding How Paper Writers Use AI-Generated Captions in Figure Caption Writing | Jan 10, 2025 | Caption Generation | —Unverified | 0 |
| Multi-LLM Collaborative Caption Generation in Scientific Documents | Jan 5, 2025 | Caption GenerationImage to text | CodeCode Available | 0 |
| Time Series Language Model for Descriptive Caption Generation | Jan 3, 2025 | Caption GenerationDenoising | —Unverified | 0 |
| Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning | Dec 31, 2024 | Caption GenerationDecoder | —Unverified | 0 |
| Multimodal Preference Data Synthetic Alignment with Reward Model | Dec 23, 2024 | 2kCaption Generation | CodeCode Available | 0 |
| Learning from Massive Human Videos for Universal Humanoid Pose Control | Dec 18, 2024 | Caption GenerationHumanoid Control | —Unverified | 0 |
| From Simple to Professional: A Combinatorial Controllable Image Captioning Agent | Dec 15, 2024 | Caption Generationcontrollable image captioning | CodeCode Available | 0 |
| DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding | Dec 2, 2024 | Caption GenerationDomain Generalization | —Unverified | 0 |
| AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models | Nov 28, 2024 | Audio captioningAudio to Text Retrieval | CodeCode Available | 2 |
| Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains | Nov 22, 2024 | BenchmarkingCaption Generation | —Unverified | 0 |
| Everything is a Video: Unifying Modalities through Next-Frame Prediction | Nov 15, 2024 | Caption GenerationCross-Modal Retrieval | —Unverified | 0 |
| Grounded Video Caption Generation | Nov 12, 2024 | Caption GenerationImage Captioning | —Unverified | 0 |
| PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | Nov 4, 2024 | Caption GenerationMultiple-choice | CodeCode Available | 2 |
| Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension | Oct 18, 2024 | Caption Generation | CodeCode Available | 1 |
| MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations | Oct 17, 2024 | Caption GenerationMotion Generation | CodeCode Available | 1 |
| SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs | Oct 12, 2024 | AudioCapsAudio captioning | CodeCode Available | 0 |
| GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning | Oct 12, 2024 | Caption GenerationDecoder | —Unverified | 0 |
| Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training | Oct 9, 2024 | Caption GenerationContrastive Learning | CodeCode Available | 2 |