| AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models | Nov 28, 2024 | Audio captioningAudio to Text Retrieval | CodeCode Available | 2 |
| Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains | Nov 22, 2024 | BenchmarkingCaption Generation | —Unverified | 0 |
| Everything is a Video: Unifying Modalities through Next-Frame Prediction | Nov 15, 2024 | Caption GenerationCross-Modal Retrieval | —Unverified | 0 |
| Grounded Video Caption Generation | Nov 12, 2024 | Caption GenerationImage Captioning | —Unverified | 0 |
| PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | Nov 4, 2024 | Caption GenerationMultiple-choice | CodeCode Available | 2 |
| Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension | Oct 18, 2024 | Caption Generation | CodeCode Available | 1 |
| MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations | Oct 17, 2024 | Caption GenerationMotion Generation | CodeCode Available | 1 |
| SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs | Oct 12, 2024 | AudioCapsAudio captioning | CodeCode Available | 0 |
| GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning | Oct 12, 2024 | Caption GenerationDecoder | —Unverified | 0 |
| Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training | Oct 9, 2024 | Caption GenerationContrastive Learning | CodeCode Available | 2 |