CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer Aug 12, 2024 Text-to-Video Generation Video Alignment
Code Code Available 11VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Jun 11, 2024 Multiple-choice Question Answering
Code Code Available 5ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Jun 6, 2024 Video Captioning Video Generation
Code Code Available 5Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Feb 29, 2024 Retrieval Text Retrieval
Code Code Available 4Tarsier: Recipes for Training and Evaluating Large Video Description Models Jun 30, 2024 Video Captioning Video Description
Code Code Available 4mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Feb 1, 2023 Action Classification Image Classification
Code Code Available 4Video ReCap: Recursive Captioning of Hour-Long Videos Feb 20, 2024 EgoSchema Video Captioning
Code Code Available 3Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding Feb 9, 2025 Image Captioning Image-text Retrieval
Code Code Available 3VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Jun 13, 2024 Dense Video Captioning MVBench
Code Code Available 3Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Oct 17, 2022 Few-Shot Learning Image Captioning
Code Code Available 3MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Apr 8, 2024 GPU Multiple-choice
Code Code Available 3CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning Jun 30, 2023 Causal Inference Medical Report Generation
Code Code Available 3GiT: Towards Generalist Vision Transformer through Universal Language Interface Mar 14, 2024 Language Modeling Language Modelling
Code Code Available 3SoccerNet-Caption: Dense Video Captioning for Soccer Broadcasts Commentaries Apr 10, 2023 Dense Video Captioning Video Captioning
Code Code Available 2VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset May 29, 2023 Audio captioning Audio-Visual Captioning
Code Code Available 2Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval Apr 11, 2024 Decoder Dense Video Captioning
Code Code Available 2Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? Dec 31, 2022 Data Augmentation Retrieval
Code Code Available 2Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Jun 7, 2023 Cross-Modal Retrieval Language Modelling
Code Code Available 2VTimeLLM: Empower LLM to Grasp Video Moments Nov 30, 2023 Dense Video Captioning Temporal Relation Extraction
Code Code Available 2Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models Oct 4, 2024 Dense Video Captioning Sentence
Code Code Available 2video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Jun 18, 2025 Audio captioning Large Language Model
Code Code Available 2Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions Apr 9, 2023 Video Captioning
Code Code Available 2GIT: A Generative Image-to-text Transformer for Vision and Language May 27, 2022 Decoder Image Captioning
Code Code Available 2Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Feb 27, 2023 Dense Video Captioning Language Modeling
Code Code Available 2VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Apr 17, 2023 Audio captioning Audio-Video Question Answering (AVQA)
Code Code Available 2VidChapters-7M: Video Chapters at Scale Sep 25, 2023 Dense Video Captioning Navigate
Code Code Available 2Vript: A Video Is Worth Thousands of Words Jun 10, 2024 Video Captioning Video Understanding
Code Code Available 2SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama Aug 18, 2024 Script Generation Video Captioning
Code Code Available 2Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs Jun 9, 2022 Image Captioning Image Classification
Code Code Available 2OmniVid: A Generative Framework for Universal Video Understanding Mar 26, 2024 Action Recognition Decoder
Code Code Available 2Movie101v2: Improved Movie Narration Benchmark Apr 20, 2024 Video Captioning
Code Code Available 2Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting Apr 7, 2025 Boundary Detection Object
Code Code Available 2LVD-2M: A Long-take Video Dataset with Temporally Dense Captions Oct 14, 2024 Video Captioning Video Generation
Code Code Available 2TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning Apr 14, 2024 Dense Video Captioning Descriptive
Code Code Available 2VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding May 22, 2024 Dense Video Captioning Highlight Detection
Code Code Available 2Large Scale Holistic Video Understanding Apr 25, 2019 Action Classification Action Recognition
Code Code Available 1A Reinforcement Learning Based Encoder-Decoder Framework for Learning Stock Trading Rules Jan 8, 2021 Decoder Deep Reinforcement Learning
Code Code Available 1HowToCaption: Prompting LLMs to Transform Video Annotations at Scale Oct 7, 2023 Automatic Speech Recognition Video Captioning
Code Code Available 1COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark Aug 5, 2024 Dense Video Captioning Diversity
Code Code Available 1The MSR-Video to Text Dataset with Clean Annotations Feb 12, 2021 Sentence Video Captioning
Code Code Available 1Hierarchical Video-Moment Retrieval and Step-Captioning Mar 29, 2023 Information Retrieval Moment Retrieval
Code Code Available 1IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning Sep 26, 2024 Image Captioning Retrieval
Code Code Available 1HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training May 1, 2020 Language Modeling Language Modelling
Code Code Available 1G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o Dec 18, 2024 Image Captioning Video Captioning
Code Code Available 1HiCM^2: Hierarchical Compact Memory Modeling for Dense Video Captioning Dec 19, 2024 Dense Video Captioning Video Captioning
Code Code Available 1A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer May 17, 2020 Dense Video Captioning Temporal Action Proposal Generation
Code Code Available 1GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation Mar 26, 2023 Video Captioning
Code Code Available 1A Comprehensive Review of the Video-to-Text Problem Mar 27, 2021 Question Answering Retrieval
Code Code Available 1An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling Sep 4, 2022 Fill Mask Optical Flow Estimation
Code Code Available 1Hierarchical Modular Network for Video Captioning Nov 24, 2021 Representation Learning Sentence
Code Code Available 1