CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer Aug 12, 2024 Text-to-Video Generation Video Alignment
Code Code Available 11VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Jun 11, 2024 Multiple-choice Question Answering
Code Code Available 5ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Jun 6, 2024 Video Captioning Video Generation
Code Code Available 5Tarsier: Recipes for Training and Evaluating Large Video Description Models Jun 30, 2024 Video Captioning Video Description
Code Code Available 4Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Feb 29, 2024 Retrieval Text Retrieval
Code Code Available 4mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Feb 1, 2023 Action Classification Image Classification
Code Code Available 4Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding Feb 9, 2025 Image Captioning Image-text Retrieval
Code Code Available 3VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Jun 13, 2024 Dense Video Captioning MVBench
Code Code Available 3MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Apr 8, 2024 GPU Multiple-choice
Code Code Available 3GiT: Towards Generalist Vision Transformer through Universal Language Interface Mar 14, 2024 Language Modeling Language Modelling
Code Code Available 3Video ReCap: Recursive Captioning of Hour-Long Videos Feb 20, 2024 EgoSchema Video Captioning
Code Code Available 3CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning Jun 30, 2023 Causal Inference Medical Report Generation
Code Code Available 3Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Oct 17, 2022 Few-Shot Learning Image Captioning
Code Code Available 3video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Jun 18, 2025 Audio captioning Large Language Model
Code Code Available 2Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting Apr 7, 2025 Boundary Detection Object
Code Code Available 2LVD-2M: A Long-take Video Dataset with Temporally Dense Captions Oct 14, 2024 Video Captioning Video Generation
Code Code Available 2Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models Oct 4, 2024 Dense Video Captioning Sentence
Code Code Available 2SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama Aug 18, 2024 Script Generation Video Captioning
Code Code Available 2Vript: A Video Is Worth Thousands of Words Jun 10, 2024 Video Captioning Video Understanding
Code Code Available 2VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding May 22, 2024 Dense Video Captioning Highlight Detection
Code Code Available 2Movie101v2: Improved Movie Narration Benchmark Apr 20, 2024 Video Captioning
Code Code Available 2TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning Apr 14, 2024 Dense Video Captioning Descriptive
Code Code Available 2Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval Apr 11, 2024 Decoder Dense Video Captioning
Code Code Available 2OmniVid: A Generative Framework for Universal Video Understanding Mar 26, 2024 Action Recognition Decoder
Code Code Available 2VTimeLLM: Empower LLM to Grasp Video Moments Nov 30, 2023 Dense Video Captioning Temporal Relation Extraction
Code Code Available 2VidChapters-7M: Video Chapters at Scale Sep 25, 2023 Dense Video Captioning Navigate
Code Code Available 2Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Jun 7, 2023 Cross-Modal Retrieval Language Modelling
Code Code Available 2VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset May 29, 2023 Audio captioning Audio-Visual Captioning
Code Code Available 2VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Apr 17, 2023 Audio captioning Audio-Video Question Answering (AVQA)
Code Code Available 2SoccerNet-Caption: Dense Video Captioning for Soccer Broadcasts Commentaries Apr 10, 2023 Dense Video Captioning Video Captioning
Code Code Available 2Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions Apr 9, 2023 Video Captioning
Code Code Available 2Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Feb 27, 2023 Dense Video Captioning Language Modeling
Code Code Available 2Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? Dec 31, 2022 Data Augmentation Retrieval
Code Code Available 2Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs Jun 9, 2022 Image Captioning Image Classification
Code Code Available 2GIT: A Generative Image-to-text Transformer for Vision and Language May 27, 2022 Decoder Image Captioning
Code Code Available 2UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks Jul 15, 2025 Video Captioning Video Understanding
Code Code Available 1VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation Feb 18, 2025 Text-to-Video Generation Video Captioning
Code Code Available 1VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning Jan 12, 2025 Dense Video Captioning Video Captioning
Code Code Available 1HiCM^2: Hierarchical Compact Memory Modeling for Dense Video Captioning Dec 19, 2024 Dense Video Captioning Video Captioning
Code Code Available 1G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o Dec 18, 2024 Image Captioning Video Captioning
Code Code Available 1VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format Nov 27, 2024 Dense Video Captioning Grounded Video Question Answering
Code Code Available 1IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning Sep 26, 2024 Image Captioning Retrieval
Code Code Available 1COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark Aug 5, 2024 Dense Video Captioning Diversity
Code Code Available 1Learning Video Context as Interleaved Multimodal Sequences Jul 31, 2024 Language Modeling Language Modelling
Code Code Available 1AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding Jun 19, 2024 Question Answering Spatial Reasoning
Code Code Available 1Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization May 31, 2024 Sentence Video Captioning
Code Code Available 1Narrative Action Evaluation with Prompt-Guided Multimodal Interaction Apr 22, 2024 Action Quality Assessment multimodal interaction
Code Code Available 1Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis Apr 12, 2024 Dense Video Captioning Transfer Learning
Code Code Available 1LVCHAT: Facilitating Long Video Comprehension Feb 19, 2024 Video Captioning
Code Code Available 1Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data Jan 16, 2024 Image Generation Text to Image Generation
Code Code Available 1