CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer Aug 12, 2024 Text-to-Video Generation Video Alignment
Code Code Available 11VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Jun 11, 2024 Multiple-choice Question Answering
Code Code Available 5ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Jun 6, 2024 Video Captioning Video Generation
Code Code Available 5Tarsier: Recipes for Training and Evaluating Large Video Description Models Jun 30, 2024 Video Captioning Video Description
Code Code Available 4Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Feb 29, 2024 Retrieval Text Retrieval
Code Code Available 4mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Feb 1, 2023 Action Classification Image Classification
Code Code Available 4Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding Feb 9, 2025 Image Captioning Image-text Retrieval
Code Code Available 3VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Jun 13, 2024 Dense Video Captioning MVBench
Code Code Available 3MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Apr 8, 2024 GPU Multiple-choice
Code Code Available 3GiT: Towards Generalist Vision Transformer through Universal Language Interface Mar 14, 2024 Language Modeling Language Modelling
Code Code Available 3Video ReCap: Recursive Captioning of Hour-Long Videos Feb 20, 2024 EgoSchema Video Captioning
Code Code Available 3CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning Jun 30, 2023 Causal Inference Medical Report Generation
Code Code Available 3Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Oct 17, 2022 Few-Shot Learning Image Captioning
Code Code Available 3video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Jun 18, 2025 Audio captioning Large Language Model
Code Code Available 2Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting Apr 7, 2025 Boundary Detection Object
Code Code Available 2LVD-2M: A Long-take Video Dataset with Temporally Dense Captions Oct 14, 2024 Video Captioning Video Generation
Code Code Available 2Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models Oct 4, 2024 Dense Video Captioning Sentence
Code Code Available 2SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama Aug 18, 2024 Script Generation Video Captioning
Code Code Available 2Vript: A Video Is Worth Thousands of Words Jun 10, 2024 Video Captioning Video Understanding
Code Code Available 2VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding May 22, 2024 Dense Video Captioning Highlight Detection
Code Code Available 2Movie101v2: Improved Movie Narration Benchmark Apr 20, 2024 Video Captioning
Code Code Available 2TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning Apr 14, 2024 Dense Video Captioning Descriptive
Code Code Available 2Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval Apr 11, 2024 Decoder Dense Video Captioning
Code Code Available 2OmniVid: A Generative Framework for Universal Video Understanding Mar 26, 2024 Action Recognition Decoder
Code Code Available 2VTimeLLM: Empower LLM to Grasp Video Moments Nov 30, 2023 Dense Video Captioning Temporal Relation Extraction
Code Code Available 2