InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Mar 22, 2024 Action Classification Action Recognition
Code Code Available 7mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Feb 1, 2023 Action Classification Image Classification
Code Code Available 4InternVideo: General Video Foundation Models via Generative and Discriminative Learning Dec 6, 2022 Action Classification Action Recognition
Code Code Available 4VideoRoPE: What Makes for Good Video Rotary Position Embedding? Feb 7, 2025 Hallucination Position
Code Code Available 3Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension Nov 20, 2024 GPU MME
Code Code Available 3Composed Multi-modal Retrieval: A Survey of Approaches and Applications Mar 3, 2025 Cross-Modal Retrieval Data Augmentation
Code Code Available 2Gramian Multimodal Representation Learning and Alignment Dec 16, 2024 Contrastive Learning Representation Learning
Code Code Available 2Explore the Limits of Omni-modal Pretraining at Scale Jun 13, 2024 Language Modeling Language Modelling
Code Code Available 2Composed Video Retrieval via Enriched Context and Discriminative Embeddings Mar 25, 2024 Composed Video Retrieval (CoVR) Retrieval
Code Code Available 2EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World Mar 24, 2024 Action Anticipation Action Quality Assessment
Code Code Available 2vid-TLDR: Training Free Token merging for Light-weight Video Transformer Mar 20, 2024 Action Recognition Computational Efficiency
Code Code Available 2Multi-granularity Correspondence Learning from Long-term Noisy Videos Jan 30, 2024 Action Segmentation Long Video Retrieval (Background Removed)
Code Code Available 2Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation Jul 13, 2023 Retrieval Video Generation
Code Code Available 2VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset May 29, 2023 Audio captioning Audio-Visual Captioning
Code Code Available 2VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Apr 17, 2023 Audio captioning Audio-Video Question Answering (AVQA)
Code Code Available 2Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? Dec 31, 2022 Data Augmentation Retrieval
Code Code Available 2X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Nov 22, 2022 All Cross-Modal Retrieval
Code Code Available 2Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Oct 12, 2022 Contrastive Learning Form
Code Code Available 2CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment Sep 14, 2022 Retrieval Text Retrieval
Code Code Available 2Revealing Single Frame Bias for Video-and-Language Learning Jun 7, 2022 Action Recognition Fine-grained Action Recognition
Code Code Available 2All in One: Exploring Unified Video-Language Pre-training Mar 14, 2022 All Language Modelling
Code Code Available 2LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts May 20, 2025 Caption Generation Retrieval
Code Code Available 1Video-GPT via Next Clip Diffusion May 18, 2025 Denoising Image Animation
Code Code Available 1StableFusion: Continual Video Retrieval via Frame Adaptation Mar 13, 2025 Continual Learning Mixture-of-Experts
Code Code Available 1Text Proxy: Decomposing Retrieval from a 1-to-N Relationship into N 1-to-1 Relationships for Text-Video Retrieval Oct 9, 2024 Retrieval Text Retrieval
Code Code Available 1TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval Sep 2, 2024 GPU Retrieval
Code Code Available 1T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval Aug 21, 2024 Retrieval Video Retrieval
Code Code Available 1MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval Aug 20, 2024 Mamba Natural Language Queries
Code Code Available 1EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval Jul 23, 2024 Re-Ranking Retrieval
Code Code Available 1Referring Atomic Video Action Recognition Jul 2, 2024 Action Localization Action Recognition
Code Code Available 1GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval May 22, 2024 Partially Relevant Video Retrieval Retrieval
Code Code Available 1Text-Video Retrieval with Global-Local Semantic Consistent Learning May 21, 2024 Concept Alignment Retrieval
Code Code Available 1DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval Jan 19, 2024 Retrieval Video Retrieval
Code Code Available 1Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning Jan 1, 2024 Representation Learning Retrieval
Code Code Available 1Holistic Features are almost Sufficient for Text-to-Video Retrieval Jan 1, 2024 Retrieval text similarity
Code Code Available 1Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos Dec 16, 2023 Video Captioning video narration captioning
Code Code Available 1Let All be Whitened: Multi-teacher Distillation for Efficient Visual Retrieval Dec 15, 2023 All Image Retrieval
Code Code Available 1RTQ: Rethinking Video-language Understanding Based on Image-text Model Dec 1, 2023 Video Captioning Video Question Answering
Code Code Available 1Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning Nov 27, 2023 Action Classification Action Recognition
Code Code Available 1VideoCon: Robust Video-Language Alignment via Contrast Captions Nov 15, 2023 Language Modeling Language Modelling
Code Code Available 1TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding Oct 29, 2023 Form Language Modelling
Code Code Available 1Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data Oct 8, 2023 Action Recognition Continual Learning
Code Code Available 1GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval Oct 8, 2023 Partially Relevant Video Retrieval Retrieval
Code Code Available 1HowToCaption: Prompting LLMs to Transform Video Annotations at Scale Oct 7, 2023 Automatic Speech Recognition Video Captioning
Code Code Available 1Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval Sep 29, 2023 Cross-Modal Retrieval Image-text matching
Code Code Available 1Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning Sep 20, 2023 Contrastive Learning Retrieval
Code Code Available 1Unified Coarse-to-Fine Alignment for Video-Text Retrieval Sep 18, 2023 Retrieval Text Retrieval
Code Code Available 1In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval Sep 16, 2023 Retrieval Style Transfer
Code Code Available 1CoVR-2: Automatic Data Construction for Composed Video Retrieval Aug 28, 2023 Composed Image Retrieval (CoIR) Composed Video Retrieval (CoVR)
Code Code Available 1Simple Baselines for Interactive Video Retrieval with Questions and Answers Aug 21, 2023 Question Answering Retrieval
Code Code Available 1