InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Mar 22, 2024 Action Classification Action Recognition
Code Code Available 7InternVideo: General Video Foundation Models via Generative and Discriminative Learning Dec 6, 2022 Action Classification Action Recognition
Code Code Available 4mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Feb 1, 2023 Action Classification Image Classification
Code Code Available 4Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension Nov 20, 2024 GPU MME
Code Code Available 3VideoRoPE: What Makes for Good Video Rotary Position Embedding? Feb 7, 2025 Hallucination Position
Code Code Available 3Explore the Limits of Omni-modal Pretraining at Scale Jun 13, 2024 Language Modeling Language Modelling
Code Code Available 2X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Nov 22, 2022 All Cross-Modal Retrieval
Code Code Available 2Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation Jul 13, 2023 Retrieval Video Generation
Code Code Available 2VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Apr 17, 2023 Audio captioning Audio-Video Question Answering (AVQA)
Code Code Available 2vid-TLDR: Training Free Token merging for Light-weight Video Transformer Mar 20, 2024 Action Recognition Computational Efficiency
Code Code Available 2Multi-granularity Correspondence Learning from Long-term Noisy Videos Jan 30, 2024 Action Segmentation Long Video Retrieval (Background Removed)
Code Code Available 2Gramian Multimodal Representation Learning and Alignment Dec 16, 2024 Contrastive Learning Representation Learning
Code Code Available 2Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Oct 12, 2022 Contrastive Learning Form
Code Code Available 2Revealing Single Frame Bias for Video-and-Language Learning Jun 7, 2022 Action Recognition Fine-grained Action Recognition
Code Code Available 2Composed Multi-modal Retrieval: A Survey of Approaches and Applications Mar 3, 2025 Cross-Modal Retrieval Data Augmentation
Code Code Available 2All in One: Exploring Unified Video-Language Pre-training Mar 14, 2022 All Language Modelling
Code Code Available 2Composed Video Retrieval via Enriched Context and Discriminative Embeddings Mar 25, 2024 Composed Video Retrieval (CoVR) Retrieval
Code Code Available 2VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset May 29, 2023 Audio captioning Audio-Visual Captioning
Code Code Available 2EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World Mar 24, 2024 Action Anticipation Action Quality Assessment
Code Code Available 2Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? Dec 31, 2022 Data Augmentation Retrieval
Code Code Available 2CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment Sep 14, 2022 Retrieval Text Retrieval
Code Code Available 2ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound Apr 6, 2022 Retrieval Text to Video Retrieval
Code Code Available 1Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning Sep 20, 2023 Contrastive Learning Retrieval
Code Code Available 1EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval Jul 23, 2024 Re-Ranking Retrieval
Code Code Available 1DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Mar 17, 2023 Retrieval Video Retrieval
Code Code Available 1An overview on the evaluated video retrieval tasks at TRECVID 2022 Jun 22, 2023 Ad-hoc video search Retrieval
Code Code Available 1A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension May 5, 2023 Reading Comprehension Retrieval
Code Code Available 1Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval Jan 1, 2023 Knowledge Distillation Language Modelling
Code Code Available 1Align and Prompt: Video-and-Language Pre-training with Entity Prompts Dec 17, 2021 cross-modal alignment Entity Alignment
Code Code Available 1AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant Nov 30, 2021 Question Answering Retrieval
Code Code Available 1Disentangled Representation Learning for Text-Video Retrieval Mar 14, 2022 Representation Learning Retrieval
Code Code Available 1Dense-Captioning Events in Videos May 2, 2017 Dense Captioning Retrieval
Code Code Available 1CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval Apr 18, 2021 Retrieval Text Retrieval
Code Code Available 1A CLIP-Hitchhiker's Guide to Long Video Retrieval May 17, 2022 Retrieval Video Retrieval
Code Code Available 1DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval Jan 19, 2024 Retrieval Video Retrieval
Code Code Available 1DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval Jun 24, 2021 Computational Efficiency Knowledge Distillation
Code Code Available 1CenterCLIP: Token Clustering for Efficient Text-Video Retrieval May 2, 2022 Clustering Retrieval
Code Code Available 1A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval Aug 3, 2022 Data Augmentation Retrieval
Code Code Available 1DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization Jun 1, 2021 Question Answering Retrieval
Code Code Available 1Bridging Video-text Retrieval with Multiple Choice Questions Jan 13, 2022 Action Recognition Linear evaluation
Code Code Available 1An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling Sep 4, 2022 Fill Mask Optical Flow Estimation
Code Code Available 1Cross-Modal Adapter for Text-Video Retrieval Nov 17, 2022 parameter-efficient fine-tuning Retrieval
Code Code Available 1Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data Oct 8, 2023 Action Recognition Continual Learning
Code Code Available 1C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval Oct 7, 2022 Knowledge Distillation Retrieval
Code Code Available 1CoVR-2: Automatic Data Construction for Composed Video Retrieval Aug 28, 2023 Composed Image Retrieval (CoIR) Composed Video Retrieval (CoVR)
Code Code Available 1COSA: Concatenated Sample Pretrained Vision-Language Foundation Model Jun 15, 2023 Form model
Code Code Available 1Cross-Architecture Self-supervised Video Representation Learning May 26, 2022 Action Recognition Contrastive Learning
Code Code Available 1Cross Modal Retrieval with Querybank Normalisation Dec 23, 2021 Cross-Modal Retrieval Metric Learning
Code Code Available 1Clover: Towards A Unified Video-Language Alignment and Fusion Model Jul 16, 2022 Language Modeling Language Modelling
Code Code Available 1Text Proxy: Decomposing Retrieval from a 1-to-N Relationship into N 1-to-1 Relationships for Text-Video Retrieval Oct 9, 2024 Retrieval Text Retrieval
Code Code Available 1