InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Mar 22, 2024 Action Classification Action Recognition
Code Code Available 75 InternVideo: General Video Foundation Models via Generative and Discriminative Learning Dec 6, 2022 Action Classification Action Recognition
Code Code Available 45 mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Feb 1, 2023 Action Classification Image Classification
Code Code Available 45 Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension Nov 20, 2024 GPU MME
Code Code Available 35 VideoRoPE: What Makes for Good Video Rotary Position Embedding? Feb 7, 2025 Hallucination Position
Code Code Available 35 CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment Sep 14, 2022 Retrieval Text Retrieval
Code Code Available 25 vid-TLDR: Training Free Token merging for Light-weight Video Transformer Mar 20, 2024 Action Recognition Computational Efficiency
Code Code Available 25 Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation Jul 13, 2023 Retrieval Video Generation
Code Code Available 25 Revealing Single Frame Bias for Video-and-Language Learning Jun 7, 2022 Action Recognition Fine-grained Action Recognition
Code Code Available 25 EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World Mar 24, 2024 Action Anticipation Action Quality Assessment
Code Code Available 25 VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset May 29, 2023 Audio captioning Audio-Visual Captioning
Code Code Available 25 Explore the Limits of Omni-modal Pretraining at Scale Jun 13, 2024 Language Modeling Language Modelling
Code Code Available 25 Composed Video Retrieval via Enriched Context and Discriminative Embeddings Mar 25, 2024 Composed Video Retrieval (CoVR) Retrieval
Code Code Available 25 Gramian Multimodal Representation Learning and Alignment Dec 16, 2024 Contrastive Learning Representation Learning
Code Code Available 25 Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Oct 12, 2022 Contrastive Learning Form
Code Code Available 25 All in One: Exploring Unified Video-Language Pre-training Mar 14, 2022 All Language Modelling
Code Code Available 25 Multi-granularity Correspondence Learning from Long-term Noisy Videos Jan 30, 2024 Action Segmentation Long Video Retrieval (Background Removed)
Code Code Available 25 VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Apr 17, 2023 Audio captioning Audio-Video Question Answering (AVQA)
Code Code Available 25 X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Nov 22, 2022 All Cross-Modal Retrieval
Code Code Available 25 Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? Dec 31, 2022 Data Augmentation Retrieval
Code Code Available 25 Composed Multi-modal Retrieval: A Survey of Approaches and Applications Mar 3, 2025 Cross-Modal Retrieval Data Augmentation
Code Code Available 25 Cross-Modal Adapter for Text-Video Retrieval Nov 17, 2022 parameter-efficient fine-tuning Retrieval
Code Code Available 15 ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound Apr 6, 2022 Retrieval Text to Video Retrieval
Code Code Available 15 EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval Jul 23, 2024 Re-Ranking Retrieval
Code Code Available 15 CoVR-2: Automatic Data Construction for Composed Video Retrieval Aug 28, 2023 Composed Image Retrieval (CoIR) Composed Video Retrieval (CoVR)
Code Code Available 15 An overview on the evaluated video retrieval tasks at TRECVID 2022 Jun 22, 2023 Ad-hoc video search Retrieval
Code Code Available 15 A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension May 5, 2023 Reading Comprehension Retrieval
Code Code Available 15 Cross Modal Retrieval with Querybank Normalisation Dec 23, 2021 Cross-Modal Retrieval Metric Learning
Code Code Available 15 Align and Prompt: Video-and-Language Pre-training with Entity Prompts Dec 17, 2021 cross-modal alignment Entity Alignment
Code Code Available 15 AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant Nov 30, 2021 Question Answering Retrieval
Code Code Available 15 Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval Jan 1, 2023 Knowledge Distillation Language Modelling
Code Code Available 15 Cross-Architecture Self-supervised Video Representation Learning May 26, 2022 Action Recognition Contrastive Learning
Code Code Available 15 A CLIP-Hitchhiker's Guide to Long Video Retrieval May 17, 2022 Retrieval Video Retrieval
Code Code Available 15 COSA: Concatenated Sample Pretrained Vision-Language Foundation Model Jun 15, 2023 Form model
Code Code Available 15 DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization Jun 1, 2021 Question Answering Retrieval
Code Code Available 15 Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning Sep 20, 2023 Contrastive Learning Retrieval
Code Code Available 15 CenterCLIP: Token Clustering for Efficient Text-Video Retrieval May 2, 2022 Clustering Retrieval
Code Code Available 15 A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval Aug 3, 2022 Data Augmentation Retrieval
Code Code Available 15 DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Mar 17, 2023 Retrieval Video Retrieval
Code Code Available 15 Bridging Video-text Retrieval with Multiple Choice Questions Jan 13, 2022 Action Recognition Linear evaluation
Code Code Available 15 An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling Sep 4, 2022 Fill Mask Optical Flow Estimation
Code Code Available 15 CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval Sep 21, 2021 Corpus Video Moment Retrieval Moment Retrieval
Code Code Available 15 Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data Oct 8, 2023 Action Recognition Continual Learning
Code Code Available 15 C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval Oct 7, 2022 Knowledge Distillation Retrieval
Code Code Available 15 Temporal Context Aggregation for Video Retrieval with Contrastive Learning Aug 4, 2020 Contrastive Learning Representation Learning
Code Code Available 15 Condensed Movies: Story Based Retrieval with Contextual Embeddings May 8, 2020 Retrieval Text to Video Retrieval
Code Code Available 15 Dense-Captioning Events in Videos May 2, 2017 Dense Captioning Retrieval
Code Code Available 15 Contrastive Masked Autoencoders for Self-Supervised Video Hashing Nov 21, 2022 Decoder Retrieval
Code Code Available 15 COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning Nov 1, 2020 Cross-Modal Retrieval Representation Learning
Code Code Available 15 DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval Jan 19, 2024 Retrieval Video Retrieval
Code Code Available 15