| X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | Nov 22, 2022 | AllCross-Modal Retrieval | CodeCode Available | 2 | 5 |
| Revealing Single Frame Bias for Video-and-Language Learning | Jun 7, 2022 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 2 | 5 |
| X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval | Mar 28, 2022 | RetrievalText to Video Retrieval | CodeCode Available | 1 | 5 |
| An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | Sep 4, 2022 | Fill MaskOptical Flow Estimation | CodeCode Available | 1 | 5 |
| Bridging Video-text Retrieval with Multiple Choice Questions | Jan 13, 2022 | Action RecognitionLinear evaluation | CodeCode Available | 1 | 5 |
| Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data | Oct 8, 2023 | Action RecognitionContinual Learning | CodeCode Available | 1 | 5 |
| CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | Apr 18, 2021 | RetrievalText Retrieval | CodeCode Available | 1 | 5 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Condensed Movies: Story Based Retrieval with Contextual Embeddings | May 8, 2020 | RetrievalText to Video Retrieval | CodeCode Available | 1 | 5 |
| DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization | Jun 1, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval | Jan 1, 2023 | Knowledge DistillationLanguage Modelling | CodeCode Available | 1 | 5 |
| ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound | Apr 6, 2022 | RetrievalText to Video Retrieval | CodeCode Available | 1 | 5 |
| End-to-End Learning of Visual Representations from Uncurated Instructional Videos | Dec 13, 2019 | Action LocalizationAction Recognition | CodeCode Available | 1 | 5 |
| Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | Apr 1, 2021 | RetrievalText Retrieval | CodeCode Available | 1 | 5 |
| GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval | Apr 1, 2022 | Boundary CaptioningBoundary Grounding | CodeCode Available | 1 | 5 |
| Holistic Features are almost Sufficient for Text-to-Video Retrieval | Jan 1, 2024 | Retrievaltext similarity | CodeCode Available | 1 | 5 |
| HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips | Jun 7, 2019 | Action LocalizationLong Video Retrieval (Background Removed) | CodeCode Available | 1 | 5 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 | 5 |
| Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling | Feb 11, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval | Dec 3, 2021 | Ad-hoc video searchfeature selection | CodeCode Available | 1 | 5 |
| MDMMT: Multidomain Multimodal Transformer for Video Retrieval | Mar 19, 2021 | RetrievalText to Video Retrieval | CodeCode Available | 1 | 5 |
| MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | Mar 23, 2023 | Auxiliary LearningMultimodal Sentiment Analysis | CodeCode Available | 1 | 5 |
| MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval | Apr 26, 2022 | Action RecognitionRetrieval | CodeCode Available | 1 | 5 |
| Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos | Apr 26, 2021 | Action LocalizationClustering | CodeCode Available | 1 | 5 |
| Partially Relevant Video Retrieval | Aug 26, 2022 | Moment RetrievalMultiple Instance Learning | CodeCode Available | 1 | 5 |
| Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval | Sep 29, 2023 | Cross-Modal RetrievalImage-text matching | CodeCode Available | 1 | 5 |
| Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval | Jan 23, 2022 | Representation LearningRetrieval | CodeCode Available | 1 | 5 |
| Revisiting the "Video" in Video-Language Understanding | Jun 3, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 1 | 5 |
| Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval | Mar 15, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| StableFusion: Continual Video Retrieval via Frame Adaptation | Mar 13, 2025 | Continual LearningMixture-of-Experts | CodeCode Available | 1 | 5 |
| The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020) | Aug 3, 2020 | Natural Language QueriesRetrieval | CodeCode Available | 1 | 5 |
| Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning | Jan 1, 2024 | Representation LearningRetrieval | CodeCode Available | 1 | 5 |
| Unified Coarse-to-Fine Alignment for Video-Text Retrieval | Sep 18, 2023 | RetrievalText Retrieval | CodeCode Available | 1 | 5 |
| VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation | Jun 8, 2021 | Multi-Task LearningQuestion Answering | CodeCode Available | 1 | 5 |
| VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | Apr 22, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| VideoCon: Robust Video-Language Alignment via Contrast Captions | Nov 15, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| VindLU: A Recipe for Effective Video-and-Language Pretraining | Dec 9, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | Nov 24, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions | Nov 19, 2021 | RetrievalSuper-Resolution | CodeCode Available | 1 | 5 |
| Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering | Apr 15, 2025 | Partially Relevant Video RetrievalRetrieval | CodeCode Available | 0 | 5 |
| RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval | Jun 26, 2022 | Mixture-of-ExpertsRetrieval | CodeCode Available | 0 | 5 |
| Robustness Analysis of Video-Language Models Against Visual and Language Perturbations | Jul 5, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Learning to Retrieve Videos by Asking Questions | May 11, 2022 | AI AgentRetrieval | CodeCode Available | 0 | 5 |
| Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning | Mar 6, 2020 | Density EstimationNoise Estimation | CodeCode Available | 0 | 5 |
| ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising | Oct 29, 2024 | RetrievalText to Video Retrieval | CodeCode Available | 0 | 5 |
| MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian | Jun 20, 2023 | Cross-Lingual TransferRetrieval | CodeCode Available | 0 | 5 |
| FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks | Mar 24, 2022 | Action RecognitionRetrieval | CodeCode Available | 0 | 5 |
| Semantic Role Aware Correlation Transformer for Text to Video Retrieval | Jun 26, 2022 | RetrievalText to Video Retrieval | CodeCode Available | 0 | 5 |
| TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval | Apr 7, 2025 | Contrastive LearningRetrieval | CodeCode Available | 0 | 5 |
| Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval | Nov 21, 2022 | AllRetrieval | CodeCode Available | 0 | 5 |