| Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval | Jun 11, 2025 | RetrievalText to Video Retrieval | —Unverified | 0 |
| Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review | May 29, 2025 | RetrievalText to Video Retrieval | —Unverified | 0 |
| Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering | Apr 15, 2025 | Partially Relevant Video RetrievalRetrieval | CodeCode Available | 0 |
| TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval | Apr 7, 2025 | Contrastive LearningRetrieval | CodeCode Available | 0 |
| Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval | Mar 24, 2025 | RetrievalText to Video Retrieval | —Unverified | 0 |
| StableFusion: Continual Video Retrieval via Frame Adaptation | Mar 13, 2025 | Continual LearningMixture-of-Experts | CodeCode Available | 1 |
| CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval | Dec 31, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising | Oct 29, 2024 | RetrievalText to Video Retrieval | CodeCode Available | 0 |
| EA-VTR: Event-Aware Video-Text Retrieval | Jul 10, 2024 | Action RecognitionContrastive Learning | —Unverified | 0 |
| Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval | Jun 21, 2024 | RetrievalSentence | —Unverified | 0 |
| Sakuga-42M Dataset: Scaling Up Cartoon Research | May 13, 2024 | MambaText to Video Retrieval | —Unverified | 0 |
| Learning text-to-video retrieval from image captioning | Apr 26, 2024 | Image CaptioningImage Retrieval | —Unverified | 0 |
| Distilling Vision-Language Models on Millions of Videos | Jan 11, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Holistic Features are almost Sufficient for Text-to-Video Retrieval | Jan 1, 2024 | Retrievaltext similarity | CodeCode Available | 1 |
| Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning | Jan 1, 2024 | Representation LearningRetrieval | CodeCode Available | 1 |
| Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning | Dec 10, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer | Nov 28, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VideoCon: Robust Video-Language Alignment via Contrast Captions | Nov 15, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| An Empirical Study of Frame Selection for Text-to-Video Retrieval | Nov 1, 2023 | RetrievalText to Video Retrieval | —Unverified | 0 |
| Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data | Oct 8, 2023 | Action RecognitionContinual Learning | CodeCode Available | 1 |
| Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval | Sep 29, 2023 | Cross-Modal RetrievalImage-text matching | CodeCode Available | 1 |
| Unified Coarse-to-Fine Alignment for Video-Text Retrieval | Sep 18, 2023 | RetrievalText Retrieval | CodeCode Available | 1 |
| TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval | Aug 2, 2023 | Retrievaltext similarity | —Unverified | 0 |
| Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment | Jul 24, 2023 | RetrievalText to Video Retrieval | —Unverified | 0 |
| MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian | Jun 20, 2023 | Cross-Lingual TransferRetrieval | CodeCode Available | 0 |
| MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | Mar 23, 2023 | Auxiliary LearningMultimodal Sentiment Analysis | CodeCode Available | 1 |
| Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer | Feb 4, 2023 | Computational EfficiencyQuestion Answering | CodeCode Available | 0 |
| Temporal Perceiving Video-Language Pre-training | Jan 18, 2023 | Action LocalizationContrastive Learning | —Unverified | 0 |
| Learning Trajectory-Word Alignments for Video-Language Tasks | Jan 5, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval | Jan 1, 2023 | Knowledge DistillationLanguage Modelling | CodeCode Available | 1 |
| VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | Dec 9, 2022 | Question AnsweringRetrieval | —Unverified | 0 |
| VindLU: A Recipe for Effective Video-and-Language Pretraining | Dec 9, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 |
| X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | Nov 22, 2022 | AllCross-Modal Retrieval | CodeCode Available | 2 |
| Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval | Nov 21, 2022 | AllRetrieval | CodeCode Available | 0 |
| SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training | Nov 21, 2022 | cross-modal alignmentGPU | —Unverified | 0 |
| Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks | Oct 10, 2022 | RetrievalText to Video Retrieval | —Unverified | 0 |
| An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | Sep 4, 2022 | Fill MaskOptical Flow Estimation | CodeCode Available | 1 |
| Partially Relevant Video Retrieval | Aug 26, 2022 | Moment RetrievalMultiple Instance Learning | CodeCode Available | 1 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Robustness Analysis of Video-Language Models Against Visual and Language Perturbations | Jul 5, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval | Jun 26, 2022 | Mixture-of-ExpertsRetrieval | CodeCode Available | 0 |
| Semantic Role Aware Correlation Transformer for Text to Video Retrieval | Jun 26, 2022 | RetrievalText to Video Retrieval | CodeCode Available | 0 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 |
| Revealing Single Frame Bias for Video-and-Language Learning | Jun 7, 2022 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 2 |
| Revisiting the "Video" in Video-Language Understanding | Jun 3, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Learning to Retrieve Videos by Asking Questions | May 11, 2022 | AI AgentRetrieval | CodeCode Available | 0 |
| MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval | Apr 26, 2022 | Action RecognitionRetrieval | CodeCode Available | 1 |
| COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval | Apr 15, 2022 | Contrastive LearningCross-Modal Retrieval | —Unverified | 0 |
| ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound | Apr 6, 2022 | RetrievalText to Video Retrieval | CodeCode Available | 1 |
| GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval | Apr 1, 2022 | Boundary CaptioningBoundary Grounding | CodeCode Available | 1 |