| Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval | Dec 26, 2024 | Image-text RetrievalInformation Retrieval | CodeCode Available | 0 | 5 |
| CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives | Nov 29, 2024 | reinforcement-learningReinforcement Learning | CodeCode Available | 0 | 5 |
| CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning | Mar 22, 2023 | Contrastive LearningRetrieval | CodeCode Available | 0 | 5 |
| Diving Deep into the Motion Representation of Video-Text Models | Jun 7, 2024 | RetrievalText Retrieval | CodeCode Available | 0 | 5 |
| Expertized Caption Auto-Enhancement for Video-Text Retrieval | Feb 5, 2025 | Caption GenerationRetrieval | CodeCode Available | 0 | 5 |
| Harvest Video Foundation Models via Efficient Post-Pretraining | Oct 30, 2023 | Question AnsweringText Retrieval | CodeCode Available | 0 | 5 |
| Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval | Jun 11, 2018 | Image-text RetrievalRetrieval | CodeCode Available | 0 | 5 |
| Rudder: A Cross Lingual Video and Text Retrieval Dataset | Mar 9, 2021 | Natural Language QueriesRetrieval | CodeCode Available | 0 | 5 |
| TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter | Jun 22, 2023 | Question AnsweringRetrieval | CodeCode Available | 0 | 5 |
| Video-Text Retrieval by Supervised Sparse Multi-Grained Learning | Feb 19, 2023 | Representation LearningRetrieval | CodeCode Available | 0 | 5 |
| OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | Sep 15, 2022 | Action ClassificationAction Recognition | —Unverified | 0 | 0 |
| Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval | May 13, 2023 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| Masked Contrastive Pre-Training for Efficient Video-Text Retrieval | Dec 2, 2022 | Image-text RetrievalRetrieval | —Unverified | 0 | 0 |
| LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders | Apr 4, 2025 | Self-Supervised LearningText Retrieval | —Unverified | 0 | 0 |
| Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning | Dec 10, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment | Jan 1, 2025 | RelationRetrieval | —Unverified | 0 | 0 |
| RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning | May 11, 2024 | Image-text matchingRetrieval | —Unverified | 0 | 0 |
| Retrieving and Highlighting Action with Spatiotemporal Reference | May 19, 2020 | Action RecognitionCross-Modal Retrieval | —Unverified | 0 | 0 |
| Learning with Noisy Correspondence | Apr 13, 2024 | Cross-Modal RetrievalCross-modal retrieval with noisy correspondence | —Unverified | 0 | 0 |
| Learning Context-Adapted Video-Text Retrieval by Attending to User Comments | Sep 29, 2021 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval | Apr 3, 2025 | Information RetrievalRepresentation Learning | —Unverified | 0 | 0 |
| Beyond Coarse-Grained Matching in Video-Text Retrieval | Oct 16, 2024 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval | Jul 11, 2022 | Representation LearningRetrieval | —Unverified | 0 | 0 |
| Stacked Convolutional Deep Encoding Network for Video-Text Retrieval | Apr 10, 2020 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| HiVLP: Hierarchical Interactive Video-Language Pre-Training | Jan 1, 2023 | RetrievalSelf-Supervised Learning | —Unverified | 0 | 0 |
| Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding | Jan 16, 2022 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding | Mar 11, 2022 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| An Empirical Study of Excitation and Aggregation Design Adaptions in CLIP4Clip for Video-Text Retrieval | May 25, 2024 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval | Jan 30, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval | Mar 28, 2021 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model | Jun 1, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 | 0 |
| HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models | Apr 7, 2024 | HallucinationRepresentation Learning | —Unverified | 0 | 0 |
| Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval | Sep 27, 2022 | Cross-Modal RetrievalRetrieval | —Unverified | 0 | 0 |
| Generalizing Multimodal Pre-training into Multilingual via Language Acquisition | May 29, 2022 | Language AcquisitionRetrieval | —Unverified | 0 | 0 |
| TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval | Sep 28, 2022 | cross-modal alignmentRetrieval | —Unverified | 0 | 0 |
| Towards Understanding Camera Motions in Any Video | Apr 21, 2025 | Question AnsweringText Retrieval | —Unverified | 0 | 0 |
| Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval | Sep 21, 2023 | Domain AdaptationRetrieval | —Unverified | 0 | 0 |
| Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval | Sep 21, 2023 | Domain AdaptationRetrieval | —Unverified | 0 | 0 |
| Uncertainty-aware sign language video retrieval with probability distribution modeling | May 30, 2024 | RetrievalSign Language Retrieval | —Unverified | 0 | 0 |
| CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval | Dec 31, 2024 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| Exploiting Visual Semantic Reasoning for Video-Text Retrieval | Jun 16, 2020 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval | Sep 28, 2022 | Contrastive LearningRetrieval | —Unverified | 0 | 0 |
| Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval | Feb 26, 2024 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| EA-VTR: Event-Aware Video-Text Retrieval | Jul 10, 2024 | Action RecognitionContrastive Learning | —Unverified | 0 | 0 |
| Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval | Jan 1, 2023 | Domain AdaptationRetrieval | —Unverified | 0 | 0 |
| V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts | Mar 3, 2025 | Contrastive LearningText Retrieval | —Unverified | 0 | 0 |
| V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts | Jan 1, 2025 | Contrastive LearningText Retrieval | —Unverified | 0 | 0 |
| Video Editing for Video Retrieval | Feb 4, 2024 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals | Jan 9, 2019 | Cross-Modal RetrievalDeep Hashing | —Unverified | 0 | 0 |
| Deep Learning for Video-Text Retrieval: a Review | Feb 24, 2023 | Deep LearningRetrieval | —Unverified | 0 | 0 |