Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 251–300 of 486 papers

Title	Date	Tasks	Status
Generative Ghost: Investigating Ranking Bias Hidden in AI-Generated Videos	Feb 11, 2025	Contrastive LearningImage Retrieval	—Unverified
Generative Semantic Communication: Architectures, Technologies, and Applications	Dec 11, 2024	RetrievalSemantic Communication	—Unverified
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach	Aug 14, 2024	Cross-Modal RetrievalLanguage Modeling	—Unverified
Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning	Mar 30, 2021	counterfactualObject	—Unverified
Grounding Physical Object and Event Concepts Through Dynamic Visual Reasoning	Jan 1, 2021	counterfactualObject	—Unverified
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning	Dec 30, 2024	Contrastive LearningQuestion Answering	—Unverified
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	Dec 30, 2022	cross-modal alignmentTGIF-Action	—Unverified
HiVLP: Hierarchical Interactive Video-Language Pre-Training	Jan 1, 2023	RetrievalSelf-Supervised Learning	—Unverified
HORUS: Multimodal Large Language Models Framework for Video Retrieval at VBS 2025	Jan 1, 2025	Image RetrievalRetrieval	—Unverified
Human Action Recognition and Prediction: A Survey	Jun 28, 2018	Action RecognitionAutonomous Driving	—Unverified
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations	Apr 7, 2022	Contrastive LearningDenoising	—Unverified
Improving Video Retrieval by Adaptive Margin	Mar 9, 2023	RetrievalVideo Retrieval	—Unverified
MuMUR : Multilingual Multimodal Universal Retrieval	Aug 24, 2022	Image RetrievalMachine Translation	—Unverified
Induce, Edit, Retrieve:Language Grounded Multimodal Schema for Instructional Video Retrieval	Nov 17, 2021	RetrievalVideo Retrieval	—Unverified
Interactive Video Retrieval with Dialog	May 7, 2019	RetrievalVideo Retrieval	—Unverified
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	Jul 13, 2023	Action RecognitionContrastive Learning	—Unverified
Key Frame Extraction with Attention Based Deep Neural Networks	Jun 21, 2023	Video RetrievalVideo Summarization	—Unverified
KPCA Spatio-temporal trajectory point cloud classifier for recognizing human actions in a CBVR system	Mar 26, 2014	Action RecognitionRetrieval	—Unverified
Large-Scale Query-by-Image Video Retrieval Using Bloom Filters	Jul 12, 2016	RetrievalVideo Retrieval	—Unverified
Large Scale Video Representation Learning via Relational Graph Clustering	Jun 1, 2020	ClusteringGraph Clustering	—Unverified
Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval	Dec 1, 2023	Image RetrievalPartially Relevant Video Retrieval	—Unverified
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision	Apr 15, 2023	Language ModelingLanguage Modelling	—Unverified
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval	Jul 11, 2022	Representation LearningRetrieval	—Unverified
Learning and Recognizing Human Action from Skeleton Movement with Deep Residual Neural Networks	Mar 21, 2018	Action RecognitionDeep Learning	—Unverified
Learning Audio-Video Modalities from Image Captions	Apr 1, 2022	Image CaptioningRetrieval	—Unverified
Learning Joint Representations of Videos and Sentences with Web Image Search	Aug 8, 2016	Image RetrievalNatural Language Queries	—Unverified
Learning Language-Visual Embedding for Movie Understanding with Natural-Language	Sep 26, 2016	Multiple-choiceRetrieval	—Unverified
Learning Locally-Adaptive Decision Functions for Person Verification	Jun 1, 2013	Face VerificationMetric Learning	—Unverified
Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval	Sep 20, 2023	RetrievalVideo Retrieval	—Unverified
Learning text-to-video retrieval from image captioning	Apr 26, 2024	Image CaptioningImage Retrieval	—Unverified
Learning to Generate Long-term Future Narrations Describing Activities of Daily Living	Mar 3, 2025	Action AnticipationDecision Making	—Unverified
Learning Trajectory-Word Alignments for Video-Language Tasks	Jan 5, 2023	Question AnsweringRetrieval	—Unverified
Learning World Models for Interactive Video Generation	May 28, 2025	In-Context LearningRetrieval	—Unverified
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review	May 29, 2025	RetrievalText to Video Retrieval	—Unverified
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning	Dec 10, 2023	Language ModelingLanguage Modelling	—Unverified
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval	Apr 2, 2025	cross-modal alignmentRetrieval	—Unverified
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling	Oct 21, 2022	Language ModelingLanguage Modelling	—Unverified
Live Laparoscopic Video Retrieval with Compressed Uncertainty	Mar 8, 2022	RetrievalVideo Retrieval	—Unverified
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning	Mar 4, 2025	Contrastive LearningImage-text Retrieval	—Unverified
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory	Mar 17, 2025	FormGPU	—Unverified
Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit Retrieval	Nov 3, 2023	Recommendation SystemsRetrieval	—Unverified
MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed	Jun 11, 2025	RetrievalVideo Retrieval	—Unverified
MarineVRS: Marine Video Retrieval System with Explainability via Semantic Understanding	Jun 7, 2023	RetrievalSentence	—Unverified
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval	Dec 2, 2022	Image-text RetrievalRetrieval	—Unverified
Masking Modalities for Cross-modal Video Retrieval	Nov 1, 2021	RetrievalVideo Retrieval	—Unverified
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval	May 13, 2023	RetrievalText Retrieval	—Unverified
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization	Mar 14, 2022	RetrievalText to Video Retrieval	—Unverified
MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline	Jul 17, 2024	Question AnsweringRetrieval	—Unverified
Modality-Balanced Embedding for Video Retrieval	Apr 18, 2022	RetrievalText Matching	—Unverified
Motion Sensitive Contrastive Learning for Self-supervised Video Representation	Aug 12, 2022	Contrastive LearningRepresentation Learning	—Unverified

Show:10 25 50

← PrevPage 6 of 10Next →

All datasets MSR-VTT-1kA DiDeMo MSR-VTT LSMDC ActivityNet MSVD YouCook2 FIVR-200K VATEX QuerYD SSv2-label retrieval SSv2-template retrieval

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	OmniVec	text-to-video R@10	89.4	—	Unverified
2	CLIP4Clip	text-to-video R@10	81.6	—	Unverified
3	OmniVec (pretrained)	text-to-video R@10	78.6	—	Unverified
4	HunYuan_tvr (huge)	text-to-video R@1	62.9	—	Unverified
5	CLIP-ViP	text-to-video R@1	57.7	—	Unverified
6	PIDRo	text-to-video R@1	55.9	—	Unverified
7	DMAE (ViT-B/16)	text-to-video R@1	55.5	—	Unverified
8	HunYuan_tvr	text-to-video R@1	55	—	Unverified
9	MuLTI	text-to-video R@1	54.7	—	Unverified
10	EERCF	text-to-video R@1	54.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Aurora (ours, r=64)	text-to-video R@5	77.4	—	Unverified
2	InternVideo2-6B	text-to-video R@1	74.2	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	72.3	—	Unverified
4	VAST	text-to-video R@1	72	—	Unverified
5	COSA	text-to-video R@1	70.5	—	Unverified
6	UMT-L (ViT-L/16)	text-to-video R@1	70.4	—	Unverified
7	GRAM	text-to-video R@1	67.3	—	Unverified
8	VALOR	text-to-video R@1	61.5	—	Unverified
9	TESTA (ViT-B/16)	text-to-video R@1	61.2	—	Unverified
10	VindLU	text-to-video R@1	61.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GRAM	text-to-video R@1	64	—	Unverified
2	VAST	text-to-video R@1	63.9	—	Unverified
3	InternVideo2-6B	text-to-video R@1	62.8	—	Unverified
4	VALOR	text-to-video R@1	59.9	—	Unverified
5	UMT-L (ViT-L/16)	text-to-video R@1	58.8	—	Unverified
6	vid-TLDR (UMT-L)	text-to-video R@1	58.1	—	Unverified
7	COSA	text-to-video R@1	57.9	—	Unverified
8	InternVideo2-6B	text-to-video R@1	55.9	—	Unverified
9	InternVideo	text-to-video R@1	55.2	—	Unverified
10	VLAB	text-to-video R@1	55.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	text-to-video R@10	53.7	—	Unverified
2	InternVideo2-6B	text-to-video R@1	46.4	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	43.1	—	Unverified
4	UMT-L (ViT-L/16)	text-to-video R@1	43	—	Unverified
5	HunYuan_tvr (huge)	text-to-video R@1	40.4	—	Unverified
6	COSA	text-to-video R@1	39.4	—	Unverified
7	mPLUG-2	text-to-video R@1	34.4	—	Unverified
8	VALOR	text-to-video R@1	34.2	—	Unverified
9	InternVideo	text-to-video R@1	34	—	Unverified
10	InternVideo2-6B	text-to-video R@1	33.8	—	Unverified