Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 301–325 of 486 papers

Title	Date	Tasks	Status
Efficient video indexing for monitoring disease activity and progression in the upper gastrointestinal tract	May 10, 2019	Image RetrievalRetrieval	—Unverified
Ego-Surfing: Person Localization in First-Person Videos Using Ego-Motion Signatures	Jun 15, 2016	ClusteringRetrieval	—Unverified
Empowering Agentic Video Analytics Systems with Video Language Models	May 1, 2025	Knowledge GraphsRAG	—Unverified
Encode the Unseen: Predictive Video Hashing for Scalable Mid-Stream Retrieval	Sep 30, 2020	RetrievalVideo Retrieval	—Unverified
End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering	Oct 10, 2016	Language ModelingLanguage Modelling	—Unverified
End-to-end Generative Pretraining for Multimodal Video Captioning	Jan 20, 2022	Action ClassificationDecoder	—Unverified
Enhanced Multimodal Representation Learning with Cross-modal KD	Jun 13, 2023	Contrastive LearningEmotion Classification	—Unverified
Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models	Apr 29, 2024	Image RetrievalLanguage Modeling	—Unverified
Event-aware Video Corpus Moment Retrieval	Feb 21, 2024	Contrastive LearningMoment Retrieval	—Unverified
Event Extraction in Video Transcripts	Oct 1, 2022	ArticlesEvent Extraction	—Unverified
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer	Nov 28, 2023	Language ModelingLanguage Modelling	—Unverified
ExpertAF: Expert Actionable Feedback from Video	Aug 1, 2024	Language ModelingLanguage Modelling	—Unverified
Exploiting Visual Semantic Reasoning for Video-Text Retrieval	Jun 16, 2020	RetrievalText Retrieval	—Unverified
Exploring Relations in Untrimmed Videos for Self-Supervised Learning	Aug 6, 2020	Action RecognitionChange Detection	—Unverified
Face Video Retrieval With Image Query via Hashing Across Euclidean Space and Riemannian Manifold	Jun 1, 2015	RetrievalVideo Retrieval	—Unverified
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks	Oct 10, 2022	RetrievalText to Video Retrieval	—Unverified
Find and Focus: Retrieve and Localize Video Events with Natural Language Queries	Sep 1, 2018	DiversityNatural Language Queries	—Unverified
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings	Aug 9, 2019	Cross-Modal RetrievalPOS	—Unverified
Fine-Grained Instance-Level Sketch-Based Video Retrieval	Feb 21, 2020	Cross-Modal RetrievalImage Retrieval	—Unverified
Fine-grained Text-Video Retrieval with Frozen Image Encoders	Jul 14, 2023	DecoderRetrieval	—Unverified
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval	Dec 31, 2024	RetrievalText Retrieval	—Unverified
FMM-X3D: FPGA-based modeling and mapping of X3D for Human Action Recognition	May 29, 2023	Action RecognitionAutonomous Vehicles	—Unverified
fpgaHART: A toolflow for throughput-oriented acceleration of 3D CNNs for HAR onto FPGAs	May 31, 2023	Action RecognitionAutonomous Vehicles	—Unverified
Free-Form Multi-Modal Multimedia Retrieval (4MR)	Mar 29, 2023	FormManagement	—Unverified
Generalizable Multi-linear Attention Network	Dec 1, 2021	Multimodal Sentiment AnalysisRetrieval	—Unverified

Show:10 25 50

← PrevPage 13 of 20Next →

All datasets MSR-VTT-1kA DiDeMo MSR-VTT LSMDC ActivityNet MSVD YouCook2 FIVR-200K VATEX QuerYD SSv2-label retrieval SSv2-template retrieval

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	OmniVec	text-to-video R@10	89.4	—	Unverified
2	CLIP4Clip	text-to-video R@10	81.6	—	Unverified
3	OmniVec (pretrained)	text-to-video R@10	78.6	—	Unverified
4	HunYuan_tvr (huge)	text-to-video R@1	62.9	—	Unverified
5	CLIP-ViP	text-to-video R@1	57.7	—	Unverified
6	PIDRo	text-to-video R@1	55.9	—	Unverified
7	DMAE (ViT-B/16)	text-to-video R@1	55.5	—	Unverified
8	HunYuan_tvr	text-to-video R@1	55	—	Unverified
9	MuLTI	text-to-video R@1	54.7	—	Unverified
10	EERCF	text-to-video R@1	54.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Aurora (ours, r=64)	text-to-video R@5	77.4	—	Unverified
2	InternVideo2-6B	text-to-video R@1	74.2	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	72.3	—	Unverified
4	VAST	text-to-video R@1	72	—	Unverified
5	COSA	text-to-video R@1	70.5	—	Unverified
6	UMT-L (ViT-L/16)	text-to-video R@1	70.4	—	Unverified
7	GRAM	text-to-video R@1	67.3	—	Unverified
8	VALOR	text-to-video R@1	61.5	—	Unverified
9	TESTA (ViT-B/16)	text-to-video R@1	61.2	—	Unverified
10	VindLU	text-to-video R@1	61.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GRAM	text-to-video R@1	64	—	Unverified
2	VAST	text-to-video R@1	63.9	—	Unverified
3	InternVideo2-6B	text-to-video R@1	62.8	—	Unverified
4	VALOR	text-to-video R@1	59.9	—	Unverified
5	UMT-L (ViT-L/16)	text-to-video R@1	58.8	—	Unverified
6	vid-TLDR (UMT-L)	text-to-video R@1	58.1	—	Unverified
7	COSA	text-to-video R@1	57.9	—	Unverified
8	InternVideo2-6B	text-to-video R@1	55.9	—	Unverified
9	InternVideo	text-to-video R@1	55.2	—	Unverified
10	VLAB	text-to-video R@1	55.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	text-to-video R@10	53.7	—	Unverified
2	InternVideo2-6B	text-to-video R@1	46.4	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	43.1	—	Unverified
4	UMT-L (ViT-L/16)	text-to-video R@1	43	—	Unverified
5	HunYuan_tvr (huge)	text-to-video R@1	40.4	—	Unverified
6	COSA	text-to-video R@1	39.4	—	Unverified
7	mPLUG-2	text-to-video R@1	34.4	—	Unverified
8	VALOR	text-to-video R@1	34.2	—	Unverified
9	InternVideo	text-to-video R@1	34	—	Unverified
10	InternVideo2-6B	text-to-video R@1	33.8	—	Unverified