Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 486 papers

Title	Date	Tasks	Status
Clarification of Video Retrieval Query Results by the Automated Insertion of Supporting Shots	Feb 19, 2021	RetrievalVideo Editing	—Unverified
Classroom Video Assessment and Retrieval via Multiple Instance Learning	Mar 25, 2014	Multiple Instance LearningRetrieval	—Unverified
CLIP2TV: Align, Match and Distill for Video-Text Retrieval	Nov 10, 2021	Representation LearningRetrieval	—Unverified
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations	Nov 7, 2022	Contrastive LearningRetrieval	—Unverified
CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture	May 3, 2025	Autonomous DrivingBenchmarking	—Unverified
CNN Retrieval based Unsupervised Metric Learning for Near-Duplicated Video Retrieval	May 30, 2021	Metric LearningRe-Ranking	—Unverified
Coarse to Fine: Video Retrieval before Moment Localization	Oct 14, 2021	Moment RetrievalRetrieval	—Unverified
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing	Jan 22, 2024	AudioCapsAudio-Visual Synchronization	—Unverified
Colo-SCRL: Self-Supervised Contrastive Representation Learning for Colonoscopic Video Retrieval	Mar 28, 2023	Action RecognitionContrastive Learning	—Unverified
Contrastive Video-Language Learning with Fine-grained Frame Sampling	Oct 10, 2022	Question AnsweringRepresentation Learning	—Unverified
Controllable Augmentations for Video Representation Learning	Mar 30, 2022	Action RecognitionContrastive Learning	—Unverified
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval	Apr 15, 2022	Contrastive LearningCross-Modal Retrieval	—Unverified
CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation	Nov 16, 2021	RetrievalVideo Captioning	—Unverified
CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation	Mar 31, 2022	RetrievalVideo Captioning	—Unverified
CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning	Apr 1, 2021	Question AnsweringRepresentation Learning	—Unverified
Deep Heterogeneous Hashing for Face Video Retrieval	Nov 4, 2019	RetrievalVideo Retrieval	—Unverified
Deep Learning Based Semantic Video Indexing and Retrieval	Jan 28, 2016	Deep LearningRetrieval	—Unverified
De-Hashing: Server-Side Context-Aware Feature Reconstruction for Mobile Visual Search	Jun 29, 2016	RetrievalVideo Retrieval	—Unverified
Detours for Navigating Instructional Videos	Jan 3, 2024	16kQuestion Answering	—Unverified
Discrete Wavelet Transform and Gradient Difference based approach for text localization in videos	Feb 24, 2015	RetrievalText Detection	—Unverified
Distilling Vision-Language Models on Millions of Videos	Jan 11, 2024	Language ModelingLanguage Modelling	—Unverified
Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval	Oct 25, 2021	Domain AdaptationRetrieval	—Unverified
Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval	Oct 12, 2023	RetrievalSemantic Retrieval	—Unverified
EA-VTR: Event-Aware Video-Text Retrieval	Jul 10, 2024	Action RecognitionContrastive Learning	—Unverified
Efficient Action Detection in Untrimmed Videos via Multi-Task Learning	Dec 22, 2016	Action DetectionAction Localization	—Unverified
Efficient video indexing for monitoring disease activity and progression in the upper gastrointestinal tract	May 10, 2019	Image RetrievalRetrieval	—Unverified
Ego-Surfing: Person Localization in First-Person Videos Using Ego-Motion Signatures	Jun 15, 2016	ClusteringRetrieval	—Unverified
Empowering Agentic Video Analytics Systems with Video Language Models	May 1, 2025	Knowledge GraphsRAG	—Unverified
Encode the Unseen: Predictive Video Hashing for Scalable Mid-Stream Retrieval	Sep 30, 2020	RetrievalVideo Retrieval	—Unverified
End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering	Oct 10, 2016	Language ModelingLanguage Modelling	—Unverified
End-to-end Generative Pretraining for Multimodal Video Captioning	Jan 20, 2022	Action ClassificationDecoder	—Unverified
Enhanced Multimodal Representation Learning with Cross-modal KD	Jun 13, 2023	Contrastive LearningEmotion Classification	—Unverified
Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models	Apr 29, 2024	Image RetrievalLanguage Modeling	—Unverified
Event-aware Video Corpus Moment Retrieval	Feb 21, 2024	Contrastive LearningMoment Retrieval	—Unverified
Event Extraction in Video Transcripts	Oct 1, 2022	ArticlesEvent Extraction	—Unverified
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer	Nov 28, 2023	Language ModelingLanguage Modelling	—Unverified
ExpertAF: Expert Actionable Feedback from Video	Aug 1, 2024	Language ModelingLanguage Modelling	—Unverified
Exploiting Visual Semantic Reasoning for Video-Text Retrieval	Jun 16, 2020	RetrievalText Retrieval	—Unverified
Exploring Relations in Untrimmed Videos for Self-Supervised Learning	Aug 6, 2020	Action RecognitionChange Detection	—Unverified
Face Video Retrieval With Image Query via Hashing Across Euclidean Space and Riemannian Manifold	Jun 1, 2015	RetrievalVideo Retrieval	—Unverified
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks	Oct 10, 2022	RetrievalText to Video Retrieval	—Unverified
Find and Focus: Retrieve and Localize Video Events with Natural Language Queries	Sep 1, 2018	DiversityNatural Language Queries	—Unverified
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings	Aug 9, 2019	Cross-Modal RetrievalPOS	—Unverified
Fine-Grained Instance-Level Sketch-Based Video Retrieval	Feb 21, 2020	Cross-Modal RetrievalImage Retrieval	—Unverified
Fine-grained Text-Video Retrieval with Frozen Image Encoders	Jul 14, 2023	DecoderRetrieval	—Unverified
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval	Dec 31, 2024	RetrievalText Retrieval	—Unverified
FMM-X3D: FPGA-based modeling and mapping of X3D for Human Action Recognition	May 29, 2023	Action RecognitionAutonomous Vehicles	—Unverified
fpgaHART: A toolflow for throughput-oriented acceleration of 3D CNNs for HAR onto FPGAs	May 31, 2023	Action RecognitionAutonomous Vehicles	—Unverified
Free-Form Multi-Modal Multimedia Retrieval (4MR)	Mar 29, 2023	FormManagement	—Unverified
Generalizable Multi-linear Attention Network	Dec 1, 2021	Multimodal Sentiment AnalysisRetrieval	—Unverified

Show:10 25 50

← PrevPage 5 of 10Next →

All datasets MSR-VTT-1kA DiDeMo MSR-VTT LSMDC ActivityNet MSVD YouCook2 FIVR-200K VATEX QuerYD SSv2-label retrieval SSv2-template retrieval

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	OmniVec	text-to-video R@10	89.4	—	Unverified
2	CLIP4Clip	text-to-video R@10	81.6	—	Unverified
3	OmniVec (pretrained)	text-to-video R@10	78.6	—	Unverified
4	HunYuan_tvr (huge)	text-to-video R@1	62.9	—	Unverified
5	CLIP-ViP	text-to-video R@1	57.7	—	Unverified
6	PIDRo	text-to-video R@1	55.9	—	Unverified
7	DMAE (ViT-B/16)	text-to-video R@1	55.5	—	Unverified
8	HunYuan_tvr	text-to-video R@1	55	—	Unverified
9	MuLTI	text-to-video R@1	54.7	—	Unverified
10	EERCF	text-to-video R@1	54.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Aurora (ours, r=64)	text-to-video R@5	77.4	—	Unverified
2	InternVideo2-6B	text-to-video R@1	74.2	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	72.3	—	Unverified
4	VAST	text-to-video R@1	72	—	Unverified
5	COSA	text-to-video R@1	70.5	—	Unverified
6	UMT-L (ViT-L/16)	text-to-video R@1	70.4	—	Unverified
7	GRAM	text-to-video R@1	67.3	—	Unverified
8	VALOR	text-to-video R@1	61.5	—	Unverified
9	TESTA (ViT-B/16)	text-to-video R@1	61.2	—	Unverified
10	VindLU	text-to-video R@1	61.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GRAM	text-to-video R@1	64	—	Unverified
2	VAST	text-to-video R@1	63.9	—	Unverified
3	InternVideo2-6B	text-to-video R@1	62.8	—	Unverified
4	VALOR	text-to-video R@1	59.9	—	Unverified
5	UMT-L (ViT-L/16)	text-to-video R@1	58.8	—	Unverified
6	vid-TLDR (UMT-L)	text-to-video R@1	58.1	—	Unverified
7	COSA	text-to-video R@1	57.9	—	Unverified
8	InternVideo2-6B	text-to-video R@1	55.9	—	Unverified
9	InternVideo	text-to-video R@1	55.2	—	Unverified
10	VLAB	text-to-video R@1	55.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	text-to-video R@10	53.7	—	Unverified
2	InternVideo2-6B	text-to-video R@1	46.4	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	43.1	—	Unverified
4	UMT-L (ViT-L/16)	text-to-video R@1	43	—	Unverified
5	HunYuan_tvr (huge)	text-to-video R@1	40.4	—	Unverified
6	COSA	text-to-video R@1	39.4	—	Unverified
7	mPLUG-2	text-to-video R@1	34.4	—	Unverified
8	VALOR	text-to-video R@1	34.2	—	Unverified
9	InternVideo	text-to-video R@1	34	—	Unverified
10	InternVideo2-6B	text-to-video R@1	33.8	—	Unverified