Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 301–350 of 486 papers

Title	Date	Tasks	Status
Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval	Oct 25, 2021	Domain AdaptationRetrieval	—Unverified
Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval	Oct 12, 2023	RetrievalSemantic Retrieval	—Unverified
EA-VTR: Event-Aware Video-Text Retrieval	Jul 10, 2024	Action RecognitionContrastive Learning	—Unverified
Efficient Action Detection in Untrimmed Videos via Multi-Task Learning	Dec 22, 2016	Action DetectionAction Localization	—Unverified
Efficient video indexing for monitoring disease activity and progression in the upper gastrointestinal tract	May 10, 2019	Image RetrievalRetrieval	—Unverified
Ego-Surfing: Person Localization in First-Person Videos Using Ego-Motion Signatures	Jun 15, 2016	ClusteringRetrieval	—Unverified
Empowering Agentic Video Analytics Systems with Video Language Models	May 1, 2025	Knowledge GraphsRAG	—Unverified
Encode the Unseen: Predictive Video Hashing for Scalable Mid-Stream Retrieval	Sep 30, 2020	RetrievalVideo Retrieval	—Unverified
End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering	Oct 10, 2016	Language ModelingLanguage Modelling	—Unverified
End-to-end Generative Pretraining for Multimodal Video Captioning	Jan 20, 2022	Action ClassificationDecoder	—Unverified
Enhanced Multimodal Representation Learning with Cross-modal KD	Jun 13, 2023	Contrastive LearningEmotion Classification	—Unverified
Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models	Apr 29, 2024	Image RetrievalLanguage Modeling	—Unverified
Event-aware Video Corpus Moment Retrieval	Feb 21, 2024	Contrastive LearningMoment Retrieval	—Unverified
Event Extraction in Video Transcripts	Oct 1, 2022	ArticlesEvent Extraction	—Unverified
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer	Nov 28, 2023	Language ModelingLanguage Modelling	—Unverified
ExpertAF: Expert Actionable Feedback from Video	Aug 1, 2024	Language ModelingLanguage Modelling	—Unverified
Exploiting Visual Semantic Reasoning for Video-Text Retrieval	Jun 16, 2020	RetrievalText Retrieval	—Unverified
Exploring Relations in Untrimmed Videos for Self-Supervised Learning	Aug 6, 2020	Action RecognitionChange Detection	—Unverified
Face Video Retrieval With Image Query via Hashing Across Euclidean Space and Riemannian Manifold	Jun 1, 2015	RetrievalVideo Retrieval	—Unverified
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks	Oct 10, 2022	RetrievalText to Video Retrieval	—Unverified
Find and Focus: Retrieve and Localize Video Events with Natural Language Queries	Sep 1, 2018	DiversityNatural Language Queries	—Unverified
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings	Aug 9, 2019	Cross-Modal RetrievalPOS	—Unverified
Fine-Grained Instance-Level Sketch-Based Video Retrieval	Feb 21, 2020	Cross-Modal RetrievalImage Retrieval	—Unverified
Fine-grained Text-Video Retrieval with Frozen Image Encoders	Jul 14, 2023	DecoderRetrieval	—Unverified
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval	Dec 31, 2024	RetrievalText Retrieval	—Unverified
FMM-X3D: FPGA-based modeling and mapping of X3D for Human Action Recognition	May 29, 2023	Action RecognitionAutonomous Vehicles	—Unverified
fpgaHART: A toolflow for throughput-oriented acceleration of 3D CNNs for HAR onto FPGAs	May 31, 2023	Action RecognitionAutonomous Vehicles	—Unverified
Free-Form Multi-Modal Multimedia Retrieval (4MR)	Mar 29, 2023	FormManagement	—Unverified
Generalizable Multi-linear Attention Network	Dec 1, 2021	Multimodal Sentiment AnalysisRetrieval	—Unverified
Generative Ghost: Investigating Ranking Bias Hidden in AI-Generated Videos	Feb 11, 2025	Contrastive LearningImage Retrieval	—Unverified
Generative Semantic Communication: Architectures, Technologies, and Applications	Dec 11, 2024	RetrievalSemantic Communication	—Unverified
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach	Aug 14, 2024	Cross-Modal RetrievalLanguage Modeling	—Unverified
Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning	Mar 30, 2021	counterfactualObject	—Unverified
Grounding Physical Object and Event Concepts Through Dynamic Visual Reasoning	Jan 1, 2021	counterfactualObject	—Unverified
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	Dec 30, 2022	cross-modal alignmentTGIF-Action	—Unverified
HiVLP: Hierarchical Interactive Video-Language Pre-Training	Jan 1, 2023	RetrievalSelf-Supervised Learning	—Unverified
HORUS: Multimodal Large Language Models Framework for Video Retrieval at VBS 2025	Jan 1, 2025	Image RetrievalRetrieval	—Unverified
Human Action Recognition and Prediction: A Survey	Jun 28, 2018	Action RecognitionAutonomous Driving	—Unverified
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations	Apr 7, 2022	Contrastive LearningDenoising	—Unverified
Improving Video Retrieval by Adaptive Margin	Mar 9, 2023	RetrievalVideo Retrieval	—Unverified
MuMUR : Multilingual Multimodal Universal Retrieval	Aug 24, 2022	Image RetrievalMachine Translation	—Unverified
Induce, Edit, Retrieve:Language Grounded Multimodal Schema for Instructional Video Retrieval	Nov 17, 2021	RetrievalVideo Retrieval	—Unverified
Interactive Video Retrieval with Dialog	May 7, 2019	RetrievalVideo Retrieval	—Unverified
Key Frame Extraction with Attention Based Deep Neural Networks	Jun 21, 2023	Video RetrievalVideo Summarization	—Unverified
KPCA Spatio-temporal trajectory point cloud classifier for recognizing human actions in a CBVR system	Mar 26, 2014	Action RecognitionRetrieval	—Unverified
Large-Scale Query-by-Image Video Retrieval Using Bloom Filters	Jul 12, 2016	RetrievalVideo Retrieval	—Unverified
Large Scale Video Representation Learning via Relational Graph Clustering	Jun 1, 2020	ClusteringGraph Clustering	—Unverified
Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval	Dec 1, 2023	Image RetrievalPartially Relevant Video Retrieval	—Unverified
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision	Apr 15, 2023	Language ModelingLanguage Modelling	—Unverified
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval	Jul 11, 2022	Representation LearningRetrieval	—Unverified

Show:10 25 50

← PrevPage 7 of 10Next →

All datasets MSR-VTT-1kA DiDeMo MSR-VTT LSMDC ActivityNet MSVD YouCook2 FIVR-200K VATEX QuerYD SSv2-label retrieval SSv2-template retrieval

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	OmniVec	text-to-video R@10	89.4	—	Unverified
2	CLIP4Clip	text-to-video R@10	81.6	—	Unverified
3	OmniVec (pretrained)	text-to-video R@10	78.6	—	Unverified
4	HunYuan_tvr (huge)	text-to-video R@1	62.9	—	Unverified
5	CLIP-ViP	text-to-video R@1	57.7	—	Unverified
6	PIDRo	text-to-video R@1	55.9	—	Unverified
7	DMAE (ViT-B/16)	text-to-video R@1	55.5	—	Unverified
8	HunYuan_tvr	text-to-video R@1	55	—	Unverified
9	MuLTI	text-to-video R@1	54.7	—	Unverified
10	EERCF	text-to-video R@1	54.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Aurora (ours, r=64)	text-to-video R@5	77.4	—	Unverified
2	InternVideo2-6B	text-to-video R@1	74.2	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	72.3	—	Unverified
4	VAST	text-to-video R@1	72	—	Unverified
5	COSA	text-to-video R@1	70.5	—	Unverified
6	UMT-L (ViT-L/16)	text-to-video R@1	70.4	—	Unverified
7	GRAM	text-to-video R@1	67.3	—	Unverified
8	VALOR	text-to-video R@1	61.5	—	Unverified
9	TESTA (ViT-B/16)	text-to-video R@1	61.2	—	Unverified
10	VindLU	text-to-video R@1	61.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GRAM	text-to-video R@1	64	—	Unverified
2	VAST	text-to-video R@1	63.9	—	Unverified
3	InternVideo2-6B	text-to-video R@1	62.8	—	Unverified
4	VALOR	text-to-video R@1	59.9	—	Unverified
5	UMT-L (ViT-L/16)	text-to-video R@1	58.8	—	Unverified
6	vid-TLDR (UMT-L)	text-to-video R@1	58.1	—	Unverified
7	COSA	text-to-video R@1	57.9	—	Unverified
8	InternVideo2-6B	text-to-video R@1	55.9	—	Unverified
9	InternVideo	text-to-video R@1	55.2	—	Unverified
10	VLAB	text-to-video R@1	55.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	text-to-video R@10	53.7	—	Unverified
2	InternVideo2-6B	text-to-video R@1	46.4	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	43.1	—	Unverified
4	UMT-L (ViT-L/16)	text-to-video R@1	43	—	Unverified
5	HunYuan_tvr (huge)	text-to-video R@1	40.4	—	Unverified
6	COSA	text-to-video R@1	39.4	—	Unverified
7	mPLUG-2	text-to-video R@1	34.4	—	Unverified
8	VALOR	text-to-video R@1	34.2	—	Unverified
9	InternVideo	text-to-video R@1	34	—	Unverified
10	InternVideo2-6B	text-to-video R@1	33.8	—	Unverified