Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 226–250 of 486 papers

Title	Date	Tasks	Status	Score
Hashing with Mutual Information	Mar 2, 2018	Image RetrievalRetrieval	CodeCode Available	5
A Challenge to Build Neuro-Symbolic Video Agents	May 20, 2025	Scene ClassificationVideo Retrieval	CodeCode Available	5
TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm	Sep 30, 2024	RetrievalVideo Retrieval	CodeCode Available	5
LAMV: Learning to Align and Match Videos With Kernelized Temporal Layers	Jun 1, 2018	Copy DetectionRetrieval	CodeCode Available	5
MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning	Jul 4, 2024	Language ModelingLanguage Modelling	CodeCode Available	5
Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains	Sep 1, 2023	Change Point DetectionInstruction Following	CodeCode Available	5
Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering	Apr 15, 2025	Partially Relevant Video RetrievalRetrieval	CodeCode Available	5
Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos	Dec 15, 2021	RetrievalTriplet	CodeCode Available	5
FIVR: Fine-grained Incident Video Retrieval	Sep 11, 2018	BenchmarkingRetrieval	CodeCode Available	5
You were saying? - Spoken Language in the V3C Dataset	Dec 15, 2022	RetrievalVideo Retrieval	CodeCode Available	5
AMIL: Adversarial Multi Instance Learning for Human Pose Estimation	Mar 18, 2020	Multiple Instance LearningPose Estimation	CodeCode Available	5
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	Apr 1, 2022	DiversityImage Captioning	CodeCode Available	5
Zorro: the masked multimodal transformer	Jan 23, 2023	Audio TaggingMultimodal Deep Learning	CodeCode Available	5
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian	Jun 20, 2023	Cross-Lingual TransferRetrieval	CodeCode Available	5
Inter-intra Variant Dual Representations forSelf-supervised Video Recognition	Jul 2, 2021	Contrastive LearningRepresentation Learning	CodeCode Available	5
A Joint Sequence Fusion Model for Video Question Answering and Retrieval	Aug 7, 2018	DecoderMultiple-choice	CodeCode Available	5
Robustness Analysis of Video-Language Models Against Visual and Language Perturbations	Jul 5, 2022	Language ModelingLanguage Modelling	CodeCode Available	5
Learning to Locate Visual Answer in Video Corpus Using Question	Oct 11, 2022	Contrastive LearningLanguage Modelling	CodeCode Available	5
Learning to Retrieve Videos by Asking Questions	May 11, 2022	AI AgentRetrieval	CodeCode Available	5
Dialogue-to-Video Retrieval	Mar 23, 2023	Recommendation SystemsRetrieval	CodeCode Available	5
Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames	Oct 16, 2022	Bilevel OptimizationRetrieval	CodeCode Available	5
Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer	Feb 4, 2023	Computational EfficiencyQuestion Answering	CodeCode Available	5
Circulant temporal encoding for video retrieval and temporal alignment	Jun 8, 2015	RetrievalVideo Retrieval	CodeCode Available	5
ECO: Efficient Convolutional Network for Online Video Understanding	Apr 24, 2018	Action ClassificationAction Recognition	CodeCode Available	5
ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models	Jun 28, 2023	RetrievalVideo Retrieval	CodeCode Available	5

Show:10 25 50

← PrevPage 10 of 20Next →

All datasets MSR-VTT-1kA DiDeMo MSR-VTT LSMDC ActivityNet MSVD YouCook2 FIVR-200K VATEX QuerYD SSv2-label retrieval SSv2-template retrieval

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	OmniVec	text-to-video R@10	89.4	—	Unverified
2	CLIP4Clip	text-to-video R@10	81.6	—	Unverified
3	OmniVec (pretrained)	text-to-video R@10	78.6	—	Unverified
4	HunYuan_tvr (huge)	text-to-video R@1	62.9	—	Unverified
5	CLIP-ViP	text-to-video R@1	57.7	—	Unverified
6	PIDRo	text-to-video R@1	55.9	—	Unverified
7	DMAE (ViT-B/16)	text-to-video R@1	55.5	—	Unverified
8	HunYuan_tvr	text-to-video R@1	55	—	Unverified
9	MuLTI	text-to-video R@1	54.7	—	Unverified
10	EERCF	text-to-video R@1	54.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Aurora (ours, r=64)	text-to-video R@5	77.4	—	Unverified
2	InternVideo2-6B	text-to-video R@1	74.2	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	72.3	—	Unverified
4	VAST	text-to-video R@1	72	—	Unverified
5	COSA	text-to-video R@1	70.5	—	Unverified
6	UMT-L (ViT-L/16)	text-to-video R@1	70.4	—	Unverified
7	GRAM	text-to-video R@1	67.3	—	Unverified
8	VALOR	text-to-video R@1	61.5	—	Unverified
9	TESTA (ViT-B/16)	text-to-video R@1	61.2	—	Unverified
10	VindLU	text-to-video R@1	61.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GRAM	text-to-video R@1	64	—	Unverified
2	VAST	text-to-video R@1	63.9	—	Unverified
3	InternVideo2-6B	text-to-video R@1	62.8	—	Unverified
4	VALOR	text-to-video R@1	59.9	—	Unverified
5	UMT-L (ViT-L/16)	text-to-video R@1	58.8	—	Unverified
6	vid-TLDR (UMT-L)	text-to-video R@1	58.1	—	Unverified
7	COSA	text-to-video R@1	57.9	—	Unverified
8	InternVideo2-6B	text-to-video R@1	55.9	—	Unverified
9	InternVideo	text-to-video R@1	55.2	—	Unverified
10	VLAB	text-to-video R@1	55.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	text-to-video R@10	53.7	—	Unverified
2	InternVideo2-6B	text-to-video R@1	46.4	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	43.1	—	Unverified
4	UMT-L (ViT-L/16)	text-to-video R@1	43	—	Unverified
5	HunYuan_tvr (huge)	text-to-video R@1	40.4	—	Unverified
6	COSA	text-to-video R@1	39.4	—	Unverified
7	mPLUG-2	text-to-video R@1	34.4	—	Unverified
8	VALOR	text-to-video R@1	34.2	—	Unverified
9	InternVideo	text-to-video R@1	34	—	Unverified
10	InternVideo2-6B	text-to-video R@1	33.8	—	Unverified