Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 486 papers

Title	Date	Tasks	Status	Hype	Score
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning	Nov 1, 2020	Cross-Modal RetrievalRepresentation Learning	CodeCode Available	1	5
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling	Jun 14, 2022	DecoderLanguage Modeling	CodeCode Available	1	5
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	Apr 1, 2021	RetrievalText Retrieval	CodeCode Available	1	5
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	Jun 15, 2023	Formmodel	CodeCode Available	1	5
Generalized Few-Shot Video Classification with Video Retrieval and Feature Generation	Jul 9, 2020	Few-Shot Image ClassificationFew-Shot Learning	CodeCode Available	1	5
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos	Jun 16, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	1	5
CoVR-2: Automatic Data Construction for Composed Video Retrieval	Aug 28, 2023	Composed Image Retrieval (CoIR)Composed Video Retrieval (CoVR)	CodeCode Available	1	5
Multi-modal Transformer for Video Retrieval	Jul 21, 2020	Natural Language QueriesRetrieval	CodeCode Available	1	5
GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval	Oct 8, 2023	Partially Relevant Video RetrievalRetrieval	CodeCode Available	1	5
GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval	May 22, 2024	Partially Relevant Video RetrievalRetrieval	CodeCode Available	1	5
MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval	Aug 20, 2024	MambaNatural Language Queries	CodeCode Available	1	5
Multi-Query Video Retrieval	Jan 10, 2022	RetrievalVideo Retrieval	CodeCode Available	1	5
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval	Sep 2, 2024	GPURetrieval	CodeCode Available	1	5
Cross-Architecture Self-supervised Video Representation Learning	May 26, 2022	Action RecognitionContrastive Learning	CodeCode Available	1	5
TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition	May 4, 2022	Action RecognitionRepresentation Learning	CodeCode Available	1	5
Normalized Contrastive Learning for Text-Video Retrieval	Nov 30, 2022	Contrastive LearningCross-Modal Retrieval	CodeCode Available	1	5
Cross Modal Retrieval with Querybank Normalisation	Dec 23, 2021	Cross-Modal RetrievalMetric Learning	CodeCode Available	1	5
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training	May 1, 2020	Language ModelingLanguage Modelling	CodeCode Available	1	5
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling	Sep 4, 2022	Fill MaskOptical Flow Estimation	CodeCode Available	1	5
Hierarchical Video-Moment Retrieval and Step-Captioning	Mar 29, 2023	Information RetrievalMoment Retrieval	CodeCode Available	1	5
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization	Jun 1, 2021	Question AnsweringRetrieval	CodeCode Available	1	5
Bridging Video-text Retrieval with Multiple Choice Questions	Jan 13, 2022	Action RecognitionLinear evaluation	CodeCode Available	1	5
Holistic Features are almost Sufficient for Text-to-Video Retrieval	Jan 1, 2024	Retrievaltext similarity	CodeCode Available	1	5
Text Proxy: Decomposing Retrieval from a 1-to-N Relationship into N 1-to-1 Relationships for Text-Video Retrieval	Oct 9, 2024	RetrievalText Retrieval	CodeCode Available	1	5
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling	Nov 24, 2021	Question AnsweringRetrieval	CodeCode Available	1	5
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips	Jun 7, 2019	Action LocalizationLong Video Retrieval (Background Removed)	CodeCode Available	1	5
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	Oct 7, 2023	Automatic Speech RecognitionVideo Captioning	CodeCode Available	1	5
Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)	Mar 21, 2025	Representation LearningRetrieval	CodeCode Available	0	5
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval	Apr 20, 2021	RetrievalVideo Retrieval	CodeCode Available	0	5
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation	Jul 20, 2018	Face GenerationLip Reading	CodeCode Available	0	5
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	Apr 1, 2022	DiversityImage Captioning	CodeCode Available	0	5
Semantic Role Aware Correlation Transformer for Text to Video Retrieval	Jun 26, 2022	RetrievalText to Video Retrieval	CodeCode Available	0	5
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval	Nov 21, 2022	AllRetrieval	CodeCode Available	0	5
Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer	Feb 4, 2023	Computational EfficiencyQuestion Answering	CodeCode Available	0	5
Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames	Oct 16, 2022	Bilevel OptimizationRetrieval	CodeCode Available	0	5
Accommodating Audio Modality in CLIP for Multimodal Processing	Mar 12, 2023	AudioCapsContrastive Learning	CodeCode Available	0	5
Video Logo Retrieval based on local Features	Aug 11, 2018	Image RetrievalRetrieval	CodeCode Available	0	5
ECO: Efficient Convolutional Network for Online Video Understanding	Apr 24, 2018	Action ClassificationAction Recognition	CodeCode Available	0	5
A Joint Sequence Fusion Model for Video Question Answering and Retrieval	Aug 7, 2018	DecoderMultiple-choice	CodeCode Available	0	5
Self-supervised Video Representation Learning with Cascade Positive Retrieval	Jan 20, 2022	Action RecognitionContrastive Learning	CodeCode Available	0	5
Dual Encoding for Zero-Example Video Retrieval	Sep 17, 2018	Ad-hoc video searchRetrieval	CodeCode Available	0	5
Self-supervised Video Representation Learning by Context and Motion Decoupling	Apr 2, 2021	Action RecognitionCPU	CodeCode Available	0	5
SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries	Nov 24, 2020	Ad-hoc video searchManagement	CodeCode Available	0	5
SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval	Jul 23, 2024	RetrievalSign Language Retrieval	CodeCode Available	0	5
LAMV: Learning to Align and Match Videos With Kernelized Temporal Layers	Jun 1, 2018	Copy DetectionRetrieval	CodeCode Available	0	5
Discriminative Residual Analysis for Image Set Classification with Posture and Age Variations	Aug 23, 2020	General ClassificationRetrieval	CodeCode Available	0	5
Circulant temporal encoding for video retrieval and temporal alignment	Jun 8, 2015	RetrievalVideo Retrieval	CodeCode Available	0	5
Central Similarity Quantization for Efficient Image and Video Retrieval	Aug 1, 2019	QuantizationRetrieval	CodeCode Available	0	5
Rudder: A Cross Lingual Video and Text Retrieval Dataset	Mar 9, 2021	Natural Language QueriesRetrieval	CodeCode Available	0	5
Joint Searching and Grounding: Multi-Granularity Video Content Retrieval	Oct 23, 2023	Contrastive LearningRetrieval	CodeCode Available	0	5

Show:10 25 50

← PrevPage 4 of 10Next →

All datasets MSR-VTT-1kA DiDeMo MSR-VTT LSMDC ActivityNet MSVD YouCook2 FIVR-200K VATEX QuerYD SSv2-label retrieval SSv2-template retrieval

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	OmniVec	text-to-video R@10	89.4	—	Unverified
2	CLIP4Clip	text-to-video R@10	81.6	—	Unverified
3	OmniVec (pretrained)	text-to-video R@10	78.6	—	Unverified
4	HunYuan_tvr (huge)	text-to-video R@1	62.9	—	Unverified
5	CLIP-ViP	text-to-video R@1	57.7	—	Unverified
6	PIDRo	text-to-video R@1	55.9	—	Unverified
7	DMAE (ViT-B/16)	text-to-video R@1	55.5	—	Unverified
8	HunYuan_tvr	text-to-video R@1	55	—	Unverified
9	MuLTI	text-to-video R@1	54.7	—	Unverified
10	EERCF	text-to-video R@1	54.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Aurora (ours, r=64)	text-to-video R@5	77.4	—	Unverified
2	InternVideo2-6B	text-to-video R@1	74.2	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	72.3	—	Unverified
4	VAST	text-to-video R@1	72	—	Unverified
5	COSA	text-to-video R@1	70.5	—	Unverified
6	UMT-L (ViT-L/16)	text-to-video R@1	70.4	—	Unverified
7	GRAM	text-to-video R@1	67.3	—	Unverified
8	VALOR	text-to-video R@1	61.5	—	Unverified
9	TESTA (ViT-B/16)	text-to-video R@1	61.2	—	Unverified
10	VindLU	text-to-video R@1	61.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GRAM	text-to-video R@1	64	—	Unverified
2	VAST	text-to-video R@1	63.9	—	Unverified
3	InternVideo2-6B	text-to-video R@1	62.8	—	Unverified
4	VALOR	text-to-video R@1	59.9	—	Unverified
5	UMT-L (ViT-L/16)	text-to-video R@1	58.8	—	Unverified
6	vid-TLDR (UMT-L)	text-to-video R@1	58.1	—	Unverified
7	COSA	text-to-video R@1	57.9	—	Unverified
8	InternVideo2-6B	text-to-video R@1	55.9	—	Unverified
9	InternVideo	text-to-video R@1	55.2	—	Unverified
10	VLAB	text-to-video R@1	55.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	text-to-video R@10	53.7	—	Unverified
2	InternVideo2-6B	text-to-video R@1	46.4	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	43.1	—	Unverified
4	UMT-L (ViT-L/16)	text-to-video R@1	43	—	Unverified
5	HunYuan_tvr (huge)	text-to-video R@1	40.4	—	Unverified
6	COSA	text-to-video R@1	39.4	—	Unverified
7	mPLUG-2	text-to-video R@1	34.4	—	Unverified
8	VALOR	text-to-video R@1	34.2	—	Unverified
9	InternVideo	text-to-video R@1	34	—	Unverified
10	InternVideo2-6B	text-to-video R@1	33.8	—	Unverified