Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 486 papers

Title	Date	Tasks	Status	Hype	Score
An overview on the evaluated video retrieval tasks at TRECVID 2022	Jun 22, 2023	Ad-hoc video searchRetrieval	CodeCode Available	1	5
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures	Jul 27, 2023	Automatic Speech RecognitionContrastive Learning	CodeCode Available	1	5
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP	Jun 21, 2021	Language ModelingLanguage Modelling	CodeCode Available	1	5
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	Apr 18, 2021	RetrievalText Retrieval	CodeCode Available	1	5
Florence: A New Foundation Model for Computer Vision	Nov 22, 2021	Action ClassificationAction Recognition	CodeCode Available	1	5
A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension	May 5, 2023	Reading ComprehensionRetrieval	CodeCode Available	1	5
Clover: Towards A Unified Video-Language Alignment and Fusion Model	Jul 16, 2022	Language ModelingLanguage Modelling	CodeCode Available	1	5
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training	May 1, 2020	Language ModelingLanguage Modelling	CodeCode Available	1	5
Align and Prompt: Video-and-Language Pre-training with Entity Prompts	Dec 17, 2021	cross-modal alignmentEntity Alignment	CodeCode Available	1	5
Hierarchical Video-Moment Retrieval and Step-Captioning	Mar 29, 2023	Information RetrievalMoment Retrieval	CodeCode Available	1	5
AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant	Nov 30, 2021	Question AnsweringRetrieval	CodeCode Available	1	5
CoCa: Contrastive Captioners are Image-Text Foundation Models	May 4, 2022	Action ClassificationDecoder	CodeCode Available	1	5
Learning video retrieval models with relevance-aware online mining	Mar 16, 2022	Multi-Instance RetrievalRetrieval	CodeCode Available	1	5
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling	Feb 11, 2021	Question AnsweringRetrieval	CodeCode Available	1	5
Marine Video Kit: A New Marine Video Dataset for Content-based Analysis and Retrieval	Sep 23, 2022	RetrievalVideo Retrieval	CodeCode Available	1	5
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos	Jun 16, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	1	5
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning	Nov 1, 2020	Cross-Modal RetrievalRepresentation Learning	CodeCode Available	1	5
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval	Sep 16, 2023	RetrievalStyle Transfer	CodeCode Available	1	5
Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval	Jan 1, 2023	Knowledge DistillationLanguage Modelling	CodeCode Available	1	5
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound	Apr 6, 2022	RetrievalText to Video Retrieval	CodeCode Available	1	5
Contrastive Masked Autoencoders for Self-Supervised Video Hashing	Nov 21, 2022	DecoderRetrieval	CodeCode Available	1	5
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions	Nov 19, 2021	RetrievalSuper-Resolution	CodeCode Available	1	5
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval	Jul 23, 2024	Re-RankingRetrieval	CodeCode Available	1	5
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning	Sep 20, 2023	Contrastive LearningRetrieval	CodeCode Available	1	5
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss	Sep 9, 2021	Mixture-of-ExpertsRetrieval	CodeCode Available	1	5
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	Jun 15, 2023	Formmodel	CodeCode Available	1	5
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling	Jun 14, 2022	DecoderLanguage Modeling	CodeCode Available	1	5
CoVR-2: Automatic Data Construction for Composed Video Retrieval	Aug 28, 2023	Composed Image Retrieval (CoIR)Composed Video Retrieval (CoVR)	CodeCode Available	1	5
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval	Dec 8, 2021	Action LocalizationRetrieval	CodeCode Available	1	5
End-to-End Learning of Visual Representations from Uncurated Instructional Videos	Dec 13, 2019	Action LocalizationAction Recognition	CodeCode Available	1	5
Cross-Architecture Self-supervised Video Representation Learning	May 26, 2022	Action RecognitionContrastive Learning	CodeCode Available	1	5
Cross-Modal Adapter for Text-Video Retrieval	Nov 17, 2022	parameter-efficient fine-tuningRetrieval	CodeCode Available	1	5
Cross Modal Retrieval with Querybank Normalisation	Dec 23, 2021	Cross-Modal RetrievalMetric Learning	CodeCode Available	1	5
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling	Sep 4, 2022	Fill MaskOptical Flow Estimation	CodeCode Available	1	5
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization	Jun 1, 2021	Question AnsweringRetrieval	CodeCode Available	1	5
Text Proxy: Decomposing Retrieval from a 1-to-N Relationship into N 1-to-1 Relationships for Text-Video Retrieval	Oct 9, 2024	RetrievalText Retrieval	CodeCode Available	1	5
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data	Oct 8, 2023	Action RecognitionContinual Learning	CodeCode Available	1	5
A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval	Aug 3, 2022	Data AugmentationRetrieval	CodeCode Available	1	5
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval	Oct 7, 2022	Knowledge DistillationRetrieval	CodeCode Available	1	5
DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval	Jun 24, 2021	Computational EfficiencyKnowledge Distillation	CodeCode Available	1	5
Audio-based Near-Duplicate Video Retrieval with Audio Similarity Learning	Oct 17, 2020	RetrievalTransfer Learning	CodeCode Available	1	5
Dual Encoding for Video Retrieval by Text	Sep 10, 2020	Ad-hoc video searchRetrieval	CodeCode Available	1	5
Temporal Context Aggregation for Video Retrieval with Contrastive Learning	Aug 4, 2020	Contrastive LearningRepresentation Learning	CodeCode Available	1	5
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval	Jan 19, 2024	RetrievalVideo Retrieval	CodeCode Available	1	5
CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval	Sep 21, 2021	Corpus Video Moment RetrievalMoment Retrieval	CodeCode Available	1	5
Generalized Few-Shot Video Classification with Video Retrieval and Feature Generation	Jul 9, 2020	Few-Shot Image ClassificationFew-Shot Learning	CodeCode Available	1	5
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model	Mar 17, 2023	RetrievalVideo Retrieval	CodeCode Available	1	5
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	Apr 1, 2021	RetrievalText Retrieval	CodeCode Available	1	5
Hysia: Serving DNN-Based Video-to-Retail Applications in Cloud	Jun 9, 2020	GPUVideo Retrieval	CodeCode Available	1	5
Condensed Movies: Story Based Retrieval with Contextual Embeddings	May 8, 2020	RetrievalText to Video Retrieval	CodeCode Available	1	5

Show:10 25 50

← PrevPage 2 of 10Next →

All datasets MSR-VTT-1kA DiDeMo MSR-VTT LSMDC ActivityNet MSVD YouCook2 FIVR-200K VATEX QuerYD SSv2-label retrieval SSv2-template retrieval

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	OmniVec	text-to-video R@10	89.4	—	Unverified
2	CLIP4Clip	text-to-video R@10	81.6	—	Unverified
3	OmniVec (pretrained)	text-to-video R@10	78.6	—	Unverified
4	HunYuan_tvr (huge)	text-to-video R@1	62.9	—	Unverified
5	CLIP-ViP	text-to-video R@1	57.7	—	Unverified
6	PIDRo	text-to-video R@1	55.9	—	Unverified
7	DMAE (ViT-B/16)	text-to-video R@1	55.5	—	Unverified
8	HunYuan_tvr	text-to-video R@1	55	—	Unverified
9	MuLTI	text-to-video R@1	54.7	—	Unverified
10	EERCF	text-to-video R@1	54.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Aurora (ours, r=64)	text-to-video R@5	77.4	—	Unverified
2	InternVideo2-6B	text-to-video R@1	74.2	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	72.3	—	Unverified
4	VAST	text-to-video R@1	72	—	Unverified
5	COSA	text-to-video R@1	70.5	—	Unverified
6	UMT-L (ViT-L/16)	text-to-video R@1	70.4	—	Unverified
7	GRAM	text-to-video R@1	67.3	—	Unverified
8	VALOR	text-to-video R@1	61.5	—	Unverified
9	TESTA (ViT-B/16)	text-to-video R@1	61.2	—	Unverified
10	VindLU	text-to-video R@1	61.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GRAM	text-to-video R@1	64	—	Unverified
2	VAST	text-to-video R@1	63.9	—	Unverified
3	InternVideo2-6B	text-to-video R@1	62.8	—	Unverified
4	VALOR	text-to-video R@1	59.9	—	Unverified
5	UMT-L (ViT-L/16)	text-to-video R@1	58.8	—	Unverified
6	vid-TLDR (UMT-L)	text-to-video R@1	58.1	—	Unverified
7	COSA	text-to-video R@1	57.9	—	Unverified
8	InternVideo2-6B	text-to-video R@1	55.9	—	Unverified
9	InternVideo	text-to-video R@1	55.2	—	Unverified
10	VLAB	text-to-video R@1	55.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	text-to-video R@10	53.7	—	Unverified
2	InternVideo2-6B	text-to-video R@1	46.4	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	43.1	—	Unverified
4	UMT-L (ViT-L/16)	text-to-video R@1	43	—	Unverified
5	HunYuan_tvr (huge)	text-to-video R@1	40.4	—	Unverified
6	COSA	text-to-video R@1	39.4	—	Unverified
7	mPLUG-2	text-to-video R@1	34.4	—	Unverified
8	VALOR	text-to-video R@1	34.2	—	Unverified
9	InternVideo	text-to-video R@1	34	—	Unverified
10	InternVideo2-6B	text-to-video R@1	33.8	—	Unverified