Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–150 of 486 papers

Title	Date	Tasks	Status	Hype
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer	Nov 28, 2023	Language ModelingLanguage Modelling	—Unverified	0
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning	Nov 27, 2023	Action ClassificationAction Recognition	CodeCode Available	1
VideoCon: Robust Video-Language Alignment via Contrast Captions	Nov 15, 2023	Language ModelingLanguage Modelling	CodeCode Available	1
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval	Nov 14, 2023	RetrievalVideo Retrieval	—Unverified	0
Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit Retrieval	Nov 3, 2023	Recommendation SystemsRetrieval	—Unverified	0
An Empirical Study of Frame Selection for Text-to-Video Retrieval	Nov 1, 2023	RetrievalText to Video Retrieval	—Unverified	0
CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved Self-Supervised Video Hashing	Oct 29, 2023	Contrastive LearningRetrieval	—Unverified	0
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding	Oct 29, 2023	FormLanguage Modelling	CodeCode Available	1
Joint Searching and Grounding: Multi-Granularity Video Content Retrieval	Oct 23, 2023	Contrastive LearningRetrieval	CodeCode Available	0
Videoprompter: an ensemble of foundational models for zero-shot video understanding	Oct 23, 2023	Action RecognitionDescriptive	—Unverified	0
Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval	Oct 12, 2023	RetrievalSemantic Retrieval	—Unverified	0
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data	Oct 8, 2023	Action RecognitionContinual Learning	CodeCode Available	1
GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval	Oct 8, 2023	Partially Relevant Video RetrievalRetrieval	CodeCode Available	1
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks	Oct 7, 2023	Action RecognitionMultiple-choice	—Unverified	0
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	Oct 7, 2023	Automatic Speech RecognitionVideo Captioning	CodeCode Available	1
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval	Sep 29, 2023	Cross-Modal RetrievalImage-text matching	CodeCode Available	1
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning	Sep 20, 2023	Contrastive LearningRetrieval	CodeCode Available	1
Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval	Sep 20, 2023	RetrievalVideo Retrieval	—Unverified	0
Unified Coarse-to-Fine Alignment for Video-Text Retrieval	Sep 18, 2023	RetrievalText Retrieval	CodeCode Available	1
Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention	Sep 17, 2023	Action RecognitionGraph Generation	—Unverified	0
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval	Sep 16, 2023	RetrievalStyle Transfer	CodeCode Available	1
Differentiable Resolution Compression and Alignment for Efficient Video Classification and Retrieval	Sep 15, 2023	RetrievalVideo Classification	CodeCode Available	0
Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains	Sep 1, 2023	Change Point DetectionInstruction Following	CodeCode Available	0
CoVR-2: Automatic Data Construction for Composed Video Retrieval	Aug 28, 2023	Composed Image Retrieval (CoIR)Composed Video Retrieval (CoVR)	CodeCode Available	1
Simple Baselines for Interactive Video Retrieval with Questions and Answers	Aug 21, 2023	Question AnsweringRetrieval	CodeCode Available	1
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval	Aug 15, 2023	RetrievalVideo Captioning	CodeCode Available	1
TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval	Aug 2, 2023	Retrievaltext similarity	—Unverified	0
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures	Jul 27, 2023	Automatic Speech RecognitionContrastive Learning	CodeCode Available	1
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment	Jul 24, 2023	RetrievalText to Video Retrieval	—Unverified	0
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model	Jul 24, 2023	Anomaly DetectionRetrieval	CodeCode Available	1
Fine-grained Text-Video Retrieval with Frozen Image Encoders	Jul 14, 2023	DecoderRetrieval	—Unverified	0
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation	Jul 13, 2023	RetrievalVideo Generation	CodeCode Available	2
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	Jul 13, 2023	Action RecognitionContrastive Learning	—Unverified	0
MultiVENT: Multilingual Videos of Events with Aligned Natural Text	Jul 6, 2023	Information RetrievalRetrieval	—Unverified	0
ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models	Jun 28, 2023	RetrievalVideo Retrieval	CodeCode Available	0
An overview on the evaluated video retrieval tasks at TRECVID 2022	Jun 22, 2023	Ad-hoc video searchRetrieval	CodeCode Available	1
Key Frame Extraction with Attention Based Deep Neural Networks	Jun 21, 2023	Video RetrievalVideo Summarization	—Unverified	0
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian	Jun 20, 2023	Cross-Lingual TransferRetrieval	CodeCode Available	0
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	Jun 15, 2023	Formmodel	CodeCode Available	1
Enhanced Multimodal Representation Learning with Cross-modal KD	Jun 13, 2023	Contrastive LearningEmotion Classification	—Unverified	0
MarineVRS: Marine Video Retrieval System with Explainability via Semantic Understanding	Jun 7, 2023	RetrievalSentence	—Unverified	0
An Overview of Challenges in Egocentric Text-Video Retrieval	Jun 7, 2023	RetrievalVideo Retrieval	—Unverified	0
fpgaHART: A toolflow for throughput-oriented acceleration of 3D CNNs for HAR onto FPGAs	May 31, 2023	Action RecognitionAutonomous Vehicles	—Unverified	0
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	May 29, 2023	Audio captioningAudio-Visual Captioning	CodeCode Available	2
FMM-X3D: FPGA-based modeling and mapping of X3D for Human Action Recognition	May 29, 2023	Action RecognitionAutonomous Vehicles	—Unverified	0
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	May 22, 2023	Question AnsweringRetrieval	—Unverified	0
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment	May 20, 2023	RetrievalVideo Retrieval	CodeCode Available	1
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval	May 13, 2023	RetrievalText Retrieval	—Unverified	0
A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension	May 5, 2023	Reading ComprehensionRetrieval	CodeCode Available	1
A Review of Deep Learning for Video Captioning	Apr 22, 2023	Deep LearningDense Video Captioning	—Unverified	0

Show:10 25 50

← PrevPage 3 of 10Next →

All datasets MSR-VTT-1kA DiDeMo MSR-VTT LSMDC ActivityNet MSVD YouCook2 FIVR-200K VATEX QuerYD SSv2-label retrieval SSv2-template retrieval

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	OmniVec	text-to-video R@10	89.4	—	Unverified
2	CLIP4Clip	text-to-video R@10	81.6	—	Unverified
3	OmniVec (pretrained)	text-to-video R@10	78.6	—	Unverified
4	HunYuan_tvr (huge)	text-to-video R@1	62.9	—	Unverified
5	CLIP-ViP	text-to-video R@1	57.7	—	Unverified
6	PIDRo	text-to-video R@1	55.9	—	Unverified
7	DMAE (ViT-B/16)	text-to-video R@1	55.5	—	Unverified
8	HunYuan_tvr	text-to-video R@1	55	—	Unverified
9	MuLTI	text-to-video R@1	54.7	—	Unverified
10	EERCF	text-to-video R@1	54.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Aurora (ours, r=64)	text-to-video R@5	77.4	—	Unverified
2	InternVideo2-6B	text-to-video R@1	74.2	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	72.3	—	Unverified
4	VAST	text-to-video R@1	72	—	Unverified
5	COSA	text-to-video R@1	70.5	—	Unverified
6	UMT-L (ViT-L/16)	text-to-video R@1	70.4	—	Unverified
7	GRAM	text-to-video R@1	67.3	—	Unverified
8	VALOR	text-to-video R@1	61.5	—	Unverified
9	TESTA (ViT-B/16)	text-to-video R@1	61.2	—	Unverified
10	VindLU	text-to-video R@1	61.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GRAM	text-to-video R@1	64	—	Unverified
2	VAST	text-to-video R@1	63.9	—	Unverified
3	InternVideo2-6B	text-to-video R@1	62.8	—	Unverified
4	VALOR	text-to-video R@1	59.9	—	Unverified
5	UMT-L (ViT-L/16)	text-to-video R@1	58.8	—	Unverified
6	vid-TLDR (UMT-L)	text-to-video R@1	58.1	—	Unverified
7	COSA	text-to-video R@1	57.9	—	Unverified
8	InternVideo2-6B	text-to-video R@1	55.9	—	Unverified
9	InternVideo	text-to-video R@1	55.2	—	Unverified
10	VLAB	text-to-video R@1	55.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	text-to-video R@10	53.7	—	Unverified
2	InternVideo2-6B	text-to-video R@1	46.4	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	43.1	—	Unverified
4	UMT-L (ViT-L/16)	text-to-video R@1	43	—	Unverified
5	HunYuan_tvr (huge)	text-to-video R@1	40.4	—	Unverified
6	COSA	text-to-video R@1	39.4	—	Unverified
7	mPLUG-2	text-to-video R@1	34.4	—	Unverified
8	VALOR	text-to-video R@1	34.2	—	Unverified
9	InternVideo	text-to-video R@1	34	—	Unverified
10	InternVideo2-6B	text-to-video R@1	33.8	—	Unverified