Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 486 papers

Title	Date	Tasks	Status
Generative Ghost: Investigating Ranking Bias Hidden in AI-Generated Videos	Feb 11, 2025	Contrastive LearningImage Retrieval	—Unverified
HORUS: Multimodal Large Language Models Framework for Video Retrieval at VBS 2025	Jan 1, 2025	Image RetrievalRetrieval	—Unverified
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval	Dec 31, 2024	RetrievalText Retrieval	—Unverified
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning	Dec 30, 2024	Contrastive LearningQuestion Answering	—Unverified
PolySmart @ TRECVid 2024 Medical Video Question Answering	Dec 20, 2024	Question AnsweringRetrieval	—Unverified
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning	Dec 18, 2024	Moment RetrievalMulti-Task Learning	—Unverified
Generative Semantic Communication: Architectures, Technologies, and Applications	Dec 11, 2024	RetrievalSemantic Communication	—Unverified
Multimodal Contextualized Support for Enhancing Video Retrieval System	Dec 10, 2024	object-detectionObject Detection	—Unverified
ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising	Oct 29, 2024	RetrievalText to Video Retrieval	CodeCode Available
Generating Signed Language Instructions in Large-Scale Dialogue Systems	Oct 17, 2024	RetrievalText Generation	CodeCode Available
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval	Oct 15, 2024	DescriptiveRetrieval	—Unverified
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models	Oct 1, 2024	Hallucinationtext similarity	—Unverified
TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm	Sep 30, 2024	RetrievalVideo Retrieval	CodeCode Available
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding	Sep 29, 2024	DiversityQuestion Answering	—Unverified
Unfolding Videos Dynamics via Taylor Expansion	Sep 4, 2024	Action DetectionAction Recognition	—Unverified
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets	Sep 2, 2024	Video AlignmentVideo Editing	—Unverified
NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality	Aug 18, 2024	RetrievalText Retrieval	—Unverified
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach	Aug 14, 2024	Cross-Modal RetrievalLanguage Modeling	—Unverified
Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics	Aug 5, 2024	RetrievalVideo Retrieval	—Unverified
Neural Graph Matching for Video Retrieval in Large-Scale Video-driven E-commerce	Aug 1, 2024	Graph MatchingRetrieval	—Unverified
ExpertAF: Expert Actionable Feedback from Video	Aug 1, 2024	Language ModelingLanguage Modelling	—Unverified
SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval	Jul 23, 2024	RetrievalSign Language Retrieval	CodeCode Available
Not All Pairs are Equal: Hierarchical Learning for Average-Precision-Oriented Video Retrieval	Jul 22, 2024	AllRetrieval	—Unverified
MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline	Jul 17, 2024	Question AnsweringRetrieval	—Unverified
EA-VTR: Event-Aware Video-Text Retrieval	Jul 10, 2024	Action RecognitionContrastive Learning	—Unverified
MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning	Jul 4, 2024	Language ModelingLanguage Modelling	CodeCode Available
ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling	Jun 25, 2024	Cross-Modal RetrievalNatural Language Queries	—Unverified
Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval	Jun 21, 2024	RetrievalSentence	—Unverified
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset	Jun 19, 2024	Language ModelingLanguage Modelling	—Unverified
RNNs, CNNs and Transformers in Human Action Recognition: A Survey and a Hybrid Model	Jun 2, 2024	Action RecognitionTemporal Action Localization	—Unverified
Uncertainty-aware sign language video retrieval with probability distribution modeling	May 30, 2024	RetrievalSign Language Retrieval	—Unverified
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter	May 29, 2024	Natural Language Queriesparameter-efficient fine-tuning	—Unverified
Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models	Apr 29, 2024	Image RetrievalLanguage Modeling	—Unverified
Learning text-to-video retrieval from image captioning	Apr 26, 2024	Image CaptioningImage Retrieval	—Unverified
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval	Apr 22, 2024	RetrievalVideo Retrieval	—Unverified
ProTA: Probabilistic Token Aggregation for Text-Video Retrieval	Apr 18, 2024	DiversityRetrieval	—Unverified
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval	Mar 26, 2024	Multimodal ReasoningRetrieval	—Unverified
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement	Feb 21, 2024	Moment RetrievalRetrieval	CodeCode Available
Event-aware Video Corpus Moment Retrieval	Feb 21, 2024	Contrastive LearningMoment Retrieval	—Unverified
Video Editing for Video Retrieval	Feb 4, 2024	RetrievalText Retrieval	—Unverified
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing	Jan 22, 2024	AudioCapsAudio-Visual Synchronization	—Unverified
Distilling Vision-Language Models on Millions of Videos	Jan 11, 2024	Language ModelingLanguage Modelling	—Unverified
Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks	Jan 6, 2024	RetrievalVariational Inference	—Unverified
Detours for Navigating Instructional Videos	Jan 3, 2024	16kQuestion Answering	—Unverified
No More Shortcuts: Realizing the Potential of Temporal Self-Supervision	Dec 20, 2023	Action ClassificationAttribute	—Unverified
WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge	Dec 15, 2023	Information RetrievalKnowledge Distillation	CodeCode Available
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning	Dec 10, 2023	Language ModelingLanguage Modelling	—Unverified
Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval	Dec 1, 2023	Image RetrievalPartially Relevant Video Retrieval	—Unverified
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval	Nov 30, 2023	BenchmarkingRetrieval	—Unverified
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding	Nov 30, 2023	FormVideo Retrieval	—Unverified

Show:10 25 50

← PrevPage 5 of 10Next →

All datasets MSR-VTT-1kA DiDeMo MSR-VTT LSMDC ActivityNet MSVD YouCook2 FIVR-200K VATEX QuerYD SSv2-label retrieval SSv2-template retrieval

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	OmniVec	text-to-video R@10	89.4	—	Unverified
2	CLIP4Clip	text-to-video R@10	81.6	—	Unverified
3	OmniVec (pretrained)	text-to-video R@10	78.6	—	Unverified
4	HunYuan_tvr (huge)	text-to-video R@1	62.9	—	Unverified
5	CLIP-ViP	text-to-video R@1	57.7	—	Unverified
6	PIDRo	text-to-video R@1	55.9	—	Unverified
7	DMAE (ViT-B/16)	text-to-video R@1	55.5	—	Unverified
8	HunYuan_tvr	text-to-video R@1	55	—	Unverified
9	MuLTI	text-to-video R@1	54.7	—	Unverified
10	EERCF	text-to-video R@1	54.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Aurora (ours, r=64)	text-to-video R@5	77.4	—	Unverified
2	InternVideo2-6B	text-to-video R@1	74.2	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	72.3	—	Unverified
4	VAST	text-to-video R@1	72	—	Unverified
5	COSA	text-to-video R@1	70.5	—	Unverified
6	UMT-L (ViT-L/16)	text-to-video R@1	70.4	—	Unverified
7	GRAM	text-to-video R@1	67.3	—	Unverified
8	VALOR	text-to-video R@1	61.5	—	Unverified
9	TESTA (ViT-B/16)	text-to-video R@1	61.2	—	Unverified
10	VindLU	text-to-video R@1	61.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GRAM	text-to-video R@1	64	—	Unverified
2	VAST	text-to-video R@1	63.9	—	Unverified
3	InternVideo2-6B	text-to-video R@1	62.8	—	Unverified
4	VALOR	text-to-video R@1	59.9	—	Unverified
5	UMT-L (ViT-L/16)	text-to-video R@1	58.8	—	Unverified
6	vid-TLDR (UMT-L)	text-to-video R@1	58.1	—	Unverified
7	COSA	text-to-video R@1	57.9	—	Unverified
8	InternVideo2-6B	text-to-video R@1	55.9	—	Unverified
9	InternVideo	text-to-video R@1	55.2	—	Unverified
10	VLAB	text-to-video R@1	55.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	text-to-video R@10	53.7	—	Unverified
2	InternVideo2-6B	text-to-video R@1	46.4	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	43.1	—	Unverified
4	UMT-L (ViT-L/16)	text-to-video R@1	43	—	Unverified
5	HunYuan_tvr (huge)	text-to-video R@1	40.4	—	Unverified
6	COSA	text-to-video R@1	39.4	—	Unverified
7	mPLUG-2	text-to-video R@1	34.4	—	Unverified
8	VALOR	text-to-video R@1	34.2	—	Unverified
9	InternVideo	text-to-video R@1	34	—	Unverified
10	InternVideo2-6B	text-to-video R@1	33.8	—	Unverified