Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 486 papers

Title	Date	Tasks	Status	Hype
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	Apr 17, 2023	Audio captioningAudio-Video Question Answering (AVQA)	CodeCode Available	2
Robust Cross-Modal Knowledge Distillation for Unconstrained Videos	Apr 16, 2023	Action RecognitionAudio Tagging	CodeCode Available	1
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision	Apr 15, 2023	Language ModelingLanguage Modelling	—Unverified	0
Self-Supervised Video Similarity Learning	Apr 6, 2023	ISVRRetrieval	CodeCode Available	1
Perfect Match in Video Retrieval	Mar 29, 2023	RetrievalVideo Retrieval	—Unverified	0
Free-Form Multi-Modal Multimedia Retrieval (4MR)	Mar 29, 2023	FormManagement	—Unverified	0
Hierarchical Video-Moment Retrieval and Step-Captioning	Mar 29, 2023	Information RetrievalMoment Retrieval	CodeCode Available	1
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding	Mar 28, 2023	Action LocalizationAction Recognition	—Unverified	0
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	Mar 28, 2023	Action ClassificationAction Recognition	CodeCode Available	0
Colo-SCRL: Self-Supervised Contrastive Representation Learning for Colonoscopic Video Retrieval	Mar 28, 2023	Action RecognitionContrastive Learning	—Unverified	0
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning	Mar 25, 2023	Contrastive LearningQuestion Answering	CodeCode Available	1
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations	Mar 24, 2023	Contrastive LearningImage Retrieval	CodeCode Available	0
Dialogue-to-Video Retrieval	Mar 23, 2023	Recommendation SystemsRetrieval	CodeCode Available	0
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	Mar 23, 2023	Auxiliary LearningMultimodal Sentiment Analysis	CodeCode Available	1
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model	Mar 17, 2023	RetrievalVideo Retrieval	CodeCode Available	1
VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression	Mar 15, 2023	RetrievalVideo Retrieval	CodeCode Available	1
Accommodating Audio Modality in CLIP for Multimodal Processing	Mar 12, 2023	AudioCapsContrastive Learning	CodeCode Available	0
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling	Mar 10, 2023	Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION	—Unverified	0
Improving Video Retrieval by Adaptive Margin	Mar 9, 2023	RetrievalVideo Retrieval	—Unverified	0
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training	Feb 20, 2023	Language ModellingObject	—Unverified	0
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning	Feb 19, 2023	Representation LearningRetrieval	CodeCode Available	0
Is Multimodal Vision Supervision Beneficial to Language?	Feb 10, 2023	Image RetrievalNatural Language Understanding	CodeCode Available	0
Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer	Feb 4, 2023	Computational EfficiencyQuestion Answering	CodeCode Available	0
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	Feb 1, 2023	Action ClassificationImage Classification	CodeCode Available	4
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring	Jan 26, 2023	Representation LearningRetrieval	CodeCode Available	1
Zorro: the masked multimodal transformer	Jan 23, 2023	Audio TaggingMultimodal Deep Learning	CodeCode Available	0
Temporal Perceiving Video-Language Pre-training	Jan 18, 2023	Action LocalizationContrastive Learning	—Unverified	0
UATVR: Uncertainty-Adaptive Text-Video Retrieval	Jan 16, 2023	RetrievalSemantic correspondence	CodeCode Available	1
Learning Trajectory-Word Alignments for Video-Language Tasks	Jan 5, 2023	Question AnsweringRetrieval	—Unverified	0
PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval	Jan 1, 2023	Representation LearningRetrieval	—Unverified	0
HiVLP: Hierarchical Interactive Video-Language Pre-Training	Jan 1, 2023	RetrievalSelf-Supervised Learning	—Unverified	0
Exploring Temporal Concurrency for Video-Language Representation Learning	Jan 1, 2023	Dynamic Time WarpingMetric Learning	CodeCode Available	0
Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval	Jan 1, 2023	Knowledge DistillationLanguage Modelling	CodeCode Available	1
Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval	Jan 1, 2023	DiversityObject	CodeCode Available	1
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?	Dec 31, 2022	Data AugmentationRetrieval	CodeCode Available	2
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	Dec 30, 2022	cross-modal alignmentTGIF-Action	—Unverified	0
TempCLR: Temporal Alignment Representation with Contrastive Learning	Dec 28, 2022	Action RecognitionContrastive Learning	CodeCode Available	1
You were saying? - Spoken Language in the V3C Dataset	Dec 15, 2022	RetrievalVideo Retrieval	CodeCode Available	0
Contextual Explainable Video Representation: Human Perception-based Understanding	Dec 12, 2022	Action DetectionAction Recognition	CodeCode Available	0
VindLU: A Recipe for Effective Video-and-Language Pretraining	Dec 9, 2022	Question AnsweringRetrieval	CodeCode Available	1
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	Dec 9, 2022	Question AnsweringRetrieval	—Unverified	0
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	Dec 6, 2022	Action ClassificationAction Recognition	CodeCode Available	4
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval	Dec 2, 2022	Image-text RetrievalRetrieval	—Unverified	0
Normalized Contrastive Learning for Text-Video Retrieval	Nov 30, 2022	Contrastive LearningCross-Modal Retrieval	CodeCode Available	1
Renmin University of China at TRECVID 2022: Improving Video Search by Feature Fusion and Negation Understanding	Nov 28, 2022	Ad-hoc video searchNegation	—Unverified	0
VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval	Nov 23, 2022	Cross-Modal RetrievalRetrieval	CodeCode Available	1
TransVCL: Attention-enhanced Video Copy Localization Network with Flexible Supervision	Nov 23, 2022	RetrievalVideo Retrieval	CodeCode Available	1
X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	Nov 22, 2022	AllCross-Modal Retrieval	CodeCode Available	2
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval	Nov 21, 2022	AllRetrieval	CodeCode Available	0
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	Nov 21, 2022	Contrastive LearningRepresentation Learning	CodeCode Available	1

Show:10 25 50

← PrevPage 4 of 10Next →

All datasets MSR-VTT-1kA DiDeMo MSR-VTT LSMDC ActivityNet MSVD YouCook2 FIVR-200K VATEX QuerYD SSv2-label retrieval SSv2-template retrieval

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	OmniVec	text-to-video R@10	89.4	—	Unverified
2	CLIP4Clip	text-to-video R@10	81.6	—	Unverified
3	OmniVec (pretrained)	text-to-video R@10	78.6	—	Unverified
4	HunYuan_tvr (huge)	text-to-video R@1	62.9	—	Unverified
5	CLIP-ViP	text-to-video R@1	57.7	—	Unverified
6	PIDRo	text-to-video R@1	55.9	—	Unverified
7	DMAE (ViT-B/16)	text-to-video R@1	55.5	—	Unverified
8	HunYuan_tvr	text-to-video R@1	55	—	Unverified
9	MuLTI	text-to-video R@1	54.7	—	Unverified
10	EERCF	text-to-video R@1	54.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Aurora (ours, r=64)	text-to-video R@5	77.4	—	Unverified
2	InternVideo2-6B	text-to-video R@1	74.2	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	72.3	—	Unverified
4	VAST	text-to-video R@1	72	—	Unverified
5	COSA	text-to-video R@1	70.5	—	Unverified
6	UMT-L (ViT-L/16)	text-to-video R@1	70.4	—	Unverified
7	GRAM	text-to-video R@1	67.3	—	Unverified
8	VALOR	text-to-video R@1	61.5	—	Unverified
9	TESTA (ViT-B/16)	text-to-video R@1	61.2	—	Unverified
10	VindLU	text-to-video R@1	61.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GRAM	text-to-video R@1	64	—	Unverified
2	VAST	text-to-video R@1	63.9	—	Unverified
3	InternVideo2-6B	text-to-video R@1	62.8	—	Unverified
4	VALOR	text-to-video R@1	59.9	—	Unverified
5	UMT-L (ViT-L/16)	text-to-video R@1	58.8	—	Unverified
6	vid-TLDR (UMT-L)	text-to-video R@1	58.1	—	Unverified
7	COSA	text-to-video R@1	57.9	—	Unverified
8	InternVideo2-6B	text-to-video R@1	55.9	—	Unverified
9	InternVideo	text-to-video R@1	55.2	—	Unverified
10	VLAB	text-to-video R@1	55.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	text-to-video R@10	53.7	—	Unverified
2	InternVideo2-6B	text-to-video R@1	46.4	—	Unverified
3	vid-TLDR (UMT-L)	text-to-video R@1	43.1	—	Unverified
4	UMT-L (ViT-L/16)	text-to-video R@1	43	—	Unverified
5	HunYuan_tvr (huge)	text-to-video R@1	40.4	—	Unverified
6	COSA	text-to-video R@1	39.4	—	Unverified
7	mPLUG-2	text-to-video R@1	34.4	—	Unverified
8	VALOR	text-to-video R@1	34.2	—	Unverified
9	InternVideo	text-to-video R@1	34	—	Unverified
10	InternVideo2-6B	text-to-video R@1	33.8	—	Unverified