SOTAVerified

Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Showing 201250 of 486 papers

TitleStatusHype
Self-supervised Video Representation Learning with Cascade Positive RetrievalCode0
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in IndonesianCode0
Dual Encoding for Zero-Example Video RetrievalCode0
FIVR: Fine-grained Incident Video RetrievalCode0
Self-supervised Video Representation Learning by Context and Motion DecouplingCode0
Is Multimodal Vision Supervision Beneficial to Language?Code0
TokenBinder: Text-Video Retrieval with One-to-Many Alignment ParadigmCode0
Joint Searching and Grounding: Multi-Granularity Video Content RetrievalCode0
WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary KnowledgeCode0
Win-Fail Action RecognitionCode0
Towards Efficient Partially Relevant Video Retrieval with Active Moment DiscoveringCode0
Deep Hashing with Category Mask for Fast Video RetrievalCode0
Semantic Role Aware Correlation Transformer for Text to Video RetrievalCode0
GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation LearningCode0
Object Priors for Classifying and Localizing Unseen ActionsCode0
Contextual Explainable Video Representation: Human Perception-based UnderstandingCode0
LAMV: Learning to Align and Match Videos With Kernelized Temporal LayersCode0
Graph Based Temporal Aggregation for Video RetrievalCode0
MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation LearningCode0
Discriminative Residual Analysis for Image Set Classification with Posture and Age VariationsCode0
Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics DomainsCode0
Person Search in Videos with One Portrait Through Visual and Temporal LinksCode0
AMIL: Adversarial Multi Instance Learning for Human Pose EstimationCode0
Efficient Cross-Modal Video Retrieval with Meta-Optimized FramesCode0
Hashing with Mutual InformationCode0
Exploring the Temporal Cues to Enhance Video Retrieval on Standardized CDVACode0
You were saying? - Spoken Language in the V3C DatasetCode0
Hierarchical Banzhaf Interaction for General Video-Language Representation LearningCode0
Socratic Models: Composing Zero-Shot Multimodal Reasoning with LanguageCode0
Learning to Locate Visual Answer in Video Corpus Using QuestionCode0
Accommodating Audio Modality in CLIP for Multimodal ProcessingCode0
Video-Text Retrieval by Supervised Sparse Multi-Grained LearningCode0
Exploring Temporal Concurrency for Video-Language Representation LearningCode0
Learning to Retrieve Videos by Asking QuestionsCode0
Zorro: the masked multimodal transformerCode0
Inter-intra Variant Dual Representations forSelf-supervised Video RecognitionCode0
Efficient End-to-End Video Question Answering with Pyramidal Multimodal TransformerCode0
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding TasksCode0
Central Similarity Quantization for Efficient Image and Video RetrievalCode0
Aligning Step-by-Step Instructional Diagrams to Video DemonstrationsCode0
Unmasked Teacher: Towards Training-Efficient Video Foundation ModelsCode0
A Challenge to Build Neuro-Symbolic Video AgentsCode0
Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake VideosCode0
Learning from Video and Text via Large-Scale Discriminative ClusteringCode0
ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval ModelsCode0
Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022Code0
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text RetrievalCode0
T2VLAD: Global-Local Sequence Alignment for Text-Video RetrievalCode0
Improving Video Corpus Moment Retrieval with Partial Relevance EnhancementCode0
Talking Face Generation by Adversarially Disentangled Audio-Visual RepresentationCode0
Show:102550
← PrevPage 5 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OmniVectext-to-video R@1089.4Unverified
2CLIP4Cliptext-to-video R@1081.6Unverified
3OmniVec (pretrained)text-to-video R@1078.6Unverified
4HunYuan_tvr (huge)text-to-video R@162.9Unverified
5CLIP-ViPtext-to-video R@157.7Unverified
6PIDRotext-to-video R@155.9Unverified
7DMAE (ViT-B/16)text-to-video R@155.5Unverified
8HunYuan_tvrtext-to-video R@155Unverified
9MuLTItext-to-video R@154.7Unverified
10EERCFtext-to-video R@154.1Unverified
#ModelMetricClaimedVerifiedStatus
1Aurora (ours, r=64)text-to-video R@577.4Unverified
2InternVideo2-6Btext-to-video R@174.2Unverified
3vid-TLDR (UMT-L)text-to-video R@172.3Unverified
4VASTtext-to-video R@172Unverified
5COSAtext-to-video R@170.5Unverified
6UMT-L (ViT-L/16)text-to-video R@170.4Unverified
7GRAMtext-to-video R@167.3Unverified
8VALORtext-to-video R@161.5Unverified
9TESTA (ViT-B/16)text-to-video R@161.2Unverified
10VindLUtext-to-video R@161.2Unverified
#ModelMetricClaimedVerifiedStatus
1GRAMtext-to-video R@164Unverified
2VASTtext-to-video R@163.9Unverified
3InternVideo2-6Btext-to-video R@162.8Unverified
4VALORtext-to-video R@159.9Unverified
5UMT-L (ViT-L/16)text-to-video R@158.8Unverified
6vid-TLDR (UMT-L)text-to-video R@158.1Unverified
7COSAtext-to-video R@157.9Unverified
8InternVideo2-6Btext-to-video R@155.9Unverified
9InternVideotext-to-video R@155.2Unverified
10VLABtext-to-video R@155.1Unverified
#ModelMetricClaimedVerifiedStatus
1EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)text-to-video R@1053.7Unverified
2InternVideo2-6Btext-to-video R@146.4Unverified
3vid-TLDR (UMT-L)text-to-video R@143.1Unverified
4UMT-L (ViT-L/16)text-to-video R@143Unverified
5HunYuan_tvr (huge)text-to-video R@140.4Unverified
6COSAtext-to-video R@139.4Unverified
7mPLUG-2text-to-video R@134.4Unverified
8VALORtext-to-video R@134.2Unverified
9InternVideotext-to-video R@134Unverified
10InternVideo2-6Btext-to-video R@133.8Unverified