SOTAVerified

Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Showing 151200 of 486 papers

TitleStatusHype
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetCode2
Robust Cross-Modal Knowledge Distillation for Unconstrained VideosCode1
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision0
Self-Supervised Video Similarity LearningCode1
Perfect Match in Video Retrieval0
Free-Form Multi-Modal Multimedia Retrieval (4MR)0
Hierarchical Video-Moment Retrieval and Step-CaptioningCode1
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding0
Unmasked Teacher: Towards Training-Efficient Video Foundation ModelsCode0
Colo-SCRL: Self-Supervised Contrastive Representation Learning for Colonoscopic Video Retrieval0
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation LearningCode1
Aligning Step-by-Step Instructional Diagrams to Video DemonstrationsCode0
Dialogue-to-Video RetrievalCode0
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation ModelsCode1
DiffusionRet: Generative Text-Video Retrieval with Diffusion ModelCode1
VVS: Video-to-Video Retrieval with Irrelevant Frame SuppressionCode1
Accommodating Audio Modality in CLIP for Multimodal ProcessingCode0
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling0
Improving Video Retrieval by Adaptive Margin0
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training0
Video-Text Retrieval by Supervised Sparse Multi-Grained LearningCode0
Is Multimodal Vision Supervision Beneficial to Language?Code0
Efficient End-to-End Video Question Answering with Pyramidal Multimodal TransformerCode0
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and VideoCode4
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge TransferringCode1
Zorro: the masked multimodal transformerCode0
Temporal Perceiving Video-Language Pre-training0
UATVR: Uncertainty-Adaptive Text-Video RetrievalCode1
Learning Trajectory-Word Alignments for Video-Language Tasks0
PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval0
HiVLP: Hierarchical Interactive Video-Language Pre-Training0
Exploring Temporal Concurrency for Video-Language Representation LearningCode0
Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video RetrievalCode1
Progressive Spatio-Temporal Prototype Matching for Text-Video RetrievalCode1
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?Code2
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training0
TempCLR: Temporal Alignment Representation with Contrastive LearningCode1
You were saying? - Spoken Language in the V3C DatasetCode0
Contextual Explainable Video Representation: Human Perception-based UnderstandingCode0
VindLU: A Recipe for Effective Video-and-Language PretrainingCode1
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners0
InternVideo: General Video Foundation Models via Generative and Discriminative LearningCode4
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval0
Normalized Contrastive Learning for Text-Video RetrievalCode1
Renmin University of China at TRECVID 2022: Improving Video Search by Feature Fusion and Negation Understanding0
VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal RetrievalCode1
TransVCL: Attention-enhanced Video Copy Localization Network with Flexible SupervisionCode1
X^2-VLM: All-In-One Pre-trained Model For Vision-Language TasksCode2
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video RetrievalCode0
Expectation-Maximization Contrastive Learning for Compact Video-and-Language RepresentationsCode1
Show:102550
← PrevPage 4 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OmniVectext-to-video R@1089.4Unverified
2CLIP4Cliptext-to-video R@1081.6Unverified
3OmniVec (pretrained)text-to-video R@1078.6Unverified
4HunYuan_tvr (huge)text-to-video R@162.9Unverified
5CLIP-ViPtext-to-video R@157.7Unverified
6PIDRotext-to-video R@155.9Unverified
7DMAE (ViT-B/16)text-to-video R@155.5Unverified
8HunYuan_tvrtext-to-video R@155Unverified
9MuLTItext-to-video R@154.7Unverified
10STANtext-to-video R@154.1Unverified
#ModelMetricClaimedVerifiedStatus
1Aurora (ours, r=64)text-to-video R@577.4Unverified
2InternVideo2-6Btext-to-video R@174.2Unverified
3vid-TLDR (UMT-L)text-to-video R@172.3Unverified
4VASTtext-to-video R@172Unverified
5COSAtext-to-video R@170.5Unverified
6UMT-L (ViT-L/16)text-to-video R@170.4Unverified
7GRAMtext-to-video R@167.3Unverified
8VALORtext-to-video R@161.5Unverified
9TESTA (ViT-B/16)text-to-video R@161.2Unverified
10VindLUtext-to-video R@161.2Unverified
#ModelMetricClaimedVerifiedStatus
1GRAMtext-to-video R@164Unverified
2VASTtext-to-video R@163.9Unverified
3InternVideo2-6Btext-to-video R@162.8Unverified
4VALORtext-to-video R@159.9Unverified
5UMT-L (ViT-L/16)text-to-video R@158.8Unverified
6vid-TLDR (UMT-L)text-to-video R@158.1Unverified
7COSAtext-to-video R@157.9Unverified
8InternVideo2-6Btext-to-video R@155.9Unverified
9InternVideotext-to-video R@155.2Unverified
10VLABtext-to-video R@155.1Unverified
#ModelMetricClaimedVerifiedStatus
1EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)text-to-video R@1053.7Unverified
2InternVideo2-6Btext-to-video R@146.4Unverified
3vid-TLDR (UMT-L)text-to-video R@143.1Unverified
4UMT-L (ViT-L/16)text-to-video R@143Unverified
5HunYuan_tvr (huge)text-to-video R@140.4Unverified
6COSAtext-to-video R@139.4Unverified
7mPLUG-2text-to-video R@134.4Unverified
8VALORtext-to-video R@134.2Unverified
9InternVideotext-to-video R@134Unverified
10InternVideo2-6Btext-to-video R@133.8Unverified