SOTAVerified

Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Showing 151200 of 486 papers

TitleStatusHype
COOT: Cooperative Hierarchical Transformer for Video-Text Representation LearningCode1
LAVENDER: Unifying Video-Language Understanding as Masked Language ModelingCode1
Frozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalCode1
COSA: Concatenated Sample Pretrained Vision-Language Foundation ModelCode1
Generalized Few-Shot Video Classification with Video Retrieval and Feature GenerationCode1
AVLnet: Learning Audio-Visual Language Representations from Instructional VideosCode1
CoVR-2: Automatic Data Construction for Composed Video RetrievalCode1
Multi-modal Transformer for Video RetrievalCode1
GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video RetrievalCode1
GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video RetrievalCode1
MUSE: Mamba is Efficient Multi-scale Learner for Text-video RetrievalCode1
Multi-Query Video RetrievalCode1
TempMe: Video Temporal Token Merging for Efficient Text-Video RetrievalCode1
Cross-Architecture Self-supervised Video Representation LearningCode1
TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation RecognitionCode1
Normalized Contrastive Learning for Text-Video RetrievalCode1
Cross Modal Retrieval with Querybank NormalisationCode1
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-trainingCode1
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual ModelingCode1
Hierarchical Video-Moment Retrieval and Step-CaptioningCode1
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy MinimizationCode1
Bridging Video-text Retrieval with Multiple Choice QuestionsCode1
Holistic Features are almost Sufficient for Text-to-Video RetrievalCode1
Text Proxy: Decomposing Retrieval from a 1-to-N Relationship into N 1-to-1 Relationships for Text-Video RetrievalCode1
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token ModelingCode1
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video ClipsCode1
HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleCode1
Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)Code0
T2VLAD: Global-Local Sequence Alignment for Text-Video RetrievalCode0
Talking Face Generation by Adversarially Disentangled Audio-Visual RepresentationCode0
Socratic Models: Composing Zero-Shot Multimodal Reasoning with LanguageCode0
Semantic Role Aware Correlation Transformer for Text to Video RetrievalCode0
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video RetrievalCode0
Efficient End-to-End Video Question Answering with Pyramidal Multimodal TransformerCode0
Efficient Cross-Modal Video Retrieval with Meta-Optimized FramesCode0
Accommodating Audio Modality in CLIP for Multimodal ProcessingCode0
Video Logo Retrieval based on local FeaturesCode0
ECO: Efficient Convolutional Network for Online Video UnderstandingCode0
A Joint Sequence Fusion Model for Video Question Answering and RetrievalCode0
Self-supervised Video Representation Learning with Cascade Positive RetrievalCode0
Dual Encoding for Zero-Example Video RetrievalCode0
Self-supervised Video Representation Learning by Context and Motion DecouplingCode0
SEA: Sentence Encoder Assembly for Video Retrieval by Textual QueriesCode0
SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language RetrievalCode0
LAMV: Learning to Align and Match Videos With Kernelized Temporal LayersCode0
Discriminative Residual Analysis for Image Set Classification with Posture and Age VariationsCode0
Circulant temporal encoding for video retrieval and temporal alignmentCode0
Central Similarity Quantization for Efficient Image and Video RetrievalCode0
Rudder: A Cross Lingual Video and Text Retrieval DatasetCode0
Joint Searching and Grounding: Multi-Granularity Video Content RetrievalCode0
Show:102550
← PrevPage 4 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OmniVectext-to-video R@1089.4Unverified
2CLIP4Cliptext-to-video R@1081.6Unverified
3OmniVec (pretrained)text-to-video R@1078.6Unverified
4HunYuan_tvr (huge)text-to-video R@162.9Unverified
5CLIP-ViPtext-to-video R@157.7Unverified
6PIDRotext-to-video R@155.9Unverified
7DMAE (ViT-B/16)text-to-video R@155.5Unverified
8HunYuan_tvrtext-to-video R@155Unverified
9MuLTItext-to-video R@154.7Unverified
10EERCFtext-to-video R@154.1Unverified
#ModelMetricClaimedVerifiedStatus
1Aurora (ours, r=64)text-to-video R@577.4Unverified
2InternVideo2-6Btext-to-video R@174.2Unverified
3vid-TLDR (UMT-L)text-to-video R@172.3Unverified
4VASTtext-to-video R@172Unverified
5COSAtext-to-video R@170.5Unverified
6UMT-L (ViT-L/16)text-to-video R@170.4Unverified
7GRAMtext-to-video R@167.3Unverified
8VALORtext-to-video R@161.5Unverified
9TESTA (ViT-B/16)text-to-video R@161.2Unverified
10VindLUtext-to-video R@161.2Unverified
#ModelMetricClaimedVerifiedStatus
1GRAMtext-to-video R@164Unverified
2VASTtext-to-video R@163.9Unverified
3InternVideo2-6Btext-to-video R@162.8Unverified
4VALORtext-to-video R@159.9Unverified
5UMT-L (ViT-L/16)text-to-video R@158.8Unverified
6vid-TLDR (UMT-L)text-to-video R@158.1Unverified
7COSAtext-to-video R@157.9Unverified
8InternVideo2-6Btext-to-video R@155.9Unverified
9InternVideotext-to-video R@155.2Unverified
10VLABtext-to-video R@155.1Unverified
#ModelMetricClaimedVerifiedStatus
1EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)text-to-video R@1053.7Unverified
2InternVideo2-6Btext-to-video R@146.4Unverified
3vid-TLDR (UMT-L)text-to-video R@143.1Unverified
4UMT-L (ViT-L/16)text-to-video R@143Unverified
5HunYuan_tvr (huge)text-to-video R@140.4Unverified
6COSAtext-to-video R@139.4Unverified
7mPLUG-2text-to-video R@134.4Unverified
8VALORtext-to-video R@134.2Unverified
9InternVideotext-to-video R@134Unverified
10InternVideo2-6Btext-to-video R@133.8Unverified