SOTAVerified

TGIF-Frame

Papers

Showing 115 of 15 papers

TitleStatusHype
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and VideoCode4
All in One: Exploring Unified Video-Language Pre-trainingCode2
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetCode2
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetCode2
COSA: Concatenated Sample Pretrained Vision-Language Foundation ModelCode1
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering ModelsCode1
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text ModelsCode1
Zero-Shot Video Question Answering via Frozen Bidirectional Language ModelsCode1
Clover: Towards A Unified Video-Language Alignment and Fusion ModelCode1
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation ModelsCode1
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual ModelingCode1
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending0
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training0
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling0
Lightweight Recurrent Cross-modal Encoder for Video Question AnsweringCode0
Show:102550

No leaderboard results yet.