SOTAVerified

Zero-Shot Video Question Answer

This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.

Papers

Showing 5185 of 85 papers

TitleStatusHype
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationCode4
A Simple LLM Framework for Long-Range Video Question-AnsweringCode1
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosCode1
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens0
VILA: On Pre-training for Visual Language ModelsCode4
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingCode2
Zero-Shot Video Question Answering with Procedural Programs0
MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkCode2
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsCode2
Vamos: Versatile Action Models for Video UnderstandingCode0
Video-LLaVA: Learning United Visual Representation by Alignment Before ProjectionCode4
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingCode2
Mistral 7BCode6
BT-Adapter: Video Conversation is Feasible Without Video Instruction TuningCode1
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal PromptsCode1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data GenerationCode1
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingCode2
Valley: Video Assistant with Large Language model Enhanced abilitYCode2
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsCode3
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video UnderstandingCode4
Self-Chained Image-Language Model for Video Localization and Question AnsweringCode1
VideoChat: Chat-Centric Video UnderstandingCode4
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction ModelCode5
mPLUG-Owl: Modularization Empowers Large Language Models with MultimodalityCode4
Verbs in Action: Improving verb understanding in video-language modelsCode0
ViperGPT: Visual Inference via Python Execution for ReasoningCode3
VIPeR: Provably Efficient Algorithm for Offline RL with Neural Function ApproximationCode0
InternVideo: General Video Foundation Models via Generative and Discriminative LearningCode4
0/1 Deep Neural Networks via Block Coordinate Descent0
Zero-Shot Video Question Answering via Frozen Bidirectional Language ModelsCode1
Flamingo: a Visual Language Model for Few-Shot LearningCode4
MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese NetworksCode0
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question AnsweringCode0
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question AnsweringCode0
Show:102550
← PrevPage 2 of 2Next →

No leaderboard results yet.