Zero-Shot Video Question Answer

This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 85 papers

Title	Date	Tasks	Status	Hype	Score
MiniCPM-V: A GPT-4V Level MLLM on Your Phone	Aug 3, 2024	HallucinationMultiple-choice	CodeCode Available	12	5
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	Sep 18, 2024	Natural Language Visual Grounding	CodeCode Available	11	5
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	Mar 22, 2024	Action ClassificationAction Recognition	CodeCode Available	7	5
Qwen2.5-Omni Technical Report	Mar 26, 2025	Automatic Speech Recognition (ASR)GSM8K	CodeCode Available	7	5
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models	Jul 10, 2024	Video Question AnsweringZero-Shot Video Question Answer	CodeCode Available	7	5
Mistral 7B	Oct 10, 2023	answerability predictionArithmetic Reasoning	CodeCode Available	6	5
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	Apr 28, 2023	Instruction Followingmodel	CodeCode Available	5	5
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	Jun 11, 2024	Multiple-choiceQuestion Answering	CodeCode Available	5	5
Flamingo: a Visual Language Model for Few-Shot Learning	Apr 29, 2022	Few-Shot LearningGenerative Visual Question Answering	CodeCode Available	4	5
VILA: On Pre-training for Visual Language Models	Dec 12, 2023	In-Context LearningLanguage Modelling	CodeCode Available	4	5
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens	Apr 4, 2024	Language ModelingLanguage Modelling	CodeCode Available	4	5
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	Feb 5, 2024	Science Question AnsweringText-to-Video Generation	CodeCode Available	4	5
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	Apr 27, 2023	Visual Question Answering (VQA)Zero-Shot Video Question Answer	CodeCode Available	4	5
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	Jun 5, 2023	Language ModelingLanguage Modelling	CodeCode Available	4	5
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	Nov 16, 2023	Language ModelingLanguage Modelling	CodeCode Available	4	5
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	Apr 25, 2024	Dense CaptioningMVBench	CodeCode Available	4	5
Tarsier: Recipes for Training and Evaluating Large Video Description Models	Jun 30, 2024	Video CaptioningVideo Description	CodeCode Available	4	5
VideoChat: Chat-Centric Video Understanding	May 10, 2023	Question AnsweringVideo-based Generative Performance Benchmarking	CodeCode Available	4	5
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	Dec 6, 2022	Action ClassificationAction Recognition	CodeCode Available	4	5
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token	Jan 7, 2025	GPUVisual Question Answering (VQA)	CodeCode Available	4	5
Long Context Transfer from Language to Vision	Jun 24, 2024	Language ModelingLanguage Modelling	CodeCode Available	4	5
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension	Nov 20, 2024	GPUMME	CodeCode Available	3	5
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	Jun 13, 2024	Dense Video CaptioningMVBench	CodeCode Available	3	5
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams	Jun 12, 2024	cross-modal alignmentLanguage Modelling	CodeCode Available	3	5
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning	Mar 17, 2025	Grounded Video Question AnsweringQuestion Answering	CodeCode Available	3	5
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding	Oct 22, 2024	Token ReductionVideo Question Answering	CodeCode Available	3	5
ViperGPT: Visual Inference via Python Execution for Reasoning	Mar 14, 2023	Code GenerationVideo Question Answering	CodeCode Available	3	5
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	Jul 22, 2024	Language Modeling	CodeCode Available	3	5
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Jun 8, 2023	Question AnsweringVCGBench-Diverse	CodeCode Available	3	5
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	Mar 8, 2024	1 Image, 2*2 StitchingCode Generation	CodeCode Available	3	5
Video ReCap: Recursive Captioning of Hour-Long Videos	Feb 20, 2024	EgoSchemaVideo Captioning	CodeCode Available	3	5
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	Nov 14, 2023	Image-based Generative Performance BenchmarkingLanguage Modeling	CodeCode Available	2	5
VideoAgent: Long-form Video Understanding with Large Language Model as Agent	Mar 15, 2024	EgoSchemaForm	CodeCode Available	2	5
Elysium: Exploring Object-level Perception in Videos via MLLM	Mar 25, 2024	ObjectObject Tracking	CodeCode Available	2	5
vid-TLDR: Training Free Token merging for Light-weight Video Transformer	Mar 20, 2024	Action RecognitionComputational Efficiency	CodeCode Available	2	5
LinVT: Empower Your Image-level Large Language Model to Understand Videos	Dec 6, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	Nov 28, 2023	Image CaptioningQuestion Answering	CodeCode Available	2	5
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos	May 29, 2024	EgoSchemaMME	CodeCode Available	2	5
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	Jul 31, 2023	Multiple-choiceQuestion Answering	CodeCode Available	2	5
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	Nov 28, 2023	3D Question Answering (3D-QA)Diagnostic	CodeCode Available	2	5
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs	Jun 13, 2024	BenchmarkingQuestion Answering	CodeCode Available	2	5
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance	Nov 4, 2024	Caption GenerationMultiple-choice	CodeCode Available	2	5
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM	Mar 27, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	Dec 4, 2023	Dense CaptioningHighlight Detection	CodeCode Available	2	5
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning	Oct 25, 2024	EgoSchemaHallucination	CodeCode Available	2	5
Understanding Long Videos with Multimodal Language Models	Mar 25, 2024	Action RecognitionFine-grained Action Recognition	CodeCode Available	2	5
Valley: Video Assistant with Large Language model Enhanced abilitY	Jun 12, 2023	Action RecognitionInstruction Following	CodeCode Available	2	5
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios	Mar 7, 2024	Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA)	CodeCode Available	2	5
Language Repository for Long Video Understanding	Mar 21, 2024	EgoSchemaQuestion Answering	CodeCode Available	1	5
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos	Dec 16, 2023	Video Captioningvideo narration captioning	CodeCode Available	1	5

Show:10 25 50

← PrevPage 1 of 2Next →

No leaderboard results yet.