SOTAVerified

Zero-Shot Video Question Answer

This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.

Papers

Showing 150 of 85 papers

TitleStatusHype
VideoMultiAgents: A Multi-Agent Framework for Video Question AnsweringCode1
Qwen2.5-Omni Technical ReportCode7
Agentic Keyframe Search for Video Question AnsweringCode1
VideoMind: A Chain-of-LoRA Agent for Long Video ReasoningCode3
BIMBA: Selective-Scan Compression for Long-Range Video Question AnsweringCode1
ENTER: Event Based Interpretable Reasoning for VideoQA0
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision TokenCode4
VidCtx: Context-aware Video Question Answering with Image ModelsCode0
LinVT: Empower Your Image-level Large Language Model to Understand VideosCode2
Video-RAG: Visually-aligned Retrieval-Augmented Long Video ComprehensionCode3
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language ModelsCode1
PPLLaVA: Varied Video Sequence Understanding With Prompt GuidanceCode2
GPT-4o System Card0
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded TuningCode2
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language UnderstandingCode3
Video Instruction Tuning With Synthetic Data0
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionCode11
Question-Answering Dense Video EventsCode0
LLaVA-OneVision: Easy Visual Task TransferCode0
MiniCPM-V: A GPT-4V Level MLLM on Your PhoneCode12
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language ModelsCode3
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal ModelsCode7
Tarsier: Recipes for Training and Evaluating Large Video Description ModelsCode4
Long Context Transfer from Language to VisionCode4
Long Story Short: Story-level Video Understanding from 20K Short Films0
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding0
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QACode1
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video UnderstandingCode3
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsCode2
Flash-VStream: Memory-Based Real-Time Understanding for Long Video StreamsCode3
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMsCode5
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs0
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long VideosCode2
Streaming Long Video Understanding with Large Language Models0
CinePile: A Long Video Question Answering Dataset and Benchmark0
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense CaptioningCode4
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering0
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual TokensCode4
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-AnsweringCode1
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLMCode2
Understanding Long Videos with Multimodal Language ModelsCode2
Elysium: Exploring Object-level Perception in Videos via MLLMCode2
InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingCode7
Language Repository for Long Video UnderstandingCode1
vid-TLDR: Training Free Token merging for Light-weight Video TransformerCode2
VideoAgent: Long-form Video Understanding with Large Language Model as AgentCode2
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextCode3
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual ScenariosCode2
Video ReCap: Recursive Captioning of Hour-Long VideosCode3
Question-Instructed Visual Descriptions for Zero-Shot Video Question AnsweringCode0
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.