SOTAVerified

Zero-Shot Video Question Answer

This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.

Papers

Showing 2650 of 85 papers

TitleStatusHype
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding0
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QACode1
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video UnderstandingCode3
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsCode2
Flash-VStream: Memory-Based Real-Time Understanding for Long Video StreamsCode3
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMsCode5
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs0
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long VideosCode2
Streaming Long Video Understanding with Large Language Models0
CinePile: A Long Video Question Answering Dataset and Benchmark0
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense CaptioningCode4
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering0
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual TokensCode4
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-AnsweringCode1
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLMCode2
Understanding Long Videos with Multimodal Language ModelsCode2
Elysium: Exploring Object-level Perception in Videos via MLLMCode2
InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingCode7
Language Repository for Long Video UnderstandingCode1
vid-TLDR: Training Free Token merging for Light-weight Video TransformerCode2
VideoAgent: Long-form Video Understanding with Large Language Model as AgentCode2
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextCode3
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual ScenariosCode2
Video ReCap: Recursive Captioning of Hour-Long VideosCode3
Question-Instructed Visual Descriptions for Zero-Shot Video Question AnsweringCode0
Show:102550
← PrevPage 2 of 4Next →

No leaderboard results yet.