SOTAVerified

Zero-Shot Video Question Answer

This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.

Papers

Showing 2650 of 85 papers

TitleStatusHype
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video UnderstandingCode3
Flash-VStream: Memory-Based Real-Time Understanding for Long Video StreamsCode3
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextCode3
Video ReCap: Recursive Captioning of Hour-Long VideosCode3
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsCode3
ViperGPT: Visual Inference via Python Execution for ReasoningCode3
LinVT: Empower Your Image-level Large Language Model to Understand VideosCode2
PPLLaVA: Varied Video Sequence Understanding With Prompt GuidanceCode2
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded TuningCode2
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsCode2
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long VideosCode2
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLMCode2
Understanding Long Videos with Multimodal Language ModelsCode2
Elysium: Exploring Object-level Perception in Videos via MLLMCode2
vid-TLDR: Training Free Token merging for Light-weight Video TransformerCode2
VideoAgent: Long-form Video Understanding with Large Language Model as AgentCode2
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual ScenariosCode2
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingCode2
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsCode2
MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkCode2
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingCode2
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingCode2
Valley: Video Assistant with Large Language model Enhanced abilitYCode2
VideoMultiAgents: A Multi-Agent Framework for Video Question AnsweringCode1
Agentic Keyframe Search for Video Question AnsweringCode1
Show:102550
← PrevPage 2 of 4Next →

No leaderboard results yet.