| BIMBA: Selective-Scan Compression for Long-Range Video Question Answering | Mar 12, 2025 | Video Question AnsweringZero-Shot Video Question Answer | CodeCode Available | 1 |
| TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | Nov 17, 2024 | MVBenchVideo-based Generative Performance Benchmarking | CodeCode Available | 1 |
| Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA | Jun 13, 2024 | AllEgoSchema | CodeCode Available | 1 |
| TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering | Apr 1, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Language Repository for Long Video Understanding | Mar 21, 2024 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| A Simple LLM Framework for Long-Range Video Question-Answering | Dec 28, 2023 | EgoSchemaLanguage Modelling | CodeCode Available | 1 |
| Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos | Dec 16, 2023 | Video Captioningvideo narration captioning | CodeCode Available | 1 |
| BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | Sep 27, 2023 | GPUVideo-based Generative Performance Benchmarking | CodeCode Available | 1 |
| Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts | Sep 27, 2023 | Few-shot Video Question AnsweringPrompt Learning | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation | Aug 8, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 |
| Self-Chained Image-Language Model for Video Localization and Question Answering | May 11, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | Jun 16, 2022 | Fill MaskLanguage Modeling | CodeCode Available | 1 |
| ENTER: Event Based Interpretable Reasoning for VideoQA | Jan 24, 2025 | Code GenerationEgoSchema | —Unverified | 0 |
| VidCtx: Context-aware Video Question Answering with Image Models | Dec 23, 2024 | Large Language ModelQuestion Answering | CodeCode Available | 0 |
| GPT-4o System Card | Oct 25, 2024 | Multiple-choiceSpatial Reasoning | —Unverified | 0 |
| Video Instruction Tuning With Synthetic Data | Oct 3, 2024 | 3D Question Answering (3D-QA) | —Unverified | 0 |
| Question-Answering Dense Video Events | Sep 6, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 |
| GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding | Jun 14, 2024 | Activity RecognitionMMR total | —Unverified | 0 |
| Long Story Short: Story-level Video Understanding from 20K Short Films | Jun 14, 2024 | Multiple Choice Question Answering (MCQA)Open-Ended Question Answering | —Unverified | 0 |
| DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs | Jun 6, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Streaming Long Video Understanding with Large Language Models | May 25, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| MoReVQA: Exploring Modular Reasoning Models for Video Question Answering | Apr 9, 2024 | EgoSchemaMultiple-choice | —Unverified | 0 |
| Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering | Feb 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens | Dec 12, 2023 | HallucinationPosition | —Unverified | 0 |
| Zero-Shot Video Question Answering with Procedural Programs | Dec 1, 2023 | Code GenerationLanguage Modeling | —Unverified | 0 |
| Vamos: Versatile Action Models for Video Understanding | Nov 22, 2023 | EgoSchemaHard Attention | CodeCode Available | 0 |
| Verbs in Action: Improving verb understanding in video-language models | Apr 13, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 0 |
| VIPeR: Provably Efficient Algorithm for Offline RL with Neural Function Approximation | Feb 24, 2023 | Computational EfficiencyOffline RL | CodeCode Available | 0 |
| 0/1 Deep Neural Networks via Block Coordinate Descent | Jun 19, 2022 | 10-shot image generation | —Unverified | 0 |
| MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks | Jul 26, 2019 | Zero-Shot Video Question Answer | CodeCode Available | 0 |
| ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering | Jun 6, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 |
| TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering | Apr 14, 2017 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |