| VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI | Oct 15, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 2 | 5 |
| CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models | Jun 11, 2025 | counterfactualDescriptive | CodeCode Available | 2 | 5 |
| Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models | Dec 24, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 2 | 5 |
| video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models | Jun 18, 2025 | Audio captioningLarge Language Model | CodeCode Available | 2 | 5 |
| Elysium: Exploring Object-level Perception in Videos via MLLM | Mar 25, 2024 | ObjectObject Tracking | CodeCode Available | 2 | 5 |
| Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning | Oct 12, 2022 | Contrastive LearningForm | CodeCode Available | 2 | 5 |
| All in One: Exploring Unified Video-Language Pre-training | Mar 14, 2022 | AllLanguage Modelling | CodeCode Available | 2 | 5 |
| vid-TLDR: Training Free Token merging for Light-weight Video Transformer | Mar 20, 2024 | Action RecognitionComputational Efficiency | CodeCode Available | 2 | 5 |
| VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection | Nov 22, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 2 | 5 |
| VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | May 29, 2023 | Audio captioningAudio-Visual Captioning | CodeCode Available | 2 | 5 |
| Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy | Oct 15, 2022 | Feature CompressionQuestion Answering | CodeCode Available | 2 | 5 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 | 5 |
| TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning | Oct 25, 2024 | EgoSchemaHallucination | CodeCode Available | 2 | 5 |
| Task Me Anything | Jun 17, 2024 | 2kAttribute | CodeCode Available | 2 | 5 |
| VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Apr 17, 2023 | Audio captioningAudio-Video Question Answering (AVQA) | CodeCode Available | 2 | 5 |
| FreeVA: Offline MLLM as Training-Free Video Assistant | May 13, 2024 | FairnessQuestion Answering | CodeCode Available | 2 | 5 |
| Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward | Apr 1, 2024 | Instruction FollowingLanguage Modeling | CodeCode Available | 2 | 5 |
| Streaming Video Question-Answering with In-context Video KV-Cache Retrieval | Mar 1, 2025 | GPUQuestion Answering | CodeCode Available | 2 | 5 |
| Revealing Single Frame Bias for Video-and-Language Learning | Jun 7, 2022 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 2 | 5 |
| LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs | Jun 27, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 2 | 5 |
| LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | Nov 28, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 2 | 5 |
| ST-LLM: Large Language Models Are Effective Temporal Learners | Mar 30, 2024 | MVBenchReading Comprehension | CodeCode Available | 2 | 5 |
| Perception Test: A Diagnostic Benchmark for Multimodal Video Models | May 23, 2023 | DiagnosticGrounded Video Question Answering | CodeCode Available | 2 | 5 |
| LingoQA: Visual Question Answering for Autonomous Driving | Dec 21, 2023 | Autonomous DrivingDecision Making | CodeCode Available | 2 | 5 |
| LinVT: Empower Your Image-level Large Language Model to Understand Videos | Dec 6, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs | Oct 14, 2024 | Computational EfficiencyQuestion Answering | CodeCode Available | 2 | 5 |
| Perception Test: A Diagnostic Benchmark for Multimodal Models | Oct 19, 2022 | DiagnosticMultiple-choice | CodeCode Available | 2 | 5 |
| PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | Nov 4, 2024 | Caption GenerationMultiple-choice | CodeCode Available | 2 | 5 |
| OmniVid: A Generative Framework for Universal Video Understanding | Mar 26, 2024 | Action RecognitionDecoder | CodeCode Available | 2 | 5 |
| CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion | Feb 8, 2024 | Computational EfficiencyMultimodal Reasoning | CodeCode Available | 2 | 5 |
| Online Video Understanding: OVBench and VideoChat-Online | Dec 31, 2024 | Autonomous DrivingQuestion Answering | CodeCode Available | 2 | 5 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 | 5 |
| LITA: Language Instructed Temporal-Localization Assistant | Mar 27, 2024 | Instruction FollowingTemporal Localization | CodeCode Available | 2 | 5 |
| COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | Jun 15, 2023 | Formmodel | CodeCode Available | 1 | 5 |
| Contrastive Video Question Answering via Video Graph Transformer | Feb 27, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 1 | 5 |
| A Simple LLM Framework for Long-Range Video Question-Answering | Dec 28, 2023 | EgoSchemaLanguage Modelling | CodeCode Available | 1 | 5 |
| Large Language Models are Temporal and Causal Reasoners for Video Question Answering | Oct 24, 2023 | Natural Language UnderstandingQuestion Answering | CodeCode Available | 1 | 5 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 | 5 |
| Connecting Vision and Language with Video Localized Narratives | Feb 22, 2023 | Question AnsweringVideo Narrative Grounding | CodeCode Available | 1 | 5 |
| Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering | Jun 2, 2024 | counterfactualCounterfactual Reasoning | CodeCode Available | 1 | 5 |
| Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners | May 22, 2022 | AttributeAutomatic Speech Recognition | CodeCode Available | 1 | 5 |
| Just Ask: Learning to Answer Questions from Millions of Narrated Videos | Dec 1, 2020 | Question AnsweringQuestion Generation | CodeCode Available | 1 | 5 |
| AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model | Sep 27, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions | Jul 17, 2020 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling | Oct 8, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Invariant Grounding for Video Question Answering | Jun 6, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Dec 19, 2022 | FormQuestion Answering | CodeCode Available | 1 | 5 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |