| VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary | Mar 12, 2025 | EgoSchemaRetrieval | CodeCode Available | 4 |
| Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | Dec 12, 2024 | EgoSchema | CodeCode Available | 3 |
| Flash-VStream: Efficient Real-Time Understanding for Long Video Streams | Jun 30, 2025 | cross-modal alignmentEgoSchema | CodeCode Available | 3 |
| Video ReCap: Recursive Captioning of Hour-Long Videos | Feb 20, 2024 | EgoSchemaVideo Captioning | CodeCode Available | 3 |
| VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos | May 29, 2024 | EgoSchemaMME | CodeCode Available | 2 |
| VideoAgent: Long-form Video Understanding with Large Language Model as Agent | Mar 15, 2024 | EgoSchemaForm | CodeCode Available | 2 |
| LLaVAction: evaluating and training multi-modal large language models for action recognition | Mar 24, 2025 | Action RecognitionAction Understanding | CodeCode Available | 2 |
| Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model | Mar 27, 2025 | EgoSchemaLanguage Modeling | CodeCode Available | 2 |
| TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning | Oct 25, 2024 | EgoSchemaHallucination | CodeCode Available | 2 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |