| Flash-VStream: Efficient Real-Time Understanding for Long Video Streams | Jun 30, 2025 | cross-modal alignmentEgoSchema | CodeCode Available | 3 |
| EgoVLM: Policy Optimization for Egocentric Video Understanding | Jun 3, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 0 |
| Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles | May 22, 2025 | EgoSchemaFew-Shot Learning | —Unverified | 0 |
| RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph | May 6, 2025 | EgoSchemaRetrieval | —Unverified | 0 |
| VideoMultiAgents: A Multi-Agent Framework for Video Question Answering | Apr 25, 2025 | Caption GenerationEgoSchema | CodeCode Available | 1 |
| Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model | Mar 27, 2025 | EgoSchemaLanguage Modeling | CodeCode Available | 2 |
| LLaVAction: evaluating and training multi-modal large language models for action recognition | Mar 24, 2025 | Action RecognitionAction Understanding | CodeCode Available | 2 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing | Mar 13, 2025 | EgoSchemaForm | CodeCode Available | 0 |
| VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary | Mar 12, 2025 | EgoSchemaRetrieval | CodeCode Available | 4 |