| VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Apr 17, 2023 | Audio captioningAudio-Video Question Answering (AVQA) | CodeCode Available | 2 |
| X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | Nov 22, 2022 | AllCross-Modal Retrieval | CodeCode Available | 2 |
| Perception Test: A Diagnostic Benchmark for Multimodal Models | Oct 19, 2022 | DiagnosticMultiple-choice | CodeCode Available | 2 |
| Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy | Oct 15, 2022 | Feature CompressionQuestion Answering | CodeCode Available | 2 |
| Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning | Oct 12, 2022 | Contrastive LearningForm | CodeCode Available | 2 |
| Revealing Single Frame Bias for Video-and-Language Learning | Jun 7, 2022 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 2 |
| All in One: Exploring Unified Video-Language Pre-training | Mar 14, 2022 | AllLanguage Modelling | CodeCode Available | 2 |
| Is Space-Time Attention All You Need for Video Understanding? | Feb 9, 2021 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder | Jun 28, 2025 | Image SegmentationLarge Language Model | CodeCode Available | 1 |
| VideoMultiAgents: A Multi-Agent Framework for Video Question Answering | Apr 25, 2025 | Caption GenerationEgoSchema | CodeCode Available | 1 |
| EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos | Mar 28, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos | Mar 17, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| BIMBA: Selective-Scan Compression for Long-Range Video Question Answering | Mar 12, 2025 | Video Question AnsweringZero-Shot Video Question Answer | CodeCode Available | 1 |
| EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering | Feb 11, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes | Feb 4, 2025 | Autonomous DrivingMultiple-choice | CodeCode Available | 1 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning | Jan 13, 2025 | Causal DiscoveryCausal Inference | CodeCode Available | 1 |
| ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark | Jan 9, 2025 | FairnessHallucination | CodeCode Available | 1 |
| VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format | Nov 27, 2024 | Dense Video CaptioningGrounded Video Question Answering | CodeCode Available | 1 |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | Oct 14, 2024 | 2kBenchmarking | CodeCode Available | 1 |
| Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering | Oct 12, 2024 | Answer GenerationBlocking | CodeCode Available | 1 |
| Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting | Oct 1, 2024 | Continual LearningLanguage Modeling | CodeCode Available | 1 |
| VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs | Sep 30, 2024 | EgoSchemaLanguage Modelling | CodeCode Available | 1 |
| Scene-Text Grounding for Text-Based Video Question Answering | Sep 22, 2024 | 2kContrastive Learning | CodeCode Available | 1 |