| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 |
| Perception Test: A Diagnostic Benchmark for Multimodal Models | Oct 19, 2022 | DiagnosticMultiple-choice | CodeCode Available | 2 |
| Streaming Video Question-Answering with In-context Video KV-Cache Retrieval | Mar 1, 2025 | GPUQuestion Answering | CodeCode Available | 2 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| LingoQA: Visual Question Answering for Autonomous Driving | Dec 21, 2023 | Autonomous DrivingDecision Making | CodeCode Available | 2 |
| ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO | Jun 17, 2024 | Language ModellingQuestion Answering | CodeCode Available | 2 |
| VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Apr 17, 2023 | Audio captioningAudio-Video Question Answering (AVQA) | CodeCode Available | 2 |
| LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs | Jun 27, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 2 |
| COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | Jun 15, 2023 | Formmodel | CodeCode Available | 1 |
| Contrastive Video Question Answering via Video Graph Transformer | Feb 27, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| A Simple LLM Framework for Long-Range Video Question-Answering | Dec 28, 2023 | EgoSchemaLanguage Modelling | CodeCode Available | 1 |
| Connecting Vision and Language with Video Localized Narratives | Feb 22, 2023 | Question AnsweringVideo Narrative Grounding | CodeCode Available | 1 |
| Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering | Jun 2, 2024 | counterfactualCounterfactual Reasoning | CodeCode Available | 1 |
| MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | Mar 23, 2023 | Auxiliary LearningMultimodal Sentiment Analysis | CodeCode Available | 1 |
| AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model | Sep 27, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Dec 19, 2022 | FormQuestion Answering | CodeCode Available | 1 |
| Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge | Feb 25, 2024 | Computational EfficiencyLanguage Modelling | CodeCode Available | 1 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Location-aware Graph Convolutional Networks for Video Question Answering | Aug 7, 2020 | Action Recognitiongraph construction | CodeCode Available | 1 |
| An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | Sep 4, 2022 | Fill MaskOptical Flow Estimation | CodeCode Available | 1 |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning | Jan 13, 2025 | Causal DiscoveryCausal Inference | CodeCode Available | 1 |
| Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering | Oct 12, 2024 | Answer GenerationBlocking | CodeCode Available | 1 |
| Learning Situation Hyper-Graphs for Video Question Answering | Apr 18, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 |