| BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | Sep 27, 2023 | GPUVideo-based Generative Performance Benchmarking | CodeCode Available | 1 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Dec 19, 2022 | FormQuestion Answering | CodeCode Available | 1 |
| Equivariant and Invariant Grounding for Video Question Answering | Jul 26, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | Nov 21, 2022 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning | Jan 13, 2025 | Causal DiscoveryCausal Inference | CodeCode Available | 1 |
| NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions | May 18, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Hierarchical Conditional Relation Networks for Video Question Answering | Feb 25, 2020 | Audio-Visual Question Answering (AVQA)Question Answering | CodeCode Available | 1 |
| HawkEye: Training Video-Text LLMs for Grounding Text in Videos | Mar 15, 2024 | Video GroundingVideo Question Answering | CodeCode Available | 1 |
| An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | Sep 4, 2022 | Fill MaskOptical Flow Estimation | CodeCode Available | 1 |
| Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models | Oct 9, 2023 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | Aug 18, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge | Feb 25, 2024 | Computational EfficiencyLanguage Modelling | CodeCode Available | 1 |
| MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | Mar 23, 2023 | Auxiliary LearningMultimodal Sentiment Analysis | CodeCode Available | 1 |
| RTQ: Rethinking Video-language Understanding Based on Image-text Model | Dec 1, 2023 | Video CaptioningVideo Question Answering | CodeCode Available | 1 |
| Location-aware Graph Convolutional Networks for Video Question Answering | Aug 7, 2020 | Action Recognitiongraph construction | CodeCode Available | 1 |
| Encoding and Controlling Global Semantics for Long-form Video Question Answering | May 30, 2024 | FormQuestion Answering | CodeCode Available | 1 |
| Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting | Oct 1, 2024 | Continual LearningLanguage Modeling | CodeCode Available | 1 |
| CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes | Apr 1, 2024 | Causal DiscoveryCausal Discovery in Video Reasoning | CodeCode Available | 1 |
| Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos | Aug 26, 2024 | FormLanguage Modelling | CodeCode Available | 1 |
| EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos | Mar 28, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering | Feb 11, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | May 30, 2022 | counterfactualDescriptive | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |