| Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | Nov 21, 2022 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 | 5 |
| TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding | Oct 29, 2023 | FormLanguage Modelling | CodeCode Available | 1 | 5 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning | Jan 13, 2025 | Causal DiscoveryCausal Inference | CodeCode Available | 1 | 5 |
| Equivariant and Invariant Grounding for Video Question Answering | Jul 26, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge | Feb 25, 2024 | Computational EfficiencyLanguage Modelling | CodeCode Available | 1 | 5 |
| MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | Mar 23, 2023 | Auxiliary LearningMultimodal Sentiment Analysis | CodeCode Available | 1 | 5 |
| MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Dec 19, 2022 | FormQuestion Answering | CodeCode Available | 1 | 5 |
| An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | Sep 4, 2022 | Fill MaskOptical Flow Estimation | CodeCode Available | 1 | 5 |
| SViTT: Temporal Learning of Sparse Video-Text Transformers | Apr 18, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation | Jun 8, 2021 | Multi-Task LearningQuestion Answering | CodeCode Available | 1 | 5 |
| Scene-Text Grounding for Text-Based Video Question Answering | Sep 22, 2024 | 2kContrastive Learning | CodeCode Available | 1 | 5 |
| Encoding and Controlling Global Semantics for Long-form Video Question Answering | May 30, 2024 | FormQuestion Answering | CodeCode Available | 1 | 5 |
| Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting | Oct 1, 2024 | Continual LearningLanguage Modeling | CodeCode Available | 1 | 5 |
| Self-Chained Image-Language Model for Video Localization and Question Answering | May 11, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Connecting Vision and Language with Video Localized Narratives | Feb 22, 2023 | Question AnsweringVideo Narrative Grounding | CodeCode Available | 1 | 5 |
| NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions | May 18, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes | Apr 1, 2024 | Causal DiscoveryCausal Discovery in Video Reasoning | CodeCode Available | 1 | 5 |
| RTQ: Rethinking Video-language Understanding Based on Image-text Model | Dec 1, 2023 | Video CaptioningVideo Question Answering | CodeCode Available | 1 | 5 |
| EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos | Mar 28, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 | 5 |
| EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering | Feb 11, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Contrastive Video Question Answering via Video Graph Transformer | Feb 27, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 1 | 5 |
| From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | May 30, 2022 | counterfactualDescriptive | CodeCode Available | 1 | 5 |
| FunQA: Towards Surprising Video Comprehension | Jun 26, 2023 | Question AnsweringText Generation | CodeCode Available | 1 | 5 |
| Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | Jul 9, 2023 | Question AnsweringTGIF-Frame | CodeCode Available | 1 | 5 |
| Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos | Dec 16, 2023 | Video Captioningvideo narration captioning | CodeCode Available | 1 | 5 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 | 5 |
| Can I Trust Your Answer? Visually Grounded Video Question Answering | Sep 4, 2023 | Grounded Video Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| A Comprehensive Review of the Video-to-Text Problem | Mar 27, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark | Jan 9, 2025 | FairnessHallucination | CodeCode Available | 1 | 5 |
| Revisiting the "Video" in Video-Language Understanding | Jun 3, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 1 | 5 |
| AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding | Jun 19, 2024 | Question AnsweringSpatial Reasoning | CodeCode Available | 1 | 5 |
| DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering | Jul 10, 2021 | Graph AttentionQuestion Answering | CodeCode Available | 1 | 5 |
| BIMBA: Selective-Scan Compression for Long-Range Video Question Answering | Mar 12, 2025 | Video Question AnsweringZero-Shot Video Question Answer | CodeCode Available | 1 | 5 |
| Location-aware Graph Convolutional Networks for Video Question Answering | Aug 7, 2020 | Action Recognitiongraph construction | CodeCode Available | 1 | 5 |
| Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval | Mar 15, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| LifeQA: A Real-life Dataset for Video Question Answering | May 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| Discovering Spatio-Temporal Rationales for Video Question Answering | Jul 22, 2023 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Referring Atomic Video Action Recognition | Jul 2, 2024 | Action LocalizationAction Recognition | CodeCode Available | 1 | 5 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering | Nov 25, 2023 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Learning Situation Hyper-Graphs for Video Question Answering | Apr 18, 2023 | DecoderQuestion Answering | CodeCode Available | 1 | 5 |
| Large Language Models are Temporal and Causal Reasoners for Video Question Answering | Oct 24, 2023 | Natural Language UnderstandingQuestion Answering | CodeCode Available | 1 | 5 |
| Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder | Jun 28, 2025 | Image SegmentationLarge Language Model | CodeCode Available | 1 | 5 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 | 5 |
| Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners | May 22, 2022 | AttributeAutomatic Speech Recognition | CodeCode Available | 1 | 5 |
| Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling | Oct 8, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA | May 13, 2020 | Image CaptioningMulti-Label Classification | CodeCode Available | 1 | 5 |
| DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization | Jun 1, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |