| VindLU: A Recipe for Effective Video-and-Language Pretraining | Dec 9, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | Nov 21, 2022 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Visual Commonsense-aware Representation Network for Video Captioning | Nov 17, 2022 | Caption GenerationQuestion Answering | CodeCode Available | 1 |
| Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling | Oct 8, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | Sep 4, 2022 | Fill MaskOptical Flow Estimation | CodeCode Available | 1 |
| Equivariant and Invariant Grounding for Video Question Answering | Jul 26, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Video Graph Transformer for Video Question Answering | Jul 12, 2022 | Question AnsweringRelation | CodeCode Available | 1 |
| Video Dialog as Conversation about Objects Living in Space-Time | Jul 8, 2022 | ObjectRelational Reasoning | CodeCode Available | 1 |
| Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | Jun 16, 2022 | Fill MaskLanguage Modeling | CodeCode Available | 1 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 |
| Invariant Grounding for Video Question Answering | Jun 6, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Revisiting the "Video" in Video-Language Understanding | Jun 3, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | May 30, 2022 | counterfactualDescriptive | CodeCode Available | 1 |
| Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners | May 22, 2022 | AttributeAutomatic Speech Recognition | CodeCode Available | 1 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 |
| Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval | Mar 15, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Video Question Answering: Datasets, Algorithms and Challenges | Mar 2, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Video as Conditional Graph Hierarchy for Multi-Granular Question Answering | Dec 12, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant | Nov 30, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning | Nov 25, 2021 | Caption GenerationQuestion Answering | CodeCode Available | 1 |
| VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | Nov 24, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering | Sep 10, 2021 | multimodal interactionNatural Language Understanding | CodeCode Available | 1 |
| DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering | Jul 10, 2021 | Graph AttentionQuestion Answering | CodeCode Available | 1 |
| NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions | Jun 19, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation | Jun 8, 2021 | Multi-Task LearningQuestion Answering | CodeCode Available | 1 |
| DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization | Jun 1, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions | May 18, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events | Mar 29, 2021 | Autonomous VehiclesBenchmarking | CodeCode Available | 1 |
| A Comprehensive Review of the Video-to-Text Problem | Mar 27, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| On the hidden treasure of dialog in video question answering | Mar 26, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling | Feb 11, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions | Dec 8, 2020 | counterfactualDescriptive | CodeCode Available | 1 |
| Just Ask: Learning to Answer Questions from Millions of Narrated Videos | Dec 1, 2020 | Question AnsweringQuestion Generation | CodeCode Available | 1 |
| Location-aware Graph Convolutional Networks for Video Question Answering | Aug 7, 2020 | Action Recognitiongraph construction | CodeCode Available | 1 |
| Visual Relation Grounding in Videos | Jul 17, 2020 | Question AnsweringRelation | CodeCode Available | 1 |
| Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions | Jul 17, 2020 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA | May 13, 2020 | Image CaptioningMulti-Label Classification | CodeCode Available | 1 |
| LifeQA: A Real-life Dataset for Video Question Answering | May 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training | May 1, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Hierarchical Conditional Relation Networks for Video Question Answering | Feb 25, 2020 | Audio-Visual Question Answering (AVQA)Question Answering | CodeCode Available | 1 |
| How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? | Jun 19, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| CogStream: Context-guided Streaming Video Question Answering | Jun 12, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning | Jun 9, 2025 | Future predictionQuestion Answering | CodeCode Available | 0 |
| EgoVLM: Policy Optimization for Egocentric Video Understanding | Jun 3, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 0 |
| VUDG: A Dataset for Video Understanding Domain Generalization | May 30, 2025 | Domain GeneralizationMultiple-choice | —Unverified | 0 |
| Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering | May 30, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos | May 29, 2025 | Question AnsweringVideo Generation | CodeCode Available | 0 |
| LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval | May 21, 2025 | Autonomous DrivingQuestion Answering | —Unverified | 0 |
| ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation | May 21, 2025 | Decision MakingLanguage Modeling | CodeCode Available | 0 |