| From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | May 30, 2022 | counterfactualDescriptive | CodeCode Available | 1 |
| Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners | May 22, 2022 | AttributeAutomatic Speech Recognition | CodeCode Available | 1 |
| Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering | May 13, 2022 | Question AnsweringSemantic Composition | —Unverified | 0 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 |
| Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering | May 9, 2022 | multimodal interactionQuestion Answering | CodeCode Available | 0 |
| Overview of the MedVidQA 2022 Shared Task on Medical Video Question-Answering | May 1, 2022 | Question AnsweringVideo Classification | —Unverified | 0 |
| Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA | May 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Flamingo: a Visual Language Model for Few-Shot Learning | Apr 29, 2022 | Few-Shot LearningGenerative Visual Question Answering | CodeCode Available | 4 |
| Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives | Apr 25, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Measuring Compositional Consistency for Video Question Answering | Apr 14, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval | Mar 15, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 |
| All in One: Exploring Unified Video-Language Pre-training | Mar 14, 2022 | AllLanguage Modelling | CodeCode Available | 2 |
| Video Question Answering: Datasets, Algorithms and Challenges | Mar 2, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering | Feb 18, 2022 | Question AnsweringSpatio-temporal Scene Graphs | —Unverified | 0 |
| NEWSKVQA: Knowledge-Aware News Video Question Answering | Feb 8, 2022 | Common Sense ReasoningManagement | —Unverified | 0 |
| CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising | Dec 14, 2021 | Cross-Modal RetrievalDecoder | —Unverified | 0 |
| Video as Conditional Graph Hierarchy for Multi-Granular Question Answering | Dec 12, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering | Dec 1, 2021 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant | Nov 30, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering | Nov 29, 2021 | DiversityQuestion Answering | —Unverified | 0 |
| SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning | Nov 25, 2021 | Caption GenerationQuestion Answering | CodeCode Available | 1 |
| VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | Nov 24, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework | Nov 16, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions | Nov 16, 2021 | counterfactualDescriptive | —Unverified | 0 |
| Transferring Domain-Agnostic Knowledge in Video Question Answering | Oct 26, 2021 | Question AnsweringTransfer Learning | —Unverified | 0 |
| Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering | Sep 10, 2021 | multimodal interactionNatural Language Understanding | CodeCode Available | 1 |
| The Multi-Modal Video Reasoning and Analyzing Competition | Aug 18, 2021 | Action RecognitionPerson Re-Identification | —Unverified | 0 |
| Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering | Aug 11, 2021 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Multi-Scale Progressive Attention Network for Video Question Answering | Aug 1, 2021 | Question AnsweringRelational Reasoning | —Unverified | 0 |
| CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding | Jul 21, 2021 | Question AnsweringSentence | —Unverified | 0 |
| DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering | Jul 10, 2021 | Graph AttentionQuestion Answering | CodeCode Available | 1 |
| iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability | Jun 25, 2021 | Bias DetectionQuestion Answering | —Unverified | 0 |
| Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering | Jun 25, 2021 | ObjectQuestion Answering | —Unverified | 0 |
| NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions | Jun 19, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering | Jun 19, 2021 | AI AgentQuestion Answering | CodeCode Available | 0 |
| VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation | Jun 8, 2021 | Multi-Task LearningQuestion Answering | CodeCode Available | 1 |
| Learning to Rehearse in Long Sequence Memorization | Jun 2, 2021 | MemorizationQuestion Answering | —Unverified | 0 |
| DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization | Jun 1, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions | May 18, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Relation-aware Hierarchical Attention Framework for Video Question Answering | May 13, 2021 | Question AnsweringRelation | CodeCode Available | 0 |
| Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering | Apr 29, 2021 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Object-Centric Representation Learning for Video Question Answering | Apr 12, 2021 | ObjectQuestion Answering | —Unverified | 0 |
| FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework | Apr 9, 2021 | Language ModellingMultiple-choice | CodeCode Available | 0 |
| Video Question Answering with Phrases via Semantic Roles | Apr 8, 2021 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning | Apr 1, 2021 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning | Mar 30, 2021 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events | Mar 29, 2021 | Autonomous VehiclesBenchmarking | CodeCode Available | 1 |
| A Comprehensive Review of the Video-to-Text Problem | Mar 27, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| On the hidden treasure of dialog in video question answering | Mar 26, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling | Feb 11, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |