| Video Question Answering Using CLIP-Guided Visual-Text Attention | Mar 6, 2023 | General KnowledgeQuestion Answering | —Unverified | 0 |
| STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training | Feb 20, 2023 | Language ModellingObject | —Unverified | 0 |
| Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer | Feb 4, 2023 | Computational EfficiencyQuestion Answering | CodeCode Available | 0 |
| Semi-Parametric Video-Grounded Text Generation | Jan 27, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Temporal Perceiving Video-Language Pre-training | Jan 18, 2023 | Action LocalizationContrastive Learning | —Unverified | 0 |
| Learning Trajectory-Word Alignments for Video-Language Tasks | Jan 5, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering | Jan 1, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Knowledge Proxy Intervention for Deconfounded Video Question Answering | Jan 1, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Exploring Temporal Concurrency for Video-Language Representation Learning | Jan 1, 2023 | Dynamic Time WarpingMetric Learning | CodeCode Available | 0 |
| HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | Dec 30, 2022 | cross-modal alignmentTGIF-Action | —Unverified | 0 |
| VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | Dec 9, 2022 | Question AnsweringRetrieval | —Unverified | 0 |
| SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training | Nov 21, 2022 | cross-modal alignmentGPU | —Unverified | 0 |
| Watching the News: Towards VideoQA Models that can Read | Nov 10, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering | Nov 7, 2022 | Add - POAdd - PQ | CodeCode Available | 0 |
| LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling | Oct 21, 2022 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Composing Ensembles of Pre-trained Models via Iterative Consensus | Oct 20, 2022 | Arithmetic ReasoningImage Generation | —Unverified | 0 |
| Dense but Efficient VideoQA for Intricate Compositional Reasoning | Oct 19, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Contrastive Video-Language Learning with Fine-grained Frame Sampling | Oct 10, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| Locate before Answering: Answer Guided Question Localization for Video Question Answering | Oct 5, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Extending Compositional Attention Networks for Social Reasoning in Videos | Oct 3, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 0 |
| In-the-Wild Video Question Answering | Oct 1, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 |
| LAVIS: A Library for Language-Vision Intelligence | Sep 15, 2022 | BenchmarkingImage Captioning | —Unverified | 0 |
| OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | Sep 15, 2022 | Action ClassificationAction Recognition | —Unverified | 0 |
| WildQA: In-the-Wild Video Question Answering | Sep 14, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 |
| Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering | Sep 8, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Video Question Answering with Iterative Video-Text Co-Tokenization | Aug 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering | Jul 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| 0/1 Deep Neural Networks via Block Coordinate Descent | Jun 19, 2022 | 10-shot image generation | —Unverified | 0 |
| Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval | Jun 5, 2022 | RetrievalSentence | CodeCode Available | 0 |
| Structured Two-stream Attention Network for Video Question Answering | Jun 2, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Modality Alignment between Deep Representations for Effective Video-and-Language Learning | Jun 1, 2022 | Question AnsweringVideo Captioning | —Unverified | 0 |
| Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering | May 13, 2022 | Question AnsweringSemantic Composition | —Unverified | 0 |
| Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering | May 9, 2022 | multimodal interactionQuestion Answering | CodeCode Available | 0 |
| Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA | May 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Overview of the MedVidQA 2022 Shared Task on Medical Video Question-Answering | May 1, 2022 | Question AnsweringVideo Classification | —Unverified | 0 |
| Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives | Apr 25, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Measuring Compositional Consistency for Video Question Answering | Apr 14, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering | Feb 18, 2022 | Question AnsweringSpatio-temporal Scene Graphs | —Unverified | 0 |
| NEWSKVQA: Knowledge-Aware News Video Question Answering | Feb 8, 2022 | Common Sense ReasoningManagement | —Unverified | 0 |
| CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising | Dec 14, 2021 | Cross-Modal RetrievalDecoder | —Unverified | 0 |
| Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering | Dec 1, 2021 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering | Nov 29, 2021 | DiversityQuestion Answering | —Unverified | 0 |
| Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework | Nov 16, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions | Nov 16, 2021 | counterfactualDescriptive | —Unverified | 0 |
| Transferring Domain-Agnostic Knowledge in Video Question Answering | Oct 26, 2021 | Question AnsweringTransfer Learning | —Unverified | 0 |
| The Multi-Modal Video Reasoning and Analyzing Competition | Aug 18, 2021 | Action RecognitionPerson Re-Identification | —Unverified | 0 |
| Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering | Aug 11, 2021 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Multi-Scale Progressive Attention Network for Video Question Answering | Aug 1, 2021 | Question AnsweringRelational Reasoning | —Unverified | 0 |
| CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding | Jul 21, 2021 | Question AnsweringSentence | —Unverified | 0 |
| Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering | Jun 25, 2021 | ObjectQuestion Answering | —Unverified | 0 |