| VindLU: A Recipe for Effective Video-and-Language Pretraining | Dec 9, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | Nov 21, 2022 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Visual Commonsense-aware Representation Network for Video Captioning | Nov 17, 2022 | Caption GenerationQuestion Answering | CodeCode Available | 1 |
| Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling | Oct 8, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | Sep 4, 2022 | Fill MaskOptical Flow Estimation | CodeCode Available | 1 |
| Equivariant and Invariant Grounding for Video Question Answering | Jul 26, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Video Graph Transformer for Video Question Answering | Jul 12, 2022 | Question AnsweringRelation | CodeCode Available | 1 |
| Video Dialog as Conversation about Objects Living in Space-Time | Jul 8, 2022 | ObjectRelational Reasoning | CodeCode Available | 1 |
| Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | Jun 16, 2022 | Fill MaskLanguage Modeling | CodeCode Available | 1 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 |
| Invariant Grounding for Video Question Answering | Jun 6, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Revisiting the "Video" in Video-Language Understanding | Jun 3, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | May 30, 2022 | counterfactualDescriptive | CodeCode Available | 1 |
| Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners | May 22, 2022 | AttributeAutomatic Speech Recognition | CodeCode Available | 1 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 |
| Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval | Mar 15, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Video Question Answering: Datasets, Algorithms and Challenges | Mar 2, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Video as Conditional Graph Hierarchy for Multi-Granular Question Answering | Dec 12, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant | Nov 30, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning | Nov 25, 2021 | Caption GenerationQuestion Answering | CodeCode Available | 1 |
| VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | Nov 24, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering | Sep 10, 2021 | multimodal interactionNatural Language Understanding | CodeCode Available | 1 |
| DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering | Jul 10, 2021 | Graph AttentionQuestion Answering | CodeCode Available | 1 |
| NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions | Jun 19, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |