| MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering | Oct 6, 2023 | counterfactualQuestion Answering | —Unverified | 0 |
| ATM: Action Temporality Modeling for Video Question Answering | Sep 5, 2023 | Contrastive LearningOptical Flow Estimation | —Unverified | 0 |
| Understanding Video Scenes through Text: Insights from Text-based Video Question Answering | Sep 4, 2023 | Domain AdaptationQuestion Answering | —Unverified | 0 |
| Distraction-free Embeddings for Robust VQA | Aug 31, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Redundancy-aware Transformer for Video Question Answering | Aug 7, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering | Jul 25, 2023 | graph constructionQuestion Answering | —Unverified | 0 |
| Traffic-Domain Video Question Answering with Automatic Captioning | Jul 18, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Reading Between the Lanes: Text VideoQA on the Road | Jul 8, 2023 | Question AnsweringScene Text Recognition | CodeCode Available | 0 |
| Read, Look or Listen? What's Needed for Solving a Multimodal Dataset | Jul 6, 2023 | Question AnsweringSpeaker Identification | —Unverified | 0 |
| Lightweight Recurrent Cross-modal Encoder for Video Question Answering | Jun 30, 2023 | Action RecognitionQuestion Answering | CodeCode Available | 0 |
| Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models | Jun 15, 2023 | cross-modal alignmentDomain Generalization | —Unverified | 0 |
| Diversifying Joint Vision-Language Tokenization Learning | Jun 6, 2023 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | May 22, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| TG-VQA: Ternary Game of Video Question Answering | May 17, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| Is a Video worth n n Images? A Highly Efficient Approach to Transformer-based Video Question Answering | May 16, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering | May 14, 2023 | Question AnsweringSemantic Role Labeling | —Unverified | 0 |
| ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos | May 4, 2023 | Question AnsweringSpatio-temporal Scene Graphs | CodeCode Available | 0 |
| VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation | May 4, 2023 | DecoderQuestion Answering | —Unverified | 0 |
| A Review of Deep Learning for Video Captioning | Apr 22, 2023 | Deep LearningDense Video Captioning | —Unverified | 0 |
| Verbs in Action: Improving verb understanding in video-language models | Apr 13, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 0 |
| Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering | Apr 7, 2023 | Question AnsweringQuestion Generation | —Unverified | 0 |
| MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | Mar 29, 2023 | Cross-Modal RetrievalDecoder | CodeCode Available | 0 |
| Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding | Mar 28, 2023 | Action LocalizationAction Recognition | —Unverified | 0 |
| Unmasked Teacher: Towards Training-Efficient Video Foundation Models | Mar 28, 2023 | Action ClassificationAction Recognition | CodeCode Available | 0 |
| MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling | Mar 10, 2023 | Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION | —Unverified | 0 |