| Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning | Mar 6, 2020 | Density EstimationNoise Estimation | CodeCode Available | 0 | 5 |
| Video Question Answering on Screencast Tutorials | Aug 2, 2020 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks | Jun 28, 2019 | Answer GenerationDecoder | —Unverified | 0 | 0 |
| Video Question Answering Using CLIP-Guided Visual-Text Attention | Mar 6, 2023 | General KnowledgeQuestion Answering | —Unverified | 0 | 0 |
| CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions | Nov 16, 2021 | counterfactualDescriptive | —Unverified | 0 | 0 |
| Overview of the MedVidQA 2022 Shared Task on Medical Video Question-Answering | May 1, 2022 | Question AnsweringVideo Classification | —Unverified | 0 | 0 |
| Overview of the NLPCC 2025 Shared Task 4: Multi-modal, Multilingual, and Multi-hop Medical Instructional Video Question Answering Challenge | May 11, 2025 | Multimodal ReasoningQuestion Answering | —Unverified | 0 | 0 |
| Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track | Dec 15, 2024 | Image CaptioningMedical Question Answering | —Unverified | 0 | 0 |
| Video Question Answering Using Language-Guided Deep Compressed-Domain Video Feature | Jan 1, 2021 | Question AnsweringVideo Compression | —Unverified | 0 | 0 |
| Parameter-free Video Segmentation for Vision and Language Understanding | Mar 3, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Video Question Answering via Attribute-Augmented Attention Network Learning | Jul 20, 2017 | AttributeInformation Retrieval | —Unverified | 0 | 0 |
| Pegasus-v1 Technical Report | Apr 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries | Dec 26, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Contrastive Video-Language Learning with Fine-grained Frame Sampling | Oct 10, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Perception Test 2023: A Summary of the First Challenge And Outcome | Dec 20, 2023 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 | 0 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 | 0 |
| Continuous Perception Benchmark | Aug 15, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Composing Ensembles of Pre-trained Models via Iterative Consensus | Oct 20, 2022 | Arithmetic ReasoningImage Generation | —Unverified | 0 | 0 |
| Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning | Jan 9, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 | 0 |
| PolySmart @ TRECVid 2024 Medical Video Question Answering | Dec 20, 2024 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| Poze: Sports Technique Feedback under Data Constraints | Nov 8, 2024 | Pose EstimationQuestion Answering | —Unverified | 0 | 0 |
| CogStream: Context-guided Streaming Video Question Answering | Jun 12, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering | Oct 12, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems | Sep 14, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding | Jul 21, 2021 | Question AnsweringSentence | —Unverified | 0 | 0 |
| Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels | Mar 21, 2024 | Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION | —Unverified | 0 | 0 |
| Read, Look or Listen? What's Needed for Solving a Multimodal Dataset | Jul 6, 2023 | Question AnsweringSpeaker Identification | —Unverified | 0 | 0 |
| ReasVQA: Advancing VideoQA with Imperfect Reasoning Process | Jan 23, 2025 | Multi-Task LearningQuestion Answering | —Unverified | 0 | 0 |
| Recent Advances in Video Question Answering: A Review of Datasets and Methods | Jan 15, 2021 | Information RetrievalMachine Translation | —Unverified | 0 | 0 |
| Redundancy-aware Transformer for Video Question Answering | Aug 7, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Video Question Answering with Iterative Video-Text Co-Tokenization | Aug 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models | Apr 18, 2024 | GSM8KMMLU | —Unverified | 0 | 0 |
| CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising | Dec 14, 2021 | Cross-Modal RetrievalDecoder | —Unverified | 0 | 0 |
| Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives | Apr 25, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Retrieval-based Video Language Model for Efficient Long Video Question Answering | Dec 8, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models | Jun 15, 2023 | cross-modal alignmentDomain Generalization | —Unverified | 0 | 0 |
| Co-attentional Transformers for Story-Based Video Understanding | Oct 27, 2020 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Video Question Answering with Phrases via Semantic Roles | Apr 8, 2021 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Video Question Generation via Cross-Modal Self-Attention Networks Learning | Jul 5, 2019 | DiversityQuestion Answering | —Unverified | 0 | 0 |
| AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction | Nov 19, 2024 | GPUQuestion Answering | —Unverified | 0 | 0 |
| Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models | Oct 10, 2024 | Conformal PredictionLanguage Modeling | —Unverified | 0 | 0 |
| Zero-Shot Long-Form Video Understanding through Screenplay | Jun 25, 2024 | FormQuestion Answering | —Unverified | 0 | 0 |
| VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | Dec 9, 2022 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| SEAL: Semantic Attention Learning for Long Video Representation | Dec 2, 2024 | DiversityQuestion Answering | —Unverified | 0 | 0 |
| Seed1.5-VL Technical Report | May 11, 2025 | Mixture-of-ExpertsMultimodal Reasoning | —Unverified | 0 | 0 |
| Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization | Apr 16, 2025 | HallucinationQuestion Answering | —Unverified | 0 | 0 |
| Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding | Mar 26, 2025 | GPUQuestion Answering | —Unverified | 0 | 0 |
| Self-supervised pre-training and contrastive representation learning for multiple-choice video QA | Sep 17, 2020 | Auxiliary LearningContrastive Learning | —Unverified | 0 | 0 |
| Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering | May 14, 2023 | Question AnsweringSemantic Role Labeling | —Unverified | 0 | 0 |
| Semi-Parametric Video-Grounded Text Generation | Jan 27, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |