| What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets | Jul 7, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Slot-VLM: SlowFast Slots for Video-Language Modeling | Feb 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training | Nov 21, 2022 | cross-modal alignmentGPU | —Unverified | 0 | 0 |
| Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding | Mar 28, 2023 | Action LocalizationAction Recognition | —Unverified | 0 | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 | 0 |
| WildQA: In-the-Wild Video Question Answering | Sep 14, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 | 0 |
| Characterizing Video Question Answering with Sparsified Inputs | Nov 27, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training | Feb 20, 2023 | Language ModellingObject | —Unverified | 0 | 0 |
| Causal Understanding For Video Question Answering | Jul 23, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Structured Two-stream Attention Network for Video Question Answering | Jun 2, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models | May 19, 2025 | Causal InferenceDecision Making | —Unverified | 0 | 0 |
| Actions and Objects Pathways for Domain Adaptation in Video Question Answering | Nov 29, 2024 | Domain AdaptationDomain Generalization | —Unverified | 0 | 0 |
| Capabilities of Gemini Models in Medicine | Apr 29, 2024 | In-Context LearningMedQA | —Unverified | 0 | 0 |
| Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering | Apr 29, 2021 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind | Feb 12, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition | Oct 8, 2024 | Action RecognitionMultiple-choice | —Unverified | 0 | 0 |
| VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens | Jan 1, 2024 | HallucinationPosition | —Unverified | 0 | 0 |
| Temporal Perceiving Video-Language Pre-training | Jan 18, 2023 | Action LocalizationContrastive Learning | —Unverified | 0 | 0 |
| 0/1 Deep Neural Networks via Block Coordinate Descent | Jun 19, 2022 | 10-shot image generation | —Unverified | 0 | 0 |
| Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens | Dec 12, 2023 | HallucinationPosition | —Unverified | 0 | 0 |
| Text-Conditioned Resampler For Long Form Video Understanding | Dec 19, 2023 | EgoSchemaForm | —Unverified | 0 | 0 |
| TG-VQA: Ternary Game of Video Question Answering | May 17, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 | 0 |
| The Forgettable-Watcher Model for Video Question Answering | May 3, 2017 | modelQuestion Answering | —Unverified | 0 | 0 |
| The Multi-Modal Video Reasoning and Analyzing Competition | Aug 18, 2021 | Action RecognitionPerson Re-Identification | —Unverified | 0 | 0 |
| The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA | Jul 2, 2024 | Grounded Video Question AnsweringObject Tracking | —Unverified | 0 | 0 |
| Backpropagation-Free Multi-modal On-Device Model Adaptation via Cloud-Device Collaboration | May 21, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| TimeLogic: A Temporal Logic Benchmark for Video QA | Jan 13, 2025 | 2kAction Segmentation | —Unverified | 0 | 0 |
| Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training | Jul 5, 2020 | DecoderQuestion Answering | —Unverified | 0 | 0 |
| TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs | Mar 13, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 | 0 |
| Top-down Activity Representation Learning for Video Question Answering | Sep 12, 2024 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Towards Fine-Grained Video Question Answering | Mar 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Towards Understanding Camera Motions in Any Video | Apr 21, 2025 | Question AnsweringText Retrieval | —Unverified | 0 | 0 |
| Traffic-Domain Video Question Answering with Automatic Captioning | Jul 18, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Transferring Domain-Agnostic Knowledge in Video Question Answering | Oct 26, 2021 | Question AnsweringTransfer Learning | —Unverified | 0 | 0 |
| Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering | May 28, 2019 | Inductive BiasMetric Learning | —Unverified | 0 | 0 |
| Trying Bilinear Pooling in Video-QA | Dec 18, 2020 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment | Sep 17, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Uncovering Temporal Context for Video Question and Answering | Nov 15, 2015 | DecoderMultiple-choice | —Unverified | 0 | 0 |
| Understanding Complexity in VideoQA via Visual Program Generation | May 19, 2025 | Code GenerationQuestion Answering | —Unverified | 0 | 0 |
| Understanding Video Scenes through Text: Insights from Text-based Video Question Answering | Sep 4, 2023 | Domain AdaptationQuestion Answering | —Unverified | 0 | 0 |
| Unlocking Video-LLM via Agent-of-Thoughts Distillation | Dec 2, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs | Oct 21, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding | Dec 4, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| ATM: Action Temporality Modeling for Video Question Answering | Sep 5, 2023 | Contrastive LearningOptical Flow Estimation | —Unverified | 0 | 0 |
| VDMA: Video Question Answering with Dynamically Generated Multi-Agents | Jul 4, 2024 | EgoSchemaQuestion Answering | —Unverified | 0 | 0 |
| Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models | Aug 22, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Flexible Frame Selection for Efficient Video Reasoning | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering | Dec 12, 2024 | feature selectionLanguage Modeling | —Unverified | 0 | 0 |
| Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering | Sep 8, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Frame-Voyager: Learning to Query Frames for Video Large Language Models | Oct 4, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |