| Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding | Mar 26, 2025 | GPUQuestion Answering | —Unverified | 0 |
| Self-supervised pre-training and contrastive representation learning for multiple-choice video QA | Sep 17, 2020 | Auxiliary LearningContrastive Learning | —Unverified | 0 |
| Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering | May 14, 2023 | Question AnsweringSemantic Role Labeling | —Unverified | 0 |
| Semi-Parametric Video-Grounded Text Generation | Jan 27, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets | Jul 7, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Slot-VLM: SlowFast Slots for Video-Language Modeling | Feb 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training | Nov 21, 2022 | cross-modal alignmentGPU | —Unverified | 0 |
| Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding | Mar 28, 2023 | Action LocalizationAction Recognition | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| WildQA: In-the-Wild Video Question Answering | Sep 14, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 |
| Characterizing Video Question Answering with Sparsified Inputs | Nov 27, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training | Feb 20, 2023 | Language ModellingObject | —Unverified | 0 |
| Causal Understanding For Video Question Answering | Jul 23, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Structured Two-stream Attention Network for Video Question Answering | Jun 2, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models | May 19, 2025 | Causal InferenceDecision Making | —Unverified | 0 |
| Actions and Objects Pathways for Domain Adaptation in Video Question Answering | Nov 29, 2024 | Domain AdaptationDomain Generalization | —Unverified | 0 |
| Capabilities of Gemini Models in Medicine | Apr 29, 2024 | In-Context LearningMedQA | —Unverified | 0 |
| Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering | Apr 29, 2021 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind | Feb 12, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition | Oct 8, 2024 | Action RecognitionMultiple-choice | —Unverified | 0 |
| VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens | Jan 1, 2024 | HallucinationPosition | —Unverified | 0 |
| Temporal Perceiving Video-Language Pre-training | Jan 18, 2023 | Action LocalizationContrastive Learning | —Unverified | 0 |
| 0/1 Deep Neural Networks via Block Coordinate Descent | Jun 19, 2022 | 10-shot image generation | —Unverified | 0 |
| Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens | Dec 12, 2023 | HallucinationPosition | —Unverified | 0 |
| Text-Conditioned Resampler For Long Form Video Understanding | Dec 19, 2023 | EgoSchemaForm | —Unverified | 0 |