| Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning | Mar 6, 2020 | Density EstimationNoise Estimation | CodeCode Available | 0 | 5 |
| FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story Videos | Dec 22, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 | 5 |
| Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning | Jun 9, 2025 | Future predictionQuestion Answering | CodeCode Available | 0 | 5 |
| Lightweight Recurrent Cross-modal Encoder for Video Question Answering | Jun 30, 2023 | Action RecognitionQuestion Answering | CodeCode Available | 0 | 5 |
| FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework | Apr 9, 2021 | Language ModellingMultiple-choice | CodeCode Available | 0 | 5 |
| Relation-aware Hierarchical Attention Framework for Video Question Answering | May 13, 2021 | Question AnsweringRelation | CodeCode Available | 0 | 5 |
| A Joint Sequence Fusion Model for Video Question Answering and Retrieval | Aug 7, 2018 | DecoderMultiple-choice | CodeCode Available | 0 | 5 |
| ActBERT: Learning Global-Local Video-Text Representations | Nov 14, 2020 | Action SegmentationQuestion Answering | CodeCode Available | 0 | 5 |
| Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer | Feb 4, 2023 | Computational EfficiencyQuestion Answering | CodeCode Available | 0 | 5 |
| Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval | Jun 5, 2022 | RetrievalSentence | CodeCode Available | 0 | 5 |
| STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering | Jan 8, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| A Better Way to Attend: Attention with Trees for Video Question Answering | Sep 5, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| OmniNet: A unified architecture for multi-modal multi-task learning | Jul 17, 2019 | Image CaptioningMulti-Task Learning | CodeCode Available | 0 | 5 |
| Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | Apr 11, 2024 | DescriptiveHallucination | CodeCode Available | 0 | 5 |
| MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | Mar 29, 2023 | Cross-Modal RetrievalDecoder | CodeCode Available | 0 | 5 |
| Hallucination Mitigation Prompts Long-term Video Understanding | Jun 17, 2024 | Answer GenerationHallucination | CodeCode Available | 0 | 5 |
| Visual Choice of Plausible Alternatives: An Evaluation of Image-based Commonsense Causal Reasoning | May 1, 2018 | Commonsense Causal ReasoningImage Captioning | CodeCode Available | 0 | 5 |
| Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models | May 16, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 0 | 5 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 | 5 |
| VideoQA-SC: Adaptive Semantic Communication for Video Question Answering | May 17, 2024 | Question AnsweringSemantic Communication | —Unverified | 0 | 0 |
| CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning | Apr 1, 2021 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Object-Centric Representation Learning for Video Question Answering | Apr 12, 2021 | ObjectQuestion Answering | —Unverified | 0 | 0 |
| Watching the News: Towards VideoQA Models that can Read | Nov 10, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Cross-Modal Reasoning with Event Correlation for Video Question Answering | Dec 20, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | Sep 15, 2022 | Action ClassificationAction Recognition | —Unverified | 0 | 0 |