| VideoCon: Robust Video-Language Alignment via Contrast Captions | Nov 15, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Large Language Models are Temporal and Causal Reasoners for Video Question Answering | Oct 24, 2023 | Natural Language UnderstandingQuestion Answering | CodeCode Available | 1 | 5 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 | 5 |
| AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant | Nov 30, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling | Oct 8, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Grounded Question-Answering in Long Egocentric Videos | Dec 11, 2023 | Video GroundingVideo Question Answering | CodeCode Available | 1 | 5 |
| NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions | Jun 19, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos | Aug 26, 2024 | FormLanguage Modelling | CodeCode Available | 1 | 5 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| Encoding and Controlling Global Semantics for Long-form Video Question Answering | May 30, 2024 | FormQuestion Answering | CodeCode Available | 1 | 5 |
| Referring Atomic Video Action Recognition | Jul 2, 2024 | Action LocalizationAction Recognition | CodeCode Available | 1 | 5 |
| Scene-Text Grounding for Text-Based Video Question Answering | Sep 22, 2024 | 2kContrastive Learning | CodeCode Available | 1 | 5 |
| Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | Jan 3, 2024 | Action DetectionHuman-Object Interaction Detection | CodeCode Available | 1 | 5 |
| CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions | Dec 8, 2020 | counterfactualDescriptive | CodeCode Available | 1 | 5 |
| Location-aware Graph Convolutional Networks for Video Question Answering | Aug 7, 2020 | Action Recognitiongraph construction | CodeCode Available | 1 | 5 |
| Video Dialog as Conversation about Objects Living in Space-Time | Jul 8, 2022 | ObjectRelational Reasoning | CodeCode Available | 1 | 5 |
| FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story Videos | Dec 22, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 | 5 |
| Vamos: Versatile Action Models for Video Understanding | Nov 22, 2023 | EgoSchemaHard Attention | CodeCode Available | 0 | 5 |
| Verbs in Action: Improving verb understanding in video-language models | Apr 13, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 0 | 5 |
| ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos | Nov 2, 2023 | counterfactualCounterfactual Reasoning | CodeCode Available | 0 | 5 |
| TVQA: Localized, Compositional Video Question Answering | Sep 5, 2018 | Video Question Answering | CodeCode Available | 0 | 5 |
| TVQA+: Spatio-Temporal Grounding for Video Question Answering | Apr 25, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework | Apr 9, 2021 | Language ModellingMultiple-choice | CodeCode Available | 0 | 5 |
| TutorialVQA: Question Answering Dataset for Tutorial Videos | Dec 2, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| Unmasked Teacher: Towards Training-Efficient Video Foundation Models | Mar 28, 2023 | Action ClassificationAction Recognition | CodeCode Available | 0 | 5 |