| TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering | Apr 1, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Hierarchical Conditional Relation Networks for Video Question Answering | Feb 25, 2020 | Audio-Visual Question Answering (AVQA)Question Answering | CodeCode Available | 1 | 5 |
| DAM: Dynamic Adapter Merging for Continual Video QA Learning | Mar 13, 2024 | Continual Learningimage-classification | CodeCode Available | 1 | 5 |
| IntentQA: Context-aware Video Intent Reasoning | Jan 1, 2023 | Contrastive LearningVideo Question Answering | CodeCode Available | 1 | 5 |
| HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training | May 1, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer | Aug 16, 2023 | DecoderQuestion Answering | CodeCode Available | 1 | 5 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 | 5 |
| HawkEye: Training Video-Text LLMs for Grounding Text in Videos | Mar 15, 2024 | Video GroundingVideo Question Answering | CodeCode Available | 1 | 5 |
| Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling | Feb 11, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| LifeQA: A Real-life Dataset for Video Question Answering | May 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | Oct 14, 2024 | 2kBenchmarking | CodeCode Available | 1 | 5 |
| AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant | Nov 30, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| Grounded Question-Answering in Long Egocentric Videos | Dec 11, 2023 | Video GroundingVideo Question Answering | CodeCode Available | 1 | 5 |
| Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling | Oct 8, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Just Ask: Learning to Answer Questions from Millions of Narrated Videos | Dec 1, 2020 | Question AnsweringQuestion Generation | CodeCode Available | 1 | 5 |
| Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos | Aug 26, 2024 | FormLanguage Modelling | CodeCode Available | 1 | 5 |
| Learning Situation Hyper-Graphs for Video Question Answering | Apr 18, 2023 | DecoderQuestion Answering | CodeCode Available | 1 | 5 |
| MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | Mar 23, 2023 | Auxiliary LearningMultimodal Sentiment Analysis | CodeCode Available | 1 | 5 |
| Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions | Jul 17, 2020 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners | May 22, 2022 | AttributeAutomatic Speech Recognition | CodeCode Available | 1 | 5 |
| Large Language Models are Temporal and Causal Reasoners for Video Question Answering | Oct 24, 2023 | Natural Language UnderstandingQuestion Answering | CodeCode Available | 1 | 5 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 | 5 |
| NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions | May 18, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 | 5 |