| Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder | Jun 28, 2025 | Image SegmentationLarge Language Model | CodeCode Available | 1 | 5 |
| Location-aware Graph Convolutional Networks for Video Question Answering | Aug 7, 2020 | Action Recognitiongraph construction | CodeCode Available | 1 | 5 |
| DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization | Jun 1, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | Oct 14, 2024 | 2kBenchmarking | CodeCode Available | 1 | 5 |
| Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA | Jun 13, 2024 | AllEgoSchema | CodeCode Available | 1 | 5 |
| Hierarchical Conditional Relation Networks for Video Question Answering | Feb 25, 2020 | Audio-Visual Question Answering (AVQA)Question Answering | CodeCode Available | 1 | 5 |
| SViTT: Temporal Learning of Sparse Video-Text Transformers | Apr 18, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning | Nov 25, 2021 | Caption GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| DAM: Dynamic Adapter Merging for Continual Video QA Learning | Mar 13, 2024 | Continual Learningimage-classification | CodeCode Available | 1 | 5 |
| HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training | May 1, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling | Feb 11, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 | 5 |
| DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering | Jul 10, 2021 | Graph AttentionQuestion Answering | CodeCode Available | 1 | 5 |
| HawkEye: Training Video-Text LLMs for Grounding Text in Videos | Mar 15, 2024 | Video GroundingVideo Question Answering | CodeCode Available | 1 | 5 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| Just Ask: Learning to Answer Questions from Millions of Narrated Videos | Dec 1, 2020 | Question AnsweringQuestion Generation | CodeCode Available | 1 | 5 |
| LifeQA: A Real-life Dataset for Video Question Answering | May 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant | Nov 30, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions | Jul 17, 2020 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Grounded Question-Answering in Long Egocentric Videos | Dec 11, 2023 | Video GroundingVideo Question Answering | CodeCode Available | 1 | 5 |
| Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos | Mar 17, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 | 5 |
| Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos | Aug 26, 2024 | FormLanguage Modelling | CodeCode Available | 1 | 5 |
| Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling | Oct 8, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners | May 22, 2022 | AttributeAutomatic Speech Recognition | CodeCode Available | 1 | 5 |