| AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant | Nov 30, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Large Language Models are Temporal and Causal Reasoners for Video Question Answering | Oct 24, 2023 | Natural Language UnderstandingQuestion Answering | CodeCode Available | 1 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 |
| CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes | Apr 1, 2024 | Causal DiscoveryCausal Discovery in Video Reasoning | CodeCode Available | 1 |
| Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling | Oct 8, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Grounded Question-Answering in Long Egocentric Videos | Dec 11, 2023 | Video GroundingVideo Question Answering | CodeCode Available | 1 |
| Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos | Aug 26, 2024 | FormLanguage Modelling | CodeCode Available | 1 |
| MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Dec 19, 2022 | FormQuestion Answering | CodeCode Available | 1 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 |
| Encoding and Controlling Global Semantics for Long-form Video Question Answering | May 30, 2024 | FormQuestion Answering | CodeCode Available | 1 |
| Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | Jan 3, 2024 | Action DetectionHuman-Object Interaction Detection | CodeCode Available | 1 |
| CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions | Dec 8, 2020 | counterfactualDescriptive | CodeCode Available | 1 |
| BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | Sep 27, 2023 | GPUVideo-based Generative Performance Benchmarking | CodeCode Available | 1 |
| Self-Chained Image-Language Model for Video Localization and Question Answering | May 11, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Video Dialog as Conversation about Objects Living in Space-Time | Jul 8, 2022 | ObjectRelational Reasoning | CodeCode Available | 1 |
| Video-Language Alignment via Spatio-Temporal Graph Transformer | Jul 16, 2024 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| Advancing Egocentric Video Question Answering with Multimodal Large Language Models | Apr 6, 2025 | Object RecognitionQuestion Answering | —Unverified | 0 |
| Contrastive Video-Language Learning with Fine-grained Frame Sampling | Oct 10, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| Frame-Voyager: Learning to Query Frames for Video Large Language Models | Oct 4, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering | Sep 8, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Continuous Perception Benchmark | Aug 15, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering | Dec 12, 2024 | feature selectionLanguage Modeling | —Unverified | 0 |
| Flexible Frame Selection for Efficient Video Reasoning | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| A Review of Deep Learning for Video Captioning | Apr 22, 2023 | Deep LearningDense Video Captioning | —Unverified | 0 |
| FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis | Oct 25, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |