| Understanding Complexity in VideoQA via Visual Program Generation | May 19, 2025 | Code GenerationQuestion Answering | —Unverified | 0 |
| SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models | May 19, 2025 | Causal InferenceDecision Making | —Unverified | 0 |
| Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models | May 16, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| Overview of the NLPCC 2025 Shared Task 4: Multi-modal, Multilingual, and Multi-hop Medical Instructional Video Question Answering Challenge | May 11, 2025 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| Seed1.5-VL Technical Report | May 11, 2025 | Mixture-of-ExpertsMultimodal Reasoning | —Unverified | 0 |
| Towards Understanding Camera Motions in Any Video | Apr 21, 2025 | Question AnsweringText Retrieval | —Unverified | 0 |
| Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization | Apr 16, 2025 | HallucinationQuestion Answering | —Unverified | 0 |
| How Can Objects Help Video-Language Understanding? | Apr 10, 2025 | Image CaptioningObject | —Unverified | 0 |
| Advancing Egocentric Video Question Answering with Multimodal Large Language Models | Apr 6, 2025 | Object RecognitionQuestion Answering | —Unverified | 0 |
| Leveraging Static Relationships for Intra-Type and Inter-Type Message Passing in Video Question Answering | Apr 3, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering | Mar 27, 2025 | Emotion RecognitionQuestion Answering | —Unverified | 0 |
| Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding | Mar 26, 2025 | GPUQuestion Answering | —Unverified | 0 |
| Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding | Mar 17, 2025 | AttributeMME | —Unverified | 0 |
| VITED: Video Temporal Evidence Distillation | Mar 17, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs | Mar 13, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment | Mar 12, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Towards Fine-Grained Video Question Answering | Mar 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Parameter-free Video Segmentation for Vision and Language Understanding | Mar 3, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| M-LLM Based Video Frame Selection for Efficient Video Understanding | Feb 27, 2025 | EgoSchemaLanguage Modeling | —Unverified | 0 |
| Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering | Feb 17, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| ENTER: Event Based Interpretable Reasoning for VideoQA | Jan 24, 2025 | Code GenerationEgoSchema | —Unverified | 0 |
| ReasVQA: Advancing VideoQA with Imperfect Reasoning Process | Jan 23, 2025 | Multi-Task LearningQuestion Answering | —Unverified | 0 |
| Admitting Ignorance Helps the Video Question Answering Models to Answer | Jan 15, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| TimeLogic: A Temporal Logic Benchmark for Video QA | Jan 13, 2025 | 2kAction Segmentation | —Unverified | 0 |
| Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning | Jan 9, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction | Jan 1, 2025 | GPUQuestion Answering | —Unverified | 0 |
| Efficient Motion-Aware Video MLLM | Jan 1, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Flexible Frame Selection for Efficient Video Reasoning | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Hierarchical Banzhaf Interaction for General Video-Language Representation Learning | Dec 30, 2024 | Contrastive LearningQuestion Answering | CodeCode Available | 0 |
| Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries | Dec 26, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| VidCtx: Context-aware Video Question Answering with Image Models | Dec 23, 2024 | Large Language ModelQuestion Answering | CodeCode Available | 0 |
| FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story Videos | Dec 22, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| PolySmart @ TRECVid 2024 Medical Video Question Answering | Dec 20, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track | Dec 15, 2024 | Image CaptioningMedical Question Answering | —Unverified | 0 |
| IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs | Dec 13, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering | Dec 12, 2024 | feature selectionLanguage Modeling | —Unverified | 0 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | Dec 6, 2024 | document understandingHallucination | CodeCode Available | 0 |
| SEAL: Semantic Attention Learning for Long Video Representation | Dec 2, 2024 | DiversityQuestion Answering | —Unverified | 0 |
| Unlocking Video-LLM via Agent-of-Thoughts Distillation | Dec 2, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks | Dec 2, 2024 | Multi-Object TrackingObject Tracking | CodeCode Available | 0 |
| Actions and Objects Pathways for Domain Adaptation in Video Question Answering | Nov 29, 2024 | Domain AdaptationDomain Generalization | —Unverified | 0 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation | Nov 27, 2024 | Graph GenerationQuestion Answering | —Unverified | 0 |
| VideoOrion: Tokenizing Object Dynamics in Videos | Nov 25, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction | Nov 19, 2024 | GPUQuestion Answering | —Unverified | 0 |
| EVQAScore: Efficient Video Question Answering Data Evaluation | Nov 11, 2024 | Keyword ExtractionQuestion Answering | —Unverified | 0 |
| Poze: Sports Technique Feedback under Data Constraints | Nov 8, 2024 | Pose EstimationQuestion Answering | —Unverified | 0 |
| FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis | Oct 25, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| GPT-4o System Card | Oct 25, 2024 | Multiple-choiceSpatial Reasoning | —Unverified | 0 |