| BEARCUBS: A benchmark for computer-using web agents | Mar 10, 2025 | Video Understanding | —Unverified | 0 |
| ALLVB: All-in-One Long Video Understanding Benchmark | Mar 10, 2025 | AllVideo Understanding | —Unverified | 0 |
| Towards Fine-Grained Video Question Answering | Mar 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos | Mar 9, 2025 | Action LocalizationBoundary Detection | CodeCode Available | 1 |
| Unified Reward Model for Multimodal Understanding and Generation | Mar 7, 2025 | Image Generationmodel | CodeCode Available | 4 |
| EgoLife: Towards Egocentric Life Assistant | Mar 5, 2025 | Question AnsweringVideo Understanding | CodeCode Available | 3 |
| Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection | Mar 5, 2025 | Anomaly DetectionObject | —Unverified | 0 |
| Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning | Mar 2, 2025 | Large Language ModelMulti-Instance Retrieval | CodeCode Available | 1 |
| HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models | Feb 28, 2025 | Action UnderstandingText-to-Video Generation | —Unverified | 0 |
| PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos | Feb 28, 2025 | Question AnsweringVideo Understanding | —Unverified | 0 |
| OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection | Feb 27, 2025 | Action DetectionBenchmarking | CodeCode Available | 3 |
| M-LLM Based Video Frame Selection for Efficient Video Understanding | Feb 27, 2025 | EgoSchemaLanguage Modeling | —Unverified | 0 |
| InternVQA: Advancing Compressed Video Quality Assessment with Distilling Large Foundation Model | Feb 26, 2025 | Video Quality AssessmentVideo Understanding | —Unverified | 0 |
| An Analysis of Data Transformation Effects on Segment Anything 2 | Feb 25, 2025 | Semantic SegmentationVideo Object Segmentation | —Unverified | 0 |
| Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos | Feb 25, 2025 | Graph LearningMistake Detection | CodeCode Available | 1 |
| Fine-Grained Video Captioning through Scene Graph Consolidation | Feb 23, 2025 | Caption GenerationImage Captioning | —Unverified | 0 |
| LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models | Feb 21, 2025 | Caption GenerationVideo Captioning | —Unverified | 0 |
| AVD2: Accident Video Diffusion for Accident Video Description | Feb 20, 2025 | Autonomous DrivingScene Understanding | —Unverified | 0 |
| MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval | Feb 18, 2025 | Action RecognitionMoment Retrieval | —Unverified | 0 |
| iMOVE: Instance-Motion-Aware Video Understanding | Feb 17, 2025 | Computational EfficiencyVideo Understanding | —Unverified | 0 |
| VRoPE: Rotary Position Embedding for Video Large Language Models | Feb 17, 2025 | PositionVideo Understanding | CodeCode Available | 1 |
| video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model | Feb 17, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Semantics-aware Test-time Adaptation for 3D Human Pose Estimation | Feb 15, 2025 | 3D human pose and shape estimation3D Human Pose Estimation | —Unverified | 0 |
| SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding | Feb 15, 2025 | Question AnsweringStreaming video understanding | CodeCode Available | 2 |
| Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering | Feb 13, 2025 | ClassificationPrompt Engineering | —Unverified | 0 |
| Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis | Feb 11, 2025 | Action RecognitionVideo Description | —Unverified | 0 |
| A Survey on Mamba Architecture for Vision Applications | Feb 11, 2025 | Mambaobject-detection | —Unverified | 0 |
| CoS: Chain-of-Shot Prompting for Long Video Understanding | Feb 10, 2025 | Video Understanding | —Unverified | 0 |
| A Survey on Video Analytics in Cloud-Edge-Terminal Collaborative Systems | Feb 10, 2025 | Autonomous DrivingEdge-computing | —Unverified | 0 |
| VideoRoPE: What Makes for Good Video Rotary Position Embedding? | Feb 7, 2025 | HallucinationPosition | CodeCode Available | 3 |
| Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray | Feb 7, 2025 | 4kGeneral Knowledge | CodeCode Available | 3 |
| WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs | Feb 6, 2025 | Video Understanding | —Unverified | 0 |
| MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding | Feb 5, 2025 | DiversityEgoSchema | —Unverified | 0 |
| A Decade of Action Quality Assessment: Largest Systematic Survey of Trends, Challenges, and Future Directions | Feb 5, 2025 | Action Quality AssessmentSurvey | —Unverified | 0 |
| TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes | Feb 4, 2025 | Autonomous DrivingMultiple-choice | CodeCode Available | 1 |
| Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives | Feb 4, 2025 | Video Understanding | CodeCode Available | 1 |
| LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models | Feb 4, 2025 | GPUVideo Understanding | —Unverified | 0 |
| VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos | Feb 3, 2025 | Knowledge GraphsRAG | CodeCode Available | 7 |
| AIN: The Arabic INclusive Large Multimodal Model | Jan 31, 2025 | document understandingmodel | CodeCode Available | 2 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding | Jan 28, 2025 | DecoderVideo Understanding | —Unverified | 0 |
| Understanding Long Videos via LLM-Powered Entity Relation Graphs | Jan 27, 2025 | EgoSchemaLarge Language Model | —Unverified | 0 |
| TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding | Jan 26, 2025 | Video Understanding | CodeCode Available | 2 |
| HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding | Jan 25, 2025 | Action UnderstandingEmotion Recognition | —Unverified | 0 |
| Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge | Jan 23, 2025 | SchedulingStreaming video understanding | CodeCode Available | 2 |
| Temporal Preference Optimization for Long-Form Video Understanding | Jan 23, 2025 | FormMME | —Unverified | 0 |
| VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding | Jan 22, 2025 | PhilosophyVideo Question Answering | CodeCode Available | 5 |
| InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling | Jan 21, 2025 | Object TrackingReferring Expression Segmentation | CodeCode Available | 0 |
| InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model | Jan 21, 2025 | Instruction FollowingMathematical Reasoning | CodeCode Available | 0 |
| MMVU: Measuring Expert-Level Multi-Discipline Video Understanding | Jan 21, 2025 | Video Understanding | CodeCode Available | 2 |