| Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation | Mar 25, 2025 | HallucinationHallucination Evaluation | CodeCode Available | 1 |
| PAVE: Patching and Adapting Video Large Language Models | Mar 25, 2025 | Audio-visual Question AnsweringMulti-Task Learning | CodeCode Available | 1 |
| ACVUBench: Audio-Centric Video Understanding Benchmark | Mar 25, 2025 | Video Understanding | CodeCode Available | 0 |
| CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos | Mar 24, 2025 | Anomaly DetectionAnomaly Detection In Surveillance Videos | —Unverified | 0 |
| SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding | Mar 24, 2025 | FormVideo Understanding | —Unverified | 0 |
| Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding | Mar 24, 2025 | 8kGPU | —Unverified | 0 |
| Breaking the Encoder Barrier for Seamless Video-Language Understanding | Mar 24, 2025 | DecoderLanguage Modeling | —Unverified | 0 |
| Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks | Mar 24, 2025 | Common Sense ReasoningPrediction | —Unverified | 0 |
| MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps | Mar 23, 2025 | Scene SegmentationVideo Understanding | CodeCode Available | 1 |
| V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction | Mar 22, 2025 | BenchmarkingVideo Understanding | CodeCode Available | 1 |
| 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding | Mar 22, 2025 | BenchmarkingObject | CodeCode Available | 0 |
| Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization | Mar 22, 2025 | Saliency DetectionSentence | —Unverified | 0 |
| Temporal Action Detection Model Compression by Progressive Block Drop | Mar 21, 2025 | Action DetectionAutonomous Driving | —Unverified | 0 |
| PVChat: Personalized Video Chat with One-Shot Learning | Mar 21, 2025 | One-Shot LearningQuestion Answering | —Unverified | 0 |
| What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation? | Mar 20, 2025 | DecoderGraph Generation | —Unverified | 0 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models | Mar 20, 2025 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| XAttention: Block Sparse Attention with Antidiagonal Scoring | Mar 20, 2025 | Video GenerationVideo Understanding | CodeCode Available | 3 |
| DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering | Mar 20, 2025 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations | Mar 20, 2025 | HallucinationVideo Understanding | —Unverified | 0 |
| STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding | Mar 20, 2025 | Video UnderstandingZero-shot Generalization | CodeCode Available | 1 |
| FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding | Mar 19, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability | Mar 18, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Improving LLM Video Understanding with 16 Frames Per Second | Mar 18, 2025 | MMEVideo MME | —Unverified | 0 |
| Impossible Videos | Mar 18, 2025 | counterfactualVideo Generation | —Unverified | 0 |
| ViSpeak: Visual Instruction Feedback in Streaming Videos | Mar 17, 2025 | Streaming video understandingVideo Understanding | CodeCode Available | 2 |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning | Mar 17, 2025 | Grounded Video Question AnsweringQuestion Answering | CodeCode Available | 3 |
| Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory | Mar 17, 2025 | FormGPU | —Unverified | 0 |
| Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition | Mar 17, 2025 | Action RecognitionVideo Recognition | —Unverified | 0 |
| Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding | Mar 17, 2025 | AttributeMME | —Unverified | 0 |
| Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma? | Mar 16, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding | Mar 16, 2025 | Video Understanding | CodeCode Available | 2 |
| Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding | Mar 14, 2025 | DenoisingDense Video Captioning | —Unverified | 0 |
| LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs | Mar 14, 2025 | Video Understanding | —Unverified | 0 |
| Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers | Mar 14, 2025 | GPUMamba | —Unverified | 0 |
| V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning | Mar 14, 2025 | BenchmarkingRelational Reasoning | —Unverified | 0 |
| TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs | Mar 13, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing | Mar 13, 2025 | EgoSchemaForm | CodeCode Available | 0 |
| LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents | Mar 13, 2025 | Computational EfficiencyOptical Character Recognition (OCR) | —Unverified | 0 |
| Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation | Mar 12, 2025 | Allcounterfactual | —Unverified | 0 |
| On the Limitations of Vision-Language Models in Understanding Image Transforms | Mar 12, 2025 | Question AnsweringVideo Generation | —Unverified | 0 |
| Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization | Mar 12, 2025 | Temporal LocalizationVideo Understanding | —Unverified | 0 |
| FaVChat: Unlocking Fine-Grained Facail Video Understanding with Multimodal Large Language Models | Mar 12, 2025 | Mixture-of-ExpertsQuestion Answering | —Unverified | 0 |
| Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment | Mar 12, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers | Mar 12, 2025 | GPUStreaming video understanding | —Unverified | 0 |
| VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary | Mar 12, 2025 | EgoSchemaRetrieval | CodeCode Available | 4 |
| Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding | Mar 12, 2025 | Instruction FollowingVideo Understanding | —Unverified | 0 |
| Generative Frame Sampler for Long Video Understanding | Mar 12, 2025 | Video Understanding | —Unverified | 0 |
| Memory-enhanced Retrieval Augmentation for Long Video Understanding | Mar 12, 2025 | RAGRetrieval | —Unverified | 0 |
| QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension | Mar 11, 2025 | AutoMLDecoder | CodeCode Available | 2 |