| VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment | Jun 16, 2024 | Action UnderstandingBenchmarking | —Unverified | 0 | 0 |
| VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks | Jun 10, 2025 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 | 0 |
| VEU-Bench: Towards Comprehensive Understanding of Video Editing | Jan 1, 2025 | Video EditingVideo Understanding | —Unverified | 0 | 0 |
| ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning | May 21, 2025 | Pseudo LabelReinforcement Learning (RL) | —Unverified | 0 | 0 |
| ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models | Nov 16, 2024 | HallucinationVideo Generation | —Unverified | 0 | 0 |
| ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation | Dec 12, 2024 | Phrase GroundingQuestion Answering | —Unverified | 0 | 0 |
| VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models | Oct 15, 2024 | Video Understanding | —Unverified | 0 | 0 |
| Video4MRI: An Empirical Study on Brain Magnetic Resonance Image Analytics with CNN-based Video Classification Frameworks | Feb 24, 2023 | ClassificationData Augmentation | —Unverified | 0 | 0 |
| VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding | Mar 18, 2024 | EgoSchemaVideo Understanding | —Unverified | 0 | 0 |
| VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation | Nov 20, 2024 | ChatbotMultiple-choice | —Unverified | 0 | 0 |
| Perceptron Synthesis Network: Rethinking the Action Scale Variances in Videos | Jul 22, 2020 | Action RecognitionTemporal Action Localization | —Unverified | 0 | 0 |
| Video Domain Incremental Learning for Human Action Recognition in Home Environments | Dec 22, 2024 | Action Recognitionclass-incremental learning | —Unverified | 0 | 0 |
| Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models | Jul 8, 2025 | Future predictionLarge Language Model | —Unverified | 0 | 0 |
| VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding | Apr 10, 2025 | Instruction FollowingVideo Understanding | —Unverified | 0 | 0 |
| VideoGLUE: Video General Understanding Evaluation of Foundation Models | Jul 6, 2023 | Action RecognitionTemporal Localization | —Unverified | 0 | 0 |
| Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding | Dec 31, 2023 | Spatio-Temporal Video GroundingVideo Grounding | —Unverified | 0 | 0 |
| VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding | Jan 1, 2024 | Spatio-Temporal Video GroundingVideo Grounding | —Unverified | 0 | 0 |
| VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models | Jun 24, 2024 | HallucinationVideo Understanding | —Unverified | 0 | 0 |
| VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding | Jul 17, 2025 | Video GroundingVideo Understanding | —Unverified | 0 | 0 |
| Video Language Model Pretraining with Spatio-temporal Masking | Jan 1, 2025 | DecoderLanguage Modeling | —Unverified | 0 | 0 |
| VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges | Sep 2, 2024 | GPUMVBench | —Unverified | 0 | 0 |
| VideoLLM Benchmarks and Evaluation: A Survey | May 3, 2025 | SurveyVideo Understanding | —Unverified | 0 | 0 |
| VideoMCC: a New Benchmark for Video Comprehension | Jun 23, 2016 | Multiple-choiceVideo Description | —Unverified | 0 | 0 |
| Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition | May 7, 2024 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| VideoPrism: A Foundational Visual Encoder for Video Understanding | Feb 20, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Videoprompter: an ensemble of foundational models for zero-shot video understanding | Oct 23, 2023 | Action RecognitionDescriptive | —Unverified | 0 | 0 |
| Video Quality Assessment for Online Processing: From Spatial to Temporal Sampling | Jan 13, 2025 | Video Quality AssessmentVideo Understanding | —Unverified | 0 | 0 |
| Video RWKV:Video Action Recognition Based RWKV | Nov 8, 2024 | Action RecognitionRepresentation Learning | —Unverified | 0 | 0 |
| VideoSAVi: Self-Aligned Video Language Models without Human Supervision | Dec 1, 2024 | EgoSchemaMVBench | —Unverified | 0 | 0 |
| VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers | Mar 12, 2025 | GPUStreaming video understanding | —Unverified | 0 | 0 |
| Video Swin Transformers for Egocentric Video Understanding @ Ego4D Challenges 2022 | Jul 22, 2022 | ObjectObject State Change Classification | —Unverified | 0 | 0 |
| Video Time: Properties, Encoders and Evaluation | Jul 18, 2018 | Video Understanding | —Unverified | 0 | 0 |
| Video Token Merging for Long-form Video Understanding | Oct 31, 2024 | FormVideo Classification | —Unverified | 0 | 0 |
| Video Understanding as Machine Translation | Jun 12, 2020 | Machine TranslationMetric Learning | —Unverified | 0 | 0 |
| Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs | Jul 2, 2024 | Video Understanding | —Unverified | 0 | 0 |
| Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding | Mar 24, 2025 | 8kGPU | —Unverified | 0 | 0 |
| VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding | Dec 4, 2024 | HallucinationInstruction Following | —Unverified | 0 | 0 |
| VidLPRO: A Video-Language Pre-training Framework for Robotic and Laparoscopic Surgery | Sep 7, 2024 | Computational EfficiencyContrastive Learning | —Unverified | 0 | 0 |
| ViFi-ReID: A Two-Stream Vision-WiFi Multimodal Approach for Person Re-identification | Oct 13, 2024 | Contrastive LearningPerson Re-Identification | —Unverified | 0 | 0 |
| VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation | Dec 1, 2024 | Instruction FollowingVideo Understanding | —Unverified | 0 | 0 |
| Visual Context Window Extension: A New Perspective for Long Video Understanding | Sep 30, 2024 | Video Understanding | —Unverified | 0 | 0 |
| Visual Subtitle Feature Enhanced Video Outline Generation | Aug 24, 2022 | ArticlesHeadline Generation | —Unverified | 0 | 0 |
| VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding | May 20, 2021 | Action SegmentationLanguage Modeling | —Unverified | 0 | 0 |
| VRDFormer: End-to-End Video Visual Relation Detection With Transformers | Jan 1, 2022 | ObjectRelation | —Unverified | 0 | 0 |
| V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning | Mar 14, 2025 | BenchmarkingRelational Reasoning | —Unverified | 0 | 0 |
| VUDG: A Dataset for Video Understanding Domain Generalization | May 30, 2025 | Domain GeneralizationMultiple-choice | —Unverified | 0 | 0 |
| Wasserstein Dependency Measure for Representation Learning | Mar 28, 2019 | Object Recognitionreinforcement-learning | —Unverified | 0 | 0 |
| Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding | Mar 14, 2025 | DenoisingDense Video Captioning | —Unverified | 0 | 0 |
| Weakly Supervised Multiclass Video Segmentation | Jun 1, 2014 | SegmentationSemantic Similarity | —Unverified | 0 | 0 |
| Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models | Jan 1, 2025 | Action LocalizationTemporal Action Localization | —Unverified | 0 | 0 |