| OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding | Apr 15, 2025 | Semantic SegmentationVideo Generation | —Unverified | 0 |
| OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding | Apr 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Only Time Can Tell: Discovering Temporal Data for Temporal Modeling | Jul 19, 2019 | BenchmarkingMotion Estimation | —Unverified | 0 |
| On the Limitations of Vision-Language Models in Understanding Image Transforms | Mar 12, 2025 | Question AnsweringVideo Generation | —Unverified | 0 |
| Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting | Apr 26, 2024 | Facial Expression RecognitionMulti-Task Learning | —Unverified | 0 |
| Open Vocabulary Multi-Label Video Classification | Jul 12, 2024 | Action ClassificationClassification | —Unverified | 0 |
| Open-Vocabulary Spatio-Temporal Action Detection | May 17, 2024 | Action DetectionFine-Grained Action Detection | —Unverified | 0 |
| Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering | Feb 13, 2025 | ClassificationPrompt Engineering | —Unverified | 0 |
| Overview of Tencent Multi-modal Ads Video Understanding Challenge | Sep 16, 2021 | Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION | —Unverified | 0 |
| Overview of the MedVidQA 2022 Shared Task on Medical Video Question-Answering | May 1, 2022 | Question AnsweringVideo Classification | —Unverified | 0 |
| Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders | May 30, 2025 | Video Understanding | —Unverified | 0 |
| Time Blindness: Why Video-Language Models Can't See What Humans Can? | May 30, 2025 | Temporal SequencesVideo Understanding | —Unverified | 0 |
| TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding | Apr 2, 2025 | Video Understanding | —Unverified | 0 |
| TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation | Apr 24, 2025 | Caption GenerationDense Video Captioning | —Unverified | 0 |
| TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs | Mar 13, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Toward a Human-Level Video Understanding Intelligence | Oct 8, 2021 | AI AgentVideo Understanding | —Unverified | 0 |
| Towards Child-Inclusive Clinical Video Understanding for Autism Spectrum Disorder | Sep 20, 2024 | Activity RecognitionDiagnostic | —Unverified | 0 |
| Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking | Apr 11, 2025 | Moment RetrievalQuestion Answering | —Unverified | 0 |
| Towards Fine-Grained Video Question Answering | Mar 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset | Jun 19, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Towards Long Video Understanding via Fine-detailed Video Story Generation | Dec 9, 2024 | Story GenerationVideo Understanding | —Unverified | 0 |
| Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition | Mar 17, 2025 | Action RecognitionVideo Recognition | —Unverified | 0 |
| Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition | Jun 9, 2021 | Action RecognitionPoint Cloud Classification | —Unverified | 0 |
| Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection | Mar 5, 2025 | Anomaly DetectionObject | —Unverified | 0 |
| Transformed ROIs for Capturing Visual Transformations in Videos | Jun 6, 2021 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Transition Is a Process: Pair-to-Video Change Detection Networks for Very High Resolution Remote Sensing Images | Dec 7, 2022 | Building change detection for remote sensing imagesChange Detection | —Unverified | 0 |
| TVBench: Redesigning Video-Language Evaluation | Oct 10, 2024 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 |
| TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning | Feb 29, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |
| Two Causally Related Needles in a Video Haystack | May 26, 2025 | Video UnderstandingVisual Grounding | —Unverified | 0 |
| Two-Stream Transformer Architecture for Long Video Understanding | Aug 2, 2022 | Action RecognitionGPU | —Unverified | 0 |
| UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection | Nov 29, 2021 | Boundary DetectionContrastive Learning | —Unverified | 0 |
| UBoCo: Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection | Jan 1, 2022 | Boundary DetectionContrastive Learning | —Unverified | 0 |
| Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges | Sep 25, 2023 | Anomaly DetectionDense Video Captioning | —Unverified | 0 |
| Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks | Mar 24, 2025 | Common Sense ReasoningPrediction | —Unverified | 0 |
| Understanding Action Sequences based on Video Captioning for Learning-from-Observation | Dec 9, 2020 | Video CaptioningVideo Understanding | —Unverified | 0 |
| Understanding Long Videos via LLM-Powered Entity Relation Graphs | Jan 27, 2025 | EgoSchemaLarge Language Model | —Unverified | 0 |
| Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation | Apr 10, 2021 | Objectobject-detection | —Unverified | 0 |
| UniDual: A Unified Model for Image and Video Understanding | Jun 10, 2019 | Multi-Task LearningVideo Understanding | —Unverified | 0 |
| Unified Graph Structured Models for Video Understanding | Mar 29, 2021 | Action DetectionGraph Classification | —Unverified | 0 |
| Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action | Jan 1, 2024 | Image GenerationInstruction Following | —Unverified | 0 |
| UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding | Jan 1, 2023 | Video Understanding | —Unverified | 0 |
| Universal Visuo-Tactile Video Understanding for Embodied Interaction | May 28, 2025 | FrictionLarge Language Model | —Unverified | 0 |
| Unsupervised Motion Representation Enhanced Network for Action Recognition | Mar 5, 2021 | Action RecognitionOptical Flow Estimation | —Unverified | 0 |
| Unsupervised Object Discovery and Tracking in Video Collections | May 14, 2015 | ObjectObject Discovery | —Unverified | 0 |
| Unsupervised Video Understanding by Reconciliation of Posture Similarities | Aug 3, 2017 | Action ClassificationRetrieval | —Unverified | 0 |
| Human Gaze Guided Attention for Surgical Activity Recognition | Mar 9, 2022 | Activity RecognitionVideo Understanding | —Unverified | 0 |
| Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers | Mar 14, 2025 | GPUMamba | —Unverified | 0 |
| VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding | Dec 4, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VCA: Video Curious Agent for Long Video Understanding | Dec 12, 2024 | Video Understanding | —Unverified | 0 |
| Vehicle Detection and Classification without Residual Calculation: Accelerating HEVC Image Decoding with Random Perturbation Injection | May 14, 2023 | Image Reconstructionvehicle detection | —Unverified | 0 |