| O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning | Aug 5, 2021 | AttributeCaption Generation | —Unverified | 0 | 0 |
| OBJECT DYNAMICS DISTILLATION FOR SCENE DECOMPOSITION AND REPRESENTATION | Sep 29, 2021 | ObjectPredict Future Video Frames | —Unverified | 0 | 0 |
| Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge | Nov 15, 2021 | Instance SegmentationObject Recognition | —Unverified | 0 | 0 |
| OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding | Jul 6, 2024 | Video Understanding | —Unverified | 0 | 0 |
| OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts | Mar 29, 2025 | Streaming video understandingVideo Understanding | —Unverified | 0 | 0 |
| Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks | Jan 14, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| OmniTrack: Real-time detection and tracking of objects, text and logos in video | Oct 14, 2019 | GPUobject-detection | —Unverified | 0 | 0 |
| OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding | Apr 15, 2025 | Semantic SegmentationVideo Generation | —Unverified | 0 | 0 |
| OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding | Apr 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Only Time Can Tell: Discovering Temporal Data for Temporal Modeling | Jul 19, 2019 | BenchmarkingMotion Estimation | —Unverified | 0 | 0 |
| On the Limitations of Vision-Language Models in Understanding Image Transforms | Mar 12, 2025 | Question AnsweringVideo Generation | —Unverified | 0 | 0 |
| Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting | Apr 26, 2024 | Facial Expression RecognitionMulti-Task Learning | —Unverified | 0 | 0 |
| Open Vocabulary Multi-Label Video Classification | Jul 12, 2024 | Action ClassificationClassification | —Unverified | 0 | 0 |
| Open-Vocabulary Spatio-Temporal Action Detection | May 17, 2024 | Action DetectionFine-Grained Action Detection | —Unverified | 0 | 0 |
| Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering | Feb 13, 2025 | ClassificationPrompt Engineering | —Unverified | 0 | 0 |
| Overview of Tencent Multi-modal Ads Video Understanding Challenge | Sep 16, 2021 | Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION | —Unverified | 0 | 0 |
| Overview of the MedVidQA 2022 Shared Task on Medical Video Question-Answering | May 1, 2022 | Question AnsweringVideo Classification | —Unverified | 0 | 0 |
| Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track | Dec 15, 2024 | Image CaptioningMedical Question Answering | —Unverified | 0 | 0 |
| OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models | Dec 31, 2024 | Activity RecognitionHuman Interaction Recognition | —Unverified | 0 | 0 |
| OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning | Apr 4, 2024 | DescriptiveDiversity | —Unverified | 0 | 0 |
| PcmNet: Position-Sensitive Context Modeling Network for Temporal Action Localization | Mar 9, 2021 | Action LocalizationBoundary Detection | —Unverified | 0 | 0 |
| Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries | Dec 26, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 | 0 |
| Personalized Video Summarization by Multimodal Video Understanding | Nov 5, 2024 | Unsupervised Video SummarizationVideo Summarization | —Unverified | 0 | 0 |
| Person Count Localization in Videos From Noisy Foreground and Detections | Jun 1, 2015 | Foreground SegmentationHuman Detection | —Unverified | 0 | 0 |
| PEVLM: Parallel Encoding for Vision-Language Models | Jun 24, 2025 | Autonomous DrivingVideo Understanding | —Unverified | 0 | 0 |
| PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval | Jan 1, 2023 | Representation LearningRetrieval | —Unverified | 0 | 0 |
| Principles of Visual Tokens for Efficient Video Understanding | Nov 20, 2024 | Video Understanding | —Unverified | 0 | 0 |
| ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab | Nov 1, 2023 | Action RecognitionVideo Understanding | —Unverified | 0 | 0 |
| Progress-Aware Video Frame Captioning | Dec 3, 2024 | Image CaptioningVideo Captioning | —Unverified | 0 | 0 |
| Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering | Oct 12, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data | Dec 8, 2022 | Action RecognitionPrompt Learning | —Unverified | 0 | 0 |
| Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval | Apr 17, 2025 | Partially Relevant Video RetrievalRetrieval | —Unverified | 0 | 0 |
| PVChat: Personalized Video Chat with One-Shot Learning | Mar 21, 2025 | One-Shot LearningQuestion Answering | —Unverified | 0 | 0 |
| PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models | Dec 12, 2024 | Video Understanding | —Unverified | 0 | 0 |
| PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild | Apr 15, 2025 | SegmentationSemantic Segmentation | —Unverified | 0 | 0 |
| PYSKL: a toolbox for skeleton-based video understanding | Apr 2, 2022 | Skeleton Based Action RecognitionVideo Understanding | —Unverified | 0 | 0 |
| Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs | Sep 30, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs | Jan 1, 2025 | Multiple-choiceVideo Generation | —Unverified | 0 | 0 |
| Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs | Jun 27, 2025 | MMEVideo MME | —Unverified | 0 | 0 |
| Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding | Oct 19, 2023 | RelationVideo Understanding | —Unverified | 0 | 0 |
| Query-Conditioned Three-Player Adversarial Network for Video Summarization | Jul 17, 2018 | Generative Adversarial NetworkVideo Summarization | —Unverified | 0 | 0 |
| Question Answering is a Format; When is it Useful? | Sep 25, 2019 | Machine TranslationQuestion Answering | —Unverified | 0 | 0 |
| R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding | Apr 2, 2024 | Highlight DetectionMoment Retrieval | —Unverified | 0 | 0 |
| R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding | Mar 31, 2024 | Highlight DetectionMoment Retrieval | —Unverified | 0 | 0 |
| Random Temporal Skipping for Multirate Video Analysis | Oct 30, 2018 | Action RecognitionOptical Flow Estimation | —Unverified | 0 | 0 |
| RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph | May 6, 2025 | EgoSchemaRetrieval | —Unverified | 0 | 0 |
| ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding | Jun 2, 2025 | Action RecognitionVideo Understanding | —Unverified | 0 | 0 |
| Real-Time Video Highlights for Yahoo Esports | Nov 27, 2016 | CPUDota 2 | —Unverified | 0 | 0 |
| Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation | Mar 12, 2025 | Allcounterfactual | —Unverified | 0 | 0 |