| O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning | Aug 5, 2021 | AttributeCaption Generation | —Unverified | 0 | 0 |
| OBJECT DYNAMICS DISTILLATION FOR SCENE DECOMPOSITION AND REPRESENTATION | Sep 29, 2021 | ObjectPredict Future Video Frames | —Unverified | 0 | 0 |
| Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge | Nov 15, 2021 | Instance SegmentationObject Recognition | —Unverified | 0 | 0 |
| OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding | Jul 6, 2024 | Video Understanding | —Unverified | 0 | 0 |
| OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts | Mar 29, 2025 | Streaming video understandingVideo Understanding | —Unverified | 0 | 0 |
| Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks | Jan 14, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| OmniTrack: Real-time detection and tracking of objects, text and logos in video | Oct 14, 2019 | GPUobject-detection | —Unverified | 0 | 0 |
| OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding | Apr 15, 2025 | Semantic SegmentationVideo Generation | —Unverified | 0 | 0 |
| OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding | Apr 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Only Time Can Tell: Discovering Temporal Data for Temporal Modeling | Jul 19, 2019 | BenchmarkingMotion Estimation | —Unverified | 0 | 0 |
| On the Limitations of Vision-Language Models in Understanding Image Transforms | Mar 12, 2025 | Question AnsweringVideo Generation | —Unverified | 0 | 0 |
| Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting | Apr 26, 2024 | Facial Expression RecognitionMulti-Task Learning | —Unverified | 0 | 0 |
| Open Vocabulary Multi-Label Video Classification | Jul 12, 2024 | Action ClassificationClassification | —Unverified | 0 | 0 |
| Open-Vocabulary Spatio-Temporal Action Detection | May 17, 2024 | Action DetectionFine-Grained Action Detection | —Unverified | 0 | 0 |
| Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering | Feb 13, 2025 | ClassificationPrompt Engineering | —Unverified | 0 | 0 |
| Overview of Tencent Multi-modal Ads Video Understanding Challenge | Sep 16, 2021 | Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION | —Unverified | 0 | 0 |
| Overview of the MedVidQA 2022 Shared Task on Medical Video Question-Answering | May 1, 2022 | Question AnsweringVideo Classification | —Unverified | 0 | 0 |
| Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track | Dec 15, 2024 | Image CaptioningMedical Question Answering | —Unverified | 0 | 0 |
| OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models | Dec 31, 2024 | Activity RecognitionHuman Interaction Recognition | —Unverified | 0 | 0 |
| OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning | Apr 4, 2024 | DescriptiveDiversity | —Unverified | 0 | 0 |
| PcmNet: Position-Sensitive Context Modeling Network for Temporal Action Localization | Mar 9, 2021 | Action LocalizationBoundary Detection | —Unverified | 0 | 0 |
| Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries | Dec 26, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 | 0 |
| Personalized Video Summarization by Multimodal Video Understanding | Nov 5, 2024 | Unsupervised Video SummarizationVideo Summarization | —Unverified | 0 | 0 |
| Person Count Localization in Videos From Noisy Foreground and Detections | Jun 1, 2015 | Foreground SegmentationHuman Detection | —Unverified | 0 | 0 |