| Can Temporal Information Help with Contrastive Self-Supervised Learning? | Nov 25, 2020 | Data AugmentationRepresentation Learning | —Unverified | 0 |
| Can't Fool Me: Adversarially Robust Transformer for Video Understanding | Oct 26, 2021 | image-classificationImage Classification | —Unverified | 0 |
| CATER: A diagnostic dataset for Compositional Actions & TEmporal Reasoning | May 1, 2020 | DiagnosticObject | —Unverified | 0 |
| Causal Reasoning Meets Visual Representation Learning: A Prospective Study | Apr 26, 2022 | BenchmarkingOut-of-Distribution Generalization | —Unverified | 0 |
| CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs | Jul 1, 2025 | Text GenerationVideo Understanding | —Unverified | 0 |
| CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding | Dec 16, 2024 | HallucinationMultiple-choice | —Unverified | 0 |
| Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis | May 14, 2024 | 4kGPU | —Unverified | 0 |
| Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos | Apr 25, 2018 | General ClassificationVideo Classification | —Unverified | 0 |
| ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System | Apr 27, 2023 | Video Understanding | —Unverified | 0 |
| Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI | Jul 14, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| Clapper: Compact Learning and Video Representation in VLMs | May 21, 2025 | Video Understanding | —Unverified | 0 |
| ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation | Mar 19, 2021 | ObjectReferring Expression Segmentation | —Unverified | 0 |
| CLIP4Caption: CLIP for Video Caption | Oct 13, 2021 | DecoderSentence | —Unverified | 0 |
| Co-attentional Transformers for Story-Based Video Understanding | Oct 27, 2020 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework | Dec 11, 2024 | GPULanguage Modeling | —Unverified | 0 |
| CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding | Jul 21, 2021 | Question AnsweringSentence | —Unverified | 0 |
| Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization | Mar 22, 2025 | Saliency DetectionSentence | —Unverified | 0 |
| How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs | May 6, 2024 | Autonomous VehiclesVideo Understanding | —Unverified | 0 |
| Comprehensive Video Understanding: Video summarization with content-based video recommender design | Oct 30, 2019 | Action RecognitionData Augmentation | —Unverified | 0 |
| Compressed Vision for Efficient Video Understanding | Oct 6, 2022 | Video CompressionVideo Understanding | —Unverified | 0 |
| Concept Graph Neural Networks for Surgical Video Understanding | Feb 27, 2022 | Video Understanding | —Unverified | 0 |
| Constructing Hierarchical Q&A Datasets for Video Story Understanding | Apr 1, 2019 | Video Understanding | —Unverified | 0 |
| ContextDet: Temporal Action Detection with Adaptive Context Aggregation | Oct 20, 2024 | Action DetectionVideo Understanding | —Unverified | 0 |
| Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries | Apr 3, 2020 | Referring Expression SegmentationVideo Segmentation | —Unverified | 0 |
| Contrastive Language-Action Pre-training for Temporal Localization | Apr 26, 2022 | Action LocalizationContrastive Learning | —Unverified | 0 |
| Contrastive Language Video Time Pre-training | Jun 4, 2024 | Action RecognitionContrastive Learning | —Unverified | 0 |
| CoS: Chain-of-Shot Prompting for Long Video Understanding | Feb 10, 2025 | Video Understanding | —Unverified | 0 |
| CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos | Mar 24, 2025 | Anomaly DetectionAnomaly Detection In Surveillance Videos | —Unverified | 0 |
| Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization | Jan 1, 2021 | Action LocalizationVideo Understanding | —Unverified | 0 |
| Cross-Class Relevance Learning for Temporal Concept Localization | Nov 19, 2019 | Feature EngineeringVideo Understanding | —Unverified | 0 |
| CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding | Jan 17, 2024 | Contrastive Learningpoint cloud video understanding | —Unverified | 0 |
| CTM: Collaborative Temporal Modeling for Action Recognition | Feb 8, 2020 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Cultivating DNN Diversity for Large Scale Video Labelling | Jul 13, 2017 | DiversityVideo Understanding | —Unverified | 0 |
| Cut-Based Graph Learning Networks to Discover Compositional Structure of Sequential Video Data | Jan 17, 2020 | Graph LearningVideo Understanding | —Unverified | 0 |
| Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a Large Foundational Video Understanding Model | Jan 29, 2024 | Action DetectionAction Localization | —Unverified | 0 |
| Cycle-Contrast for Self-Supervised Video Representation Learning | Oct 28, 2020 | Action RecognitionContrastive Learning | —Unverified | 0 |
| DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description | Mar 31, 2025 | Video DescriptionVideo Understanding | —Unverified | 0 |
| Deep learning for action spotting in association football videos | Oct 2, 2024 | Action SpottingBenchmarking | —Unverified | 0 |
| Deep Spatio-Temporal Random Fields for Efficient Video Segmentation | Jul 3, 2018 | Instance SegmentationSemantic Segmentation | —Unverified | 0 |
| Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding | May 23, 2025 | FormQuestion Answering | —Unverified | 0 |
| DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding | May 19, 2018 | Action Recognition In VideosGesture Recognition | —Unverified | 0 |
| Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection | Jul 29, 2020 | object-detectionObject Detection | —Unverified | 0 |
| Development of a MultiModal Annotation Framework and Dataset for Deep Video Understanding | Jun 1, 2022 | Knowledge GraphsVideo Understanding | —Unverified | 0 |
| Discerning Generic Event Boundaries in Long-Form Wild Videos | Jun 18, 2021 | Boundary DetectionForm | —Unverified | 0 |
| Discrete neural representations for explainable anomaly detection | Dec 10, 2021 | Anomaly DetectionObject | —Unverified | 0 |
| Disentangle and denoise: Tackling context misalignment for video moment retrieval | Aug 14, 2024 | DenoisingDisentanglement | —Unverified | 0 |
| Distantly Supervised Semantic Text Detection and Recognition for Broadcast Sports Videos Understanding | Oct 31, 2021 | Action RecognitionText Detection | —Unverified | 0 |
| DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning | Aug 29, 2024 | Multi-Task LearningPrompt Learning | —Unverified | 0 |
| DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition | Jan 11, 2019 | Action ClassificationAction Recognition | —Unverified | 0 |