| C^3: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues | Jun 16, 2021 | Contrastive Learningcounterfactual | —Unverified | 0 | 0 |
| CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition | Mar 30, 2025 | Action ClassificationAction Recognition | —Unverified | 0 | 0 |
| CAG-QIL: Context-Aware Actionness Grouping via Q Imitation Learning for Online Temporal Action Localization | Jan 1, 2021 | Action LocalizationImitation Learning | —Unverified | 0 | 0 |
| Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting | Apr 19, 2021 | Action SpottingCamera Calibration | —Unverified | 0 | 0 |
| Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP | Sep 23, 2024 | Image GenerationQuestion Answering | —Unverified | 0 | 0 |
| FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning | Oct 20, 2024 | DiagnosticVideo Captioning | —Unverified | 0 | 0 |
| Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks? | Nov 13, 2024 | Action LocalizationTemporal Action Localization | —Unverified | 0 | 0 |
| Can Temporal Information Help with Contrastive Self-Supervised Learning? | Nov 25, 2020 | Data AugmentationRepresentation Learning | —Unverified | 0 | 0 |
| Can't Fool Me: Adversarially Robust Transformer for Video Understanding | Oct 26, 2021 | image-classificationImage Classification | —Unverified | 0 | 0 |
| CATER: A diagnostic dataset for Compositional Actions & TEmporal Reasoning | May 1, 2020 | DiagnosticObject | —Unverified | 0 | 0 |
| Causal Reasoning Meets Visual Representation Learning: A Prospective Study | Apr 26, 2022 | BenchmarkingOut-of-Distribution Generalization | —Unverified | 0 | 0 |
| CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs | Jul 1, 2025 | Text GenerationVideo Understanding | —Unverified | 0 | 0 |
| CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding | Dec 16, 2024 | HallucinationMultiple-choice | —Unverified | 0 | 0 |
| Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis | May 14, 2024 | 4kGPU | —Unverified | 0 | 0 |
| Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos | Apr 25, 2018 | General ClassificationVideo Classification | —Unverified | 0 | 0 |
| ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System | Apr 27, 2023 | Video Understanding | —Unverified | 0 | 0 |
| Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI | Jul 14, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 | 0 |
| Clapper: Compact Learning and Video Representation in VLMs | May 21, 2025 | Video Understanding | —Unverified | 0 | 0 |
| ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation | Mar 19, 2021 | ObjectReferring Expression Segmentation | —Unverified | 0 | 0 |
| CLIP4Caption: CLIP for Video Caption | Oct 13, 2021 | DecoderSentence | —Unverified | 0 | 0 |
| Co-attentional Transformers for Story-Based Video Understanding | Oct 27, 2020 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework | Dec 11, 2024 | GPULanguage Modeling | —Unverified | 0 | 0 |
| CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding | Jul 21, 2021 | Question AnsweringSentence | —Unverified | 0 | 0 |
| Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization | Mar 22, 2025 | Saliency DetectionSentence | —Unverified | 0 | 0 |
| How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs | May 6, 2024 | Autonomous VehiclesVideo Understanding | —Unverified | 0 | 0 |
| Comprehensive Video Understanding: Video summarization with content-based video recommender design | Oct 30, 2019 | Action RecognitionData Augmentation | —Unverified | 0 | 0 |
| Compressed Vision for Efficient Video Understanding | Oct 6, 2022 | Video CompressionVideo Understanding | —Unverified | 0 | 0 |
| Concept Graph Neural Networks for Surgical Video Understanding | Feb 27, 2022 | Video Understanding | —Unverified | 0 | 0 |
| Constructing Hierarchical Q&A Datasets for Video Story Understanding | Apr 1, 2019 | Video Understanding | —Unverified | 0 | 0 |
| ContextDet: Temporal Action Detection with Adaptive Context Aggregation | Oct 20, 2024 | Action DetectionVideo Understanding | —Unverified | 0 | 0 |
| Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries | Apr 3, 2020 | Referring Expression SegmentationVideo Segmentation | —Unverified | 0 | 0 |
| Contrastive Language-Action Pre-training for Temporal Localization | Apr 26, 2022 | Action LocalizationContrastive Learning | —Unverified | 0 | 0 |
| Contrastive Language Video Time Pre-training | Jun 4, 2024 | Action RecognitionContrastive Learning | —Unverified | 0 | 0 |
| CoS: Chain-of-Shot Prompting for Long Video Understanding | Feb 10, 2025 | Video Understanding | —Unverified | 0 | 0 |
| CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos | Mar 24, 2025 | Anomaly DetectionAnomaly Detection In Surveillance Videos | —Unverified | 0 | 0 |
| Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization | Jan 1, 2021 | Action LocalizationVideo Understanding | —Unverified | 0 | 0 |
| Cross-Class Relevance Learning for Temporal Concept Localization | Nov 19, 2019 | Feature EngineeringVideo Understanding | —Unverified | 0 | 0 |
| CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding | Jan 17, 2024 | Contrastive Learningpoint cloud video understanding | —Unverified | 0 | 0 |
| CTM: Collaborative Temporal Modeling for Action Recognition | Feb 8, 2020 | Action RecognitionVideo Understanding | —Unverified | 0 | 0 |
| Cultivating DNN Diversity for Large Scale Video Labelling | Jul 13, 2017 | DiversityVideo Understanding | —Unverified | 0 | 0 |
| Cut-Based Graph Learning Networks to Discover Compositional Structure of Sequential Video Data | Jan 17, 2020 | Graph LearningVideo Understanding | —Unverified | 0 | 0 |
| Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a Large Foundational Video Understanding Model | Jan 29, 2024 | Action DetectionAction Localization | —Unverified | 0 | 0 |
| Cycle-Contrast for Self-Supervised Video Representation Learning | Oct 28, 2020 | Action RecognitionContrastive Learning | —Unverified | 0 | 0 |
| DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description | Mar 31, 2025 | Video DescriptionVideo Understanding | —Unverified | 0 | 0 |
| Deep learning for action spotting in association football videos | Oct 2, 2024 | Action SpottingBenchmarking | —Unverified | 0 | 0 |
| Deep Spatio-Temporal Random Fields for Efficient Video Segmentation | Jul 3, 2018 | Instance SegmentationSemantic Segmentation | —Unverified | 0 | 0 |
| Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding | May 23, 2025 | FormQuestion Answering | —Unverified | 0 | 0 |
| DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding | May 19, 2018 | Action Recognition In VideosGesture Recognition | —Unverified | 0 | 0 |
| Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection | Jul 29, 2020 | object-detectionObject Detection | —Unverified | 0 | 0 |