| Recurring the Transformer for Video Action Recognition | Jan 1, 2022 | Action RecognitionGPU | —Unverified | 0 | 0 |
| Relational Space-Time Query in Long-Form Videos | Jan 1, 2023 | FormVideo Understanding | —Unverified | 0 | 0 |
| Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition | Aug 3, 2020 | Action RecognitionOptical Flow Estimation | —Unverified | 0 | 0 |
| ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task | Apr 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Rethinking Image-to-Video Adaptation: An Object-centric Perspective | Jul 9, 2024 | Action RecognitionObject | —Unverified | 0 | 0 |
| Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data | Jul 18, 2024 | Language ModellingLarge Language Model | —Unverified | 0 | 0 |
| Retrieval-based Video Language Model for Efficient Long Video Question Answering | Dec 8, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning | May 11, 2024 | Image-text matchingRetrieval | —Unverified | 0 | 0 |
| Revealing Occlusions with 4D Neural Fields | Apr 22, 2022 | Video Understanding | —Unverified | 0 | 0 |
| Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding | Sep 20, 2023 | Action LocalizationForm | —Unverified | 0 | 0 |
| ReWind: Understanding Long Videos with Instructed Learnable Memory | Nov 23, 2024 | Large Language ModelQuestion Answering | —Unverified | 0 | 0 |
| SA-NET.v2: Real-time vehicle detection from oblique UAV images with use of uncertainty estimation in deep meta-learning | Aug 4, 2022 | Meta-LearningSemantic Segmentation | —Unverified | 0 | 0 |
| SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context | Nov 25, 2024 | Large Language ModelMME | —Unverified | 0 | 0 |
| Scene-centric Joint Parsing of Cross-view Videos | Sep 16, 2017 | Video Understanding | —Unverified | 0 | 0 |
| Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis | May 31, 2025 | Scene SegmentationSegmentation | —Unverified | 0 | 0 |
| SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding | Jun 9, 2025 | RAGRetrieval | —Unverified | 0 | 0 |
| MM-SEAL: A Large-scale Video Dataset of Multi-person Multi-grained Spatio-temporally Action Localization | Apr 6, 2022 | Action LocalizationAction Recognition | —Unverified | 0 | 0 |
| SEAL: Semantic Attention Learning for Long Video Representation | Dec 2, 2024 | DiversityQuestion Answering | —Unverified | 0 | 0 |
| Search-Map-Search: A Frame Selection Paradigm for Action Recognition | Apr 20, 2023 | Action RecognitionHeuristic Search | —Unverified | 0 | 0 |
| Seed1.5-VL Technical Report | May 11, 2025 | Mixture-of-ExpertsMultimodal Reasoning | —Unverified | 0 | 0 |
| Selective Structured State-Spaces for Long-Form Video Understanding | Mar 25, 2023 | Contrastive LearningForm | —Unverified | 0 | 0 |
| Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization | Apr 16, 2025 | HallucinationQuestion Answering | —Unverified | 0 | 0 |
| Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding | Mar 26, 2025 | GPUQuestion Answering | —Unverified | 0 | 0 |
| Self-supervised Motion Representation via Scattering Local Motion Cues | Aug 1, 2020 | Action RecognitionOptical Flow Estimation | —Unverified | 0 | 0 |
| Self-Supervised Object Detection from Egocentric Videos | Jan 1, 2023 | Class-agnostic Object DetectionObject | —Unverified | 0 | 0 |
| Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction | Nov 28, 2018 | Action RecognitionPrediction | —Unverified | 0 | 0 |
| Self-supervised video pretraining yields robust and more human-aligned visual representations | Oct 12, 2022 | Contrastive Learningobject-detection | —Unverified | 0 | 0 |
| Semantics-aware Test-time Adaptation for 3D Human Pose Estimation | Feb 15, 2025 | 3D human pose and shape estimation3D Human Pose Estimation | —Unverified | 0 | 0 |
| Semantic Segmentation on VSPW Dataset through Masked Video Consistency | Jun 7, 2024 | Semantic SegmentationVideo Understanding | —Unverified | 0 | 0 |
| Semi-Parametric Video-Grounded Text Generation | Jan 27, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding | Apr 10, 2025 | Video Understanding | —Unverified | 0 | 0 |
| ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries | Dec 17, 2024 | Human Detectionimage-classification | —Unverified | 0 | 0 |
| SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation | May 13, 2025 | Computational EfficiencyVideo Understanding | —Unverified | 0 | 0 |
| Skimming and Scanning for Untrimmed Video Action Recognition | Apr 21, 2021 | Action RecognitionTemporal Action Localization | —Unverified | 0 | 0 |
| Slicing Convolutional Neural Network for Crowd Video Understanding | Jun 1, 2016 | AttributeVideo Understanding | —Unverified | 0 | 0 |
| Slot-VLM: SlowFast Slots for Video-Language Modeling | Feb 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding | Mar 24, 2025 | FormVideo Understanding | —Unverified | 0 | 0 |
| SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability | Mar 18, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding | Nov 30, 2023 | FormVideo Retrieval | —Unverified | 0 | 0 |
| Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs | May 25, 2025 | Video Understanding | —Unverified | 0 | 0 |
| Spatio-Temporal Context for Action Detection | Jun 29, 2021 | Action DetectionVideo Understanding | —Unverified | 0 | 0 |
| Spatio-Temporal Crop Aggregation for Video Representation Learning | Nov 30, 2022 | Action ClassificationDimensionality Reduction | —Unverified | 0 | 0 |
| Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos | Jul 1, 2017 | Action RecognitionAction Recognition In Videos | —Unverified | 0 | 0 |
| Spatio-Temporal Video Representation Learning for AI Based Video Playback Style Prediction | Oct 3, 2021 | Action RecognitionRepresentation Learning | —Unverified | 0 | 0 |
| Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain | Sep 29, 2022 | Action RecognitionVideo Understanding | —Unverified | 0 | 0 |
| Spherical World-Locking for Audio-Visual Localization in Egocentric Videos | Aug 9, 2024 | Active Speaker LocalizationDecoder | —Unverified | 0 | 0 |
| Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions | May 10, 2021 | Contrastive LearningRetrieval | —Unverified | 0 | 0 |
| SPOT! Revisiting Video-Language Models for Event Understanding | Nov 21, 2023 | AttributeVideo Understanding | —Unverified | 0 | 0 |
| Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips | Dec 2, 2021 | Action RecognitionVideo Understanding | —Unverified | 0 | 0 |
| STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training | Nov 29, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 | 0 |