| A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action Detector | Jun 7, 2022 | Action ClassificationAction Detection | CodeCode Available | 1 |
| AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions | May 23, 2017 | Actin DetectionAction Detection | CodeCode Available | 1 |
| MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer | Apr 29, 2023 | DecoderHighlight Detection | CodeCode Available | 1 |
| MMAD: Multi-label Micro-Action Detection in Videos | Jul 7, 2024 | Action AnalysisAction Detection | CodeCode Available | 1 |
| AutoVideo: An Automated Video Action Recognition System | Aug 9, 2021 | Action RecognitionAutoML | CodeCode Available | 1 |
| Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos | Aug 18, 2023 | point cloud video understandingSelf-Supervised Learning | CodeCode Available | 1 |
| Action Scene Graphs for Long-Form Understanding of Egocentric Videos | Dec 6, 2023 | Action AnticipationForm | CodeCode Available | 1 |
| Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models | Jan 1, 2025 | Action RecognitionAction Segmentation | CodeCode Available | 1 |
| MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps | Mar 23, 2025 | Scene SegmentationVideo Understanding | CodeCode Available | 1 |
| MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding | Jul 8, 2025 | Autonomous DrivingVideo Understanding | CodeCode Available | 1 |
| CyberV: Cybernetics for Test-time Scaling in Video Understanding | Jun 9, 2025 | Video Understanding | CodeCode Available | 1 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation | Jun 15, 2025 | ObjectSemantic Segmentation | CodeCode Available | 1 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| Long Movie Clip Classification with State-Space Video Models | Apr 4, 2022 | ClassificationDecoder | CodeCode Available | 1 |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning | Jan 13, 2025 | Causal DiscoveryCausal Inference | CodeCode Available | 1 |
| Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning | Mar 2, 2025 | Large Language ModelMulti-Instance Retrieval | CodeCode Available | 1 |
| Crossover Learning for Fast Online Video Instance Segmentation | Apr 13, 2021 | Instance SegmentationSemantic Segmentation | CodeCode Available | 1 |
| Learning Transferable Spatiotemporal Representations from Natural Script Knowledge | Sep 30, 2022 | DescriptiveRepresentation Learning | CodeCode Available | 1 |
| Learning Temporally Latent Causal Processes from General Temporal Data | Sep 29, 2021 | Causal DiscoveryDisentanglement | CodeCode Available | 1 |
| Learning Temporally Causal Latent Processes from General Temporal Data | Oct 11, 2021 | Causal DiscoveryRepresentation Learning | CodeCode Available | 1 |
| Learning the Predictability of the Future | Jun 19, 2021 | Representation LearningSelf-Supervised Action Recognition | CodeCode Available | 1 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 |
| Leveraging triplet loss for unsupervised action segmentation | Apr 13, 2023 | Action SegmentationClustering | CodeCode Available | 1 |
| Learning Salient Boundary Feature for Anchor-free Temporal Action Localization | Mar 24, 2021 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 |