| InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | Mar 22, 2024 | Action ClassificationAction Recognition | CodeCode Available | 7 |
| Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations | Oct 10, 2024 | Time Series ForecastingVideo Recognition | CodeCode Available | 5 |
| Expanding Language-Image Pretrained Models for General Video Recognition | Aug 4, 2022 | Action ClassificationAction Recognition | CodeCode Available | 3 |
| Uni-AdaFocus: Spatial-temporal Dynamic Computation for Video Recognition | Dec 15, 2024 | Computational EfficiencyVideo Recognition | CodeCode Available | 2 |
| DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark | May 30, 2024 | DeepFake DetectionMamba | CodeCode Available | 2 |
| Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation | Mar 18, 2024 | Mixture-of-Expertsparameter-efficient fine-tuning | CodeCode Available | 2 |
| Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models | Dec 31, 2022 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| Revisiting Classifier: Transferring Vision-Language Models for Video Recognition | Jul 4, 2022 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition | May 26, 2022 | Action RecognitionVideo Recognition | CodeCode Available | 2 |
| TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device | Sep 27, 2021 | Video RecognitionVideo Understanding | CodeCode Available | 2 |
| Video Swin Transformer | Jun 24, 2021 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs? | Apr 10, 2020 | General ClassificationOpen-Ended Question Answering | CodeCode Available | 2 |
| X3D: Expanding Architectures for Efficient Video Recognition | Apr 9, 2020 | Action Classificationfeature selection | CodeCode Available | 2 |
| Omni-sourced Webly-supervised Learning for Video Recognition | Mar 29, 2020 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation | Mar 26, 2025 | Video Recognition | CodeCode Available | 1 |
| PAVE: Patching and Adapting Video Large Language Models | Mar 25, 2025 | Audio-visual Question AnsweringMulti-Task Learning | CodeCode Available | 1 |
| OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning | Aug 12, 2024 | Video RecognitionZero-Shot Learning | CodeCode Available | 1 |
| VideoMamba: Spatio-Temporal Selective State Space Model | Jul 11, 2024 | Mambamodel | CodeCode Available | 1 |
| No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding | May 14, 2024 | Action DetectionGPU | CodeCode Available | 1 |
| VG4D: Vision-Language Model Goes 4D Video Recognition | Apr 17, 2024 | Action RecognitionAutonomous Driving | CodeCode Available | 1 |
| Video Recognition in Portrait Mode | Dec 21, 2023 | Data AugmentationVideo Recognition | CodeCode Available | 1 |
| Adapting Short-Term Transformers for Action Detection in Untrimmed Videos | Dec 4, 2023 | Action DetectionVideo Recognition | CodeCode Available | 1 |
| OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition | Nov 30, 2023 | DescriptiveLanguage Modelling | CodeCode Available | 1 |
| DEVIAS: Learning Disentangled Video Representations of Action and Scene | Nov 30, 2023 | Action RecognitionDecoder | CodeCode Available | 1 |
| Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data | Oct 8, 2023 | Action RecognitionContinual Learning | CodeCode Available | 1 |
| ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video | Oct 2, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning | Sep 14, 2023 | Transfer LearningVideo Recognition | CodeCode Available | 1 |
| Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers | Aug 25, 2023 | Action RecognitionObject Detection | CodeCode Available | 1 |
| Audio-Visual Class-Incremental Learning | Aug 21, 2023 | class-incremental learningClass Incremental Learning | CodeCode Available | 1 |
| Helping Hands: An Object-Aware Ego-Centric Video Recognition Model | Aug 15, 2023 | DecoderObject | CodeCode Available | 1 |
| Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation | Aug 8, 2023 | Video Recognition | CodeCode Available | 1 |
| What Can Simple Arithmetic Operations Do for Temporal Modeling? | Jul 18, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition | Jul 13, 2023 | Action RecognitionTemporal Action Localization | CodeCode Available | 1 |
| Implicit Temporal Modeling with Learnable Alignment for Video Recognition | Apr 20, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Frame Flexible Network | Mar 26, 2023 | Video Recognition | CodeCode Available | 1 |
| The effectiveness of MAE pre-pretraining for billion-scale pretraining | Mar 23, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge | Mar 15, 2023 | Action RecognitionFew-Shot action recognition | CodeCode Available | 1 |
| Making Vision Transformers Efficient from A Token Sparsification View | Mar 15, 2023 | Efficient ViTsimage-classification | CodeCode Available | 1 |
| Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization | Feb 1, 2023 | Action RecognitionContinual Learning | CodeCode Available | 1 |
| Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring | Jan 26, 2023 | Representation LearningRetrieval | CodeCode Available | 1 |
| Efficient Movie Scene Detection using State-Space Transformers | Dec 29, 2022 | GPUScene Segmentation | CodeCode Available | 1 |
| VLG: General Video Recognition with Web Textual Knowledge | Dec 3, 2022 | Video Recognition | CodeCode Available | 1 |
| SVFormer: Semi-supervised Video Transformer for Action Recognition | Nov 23, 2022 | Action Recognitionimage-classification | CodeCode Available | 1 |
| Look More but Care Less in Video Recognition | Nov 18, 2022 | Action RecognitionVideo Recognition | CodeCode Available | 1 |
| Cluster and Aggregate: Face Recognition with Large Probe Set | Oct 19, 2022 | Face RecognitionFace Verification | CodeCode Available | 1 |
| Towards a Unified View on Visual Parameter-Efficient Transfer Learning | Oct 3, 2022 | Action RecognitionImage Classification | CodeCode Available | 1 |
| AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition | Sep 27, 2022 | Video Recognition | CodeCode Available | 1 |
| Rethinking Resolution in the Context of Efficient Video Recognition | Sep 26, 2022 | Knowledge DistillationVideo Recognition | CodeCode Available | 1 |
| Real-time Online Video Detection with Temporal Smoothing Transformers | Sep 19, 2022 | Action AnticipationAction Detection | CodeCode Available | 1 |
| Frozen CLIP Models are Efficient Video Learners | Aug 6, 2022 | Action ClassificationDecoder | CodeCode Available | 1 |