SOTAVerified

Action Classification

Papers

Showing 150 of 457 papers

TitleStatusHype
InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingCode7
VideoMamba: State Space Model for Efficient Video UnderstandingCode5
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and VideoCode4
InternVideo: General Video Foundation Models via Generative and Discriminative LearningCode4
Towards Universal Soccer Video UnderstandingCode3
ONE-PEACE: Exploring One General Representation Model Toward Unlimited ModalitiesCode3
Expanding Language-Image Pretrained Models for General Video RecognitionCode3
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-TrainingCode3
VideoMAE V2: Scaling Video Masked Autoencoders with Dual MaskingCode2
AIM: Adapting Image Models for Efficient Video Action RecognitionCode2
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language ModelsCode2
Learning Video Representations from Large Language ModelsCode2
MARLIN: Masked Autoencoder for facial video Representation LearnINgCode2
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormerCode2
Revisiting Classifier: Transferring Vision-Language Models for Video RecognitionCode2
Omnivore: A Single Model for Many Visual ModalitiesCode2
Video Swin TransformerCode2
Is Space-Time Attention All You Need for Video Understanding?Code2
X3D: Expanding Architectures for Efficient Video RecognitionCode2
Omni-sourced Webly-supervised Learning for Video RecognitionCode2
Temporal Segment Networks for Action Recognition in VideosCode2
Temporal Segment Networks: Towards Good Practices for Deep Action RecognitionCode2
Make Your Training Flexible: Towards Deployment-Efficient Video ModelsCode1
Temporal Action Localization with Cross Layer Task Decoupling and RefinementCode1
KNN-MMD: Cross Domain Wireless Sensing via Local Distribution AlignmentCode1
Autoregressive Adaptive Hypergraph Transformer for Skeleton-based Activity RecognitionCode1
CrossFi: A Cross Domain Wi-Fi Sensing Framework Based on Siamese NetworkCode1
Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action LocalizationCode1
EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action RecognitionCode1
EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action UnderstandingCode1
Finding the Missing Data: A BERT-inspired Approach Against Package Loss in Wireless SensingCode1
Open-Vocabulary Video Relation ExtractionCode1
CAST: Cross-Attention in Space and Time for Video Action RecognitionCode1
Just Add π! Pose Induced Video Transformers for Understanding Activities of Daily LivingCode1
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer LearningCode1
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to VideoCode1
ALIP: Adaptive Language-Image Pre-training with Synthetic CaptionCode1
Actor-agnostic Multi-label Action Recognition with Multi-modal QueryCode1
What Can Simple Arithmetic Operations Do for Temporal Modeling?Code1
Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision TransformersCode1
AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose EstimationCode1
Implicit Temporal Modeling with Learnable Alignment for Video RecognitionCode1
The effectiveness of MAE pre-pretraining for billion-scale pretrainingCode1
Dual-path Adaptation from Image to Video TransformersCode1
HierVL: Learning Hierarchical Video-Language EmbeddingsCode1
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation LearningCode1
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video LearningCode1
Post-Processing Temporal Action DetectionCode1
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation LearningCode1
AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked AutoencodersCode1
Show:102550
← PrevPage 1 of 10Next →

No leaderboard results yet.