SOTAVerified

Action Recognition In Videos

Action Recognition in Videos is a task in computer vision and pattern recognition where the goal is to identify and categorize human actions performed in a video sequence. The task involves analyzing the spatiotemporal dynamics of the actions and mapping them to a predefined set of action classes, such as running, jumping, or swimming.

Papers

Showing 2650 of 124 papers

TitleStatusHype
Region-based Non-local Operation for Video ClassificationCode1
Skeleton-based Action Recognition via Spatial and Temporal Transformer NetworksCode1
DirecFormer: A Directed Attention in Transformer Approach to Robust Action RecognitionCode1
CAST: Cross-Attention in Space and Time for Video Action RecognitionCode1
Self-supervised Video TransformerCode1
Multimodal Fusion via Teacher-Student Network for Indoor Action RecognitionCode1
Spatiotemporal Residual Networks for Video Action RecognitionCode1
Dual-path Adaptation from Image to Video TransformersCode1
A Dense-Sparse Complementary Network for Human Action Recognition based on RGB and Skeleton ModalitiesCode1
YouTube-8M: A Large-Scale Video Classification BenchmarkCode1
Logsig-RNN: a novel network for robust and efficient skeleton-based action recognitionCode1
EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action RecognitionCode1
Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action RecognitionCode0
ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in VideosCode0
RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in VideosCode0
Self-Supervised MultiModal Versatile NetworksCode0
Learning Latent Sub-events in Activity Videos Using Temporal Attention FiltersCode0
Robust Real-Time Violence Detection in Video Using CNN And LSTMCode0
Pose And Joint-Aware Action RecognitionCode0
R-C3D: Region Convolutional 3D Network for Temporal Activity DetectionCode0
Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action RecognitionCode0
Hiera: A Hierarchical Vision Transformer without the Bells-and-WhistlesCode0
Out-of-Distribution Detection for Generalized Zero-Shot Action RecognitionCode0
HaltingVT: Adaptive Token Halting Transformer for Efficient Video RecognitionCode0
Gating Revisited: Deep Multi-layer RNNs That Can Be TrainedCode0
Show:102550
← PrevPage 2 of 5Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1CPNet Res34, 5 CPVal96.7Unverified
2STM (Resnet-50, 16 frames)Val96.7Unverified
3MFNetVal96.68Unverified
4DINVal95.31Unverified
5MultiScale TRNVal95.31Unverified
6convSTARVal92.7Unverified
73D-SqueezeNetVal90.77Unverified
83D-ShuffleNetV2 0.25xVal86.91Unverified
93D-MobileNetV2 0.2xVal86.43Unverified
#ModelMetricClaimedVerifiedStatus
1MMNetX-Sub97.4Unverified
2DSCNet (RGB + Pose)X-Sub97.4Unverified
3EPAM-NetX-Sub96.2Unverified
4DVANet (RGB only)X-Sub95.8Unverified
5TSMFX-Sub95.8Unverified
#ModelMetricClaimedVerifiedStatus
1STM (ImageNet+Kinetics pretrain)3-fold Accuracy96.2Unverified
23D-SqueezeNet3-fold Accuracy74.94Unverified
33D-ShuffleNetV2 0.25x3-fold Accuracy56.52Unverified
43D-MobileNetV2 0.2x3-fold Accuracy55.56Unverified
5Baseline UCF1013-fold Accuracy43.9Unverified
#ModelMetricClaimedVerifiedStatus
1STM (16 frames, ImageNet pretraining)Top-1 Accuracy64.2Unverified
2CPNet Res34, 5 CPTop-1 Accuracy57.65Unverified
32-Stream TRNTop-1 Accuracy55.52Unverified
4DINTop-1 Accuracy34.11Unverified
#ModelMetricClaimedVerifiedStatus
1FlorenceTop-1 Accuracy86.5Unverified
2ActionCLIP (ViT-B/16)Top-1 Accuracy83.8Unverified
3Frozen Backbone, SwinV2-G-ext22K (Video-Swin)Top-1 Accuracy81.7Unverified
#ModelMetricClaimedVerifiedStatus
1YOWO+LFB*mAP (Val)20.2Unverified
2VideoMAE V2mAP (Val)18.24Unverified
#ModelMetricClaimedVerifiedStatus
1ITANetTop-1 Accuracy(5-Way-1-Shot)49.2Unverified
2OTAM[3]++Top-1 Accuracy(5-Way-1-Shot)42.8Unverified
#ModelMetricClaimedVerifiedStatus
1ITANetTop-1 Accuracy(5-Way-1-Shot)39.8Unverified
2CMN[35]Top-1 Accuracy(5-Way-1-Shot)36.2Unverified
#ModelMetricClaimedVerifiedStatus
1G-BlendVideo hit@174.8Unverified
2LSTM +Pretrained on YT-8MVideo hit@165.7Unverified
#ModelMetricClaimedVerifiedStatus
1Single-stream R-C3D (two-way buffer)mAP@0.154.5Unverified
2Single-stream R-C3D (one-way buffer)mAP@0.151.6Unverified
#ModelMetricClaimedVerifiedStatus
1LSTM + Pretrained on YT-8MmAP75.6Unverified
#ModelMetricClaimedVerifiedStatus
1YOWO+LFB*mAP (Val)19.2Unverified
#ModelMetricClaimedVerifiedStatus
1STM (ImageNet+Kinetics pretrain)Average accuracy of 3 splits72.2Unverified
#ModelMetricClaimedVerifiedStatus
1FlorenceTop-1 Accuracy87.8Unverified
#ModelMetricClaimedVerifiedStatus
1G-BlendClip Hit@149.7Unverified
#ModelMetricClaimedVerifiedStatus
12D-3D-Softargmax (RGB only)Accuracy (CS)85.5Unverified
#ModelMetricClaimedVerifiedStatus
1STM (16 frames, ImageNet pretraining)Top 1 Accuracy50.7Unverified