SOTAVerified

Action Recognition In Videos

Action Recognition in Videos is a task in computer vision and pattern recognition where the goal is to identify and categorize human actions performed in a video sequence. The task involves analyzing the spatiotemporal dynamics of the actions and mapping them to a predefined set of action classes, such as running, jumping, or swimming.

Papers

Showing 150 of 124 papers

TitleStatusHype
A new face swap method for image and video domains: a technical reportCode3
VideoMAE V2: Scaling Video Masked Autoencoders with Dual MaskingCode2
Learning Spatiotemporal Features with 3D Convolutional NetworksCode2
Temporal Segment Networks: Towards Good Practices for Deep Action RecognitionCode2
Temporal Segment Networks for Action Recognition in VideosCode2
MMNet: A Model-Based Multimodal Network for Human Action Recognition in RGB-D VideosCode1
IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in VideosCode1
Self-supervised Video TransformerCode1
Unsupervised Learning of Video Representations via Dense Trajectory ClusteringCode1
TDN: Temporal Difference Networks for Efficient Action RecognitionCode1
Learning Implicit Temporal Alignment for Few-shot Video ClassificationCode1
Tensor Representations for Action RecognitionCode1
DirecFormer: A Directed Attention in Transformer Approach to Robust Action RecognitionCode1
Florence: A New Foundation Model for Computer VisionCode1
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and TextCode1
Simba: Mamba augmented U-ShiftGCN for Skeletal Action Recognition in VideosCode1
Self-supervised Video Representation Learning with Cross-Stream Prototypical ContrastingCode1
SlowFast Networks for Video RecognitionCode1
Actor-agnostic Multi-label Action Recognition with Multi-modal QueryCode1
Spatiotemporal Residual Networks for Video Action RecognitionCode1
TEA: Temporal Excitation and Aggregation for Action RecognitionCode1
Towards Good Practices for Very Deep Two-Stream ConvNetsCode1
Region-based Non-local Operation for Video ClassificationCode1
YouTube-8M: A Large-Scale Video Classification BenchmarkCode1
ActionCLIP: A New Paradigm for Video Action RecognitionCode1
CAST: Cross-Attention in Space and Time for Video Action RecognitionCode1
Multimodal Fusion via Teacher-Student Network for Indoor Action RecognitionCode1
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video LearningCode1
Skeleton-based Action Recognition via Spatial and Temporal Transformer NetworksCode1
Logsig-RNN: a novel network for robust and efficient skeleton-based action recognitionCode1
Space-time Mixing Attention for Video TransformerCode1
Busy-Quiet Video Disentangling for Video ClassificationCode1
Dual-path Adaptation from Image to Video TransformersCode1
A Dense-Sparse Complementary Network for Human Action Recognition based on RGB and Skeleton ModalitiesCode1
Multi-Temporal Convolutions for Human Action Recognition in VideosCode1
Self-supervised Video Representation Learning Using Inter-intra Contrastive FrameworkCode1
EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action RecognitionCode1
Discriminative Video Representation Learning Using Support Vector Classifiers0
Discriminative convolutional Fisher vector network for action recognition0
AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos0
Developing Motion Code Embedding for Action Recognition in Videos0
Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice0
Per-Sample Kernel Adaptation for Visual Recognition and Grouping0
DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding0
Deep Learning Approaches for Human Action Recognition in Video Data0
Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition0
Coupled Recurrent Network (CRN)0
Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition0
An Information-rich Sampling Technique over Spatio-Temporal CNN for Classification of Human Actions in Videos0
NAS-TC: Neural Architecture Search on Temporal Convolutions for Complex Action Recognition0
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1CPNet Res34, 5 CPVal96.7Unverified
2STM (Resnet-50, 16 frames)Val96.7Unverified
3MFNetVal96.68Unverified
4DINVal95.31Unverified
5MultiScale TRNVal95.31Unverified
6convSTARVal92.7Unverified
73D-SqueezeNetVal90.77Unverified
83D-ShuffleNetV2 0.25xVal86.91Unverified
93D-MobileNetV2 0.2xVal86.43Unverified
#ModelMetricClaimedVerifiedStatus
1DSCNet (RGB + Pose)X-Sub97.4Unverified
2MMNetX-Sub97.4Unverified
3EPAM-NetX-Sub96.2Unverified
4DVANet (RGB only)X-Sub95.8Unverified
5TSMFX-Sub95.8Unverified
#ModelMetricClaimedVerifiedStatus
1STM (ImageNet+Kinetics pretrain)3-fold Accuracy96.2Unverified
23D-SqueezeNet3-fold Accuracy74.94Unverified
33D-ShuffleNetV2 0.25x3-fold Accuracy56.52Unverified
43D-MobileNetV2 0.2x3-fold Accuracy55.56Unverified
5Baseline UCF1013-fold Accuracy43.9Unverified
#ModelMetricClaimedVerifiedStatus
1STM (16 frames, ImageNet pretraining)Top-1 Accuracy64.2Unverified
2CPNet Res34, 5 CPTop-1 Accuracy57.65Unverified
32-Stream TRNTop-1 Accuracy55.52Unverified
4DINTop-1 Accuracy34.11Unverified
#ModelMetricClaimedVerifiedStatus
1FlorenceTop-1 Accuracy86.5Unverified
2ActionCLIP (ViT-B/16)Top-1 Accuracy83.8Unverified
3Frozen Backbone, SwinV2-G-ext22K (Video-Swin)Top-1 Accuracy81.7Unverified
#ModelMetricClaimedVerifiedStatus
1YOWO+LFB*mAP (Val)20.2Unverified
2VideoMAE V2mAP (Val)18.24Unverified
#ModelMetricClaimedVerifiedStatus
1ITANetTop-1 Accuracy(5-Way-1-Shot)49.2Unverified
2OTAM[3]++Top-1 Accuracy(5-Way-1-Shot)42.8Unverified
#ModelMetricClaimedVerifiedStatus
1ITANetTop-1 Accuracy(5-Way-1-Shot)39.8Unverified
2CMN[35]Top-1 Accuracy(5-Way-1-Shot)36.2Unverified
#ModelMetricClaimedVerifiedStatus
1G-BlendVideo hit@174.8Unverified
2LSTM +Pretrained on YT-8MVideo hit@165.7Unverified
#ModelMetricClaimedVerifiedStatus
1Single-stream R-C3D (two-way buffer)mAP@0.154.5Unverified
2Single-stream R-C3D (one-way buffer)mAP@0.151.6Unverified
#ModelMetricClaimedVerifiedStatus
1LSTM + Pretrained on YT-8MmAP75.6Unverified
#ModelMetricClaimedVerifiedStatus
1YOWO+LFB*mAP (Val)19.2Unverified
#ModelMetricClaimedVerifiedStatus
1STM (ImageNet+Kinetics pretrain)Average accuracy of 3 splits72.2Unverified
#ModelMetricClaimedVerifiedStatus
1FlorenceTop-1 Accuracy87.8Unverified
#ModelMetricClaimedVerifiedStatus
1G-BlendClip Hit@149.7Unverified
#ModelMetricClaimedVerifiedStatus
12D-3D-Softargmax (RGB only)Accuracy (CS)85.5Unverified
#ModelMetricClaimedVerifiedStatus
1STM (16 frames, ImageNet pretraining)Top 1 Accuracy50.7Unverified