SOTAVerified

Action Detection

Action Detection aims to find both where and when an action occurs within a video clip and classify what the action is taking place. Typically results are given in the form of action tublets, which are action bounding boxes linked across time in the video. This is related to temporal localization, which seeks to identify the start and end frame of an action, and action recognition, which seeks only to classify which action is taking place and typically assumes a trimmed video.

Papers

Showing 151200 of 817 papers

TitleStatusHype
Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization0
Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments0
Glance and Focus: Memory Prompting for Multi-Event Video Question AnsweringCode1
Low-power Continuous Remote Behavioral Localization with Event Cameras0
Self-supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions0
SADA: Semantic adversarial unsupervised domain adaptation for Temporal Action LocalizationCode0
Generative Model-based Feature Knowledge Distillation for Action RecognitionCode1
Advanced Image Segmentation Techniques for Neural Activity Detection via C-fos Immediate Early Gene Expression0
Semi-supervised Active Learning for Video Action DetectionCode0
Spatiotemporal Event Graphs for Dynamic Scene Understanding0
Low-power, Continuous Remote Behavioral Localization with Event Cameras0
Towards More Practical Group Activity Detection: A New Benchmark and Model0
Adapting Short-Term Transformers for Action Detection in Untrimmed VideosCode1
SPIRE-SIES: A Spontaneous Indian English Speech Corpus0
End-to-End Temporal Action Detection with 1B Parameters Across 1000 FramesCode2
Centre Stage: Centricity-based Audio-Visual Temporal Action DetectionCode0
ADM-Loc: Actionness Distribution Modeling for Point-supervised Temporal Action Localization0
Introducing SSBD+ Dataset with a Convolutional Pipeline for detecting Self-Stimulatory Behaviours in Children using raw videos0
Combatting Human Trafficking in the Cyberspace: A Natural Language Processing-Based Methodology to Analyze the Language in Online Advertisements0
ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection0
A Hybrid Graph Network for Complex Activity Detection in Video0
ChimpACT: A Longitudinal Dataset for Understanding Chimpanzee BehaviorsCode1
Prompt-driven Target Speech Diarization0
Device Detection and Channel Estimation in MTC with Correlated Activity Pattern0
POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization0
Enhancing Illicit Activity Detection using XAI: A Multimodal Graph-LLM Framework0
Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation0
Hierarchical MTC User Activity Detection and Channel Estimation with Unknown Spatial Covariance0
End-to-end Online Speaker Diarization with Target Speaker Tracking0
VSANet: Real-time Speech Enhancement Based on Voice Activity Detection and Causal Spatial Attention0
Boundary Discretization and Reliable Classification Network for Temporal Action DetectionCode0
ACT-Net: Anchor-context Action Detection in Surgery Videos0
A Grammatical Compositional Model for Video Action Detection0
PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System0
M^33D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding0
ENIGMA-51: Towards a Fine-Grained Understanding of Human-Object Interactions in Industrial ScenariosCode0
The Impact of Silence on Speech Anti-Spoofing0
SkeleTR: Towrads Skeleton-based Action Recognition in the Wild0
JOADAA: joint online action detection and action anticipation0
Effective Abnormal Activity Detection on Multivariate Time Series Healthcare Data0
Temporal Action Localization with Enhanced Instant DiscriminabilityCode2
In-Ear-Voice: Towards Milli-Watt Audio Enhancement With Bone-Conduction Microphones for In-Ear Sensing Platforms0
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using TransformersCode1
Self-Feedback DETR for Temporal Action Detection0
Progression-Guided Temporal Action Detection in VideosCode0
The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 20230
Memory-and-Anticipation Transformer for Online Action UnderstandingCode1
Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for ConversationsCode0
PAT: Position-Aware Transformer for Dense Multi-Label Action Detection0
A Survey on Deep Learning-based Spatio-temporal Action Detection0
Show:102550
← PrevPage 4 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1STAR/LFrame-mAP 0.590.3Unverified
2SiAFrame-mAP 0.588.5Unverified
3YOWO + LFBFrame-mAP 0.587.3Unverified
4HITFrame-mAP 0.584.8Unverified
5HISAN (ResNet-101 + FPN)Video-mAP 0.282.3Unverified
6YOWOFrame-mAP 0.580.4Unverified
7Two-in-one Two StreamVideo-mAP 0.278.48Unverified
8MOCFrame-mAP 0.577.8Unverified
9Faster-RCNN + two-stream I3D convFrame-mAP 0.576.3Unverified
10Two-in-oneVideo-mAP 0.275.48Unverified
#ModelMetricClaimedVerifiedStatus
1SiAFrame-mAP 0.588.5Unverified
2HISAN (ResNet-101 + FPN)Video-mAP 0.287.59Unverified
3HITFrame-mAP 0.583.8Unverified
4HISAN (VGG-16)Frame-mAP 0.576.72Unverified
5DTSVideo-mAP 0.276.1Unverified
6YOWO + LFBFrame-mAP 0.575.7Unverified
7Two-in-one Two StreamVideo-mAP 0.574.74Unverified
8YOWOFrame-mAP 0.574.4Unverified
9MOCFrame-mAP 0.574Unverified
10Faster-RCNN + two-stream I3D convFrame-mAP 0.573.3Unverified
#ModelMetricClaimedVerifiedStatus
1TTMmAP28.79Unverified
2CTRNmAP27.8Unverified
3Coarse-Fine Networks (w/ self-supervised detection pretraining)mAP26.95Unverified
4UniMD+Sync. (RGB+Flow)mAP26.53Unverified
5PDAN (RGB+Flow)mAP26.5Unverified
6PATmAP26.5Unverified
7MS-TCT (RGB only)mAP25.4Unverified
83D ResNet-50 + super-events pretrained on AViDmAP25.2Unverified
9Coarse-Fine NetworksmAP25.1Unverified
10MLAD (RGB + Flow)mAP23.7Unverified
#ModelMetricClaimedVerifiedStatus
1MLADmAP51.5Unverified
2CTRNmAP51.2Unverified
3PDANmAP47.6Unverified
4TGMmAP46.4Unverified
5MS-TCT (RGB only)mAP43.1Unverified
6I3D + our super-eventmAP36.4Unverified
7Two-stream + LSTMmAP28.1Unverified
8Two-streammAP27.6Unverified
#ModelMetricClaimedVerifiedStatus
1Two-in-one Two StreamVideo-mAP 0.596.52Unverified
2DTSVideo-mAP 0.294.3Unverified
3Two-in-oneVideo-mAP 0.592.74Unverified
4T-CNNFrame-mAP 0.586.7Unverified
5MR-TS R-CNNFrame-mAP 0.584.52Unverified
6TS R-CNNFrame-mAP 0.582.3Unverified
7Action TubesFrame-mAP 0.568.1Unverified
#ModelMetricClaimedVerifiedStatus
1MAT (Ours) TransmAP71.6Unverified
2TadML-two streammAP59.7Unverified
3MAT (ours)mAP58.2Unverified
4TadML-rgbmAP53.46Unverified
#ModelMetricClaimedVerifiedStatus
1HITFrame-mAP 0.533.3Unverified
2SiAFrame-mAP 0.528.8Unverified
#ModelMetricClaimedVerifiedStatus
1MS-TCTFrame-mAP33.7Unverified
2PDANFrame-mAP32.7Unverified
#ModelMetricClaimedVerifiedStatus
1STCNNIoU0.14Unverified
2Two Stream NetworkIoU0.07Unverified
#ModelMetricClaimedVerifiedStatus
1STCNN-V2 (Vote decision)IoU0.52Unverified
2RGB and PRGBIoU0.35Unverified
#ModelMetricClaimedVerifiedStatus
1PATmAP44.6Unverified