SOTAVerified

Action Detection

Action Detection aims to find both where and when an action occurs within a video clip and classify what the action is taking place. Typically results are given in the form of action tublets, which are action bounding boxes linked across time in the video. This is related to temporal localization, which seeks to identify the start and end frame of an action, and action recognition, which seeks only to classify which action is taking place and typically assumes a trimmed video.

Papers

Showing 150 of 817 papers

TitleStatusHype
Moshi: a speech-text foundation model for real-time dialogueCode9
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action DetectionCode3
Harnessing Temporal Causality for Advanced Temporal Action DetectionCode3
Efficient Video Action Detection with Token Dropout and Context RefinementCode3
pyannote.audio: neural building blocks for speaker diarizationCode3
YOWOv3: An Efficient and Generalized Framework for Human Action Detection and RecognitionCode2
TIM: A Time Interval Machine for Audio-Visual Action RecognitionCode2
UniMD: Towards Unifying Moment Retrieval and Temporal Action DetectionCode2
End-to-End Temporal Action Detection with 1B Parameters Across 1000 FramesCode2
Temporal Action Localization with Enhanced Instant DiscriminabilityCode2
Act3D: 3D Feature Field Transformers for Multi-Task Robotic ManipulationCode2
TriDet: Temporal Action Detection with Relative Boundary ModelingCode2
YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time Spatio-temporal Action DetectionCode2
Structured Attention Composition for Temporal Action LocalizationCode2
Colar: Effective and Efficient Online Action Detection by Consulting ExemplarsCode2
audino: A Modern Annotation Tool for Audio and SpeechCode2
Temporal Action Detection with Structured Segment NetworksCode2
Speaker Diarization with Overlapping Community Detection Using Graph Attention Networks and Label Propagation AlgorithmCode1
DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection TransformerCode1
Context-Enhanced Memory-Refined Transformer for Online Action DetectionCode1
VANPY: Voice Analysis FrameworkCode1
Preventing Rogue Agents Improves Multi-Agent CollaborationCode1
Training-Free Zero-Shot Temporal Action Detection with Vision-Language ModelsCode1
MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action DetectionCode1
WiFi CSI Based Temporal Activity Detection via Dual Pyramid NetworkCode1
USDRL: Unified Skeleton-Based Dense Representation Learning with Multi-Grained Feature DecorrelationCode1
Exploiting VLM Localizability and Semantics for Open Vocabulary Action DetectionCode1
Towards Student Actions in Classroom Scenes: New Dataset and BaselineCode1
ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming VideosCode1
MMAD: Multi-label Micro-Action Detection in VideosCode1
DyFADet: Dynamic Feature Aggregation for Temporal Action DetectionCode1
InaGVAD : a Challenging French TV and Radio Corpus Annotated for Speech Activity Detection and Speaker Gender SegmentationCode1
No Time to Waste: Squeeze Time into Channel for Mobile Video UnderstandingCode1
TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate ExpressionCode1
Benchmarking the Robustness of Temporal Action Detection Models Against Temporal CorruptionsCode1
Online speaker diarization of meetings guided by speech separationCode1
Glance and Focus: Memory Prompting for Multi-Event Video Question AnsweringCode1
Generative Model-based Feature Knowledge Distillation for Action RecognitionCode1
Adapting Short-Term Transformers for Action Detection in Untrimmed VideosCode1
ChimpACT: A Longitudinal Dataset for Understanding Chimpanzee BehaviorsCode1
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using TransformersCode1
Memory-and-Anticipation Transformer for Online Action UnderstandingCode1
ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and DevelopmentCode1
Multi-Granularity Hand Action DetectionCode1
E2E-LOAD: End-to-End Long-form Online Action DetectionCode1
WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity RecognitionCode1
Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action DetectionCode1
DiffTAD: Temporal Action Detection with Proposal Denoising DiffusionCode1
TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker EmbeddingsCode1
MiniROAD: Minimal RNN Framework for Online Action DetectionCode1
Show:102550
← PrevPage 1 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1STAR/LFrame-mAP 0.590.3Unverified
2SiAFrame-mAP 0.588.5Unverified
3YOWO + LFBFrame-mAP 0.587.3Unverified
4HITFrame-mAP 0.584.8Unverified
5HISAN (ResNet-101 + FPN)Video-mAP 0.282.3Unverified
6YOWOFrame-mAP 0.580.4Unverified
7Two-in-one Two StreamVideo-mAP 0.278.48Unverified
8MOCFrame-mAP 0.577.8Unverified
9Faster-RCNN + two-stream I3D convFrame-mAP 0.576.3Unverified
10Two-in-oneVideo-mAP 0.275.48Unverified
#ModelMetricClaimedVerifiedStatus
1SiAFrame-mAP 0.588.5Unverified
2HISAN (ResNet-101 + FPN)Video-mAP 0.287.59Unverified
3HITFrame-mAP 0.583.8Unverified
4HISAN (VGG-16)Frame-mAP 0.576.72Unverified
5DTSVideo-mAP 0.276.1Unverified
6YOWO + LFBFrame-mAP 0.575.7Unverified
7Two-in-one Two StreamVideo-mAP 0.574.74Unverified
8YOWOFrame-mAP 0.574.4Unverified
9MOCFrame-mAP 0.574Unverified
10Faster-RCNN + two-stream I3D convFrame-mAP 0.573.3Unverified
#ModelMetricClaimedVerifiedStatus
1TTMmAP28.79Unverified
2CTRNmAP27.8Unverified
3Coarse-Fine Networks (w/ self-supervised detection pretraining)mAP26.95Unverified
4UniMD+Sync. (RGB+Flow)mAP26.53Unverified
5PDAN (RGB+Flow)mAP26.5Unverified
6PATmAP26.5Unverified
7MS-TCT (RGB only)mAP25.4Unverified
83D ResNet-50 + super-events pretrained on AViDmAP25.2Unverified
9Coarse-Fine NetworksmAP25.1Unverified
10MLAD (RGB + Flow)mAP23.7Unverified
#ModelMetricClaimedVerifiedStatus
1MLADmAP51.5Unverified
2CTRNmAP51.2Unverified
3PDANmAP47.6Unverified
4TGMmAP46.4Unverified
5MS-TCT (RGB only)mAP43.1Unverified
6I3D + our super-eventmAP36.4Unverified
7Two-stream + LSTMmAP28.1Unverified
8Two-streammAP27.6Unverified
#ModelMetricClaimedVerifiedStatus
1Two-in-one Two StreamVideo-mAP 0.596.52Unverified
2DTSVideo-mAP 0.294.3Unverified
3Two-in-oneVideo-mAP 0.592.74Unverified
4T-CNNFrame-mAP 0.586.7Unverified
5MR-TS R-CNNFrame-mAP 0.584.52Unverified
6TS R-CNNFrame-mAP 0.582.3Unverified
7Action TubesFrame-mAP 0.568.1Unverified
#ModelMetricClaimedVerifiedStatus
1MAT (Ours) TransmAP71.6Unverified
2TadML-two streammAP59.7Unverified
3MAT (ours)mAP58.2Unverified
4TadML-rgbmAP53.46Unverified
#ModelMetricClaimedVerifiedStatus
1HITFrame-mAP 0.533.3Unverified
2SiAFrame-mAP 0.528.8Unverified
#ModelMetricClaimedVerifiedStatus
1MS-TCTFrame-mAP33.7Unverified
2PDANFrame-mAP32.7Unverified
#ModelMetricClaimedVerifiedStatus
1STCNNIoU0.14Unverified
2Two Stream NetworkIoU0.07Unverified
#ModelMetricClaimedVerifiedStatus
1STCNN-V2 (Vote decision)IoU0.52Unverified
2RGB and PRGBIoU0.35Unverified
#ModelMetricClaimedVerifiedStatus
1PATmAP44.6Unverified