Audio Classification
Audio Classification is a machine learning task that involves identifying and tagging audio signals into different classes or categories. The goal of audio classification is to enable machines to automatically recognize and distinguish between different types of audio, such as music, speech, and environmental sounds.
Papers
Showing 1–10 of 361 papers
All datasetsAudioSetESC-50ICBHI Respiratory Sound DatabaseVGGSoundSHDFSD50KBalanced Audio SetSpeech CommandsSSCBirdCLEF 2021DCASEEPIC-KITCHENS-100
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | OmniVec2 | Test mAP | 0.56 | — | Unverified |
| 2 | OmniVec | Test mAP | 0.55 | — | Unverified |
| 3 | EquiAV | Test mAP | 0.55 | — | Unverified |
| 4 | MAViL (Audio-Visual, single) | Test mAP | 0.53 | — | Unverified |
| 5 | Audiovisual Masked Autoencoder (Audiovisual, Single) | Test mAP | 0.52 | — | Unverified |
| 6 | CAV-MAE (Audio-Visual) | Test mAP | 0.51 | — | Unverified |
| 7 | BEATs (Audio-only, Ensemble) | Test mAP | 0.51 | — | Unverified |
| 8 | UAVM (Audio + Video) | Test mAP | 0.5 | — | Unverified |
| 9 | SSLAM (Audio-Only, Single) | Test mAP | 0.5 | — | Unverified |
| 10 | mn40_as (Ensemble) | Test mAP | 0.5 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | OmniVec2 | Top-1 Accuracy | 99.1 | — | Unverified |
| 2 | InternVideo2 | Top-1 Accuracy | 98.6 | — | Unverified |
| 3 | M2D2 AS+ | Top-1 Accuracy | 98.5 | — | Unverified |
| 4 | OmniVec | Top-1 Accuracy | 98.4 | — | Unverified |
| 5 | BEATs | Top-1 Accuracy | 98.1 | — | Unverified |
| 6 | mn40_as | Top-1 Accuracy | 97.45 | — | Unverified |
| 7 | M2D-CLAP/0.7 | Top-1 Accuracy | 97.4 | — | Unverified |
| 8 | DyMN-L | Top-1 Accuracy | 97.4 | — | Unverified |
| 9 | M2D-AS/0.7 | Top-1 Accuracy | 97.2 | — | Unverified |
| 10 | HTS-AT | Top-1 Accuracy | 97 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | ADD | ICBHI Score | 65.53 | — | Unverified |
| 2 | BEATs (PAFA) | ICBHI Score | 64.84 | — | Unverified |
| 3 | BTS | ICBHI Score | 63.54 | — | Unverified |
| 4 | BEATs (CE) | ICBHI Score | 63.49 | — | Unverified |
| 5 | M2D-X/0.7 (η=0.3) | ICBHI Score | 63.29 | — | Unverified |
| 6 | CycleGuardian | ICBHI Score | 63.26 | — | Unverified |
| 7 | M2D/0.7 (e=0.3) | ICBHI Score | 62.73 | — | Unverified |
| 8 | Audio-CLAP | ICBHI Score | 62.56 | — | Unverified |
| 9 | AST (Patch-Mix CL) | ICBHI Score | 62.37 | — | Unverified |
| 10 | AFT on Mixed-500 | ICBHI Score | 61.79 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | MBT (AV) | Top 5 Accuracy | 85.6 | — | Unverified |
| 2 | Mirasol3B | Top 1 Accuracy | 69.8 | — | Unverified |
| 3 | CA2ST(B/16) | Top 1 Accuracy | 68.3 | — | Unverified |
| 4 | ONE-PEACE (Audio-Visual) | Top 1 Accuracy | 68.2 | — | Unverified |
| 5 | CAVA(B/16) | Top 1 Accuracy | 68.2 | — | Unverified |
| 6 | EquiAV | Top 1 Accuracy | 67.1 | — | Unverified |
| 7 | MAViL | Top 1 Accuracy | 67.1 | — | Unverified |
| 8 | MMT (Audio-Visual) | Top 1 Accuracy | 66.2 | — | Unverified |
| 9 | CAV-MAE (Audio-Visual) | Top 1 Accuracy | 65.9 | — | Unverified |
| 10 | UAVM (Audio + Video) | Top 1 Accuracy | 65.8 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Event-SSM | Percentage correct | 95.9 | — | Unverified |
| 2 | SNN with Dilated Convolution with Learnable Spacings | Percentage correct | 95.1 | — | Unverified |
| 3 | SNN featuring learnable axonal delays with adaptively delay caps | Percentage correct | 92.45 | — | Unverified |
| 4 | CNN | Percentage correct | 92.4 | — | Unverified |
| 5 | SNN with spatio-temporal filters and attention | Percentage correct | 92.4 | — | Unverified |
| 6 | SNN with temporal-wise attention | Percentage correct | 91.1 | — | Unverified |
| 7 | SNN | Percentage correct | 87 | — | Unverified |
| 8 | Recurrent convolutional SNN | Percentage correct | 83.5 | — | Unverified |
| 9 | Recurrent SNN | Percentage correct | 83.2 | — | Unverified |
| 10 | Sparse Spiking Gradient Descent | Percentage correct | 77.5 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | ONE-PEACE | mAP | 69.7 | — | Unverified |
| 2 | MN | mAP | 65.6 | — | Unverified |
| 3 | PaSST-S | mAP | 65.55 | — | Unverified |
| 4 | DyMN-L | mAP | 65.5 | — | Unverified |
| 5 | PaSST-N-S | mAP | 64.2 | — | Unverified |
| 6 | LHGNN | Mean AP | 59 | — | Unverified |
| 7 | PSLA | mAP | 56.71 | — | Unverified |
| 8 | MATPAC (SSL Model) | mAP | 55.2 | — | Unverified |
| 9 | Temporal Knowledge Distillation for On-device Audio Classification | mAP | 54.8 | — | Unverified |
| 10 | Large 6-Layer Transformer with Pooling | mAP | 53.7 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | EquiAV | Mean AP | 42.4 | — | Unverified |
| 2 | SSLAM | Mean AP | 40.9 | — | Unverified |
| 3 | EAT | Mean AP | 40.3 | — | Unverified |
| 4 | BEATs | Mean AP | 38.9 | — | Unverified |
| 5 | Base (ours) | Mean AP | 37.4 | — | Unverified |
| 6 | SSAST-PATCH | Mean AP | 31 | — | Unverified |
| 7 | SSAST-FRAME | Mean AP | 29.2 | — | Unverified |
| 8 | Conformer | Mean AP | 27.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | PDC | Accuracy | 97.8 | — | Unverified |
| 2 | ASM-RH | Accuracy | 96.51 | — | Unverified |
| 3 | EfficientLEAF | Accuracy | 95.2 | — | Unverified |
| 4 | melspect | Accuracy | 95.1 | — | Unverified |
| 5 | LEAF | Accuracy | 95.1 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Event-SSM | Accuracy | 88.4 | — | Unverified |
| 2 | SNN with Dilated Convolution with Learnable Spacings | Accuracy | 80.69 | — | Unverified |
| 3 | RadLIF | Accuracy | 77.4 | — | Unverified |
| 4 | SpikGRU | Accuracy | 77 | — | Unverified |
| 5 | Adaptive SRNN | Accuracy | 74.2 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | EfficientLEAF (8s) | Accuracy | 72.2 | — | Unverified |
| 2 | EfficientLEAF | Accuracy | 42.9 | — | Unverified |
| 3 | LEAF | Accuracy | 42.3 | — | Unverified |
| 4 | melspect | Accuracy | 39.9 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | CrissCross (AudioSet) | Top-1 Accuracy | 97 | — | Unverified |
| 2 | CrissCross (Kinetics-400) | Top-1 Accuracy | 96 | — | Unverified |
| 3 | XDC | Top-1 Accuracy | 95 | — | Unverified |
| 4 | CrissCross (Kinetics-Sound) | Top-1 Accuracy | 93 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Audiovisual Masked Autoencoder (Audiovisual, Single) | Top-1 Action | 46 | — | Unverified |
| 2 | Audiovisual Masked Autoencoder (Video-only, Single) | Top-1 Action | 45.8 | — | Unverified |
| 3 | Audiovisual Masked Autoencoder (Audio-only, Single) | Top-1 Action | 19.7 | — | Unverified |
| 4 | PlayItBackX3 | Top-1 Action | 15.9 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | M2D-AS/0.7 | Mean AP | 48.5 | — | Unverified |
| 2 | LHGNN | Mean AP | 46.6 | — | Unverified |
| 3 | VAB-Encodec (Ours) | Mean AP | 38.7 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | EfficientLEAF | Accuracy | 60.2 | — | Unverified |
| 2 | melspect | Accuracy | 58.8 | — | Unverified |
| 3 | LEAF | Accuracy | 50.2 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | AUCO ResNet | AUC | 0.82 | — | Unverified |
| 2 | DenseNet 201 | AUC | 0.6 | — | Unverified |
| 3 | Inception ResNet V2 | AUC | 0.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Mirasol3B | Accuracy | 78.2 | — | Unverified |
| 2 | CA2ST(B/16) | Accuracy | 61 | — | Unverified |
| 3 | CAVA(B/16) | Accuracy | 60.3 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | ASM-RH-A | Top-1 Accuracy | 75.4 | — | Unverified |
| 2 | ERANN-0-4 | Top-1 Accuracy | 74.8 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Qwen-Audio | Accuracy | 92.89 | — | Unverified |
| 2 | VocalSound Baseline | Accuracy | 90.5 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | XGBoost (330) | Accuracy (10-fold) | 99.3 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | animal2vec | AP | 0.91 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Audio | Accuracy (%) | 64.5 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | CDIL | FruitFlies | 97.09 | — | Unverified |