SOTAVerified

Audio Classification

Audio Classification is a machine learning task that involves identifying and tagging audio signals into different classes or categories. The goal of audio classification is to enable machines to automatically recognize and distinguish between different types of audio, such as music, speech, and environmental sounds.

Papers

Showing 150 of 361 papers

TitleStatusHype
InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingCode7
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic AlignmentCode4
Leveraging tropical reef, bird and unrelated sounds for superior transfer learning in marine bioacousticsCode3
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language ModelsCode3
ONE-PEACE: Exploring One General Representation Model Toward Unlimited ModalitiesCode3
CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio ClassificationCode3
EAT: Self-Supervised Pre-Training with Efficient Audio TransformerCode3
Global birdsong embeddings enable superior transfer learning for bioacoustic classificationCode2
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and DetectionCode2
Oceanship: A Large-Scale Dataset for Underwater Audio Target RecognitionCode2
SSAST: Self-Supervised Audio Spectrogram TransformerCode2
Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge DistillationCode2
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal ModelsCode2
Benchmarking Representations for Speech, Music, and Acoustic EventsCode2
BirdSet: A Large-Scale Dataset for Audio Classification in Avian BioacousticsCode2
Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio ModelsCode2
Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music AudioCode2
AST: Audio Spectrogram TransformerCode2
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space ModelCode2
Contrastive Audio-Visual Masked AutoencoderCode2
Audio Mamba: Bidirectional State Space Model for Audio Representation LearningCode2
Federated Self-Training for Semi-Supervised Audio RecognitionCode1
Exploiting Foundation Models and Speech Enhancement for Parkinson's Disease Detection from Speech in Real-World Operative ConditionsCode1
Few-shot Class-incremental Audio Classification Using Stochastic ClassifierCode1
Efficient Training of Audio Transformers with PatchoutCode1
Adaptive Differential Denoising for Respiratory Sounds ClassificationCode1
Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained DevicesCode1
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive LearningCode1
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacousticsCode1
ElasticAST: An Audio Spectrogram Transformer for All Length and ResolutionsCode1
Effective Audio Classification Network Based on Paired Inverse Pyramid Structure and Dense MLP BlockCode1
AUCO ResNet: an end-to-end network for Covid-19 pre-screening from cough and breathCode1
EfficientLEAF: A Faster LEarnable Audio Frontend of Questionable UseCode1
End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification NetworkCode1
Fluctuation-driven initialization for spiking neural network trainingCode1
DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic Sound Event Localization and DetectionCode1
CRNNs for Urban Sound Tagging with spatiotemporal contextCode1
Device-Robust Acoustic Scene Classification via Impulse Response AugmentationCode1
A surrogate gradient spiking baseline for speech command recognitionCode1
Acoustic Prompt Tuning: Empowering Large Language Models with Audition CapabilitiesCode1
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge TransferCode1
Adversarial Fine-tuning using Generated Respiratory Sound to Address Class ImbalanceCode1
CNN Architectures for Large-Scale Audio ClassificationCode1
ATST: Audio Representation Learning with Teacher-Student TransformerCode1
CycleGuardian: A Framework for Automatic RespiratorySound classification Based on Improved Deep clustering and Contrastive LearningCode1
DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable LearnersCode1
Continual Transformers: Redundancy-Free Attention for Online InferenceCode1
Audio Tagging on an Embedded Hardware PlatformCode1
DTF-AT: Decoupled Time-Frequency Audio Transformer for Event ClassificationCode1
CLARA: Multilingual Contrastive Learning for Audio Representation AcquisitionCode1
Show:102550
← PrevPage 1 of 8Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OmniVec2Test mAP0.56Unverified
2OmniVecTest mAP0.55Unverified
3EquiAVTest mAP0.55Unverified
4MAViL (Audio-Visual, single)Test mAP0.53Unverified
5Audiovisual Masked Autoencoder (Audiovisual, Single)Test mAP0.52Unverified
6CAV-MAE (Audio-Visual)Test mAP0.51Unverified
7BEATs (Audio-only, Ensemble)Test mAP0.51Unverified
8UAVM (Audio + Video)Test mAP0.5Unverified
9SSLAM (Audio-Only, Single)Test mAP0.5Unverified
10mn40_as (Ensemble)Test mAP0.5Unverified
#ModelMetricClaimedVerifiedStatus
1OmniVec2Top-1 Accuracy99.1Unverified
2InternVideo2Top-1 Accuracy98.6Unverified
3M2D2 AS+Top-1 Accuracy98.5Unverified
4OmniVecTop-1 Accuracy98.4Unverified
5BEATsTop-1 Accuracy98.1Unverified
6mn40_asTop-1 Accuracy97.45Unverified
7DyMN-LTop-1 Accuracy97.4Unverified
8M2D-CLAP/0.7Top-1 Accuracy97.4Unverified
9M2D-AS/0.7Top-1 Accuracy97.2Unverified
10HTS-ATTop-1 Accuracy97Unverified
#ModelMetricClaimedVerifiedStatus
1ADDICBHI Score65.53Unverified
2BEATs (PAFA)ICBHI Score64.84Unverified
3BTSICBHI Score63.54Unverified
4BEATs (CE)ICBHI Score63.49Unverified
5M2D-X/0.7 (η=0.3)ICBHI Score63.29Unverified
6CycleGuardianICBHI Score63.26Unverified
7M2D/0.7 (e=0.3)ICBHI Score62.73Unverified
8Audio-CLAPICBHI Score62.56Unverified
9AST (Patch-Mix CL)ICBHI Score62.37Unverified
10AFT on Mixed-500ICBHI Score61.79Unverified
#ModelMetricClaimedVerifiedStatus
1MBT (AV)Top 5 Accuracy85.6Unverified
2Mirasol3BTop 1 Accuracy69.8Unverified
3CA2ST(B/16)Top 1 Accuracy68.3Unverified
4ONE-PEACE (Audio-Visual)Top 1 Accuracy68.2Unverified
5CAVA(B/16)Top 1 Accuracy68.2Unverified
6EquiAVTop 1 Accuracy67.1Unverified
7MAViLTop 1 Accuracy67.1Unverified
8MMT (Audio-Visual)Top 1 Accuracy66.2Unverified
9CAV-MAE (Audio-Visual)Top 1 Accuracy65.9Unverified
10UAVM (Audio + Video)Top 1 Accuracy65.8Unverified
#ModelMetricClaimedVerifiedStatus
1Event-SSMPercentage correct95.9Unverified
2SNN with Dilated Convolution with Learnable SpacingsPercentage correct95.1Unverified
3SNN featuring learnable axonal delays with adaptively delay capsPercentage correct92.45Unverified
4SNN with spatio-temporal filters and attentionPercentage correct92.4Unverified
5CNNPercentage correct92.4Unverified
6SNN with temporal-wise attentionPercentage correct91.1Unverified
7SNNPercentage correct87Unverified
8Recurrent convolutional SNNPercentage correct83.5Unverified
9Recurrent SNNPercentage correct83.2Unverified
10Sparse Spiking Gradient DescentPercentage correct77.5Unverified
#ModelMetricClaimedVerifiedStatus
1ONE-PEACEmAP69.7Unverified
2MNmAP65.6Unverified
3PaSST-SmAP65.55Unverified
4DyMN-LmAP65.5Unverified
5PaSST-N-SmAP64.2Unverified
6LHGNNMean AP59Unverified
7PSLAmAP56.71Unverified
8MATPAC (SSL Model)mAP55.2Unverified
9Temporal Knowledge Distillation for On-device Audio ClassificationmAP54.8Unverified
10Large 6-Layer Transformer with PoolingmAP53.7Unverified
#ModelMetricClaimedVerifiedStatus
1EquiAVMean AP42.4Unverified
2SSLAMMean AP40.9Unverified
3EATMean AP40.3Unverified
4BEATsMean AP38.9Unverified
5Base (ours)Mean AP37.4Unverified
6SSAST-PATCHMean AP31Unverified
7SSAST-FRAMEMean AP29.2Unverified
8ConformerMean AP27.6Unverified
#ModelMetricClaimedVerifiedStatus
1PDCAccuracy97.8Unverified
2ASM-RHAccuracy96.51Unverified
3EfficientLEAFAccuracy95.2Unverified
4LEAFAccuracy95.1Unverified
5melspectAccuracy95.1Unverified
#ModelMetricClaimedVerifiedStatus
1Event-SSMAccuracy88.4Unverified
2SNN with Dilated Convolution with Learnable SpacingsAccuracy80.69Unverified
3RadLIFAccuracy77.4Unverified
4SpikGRUAccuracy77Unverified
5Adaptive SRNNAccuracy74.2Unverified
#ModelMetricClaimedVerifiedStatus
1EfficientLEAF (8s)Accuracy72.2Unverified
2EfficientLEAFAccuracy42.9Unverified
3LEAFAccuracy42.3Unverified
4melspectAccuracy39.9Unverified
#ModelMetricClaimedVerifiedStatus
1CrissCross (AudioSet)Top-1 Accuracy97Unverified
2CrissCross (Kinetics-400)Top-1 Accuracy96Unverified
3XDCTop-1 Accuracy95Unverified
4CrissCross (Kinetics-Sound)Top-1 Accuracy93Unverified
#ModelMetricClaimedVerifiedStatus
1Audiovisual Masked Autoencoder (Audiovisual, Single)Top-1 Action46Unverified
2Audiovisual Masked Autoencoder (Video-only, Single)Top-1 Action45.8Unverified
3Audiovisual Masked Autoencoder (Audio-only, Single)Top-1 Action19.7Unverified
4PlayItBackX3Top-1 Action15.9Unverified
#ModelMetricClaimedVerifiedStatus
1M2D-AS/0.7Mean AP48.5Unverified
2LHGNNMean AP46.6Unverified
3VAB-Encodec (Ours)Mean AP38.7Unverified
#ModelMetricClaimedVerifiedStatus
1EfficientLEAFAccuracy60.2Unverified
2melspectAccuracy58.8Unverified
3LEAFAccuracy50.2Unverified
#ModelMetricClaimedVerifiedStatus
1AUCO ResNetAUC0.82Unverified
2DenseNet 201AUC0.6Unverified
3Inception ResNet V2AUC0.6Unverified
#ModelMetricClaimedVerifiedStatus
1Mirasol3BAccuracy78.2Unverified
2CA2ST(B/16)Accuracy61Unverified
3CAVA(B/16)Accuracy60.3Unverified
#ModelMetricClaimedVerifiedStatus
1ASM-RH-ATop-1 Accuracy75.4Unverified
2ERANN-0-4Top-1 Accuracy74.8Unverified
#ModelMetricClaimedVerifiedStatus
1Qwen-AudioAccuracy 92.89Unverified
2VocalSound BaselineAccuracy 90.5Unverified
#ModelMetricClaimedVerifiedStatus
1XGBoost (330)Accuracy (10-fold)99.3Unverified
#ModelMetricClaimedVerifiedStatus
1animal2vecAP0.91Unverified
#ModelMetricClaimedVerifiedStatus
1AudioAccuracy (%)64.5Unverified
#ModelMetricClaimedVerifiedStatus
1CDILFruitFlies97.09Unverified