Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks
Xian Li, Nian Shao, Xiaofei Li
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/audio-westlakeu/audiosslOfficialIn paperpytorch★ 136
- github.com/Audio-WestlakeU/ATST-SEDpytorch★ 161
Abstract
Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. While frame-level tasks are important for fine-grained acoustic scene/event understanding, prior studies primarily evaluate on clip-level downstream tasks. In order to tackle both clip-level and frame-level tasks, this paper proposes Audio Teacher-Student Transformer (ATST), with a clip-level version (named ATST-Clip) and a frame-level version (named ATST-Frame), responsible for learning clip-level and frame-level representations, respectively. Both methods use a Transformer encoder and a teacher-student training scheme. We have carefully designed the view creation strategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip uses segment-wise data augmentations, and ATST-Frame integrates frame-wise data augmentations and masking. Experimental results show that our ATST-Frame model obtains state-of-the-art (SOTA) performances on most of the clip-level and frame-level downstream tasks. Especially, it outperforms other models by a large margin on the frame-level sound event detection task. In addition, the performance can be further improved by combining the two models through knowledge distillation. Our code is available online.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| AudioSet | ATST-C2F(Single) | Test mAP | 0.5 | — | Unverified |
| AudioSet | ATST-Frame | Test mAP | 0.48 | — | Unverified |