Transformer-Based Approaches for Automatic Music Transcription
Christos Zonios
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/MS-P3/code7/tree/main/wav2vec2_bertmindspore★ 0
Abstract
Automatic Music Transcription (AMT) is the process of extracting information from audio into some form of music notation. In polyphonic music, this is a very hard problem for computers to solve as it requires significant prior knowledge and understanding of music language and the audio is subject to a multitude of variations in frequencies depending on many factors such as instrument materials, tuning, player performance, recording equipment and others. Transformers are self-supervised models that have recently showed great promise as they use self-attention in order to learn contextual representations from unlabeled data. They have surpassed state of the art (SOTA) performance in various Speech Recognition (SR), Natural Language Processing (NLP) and Computer Vision tasks. In this work, we examine transformer-based approaches for performing AMT on piano recordings by learning audio and music language representations. Specifically, we look at the popular SR model wav2vec2 as a solution to the former and the NLP model BERT in order to perform Music Language Modelling (MusicLM). We propose a new pre-training approach for MusicLM transformers based on an appropriately defined transcription error correction task. In addition, three novel models for AMT are proposed and studied that appropriately integrate wav2vec2 and BERT transformers at various stages. We conclude that a wav2vec2 encoder model pre-trained on speech audio is not able to surpass SOTA models using mel-scale spectrograms and convolutional network encoders without significant conditioning on music audio. viii We show that a BERT transformer pre-trained on natural language has transfer learning potential for MusicLM. We also examine the robustness of such a transformer for performing MusicLM, and find that we are able to achieve interesting results when doing Masked MusicLM and when replacing Recurrent Neural Networks with pretrained transformers in SOTA models for AMT.