Improving Text-To-Audio Models with Synthetic Captions

2024-06-18Code Available5· sign in to hype

Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/declare-lab/tango
pytorch★ 1,233

Abstract

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged text-only language models to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an audio language model to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named AF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new state-of-the-art.

Tasks

AudioCaps Audio captioning Audio Generation Language Modeling Language Modelling Text-to-Music Generation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AudioCaps	Tango-AF&AC-FT-AC	FAD	2.54	—	Unverified

Improving Text-To-Audio Models with Synthetic Captions

Code

Abstract

Tasks

Benchmark Results

Reproductions