SOTAVerified

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

2024-10-27Unverified0· sign in to hype

Maohao Shen, Shun Zhang, JiLong Wu, Zhiping Xiu, Ehab AlBadawy, Yiting Lu, Mike Seltzer, Qing He

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.

Tasks

Reproductions