Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

2024-09-20Unverified0· sign in to hype

Xuanru Zhou, Jiachen Lian, Cheol Jun Cho, Jingwen Liu, Zongli Ye, Jinming Zhang, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno Tempini, Gopala Anumanchipalli

arXiv PDF

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at https://rorizzz.github.io/

Tasks

Automatic Speech Recognition Automatic Speech Recognition (ASR)Benchmarking object-detection Object Detection speech-recognition Speech Recognition

Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

Abstract

Tasks

Reproductions