Voice Activity Projection: Self-supervised Learning of Turn-taking Events

2022-05-19Code Available1· sign in to hype

Erik Ekstedt, Gabriel Skantze

Code Available — Be the first to reproduce this paper.

Code

github.com/erikekstedt/vap_turn_taking
OfficialIn paperpytorch★ 25
github.com/erikekstedt/conv_ssl
OfficialIn paperpytorch★ 14
github.com/ErikEkstedt/VoiceActivityProjection
pytorch★ 96

Abstract

The modeling of turn-taking in dialog can be viewed as the modeling of the dynamics of voice activity of the interlocutors. We extend prior work and define the predictive task of Voice Activity Projection, a general, self-supervised objective, as a way to train turn-taking models without the need of labeled data. We highlight a theoretical weakness with prior approaches, arguing for the need of modeling the dependency of voice activity events in the projection window. We propose four zero-shot tasks, related to the prediction of upcoming turn-shifts and backchannels, and show that the proposed model outperforms prior work.

Tasks

Self-Supervised Learning

Voice Activity Projection: Self-supervised Learning of Turn-taking Events

Code

Abstract

Tasks

Reproductions