Translating speech with just images

2024-06-11Code Available0· sign in to hype

Dan Oneata, Herman Kamper

Code Available — Be the first to reproduce this paper.

Code

github.com/danoneata/strim
OfficialIn paperpytorch★ 1

Abstract

Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.

Tasks

Image Captioning Translation

Translating speech with just images

Code

Abstract

Tasks

Reproductions