BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

2026-03-05Unverified0· sign in to hype

Théo Charlot, Tarek Kunze, Maxime Poli, Alejandrina Cristia, Emmanuel Dupoux, Marvin Lavechin

Unverified — Be the first to reproduce this paper.

Abstract

Child-centered daylong recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, a self-supervised speech model trained on 13,000 hours of multilingual child-centered recordings spanning 40+ languages. Evaluated on voice type classification -- distinguishing target children from female adults, male adults, and other children, a key preprocessing step for analyzing naturalistic language experiences -- BabyHuBERT-VTC achieves F1-scores from 52.1% to 74.4% across six corpora, consistently outperforming W2V2-LL4300 (English daylongs) and HuBERT (clean adult speech). Notable gains include 13.2 and 15.9 absolute F1 points over HuBERT on Vanuatu and Solomon Islands, demonstrating effectiveness on underrepresented languages. We share code and model to support researchers working with child-centered recordings across diverse linguistic contexts.

BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

Abstract

Reproductions