SOTAVerified

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization

2024-06-18Code Available2· sign in to hype

Young Jin Ahn, Jungwoo Park, Sangha Park, Jonghyun Choi, Kee-Eung Kim

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Visual Speech Recognition (VSR) stands at the intersection of computer vision and speech recognition, aiming to interpret spoken content from visual cues. A prominent challenge in VSR is the presence of homophenes-visually similar lip gestures that represent different phonemes. Prior approaches have sought to distinguish fine-grained visemes by aligning visual and auditory semantics, but often fell short of full synchronization. To address this, we present SyncVSR, an end-to-end learning framework that leverages quantized audio for frame-level crossmodal supervision. By integrating a projection layer that synchronizes visual representation with acoustic data, our encoder learns to generate discrete audio tokens from a video sequence in a non-autoregressive manner. SyncVSR shows versatility across tasks, languages, and modalities at the cost of a forward pass. Our empirical evaluations show that it not only achieves state-of-the-art results but also reduces data usage by up to ninefold.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
CAS-VSR-W1k (LRW-1000)SyncVSR (Word Boundary)Top-1 Accuracy58.2Unverified
Lip Reading in the WildSyncVSRTop-1 Accuracy93.2Unverified
Lip Reading in the WildSyncVSR (Word Boundary)Top-1 Accuracy95Unverified
LRS2SyncVSRWord Error Rate (WER)28.9Unverified
LRS2SyncVSRWord Error Rate (WER)16.5Unverified
LRS3-TEDSyncVSRWord Error Rate (WER)21.5Unverified
LRS3-TEDSyncVSRWord Error Rate (WER)31.2Unverified

Reproductions