CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition
Sarah Alyami, Hamzah Luqman
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/snalyami/CLIP-SLAOfficialnone★ 2
Abstract
Continuous sign language recognition (CSLR) focuses on interpreting and transcribing sequences of sign language gestures in videos. In this work, we propose CLIP sign language adaptation (CLIP-SLA), a novel CSLR framework that leverages the powerful pre-trained visual encoder from the CLIP model to sign language tasks through parameter-efficient fine-tuning (PEFT). We introduce two variants, SLA-Adapter and SLA-LoRA, which integrate PEFT modules into the CLIP visual encoder, enabling fine-tuning with minimal trainable parameters. The effectiveness of the proposed frameworks is validated on four datasets: Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500, where both CLIP-SLA variants outperformed several SOTA models with fewer trainable parameters. Extensive ablation studies emphasize the effectiveness and flexibility of the proposed methods with different vision-language models for CSLR. These findings showcase the potential of adapting large-scale pre-trained models for scalable and efficient CSLR, which pave the way for future advancements in sign language understanding.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| CSL-Daily | SLA-LoRA | Word Error Rate (WER) | 25.8 | — | Unverified |
| RWTH-PHOENIX-Weather 2014 | SLA-Adapter | Word Error Rate (WER) | 18.8 | — | Unverified |
| RWTH-PHOENIX-Weather 2014 T | SLA-LoRA | Word Error Rate (WER) | 19.4 | — | Unverified |