SOTAVerified

The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

2024-11-11Code Available3· sign in to hype

Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, Jacob Andreas

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of in-context task examples. We investigate the effectiveness of test-time training (TTT) -- temporarily updating model parameters during inference using a loss derived from input data -- as a mechanism for improving LMs' reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to 6 higher accuracy compared to fine-tuned baselines -- reaching 53.0\% on the public validation set with an 8B-parameter LM and 61.9\% when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the 10-shot setting by 7.3 percentage points (50.5\% to 57.8\%). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.

Tasks

Reproductions