xVal: A Continuous Numerical Tokenization for Scientific Language Models

2023-10-04Code Available1· sign in to hype

Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/PolymathicAI/xVal
OfficialIn paperpytorch★ 149
github.com/lucidrains/iTransformer
pytorch★ 0

Abstract

Due in part to their discontinuous and discrete default encodings for numbers, Large Language Models (LLMs) have not yet been commonly used to process numerically-dense scientific datasets. Rendering datasets as text, however, could help aggregate diverse and multi-modal scientific data into a single training corpus, thereby potentially facilitating the development of foundation models for science. In this work, we introduce xVal, a strategy for continuously tokenizing numbers within language models that results in a more appropriate inductive bias for scientific applications. By training specially-modified language models from scratch on a variety of scientific datasets formatted as text, we find that xVal generally outperforms other common numerical tokenization strategies on metrics including out-of-distribution generalization and computational efficiency.

Tasks

Computational Efficiency Inductive Bias Out-of-Distribution Generalization

xVal: A Continuous Numerical Tokenization for Scientific Language Models

Code

Abstract

Tasks

Reproductions