SOTAVerified

GLM-130B: An Open Bilingual Pre-trained Model

2022-10-05Code Available6· sign in to hype

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, WenGuang Chen, Peng Zhang, Yuxiao Dong, Jie Tang

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4RTX 3090 (24G) or 8RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at https://github.com/THUDM/GLM-130B/.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
BIG-bench-liteGLM-130B (3-shot)Accuracy15.11Unverified
BIG-bench-liteGLM-130B (1-shot)Accuracy14.91Unverified
BIG-bench-liteGLM-130B (0-shot)Accuracy13.31Unverified
CLUE (AFQMC)GLM-130BAccuracy71.2Unverified
CLUE (AFQMC)ERNIE 3.0 Titan-260BAccuracy69Unverified
CLUE (C3)ERNIE 3.0 Titan-260BAccuracy54.9Unverified
CLUE (C3)GLM-130BAccuracy77.5Unverified
CLUE (CMNLI)GLM-130BAccuracy77Unverified
CLUE (CMNLI)ERNIE 3.0 Titan-260BAccuracy51.7Unverified
CLUE (CMRC2018)GLM-130BAccuracy55.7Unverified
CLUE (CMRC2018)ERNIE 3.0 Titan-260BAccuracy16.6Unverified
CLUE (DRCD)ERNIE 3.0 Titan-260BAccuracy29.5Unverified
CLUE (DRCD)GLM-130BAccuracy77.1Unverified
CLUE (OCNLI_50K)ERNIE 3.0 Titan-260BAccuracy44.6Unverified
CLUE (OCNLI_50K)GLM-130BAccuracy74.7Unverified
CLUE (WSC1.1)ERNIE 3.0 Titan-260BAccuracy81.1Unverified
CLUE (WSC1.1)GLM-130BAccuracy83.9Unverified
FewCLUE (BUSTM)ERNIE 3.0 Titan-260BAccuracy64.4Unverified
FewCLUE (BUSTM)GLM-130BAccuracy77.5Unverified
FewCLUE (CHID-FC)ERNIE 3.0 Titan-260BAccuracy87.1Unverified
FewCLUE (CHID-FC)GLM-130BAccuracy90.1Unverified
FewCLUE (CLUEWSC-FC)GLM-130BAccuracy77.4Unverified
FewCLUE (CLUEWSC-FC)ERNIE 3.0 Titan-260BAccuracy53.5Unverified
FewCLUE (EPRSTMT)ERNIE 3.0 Titan-260BAccuracy88.8Unverified
FewCLUE (EPRSTMT)GLM-130BAccuracy92.5Unverified
FewCLUE (OCNLI-FC)ERNIE 3.0 Titan-260BAccuracy53.8Unverified
FewCLUE (OCNLI-FC)GLM-130BAccuracy73.8Unverified
LAMBADAGLM-130B (bidirectional attention)Accuracy80.2Unverified
The PileGPT-3Bits per byte0.74Unverified
The PileJurassic-1Bits per byte0.65Unverified
The PileGLM-130BBits per byte0.63Unverified

Reproductions