GLM-130B: An Open Bilingual Pre-trained Model

2022-10-05Code Available6· sign in to hype

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, WenGuang Chen, Peng Zhang, Yuxiao Dong, Jie Tang

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/thudm/glm-130b
OfficialIn paperpytorch★ 7,664
github.com/thudm/chatglm2-6b
pytorch★ 15,640
github.com/thudm/chatglm
pytorch★ 13,746
github.com/thudm/chatglm3
pytorch★ 13,746
github.com/THUDM/GLM
pytorch★ 3,463
github.com/jackaduma/ChatGLM-LoRA-RLHF-PyTorch
pytorch★ 140
github.com/2023-MindSpore-4/Code12/tree/main/MindFormers/glm
mindspore★ 0
github.com/2023-MindSpore-4/Code12/tree/main/MindFormers/glm3
mindspore★ 0

Abstract

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4RTX 3090 (24G) or 8RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at https://github.com/THUDM/GLM-130B/.

Tasks

Language Modeling Language Modelling Long-Context Understanding model Multi-task Language Understanding Quantization

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
BIG-bench-lite	GLM-130B (3-shot)	Accuracy	15.11	—	Unverified
BIG-bench-lite	GLM-130B (1-shot)	Accuracy	14.91	—	Unverified
BIG-bench-lite	GLM-130B (0-shot)	Accuracy	13.31	—	Unverified
CLUE (AFQMC)	GLM-130B	Accuracy	71.2	—	Unverified
CLUE (AFQMC)	ERNIE 3.0 Titan-260B	Accuracy	69	—	Unverified
CLUE (C3)	ERNIE 3.0 Titan-260B	Accuracy	54.9	—	Unverified
CLUE (C3)	GLM-130B	Accuracy	77.5	—	Unverified
CLUE (CMNLI)	GLM-130B	Accuracy	77	—	Unverified
CLUE (CMNLI)	ERNIE 3.0 Titan-260B	Accuracy	51.7	—	Unverified
CLUE (CMRC2018)	GLM-130B	Accuracy	55.7	—	Unverified
CLUE (CMRC2018)	ERNIE 3.0 Titan-260B	Accuracy	16.6	—	Unverified
CLUE (DRCD)	ERNIE 3.0 Titan-260B	Accuracy	29.5	—	Unverified
CLUE (DRCD)	GLM-130B	Accuracy	77.1	—	Unverified
CLUE (OCNLI_50K)	ERNIE 3.0 Titan-260B	Accuracy	44.6	—	Unverified
CLUE (OCNLI_50K)	GLM-130B	Accuracy	74.7	—	Unverified
CLUE (WSC1.1)	ERNIE 3.0 Titan-260B	Accuracy	81.1	—	Unverified
CLUE (WSC1.1)	GLM-130B	Accuracy	83.9	—	Unverified
FewCLUE (BUSTM)	ERNIE 3.0 Titan-260B	Accuracy	64.4	—	Unverified
FewCLUE (BUSTM)	GLM-130B	Accuracy	77.5	—	Unverified
FewCLUE (CHID-FC)	ERNIE 3.0 Titan-260B	Accuracy	87.1	—	Unverified
FewCLUE (CHID-FC)	GLM-130B	Accuracy	90.1	—	Unverified
FewCLUE (CLUEWSC-FC)	GLM-130B	Accuracy	77.4	—	Unverified
FewCLUE (CLUEWSC-FC)	ERNIE 3.0 Titan-260B	Accuracy	53.5	—	Unverified
FewCLUE (EPRSTMT)	ERNIE 3.0 Titan-260B	Accuracy	88.8	—	Unverified
FewCLUE (EPRSTMT)	GLM-130B	Accuracy	92.5	—	Unverified
FewCLUE (OCNLI-FC)	ERNIE 3.0 Titan-260B	Accuracy	53.8	—	Unverified
FewCLUE (OCNLI-FC)	GLM-130B	Accuracy	73.8	—	Unverified
LAMBADA	GLM-130B (bidirectional attention)	Accuracy	80.2	—	Unverified
The Pile	GPT-3	Bits per byte	0.74	—	Unverified
The Pile	Jurassic-1	Bits per byte	0.65	—	Unverified
The Pile	GLM-130B	Bits per byte	0.63	—	Unverified

GLM-130B: An Open Bilingual Pre-trained Model

Code

Abstract

Tasks

Benchmark Results

Reproductions