BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation

2021-09-09EMNLP 2021Code Available1· sign in to hype

Haoran Xu, Benjamin Van Durme, Kenton Murray

Code Available — Be the first to reproduce this paper.

Code

github.com/fe1ixxu/BiBERT
OfficialIn paperpytorch★ 32
github.com/he1ght/BiBERT_CE
pytorch★ 0

Abstract

The success of bidirectional encoders using masked language models, such as BERT, on numerous natural language processing tasks has prompted researchers to attempt to incorporate these pre-trained models into neural machine translation (NMT) systems. However, proposed methods for incorporating pre-trained models are non-trivial and mainly focus on BERT, which lacks a comparison of the impact that other pre-trained models may have on translation performance. In this paper, we demonstrate that simply using the output (contextualized embeddings) of a tailored and suitable bilingual pre-trained language model (dubbed BiBERT) as the input of the NMT encoder achieves state-of-the-art translation performance. Moreover, we also propose a stochastic layer selection approach and a concept of dual-directional translation model to ensure the sufficient utilization of contextualized embeddings. In the case of without using back translation, our best models achieve BLEU scores of 30.45 for En->De and 38.61 for De->En on the IWSLT'14 dataset, and 31.26 for En->De and 34.94 for De->En on the WMT'14 dataset, which exceeds all published numbers.

Tasks

de-en Language Modeling Language Modelling Machine Translation NMT Translation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
IWSLT2014 German-English	BiBERT	BLEU score	38.61	—	Unverified
WMT2014 English-German	BiBERT	BLEU score	31.26	—	Unverified
WMT2014 German-English	BiBERT	BLEU score	34.94	—	Unverified

BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation

Code

Abstract

Tasks

Benchmark Results

Reproductions