Cross-Domain Language Modeling: An Empirical Investigation

2021-12-01ALTA 2021Unverified0· sign in to hype

Vincent Nguyen, Sarvnaz Karimi, Maciej Rybinski, Zhenchang Xing

Unverified — Be the first to reproduce this paper.

Abstract

Transformer encoder models exhibit strong performance in single-domain applications. However, in a cross-domain situation, using a sub-word vocabulary model results in sub-word overlap. This is an issue when there is an overlap between sub-words that share no semantic similarity between domains. We hypothesize that alleviating this overlap allows for a more effective modeling of multi-domain tasks; we consider the biomedical and general domains in this paper. We present a study on reducing sub-word overlap by scaling the vocabulary size in a Transformer encoder model while pretraining with multiple domains. We observe a significant increase in downstream performance in the general-biomedical cross-domain from a reduction in sub-word overlap.

Tasks

Language Modeling Language Modelling Semantic Similarity Semantic Textual Similarity

Cross-Domain Language Modeling: An Empirical Investigation

Abstract

Tasks

Reproductions