Optimizing Word Segmentation for Downstream Task

2020-11-01Findings of the Association for Computational LinguisticsCode Available1· sign in to hype

Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki

Code Available — Be the first to reproduce this paper.

Code

github.com/tatHi/optok
OfficialIn paperpytorch★ 11

Abstract

In traditional NLP, we tokenize a given sentence as a preprocessing, and thus the tokenization is unrelated to a target downstream task. To address this issue, we propose a novel method to explore a tokenization which is appropriate for the downstream task. Our proposed method, optimizing tokenization (OpTok), is trained to assign a high probability to such appropriate tokenization based on the downstream task loss. OpTok can be used for any downstream task which uses a vector representation of a sentence such as text classification. Experimental results demonstrate that OpTok improves the performance of sentiment analysis and textual entailment. In addition, we introduce OpTok into BERT, the state-of-the-art contextualized embeddings and report a positive effect.

Tasks

Natural Language Inference Segmentation Sentence Sentiment Analysis text-classification Text Classification

Optimizing Word Segmentation for Downstream Task

Code

Abstract

Tasks

Reproductions