Can Data Diversity Enhance Learning Generalization?

2022-10-01COLING 2022Unverified0· sign in to hype

Yu Yu, Shahram Khadivi, Jia Xu

Unverified — Be the first to reproduce this paper.

Abstract

This paper introduces our Diversity Advanced Actor-Critic reinforcement learning (A2C) framework (DAAC) to improve the generalization and accuracy of Natural Language Processing (NLP). We show that the diversification of training samples alleviates overfitting and improves model generalization and accuracy. We quantify diversity on a set of samples using the max dispersion, convex hull volume, and graph entropy based on sentence embeddings in high-dimensional metric space. We also introduce A2C to select such a diversified training subset efficiently. Our experiments achieve up to +23.8 accuracy increase (38.0% relatively) in sentiment analysis, -44.7 perplexity decrease (37.9% relatively) in language modeling, and consistent improvements in named entity recognition over various domains. In particular, our method outperforms both domain adaptation and generalization baselines without using any target domain knowledge.

Tasks

Diversity Domain Adaptation Language Modeling Language Modelling named-entity-recognition Named Entity Recognition Named Entity Recognition (NER)Reinforcement Learning (RL)Sentence Sentence Embeddings Sentiment Analysis

Can Data Diversity Enhance Learning Generalization?

Abstract

Tasks

Reproductions