SOTAVerified

A Novel Efficient and Effective Preprocessing Strategy for Text Classification

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Text classification is an essential task of natural language processing. Preprocessing, which determines the representation of text features, is one of the key steps of text classification architecture. This paper proposes a novel efficient and effective preprocessing strategy with three methods for text classification using OMP algorithm to complete the classification. The main idea of our new preprocessing strategy is that we combine regular filtering and/or stopwords removal with tokenization and lowcase convertion, which can effectively reduce the feature dimension and improve the quality of text feature matrix to some extent. Simulation tests on 20Newsgroups dataset show compared with the existing state-of-the-art method, our new best method reduces the number of features by 19.85\%, 34.35\%, 26.25\%, and 38.67\%, and increase the speed of text classification by 17.38\%, 25.64\%, 23.76\%, and 33.38\% with similar classification accuracy on religion, computer, science and sport data, respectively.

Tasks

Reproductions