Non-Zipfian Distribution of Stopwords and Subset Selection Models

2026-03-05Unverified0· sign in to hype

Wentian Li, Oscar Fontanelli

Unverified — Be the first to reproduce this paper.

Abstract

Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank r is a decreasing Hill's function (1/(1+(r/r_mid)^γ)); whereas the probability for not being selected is the standard Hill's function ( 1/(1+(r_mid/r)^γ)). We validate this selection probability model by a direct estimation from an independent collection of texts. We also show analytically that this model leads to a BRF rank-frequency distribution for stopwords when the original full word list follows the Zipf's law, as well as explaining the quadratic fitting function for the non-stopwords.

Non-Zipfian Distribution of Stopwords and Subset Selection Models

Abstract

Reproductions