BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews

2023-05-11Code Available1· sign in to hype

Mohsinul Kabir, Obayed Bin Mahfuz, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

Code Available — Be the first to reproduce this paper.

Code

github.com/mohsinulkabir14/banglabook
OfficialIn papernone★ 18

Abstract

The analysis of consumer sentiment, as expressed through reviews, can provide a wealth of insight regarding the quality of a product. While the study of sentiment analysis has been widely explored in many popular languages, relatively less attention has been given to the Bangla language, mostly due to a lack of relevant data and cross-domain adaptability. To address this limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We provide a detailed statistical analysis of the dataset and employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features, emphasizing the necessity for additional training resources in this domain. Additionally, we conduct an in-depth error analysis by examining sentiment unigrams, which may provide insight into common classification errors in under-resourced languages like Bangla. Our codes and data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.

Tasks

Sentiment Analysis

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
BanglaBook	Bangla-BERT (large)	Weighted Average F1-score	0.93	—	Unverified
BanglaBook	Random Forest (word 2-gram + word 3-gram)	Weighted Average F1-score	0.91	—	Unverified
BanglaBook	Bangla-BERT (base-uncased)	Weighted Average F1-score	0.91	—	Unverified
BanglaBook	SVM (word 2-gram + word 3-gram)	Weighted Average F1-score	0.91	—	Unverified
BanglaBook	Random Forest (word 1-gram)	Weighted Average F1-score	0.9	—	Unverified
BanglaBook	Logistic Regression (char 2-gram + char 3-gram)	Weighted Average F1-score	0.9	—	Unverified
BanglaBook	Logistic Regression (word 2-gram + word 3-gram)	Weighted Average F1-score	0.9	—	Unverified
BanglaBook	XGBoost (char 2-gram + char 3-gram)	Weighted Average F1-score	0.87	—	Unverified
BanglaBook	Multinomial NB (word 2-gram + word 3-gram)	Weighted Average F1-score	0.87	—	Unverified
BanglaBook	XGBoost (word 2-gram + word 3-gram)	Weighted Average F1-score	0.87	—	Unverified
BanglaBook	Multinomial NB (BoW)	Weighted Average F1-score	0.86	—	Unverified
BanglaBook	SVM (word 1-gram)	Weighted Average F1-score	0.85	—	Unverified
BanglaBook	LSTM (GloVe)	Weighted Average F1-score	0.1	—	Unverified

BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews

Code

Abstract

Tasks

Benchmark Results

Reproductions