Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository

2024-10-15Code Available0· sign in to hype

S. Tamang, D. J. Bora

Code Available — Be the first to reproduce this paper.

Code

github.com/indian-nlp/assamese-dataset
Officialnone★ 4

Abstract

This paper introduces a centralized, open-source dataset repository designed to advance NLP and NMT for Assamese, a low-resource language. The repository, available at GitHub, supports various tasks like sentiment analysis, named entity recognition, and machine translation by providing both pre-training and fine-tuning corpora. We review existing datasets, highlighting the need for standardized resources in Assamese NLP, and discuss potential applications in AI-driven research, such as LLMs, OCR, and chatbots. While promising, challenges like data scarcity and linguistic diversity remain. The repository aims to foster collaboration and innovation, promoting Assamese language research in the digital age.

Tasks

Diversity Machine Translation named-entity-recognition Named Entity Recognition NMT Optical Character Recognition (OCR)Sentiment Analysis Translation

Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository

Code

Abstract

Tasks

Reproductions