Lemmatization

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.

Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 351 papers

Title	Date	Tasks	Status	Hype	Score
Open-Source Web Service with Morphological Dictionary-Supplemented Deep Learning for Morphosyntactic Analysis of Czech	Jun 18, 2024	Deep LearningDependency Parsing	CodeCode Available	3	5
Top2Vec: Distributed Representations of Topics	Aug 19, 2020	LemmatizationSemantic Similarity	CodeCode Available	2	5
DadmaTools: Natural Language Processing Toolkit for Persian Language	Jul 1, 2022	ChunkingConstituency Parsing	CodeCode Available	2	5
ELIT: Emory Language and Information Toolkit	Sep 8, 2021	AMR ParsingConstituency Parsing	CodeCode Available	1	5
ParsiPy: NLP Toolkit for Historical Persian Texts in Python	Mar 22, 2025	LemmatizationPart-Of-Speech Tagging	CodeCode Available	1	5
Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation	Aug 24, 2023	Authorship AttributionKnowledge Distillation	CodeCode Available	1	5
Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography	Jul 7, 2021	Lemmatization	CodeCode Available	1	5
HuSpaCy: an industrial-strength Hungarian natural language processing toolkit	Jan 6, 2022	Dependency ParsingLemmatization	CodeCode Available	1	5
A State-of-the-Art Morphosyntactic Parser and Lemmatizer for Ancient Greek	Oct 15, 2024	Lemmatization	CodeCode Available	1	5
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages	Mar 16, 2020	Coreference ResolutionDependency Parsing	CodeCode Available	1	5
Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek	Mar 31, 2024	LemmatizationSentence	CodeCode Available	1	5
Exploring Large Language Models for Classical Philology	May 23, 2023	BenchmarkingDecoder	CodeCode Available	1	5
Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered	May 26, 2021	LemmatizationMorphological Analysis	CodeCode Available	1	5
Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing	Jan 9, 2021	Dependency ParsingLanguage Modeling	CodeCode Available	1	5
Hybrid lemmatization in HuSpaCy	Jun 13, 2023	Lemmatization	CodeCode Available	1	5
TopicModel4J: A Java Package for Topic Models	Oct 28, 2020	LemmatizationTopic Models	CodeCode Available	1	5
Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines	Aug 24, 2023	AllBoundary Detection	CodeCode Available	1	5
KLPT – Kurdish Language Processing Toolkit	Nov 1, 2020	DiversityLemmatization	CodeCode Available	1	5
One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks	Sep 20, 2024	AllDependency Parsing	CodeCode Available	1	5
Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora	Jul 26, 2018	Lemmatizationnamed-entity-recognition	CodeCode Available	0	5
Neural Transition-based String Transduction for Limited-Resource Setting in Morphology	Aug 1, 2018	LemmatizationMachine Translation	CodeCode Available	0	5
Revisiting NMT for Normalization of Early English Letters	Jun 1, 2019	LemmatizationMachine Translation	CodeCode Available	0	5
Morpheus: A Neural Network for Jointly Learning Contextual Lemmatization and Morphological Tagging	Aug 1, 2019	DecoderLemmatization	CodeCode Available	0	5
Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models	Dec 2, 2019	LemmatizationMorphological Tagging	CodeCode Available	0	5
BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer	Nov 6, 2023	LemmatizationSentence	CodeCode Available	0	5
Morphological parsing of low‑resource languages	May 29, 2019	LemmatizationMorphological Analysis	CodeCode Available	0	5
An Automated Text Categorization Framework based on Hyperparameter Optimization	Apr 6, 2017	Authorship AttributionGeneral Classification	CodeCode Available	0	5
NLP-Cube: End-to-End Raw Text Processing With Neural Networks	Oct 1, 2018	LemmatizationSentence	CodeCode Available	0	5
LemmaTag: Jointly Tagging and Lemmatizing for Morphologically Rich Languages with BRNNs	Oct 1, 2018	LemmatizationMachine Translation	CodeCode Available	0	5
Lexicon and Rule-based Word Lemmatization Approach for the Somali Language	Aug 3, 2023	ArticlesInformation Retrieval	CodeCode Available	0	5
Joint Learning of POS and Dependencies for Multilingual Universal Dependency Parsing	Oct 1, 2018	Dependency ParsingLemmatization	CodeCode Available	0	5
Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization	May 1, 2016	Extractive SummarizationLemmatization	CodeCode Available	0	5
Improving Lemmatization of Non-Standard Languages with Joint Learning	Mar 16, 2019	DecoderLanguage Modeling	CodeCode Available	0	5
Grammatical gender associations outweigh topical gender bias in crosslinguistic word embeddings	May 18, 2020	Cultural Vocal Bursts Intensity PredictionLemmatization	CodeCode Available	0	5
A Study of fastText Word Embedding Effects in Document Classification in Bangla Language	Jul 30, 2020	ClassificationDocument Classification	CodeCode Available	0	5
From Text to Lexicon: Bridging the Gap between Word Embeddings and Lexical Resources	Aug 1, 2018	Coreference ResolutionLemmatization	CodeCode Available	0	5
Knowledge Authoring with Factual English	Aug 5, 2022	LemmatizationPart-Of-Speech Tagging	CodeCode Available	0	5
Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources	Jan 28, 2021	Data AugmentationDecoder	CodeCode Available	0	5
DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks	Jan 11, 2021	LemmatizationMulti-Task Learning	CodeCode Available	0	5
Transformers on Multilingual Clause-Level Morphology	Nov 3, 2022	Data AugmentationLanguage Modelling	CodeCode Available	0	5
Training Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw Text	Apr 2, 2019	Data AugmentationLEMMA	CodeCode Available	0	5
Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers	May 30, 2024	LemmatizationMorphological Tagging	CodeCode Available	0	5
Development of a Hindi Lemmatizer	May 24, 2013	LemmatizationMachine Translation	CodeCode Available	0	5
Imitation Learning for Neural Morphological String Transduction	Aug 31, 2018	Imitation LearningLemmatization	CodeCode Available	0	5
Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning	Mar 4, 2016	LEMMALemmatization	CodeCode Available	0	5
IUCM at SemEval-2018 Task 11: Similar-Topic Texts as a Comprehension Knowledge Source	Jun 1, 2018	ClusteringLemmatization	CodeCode Available	0	5
Evaluating Shortest Edit Script Methods for Contextual Lemmatization	Mar 25, 2024	LEMMALemmatization	CodeCode Available	0	5
Cross-Lingual Lemmatization and Morphology Tagging with Two-Stage Multilingual BERT Fine-Tuning	Aug 1, 2019	LemmatizationMorphological Analysis	CodeCode Available	0	5
CMU-01 at the SIGMORPHON 2019 Shared Task on Crosslinguality and Context in Morphology	Jul 23, 2019	LEMMALemmatization	CodeCode Available	0	5
Cross-lingual Named Entity Corpus for Slavic Languages	Mar 30, 2024	LEMMALemmatization	CodeCode Available	0	5

Show:10 25 50

← PrevPage 1 of 8Next →

No leaderboard results yet.