Stemming vs Lemmatization in NLP: Main Differences

Stemming vs Lemmatization in NLP: Main Differences

posted 7 min read

Stemming?

Stemming and and lemmatization used to be the most confusing for me but as our lecturer broke down the terms for us it became a lot easier for me to explain to somebody else. Let me try and break it down…Stemming is a word that may be mapped to an incorrect or non-existent root word whereas, in lemmatization, a word is always mapped to a correct lemma. But lemmatization is computationally more difficult and hence slower than stemming. First documented algorithm for stemming the English language was proposed by Martin Porter in 1980, which as a result, is often known as the Porter stemming algorithm. The Porter stemming algorithm is the most widely used method for stemming in English. other languages usually have different algorithms which can be better suited. With alphabet or stemming, the word “similarity” is reduced to the stem “simil” and “computing” is reduced to the stem “comput.” Such reductions are useful in search and IR systems because a document containing the word “computing” is relevant to a query containing the word “compute.”

Based on my understanding stemming is the process of reducing inflected words to their stem, base or root form, basically a written word form. The stem need not be identical to the morphological root of the word, it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion as the user is assumed to have knowledge of the word structure. A stemming algorithm reduces the words “chocolates,” “chocolatey,” “choco” to the root word being “chocolate” and “retrieval,” “retrieved,” “retrieves” reduce to the common stem “retrieve.” This process is focused on the front of words. A different process, called lemmatization, focuses on the end of words and aims to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Stemming Algorithm

How they work you may ask?...Well, stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found. The main goal of stemming is to cut down the words of the text, which is applied to vary grammatical words into its very simple form, which is usually easier for similar meaning words to be treated as the same root word. The algorithm uses affix strippers which remove common affixes from the words. Some commonly used stemming algorithms include Porter Stemming Algorithm, Lancaster Stemming Algorithm, Lovins Stemming Algorithm, and Paice/Husk Stemming Algorithm. Since it just focuses on a single word at a time, the prefix stripper and suffix stripper in this algorithm can overcut the word, and it usually reduces the speed of the processing to get the root form of the words. With this modern search engine, it is better to look for a stemmer that helps in creating an index with actual language forms than one which speeds up querying.

Stemming and lemmatization, in the field of natural language processing, are pre-processing or normalization techniques often performed as a step before processing actual information either for the purpose of reducing the dimensions of data to use or for converting each word into its base or root form, which ensures that different forms of the same word will be considered the same. Stemming and lemmatization are the procedures by which we come at the root forms of the words. What really sets the difference between the two is that lemmatization comes with an intelligent morphological analysis and derives the contextually correct base forms of the words, while the stemming simply cuts off the ends of words, often incorrectly, and in most cases, shows no actual understanding of the context by the developers. The purpose of a stem is that if a stem can be included knowingly and purposefully invented, is to help to sort a proper stem to serve this sort.

To try and elaborate further below I have code snippets:

Input:

import nltk  #importing nltk
import spacy #importing spacy


from nltk.stem import PorterStemmer, SnowballStemmer

stemmer = PorterStemmer()

words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, "|", stemmer.stem(word))

Output:

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet

Definition and Fundamentals of Lemmatization:

Lemmatization on the other hand is a module that reduces inflected words properly to the base or dictionary form of a word, which is generally referred to as a lemma. In other words, lemmatization ensures that the root word belongs to the language and is also regarded as a valid word, which means that meaningful interpretation can still be employed subsequent to the transformation. The purpose of lemmatization is to accurately and completely identify the base forms of words, which are also known as root words. These final root words can be used to form dictionary meanings or are identified as lemmas in specialized areas. Furthermore, lemmatization identifies word forms uniquely, and words are generally lemmatized based on their parts of speech and must utilize specific rules and exceptions. Although lemmatization is a complex task, it is an essential process in natural language processing.

In natural language processing, processes such as removing punctuation marks, separating a text by space or line, and converting all characters to lowercase are applied to prepare a text for further processing, and the text is expressed in a form that is suitable for such processing. Text cleanup is one of the most important processes in natural language processing. Another important process that is executed during text cleanup is the conversion of words into their simplest form. Different from stemming, lemmatization converts words into their simplest form and ensures that the transformed words are meaningful.

Algorithms for Lemmatization

Common Lemmatization Algorithms

NLEn is a free software Python package that provides several natural language processing tools through an interface with the Natural Language Toolkit (nltk). NLEn simplifies the usage by hiding several implementation details involved with processing a large amount of text data, and it has been designed for being integrated closely with the data manipulation and analysis tools provided by the pandas and dplyr packages. The lemmatization process is a basic task for several natural language processing applications, which requires dealing with the morphological analysis of words and their meaning. The aim of lemmatization is to reduce a word to its base form, called a lemma, which conveys the general meaning of the set of words associated to the same stem.

Lemmatization of words is a common foundation of many natural language processing tasks such as part-of-speech tagging, named entity recognition, and syntactic parsing. Many applications call for lemmatizers with greater coverage and efficiency. In this article, we describe some commonly used lemmatization algorithms. The lemmatization module implicitly uses the WordNet noun, verb, adjective, and adverb rules for the English language. By default, the lemmatization module will apply the appropriate rule based on the word's part of speech value in the input table. The morphological or structural lemmatizer interface provides access to the full range of morpha functionality.

So below I also added a simple example using code snippets:

Input:

nlp = spacy.load("en_core_web_sm")

doc = nlp("eating eats eat ate adjustable rafting ability meeting")

for docs in doc:
    print(docs, "|", docs.lemma_, "|", docs.lemma)

output:

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet

input:

nlp = spacy.load("en_core_web_sm")

doc = nlp("eating eats eat ate adjustable rafting ability meeting")

for docs in doc:
    print(docs, "|", docs.lemma_, "|", docs.lemma)

Output:

eating | eat | 9837207709914848172
eats | eat | 9837207709914848172
eat | eat | 9837207709914848172
ate | eat | 9837207709914848172
adjustable | adjustable | 6033511944150694480
rafting | raft | 7154368781129989833
ability | ability | 11565809527369121409
meeting | meeting | 14798207169164081740

The only difference set between this one and the code above is the fact that when there is no underscore placed next to lemma it gives the numbers hidden under the word, including its root word

Input:

doc = nlp("Mando talked for 3 hours although talking isn't his thing he became talkative")

for docs in doc:
    print(docs, "|", docs.lemma_, "|", docs.lemma)

Output:

Mando | Mando | 7837215228004622142
talked | talk | 13939146775466599234
for | for | 16037325823156266367
3 | 3 | 602994839685422785
hours | hour | 9748623380567160636
although | although | 343236316598008647
talking | talk | 13939146775466599234
is | be | 10382539506755952630
nt | not | 447765159362469301
his | his | 2661093235354845946
thing | thing | 2473243759842082748
he | he | 1655312771067108281
became | become | 12558846041070486771
talkative | talkative | 13364764166055324990


The code set below give the type of pipeline words that one would prefer to work with, like tagger, parse, attribute_ruler, lemmatizer(being lemma) and ner(Named Entity Recognition)

Input:

nlp.pipe_names
//pipeline names

output:

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Below is another typical example I used for lemmatization and arrays combined.

Input:

doc = nlp("No, brothers, I am really exhausted. It was just one of those days.")

for docs in doc:
    print(docs, "|", docs.lemma_)

Output:

No | no
, | ,
brothers | brother
, | ,
I | I
am | be
really | really
exhausted | exhausted
. | .
It | it
was | be
just | just
one | one
of | of
those | those
days | day
. | .

The 3rd word above or rather in the sentence above is brothers and so when I opt it out using the array doc2 and finding its root word it gives me the word brother

Input:

doc[2]

output:

brothers

Input:

doc[2].lemma_

output:

brother

Input:

ar = nlp.get_pipe("attribute_ruler")

ar.add([[{"TEXT":"brothers"}],[{"TEXT":"brah"}]], {"LEMMA":"Brother"})
doc = nlp("No, brothers, brah I am really exhausted. It was just one of those days.")

for docs in doc:
    print(docs, "|", docs.lemma_)

Output:

No | no
, | ,
brothers | Brother
, | ,
brah | Brother
I | I
am | be
really | really
exhausted | exhausted
. | .
It | it
was | be
just | just
one | one
of | of
those | those
days | day
. | .

Input:

doc[0]

output:

No

Input:

doc[0].lemma

Output:

'No'

If you have read till this far, please leave a comment on what you think about my article, your opinion matters much. Thanks!

If you read this far, tweet to the author to show them you care. Tweet a Thanks
Thank you for pointing out the difference between stemming and lemmatization in natural language processing. NLP is an interesting side of machine learning for audio/speech translation.
I really like that you enjoyed the post Sir. True, there is just more to it that meets the eye

More Posts

You should know these f-string tricks in Python

mouyuan123 - May 28, 2024

Language Processing Pipeline in Spacy

Thatohatsi Matshidiso Tilodi - May 19, 2024

NameError: name 'pd' is not defined in Python [Solved]

muhammaduzairrazaq - Feb 29, 2024

Error: (-215:assertion failed) !ssize.empty() in function 'cv::resize'

Mark Thompson - Nov 30, 2023

How to create an Income and Expense App in Streamlit

Brando - Nov 19, 2023
chevron_left