In this Python NLP article, we are going to learn about NLP Stemming and Lemmatization in Python, so first of all we are going to talk about these concepts and after that we create our examples.
Understanding Tokenization in NLTK
Before we dive into Stemming and Lemmatization, let’s briefly touch upon the concept of Tokenization. Tokenization involves breaking down text into smaller units, typically words or sentences, to facilitate further analysis. NLTK (Natural Language Toolkit) provides robust functionalities for tokenizing text effortlessly in Python.
Learn How to Tokenize words in NLTK with Python
What is Stemming in NLP ?
Stemming is a process to remove affixes from a word, ending up with the stem. or in literal term we can say that stemming is the process of cutting down the branches to its stem, using stemming we can cut down a word or token to its stem or base word. for example the word eat will have variations like like eating, eaten,eats. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems. there are different stemmers that you can use in NLTK for example we have PorterStemmer, LancasterStemmer, SnowballStemmer.
Python NLP – Stemming and Lemmatization
So now let’s start from PorterStemer and it is the default choice for stemming.
1 2 3 4 5 6 7 |
from nltk.stem import PorterStemmer #create porterstemmer object porter_stemer = PorterStemmer() print(porter_stemer.stem('drinking')) |
You can see that in the above code we are going to stem the drinking word, and the result will be drink.
1 2 3 |
drink Process finished with exit code 0 |
Now let’s just use LancasterStemmer, so LancasterStemmer is just like the PorterStemmer, but It is known to be slightly more aggressive than the PorterStemmer.
1 2 3 4 5 6 7 |
from nltk.stem import LancasterStemmer lan_stemmer = LancasterStemmer() print(lan_stemmer.stem('drinks')) |
So this will be the result.
1 2 3 |
drink Process finished with exit code 0 |
Another type is SnowBallStemmer, the best thing about snowball stemmer is this, that it supports 13 language.
You can check the languages that are available for SnowBallStemmer.
1 2 3 4 |
from nltk.stem import SnowballStemmer print(SnowballStemmer.languages) |
Run the code and this is the result.
1 2 3 |
('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian' , 'spanish', 'swedish') |
Now let’s stem a word using SnowBallStemmer. you can see that after creating the SnowBallStemmer we need to specify the language that we want to use.
1 2 3 4 5 6 |
from nltk.stem import SnowballStemmer snow_stemmer = SnowballStemmer('english') print(snow_stemmer.stem('eating')) |
Run the code this is the result.
1 2 3 |
eat Process finished with exit code 0 |
What is Lemmatization in NLP ?
Lemmatization process is the same as stemming process, but it brings context to the words. so it links words with similar meaning to one word. and unlike stemming, you are always left with a valid word that means the same thing. However, the word you end up with can be completely different.
This is the example for Lemmatization.
1 2 3 4 5 6 7 |
from nltk.stem import WordNetLemmatizer lematizer = WordNetLemmatizer() print(lematizer.lemmatize('eating')) print(lematizer.lemmatize('eating', pos='v')) print(lematizer.lemmatize("better", pos='a')) |
If you run the code this will be the result.
1 2 3 4 5 |
eating eat good Process finished with exit code 0 |
let’s see the difference between stem and lemmatization.
1 2 3 4 5 6 7 8 9 |
from nltk.stem import WordNetLemmatizer, PorterStemmer lematizer = WordNetLemmatizer() stemmer = PorterStemmer() print(stemmer.stem('believes')) print(lematizer.lemmatize('believes')) |
If you run the code this will be the result, You can see in this example the PortStemmer just removes the es, but the lematizer found the valid root for the word.
1 2 3 4 |
believ belief Process finished with exit code 0 |
Subscribe and Get Free Video Courses & Articles in your Email