Python NLP - Stopwords Removal in NLTK

In this Python NLP article we are going to learn about Stopwords Removal in NLTK,

also we are going to create some examples in Stopwords Removal in NLTK.

Learn about Stemming And Lemmatization in Python NLP.

What are Stopwords ?

Stopwords are words that generally do not contribute to the meaning of a sentence, for example

stop word are like ( “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore,

both when indexing entries for searching and when retrieving them as the result of a search query.

for the purposes of information retrieval and natural language processing. NLTK comes with a

pre-built list of stop words for around 22 languages.

OK now first let’s check how many languages are available is stopwords.

from nltk.corpus import stopwords

print(stopwords.fileids())

from nltk.corpus import stopwords

print(stopwords.fileids())

If you run the code this will be the available languages for the stopwords.

['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 
'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 
'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish',
 'tajik', 'turkish']

['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french',

'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali',

'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish',

'tajik', 'turkish']

Let’s check the stopwords list.

from nltk.corpus import stopwords


#config the language
stop_words = stopwords.words('english')

print(stop_words)

from nltk.corpus import stopwords

#config the language

stop_words = stopwords.words('english')

print(stop_words)

Run the code and you can see that these NLTK Stopwords are available for English language.

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",
 "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 
'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 
'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 
'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because',
 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 
'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 
'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again',
 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 
'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 
'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can',
 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 
'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn',
 "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', 
"haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 
'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 
'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",

"you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he',

'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's",

'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',

'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',

'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do',

'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because',

'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against',

'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below',

'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again',

'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any',

'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor',

'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can',

'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm',

'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn',

"didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',

"haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't",

'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't",

'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

So now let’s remove stop words from this text. in this example we are going to get non stop words

from this text. the non stop words in this text are [first, example, nltk].

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

mytext = "this is my first example in nltk for you"


stop_words = set(stopwords.words('english'))


words = word_tokenize(mytext)


filtered_words = []


for word in words:
    if word not in stop_words:
        filtered_words.append(word)


print(filtered_words)

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

mytext = "this is my first example in nltk for you"

stop_words = set(stopwords.words('english'))

words = word_tokenize(mytext)

filtered_words = []

for word in words:

if word not in stop_words:

filtered_words.append(word)

print(filtered_words)

Now if you run the code this is the result.

['first', 'example', 'nltk']

1	['first', 'example', 'nltk']

Also you can see the frequency distribution of a word in a sentence using this code. we have

also plotted the most used words.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

mytext = "this is my first example in nltk for you"


stop_words = set(stopwords.words('english'))


words = word_tokenize(mytext)


filtered_words = []


for word in words:
    if word not in stop_words:
        filtered_words.append(word)


print(filtered_words)




freq_dist =FreqDist(filtered_words)


print(freq_dist.most_common())

freq_dist.plot()

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.probability import FreqDist

mytext = "this is my first example in nltk for you"

stop_words = set(stopwords.words('english'))

words = word_tokenize(mytext)

filtered_words = []

for word in words:

if word not in stop_words:

filtered_words.append(word)

print(filtered_words)

freq_dist =FreqDist(filtered_words)

print(freq_dist.most_common())

freq_dist.plot()

If you run the code this is the result. you can see that every word is used one time in the text.

[('first', 1), ('example', 1), ('nltk', 1)]

1	[('first', 1), ('example', 1), ('nltk', 1)]

And this is the plot.

Python NLP - Stopwords Removal in NLTK — Python NLP – Stopwords Removal in NLTK

Subscribe and Get Free Video Courses & Articles in your Email