Python NLP - Parts of Speech Default Tagging

In this Python NLP article we are going to learn about Parts of Speech Default Tagging, so

Default Tagging provides a baseline for part-of-speech tagging , it is performed using the

DefaultTagger class. and It simply assigns the same part-of-speech tag to every token.

The DefaultTagger class takes ‘tag’ as a single argument. for example NN is the tag for a

singular noun.

Learn about Parts of Speech Tagging (POS) in NLTK

from nltk.tag import DefaultTagger

tagger = DefaultTagger('NN')

print(tagger.tag(['Hello', 'World']))
print(tagger.tag(['Good', 'Morning']))

from nltk.tag import DefaultTagger

tagger = DefaultTagger('NN')

print(tagger.tag(['Hello', 'World']))

print(tagger.tag(['Good', 'Morning']))

So in here for every tagger we have a tag method which takes token as list of arguments. if you run

the code this will be the result.

[('Hello', 'NN'), ('World', 'NN')]
[('Good', 'NN'), ('Morning', 'NN')]

1 2	[('Hello', 'NN'), ('World', 'NN')] [('Good', 'NN'), ('Morning', 'NN')]

Also you can untag a sentence using this code.

from nltk.tag import untag

print(untag([('Hello', 'NN'), ('World', 'NN')]))

from nltk.tag import untag

print(untag([('Hello', 'NN'), ('World', 'NN')]))

This is the result.

['Hello', 'World']

1	['Hello', 'World']

Also there is a function in DefaultTagger that you can predict the accuracy.

so for this we are going to use Brown Corpus , The Brown Corpus was the first million-word

electronic corpus of English, created in 1961 at Brown University. This corpus contains text

from 500 sources, and the sources have been categorized by genre, such as news, editorial.

from nltk.corpus import brown
from nltk.tag import DefaultTagger

brown_tagged_sents = brown.tagged_sents(categories='news')

default_tagger = DefaultTagger('NN')

print(default_tagger.evaluate(brown_tagged_sents))

from nltk.corpus import brown

from nltk.tag import DefaultTagger

brown_tagged_sents = brown.tagged_sents(categories='news')

default_tagger = DefaultTagger('NN')

print(default_tagger.evaluate(brown_tagged_sents))

Run the code and you can see that we have received poorly result. the accuracy is 13 percent.

0.13089484257215028

1	0.13089484257215028

There are different taggers that you can use for example Unigram tagger, A Unigram generally

refers to a single token. so a unigram tagger only uses a single word as its context for determining

the part-of-speech tag.

from nltk.tag import UnigramTagger
from nltk.corpus import treebank



train_sents = treebank.tagged_sents()[:2000]

tagger = UnigramTagger(train_sents)

print(treebank.sents()[0])
print(tagger.tag(treebank.sents()[0]))

from nltk.tag import UnigramTagger

from nltk.corpus import treebank

train_sents = treebank.tagged_sents()[:2000]

tagger = UnigramTagger(train_sents)

print(treebank.sents()[0])

print(tagger.tag(treebank.sents()[0]))

So in here we have just used the 2000 tagged sentences from tree bank corpus as the training set

to initialize the Unigram tagger class. if you run the code this is the result.

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 
'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'),
 ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), 
('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 
'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the',

'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'),

('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'),

('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director',

'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

Now let’s check the accuracy.

test_sents = treebank.tagged_sents()[2000:]


print("Accuracy : ", tagger.evaluate(test_sents))

test_sents = treebank.tagged_sents()[2000:]

print("Accuracy : ", tagger.evaluate(test_sents))

If you see the accuracy, we are receiving 82 percent accuracy.

Accuracy :  0.8289714062852803

1	Accuracy : 0.8289714062852803

Now let’s just talk about BackOffTagging , so back of tagging is one of the features from

SequentialBackOffTagger. using Back of tagging we can chain taggers together, so if that one

tagger does not know how to tag a word, it pass the word to the next back off tagger, if that one was

not able to tag the word it can pass that to another Back Of Tagger, so this is the work for BackOff

Tagging.

from nltk.tag import UnigramTagger, DefaultTagger
from nltk.corpus import treebank


train_sents = treebank.tagged_sents()[:2000]

test_sents = treebank.tagged_sents()[2000:]

tagger1 = DefaultTagger('NN')
tagger2 = UnigramTagger(train_sents, backoff=tagger1)
print("Back of Accuracy : " , tagger2.evaluate(test_sents))

from nltk.tag import UnigramTagger, DefaultTagger

from nltk.corpus import treebank

train_sents = treebank.tagged_sents()[:2000]

test_sents = treebank.tagged_sents()[2000:]

tagger1 = DefaultTagger('NN')

tagger2 = UnigramTagger(train_sents, backoff=tagger1)

print("Back of Accuracy : " , tagger2.evaluate(test_sents))

After runing you can see that we have good accuracy and it is 85 percent.

Back of Accuracy :  0.8547119075476733

1	Back of Accuracy : 0.8547119075476733

Along with UnigramTagger, there are two more taggers that we can use, we have BigramTagger

and TriGramTagger.

Subscribe and Get Free Video Courses & Articles in your Email

1 thought on “Python NLP – Parts of Speech Default Tagging”

Rodrigo Blanchette

August 28, 2020 at 11:37 am

YOU NEED QUALITY VISITORS for your: codeloop.org

My name is Rodrigo Blanchette, and I’m a Web Traffic Specialist. I can get:
– visitors from search engines
– visitors from social media
– visitors from any country you want
– very low bounce rate & long visit duration

CLAIM YOUR 24 HOURS FREE TEST => https://bit.ly/3h750yC

Comments are closed.