In this Python NLP article we are going to learn about Parts of Speech Default Tagging, so
Default Tagging provides a baseline for part-of-speech tagging , it is performed using the
DefaultTagger class. and It simply assigns the same part-of-speech tag to every token.
The DefaultTagger class takes ‘tag’ as a single argument. for example NN is the tag for a
singular noun.
Learn about Parts of Speech Tagging (POS) in NLTK
1 2 3 4 5 6 |
from nltk.tag import DefaultTagger tagger = DefaultTagger('NN') print(tagger.tag(['Hello', 'World'])) print(tagger.tag(['Good', 'Morning'])) |
So in here for every tagger we have a tag method which takes token as list of arguments. if you run
the code this will be the result.
1 2 |
[('Hello', 'NN'), ('World', 'NN')] [('Good', 'NN'), ('Morning', 'NN')] |
Also you can untag a sentence using this code.
1 2 3 |
from nltk.tag import untag print(untag([('Hello', 'NN'), ('World', 'NN')])) |
This is the result.
1 |
['Hello', 'World'] |
Also there is a function in DefaultTagger that you can predict the accuracy.
so for this we are going to use Brown Corpus , The Brown Corpus was the first million-word
electronic corpus of English, created in 1961 at Brown University. This corpus contains text
from 500 sources, and the sources have been categorized by genre, such as news, editorial.
1 2 3 4 5 6 7 8 |
from nltk.corpus import brown from nltk.tag import DefaultTagger brown_tagged_sents = brown.tagged_sents(categories='news') default_tagger = DefaultTagger('NN') print(default_tagger.evaluate(brown_tagged_sents)) |
Run the code and you can see that we have received poorly result. the accuracy is 13 percent.
1 |
0.13089484257215028 |
There are different taggers that you can use for example Unigram tagger, A Unigram generally
refers to a single token. so a unigram tagger only uses a single word as its context for determining
the part-of-speech tag.
1 2 3 4 5 6 7 8 9 10 11 |
from nltk.tag import UnigramTagger from nltk.corpus import treebank train_sents = treebank.tagged_sents()[:2000] tagger = UnigramTagger(train_sents) print(treebank.sents()[0]) print(tagger.tag(treebank.sents()[0])) |
So in here we have just used the 2000 tagged sentences from tree bank corpus as the training set
to initialize the Unigram tagger class. if you run the code this is the result.
1 2 3 4 5 6 |
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')] |
Now let’s check the accuracy.
1 2 3 4 |
test_sents = treebank.tagged_sents()[2000:] print("Accuracy : ", tagger.evaluate(test_sents)) |
If you see the accuracy, we are receiving 82 percent accuracy.
1 |
Accuracy : 0.8289714062852803 |
Now let’s just talk about BackOffTagging , so back of tagging is one of the features from
SequentialBackOffTagger. using Back of tagging we can chain taggers together, so if that one
tagger does not know how to tag a word, it pass the word to the next back off tagger, if that one was
not able to tag the word it can pass that to another Back Of Tagger, so this is the work for BackOff
Tagging.
1 2 3 4 5 6 7 8 9 10 11 |
from nltk.tag import UnigramTagger, DefaultTagger from nltk.corpus import treebank train_sents = treebank.tagged_sents()[:2000] test_sents = treebank.tagged_sents()[2000:] tagger1 = DefaultTagger('NN') tagger2 = UnigramTagger(train_sents, backoff=tagger1) print("Back of Accuracy : " , tagger2.evaluate(test_sents)) |
After runing you can see that we have good accuracy and it is 85 percent.
1 |
Back of Accuracy : 0.8547119075476733 |
Along with UnigramTagger, there are two more taggers that we can use, we have BigramTagger
and TriGramTagger.
Subscribe and Get Free Video Courses & Articles in your Email
YOU NEED QUALITY VISITORS for your: codeloop.org
My name is Rodrigo Blanchette, and I’m a Web Traffic Specialist. I can get:
– visitors from search engines
– visitors from social media
– visitors from any country you want
– very low bounce rate & long visit duration
CLAIM YOUR 24 HOURS FREE TEST => https://bit.ly/3h750yC