Python Natural Language Processing – NLP Tokenization

In this Python Natural Language Processing article we are going to talk about NLP Tokenization.

so first let’s just define tokenization and after that we are going to create our examples. 

 

 

Read Introduction to Natural Language Processing in Python 

 

 

 

What is NLP Tokenization ? 

Tonenization is the process of splitting text in to smaller parts, and every smaller parts are

called tokens. And it is one of the most important step in natural language processing. there are

two level tokenization, we have sentence level tokenization and word level tokenization.

 

 

 

 

1: Sentence Tokenization 

Using sentence tokenization we can split a text to sentences. this is done by

sen_tokenize() function. so sent_tokenize () function uses an instance of 

PunktSentenceTokenizer. also sent_tokenizer’ is pretrained. It doesn’t require

training text and can tokenize straightaway.

 

 

 

 

 

If you run the code this will be the result. you can see that the text is spitted in

to separate sentences.

 

 

 

 

Also we have another tokenizer that is called PunktSentenceTokenizer, When we have huge

chunks of data then it is good to use it.

 

 

 

 

What is PunktSentenceTokenizer ?

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to

build a model for abbreviation words, it means that it is unsupervised trainable model and it

can be trained on unlabeled data.

 

 

 

 

 

If you run the code this will be the result.

 

 

 

 

 

you can also tokenize sentence from different languages using different pickle file other

than English. so in this example we are going to tokenize a text from Spanish language.

 

 

 

 

If you run the code this will be the result.

 

 

 

 

2: Word Tokenization 

we can do word tokenization using word_tokenize() function , word_tokenize function uses

an instance of NLTK that is called TreebankWordTokenizer.

 

 

 

 

This is the simplest tokenizer that is related to python, it is the split() method of the python string

this is the most basic tokenizer, that uses white space as delimiter. 

 

 

 

 

This is the result for the code. you can see that our sentence splitted to separate words.

 

 

 

 

Now let’s use word_tokenize() from nltk, This is the most commonly used tokenizer,

basically we can say that it is the default one,

 

 

 

 

This will be the result 

 

 

 

 



Regular Expression Tokenizer 

A RegexpTokenizer splits a string into substrings using a regular expression.

most of the other tokenizers can be derived from this tokenizer . you can also build a

very specific tokenizer using a different pattern.



 

We have used e \w+ as a regular expression, which means we need all the words and digits

from the string, and other symbols can be used as a splitter.

 

 

In the second part we specify \d+ as regex. The result will produce only digits from the string.

 

 

 

 

If you run the code this is the result.

 

 

 

 

 

Subscribe and Get Free Video Courses & Articles in your Email

 

2 thoughts on “Python Natural Language Processing – NLP Tokenization”

Comments are closed.

Codeloop
Share via
Copy link
Powered by Social Snap
×