Python Natural Language Processing - NLP Tokenization

In this Python Natural Language Processing article we are going to talk about NLP Tokenization.

so first let’s just define tokenization and after that we are going to create our examples.

Read Introduction to Natural Language Processing in Python

What is NLP Tokenization ?

Tonenization is the process of splitting text in to smaller parts, and every smaller parts are

called tokens. And it is one of the most important step in natural language processing. there are

two level tokenization, we have sentence level tokenization and word level tokenization.

1: Sentence Tokenization

Using sentence tokenization we can split a text to sentences. this is done by

sen_tokenize() function. so sent_tokenize () function uses an instance of

PunktSentenceTokenizer. also sent_tokenizer’ is pretrained. It doesn’t require

training text and can tokenize straightaway.

from nltk.tokenize import sent_tokenize

mytext = "Hello friends. welcome to codeloop.org. like the article "

print(sent_tokenize(mytext))

from nltk.tokenize import sent_tokenize

mytext = "Hello friends. welcome to codeloop.org. like the article "

print(sent_tokenize(mytext))

If you run the code this will be the result. you can see that the text is spitted in

to separate sentences.

['Hello friends.', 'welcome to codeloop.org.', 'like the article']

1	['Hello friends.', 'welcome to codeloop.org.', 'like the article']

Also we have another tokenizer that is called PunktSentenceTokenizer, When we have huge

chunks of data then it is good to use it.

What is PunktSentenceTokenizer ?

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to

build a model for abbreviation words, it means that it is unsupervised trainable model and it

can be trained on unlabeled data.

import nltk

# Loading PunktSentenceTokenizer with English pickle
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

text=" Hello friends . How are you . welcome to codeloop.org "

print(tokenizer.tokenize(text))

import nltk

# Loading PunktSentenceTokenizer with English pickle

tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

text=" Hello friends . How are you . welcome to codeloop.org "

print(tokenizer.tokenize(text))

If you run the code this will be the result.

[' Hello friends .', 'How are you .', 'welcome to codeloop.org']

1	[' Hello friends .', 'How are you .', 'welcome to codeloop.org']

you can also tokenize sentence from different languages using different pickle file other

than English. so in this example we are going to tokenize a text from Spanish language.

import nltk.data

spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')

mytext =  'Hola amigos . Cómo estás . Por favor suscribete a mi canal'

print(spanish_tokenizer.tokenize(mytext))

import nltk.data

spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')

mytext = 'Hola amigos . Cómo estás . Por favor suscribete a mi canal'

print(spanish_tokenizer.tokenize(mytext))

If you run the code this will be the result.

['Hola amigos .', 'Cómo estás .', 'Por favor suscribete a mi canal']

1	['Hola amigos .', 'Cómo estás .', 'Por favor suscribete a mi canal']

2: Word Tokenization

we can do word tokenization using word_tokenize() function , word_tokenize function uses

an instance of NLTK that is called TreebankWordTokenizer.

This is the simplest tokenizer that is related to python, it is the split() method of the python string

this is the most basic tokenizer, that uses white space as delimiter.

mytext = "Hello World ! @ Welcome to Python Natural Language Processing Course 4 you"


print(mytext.split())

mytext = "Hello World ! @ Welcome to Python Natural Language Processing Course 4 you"

print(mytext.split())

This is the result for the code. you can see that our sentence splitted to separate words.

['Hello', 'World', '!', '@', 'Welcome', 'to', 'Python', 'Natural', 'Language', 'Processing', 'Course', '4', 'you']

1	['Hello', 'World', '!', '@', 'Welcome', 'to', 'Python', 'Natural', 'Language', 'Processing', 'Course', '4', 'you']

Now let’s use word_tokenize() from nltk, This is the most commonly used tokenizer,

basically we can say that it is the default one,

from nltk.tokenize import word_tokenize

print(word_tokenize(mytext))

from nltk.tokenize import word_tokenize

print(word_tokenize(mytext))

This will be the result

['Hello', 'World', '!', '@', 'Welcome', 'to', 'Python', 'Natural', 'Language', 'Processing', 'Course', '4', 'you']

1	['Hello', 'World', '!', '@', 'Welcome', 'to', 'Python', 'Natural', 'Language', 'Processing', 'Course', '4', 'you']

Regular Expression Tokenizer

A RegexpTokenizer splits a string into substrings using a regular expression.

most of the other tokenizers can be derived from this tokenizer . you can also build a

very specific tokenizer using a different pattern.

from nltk.tokenize import regexp_tokenize

mytext = "Hello World ! @ Welcome to Python Natural Language Processing Course 4 you"

print(regexp_tokenize(mytext, pattern='\w+'))
print(regexp_tokenize(mytext, pattern='\d+'))

from nltk.tokenize import regexp_tokenize

mytext = "Hello World ! @ Welcome to Python Natural Language Processing Course 4 you"

print(regexp_tokenize(mytext, pattern='\w+'))

print(regexp_tokenize(mytext, pattern='\d+'))

We have used e \w+ as a regular expression, which means we need all the words and digits

from the string, and other symbols can be used as a splitter.

In the second part we specify \d+ as regex. The result will produce only digits from the string.

If you run the code this is the result.

['Hello', 'World', 'Welcome', 'to', 'Python', 'Natural', 'Language', 'Processing', 'Course', '4', 'you']
['4']

1 2	['Hello', 'World', 'Welcome', 'to', 'Python', 'Natural', 'Language', 'Processing', 'Course', '4', 'you'] ['4']

Subscribe and Get Free Video Courses & Articles in your Email

Python Natural Language Processing – NLP Tokenization

What is NLP Tokenization ?

1: Sentence Tokenization

What is PunktSentenceTokenizer ?

2: Word Tokenization

Regular Expression Tokenizer

2 thoughts on “Python Natural Language Processing – NLP Tokenization”