In this Python Natural Language Processing article we are going to talk about NLP Tokenization.
so first let’s just define tokenization and after that we are going to create our examples.
Read Introduction to Natural Language Processing in Python
What is NLP Tokenization ?
Tonenization is the process of splitting text in to smaller parts, and every smaller parts are
called tokens. And it is one of the most important step in natural language processing. there are
two level tokenization, we have sentence level tokenization and word level tokenization.
1: Sentence Tokenization
Using sentence tokenization we can split a text to sentences. this is done by
sen_tokenize() function. so sent_tokenize () function uses an instance of
PunktSentenceTokenizer. also sent_tokenizer’ is pretrained. It doesn’t require
training text and can tokenize straightaway.
1 2 3 4 5 |
from nltk.tokenize import sent_tokenize mytext = "Hello friends. welcome to codeloop.org. like the article " print(sent_tokenize(mytext)) |
If you run the code this will be the result. you can see that the text is spitted in
to separate sentences.
1 |
['Hello friends.', 'welcome to codeloop.org.', 'like the article'] |
Also we have another tokenizer that is called PunktSentenceTokenizer, When we have huge
chunks of data then it is good to use it.
What is PunktSentenceTokenizer ?
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to
build a model for abbreviation words, it means that it is unsupervised trainable model and it
can be trained on unlabeled data.
1 2 3 4 5 6 7 8 |
import nltk # Loading PunktSentenceTokenizer with English pickle tokenizer=nltk.data.load('tokenizers/punkt/english.pickle') text=" Hello friends . How are you . welcome to codeloop.org " print(tokenizer.tokenize(text)) |
If you run the code this will be the result.
1 |
[' Hello friends .', 'How are you .', 'welcome to codeloop.org'] |
you can also tokenize sentence from different languages using different pickle file other
than English. so in this example we are going to tokenize a text from Spanish language.
1 2 3 4 5 6 7 |
import nltk.data spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle') mytext = 'Hola amigos . Cómo estás . Por favor suscribete a mi canal' print(spanish_tokenizer.tokenize(mytext)) |
If you run the code this will be the result.
1 |
['Hola amigos .', 'Cómo estás .', 'Por favor suscribete a mi canal'] |
2: Word Tokenization
we can do word tokenization using word_tokenize() function , word_tokenize function uses
an instance of NLTK that is called TreebankWordTokenizer.
This is the simplest tokenizer that is related to python, it is the split() method of the python string
this is the most basic tokenizer, that uses white space as delimiter.
1 2 3 4 |
mytext = "Hello World ! @ Welcome to Python Natural Language Processing Course 4 you" print(mytext.split()) |
This is the result for the code. you can see that our sentence splitted to separate words.
1 |
['Hello', 'World', '!', '@', 'Welcome', 'to', 'Python', 'Natural', 'Language', 'Processing', 'Course', '4', 'you'] |
Now let’s use word_tokenize() from nltk, This is the most commonly used tokenizer,
basically we can say that it is the default one,
1 2 3 |
from nltk.tokenize import word_tokenize print(word_tokenize(mytext)) |
This will be the result
1 |
['Hello', 'World', '!', '@', 'Welcome', 'to', 'Python', 'Natural', 'Language', 'Processing', 'Course', '4', 'you'] |
Regular Expression Tokenizer
A RegexpTokenizer splits a string into substrings using a regular expression.
most of the other tokenizers can be derived from this tokenizer . you can also build a
very specific tokenizer using a different pattern.
1 2 3 4 5 6 |
from nltk.tokenize import regexp_tokenize mytext = "Hello World ! @ Welcome to Python Natural Language Processing Course 4 you" print(regexp_tokenize(mytext, pattern='\w+')) print(regexp_tokenize(mytext, pattern='\d+')) |
We have used e \w+ as a regular expression, which means we need all the words and digits
from the string, and other symbols can be used as a splitter.
In the second part we specify \d+ as regex. The result will produce only digits from the string.
If you run the code this is the result.
1 2 |
['Hello', 'World', 'Welcome', 'to', 'Python', 'Natural', 'Language', 'Processing', 'Course', '4', 'you'] ['4'] |
Subscribe and Get Free Video Courses & Articles in your Email
Годнота спасибо
_________________
азино 777 играть демо