No Result
View All Result
DevRescue
  • Home
  • Python
  • Lists
  • Movies
  • Finance
  • Opinion
  • About
  • Contact Us
  • Home
  • Python
  • Lists
  • Movies
  • Finance
  • Opinion
  • About
  • Contact Us
DevRescue
Home Blog Python

Python nltk Tokenize Example

by Khaleel O.
December 27, 2021
in Python
Reading Time: 7 mins read
A A
python nltk tokenize string
python nltk tokenize string

In this tutorial we will use the python nltk library to tokenize an example string of text. By “tokenize” we mean break up a string into a list of substrings. We will be using Python 3.8.10. Let’s go! ⚡⚡✨

The library name nltk is short for Natural Language Toolkit. The nltk library is a popular platform for Natural Language Processing or NLP in Python. NLP is the computational processing of spoken and written human languages. This processing allows computers to derive meaning and generate language, just like humans do. It is a subset of the field of linguistics, computer science and artificial intelligence.

If you don’t already have the library, you will need to install it along with the required models and datasets with the following 2 commands. Keep in mind that you need Python versions 3.6 to 3.9 to use the nltk library:

Command 1:

pip install nltk

Command 2:

python -m nltk.downloader popular

Command 1 installs the nltk library and Command 2 installs the most commonly used subset of nltk data that you’ll need run the code in this tutorial. You can find the full installation guide HERE.

Once you have successfully installed the library, we can write our code:

import nltk
from nltk.tokenize import word_tokenize

t = f"""
    Artificial intelligence will reach human levels by around 2029.
    Follow that out further to, say, 2045, and we will have multiplied 
    the intelligence – the human biological machine intelligence of our 
    civilization – a billion-fold.
    - Ray Kurzweil, American inventor and futurist.
    """

print(word_tokenize(t))

#Output:
#['Artificial', 'intelligence', 'will', 'reach', 'human', 'levels', 
# 'by', 'around', '2029', '.', 'Follow', 'that', 'out', 'further', 'to', 
# ',', 'say', ',', '2045', ',', 'and', 'we', 'will', 'have', 'multiplied', 
# 'the', 'intelligence', '–', 'the', 'human', 'biological', 'machine', 'intelligence', 
# 'of', 'our', 'civilization', '–', 'a', 'billion-fold', '.', '-', 'Ray', 'Kurzweil', ',', 
# 'American', 'inventor', 'and', 'futurist', '.']

Let’s explain what’s happening here:

  1. We import our library nltk and we import the method that we will use to tokenize the string called word_tokenize.
  2. We define our text t. We use the f-string syntax f and the triple-quotes to declare a multiline string literal. This allows all of our line breaks to be preserved when text t is printed to the screen. In this case our string literal is a famous quote about Artificial Intelligence 😊
  3. We call the method word_tokenize that was imported earlier and we supply our text t as a parameter. The word_tokenize method is the method that will actually tokenize the text. It accepts 3 parameters and returns a tokenized copy of the text as a list. The parameters are:
    • the text which is the string to be tokenized
    • the language which a string that defaults to ‘english’ and represents the name of the model used by the nltk tokenizer
    • the preserve_line flag which determines whether to sentence tokenize the text or not and it defaults to False.
  4. The print statement will print the list returned by the call word_tokenize(t). In this case we see that the string is broken up into words and punctuation as seen in the output.

Simple right? Once the statement executes we will see a list of words printed to the console. The nltk library is quite powerful and prolific. Note from the above example that the tokenizer is intelligent enough to tell the difference between words and punctuation.

The nltk library also allows us to tokenize by the sentences in our text, as opposed to just the words and punctuation. Here’s our code:

import nltk
from nltk.tokenize import sent_tokenize

t = "I came. I saw. I conquered."

print(sent_tokenize(t))

#Output:
#['I came.', 'I saw.', 'I conquered.']

Let’s explain what’s happening here:

  1. We import the package nltk and the method sent_tokenize that we will use to tokenize our sentence.
  2. We define our text t.
  3. The sent_tokenize method is our sentence tokenizer. It accepts two arguments and returns the text split into sentences. The two parameters are the text which is t in this case and the language which is the name of the model used by the nltk tokenizer, which defaults to ‘english’.
  4. The print statement prints the list returned by sent_tokenize(t). The list will contain three sentences.

Once the statement executes we will see a list of 3 sentences printed to the console. Note again that the tokenizer is smart enough to know what a sentence is in the English Language.

In our final example we will read text from a file on disk and call both tokenizers on the text. Here is our code:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import io


f = io.open("quote.txt", mode="r", encoding="utf-8")

t = f.read()

print(word_tokenize(t))
print(sent_tokenize(t))

#Output:
#['“', 'We', 'have', 'seen', 'AI', 'providing', 'conversation', 'and', 'comfort', 
# 'to', 'the', 'lonely', ';', 'we', 'have', 'also', 'seen', 'AI', 'engaging', 'in', 'racial', 
# 'discrimination', '.', 'Yet', 'the', 'biggest', 'harm', 'that', 'AI', 'is', 'likely', 'to', 'do', 
# 'to', 'individuals', 'in', 'the', 'short', 'term', 'is', 'job', 'displacement', ',', 'as', 'the', 
# 'amount', 'of', 'work', 'we', 'can', 'automate', 'with', 'AI', 'is', 'vastly', 'larger', 'than', 'before', 
# '.', 'As', 'leaders', ',', 'it', 'is', 'incumbent', 'on', 'all', 'of', 'us', 'to', 'make', 'sure', 'we', 'are', 
# 'building', 'a', 'world', 'in', 'which', 'every', 'individual', 'has', 'an', 'opportunity', 'to', 'thrive.', '”', 
# 'Andrew', 'Ng', ',', 'Co-founder', 'and', 'lead', 'of', 'Google', 'Brain']

#Output:
#['“We have seen AI providing conversation and comfort to the lonely; we have also seen AI engaging in racial discrimination.', 
# 'Yet the biggest harm that AI is likely to do to individuals in the short term is job displacement, as the amount of work we can automate with AI is vastly larger than before.', 
# 'As leaders, it is incumbent on all of us to make sure we are building a world in which every individual has an opportunity to thrive.”\nAndrew Ng, Co-founder and lead of Google Brain']

Let’s explain what is happening here:

  1. We must prepare a file called quote.txt which will contain some text. This file should exist in the same folder as our python script on the local disk. For this example we are using the Windows 10 OS.
  2. We import the required nltk library and methods.
  3. We import the io library so that we can read from the file on disk.
  4. The io.open() method accepts the name of our file, the mode and the encoding as parameters and returns a file object.
  5. We read all the bytes from the file object using the read() method which returns a string t.
  6. We call the word_tokenize method to tokenize the string t read from the file by words and punctuation.
  7. We call the sent_tokenize method to tokenize the string t read from the file by sentence.

Once the statement executes, we will see two lists printed to the console: a list of words and a list of sentences. This is exactly what we expect to see. Note that the tokenizer keeps any newline characters in the original text too!

Following is our full code:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import io

t = f"""
    Artificial intelligence will reach human levels by around 2029.
    Follow that out further to, say, 2045, and we will have multiplied 
    the intelligence – the human biological machine intelligence of our 
    civilization – a billion-fold.
    - Ray Kurzweil, American inventor and futurist.
    """

print(word_tokenize(t))

t = "I came. I saw. I conquered."

print(sent_tokenize(t))

f = io.open("quote.txt", mode="r", encoding="utf-8")

t = f.read()

print(word_tokenize(t))
print(sent_tokenize(t))

Thanks for reading our tutorial. Now you can use the Python nltk library to tokenize an example string. Happy Coding! 👌👌👌

Tags: machine learningnlp
Previous Post

Python TwoFish Encryption

Next Post

Python argparse Example with Code

Khaleel O.

Khaleel O.

I love to share, educate and help developers. I have 14+ years experience in IT. Currently transitioning from Systems Administration to DevOps. Avid reader, intellectual and dreamer. Enter Freely, Go safely, And leave something of the happiness you bring.

Related Posts

Python

Python Fibonacci Recursive Solution

by Khaleel O.
January 16, 2024
0
0

Let's do a Python Fibonacci Recursive Solution. Let's go! 🔥🔥🔥 The Fibonacci sequence is a series of numbers in which...

Read moreDetails
Python

Python Slice String List Tuple

by Khaleel O.
January 16, 2024
0
0

Let's do a Python Slice string list tuple how-to tutorial. Let's go! 🔥🔥🔥 In Python, a slice is a feature...

Read moreDetails
Python

Python Blowfish Encryption Example

by Khaleel O.
January 14, 2024
0
0

Let's do a Python Blowfish Encryption example. Let's go! 🔥 🔥 Blowfish is a symmetric-key block cipher algorithm designed for...

Read moreDetails
Python

Python Deque Methods

by Khaleel O.
January 14, 2024
0
0

In this post we'll list Python Deque Methods. Ready? Let's go! 🔥🔥🔥 A deque (double-ended queue) in Python is a...

Read moreDetails

DevRescue © 2021 All Rights Reserved. Privacy Policy. Cookie Policy

Manage your privacy

To provide the best experiences, we and our partners use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us and our partners to process personal data such as browsing behavior or unique IDs on this site and show (non-) personalized ads. Not consenting or withdrawing consent, may adversely affect certain features and functions.

Click below to consent to the above or make granular choices. Your choices will be applied to this site only. You can change your settings at any time, including withdrawing your consent, by using the toggles on the Cookie Policy, or by clicking on the manage consent button at the bottom of the screen.

Functional Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
Statistics

Marketing

Features
Always active

Always active
  • Manage options
  • Manage services
  • Manage {vendor_count} vendors
  • Read more about these purposes
Manage options
  • {title}
  • {title}
  • {title}
Manage your privacy
To provide the best experiences, DevRescue.com will use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
Statistics

Marketing

Features
Always active

Always active
  • Manage options
  • Manage services
  • Manage {vendor_count} vendors
  • Read more about these purposes
Manage options
  • {title}
  • {title}
  • {title}
No Result
View All Result
  • Home
  • Python
  • Lists
  • Movies
  • Finance
  • Opinion
  • About
  • Contact Us

DevRescue © 2022 All Rights Reserved Privacy Policy