âčïž Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.1 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://realpython.com/nltk-nlp-python/ |
| Last Crawled | 2026-04-16 08:11:38 (2 days ago) |
| First Indexed | 2021-04-22 08:54:15 (4 years ago) |
| HTTP Status Code | 200 |
| Meta Title | Natural Language Processing With Python's NLTK Package â Real Python |
| Meta Description | In this beginner-friendly tutorial, you'll take your first steps with Natural Language Processing (NLP) and Python's Natural Language Toolkit (NLTK). You'll learn how to process unstructured data in order to be able to analyze it and draw conclusions from it. |
| Meta Canonical | null |
| Boilerpipe Text | Natural language processing
(NLP) is a field that focuses on making natural human language usable by computer programs.
NLTK
, or
Natural Language Toolkit
, is a Python package that you can use for NLP.
A lot of the data that you could be analyzing is
unstructured data
and contains human-readable text. Before you can analyze that data programmatically, you first need to preprocess it. In this tutorial, youâll take your first look at the kinds of
text preprocessing
tasks you can do with NLTK so that youâll be ready to apply them in future projects. Youâll also see how to do some basic
text analysis
and create
visualizations
.
If youâre familiar with the
basics of using Python
and would like to get your feet wet with some NLP, then youâve come to the right place.
By the end of this tutorial, youâll know how to:
Find text
to analyze
Preprocess
your text for analysis
Analyze
your text
Create
visualizations
based on your analysis
Letâs get Pythoning!
Getting Started With Pythonâs NLTK
The first thing you need to do is make sure that you have Python installed. For this tutorial, youâll be using Python 3.9. If you donât yet have Python installed, then check out
Python 3 Installation & Setup Guide
to get started.
Once you have that dealt with, your next step is to
install NLTK
with
pip
. Itâs a best practice to install it in a virtual environment. To learn more about virtual environments, check out
Python Virtual Environments: A Primer
.
For this tutorial, youâll be installing version 3.5:
In order to create visualizations for
named entity recognition
, youâll also need to install
NumPy
and
Matplotlib
:
If youâd like to know more about how
pip
works, then you can check out
What Is Pip? A Guide for New Pythonistas
. You can also take a look at the official page on
installing NLTK data
.
Tokenizing
By
tokenizing
, you can conveniently split up text by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. Itâs your first step in turning unstructured data into structured data, which is easier to analyze.
When youâre analyzing text, youâll be tokenizing by word and tokenizing by sentence. Hereâs what both types of tokenization bring to the table:
Tokenizing by word:
Words are like the atoms of natural language. Theyâre the smallest unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to identify words that come up particularly often. For example, if you were analyzing a group of job ads, then you might find that the word âPythonâ comes up often. That could suggest high demand for Python knowledge, but youâd need to look deeper to know more.
Tokenizing by sentence:
When you tokenize by sentence, you can analyze how those words relate to one another and see more context. Are there a lot of negative words around the word âPythonâ because the hiring manager doesnât like Python? Are there more terms from the domain of
herpetology
than the domain of software development, suggesting that you may be dealing with an entirely different kind of
python
than you were expecting?
Hereâs how to
import
the relevant parts of NLTK so you can tokenize by word and by sentence:
Now that youâve imported what you need, you can create a
string
to tokenize. Hereâs a quote from
Dune
that you can use:
You can use
sent_tokenize()
to split up
example_string
into sentences:
Tokenizing
example_string
by sentence gives you a
list
of three strings that are sentences:
"Muad'Dib learned rapidly because his first training was in how to learn."
'And the first lesson of all was the basic trust that he could learn.'
"It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult."
Now try tokenizing
example_string
by word:
You got a list of strings that NLTK considers to be words, such as:
"Muad'Dib"
'training'
'how'
But the following strings were also considered to be words:
"'s"
','
'.'
See how
"It's"
was split at the apostrophe to give you
'It'
and
"'s"
, but
"Muad'Dib"
was left whole? This happened because NLTK knows that
'It'
and
"'s"
(a contraction of âisâ) are two distinct words, so it counted them separately. But
"Muad'Dib"
isnât an accepted contraction like
"It's"
, so it wasnât read as two separate words and was left intact.
Filtering Stop Words
Stop words
are words that you want to ignore, so you filter them out of your text when youâre processing it. Very common words like
'in'
,
'is'
, and
'an'
are often used as stop words since they donât add a lot of meaning to a text in and of themselves.
Hereâs how to import the relevant parts of NLTK in order to filter out stop words:
Hereâs a
quote from Worf
that you can filter:
Now tokenize
worf_quote
by word and store the resulting list in
words_in_quote
:
You have a list of the words in
worf_quote
, so the next step is to create a
set
of stop words to filter
words_in_quote
. For this example, youâll need to focus on stop words in
"english"
:
Next, create an empty list to hold the words that make it past the filter:
You created an empty list,
filtered_list
, to hold all the words in
words_in_quote
that arenât stop words. Now you can use
stop_words
to filter
words_in_quote
:
You iterated over
words_in_quote
with a
for
loop
and added all the words that werenât stop words to
filtered_list
. You used
.casefold()
on
word
so you could ignore whether the letters in
word
were uppercase or lowercase. This is worth doing because
stopwords.words('english')
includes only lowercase versions of stop words.
Alternatively, you could use a
list comprehension
to make a list of all the words in your text that arenât stop words:
When you use a list comprehension, you donât create an empty list and then add items to the end of it. Instead, you define the list and its contents at the same time. Using a list comprehension is often seen as more
Pythonic
.
Take a look at the words that ended up in
filtered_list
:
You filtered out a few words like
'am'
and
'a'
, but you also filtered out
'not'
, which does affect the overall meaning of the sentence. (Worf wonât be happy about this.)
Words like
'I'
and
'not'
may seem too important to filter out, and depending on what kind of analysis you want to do, they can be. Hereâs why:
'I'
is a pronoun, which are context words rather than content words:
Content words
give you information about the topics covered in the text or the sentiment that the author has about those topics.
Context words
give you information about writing style. You can observe patterns in how authors use context words in order to quantify their writing style. Once youâve quantified their writing style, you can analyze a text written by an unknown author to see how closely it follows a particular writing style so you can try to identify who the author is.
'not'
is
technically an adverb
but has still been included in
NLTKâs list of stop words for English
. If you want to edit the list of stop words to exclude
'not'
or make other changes, then you can
download it
.
So,
'I'
and
'not'
can be important parts of a sentence, but it depends on what youâre trying to learn from that sentence.
Stemming
Stemming
is a text processing task in which you reduce words to their
root
, which is the core part of a word. For example, the words âhelpingâ and âhelperâ share the root âhelp.â Stemming allows you to zero in on the basic meaning of a word rather than all the details of how itâs being used. NLTK has
more than one stemmer
, but youâll be using the
Porter stemmer
.
Hereâs how to import the relevant parts of NLTK in order to start stemming:
Now that youâre done importing, you can create a stemmer with
PorterStemmer()
:
The next step is for you to create a string to stem. Hereâs one you can use:
Before you can stem the words in that string, you need to separate all the words in it:
Now that you have a list of all the tokenized words from the string, take a look at whatâs in
words
:
Create a list of the stemmed versions of the words in
words
by using
stemmer.stem()
in a list comprehension:
Take a look at whatâs in
stemmed_words
:
Hereâs what happened to all the words that started with
'discov'
or
'Discov'
:
Original word
Stemmed version
'Discovery'
'discoveri'
'discovered'
'discov'
'discoveries'
'discoveri'
'Discovering'
'discov'
Those results look a little inconsistent. Why would
'Discovery'
give you
'discoveri'
when
'Discovering'
gives you
'discov'
?
Understemming and overstemming are two ways stemming can go wrong:
Understemming
happens when two related words should be reduced to the same stem but arenât. This is a
false negative
.
Overstemming
happens when two unrelated words are reduced to the same stem even though they shouldnât be. This is a
false positive
.
The
Porter stemming algorithm
dates from 1979, so itâs a little on the older side. The
Snowball stemmer
, which is also called
Porter2
, is an improvement on the original and is also available through NLTK, so you can use that one in your own projects. Itâs also worth noting that the purpose of the Porter stemmer is not to produce complete words but to find variant forms of a word.
Fortunately, you have some other ways to reduce words to their core meaning, such as lemmatizing, which youâll see later in this tutorial. But first, we need to cover parts of speech.
Tagging Parts of Speech
Part of speech
is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or
POS tagging
, is the task of labeling the words in your text according to their part of speech.
In English, there are eight parts of speech:
Part of speech
Role
Examples
Noun
Is a person, place, or thing
mountain, bagel, Poland
Pronoun
Replaces a noun
you, she, we
Adjective
Gives information about what a noun is like
efficient, windy, colorful
Verb
Is an action or a state of being
learn, is, go
Adverb
Gives information about a verb, an adjective, or another adverb
efficiently, always, very
Preposition
Gives information about how a noun or pronoun is connected to another word
from, about, at
Conjunction
Connects two other words or phrases
so, because, and
Interjection
Is an exclamation
yay, ow, wow
Some sources also include the category
articles
(like âaâ or âtheâ) in the list of parts of speech, but other sources consider them to be adjectives. NLTK uses the word
determiner
to refer to articles.
Hereâs how to import the relevant parts of NLTK in order to tag parts of speech:
Now create some text to tag. You can use this
Carl Sagan quote
:
Use
word_tokenize
to separate the words in that string and store them in a list:
Now call
nltk.pos_tag()
on your new list of words:
All the words in the quote are now in a separate
tuple
, with a tag that represents their part of speech. But what do the tags mean? Hereâs how to get a list of tags and their meanings:
The list is quite long, but feel free to expand the box below to see it.
Hereâs the list of POS tags and their meanings:
Thatâs a lot to take in, but fortunately there are some patterns to help you remember whatâs what.
Hereâs a summary that you can use to get started with NLTKâs POS tags:
Tags that start with
Deal with
JJ
Adjectives
NN
Nouns
RB
Adverbs
PRP
Pronouns
VB
Verbs
Now that you know what the POS tags mean, you can see that your tagging was fairly successful:
'pie'
was tagged
NN
because itâs a singular noun.
'you'
was tagged
PRP
because itâs a personal pronoun.
'invent'
was tagged
VB
because itâs the base form of a verb.
But how would NLTK handle tagging the parts of speech in a text that is basically gibberish?
Jabberwocky
is a
nonsense poem
that doesnât technically mean much but is still written in a way that can convey some kind of meaning to English speakers.
Make a string to hold an excerpt from this poem:
Use
word_tokenize
to separate the words in the excerpt and store them in a list:
Call
nltk.pos_tag()
on your new list of words:
Accepted English words like
'and'
and
'the'
were correctly tagged as a conjunction and a determiner, respectively. The gibberish word
'slithy'
was tagged as an adjective, which is what a human English speaker would probably assume from the context of the poem as well. Way to go, NLTK!
Lemmatizing
Now that youâre up to speed on parts of speech, you can circle back to lemmatizing. Like stemming,
lemmatizing
reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like
'discoveri'
.
Hereâs how to import the relevant parts of NLTK in order to start lemmatizing:
Create a lemmatizer to use:
Letâs start with lemmatizing a plural noun:
"scarves"
gave you
'scarf'
, so thatâs already a bit more sophisticated than what you would have gotten with the Porter stemmer, which is
'scarv'
. Next, create a string with more than one word to lemmatize:
Now tokenize that string by word:
Hereâs your list of words:
Create a list containing all the words in
words
after theyâve been lemmatized:
Hereâs the list you got:
That looks right. The plurals
'friends'
and
'scarves'
became the singulars
'friend'
and
'scarf'
.
But what would happen if you lemmatized a word that looked very different from its lemma? Try lemmatizing
"worst"
:
You got the result
'worst'
because
lemmatizer.lemmatize()
assumed that
"worst"
was a noun
. You can make it clear that you want
"worst"
to be an adjective:
The default parameter for
pos
is
'n'
for noun, but you made sure that
"worst"
was treated as an adjective by adding the parameter
pos="a"
. As a result, you got
'bad'
, which looks very different from your original word and is nothing like what youâd get if you were stemming. This is because
"worst"
is the
superlative
form of the adjective
'bad'
, and lemmatizing reduces superlatives as well as
comparatives
to their lemmas.
Now that you know how to use NLTK to tag parts of speech, you can try tagging your words before lemmatizing them to avoid mixing up
homographs
, or words that are spelled the same but have different meanings and can be different parts of speech.
Chunking
While tokenizing allows you to identify words and sentences,
chunking
allows you to identify
phrases
.
Chunking makes use of POS tags to group words and apply chunk tags to those groups. Chunks donât overlap, so one instance of a word can be in only one chunk at a time.
Hereâs how to import the relevant parts of NLTK in order to chunk:
Before you can chunk, you need to make sure that the parts of speech in your text are tagged, so create a string for POS tagging. You can use this quote from
The Lord of the Rings
:
Now tokenize that string by word:
Now youâve got a list of all of the words in
lotr_quote
.
The next step is to tag those words by part of speech:
Youâve got a list of tuples of all the words in the quote, along with their POS tag. In order to chunk, you first need to define a chunk grammar.
Create a chunk grammar with one regular expression rule:
NP
stands for noun phrase. You can learn more about
noun phrase chunking
in
Chapter 7
of
Natural Language Processing with PythonâAnalyzing Text with the Natural Language Toolkit
.
According to the rule you created, your chunks:
Start with an optional (
?
) determiner (
'DT'
)
Can have any number (
*
) of adjectives (
JJ
)
End with a noun (
<NN>
)
Create a
chunk parser
with this grammar:
Now try it out with your quote:
Hereâs how you can see a visual representation of this tree:
This is what the visual representation looks like:
You got two noun phrases:
'a dangerous business'
has a determiner, an adjective, and a noun.
'door'
has just a noun.
Now that you know about chunking, itâs time to look at chinking.
Chinking
Chinking is used together with chunking, but while chunking is used to include a pattern,
chinking
is used to exclude a pattern.
Letâs reuse the quote you used in the section on chunking. You already have a list of tuples containing each of the words in the quote along with its part of speech tag:
The next step is to create a grammar to determine what you want to include and exclude in your chunks. This time, youâre going to use more than one line because youâre going to have more than one rule. Because youâre using more than one line for the grammar, youâll be using triple quotes (
"""
):
The first rule of your grammar is
{<.*>+}
. This rule has curly braces that face inward (
{}
) because itâs used to determine what patterns you want to include in you chunks. In this case, you want to include everything:
<.*>+
.
The second rule of your grammar is
}<JJ>{
. This rule has curly braces that face outward (
}{
) because itâs used to determine what patterns you want to exclude in your chunks. In this case, you want to exclude adjectives:
<JJ>
.
Create a chunk parser with this grammar:
Now chunk your sentence with the chink you specified:
You get this tree as a result:
In this case,
('dangerous', 'JJ')
was excluded from the chunks because itâs an adjective (
JJ
). But that will be easier to see if you get a graphic representation again:
You get this visual representation of the
tree
:
Here, youâve excluded the adjective
'dangerous'
from your chunks and are left with two chunks containing everything else. The first chunk has all the text that appeared before the adjective that was excluded. The second chunk contains everything after the adjective that was excluded.
Now that you know how to exclude patterns from your chunks, itâs time to look into named entity recognition (NER).
Using Named Entity Recognition (NER)
Named entities
are noun phrases that refer to specific locations, people, organizations, and so on. With
named entity recognition
, you can find the named entities in your texts and also determine what kind of named entity they are.
Hereâs the list of named entity types from the
NLTK book
:
NE type
Examples
ORGANIZATION
Georgia-Pacific Corp., WHO
PERSON
Eddy Bonte, President Obama
LOCATION
Murray River, Mount Everest
DATE
June, 2008-06-29
TIME
two fifty a m, 1:30 p.m.
MONEY
175 million Canadian dollars, GBP 10.40
PERCENT
twenty pct, 18.75 %
FACILITY
Washington Monument, Stonehenge
GPE
South East Asia, Midlothian
You can use
nltk.ne_chunk()
to recognize named entities. Letâs use
lotr_pos_tags
again to test it out:
Now take a look at the visual representation:
Hereâs what you get:
See how
Frodo
has been tagged as a
PERSON
? You also have the option to use the parameter
binary=True
if you just want to know what the named entities are but not what kind of named entity they are:
Now all you see is that
Frodo
is an
NE
:
Thatâs how you can identify named entities! But you can take this one step further and extract named entities directly from your text. Create a string from which to extract named entities. You can use this quote from
The War of the Worlds
:
Now create a function to extract named entities:
With this function, you gather all named entities, with no repeats. In order to do that, you tokenize by word, apply part of speech tags to those words, and then extract named entities based on those tags. Because you included
binary=True
, the named entities youâll get wonât be labeled more specifically. Youâll just know that theyâre named entities.
Take a look at the information you extracted:
You missed the city of Nice, possibly because NLTK interpreted it as a regular English adjective, but you still got the following:
An institution:
'Lick Observatory'
A planet:
'Mars'
A publication:
'Nature'
People:
'Perrotin'
,
'Schiaparelli'
Thatâs some pretty decent variety!
Getting Text to Analyze
Now that youâve done some text processing tasks with small example texts, youâre ready to analyze a bunch of texts at once. A group of texts is called a
corpus
. NLTK provides several
corpora
covering everything from novels hosted by
Project Gutenberg
to inaugural speeches by presidents of the United States.
In order to analyze texts in NLTK, you first need to import them. This requires
nltk.download("book")
, which is a pretty big download:
You now have access to a few linear texts (such as
Sense and Sensibility
and
Monty Python and the Holy Grail
) as well as a few groups of texts (such as a chat corpus and a personals corpus). Human nature is fascinating, so letâs see what we can find out by taking a closer look at the personals corpus!
This corpus is a collection of
personals ads
, which were an early version of online dating. If you wanted to meet someone, then you could place an ad in a newspaper and wait for other readers to respond to you.
If youâd like to learn how to get other texts to analyze, then you can check out
Chapter 3
of
Natural Language Processing with Python â Analyzing Text with the Natural Language Toolkit
.
Using a Concordance
When you use a
concordance
, you can see each time a word is used, along with its immediate context. This can give you a peek into how a word is being used at the sentence level and what words are used with it.
Letâs see what these good people looking for love have to say! The personals corpus is called
text8
, so weâre going to call
.concordance()
on it with the parameter
"man"
:
Interestingly, the last three of those fourteen matches have to do with seeking an honest man, specifically:
SEEKING HONEST MAN
Seeks 35 - 45 , honest man with good SOH & similar interests
genuine , caring , honest and normal man for fship , poss rship
Letâs see if thereâs a similar pattern with the word
"woman"
:
The issue of honesty came up in the first match only:
Dipping into a corpus with a concordance wonât give you the full picture, but it can still be interesting to take a peek and see if anything stands out.
Making a Dispersion Plot
You can use a
dispersion plot
to see how much a particular word appears and where it appears. So far, weâve looked for
"man"
and
"woman"
, but it would be interesting to see how much those words are used compared to their synonyms:
Hereâs the dispersion plot you get:
Each vertical blue line represents one instance of a word. Each horizontal row of blue lines represents the corpus as a whole. This plot shows that:
"lady"
was used a lot more than
"woman"
or
"girl"
. There were no instances of
"gal"
.
"man"
and
"guy"
were used a similar number of times and were more common than
"gentleman"
or
"boy"
.
You use a dispersion plot when you want to see where words show up in a text or corpus. If youâre analyzing a single text, this can help you see which words show up near each other. If youâre analyzing a corpus of texts that is organized chronologically, it can help you see which words were being used more or less over a period of time.
Staying on the theme of romance, see what you can find out by making a dispersion plot for
Sense and Sensibility
, which is
text2
. Jane Austen novels talk a lot about peopleâs homes, so make a dispersion plot with the names of a few homes:
Hereâs the plot you get:
Apparently Allenham is mentioned a lot in the first third of the novel and then doesnât come up much again. Cleveland, on the other hand, barely comes up in the first two thirds but shows up a fair bit in the last third. This distribution reflects changes in the relationship between
Marianne
and
Willoughby
:
Allenham
is the home of Willoughbyâs benefactress and comes up a lot when Marianne is first interested in him.
Cleveland
is a home that Marianne stays at after she goes to see Willoughby in London and things go wrong.
Dispersion plots are just one type of visualization you can make for textual data. The next one youâll take a look at is frequency distributions.
Making a Frequency Distribution
With a
frequency distribution
, you can check which words show up most frequently in your text. Youâll need to get started with an
import
:
FreqDist
is a subclass of
collections.Counter
. Hereâs how to create a frequency distribution of the entire corpus of personals ads:
Since
1108
samples and
4867
outcomes is a lot of information, start by narrowing that down. Hereâs how to see the
20
most common words in the corpus:
You have a lot of stop words in your frequency distribution, but you can remove them just as you did
earlier
. Create a list of all of the words in
text8
that arenât stop words:
Now that you have a list of all of the words in your corpus that arenât stop words, make a frequency distribution:
Take a look at the
20
most common words:
You can turn this list into a graph:
Hereâs the graph you get:
Some of the most common words are:
'lady'
'seeks'
'ship'
'relationship'
'fun'
'slim'
'build'
'smoker'
'50'
'non'
'movies'
'good'
'honest'
From what youâve already learned about the people writing these personals ads, they did seem interested in honesty and used the word
'lady'
a lot. In addition,
'slim'
and
'build'
both show up the same number of times. You saw
slim
and
build
used near each other when you were learning about
concordances
, so maybe those two words are commonly used together in this corpus. That brings us to collocations!
Finding Collocations
A
collocation
is a sequence of words that shows up often. If youâre interested in common collocations in English, then you can check out
The BBI Dictionary of English Word Combinations
. Itâs a handy reference you can use to help you make sure your writing is
idiomatic
. Here are some examples of collocations that use the word âtreeâ:
Syntax tree
Family tree
Decision tree
To see pairs of words that come up often in your corpus, you need to call
.collocations()
on it:
slim build
did show up, as did
medium build
and several other word combinations. No long walks on the beach though!
But what would happen if you looked for collocations after lemmatizing the words in your corpus? Would you find some word combinations that you missed the first time around because they came up in slightly varied versions?
If you followed the instructions
earlier
, then youâll already have a
lemmatizer
, but you canât call
collocations()
on just any
data type
, so youâre going to need to do some prep work. Start by creating a list of the lemmatized versions of all the words in
text8
:
But in order for you to be able to do the linguistic processing tasks youâve seen so far, you need to make an
NLTK text
with this list:
Hereâs how to see the collocations in your
new_text
:
Compared to your previous list of collocations, this new one is missing a few:
weekends away
poss rship
The idea of
quiet nights
still shows up in the lemmatized version,
quiet night
. Your latest search for collocations also brought up a few news ones:
year old
suggests that users often mention ages.
photo pls
suggests that users often request one or more photos.
Thatâs how you can find common word combinations to see what people are talking about and how theyâre talking about it!
Conclusion
Congratulations on taking your first steps with
NLP
! A whole new world of unstructured data is now open for you to explore. Now that youâve covered the basics of text analytics tasks, you can get out there are find some texts to analyze and see what you can learn about the texts themselves as well as the people who wrote them and the topics theyâre about.
Now you know how to:
Find text
to analyze
Preprocess
your text for analysis
Analyze
your text
Create
visualizations
based on your analysis
For your next step, you can use NLTK to analyze a text to see whether the sentiments expressed in it are positive or negative. To learn more about sentiment analysis, check out
Sentiment Analysis: First Steps With Pythonâs NLTK Library
. If youâd like to dive deeper into the nuts and bolts of
NLTK
, then you can work your way through
Natural Language Processing with PythonâAnalyzing Text with the Natural Language Toolkit
.
Now get out there and find yourself some text to analyze! |
| Markdown | [](https://realpython.com/)
- [Start Here](https://realpython.com/start-here/)
- [Learn Python](https://realpython.com/nltk-nlp-python/)
[Python Tutorials â In-depth articles and video courses](https://realpython.com/search?kind=article&kind=course&order=newest)
[Learning Paths â Guided study plans for accelerated learning](https://realpython.com/learning-paths/)
[Quizzes & Exercises â Check your learning progress](https://realpython.com/quizzes/)
[Browse Topics â Focus on a specific area or skill level](https://realpython.com/tutorials/all/)
[Community Chat â Learn with other Pythonistas](https://realpython.com/community/)
[Office Hours â Live Q\&A calls with Python experts](https://realpython.com/office-hours/)
[Live Courses â Live, instructor-led Python courses](https://realpython.com/live/)
[Podcast â Hear whatâs new in the world of Python](https://realpython.com/podcasts/rpp/)
[Books â Round out your knowledge and learn offline](https://realpython.com/products/books/)
[Reference â Concise definitions for common Python terms](https://realpython.com/ref/)
[Code Mentor âBeta Personalized code assistance & learning tools](https://realpython.com/mentor/)
[Unlock All Content â](https://realpython.com/account/join/)
- [More](https://realpython.com/nltk-nlp-python/)
[Learner Stories](https://realpython.com/learner-stories/) [Python Newsletter](https://realpython.com/newsletter/) [Python Job Board](https://www.pythonjobshq.com/) [Meet the Team](https://realpython.com/team/) [Become a Contributor](https://realpython.com/jobs/)
- [Search](https://realpython.com/search "Search")
- [Join](https://realpython.com/account/join/)
- [SignâIn](https://realpython.com/account/login/?next=%2Fnltk-nlp-python%2F)
[Browse Topics](https://realpython.com/tutorials/all/)
[Guided Learning Paths](https://realpython.com/learning-paths/)
[Basics](https://realpython.com/search?level=basics)
[Intermediate](https://realpython.com/search?level=intermediate)
[Advanced](https://realpython.com/search?level=advanced)
***
[ai](https://realpython.com/tutorials/ai/) [algorithms](https://realpython.com/tutorials/algorithms/) [api](https://realpython.com/tutorials/api/) [best-practices](https://realpython.com/tutorials/best-practices/) [career](https://realpython.com/tutorials/career/) [community](https://realpython.com/tutorials/community/) [databases](https://realpython.com/tutorials/databases/) [data-science](https://realpython.com/tutorials/data-science/) [data-structures](https://realpython.com/tutorials/data-structures/) [data-viz](https://realpython.com/tutorials/data-viz/) [devops](https://realpython.com/tutorials/devops/) [django](https://realpython.com/tutorials/django/) [docker](https://realpython.com/tutorials/docker/) [editors](https://realpython.com/tutorials/editors/) [flask](https://realpython.com/tutorials/flask/) [front-end](https://realpython.com/tutorials/front-end/) [gamedev](https://realpython.com/tutorials/gamedev/) [gui](https://realpython.com/tutorials/gui/) [machine-learning](https://realpython.com/tutorials/machine-learning/) [news](https://realpython.com/tutorials/news/) [numpy](https://realpython.com/tutorials/numpy/) [projects](https://realpython.com/tutorials/projects/) [python](https://realpython.com/tutorials/python/) [stdlib](https://realpython.com/tutorials/stdlib/) [testing](https://realpython.com/tutorials/testing/) [tools](https://realpython.com/tutorials/tools/) [web-dev](https://realpython.com/tutorials/web-dev/) [web-scraping](https://realpython.com/tutorials/web-scraping/)
[Table of Contents](https://realpython.com/nltk-nlp-python/#toc)
- [Getting Started With Pythonâs NLTK](https://realpython.com/nltk-nlp-python/#getting-started-with-pythons-nltk)
- [Tokenizing](https://realpython.com/nltk-nlp-python/#tokenizing)
- [Filtering Stop Words](https://realpython.com/nltk-nlp-python/#filtering-stop-words)
- [Stemming](https://realpython.com/nltk-nlp-python/#stemming)
- [Tagging Parts of Speech](https://realpython.com/nltk-nlp-python/#tagging-parts-of-speech)
- [Lemmatizing](https://realpython.com/nltk-nlp-python/#lemmatizing)
- [Chunking](https://realpython.com/nltk-nlp-python/#chunking)
- [Chinking](https://realpython.com/nltk-nlp-python/#chinking)
- [Using Named Entity Recognition (NER)](https://realpython.com/nltk-nlp-python/#using-named-entity-recognition-ner)
- [Getting Text to Analyze](https://realpython.com/nltk-nlp-python/#getting-text-to-analyze)
- [Using a Concordance](https://realpython.com/nltk-nlp-python/#using-a-concordance)
- [Making a Dispersion Plot](https://realpython.com/nltk-nlp-python/#making-a-dispersion-plot)
- [Making a Frequency Distribution](https://realpython.com/nltk-nlp-python/#making-a-frequency-distribution)
- [Finding Collocations](https://realpython.com/nltk-nlp-python/#finding-collocations)
- [Conclusion](https://realpython.com/nltk-nlp-python/#conclusion)
Mark as Completed
Share

# Natural Language Processing With Python's NLTK Package
by [Joanna Jablonski](https://realpython.com/nltk-nlp-python/#author)
Reading time estimate
43m
[basics](https://realpython.com/tutorials/basics/) [data-science](https://realpython.com/tutorials/data-science/)
Mark as Completed
Share
Table of Contents
- [Getting Started With Pythonâs NLTK](https://realpython.com/nltk-nlp-python/#getting-started-with-pythons-nltk)
- [Tokenizing](https://realpython.com/nltk-nlp-python/#tokenizing)
- [Filtering Stop Words](https://realpython.com/nltk-nlp-python/#filtering-stop-words)
- [Stemming](https://realpython.com/nltk-nlp-python/#stemming)
- [Tagging Parts of Speech](https://realpython.com/nltk-nlp-python/#tagging-parts-of-speech)
- [Lemmatizing](https://realpython.com/nltk-nlp-python/#lemmatizing)
- [Chunking](https://realpython.com/nltk-nlp-python/#chunking)
- [Chinking](https://realpython.com/nltk-nlp-python/#chinking)
- [Using Named Entity Recognition (NER)](https://realpython.com/nltk-nlp-python/#using-named-entity-recognition-ner)
- [Getting Text to Analyze](https://realpython.com/nltk-nlp-python/#getting-text-to-analyze)
- [Using a Concordance](https://realpython.com/nltk-nlp-python/#using-a-concordance)
- [Making a Dispersion Plot](https://realpython.com/nltk-nlp-python/#making-a-dispersion-plot)
- [Making a Frequency Distribution](https://realpython.com/nltk-nlp-python/#making-a-frequency-distribution)
- [Finding Collocations](https://realpython.com/nltk-nlp-python/#finding-collocations)
- [Conclusion](https://realpython.com/nltk-nlp-python/#conclusion)
[Remove ads](https://realpython.com/account/join/)
[Natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP) is a field that focuses on making natural human language usable by computer programs. **NLTK**, or [Natural Language Toolkit](https://www.nltk.org/), is a Python package that you can use for NLP.
A lot of the data that you could be analyzing is [unstructured data](https://en.wikipedia.org/wiki/Unstructured_data) and contains human-readable text. Before you can analyze that data programmatically, you first need to preprocess it. In this tutorial, youâll take your first look at the kinds of **text preprocessing** tasks you can do with NLTK so that youâll be ready to apply them in future projects. Youâll also see how to do some basic **text analysis** and create **visualizations**.
If youâre familiar with the [basics of using Python](https://realpython.com/products/python-basics-book/) and would like to get your feet wet with some NLP, then youâve come to the right place.
**By the end of this tutorial, youâll know how to:**
- **Find text** to analyze
- **Preprocess** your text for analysis
- **Analyze** your text
- Create **visualizations** based on your analysis
Letâs get Pythoning\!
**Free Download:** [Get a sample chapter from Python Basics: A Practical Introduction to Python 3](https://realpython.com/bonus/python-basics-sample-download/) to see how you can go from beginner to intermediate in Python with a complete curriculum, up-to-date for Python 3.8.
## Getting Started With Pythonâs NLTK
The first thing you need to do is make sure that you have Python installed. For this tutorial, youâll be using Python 3.9. If you donât yet have Python installed, then check out [Python 3 Installation & Setup Guide](https://realpython.com/installing-python/) to get started.
Once you have that dealt with, your next step is to [install NLTK](https://www.nltk.org/install.html) with [`pip`](https://realpython.com/what-is-pip/). Itâs a best practice to install it in a virtual environment. To learn more about virtual environments, check out [Python Virtual Environments: A Primer](https://realpython.com/python-virtual-environments-a-primer/).
For this tutorial, youâll be installing version 3.5:
Shell
```
$ python -m pip install nltk==3.5
```
In order to create visualizations for [named entity recognition](https://realpython.com/nltk-nlp-python/#using-named-entity-recognition-ner), youâll also need to install [NumPy](https://realpython.com/numpy-tutorial/) and [Matplotlib](https://realpython.com/python-matplotlib-guide/):
Shell
```
$ python -m pip install numpy matplotlib
```
If youâd like to know more about how `pip` works, then you can check out [What Is Pip? A Guide for New Pythonistas](https://realpython.com/what-is-pip/). You can also take a look at the official page on [installing NLTK data](https://www.nltk.org/data).
[Remove ads](https://realpython.com/account/join/)
## Tokenizing
By **tokenizing**, you can conveniently split up text by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. Itâs your first step in turning unstructured data into structured data, which is easier to analyze.
When youâre analyzing text, youâll be tokenizing by word and tokenizing by sentence. Hereâs what both types of tokenization bring to the table:
- **Tokenizing by word:** Words are like the atoms of natural language. Theyâre the smallest unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to identify words that come up particularly often. For example, if you were analyzing a group of job ads, then you might find that the word âPythonâ comes up often. That could suggest high demand for Python knowledge, but youâd need to look deeper to know more.
- **Tokenizing by sentence:** When you tokenize by sentence, you can analyze how those words relate to one another and see more context. Are there a lot of negative words around the word âPythonâ because the hiring manager doesnât like Python? Are there more terms from the domain of [herpetology](https://en.wikipedia.org/wiki/Herpetology) than the domain of software development, suggesting that you may be dealing with an entirely different kind of [python](https://en.wikipedia.org/wiki/Pythonidae) than you were expecting?
Hereâs how to [import](https://realpython.com/absolute-vs-relative-python-imports/) the relevant parts of NLTK so you can tokenize by word and by sentence:
Python
```
>>> from nltk.tokenize import sent_tokenize, word_tokenize
```
Now that youâve imported what you need, you can create a [string](https://realpython.com/python-strings/) to tokenize. Hereâs a quote from [*Dune*](https://en.wikipedia.org/wiki/Dune_\(novel\)) that you can use:
Python
```
```
You can use `sent_tokenize()` to split up `example_string` into sentences:
Python
```
```
Tokenizing `example_string` by sentence gives you a [list](https://realpython.com/python-lists-tuples/) of three strings that are sentences:
1. `"Muad'Dib learned rapidly because his first training was in how to learn."`
2. `'And the first lesson of all was the basic trust that he could learn.'`
3. `"It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult."`
Now try tokenizing `example_string` by word:
Python
```
```
You got a list of strings that NLTK considers to be words, such as:
- `"Muad'Dib"`
- `'training'`
- `'how'`
But the following strings were also considered to be words:
- `"'s"`
- `','`
- `'.'`
See how `"It's"` was split at the apostrophe to give you `'It'` and `"'s"`, but `"Muad'Dib"` was left whole? This happened because NLTK knows that `'It'` and `"'s"` (a contraction of âisâ) are two distinct words, so it counted them separately. But `"Muad'Dib"` isnât an accepted contraction like `"It's"`, so it wasnât read as two separate words and was left intact.
## Filtering Stop Words
**Stop words** are words that you want to ignore, so you filter them out of your text when youâre processing it. Very common words like `'in'`, `'is'`, and `'an'` are often used as stop words since they donât add a lot of meaning to a text in and of themselves.
Hereâs how to import the relevant parts of NLTK in order to filter out stop words:
Python
```
```
Hereâs a [quote from Worf](https://www.youtube.com/watch?v=ri5S4Hcq0nY) that you can filter:
Python
```
>>> worf_quote = "Sir, I protest. I am not a merry man!"
```
Now tokenize `worf_quote` by word and store the resulting list in `words_in_quote`:
Python
```
```
You have a list of the words in `worf_quote`, so the next step is to create a [set](https://realpython.com/python-sets/) of stop words to filter `words_in_quote`. For this example, youâll need to focus on stop words in `"english"`:
Python
```
>>> stop_words = set(stopwords.words("english"))
```
Next, create an empty list to hold the words that make it past the filter:
Python
```
>>> filtered_list = []
```
You created an empty list, `filtered_list`, to hold all the words in `words_in_quote` that arenât stop words. Now you can use `stop_words` to filter `words_in_quote`:
Python
```
```
You iterated over `words_in_quote` with a [`for` loop](https://realpython.com/python-for-loop/) and added all the words that werenât stop words to `filtered_list`. You used [`.casefold()`](https://docs.python.org/3/library/stdtypes.html#str.casefold) on `word` so you could ignore whether the letters in `word` were uppercase or lowercase. This is worth doing because `stopwords.words('english')` includes only lowercase versions of stop words.
Alternatively, you could use a [list comprehension](https://realpython.com/list-comprehension-python/) to make a list of all the words in your text that arenât stop words:
Python
```
```
When you use a list comprehension, you donât create an empty list and then add items to the end of it. Instead, you define the list and its contents at the same time. Using a list comprehension is often seen as more [Pythonic](https://realpython.com/learning-paths/writing-pythonic-code/).
Take a look at the words that ended up in `filtered_list`:
Python
```
```
You filtered out a few words like `'am'` and `'a'`, but you also filtered out `'not'`, which does affect the overall meaning of the sentence. (Worf wonât be happy about this.)
Words like `'I'` and `'not'` may seem too important to filter out, and depending on what kind of analysis you want to do, they can be. Hereâs why:
- **`'I'`** is a pronoun, which are context words rather than content words:
- **Content words** give you information about the topics covered in the text or the sentiment that the author has about those topics.
- **Context words** give you information about writing style. You can observe patterns in how authors use context words in order to quantify their writing style. Once youâve quantified their writing style, you can analyze a text written by an unknown author to see how closely it follows a particular writing style so you can try to identify who the author is.
- **`'not'`** is [technically an adverb](https://www.merriam-webster.com/dictionary/not) but has still been included in [NLTKâs list of stop words for English](https://www.nltk.org/nltk_data/). If you want to edit the list of stop words to exclude `'not'` or make other changes, then you can [download it](https://www.nltk.org/nltk_data/).
So, `'I'` and `'not'` can be important parts of a sentence, but it depends on what youâre trying to learn from that sentence.
[Remove ads](https://realpython.com/account/join/)
## Stemming
**Stemming** is a text processing task in which you reduce words to their [root](https://simple.wikipedia.org/wiki/Root_\(linguistics\)), which is the core part of a word. For example, the words âhelpingâ and âhelperâ share the root âhelp.â Stemming allows you to zero in on the basic meaning of a word rather than all the details of how itâs being used. NLTK has [more than one stemmer](http://www.nltk.org/howto/stem.html), but youâll be using the [Porter stemmer](https://www.nltk.org/_modules/nltk/stem/porter.html).
Hereâs how to import the relevant parts of NLTK in order to start stemming:
Python
```
```
Now that youâre done importing, you can create a stemmer with `PorterStemmer()`:
Python
```
>>> stemmer = PorterStemmer()
```
The next step is for you to create a string to stem. Hereâs one you can use:
Python
```
```
Before you can stem the words in that string, you need to separate all the words in it:
Python
```
>>> words = word_tokenize(string_for_stemming)
```
Now that you have a list of all the tokenized words from the string, take a look at whatâs in `words`:
Python
```
```
Create a list of the stemmed versions of the words in `words` by using `stemmer.stem()` in a list comprehension:
Python
```
>>> stemmed_words = [stemmer.stem(word) for word in words]
```
Take a look at whatâs in `stemmed_words`:
Python
```
```
Hereâs what happened to all the words that started with `'discov'` or `'Discov'`:
| Original word | Stemmed version |
|---|---|
| `'Discovery'` | `'discoveri'` |
| `'discovered'` | `'discov'` |
| `'discoveries'` | `'discoveri'` |
| `'Discovering'` | `'discov'` |
Those results look a little inconsistent. Why would `'Discovery'` give you `'discoveri'` when `'Discovering'` gives you `'discov'`?
Understemming and overstemming are two ways stemming can go wrong:
1. **Understemming** happens when two related words should be reduced to the same stem but arenât. This is a [false negative](https://en.wikipedia.org/wiki/False_positives_and_false_negatives#False_negative_error).
2. **Overstemming** happens when two unrelated words are reduced to the same stem even though they shouldnât be. This is a [false positive](https://en.wikipedia.org/wiki/False_positives_and_false_negatives#False_negative_error).
The [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) dates from 1979, so itâs a little on the older side. The **Snowball stemmer**, which is also called **Porter2**, is an improvement on the original and is also available through NLTK, so you can use that one in your own projects. Itâs also worth noting that the purpose of the Porter stemmer is not to produce complete words but to find variant forms of a word.
Fortunately, you have some other ways to reduce words to their core meaning, such as lemmatizing, which youâll see later in this tutorial. But first, we need to cover parts of speech.
[Remove ads](https://realpython.com/account/join/)
## Tagging Parts of Speech
**Part of speech** is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or **POS tagging**, is the task of labeling the words in your text according to their part of speech.
In English, there are eight parts of speech:
| Part of speech | Role | Examples |
|---|---|---|
| Noun | Is a person, place, or thing | mountain, bagel, Poland |
| Pronoun | Replaces a noun | you, she, we |
| Adjective | Gives information about what a noun is like | efficient, windy, colorful |
| Verb | Is an action or a state of being | learn, is, go |
| Adverb | Gives information about a verb, an adjective, or another adverb | efficiently, always, very |
| Preposition | Gives information about how a noun or pronoun is connected to another word | from, about, at |
| Conjunction | Connects two other words or phrases | so, because, and |
| Interjection | Is an exclamation | yay, ow, wow |
Some sources also include the category **articles** (like âaâ or âtheâ) in the list of parts of speech, but other sources consider them to be adjectives. NLTK uses the word **determiner** to refer to articles.
Hereâs how to import the relevant parts of NLTK in order to tag parts of speech:
Python
```
>>> from nltk.tokenize import word_tokenize
```
Now create some text to tag. You can use this [Carl Sagan quote](https://www.youtube.com/watch?v=5_vVGPy4-rc):
Python
```
```
Use `word_tokenize` to separate the words in that string and store them in a list:
Python
```
>>> words_in_sagan_quote = word_tokenize(sagan_quote)
```
Now call `nltk.pos_tag()` on your new list of words:
Python
```
```
All the words in the quote are now in a separate [tuple](https://realpython.com/python-tuple/), with a tag that represents their part of speech. But what do the tags mean? Hereâs how to get a list of tags and their meanings:
Python
```
>>> nltk.help.upenn_tagset()
```
The list is quite long, but feel free to expand the box below to see it.
POS Tags and Their MeaningsShow/Hide
Hereâs the list of POS tags and their meanings:
Python
```
```
Thatâs a lot to take in, but fortunately there are some patterns to help you remember whatâs what.
Hereâs a summary that you can use to get started with NLTKâs POS tags:
| Tags that start with | Deal with |
|---|---|
| `JJ` | Adjectives |
| `NN` | Nouns |
| `RB` | Adverbs |
| `PRP` | Pronouns |
| `VB` | Verbs |
Now that you know what the POS tags mean, you can see that your tagging was fairly successful:
- `'pie'` was tagged `NN` because itâs a singular noun.
- `'you'` was tagged `PRP` because itâs a personal pronoun.
- `'invent'` was tagged `VB` because itâs the base form of a verb.
But how would NLTK handle tagging the parts of speech in a text that is basically gibberish? [Jabberwocky](https://www.poetryfoundation.org/poems/42916/jabberwocky) is a [nonsense poem](https://en.wikipedia.org/wiki/Nonsense_verse) that doesnât technically mean much but is still written in a way that can convey some kind of meaning to English speakers.
Make a string to hold an excerpt from this poem:
Python
```
```
Use `word_tokenize` to separate the words in the excerpt and store them in a list:
Python
```
>>> words_in_excerpt = word_tokenize(jabberwocky_excerpt)
```
Call `nltk.pos_tag()` on your new list of words:
Python
```
```
Accepted English words like `'and'` and `'the'` were correctly tagged as a conjunction and a determiner, respectively. The gibberish word `'slithy'` was tagged as an adjective, which is what a human English speaker would probably assume from the context of the poem as well. Way to go, NLTK\!
[Remove ads](https://realpython.com/account/join/)
## Lemmatizing
Now that youâre up to speed on parts of speech, you can circle back to lemmatizing. Like stemming, **lemmatizing** reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like `'discoveri'`.
**Note:** A **lemma** is a word that represents a whole group of words, and that group of words is called a **lexeme**.
For example, if you were to look up [the word âblendingâ in a dictionary](https://www.merriam-webster.com/dictionary/blending), then youâd need to look at the entry for âblend,â but you would find âblendingâ listed in that entry.
In this example, âblendâ is the **lemma**, and âblendingâ is part of the **lexeme**. So when you lemmatize a word, you are reducing it to its lemma.
Hereâs how to import the relevant parts of NLTK in order to start lemmatizing:
Python
```
>>> from nltk.stem import WordNetLemmatizer
```
Create a lemmatizer to use:
Python
```
>>> lemmatizer = WordNetLemmatizer()
```
Letâs start with lemmatizing a plural noun:
Python
```
```
`"scarves"` gave you `'scarf'`, so thatâs already a bit more sophisticated than what you would have gotten with the Porter stemmer, which is `'scarv'`. Next, create a string with more than one word to lemmatize:
Python
```
>>> string_for_lemmatizing = "The friends of DeSoto love scarves."
```
Now tokenize that string by word:
Python
```
>>> words = word_tokenize(string_for_lemmatizing)
```
Hereâs your list of words:
Python
```
```
Create a list containing all the words in `words` after theyâve been lemmatized:
Python
```
>>> lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
```
Hereâs the list you got:
Python
```
```
That looks right. The plurals `'friends'` and `'scarves'` became the singulars `'friend'` and `'scarf'`.
But what would happen if you lemmatized a word that looked very different from its lemma? Try lemmatizing `"worst"`:
Python
```
```
You got the result `'worst'` because `lemmatizer.lemmatize()` assumed that [`"worst"` was a noun](https://www.merriam-webster.com/dictionary/worst). You can make it clear that you want `"worst"` to be an adjective:
Python
```
```
The default parameter for `pos` is `'n'` for noun, but you made sure that `"worst"` was treated as an adjective by adding the parameter `pos="a"`. As a result, you got `'bad'`, which looks very different from your original word and is nothing like what youâd get if you were stemming. This is because `"worst"` is the [superlative](https://www.merriam-webster.com/dictionary/superlative) form of the adjective `'bad'`, and lemmatizing reduces superlatives as well as [comparatives](https://www.merriam-webster.com/dictionary/comparative) to their lemmas.
Now that you know how to use NLTK to tag parts of speech, you can try tagging your words before lemmatizing them to avoid mixing up [homographs](https://en.wikipedia.org/wiki/Homograph), or words that are spelled the same but have different meanings and can be different parts of speech.
[Remove ads](https://realpython.com/account/join/)
## Chunking
While tokenizing allows you to identify words and sentences, **chunking** allows you to identify **phrases**.
**Note:** A **phrase** is a word or group of words that works as a single unit to perform a grammatical function. **Noun phrases** are built around a noun.
Here are some examples:
- âA planetâ
- âA tilting planetâ
- âA swiftly tilting planetâ
Chunking makes use of POS tags to group words and apply chunk tags to those groups. Chunks donât overlap, so one instance of a word can be in only one chunk at a time.
Hereâs how to import the relevant parts of NLTK in order to chunk:
Python
```
>>> from nltk.tokenize import word_tokenize
```
Before you can chunk, you need to make sure that the parts of speech in your text are tagged, so create a string for POS tagging. You can use this quote from [*The Lord of the Rings*](https://en.wikipedia.org/wiki/The_Lord_of_the_Rings):
Python
```
>>> lotr_quote = "It's a dangerous business, Frodo, going out your door."
```
Now tokenize that string by word:
Python
```
```
Now youâve got a list of all of the words in `lotr_quote`.
The next step is to tag those words by part of speech:
Python
```
```
Youâve got a list of tuples of all the words in the quote, along with their POS tag. In order to chunk, you first need to define a chunk grammar.
**Note:** A **chunk grammar** is a combination of rules on how sentences should be chunked. It often uses [regular expressions](https://realpython.com/regex-python/), or **regexes**.
For this tutorial, you donât need to know how regular expressions work, but they will definitely [come in handy](https://xkcd.com/208/) for you in the future if you want to process text.
Create a chunk grammar with one regular expression rule:
Python
```
>>> grammar = "NP: {<DT>?<JJ>*<NN>}"
```
`NP` stands for noun phrase. You can learn more about **noun phrase chunking** in [Chapter 7](https://www.nltk.org/book/ch07.html#noun-phrase-chunking) of *Natural Language Processing with PythonâAnalyzing Text with the Natural Language Toolkit*.
According to the rule you created, your chunks:
1. Start with an optional (`?`) determiner (`'DT'`)
2. Can have any number (`*`) of adjectives (`JJ`)
3. End with a noun (`<NN>`)
Create a **chunk parser** with this grammar:
Python
```
>>> chunk_parser = nltk.RegexpParser(grammar)
```
Now try it out with your quote:
Python
```
>>> tree = chunk_parser.parse(lotr_pos_tags)
```
Hereâs how you can see a visual representation of this tree:
Python
```
>>> tree.draw()
```
This is what the visual representation looks like:
[](https://files.realpython.com/media/lotr_tree.7588d4620a3d.jpg)
You got two noun phrases:
1. **`'a dangerous business'`** has a determiner, an adjective, and a noun.
2. **`'door'`** has just a noun.
Now that you know about chunking, itâs time to look at chinking.
[Remove ads](https://realpython.com/account/join/)
## Chinking
Chinking is used together with chunking, but while chunking is used to include a pattern, **chinking** is used to exclude a pattern.
Letâs reuse the quote you used in the section on chunking. You already have a list of tuples containing each of the words in the quote along with its part of speech tag:
Python
```
```
The next step is to create a grammar to determine what you want to include and exclude in your chunks. This time, youâre going to use more than one line because youâre going to have more than one rule. Because youâre using more than one line for the grammar, youâll be using triple quotes (`"""`):
Python
```
```
The first rule of your grammar is `{<.*>+}`. This rule has curly braces that face inward (`{}`) because itâs used to determine what patterns you want to include in you chunks. In this case, you want to include everything: `<.*>+`.
The second rule of your grammar is `}<JJ>{`. This rule has curly braces that face outward (`}{`) because itâs used to determine what patterns you want to exclude in your chunks. In this case, you want to exclude adjectives: `<JJ>`.
Create a chunk parser with this grammar:
Python
```
>>> chunk_parser = nltk.RegexpParser(grammar)
```
Now chunk your sentence with the chink you specified:
Python
```
>>> tree = chunk_parser.parse(lotr_pos_tags)
```
You get this tree as a result:
Python
```
```
In this case, `('dangerous', 'JJ')` was excluded from the chunks because itâs an adjective (`JJ`). But that will be easier to see if you get a graphic representation again:
Python
```
>>> tree.draw()
```
You get this visual representation of the `tree`:
[](https://files.realpython.com/media/chinking.f731be30df2b.jpg)
Here, youâve excluded the adjective `'dangerous'` from your chunks and are left with two chunks containing everything else. The first chunk has all the text that appeared before the adjective that was excluded. The second chunk contains everything after the adjective that was excluded.
Now that you know how to exclude patterns from your chunks, itâs time to look into named entity recognition (NER).
[Remove ads](https://realpython.com/account/join/)
## Using Named Entity Recognition (NER)
**Named entities** are noun phrases that refer to specific locations, people, organizations, and so on. With **named entity recognition**, you can find the named entities in your texts and also determine what kind of named entity they are.
Hereâs the list of named entity types from the [NLTK book](https://www.nltk.org/book/ch07.html#sec-ner):
| NE type | Examples |
|---|---|
| ORGANIZATION | Georgia-Pacific Corp., WHO |
| PERSON | Eddy Bonte, President Obama |
| LOCATION | Murray River, Mount Everest |
| DATE | June, 2008-06-29 |
| TIME | two fifty a m, 1:30 p.m. |
| MONEY | 175 million Canadian dollars, GBP 10.40 |
| PERCENT | twenty pct, 18.75 % |
| FACILITY | Washington Monument, Stonehenge |
| GPE | South East Asia, Midlothian |
You can use `nltk.ne_chunk()` to recognize named entities. Letâs use `lotr_pos_tags` again to test it out:
Python
```
```
Now take a look at the visual representation:
Python
```
>>> tree.draw()
```
Hereâs what you get:
[](https://files.realpython.com/media/frodo_person.56bb306284f6.jpg)
See how `Frodo` has been tagged as a `PERSON`? You also have the option to use the parameter `binary=True` if you just want to know what the named entities are but not what kind of named entity they are:
Python
```
```
Now all you see is that `Frodo` is an `NE`:
[](https://files.realpython.com/media/frodo_ne.83c55460f7f6.jpg)
Thatâs how you can identify named entities! But you can take this one step further and extract named entities directly from your text. Create a string from which to extract named entities. You can use this quote from [*The War of the Worlds*](https://en.wikipedia.org/wiki/The_War_of_the_Worlds):
Python
```
```
Now create a function to extract named entities:
Python
```
```
With this function, you gather all named entities, with no repeats. In order to do that, you tokenize by word, apply part of speech tags to those words, and then extract named entities based on those tags. Because you included `binary=True`, the named entities youâll get wonât be labeled more specifically. Youâll just know that theyâre named entities.
Take a look at the information you extracted:
Python
```
```
You missed the city of Nice, possibly because NLTK interpreted it as a regular English adjective, but you still got the following:
- **An institution:** `'Lick Observatory'`
- **A planet:** `'Mars'`
- **A publication:** `'Nature'`
- **People:** `'Perrotin'`, `'Schiaparelli'`
Thatâs some pretty decent variety\!
[Remove ads](https://realpython.com/account/join/)
## Getting Text to Analyze
Now that youâve done some text processing tasks with small example texts, youâre ready to analyze a bunch of texts at once. A group of texts is called a **corpus**. NLTK provides several **corpora** covering everything from novels hosted by [Project Gutenberg](https://www.gutenberg.org/) to inaugural speeches by presidents of the United States.
In order to analyze texts in NLTK, you first need to import them. This requires `nltk.download("book")`, which is a pretty big download:
Python
```
```
You now have access to a few linear texts (such as *Sense and Sensibility* and *Monty Python and the Holy Grail*) as well as a few groups of texts (such as a chat corpus and a personals corpus). Human nature is fascinating, so letâs see what we can find out by taking a closer look at the personals corpus\!
This corpus is a collection of [personals ads](https://en.wikipedia.org/wiki/Personal_advertisement), which were an early version of online dating. If you wanted to meet someone, then you could place an ad in a newspaper and wait for other readers to respond to you.
If youâd like to learn how to get other texts to analyze, then you can check out [Chapter 3](https://www.nltk.org/book/ch03.html) of *Natural Language Processing with Python â Analyzing Text with the Natural Language Toolkit*.
## Using a Concordance
When you use a **concordance**, you can see each time a word is used, along with its immediate context. This can give you a peek into how a word is being used at the sentence level and what words are used with it.
Letâs see what these good people looking for love have to say! The personals corpus is called `text8`, so weâre going to call `.concordance()` on it with the parameter `"man"`:
Python
```
```
Interestingly, the last three of those fourteen matches have to do with seeking an honest man, specifically:
1. `SEEKING HONEST MAN`
2. `Seeks 35 - 45 , honest man with good SOH & similar interests`
3. `genuine , caring , honest and normal man for fship , poss rship`
Letâs see if thereâs a similar pattern with the word `"woman"`:
Python
```
```
The issue of honesty came up in the first match only:
Shell
```
Seeking an honest , caring woman , slim or med . build
```
Dipping into a corpus with a concordance wonât give you the full picture, but it can still be interesting to take a peek and see if anything stands out.
## Making a Dispersion Plot
You can use a **dispersion plot** to see how much a particular word appears and where it appears. So far, weâve looked for `"man"` and `"woman"`, but it would be interesting to see how much those words are used compared to their synonyms:
Python
```
```
Hereâs the dispersion plot you get:
[](https://files.realpython.com/media/dispersion-plot.609e2e61b885.png)
Each vertical blue line represents one instance of a word. Each horizontal row of blue lines represents the corpus as a whole. This plot shows that:
- `"lady"` was used a lot more than `"woman"` or `"girl"`. There were no instances of `"gal"`.
- `"man"` and `"guy"` were used a similar number of times and were more common than `"gentleman"` or `"boy"`.
You use a dispersion plot when you want to see where words show up in a text or corpus. If youâre analyzing a single text, this can help you see which words show up near each other. If youâre analyzing a corpus of texts that is organized chronologically, it can help you see which words were being used more or less over a period of time.
Staying on the theme of romance, see what you can find out by making a dispersion plot for *Sense and Sensibility*, which is `text2`. Jane Austen novels talk a lot about peopleâs homes, so make a dispersion plot with the names of a few homes:
Python
```
>>> text2.dispersion_plot(["Allenham", "Whitwell", "Cleveland", "Combe"])
```
Hereâs the plot you get:
[](https://files.realpython.com/media/homes-dispersion-plot.c89fcb3954ec.png)
Apparently Allenham is mentioned a lot in the first third of the novel and then doesnât come up much again. Cleveland, on the other hand, barely comes up in the first two thirds but shows up a fair bit in the last third. This distribution reflects changes in the relationship between [Marianne](https://en.wikipedia.org/wiki/Marianne_Dashwood) and [Willoughby](https://en.wikipedia.org/wiki/John_Willoughby):
- **Allenham** is the home of Willoughbyâs benefactress and comes up a lot when Marianne is first interested in him.
- **Cleveland** is a home that Marianne stays at after she goes to see Willoughby in London and things go wrong.
Dispersion plots are just one type of visualization you can make for textual data. The next one youâll take a look at is frequency distributions.
[Remove ads](https://realpython.com/account/join/)
## Making a Frequency Distribution
With a **frequency distribution**, you can check which words show up most frequently in your text. Youâll need to get started with an `import`:
Python
```
>>> from nltk import FreqDist
```
[`FreqDist`](https://github.com/nltk/nltk/blob/1805fe870635afb7ef16d4ff5373e1c3d97c9107/nltk/probability.py#L61) is a subclass of `collections.Counter`. Hereâs how to create a frequency distribution of the entire corpus of personals ads:
Python
```
```
Since `1108` samples and `4867` outcomes is a lot of information, start by narrowing that down. Hereâs how to see the `20` most common words in the corpus:
Python
```
```
You have a lot of stop words in your frequency distribution, but you can remove them just as you did [earlier](https://realpython.com/nltk-nlp-python/#filtering-stop-words). Create a list of all of the words in `text8` that arenât stop words:
Python
```
```
Now that you have a list of all of the words in your corpus that arenât stop words, make a frequency distribution:
Python
```
>>> frequency_distribution = FreqDist(meaningful_words)
```
Take a look at the `20` most common words:
Python
```
```
You can turn this list into a graph:
Python
```
>>> frequency_distribution.plot(20, cumulative=True)
```
Hereâs the graph you get:
[](https://files.realpython.com/media/freq-dist.1812fe36b438.png)
Some of the most common words are:
- `'lady'`
- `'seeks'`
- `'ship'`
- `'relationship'`
- `'fun'`
- `'slim'`
- `'build'`
- `'smoker'`
- `'50'`
- `'non'`
- `'movies'`
- `'good'`
- `'honest'`
From what youâve already learned about the people writing these personals ads, they did seem interested in honesty and used the word `'lady'` a lot. In addition, `'slim'` and `'build'` both show up the same number of times. You saw `slim` and `build` used near each other when you were learning about [concordances](https://realpython.com/nltk-nlp-python/#using-a-concordance), so maybe those two words are commonly used together in this corpus. That brings us to collocations\!
[Remove ads](https://realpython.com/account/join/)
## Finding Collocations
A **collocation** is a sequence of words that shows up often. If youâre interested in common collocations in English, then you can check out [*The BBI Dictionary of English Word Combinations*](https://realpython.com/asins/902723261X/). Itâs a handy reference you can use to help you make sure your writing is [idiomatic](https://en.wikipedia.org/wiki/English-language_idioms). Here are some examples of collocations that use the word âtreeâ:
- Syntax tree
- Family tree
- Decision tree
To see pairs of words that come up often in your corpus, you need to call `.collocations()` on it:
Python
```
```
`slim build` did show up, as did `medium build` and several other word combinations. No long walks on the beach though\!
But what would happen if you looked for collocations after lemmatizing the words in your corpus? Would you find some word combinations that you missed the first time around because they came up in slightly varied versions?
If you followed the instructions [earlier](https://realpython.com/nltk-nlp-python/#lemmatizing), then youâll already have a `lemmatizer`, but you canât call `collocations()` on just any [data type](https://realpython.com/python-data-types/), so youâre going to need to do some prep work. Start by creating a list of the lemmatized versions of all the words in `text8`:
Python
```
>>> lemmatized_words = [lemmatizer.lemmatize(word) for word in text8]
```
But in order for you to be able to do the linguistic processing tasks youâve seen so far, you need to make an [NLTK text](https://www.nltk.org/api/nltk.html#nltk.text.Text) with this list:
Python
```
>>> new_text = nltk.Text(lemmatized_words)
```
Hereâs how to see the collocations in your `new_text`:
Python
```
```
Compared to your previous list of collocations, this new one is missing a few:
- `weekends away`
- `poss rship`
The idea of `quiet nights` still shows up in the lemmatized version, `quiet night`. Your latest search for collocations also brought up a few news ones:
- **`year old`** suggests that users often mention ages.
- **`photo pls`** suggests that users often request one or more photos.
Thatâs how you can find common word combinations to see what people are talking about and how theyâre talking about it\!
## Conclusion
Congratulations on taking your first steps with **NLP**! A whole new world of unstructured data is now open for you to explore. Now that youâve covered the basics of text analytics tasks, you can get out there are find some texts to analyze and see what you can learn about the texts themselves as well as the people who wrote them and the topics theyâre about.
**Now you know how to:**
- **Find text** to analyze
- **Preprocess** your text for analysis
- **Analyze** your text
- Create **visualizations** based on your analysis
For your next step, you can use NLTK to analyze a text to see whether the sentiments expressed in it are positive or negative. To learn more about sentiment analysis, check out [Sentiment Analysis: First Steps With Pythonâs NLTK Library](https://realpython.com/python-nltk-sentiment-analysis/). If youâd like to dive deeper into the nuts and bolts of **NLTK**, then you can work your way through [*Natural Language Processing with PythonâAnalyzing Text with the Natural Language Toolkit*](https://www.nltk.org/book/).
Now get out there and find yourself some text to analyze\!
Mark as Completed
Share
đ Python Tricks đ
Get a short & sweet **Python Trick** delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

About **Joanna Jablonski**
[ ](https://realpython.com/team/jjablonski/)
Joanna is the Executive Editor of Real Python. She loves natural languages just as much as she loves programming languages\!
[» More about Joanna](https://realpython.com/team/jjablonski/)
***
*Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:*
[](https://realpython.com/team/asantos/)
[Aldren](https://realpython.com/team/asantos/)
[](https://realpython.com/team/damos/)
[David](https://realpython.com/team/damos/)
[](https://realpython.com/team/gahjelle/)
[Geir Arne](https://realpython.com/team/gahjelle/)
[](https://realpython.com/team/jschmitt/)
[Jacob](https://realpython.com/team/jschmitt/)
Master Real-World Python Skills With Unlimited Access to Real Python

**Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:**
[Level Up Your Python Skills »](https://realpython.com/account/join/?utm_source=rp_article_footer&utm_content=nltk-nlp-python)
Master Real-World Python Skills
With Unlimited Access to Real Python

**Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:**
[Level Up Your Python Skills »](https://realpython.com/account/join/?utm_source=rp_article_footer&utm_content=nltk-nlp-python)
What Do You Think?
**Rate this article:**
[LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Frealpython.com%2Fnltk-nlp-python%2F)
[Twitter](https://twitter.com/intent/tweet/?text=Interesting%20Python%20article%20on%20%40realpython%3A%20Natural%20Language%20Processing%20With%20Python%27s%20NLTK%20Package&url=https%3A%2F%2Frealpython.com%2Fnltk-nlp-python%2F)
[Bluesky](https://bsky.app/intent/compose?text=Interesting%20Python%20article%20on%20%40realpython.com%3A%20Natural%20Language%20Processing%20With%20Python%27s%20NLTK%20Package%20https%3A%2F%2Frealpython.com%2Fnltk-nlp-python%2F)
[Facebook](https://facebook.com/sharer/sharer.php?u=https%3A%2F%2Frealpython.com%2Fnltk-nlp-python%2F)
[Email](mailto:?subject=Python%20article%20for%20you&body=Natural%20Language%20Processing%20With%20Python%27s%20NLTK%20Package%20on%20Real%20Python%0A%0Ahttps%3A%2F%2Frealpython.com%2Fnltk-nlp-python%2F%0A)
Whatâs your \#1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.
**Commenting Tips:** The most useful comments are those written with the goal of learning from or helping out other students. [Get tips for asking good questions](https://realpython.com/python-beginner-tips/#tip-9-ask-good-questions) and [get answers to common questions in our support portal](https://support.realpython.com/).
***
Looking for a real-time conversation? Visit the [Real Python Community Chat](https://realpython.com/community/) or join the next [âOffice Hoursâ Live Q\&A Session](https://realpython.com/office-hours/). Happy Pythoning\!
Keep Learning
Related Topics: [basics](https://realpython.com/tutorials/basics/) [data-science](https://realpython.com/tutorials/data-science/)
Related Learning Paths:
- [Machine Learning With Python](https://realpython.com/learning-paths/machine-learning-python/?utm_source=realpython&utm_medium=web&utm_campaign=related-learning-path&utm_content=nltk-nlp-python)
Related Tutorials:
- [Sentiment Analysis: First Steps With Python's NLTK Library](https://realpython.com/python-nltk-sentiment-analysis/?utm_source=realpython&utm_medium=web&utm_campaign=related-post&utm_content=nltk-nlp-python)
- [Natural Language Processing With spaCy in Python](https://realpython.com/natural-language-processing-spacy-python/?utm_source=realpython&utm_medium=web&utm_campaign=related-post&utm_content=nltk-nlp-python)
- [HTML and CSS for Python Developers](https://realpython.com/html-css-python/?utm_source=realpython&utm_medium=web&utm_campaign=related-post&utm_content=nltk-nlp-python)
## Keep reading Real Python by creating a free account or signing in:
[](https://realpython.com/account/signup/?intent=continue_reading&utm_source=rp&utm_medium=web&utm_campaign=rwn&utm_content=v1&next=%2Fnltk-nlp-python%2F)
[Continue »](https://realpython.com/account/signup/?intent=continue_reading&utm_source=rp&utm_medium=web&utm_campaign=rwn&utm_content=v1&next=%2Fnltk-nlp-python%2F)
Already have an account? [Sign-In](https://realpython.com/account/login/?next=/nltk-nlp-python/)
Almost there! Complete this form and click the button below to gain instant access:
Ă

"Python Basics: A Practical Introduction to Python 3" â Free Sample Chapter (PDF)
##### Learn Python
- [Start Here](https://realpython.com/start-here/)
- [Learning Resources](https://realpython.com/search)
- [Code Mentor](https://realpython.com/mentor/)
- [Python Reference](https://realpython.com/ref/)
- [Python Cheat Sheet](https://realpython.com/cheatsheets/python/)
- [Support Center](https://support.realpython.com/)
##### Courses & Paths
- [Learning Paths](https://realpython.com/learning-paths/)
- [Quizzes & Exercises](https://realpython.com/quizzes/)
- [Browse Topics](https://realpython.com/tutorials/all/)
- [Live Courses](https://realpython.com/live/)
- [Books](https://realpython.com/books/)
##### Community
- [Podcast](https://realpython.com/podcasts/rpp/)
- [Newsletter](https://realpython.com/newsletter/)
- [Community Chat](https://realpython.com/community/)
- [Office Hours](https://realpython.com/office-hours/)
- [Learner Stories](https://realpython.com/learner-stories/)
##### Membership
- [Plans & Pricing](https://realpython.com/account/join/)
- [Team Plans](https://realpython.com/account/join-team/)
- [For Business](https://realpython.com/account/join-team/inquiry/)
- [For Schools](https://realpython.com/account/join-team/education-inquiry/)
- [Reviews](https://realpython.com/learner-stories/)
##### Company
- [About Us](https://realpython.com/about/)
- [Team](https://realpython.com/team/)
- [Mission & Values](https://realpython.com/mission/)
- [Editorial Guidelines](https://realpython.com/editorial-guidelines/)
- [Sponsorships](https://realpython.com/sponsorships/)
- [Careers](https://realpython.workable.com/)
- [Press Kit](https://realpython.com/media-kit/)
- [Merch](https://realpython.com/merch)
[Privacy Policy](https://realpython.com/privacy-policy/) â
[Terms of Use](https://realpython.com/terms/) â
[Security](https://realpython.com/security/) â
[Contact](https://realpython.com/contact/)
Happy Pythoning\!
© 2012â2026 DevCademy Media Inc. DBA Real Python. All rights reserved.
REALPYTHONâą is a trademark of DevCademy Media Inc.
[](https://realpython.com/)
 |
| Readable Markdown | [Natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP) is a field that focuses on making natural human language usable by computer programs. **NLTK**, or [Natural Language Toolkit](https://www.nltk.org/), is a Python package that you can use for NLP.
A lot of the data that you could be analyzing is [unstructured data](https://en.wikipedia.org/wiki/Unstructured_data) and contains human-readable text. Before you can analyze that data programmatically, you first need to preprocess it. In this tutorial, youâll take your first look at the kinds of **text preprocessing** tasks you can do with NLTK so that youâll be ready to apply them in future projects. Youâll also see how to do some basic **text analysis** and create **visualizations**.
If youâre familiar with the [basics of using Python](https://realpython.com/products/python-basics-book/) and would like to get your feet wet with some NLP, then youâve come to the right place.
**By the end of this tutorial, youâll know how to:**
- **Find text** to analyze
- **Preprocess** your text for analysis
- **Analyze** your text
- Create **visualizations** based on your analysis
Letâs get Pythoning\!
## Getting Started With Pythonâs NLTK
The first thing you need to do is make sure that you have Python installed. For this tutorial, youâll be using Python 3.9. If you donât yet have Python installed, then check out [Python 3 Installation & Setup Guide](https://realpython.com/installing-python/) to get started.
Once you have that dealt with, your next step is to [install NLTK](https://www.nltk.org/install.html) with [`pip`](https://realpython.com/what-is-pip/). Itâs a best practice to install it in a virtual environment. To learn more about virtual environments, check out [Python Virtual Environments: A Primer](https://realpython.com/python-virtual-environments-a-primer/).
For this tutorial, youâll be installing version 3.5:
In order to create visualizations for [named entity recognition](https://realpython.com/nltk-nlp-python/#using-named-entity-recognition-ner), youâll also need to install [NumPy](https://realpython.com/numpy-tutorial/) and [Matplotlib](https://realpython.com/python-matplotlib-guide/):
If youâd like to know more about how `pip` works, then you can check out [What Is Pip? A Guide for New Pythonistas](https://realpython.com/what-is-pip/). You can also take a look at the official page on [installing NLTK data](https://www.nltk.org/data).
## Tokenizing
By **tokenizing**, you can conveniently split up text by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. Itâs your first step in turning unstructured data into structured data, which is easier to analyze.
When youâre analyzing text, youâll be tokenizing by word and tokenizing by sentence. Hereâs what both types of tokenization bring to the table:
- **Tokenizing by word:** Words are like the atoms of natural language. Theyâre the smallest unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to identify words that come up particularly often. For example, if you were analyzing a group of job ads, then you might find that the word âPythonâ comes up often. That could suggest high demand for Python knowledge, but youâd need to look deeper to know more.
- **Tokenizing by sentence:** When you tokenize by sentence, you can analyze how those words relate to one another and see more context. Are there a lot of negative words around the word âPythonâ because the hiring manager doesnât like Python? Are there more terms from the domain of [herpetology](https://en.wikipedia.org/wiki/Herpetology) than the domain of software development, suggesting that you may be dealing with an entirely different kind of [python](https://en.wikipedia.org/wiki/Pythonidae) than you were expecting?
Hereâs how to [import](https://realpython.com/absolute-vs-relative-python-imports/) the relevant parts of NLTK so you can tokenize by word and by sentence:
Now that youâve imported what you need, you can create a [string](https://realpython.com/python-strings/) to tokenize. Hereâs a quote from [*Dune*](https://en.wikipedia.org/wiki/Dune_\(novel\)) that you can use:
You can use `sent_tokenize()` to split up `example_string` into sentences:
Tokenizing `example_string` by sentence gives you a [list](https://realpython.com/python-lists-tuples/) of three strings that are sentences:
1. `"Muad'Dib learned rapidly because his first training was in how to learn."`
2. `'And the first lesson of all was the basic trust that he could learn.'`
3. `"It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult."`
Now try tokenizing `example_string` by word:
You got a list of strings that NLTK considers to be words, such as:
- `"Muad'Dib"`
- `'training'`
- `'how'`
But the following strings were also considered to be words:
- `"'s"`
- `','`
- `'.'`
See how `"It's"` was split at the apostrophe to give you `'It'` and `"'s"`, but `"Muad'Dib"` was left whole? This happened because NLTK knows that `'It'` and `"'s"` (a contraction of âisâ) are two distinct words, so it counted them separately. But `"Muad'Dib"` isnât an accepted contraction like `"It's"`, so it wasnât read as two separate words and was left intact.
## Filtering Stop Words
**Stop words** are words that you want to ignore, so you filter them out of your text when youâre processing it. Very common words like `'in'`, `'is'`, and `'an'` are often used as stop words since they donât add a lot of meaning to a text in and of themselves.
Hereâs how to import the relevant parts of NLTK in order to filter out stop words:
Hereâs a [quote from Worf](https://www.youtube.com/watch?v=ri5S4Hcq0nY) that you can filter:
Now tokenize `worf_quote` by word and store the resulting list in `words_in_quote`:
You have a list of the words in `worf_quote`, so the next step is to create a [set](https://realpython.com/python-sets/) of stop words to filter `words_in_quote`. For this example, youâll need to focus on stop words in `"english"`:
Next, create an empty list to hold the words that make it past the filter:
You created an empty list, `filtered_list`, to hold all the words in `words_in_quote` that arenât stop words. Now you can use `stop_words` to filter `words_in_quote`:
You iterated over `words_in_quote` with a [`for` loop](https://realpython.com/python-for-loop/) and added all the words that werenât stop words to `filtered_list`. You used [`.casefold()`](https://docs.python.org/3/library/stdtypes.html#str.casefold) on `word` so you could ignore whether the letters in `word` were uppercase or lowercase. This is worth doing because `stopwords.words('english')` includes only lowercase versions of stop words.
Alternatively, you could use a [list comprehension](https://realpython.com/list-comprehension-python/) to make a list of all the words in your text that arenât stop words:
When you use a list comprehension, you donât create an empty list and then add items to the end of it. Instead, you define the list and its contents at the same time. Using a list comprehension is often seen as more [Pythonic](https://realpython.com/learning-paths/writing-pythonic-code/).
Take a look at the words that ended up in `filtered_list`:
You filtered out a few words like `'am'` and `'a'`, but you also filtered out `'not'`, which does affect the overall meaning of the sentence. (Worf wonât be happy about this.)
Words like `'I'` and `'not'` may seem too important to filter out, and depending on what kind of analysis you want to do, they can be. Hereâs why:
- **`'I'`** is a pronoun, which are context words rather than content words:
- **Content words** give you information about the topics covered in the text or the sentiment that the author has about those topics.
- **Context words** give you information about writing style. You can observe patterns in how authors use context words in order to quantify their writing style. Once youâve quantified their writing style, you can analyze a text written by an unknown author to see how closely it follows a particular writing style so you can try to identify who the author is.
- **`'not'`** is [technically an adverb](https://www.merriam-webster.com/dictionary/not) but has still been included in [NLTKâs list of stop words for English](https://www.nltk.org/nltk_data/). If you want to edit the list of stop words to exclude `'not'` or make other changes, then you can [download it](https://www.nltk.org/nltk_data/).
So, `'I'` and `'not'` can be important parts of a sentence, but it depends on what youâre trying to learn from that sentence.
## Stemming
**Stemming** is a text processing task in which you reduce words to their [root](https://simple.wikipedia.org/wiki/Root_\(linguistics\)), which is the core part of a word. For example, the words âhelpingâ and âhelperâ share the root âhelp.â Stemming allows you to zero in on the basic meaning of a word rather than all the details of how itâs being used. NLTK has [more than one stemmer](http://www.nltk.org/howto/stem.html), but youâll be using the [Porter stemmer](https://www.nltk.org/_modules/nltk/stem/porter.html).
Hereâs how to import the relevant parts of NLTK in order to start stemming:
Now that youâre done importing, you can create a stemmer with `PorterStemmer()`:
The next step is for you to create a string to stem. Hereâs one you can use:
Before you can stem the words in that string, you need to separate all the words in it:
Now that you have a list of all the tokenized words from the string, take a look at whatâs in `words`:
Create a list of the stemmed versions of the words in `words` by using `stemmer.stem()` in a list comprehension:
Take a look at whatâs in `stemmed_words`:
Hereâs what happened to all the words that started with `'discov'` or `'Discov'`:
| Original word | Stemmed version |
|---|---|
| `'Discovery'` | `'discoveri'` |
| `'discovered'` | `'discov'` |
| `'discoveries'` | `'discoveri'` |
| `'Discovering'` | `'discov'` |
Those results look a little inconsistent. Why would `'Discovery'` give you `'discoveri'` when `'Discovering'` gives you `'discov'`?
Understemming and overstemming are two ways stemming can go wrong:
1. **Understemming** happens when two related words should be reduced to the same stem but arenât. This is a [false negative](https://en.wikipedia.org/wiki/False_positives_and_false_negatives#False_negative_error).
2. **Overstemming** happens when two unrelated words are reduced to the same stem even though they shouldnât be. This is a [false positive](https://en.wikipedia.org/wiki/False_positives_and_false_negatives#False_negative_error).
The [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) dates from 1979, so itâs a little on the older side. The **Snowball stemmer**, which is also called **Porter2**, is an improvement on the original and is also available through NLTK, so you can use that one in your own projects. Itâs also worth noting that the purpose of the Porter stemmer is not to produce complete words but to find variant forms of a word.
Fortunately, you have some other ways to reduce words to their core meaning, such as lemmatizing, which youâll see later in this tutorial. But first, we need to cover parts of speech.
## Tagging Parts of Speech
**Part of speech** is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or **POS tagging**, is the task of labeling the words in your text according to their part of speech.
In English, there are eight parts of speech:
| Part of speech | Role | Examples |
|---|---|---|
| Noun | Is a person, place, or thing | mountain, bagel, Poland |
| Pronoun | Replaces a noun | you, she, we |
| Adjective | Gives information about what a noun is like | efficient, windy, colorful |
| Verb | Is an action or a state of being | learn, is, go |
| Adverb | Gives information about a verb, an adjective, or another adverb | efficiently, always, very |
| Preposition | Gives information about how a noun or pronoun is connected to another word | from, about, at |
| Conjunction | Connects two other words or phrases | so, because, and |
| Interjection | Is an exclamation | yay, ow, wow |
Some sources also include the category **articles** (like âaâ or âtheâ) in the list of parts of speech, but other sources consider them to be adjectives. NLTK uses the word **determiner** to refer to articles.
Hereâs how to import the relevant parts of NLTK in order to tag parts of speech:
Now create some text to tag. You can use this [Carl Sagan quote](https://www.youtube.com/watch?v=5_vVGPy4-rc):
Use `word_tokenize` to separate the words in that string and store them in a list:
Now call `nltk.pos_tag()` on your new list of words:
All the words in the quote are now in a separate [tuple](https://realpython.com/python-tuple/), with a tag that represents their part of speech. But what do the tags mean? Hereâs how to get a list of tags and their meanings:
The list is quite long, but feel free to expand the box below to see it.
Hereâs the list of POS tags and their meanings:
Thatâs a lot to take in, but fortunately there are some patterns to help you remember whatâs what.
Hereâs a summary that you can use to get started with NLTKâs POS tags:
| Tags that start with | Deal with |
|---|---|
| `JJ` | Adjectives |
| `NN` | Nouns |
| `RB` | Adverbs |
| `PRP` | Pronouns |
| `VB` | Verbs |
Now that you know what the POS tags mean, you can see that your tagging was fairly successful:
- `'pie'` was tagged `NN` because itâs a singular noun.
- `'you'` was tagged `PRP` because itâs a personal pronoun.
- `'invent'` was tagged `VB` because itâs the base form of a verb.
But how would NLTK handle tagging the parts of speech in a text that is basically gibberish? [Jabberwocky](https://www.poetryfoundation.org/poems/42916/jabberwocky) is a [nonsense poem](https://en.wikipedia.org/wiki/Nonsense_verse) that doesnât technically mean much but is still written in a way that can convey some kind of meaning to English speakers.
Make a string to hold an excerpt from this poem:
Use `word_tokenize` to separate the words in the excerpt and store them in a list:
Call `nltk.pos_tag()` on your new list of words:
Accepted English words like `'and'` and `'the'` were correctly tagged as a conjunction and a determiner, respectively. The gibberish word `'slithy'` was tagged as an adjective, which is what a human English speaker would probably assume from the context of the poem as well. Way to go, NLTK\!
## Lemmatizing
Now that youâre up to speed on parts of speech, you can circle back to lemmatizing. Like stemming, **lemmatizing** reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like `'discoveri'`.
Hereâs how to import the relevant parts of NLTK in order to start lemmatizing:
Create a lemmatizer to use:
Letâs start with lemmatizing a plural noun:
`"scarves"` gave you `'scarf'`, so thatâs already a bit more sophisticated than what you would have gotten with the Porter stemmer, which is `'scarv'`. Next, create a string with more than one word to lemmatize:
Now tokenize that string by word:
Hereâs your list of words:
Create a list containing all the words in `words` after theyâve been lemmatized:
Hereâs the list you got:
That looks right. The plurals `'friends'` and `'scarves'` became the singulars `'friend'` and `'scarf'`.
But what would happen if you lemmatized a word that looked very different from its lemma? Try lemmatizing `"worst"`:
You got the result `'worst'` because `lemmatizer.lemmatize()` assumed that [`"worst"` was a noun](https://www.merriam-webster.com/dictionary/worst). You can make it clear that you want `"worst"` to be an adjective:
The default parameter for `pos` is `'n'` for noun, but you made sure that `"worst"` was treated as an adjective by adding the parameter `pos="a"`. As a result, you got `'bad'`, which looks very different from your original word and is nothing like what youâd get if you were stemming. This is because `"worst"` is the [superlative](https://www.merriam-webster.com/dictionary/superlative) form of the adjective `'bad'`, and lemmatizing reduces superlatives as well as [comparatives](https://www.merriam-webster.com/dictionary/comparative) to their lemmas.
Now that you know how to use NLTK to tag parts of speech, you can try tagging your words before lemmatizing them to avoid mixing up [homographs](https://en.wikipedia.org/wiki/Homograph), or words that are spelled the same but have different meanings and can be different parts of speech.
## Chunking
While tokenizing allows you to identify words and sentences, **chunking** allows you to identify **phrases**.
Chunking makes use of POS tags to group words and apply chunk tags to those groups. Chunks donât overlap, so one instance of a word can be in only one chunk at a time.
Hereâs how to import the relevant parts of NLTK in order to chunk:
Before you can chunk, you need to make sure that the parts of speech in your text are tagged, so create a string for POS tagging. You can use this quote from [*The Lord of the Rings*](https://en.wikipedia.org/wiki/The_Lord_of_the_Rings):
Now tokenize that string by word:
Now youâve got a list of all of the words in `lotr_quote`.
The next step is to tag those words by part of speech:
Youâve got a list of tuples of all the words in the quote, along with their POS tag. In order to chunk, you first need to define a chunk grammar.
Create a chunk grammar with one regular expression rule:
`NP` stands for noun phrase. You can learn more about **noun phrase chunking** in [Chapter 7](https://www.nltk.org/book/ch07.html#noun-phrase-chunking) of *Natural Language Processing with PythonâAnalyzing Text with the Natural Language Toolkit*.
According to the rule you created, your chunks:
1. Start with an optional (`?`) determiner (`'DT'`)
2. Can have any number (`*`) of adjectives (`JJ`)
3. End with a noun (`<NN>`)
Create a **chunk parser** with this grammar:
Now try it out with your quote:
Hereâs how you can see a visual representation of this tree:
This is what the visual representation looks like:
[](https://files.realpython.com/media/lotr_tree.7588d4620a3d.jpg)
You got two noun phrases:
1. **`'a dangerous business'`** has a determiner, an adjective, and a noun.
2. **`'door'`** has just a noun.
Now that you know about chunking, itâs time to look at chinking.
## Chinking
Chinking is used together with chunking, but while chunking is used to include a pattern, **chinking** is used to exclude a pattern.
Letâs reuse the quote you used in the section on chunking. You already have a list of tuples containing each of the words in the quote along with its part of speech tag:
The next step is to create a grammar to determine what you want to include and exclude in your chunks. This time, youâre going to use more than one line because youâre going to have more than one rule. Because youâre using more than one line for the grammar, youâll be using triple quotes (`"""`):
The first rule of your grammar is `{<.*>+}`. This rule has curly braces that face inward (`{}`) because itâs used to determine what patterns you want to include in you chunks. In this case, you want to include everything: `<.*>+`.
The second rule of your grammar is `}<JJ>{`. This rule has curly braces that face outward (`}{`) because itâs used to determine what patterns you want to exclude in your chunks. In this case, you want to exclude adjectives: `<JJ>`.
Create a chunk parser with this grammar:
Now chunk your sentence with the chink you specified:
You get this tree as a result:
In this case, `('dangerous', 'JJ')` was excluded from the chunks because itâs an adjective (`JJ`). But that will be easier to see if you get a graphic representation again:
You get this visual representation of the `tree`:
[](https://files.realpython.com/media/chinking.f731be30df2b.jpg)
Here, youâve excluded the adjective `'dangerous'` from your chunks and are left with two chunks containing everything else. The first chunk has all the text that appeared before the adjective that was excluded. The second chunk contains everything after the adjective that was excluded.
Now that you know how to exclude patterns from your chunks, itâs time to look into named entity recognition (NER).
## Using Named Entity Recognition (NER)
**Named entities** are noun phrases that refer to specific locations, people, organizations, and so on. With **named entity recognition**, you can find the named entities in your texts and also determine what kind of named entity they are.
Hereâs the list of named entity types from the [NLTK book](https://www.nltk.org/book/ch07.html#sec-ner):
| NE type | Examples |
|---|---|
| ORGANIZATION | Georgia-Pacific Corp., WHO |
| PERSON | Eddy Bonte, President Obama |
| LOCATION | Murray River, Mount Everest |
| DATE | June, 2008-06-29 |
| TIME | two fifty a m, 1:30 p.m. |
| MONEY | 175 million Canadian dollars, GBP 10.40 |
| PERCENT | twenty pct, 18.75 % |
| FACILITY | Washington Monument, Stonehenge |
| GPE | South East Asia, Midlothian |
You can use `nltk.ne_chunk()` to recognize named entities. Letâs use `lotr_pos_tags` again to test it out:
Now take a look at the visual representation:
Hereâs what you get:
[](https://files.realpython.com/media/frodo_person.56bb306284f6.jpg)
See how `Frodo` has been tagged as a `PERSON`? You also have the option to use the parameter `binary=True` if you just want to know what the named entities are but not what kind of named entity they are:
Now all you see is that `Frodo` is an `NE`:
[](https://files.realpython.com/media/frodo_ne.83c55460f7f6.jpg)
Thatâs how you can identify named entities! But you can take this one step further and extract named entities directly from your text. Create a string from which to extract named entities. You can use this quote from [*The War of the Worlds*](https://en.wikipedia.org/wiki/The_War_of_the_Worlds):
Now create a function to extract named entities:
With this function, you gather all named entities, with no repeats. In order to do that, you tokenize by word, apply part of speech tags to those words, and then extract named entities based on those tags. Because you included `binary=True`, the named entities youâll get wonât be labeled more specifically. Youâll just know that theyâre named entities.
Take a look at the information you extracted:
You missed the city of Nice, possibly because NLTK interpreted it as a regular English adjective, but you still got the following:
- **An institution:** `'Lick Observatory'`
- **A planet:** `'Mars'`
- **A publication:** `'Nature'`
- **People:** `'Perrotin'`, `'Schiaparelli'`
Thatâs some pretty decent variety\!
## Getting Text to Analyze
Now that youâve done some text processing tasks with small example texts, youâre ready to analyze a bunch of texts at once. A group of texts is called a **corpus**. NLTK provides several **corpora** covering everything from novels hosted by [Project Gutenberg](https://www.gutenberg.org/) to inaugural speeches by presidents of the United States.
In order to analyze texts in NLTK, you first need to import them. This requires `nltk.download("book")`, which is a pretty big download:
You now have access to a few linear texts (such as *Sense and Sensibility* and *Monty Python and the Holy Grail*) as well as a few groups of texts (such as a chat corpus and a personals corpus). Human nature is fascinating, so letâs see what we can find out by taking a closer look at the personals corpus\!
This corpus is a collection of [personals ads](https://en.wikipedia.org/wiki/Personal_advertisement), which were an early version of online dating. If you wanted to meet someone, then you could place an ad in a newspaper and wait for other readers to respond to you.
If youâd like to learn how to get other texts to analyze, then you can check out [Chapter 3](https://www.nltk.org/book/ch03.html) of *Natural Language Processing with Python â Analyzing Text with the Natural Language Toolkit*.
## Using a Concordance
When you use a **concordance**, you can see each time a word is used, along with its immediate context. This can give you a peek into how a word is being used at the sentence level and what words are used with it.
Letâs see what these good people looking for love have to say! The personals corpus is called `text8`, so weâre going to call `.concordance()` on it with the parameter `"man"`:
Interestingly, the last three of those fourteen matches have to do with seeking an honest man, specifically:
1. `SEEKING HONEST MAN`
2. `Seeks 35 - 45 , honest man with good SOH & similar interests`
3. `genuine , caring , honest and normal man for fship , poss rship`
Letâs see if thereâs a similar pattern with the word `"woman"`:
The issue of honesty came up in the first match only:
Dipping into a corpus with a concordance wonât give you the full picture, but it can still be interesting to take a peek and see if anything stands out.
## Making a Dispersion Plot
You can use a **dispersion plot** to see how much a particular word appears and where it appears. So far, weâve looked for `"man"` and `"woman"`, but it would be interesting to see how much those words are used compared to their synonyms:
Hereâs the dispersion plot you get:
[](https://files.realpython.com/media/dispersion-plot.609e2e61b885.png)
Each vertical blue line represents one instance of a word. Each horizontal row of blue lines represents the corpus as a whole. This plot shows that:
- `"lady"` was used a lot more than `"woman"` or `"girl"`. There were no instances of `"gal"`.
- `"man"` and `"guy"` were used a similar number of times and were more common than `"gentleman"` or `"boy"`.
You use a dispersion plot when you want to see where words show up in a text or corpus. If youâre analyzing a single text, this can help you see which words show up near each other. If youâre analyzing a corpus of texts that is organized chronologically, it can help you see which words were being used more or less over a period of time.
Staying on the theme of romance, see what you can find out by making a dispersion plot for *Sense and Sensibility*, which is `text2`. Jane Austen novels talk a lot about peopleâs homes, so make a dispersion plot with the names of a few homes:
Hereâs the plot you get:
[](https://files.realpython.com/media/homes-dispersion-plot.c89fcb3954ec.png)
Apparently Allenham is mentioned a lot in the first third of the novel and then doesnât come up much again. Cleveland, on the other hand, barely comes up in the first two thirds but shows up a fair bit in the last third. This distribution reflects changes in the relationship between [Marianne](https://en.wikipedia.org/wiki/Marianne_Dashwood) and [Willoughby](https://en.wikipedia.org/wiki/John_Willoughby):
- **Allenham** is the home of Willoughbyâs benefactress and comes up a lot when Marianne is first interested in him.
- **Cleveland** is a home that Marianne stays at after she goes to see Willoughby in London and things go wrong.
Dispersion plots are just one type of visualization you can make for textual data. The next one youâll take a look at is frequency distributions.
## Making a Frequency Distribution
With a **frequency distribution**, you can check which words show up most frequently in your text. Youâll need to get started with an `import`:
[`FreqDist`](https://github.com/nltk/nltk/blob/1805fe870635afb7ef16d4ff5373e1c3d97c9107/nltk/probability.py#L61) is a subclass of `collections.Counter`. Hereâs how to create a frequency distribution of the entire corpus of personals ads:
Since `1108` samples and `4867` outcomes is a lot of information, start by narrowing that down. Hereâs how to see the `20` most common words in the corpus:
You have a lot of stop words in your frequency distribution, but you can remove them just as you did [earlier](https://realpython.com/nltk-nlp-python/#filtering-stop-words). Create a list of all of the words in `text8` that arenât stop words:
Now that you have a list of all of the words in your corpus that arenât stop words, make a frequency distribution:
Take a look at the `20` most common words:
You can turn this list into a graph:
Hereâs the graph you get:
[](https://files.realpython.com/media/freq-dist.1812fe36b438.png)
Some of the most common words are:
- `'lady'`
- `'seeks'`
- `'ship'`
- `'relationship'`
- `'fun'`
- `'slim'`
- `'build'`
- `'smoker'`
- `'50'`
- `'non'`
- `'movies'`
- `'good'`
- `'honest'`
From what youâve already learned about the people writing these personals ads, they did seem interested in honesty and used the word `'lady'` a lot. In addition, `'slim'` and `'build'` both show up the same number of times. You saw `slim` and `build` used near each other when you were learning about [concordances](https://realpython.com/nltk-nlp-python/#using-a-concordance), so maybe those two words are commonly used together in this corpus. That brings us to collocations\!
## Finding Collocations
A **collocation** is a sequence of words that shows up often. If youâre interested in common collocations in English, then you can check out [*The BBI Dictionary of English Word Combinations*](https://realpython.com/asins/902723261X/). Itâs a handy reference you can use to help you make sure your writing is [idiomatic](https://en.wikipedia.org/wiki/English-language_idioms). Here are some examples of collocations that use the word âtreeâ:
- Syntax tree
- Family tree
- Decision tree
To see pairs of words that come up often in your corpus, you need to call `.collocations()` on it:
`slim build` did show up, as did `medium build` and several other word combinations. No long walks on the beach though\!
But what would happen if you looked for collocations after lemmatizing the words in your corpus? Would you find some word combinations that you missed the first time around because they came up in slightly varied versions?
If you followed the instructions [earlier](https://realpython.com/nltk-nlp-python/#lemmatizing), then youâll already have a `lemmatizer`, but you canât call `collocations()` on just any [data type](https://realpython.com/python-data-types/), so youâre going to need to do some prep work. Start by creating a list of the lemmatized versions of all the words in `text8`:
But in order for you to be able to do the linguistic processing tasks youâve seen so far, you need to make an [NLTK text](https://www.nltk.org/api/nltk.html#nltk.text.Text) with this list:
Hereâs how to see the collocations in your `new_text`:
Compared to your previous list of collocations, this new one is missing a few:
- `weekends away`
- `poss rship`
The idea of `quiet nights` still shows up in the lemmatized version, `quiet night`. Your latest search for collocations also brought up a few news ones:
- **`year old`** suggests that users often mention ages.
- **`photo pls`** suggests that users often request one or more photos.
Thatâs how you can find common word combinations to see what people are talking about and how theyâre talking about it\!
## Conclusion
Congratulations on taking your first steps with **NLP**! A whole new world of unstructured data is now open for you to explore. Now that youâve covered the basics of text analytics tasks, you can get out there are find some texts to analyze and see what you can learn about the texts themselves as well as the people who wrote them and the topics theyâre about.
**Now you know how to:**
- **Find text** to analyze
- **Preprocess** your text for analysis
- **Analyze** your text
- Create **visualizations** based on your analysis
For your next step, you can use NLTK to analyze a text to see whether the sentiments expressed in it are positive or negative. To learn more about sentiment analysis, check out [Sentiment Analysis: First Steps With Pythonâs NLTK Library](https://realpython.com/python-nltk-sentiment-analysis/). If youâd like to dive deeper into the nuts and bolts of **NLTK**, then you can work your way through [*Natural Language Processing with PythonâAnalyzing Text with the Natural Language Toolkit*](https://www.nltk.org/book/).
Now get out there and find yourself some text to analyze\! |
| Shard | 71 (laksa) |
| Root Hash | 13351397557425671 |
| Unparsed URL | com,realpython!/nltk-nlp-python/ s443 |