âšď¸ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.1 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://builtin.com/machine-learning/nlp-machine-learning |
| Last Crawled | 2026-04-03 19:47:38 (3 days ago) |
| First Indexed | 2021-11-10 20:31:36 (4 years ago) |
| HTTP Status Code | 200 |
| Meta Title | NLP Machine Learning: Build an NLP Classifier | Built In |
| Meta Description | Try your hand at NLP with this machine learning tutorial. |
| Meta Canonical | null |
| Boilerpipe Text | Natural Language Processing (NLP) is a subfield of machine learning that makes it possible for computers to understand, analyze, manipulate and generate human language. You encounter NLP machine learning in your everyday life â from spam detection, to autocorrect, to your digital assistant (âHey, Siri?â). You may even encounter NLP and not even realize it. In this article, Iâll show you how to develop your own NLP projects with Natural Language Toolkit (NLTK) but before we dive into the tutorial, letâs look at some every day examples of NLP.Â
Examples of NLP Machine Learning
Email spam filters
Auto-correct
Predictive text
Speech recognition
Information retrieval
Information extraction
Machine translation
Text simplification
Sentiment analysis
Text summarization
Query response
Natural language generation
More From Our Experts
Artificial Intelligence vs. Machine Learning vs. Deep Learning
Get Started With NLP
NLTK is a popular open-source suite of Python libraries. Rather than building all of your NLP tools from scratch, NLTK provides all common NLP tasks so you can jump right in. In this tutorial, Iâll show you how to perform basic NLP tasks and use a machine learning classifier to predict whether an SMS is spam (a harmful, malicious, or unwanted message or ham (something you might actually want to read. You can find all the code below in this
Github Repo
.
First things first, youâll want to
install NLTK
.Â
Type
!pip install nltk
in a Jupyter Notebook. If it doesnât work in cmd, type
conda install -c conda-forge nltk
. You shouldnât have to do much troubleshooting beyond that.
Importing NLTK Library
import
nltk
nltk
.download()
This code gives us an NLTK downloader application which is helpful in all NLP Tasks.
As you can see, Iâve already installed Stopwords Corpus in my system, which helps remove redundant words. Youâll be able to install whatever packages will be most useful to your project.
NLP Machine Learning in 10 Minutes
Prepare Your Data for NLP
Reading In-text DataÂ
Our data comes to us in a structured or unstructured format. A structured format has a well-defined pattern. For example Excel and Google Sheets are structured data. Alternatively, unstructured data has no discernible pattern (e.g. images, audio files, social media posts). In between these two data types, we may find we have a semi-structured format. Language is a great example of semi-structured data.
Access raw code
here
.Â
As we can see from the code above, when we read semi-structured data, itâs hard for a computer (and a human!) to interpret. We can use Pandas to help us understand our data.
Access raw code
here
.Â
With the help of Pandas we can now see and interpret our semi-structured data more clearly.
How to Clean Your Data
Cleaning up your text data is necessary to highlight attributes that weâre going to want our machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of three steps.
How to Clean Your Data for NLP
Remove punctuation
Tokenize
Remove stop words
Stem
Lemmatize
1. Remove Punctuation
Punctuation can provide grammatical context to a sentence which supports human understanding. But for our vectorizer, which counts the number of words and not the context, punctuation does not add value. So we need to remove all special characters. For example, âHow are you?â becomes:Â How are you
Hereâs how to do it:
In
body_text_clean
, you can see weâve removed all punctuation. Iâve becomes Ive and WILL!! Becomes WILL.
2.Tokenize
Tokenizing separates text into units such as sentences or words. In other words, this function gives structure to previously unstructured text. For example: Plata o Plomo becomes âPlataâ,âoâ,âPlomoâ.
Access raw code
here
.
In
body_text_tokenized
, weâve generated all the words as tokens.
3. Remove Stop Words
Stop words are common words that will likely appear in any text. They donât tell us much about our data so we remove them. Again, these are words that are great for human understanding, but will confuse your machine learning program. For example: silver or lead is fine for me becomes silver, lead, fine.
Access raw code
here
.
In
body_text_nostop
, we remove all unnecessary words like âbeen,â âfor,â and âthe.â
4. Stem
Stemming helps reduce a word to its stem form. It often makes sense to treat related words in the same way. It removes suffixes like âing,â âly,â âsâ by a simple rule-based approach. Stemming reduces the corpus of words but often the actual words are lost, in a sense. For example: âEntitlingâ or âEntitledâ become âEntitl.â
Note: Some search engines treat words with the same stem as synonyms.Â
Access raw code
here
.
In
body_text_stemmed
, words like entry and goes are stemmed to entri and goe even though they donât mean anything in English.
5. Lemmatize
Lemmatizing derives the root form (âlemmaâ) of a word. This practice is more robust than stemming because it uses a dictionary-based approach (i.e a morphological analysis) to the root word. For example, âEntitlingâ or âEntitledâ become âEntitle.â
In short, stemming is typically faster as it simply chops off the end of the word, but without understanding the wordâs context. Lemmatizing is slower but more accurate because it takes an informed analysis with the wordâs context in mind.
Access raw code
here
.
In
body_text_stemmed
, we can see words like âchancesâ are lemmatized to âchanceâ but stemmed to âchanc.â
Want More Data Science Tutorials?
Need to Automate Your Data Analysis? Hereâs How.
Vectorize Data
Vectorizing is the process of encoding text as integers to create feature vectors so that machine learning algorithms can understand language.
Methods of Vectorizing Data for NLP
Bag-of-Words
N-Grams
TF-IDF
1. Bag-Of-Words
Bag-of-Words (BoW) or CountVectorizer describes the presence of words within the text data. This process gives a result of one if present in the sentence and zero if absent. This model therefore, creates a bag of words with a document-matrix count in each text document.
from sklearn
.feature_extraction
.text
import CountVectorizer
count_vect =
CountVectorizer
(analyzer=clean_text)
X_counts = count_vect
.fit_transform
(data
[
'body_text'
]
)
print
(X_counts.shape)
print
(count_vect.get_feature_names()
)
We apply BoW to the
body_text
so the count of each word is stored in the document matrix. (Check the
repo
).
2. N-Grams
N-grams are simply all combinations of adjacent words or letters of length
n
that we find in our source text. N-grams with
n=1
are called unigrams,
n=2
are bigrams, and so on.Â
Access raw code
here
.
Unigrams usually donât contain much information as compared to bigrams or trigrams. The basic principle behind N-grams is that they capture which letter or word is likely to follow a given word. The longer the N-gram (higher
n
), the more context you have to work with.
from
sklearn.feature_extraction.text import CountVectorizer
ngram_vect = CountVectorizer(ngram_range=(2,2),
analyzer
=clean_text) # It applies only bigram vectorizer
X_counts = ngram_vect.fit_transform(data[
'body_text'
])
print
(X_counts.shape)
print
(ngram_vect.get_feature_names())
Weâve applied N-Gram to the
body_text
, so the count of each group of words in a sentence is stored in the document matrix. (Check the
repo
).
3. TF-IDF
TF-IDF computes the relative frequency with which a word appears in a document compared to its frequency across all documents. Itâs more useful than term frequency for identifying key words in each document (high frequency in that document, low frequency in other documents).
Note: We use TF-IDF for search engine scoring, text summarization and document clustering. Check my article on
recommender systems
â to learn more about TF-IDF.
from sklearn
.feature_extraction
.text
import TfidfVectorizer
tfidf_vect =
TfidfVectorizer
(analyzer=clean_text)
X_tfidf = tfidf_vect
.fit_transform
(data
[
'body_text'
]
)
print
(X_tfidf.shape)
print
(tfidf_vect.get_feature_names()
)
Weâve applied TF-IDF in the body_text, so the relative count of each word in the sentences is stored in the document matrix. (Check the
repo
).
Note: Vectorizers output sparse matrices in which most entries are zero. In the interest of efficient storage, a sparse matrix will be stored if youâre only storing locations of the non-zero elements.
How to Make the Most of Your Graphs
7 Ways to Tell Powerful Stories With Your Data Visualization
Feature Engineering
Feature Creation
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Because feature engineering requires domain knowledge, feature can be tough to create, but theyâre certainly worth your time.
Access raw code
here
.
body_len
shows the length of words excluding whitespaces in a message body.
punct%
shows the percentage of punctuation marks in a message body.
Is Your Feature Worthwhile?Â
Access raw code
here
.
We can see clearly that spams have a high number of words compared to hams. So
body_len
is a good feature to distinguish.
Now letâs look at
punct%
.
Access raw code
here
.
Spam has a higher percentage of punctuations but not that far away from ham. This is surprising given spam emails often contain a lot of punctuation marks. Nevertheless, given the apparent difference, we can still call this a useful feature.Â
Need to Optimize Your Hardware? We Have a Tutorial for That.
Create a Linux Virtual Machine on Your Computer
Building Machine Learning ClassifiersÂ
Model Selection
We use an ensemble method of machine learning. By using multiple models in concert, their combination produces more robust results than a single model (e.g. support vector machine, Naive Bayes). Ensemble methods are the first choice for many Kaggle competitions. We construct
random forest algorithms
(i.e. multiple random decision trees) and use the aggregates of each tree for the final prediction. This process can be used for classification as well as regression problems and follows a random bagging strategy.
Grid-search:
This model exhaustively searches overall parameter combinations in a given grid to determine the best model.
Cross-validation:
This model divides a data set into k subsets and repeats the method k times.This model also uses a different subset as the test set in each iteration.
Access raw code
here
.
The
mean_test_score
for
n_estimators=150
and
max_depth
gives the best result. Here,Â
n_estimators
is the number of trees in the forest (group of decision trees) and
max_depth
is the max number of levels in each decision tree.
Access raw code
here
.
Similarly, the
mean_test_score
for
n_estimators=150
and
max_depth=90
gives the best result.
Future Improvements
You could use GradientBoosting, XgBoost for classifying. GradientBoosting will take a while because it takes an iterative approach by combining weak learners to create strong learners thereby focusing on mistakes of prior iterations. In short, compared to random forest, GradientBoosting follows a sequential approach rather than a random parallel approach.
More From Badreesh Shetty
An In-Depth Guide to How Recommender Systems Work
We combine all the above-discussed sections to build a Spam-Ham Classifier.
Random forest provides 97.7 percent accuracy. We obtain a high-value F1-score from the model. This confusion matrix tells us that we correctly predicted 965 hams and 123 spams. We incorrectly identified zero hams as spams and 26 spams were incorrectly predicted as hams. This margin of error is justifiable given the fact that detecting spams as hams is preferable to potentially losing important hams to an SMS spam filter.
Spam filters are just one example of NLP you encounter every day. Here are others that influence your life each day (and some you may want to try out!). Hopefully this tutorial will help you try more of these out for yourself.
Email spam filters
â your âjunkâ folder
Auto-correct
â text messages, word processors
Predictive text
â search engines, text messages
Speech recognition
â digital assistants like Siri, AlexaÂ
Information retrieval
â Google finds relevant and similar results
Information extraction
â Gmail suggests events from emails to add on your calendar
Machine translation
â Google Translate translates language from one language to another
Text simplification
â
Rewordify
simplifies the meaning of sentences
Sentiment analysis
â
Hater News
gives us the sentiment of the user
Text summarization
â Redditâs
autotldr
gives a summary of a submission
Query response
â IBM Watsonâs answers to
a question
Natural language generation
â generation of text from image or
video data |
| Markdown | [ ](https://builtin.com/)
[In Jobs](https://builtin.com/jobs?search=)
[View All Jobs](https://builtin.com/jobs)
[For Employers](https://employers.builtin.com/?utm_medium=BIReferral&utm_source=foremployers)
[Join](https://builtin.com/auth/signup?destination=%2Fmachine-learning%2Fnlp-machine-learning)
[Log In](https://builtin.com/auth/login?destination=%2Fmachine-learning%2Fnlp-machine-learning)
- [Jobs](https://builtin.com/jobs)
- [Companies](https://builtin.com/companies)
- [Remote](https://builtin.com/jobs/remote)
- [Articles](https://builtin.com/tech-topics)
- [Best Places To Work](https://builtin.com/awards/us/2026/best-places-to-work)
- [Job Application Tracker](https://builtin.com/auth/login?destination=%2Fhome%23application-tracker-section)
- [Artificial Intelligence](https://builtin.com/tag/artificial-intelligence "Artificial Intelligence")
- [Expert Contributors](https://builtin.com/tag/expert-contributors "Expert Contributors")
- [Machine Learning](https://builtin.com/tag/machine-learning "Machine Learning")
- \+2
- [Natural Language Processing](https://builtin.com/tag/natural-language-processing "Natural Language Processing")
- [Natural Language Processing Algorithms](https://builtin.com/tag/natural-language-processing-algorithms "Natural Language Processing Algorithms")
- \+4
# A Step-by-Step NLP Machine Learning Classifier Tutorial
Natural language processing influences your life every day. Hereâs a tutorial to help you try it out for yourself.

Written by [Badreesh Shetty](https://builtin.com/authors/badreesh-shetty)

Image: Shutterstock / Built In

UPDATED BY
[Hal Koss](https://builtin.com/authors/hal-koss) \| Jul 19, 2022
Natural Language Processing (NLP) is a subfield of machine learning that makes it possible for computers to understand, analyze, manipulate and generate human language. You encounter NLP machine learning in your everyday life â from spam detection, to autocorrect, to your digital assistant (âHey, Siri?â). You may even encounter NLP and not even realize it. In this article, Iâll show you how to develop your own NLP projects with Natural Language Toolkit (NLTK) but before we dive into the tutorial, letâs look at some every day examples of NLP.
## Examples of NLP Machine Learning
- Email spam filters
- Auto-correct
- Predictive text
- Speech recognition
- Information retrieval
- Information extraction
- Machine translation
- Text simplification
- Sentiment analysis
- Text summarization
- Query response
- Natural language generation
More From Our Experts[Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://builtin.com/artificial-intelligence/ai-vs-machine-learning)
## Get Started With NLP
NLTK is a popular open-source suite of Python libraries. Rather than building all of your NLP tools from scratch, NLTK provides all common NLP tasks so you can jump right in. In this tutorial, Iâll show you how to perform basic NLP tasks and use a machine learning classifier to predict whether an SMS is spam (a harmful, malicious, or unwanted message or ham (something you might actually want to read. You can find all the code below in this [Github Repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning).
First things first, youâll want to [install NLTK](https://pypi.python.org/pypi/nltk).
Type `!pip install nltk` in a Jupyter Notebook. If it doesnât work in cmd, type `conda install -c conda-forge nltk`. You shouldnât have to do much troubleshooting beyond that.
### Importing NLTK Library
```
import nltk
nltk.download()
```
This code gives us an NLTK downloader application which is helpful in all NLP Tasks.

As you can see, Iâve already installed Stopwords Corpus in my system, which helps remove redundant words. Youâll be able to install whatever packages will be most useful to your project.
NLP Machine Learning in 10 Minutes
## Prepare Your Data for NLP
### Reading In-text Data
Our data comes to us in a structured or unstructured format. A structured format has a well-defined pattern. For example Excel and Google Sheets are structured data. Alternatively, unstructured data has no discernible pattern (e.g. images, audio files, social media posts). In between these two data types, we may find we have a semi-structured format. Language is a great example of semi-structured data.

Access raw code [here](https://gist.github.com/BadreeshShetty/24d98bfa84dbe8d0589ba824028700d2).
As we can see from the code above, when we read semi-structured data, itâs hard for a computer (and a human!) to interpret. We can use Pandas to help us understand our data.

Access raw code [here](https://gist.github.com/BadreeshShetty/8755fad754100ce47555341feab4ac6c).
With the help of Pandas we can now see and interpret our semi-structured data more clearly.
## How to Clean Your Data
Cleaning up your text data is necessary to highlight attributes that weâre going to want our machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of three steps.
## How to Clean Your Data for NLP
1. Remove punctuation
2. Tokenize
3. Remove stop words
4. Stem
5. Lemmatize
### 1\. Remove Punctuation
Punctuation can provide grammatical context to a sentence which supports human understanding. But for our vectorizer, which counts the number of words and not the context, punctuation does not add value. So we need to remove all special characters. For example, âHow are you?â becomes: How are you
Hereâs how to do it:

In `body_text_clean`, you can see weâve removed all punctuation. Iâve becomes Ive and WILL!! Becomes WILL.
### 2\.Tokenize
Tokenizing separates text into units such as sentences or words. In other words, this function gives structure to previously unstructured text. For example: Plata o Plomo becomes âPlataâ,âoâ,âPlomoâ.

Access raw code [here](https://gist.github.com/BadreeshShetty/0861309de75358fd788235ff99a73aa5).
In `body_text_tokenized`, weâve generated all the words as tokens.
### 3\. Remove Stop Words
Stop words are common words that will likely appear in any text. They donât tell us much about our data so we remove them. Again, these are words that are great for human understanding, but will confuse your machine learning program. For example: silver or lead is fine for me becomes silver, lead, fine.

Access raw code [here](https://gist.github.com/BadreeshShetty/cc86f564e5b368c1ec9a6d19ecd682ff).
In `body_text_nostop`, we remove all unnecessary words like âbeen,â âfor,â and âthe.â
### 4\. Stem
Stemming helps reduce a word to its stem form. It often makes sense to treat related words in the same way. It removes suffixes like âing,â âly,â âsâ by a simple rule-based approach. Stemming reduces the corpus of words but often the actual words are lost, in a sense. For example: âEntitlingâ or âEntitledâ become âEntitl.â
*Note: Some search engines treat words with the same stem as synonyms.*

Access raw code [here](https://gist.github.com/BadreeshShetty/3bd45483ebc032dc768caaef9ea77c1b).
In `body_text_stemmed`, words like entry and goes are stemmed to entri and goe even though they donât mean anything in English.
### 5\. Lemmatize
Lemmatizing derives the root form (âlemmaâ) of a word. This practice is more robust than stemming because it uses a dictionary-based approach (i.e a morphological analysis) to the root word. For example, âEntitlingâ or âEntitledâ become âEntitle.â
In short, stemming is typically faster as it simply chops off the end of the word, but without understanding the wordâs context. Lemmatizing is slower but more accurate because it takes an informed analysis with the wordâs context in mind.

Access raw code [here](https://gist.github.com/BadreeshShetty/d421a71ab1ce7c25f04ae651171d0aa0).
In `body_text_stemmed`, we can see words like âchancesâ are lemmatized to âchanceâ but stemmed to âchanc.â
Want More Data Science Tutorials?[Need to Automate Your Data Analysis? Hereâs How.](https://builtin.com/data-science/automate-data-analysis)
###
## Vectorize Data
Vectorizing is the process of encoding text as integers to create feature vectors so that machine learning algorithms can understand language.
## Methods of Vectorizing Data for NLP
1. Bag-of-Words
2. N-Grams
3. TF-IDF
### 1\. Bag-Of-Words
Bag-of-Words (BoW) or CountVectorizer describes the presence of words within the text data. This process gives a result of one if present in the sentence and zero if absent. This model therefore, creates a bag of words with a document-matrix count in each text document.
```
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(analyzer=clean_text)
X_counts = count_vect.fit_transform(data['body_text'])
print(X_counts.shape)
print(count_vect.get_feature_names())
```
We apply BoW to the `body_text` so the count of each word is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)).
### 2\. N-Grams
N-grams are simply all combinations of adjacent words or letters of length `n` that we find in our source text. N-grams with `n=1` are called unigrams, `n=2` are bigrams, and so on.

Access raw code [here](https://gist.github.com/BadreeshShetty/0dc2305ac012a5018aa39879489617be).
Unigrams usually donât contain much information as compared to bigrams or trigrams. The basic principle behind N-grams is that they capture which letter or word is likely to follow a given word. The longer the N-gram (higher `n`), the more context you have to work with.
```
from sklearn.feature_extraction.text import CountVectorizer
ngram_vect = CountVectorizer(ngram_range=(2,2),analyzer=clean_text) # It applies only bigram vectorizer
X_counts = ngram_vect.fit_transform(data['body_text'])
print(X_counts.shape)
print(ngram_vect.get_feature_names())
```
Weâve applied N-Gram to the `body_text`, so the count of each group of words in a sentence is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)).
### 3\. TF-IDF
TF-IDF computes the relative frequency with which a word appears in a document compared to its frequency across all documents. Itâs more useful than term frequency for identifying key words in each document (high frequency in that document, low frequency in other documents).
*Note: We use TF-IDF for search engine scoring, text summarization and document clustering. Check my article on [recommender systems](https://builtin.com/data-science/recommender-systems) to learn more about TF-IDF.*
```
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])
print(X_tfidf.shape)
print(tfidf_vect.get_feature_names())
```
Weâve applied TF-IDF in the body\_text, so the relative count of each word in the sentences is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)).
*Note: Vectorizers output sparse matrices in which most entries are zero. In the interest of efficient storage, a sparse matrix will be stored if youâre only storing locations of the non-zero elements.*
How to Make the Most of Your Graphs[7 Ways to Tell Powerful Stories With Your Data Visualization](https://builtin.com/data-science/data-visualization)
## Feature Engineering
### Feature Creation
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Because feature engineering requires domain knowledge, feature can be tough to create, but theyâre certainly worth your time.

Access raw code [here](https://gist.github.com/BadreeshShetty/fdd7706e4c7b1553b00b2d7cb00120de).
- `body_len` shows the length of words excluding whitespaces in a message body.
- `punct%` shows the percentage of punctuation marks in a message body.
### Is Your Feature Worthwhile?

Access raw code [here](https://gist.github.com/BadreeshShetty/af2fc15ae9db9498df17f814c0ac806b).
We can see clearly that spams have a high number of words compared to hams. So `body_len` is a good feature to distinguish.
Now letâs look at `punct%`.

Access raw code [here](https://gist.github.com/BadreeshShetty/af2fc15ae9db9498df17f814c0ac806b).
Spam has a higher percentage of punctuations but not that far away from ham. This is surprising given spam emails often contain a lot of punctuation marks. Nevertheless, given the apparent difference, we can still call this a useful feature.
Need to Optimize Your Hardware? We Have a Tutorial for That.[Create a Linux Virtual Machine on Your Computer](https://builtin.com/data-science/linux-vm)
## Building Machine Learning Classifiers
### Model Selection
We use an ensemble method of machine learning. By using multiple models in concert, their combination produces more robust results than a single model (e.g. support vector machine, Naive Bayes). Ensemble methods are the first choice for many Kaggle competitions. We construct [random forest algorithms](https://builtin.com/data-science/random-forest-algorithm) (i.e. multiple random decision trees) and use the aggregates of each tree for the final prediction. This process can be used for classification as well as regression problems and follows a random bagging strategy.
- **Grid-search:** This model exhaustively searches overall parameter combinations in a given grid to determine the best model.
- **Cross-validation:** This model divides a data set into k subsets and repeats the method k times.This model also uses a different subset as the test set in each iteration.

Access raw code [here](https://gist.github.com/BadreeshShetty/31cc189f321ae859ec9740f89048b143).
The `mean_test_score` for `n_estimators=150` and `max_depth` gives the best result. Here, `n_estimators` is the number of trees in the forest (group of decision trees) and `max_depth` is the max number of levels in each decision tree.

Access raw code [here](https://gist.github.com/BadreeshShetty/8a30f5ff4e5890e52d35a044d53e1882).
Similarly, the `mean_test_score` for `n_estimators=150` and `max_depth=90` gives the best result.
### Future Improvements
You could use GradientBoosting, XgBoost for classifying. GradientBoosting will take a while because it takes an iterative approach by combining weak learners to create strong learners thereby focusing on mistakes of prior iterations. In short, compared to random forest, GradientBoosting follows a sequential approach rather than a random parallel approach.
More From Badreesh Shetty[An In-Depth Guide to How Recommender Systems Work](https://builtin.com/data-science/recommender-systems)
## Our NLP Machine Learning Classifier
We combine all the above-discussed sections to build a Spam-Ham Classifier.

Random forest provides 97.7 percent accuracy. We obtain a high-value F1-score from the model. This confusion matrix tells us that we correctly predicted 965 hams and 123 spams. We incorrectly identified zero hams as spams and 26 spams were incorrectly predicted as hams. This margin of error is justifiable given the fact that detecting spams as hams is preferable to potentially losing important hams to an SMS spam filter.
Spam filters are just one example of NLP you encounter every day. Here are others that influence your life each day (and some you may want to try out!). Hopefully this tutorial will help you try more of these out for yourself.
- **Email spam filters** â your âjunkâ folder
- **Auto-correct** â text messages, word processors
- **Predictive text** â search engines, text messages
- **Speech recognition** â digital assistants like Siri, Alexa
- **Information retrieval** â Google finds relevant and similar results
- **Information extraction** â Gmail suggests events from emails to add on your calendar
- **Machine translation** â Google Translate translates language from one language to another
- **Text simplification** â [Rewordify](https://rewordify.com/) simplifies the meaning of sentences
- **Sentiment analysis** â[Hater News](https://haternews.herokuapp.com/) gives us the sentiment of the user
- **Text summarization** â Redditâs [autotldr](https://www.reddit.com/r/autotldr/) gives a summary of a submission
- **Query response** â IBM Watsonâs answers to [a question](https://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
- **Natural language generation** â generation of text from image or [video data](https://www.bbc.com/news/technology-34204052)
### Recent Artificial Intelligence Articles
[ You Wonât Even Be Able to See the Next AI Breakthrough](https://builtin.com/articles/invisible-ai)
[ From OpenAI to SpaceX, These Are the Hottest IPOs to Watch in 2026](https://builtin.com/articles/top-tech-ipos-2026)
[ What Is Waymo? How Its Robotaxi Service Works and Where Itâs Expanding.](https://builtin.com/articles/waymo-robotaxis)
Explore Job Matches.
Job Title or Keyword
Clear search
Location
Fully Remote, Hybrid, On Site
Fully Remote
Hybrid
On Site
Clear
Apply
See Jobs
- [Jobs](https://builtin.com/jobs)
- [Companies](https://builtin.com/companies)
- [Articles](https://builtin.com/tech-topics)
- [Tracker](https://builtin.com/auth/login?destination=%2Fhome%23application-tracker-section)
- More

[Join](https://builtin.com/auth/signup?destination=%2Fmachine-learning%2Fnlp-machine-learning)
[Log In](https://builtin.com/auth/login?destination=%2Fmachine-learning%2Fnlp-machine-learning)
- [Tech Jobs](https://builtin.com/jobs)
- [Companies](https://builtin.com/companies)
- [Articles](https://builtin.com/tech-topics)
- [Remote](https://builtin.com/jobs/remote)
- [Best Places To Work](https://builtin.com/awards/us/2026/best-places-to-work)
- [Tech Hubs](https://builtin.com/tech-hubs)
[Post Job](https://employers.builtin.com/membership?utm_medium=BIReferral&utm_source=foremployers)
[](https://builtin.com/)

Built In is the online community for startups and tech companies. Find startup jobs, tech news and events.
About
[Our Story](https://builtin.com/our-story)
[Careers](https://employers.builtin.com/careers/)
[Our Staff Writers](https://builtin.com/our-staff)
[Content Descriptions](https://builtin.com/content-descriptions)
***
Get Involved
[Recruit With Built In](https://employers.builtin.com/membership?utm_medium=BIReferral&utm_source=foremployers)
[Become an Expert Contributor](https://builtin.com/expert-contributors)
***
Resources
[Customer Support](https://knowledgebase.builtin.com/s/)
[Share Feedback](https://form.jotform.com/223044927257054)
[Report a Bug](https://knowledgebase.builtin.com/s/contactsupport)
[Tech Job Tools + Career Resources](https://builtin.com/articles/grow-your-career)
[Browse Jobs](https://builtin.com/browse-jobs)
[Tech A-Z](https://builtin.com/tech-dictionary)
***
Tech Hubs
[Our Sites](https://builtin.com/our-sites)
***
[Learning Lab User Agreement](https://builtin.com/learning-lab-user-agreement) [Accessibility Statement](https://builtin.com/accessibility-statement) [Copyright Policy](https://builtin.com/copyright-policy) [Privacy Policy](https://builtin.com/privacy-policy) [Terms of Use](https://builtin.com/community-terms-of-use) [Your Privacy Choices/Cookie Settings](https://builtin.com/california-do-not-sell-my-information) [CA Notice of Collection](https://builtin.com/ca-notice-collection)
Š Built In 2026 |
| Readable Markdown | Natural Language Processing (NLP) is a subfield of machine learning that makes it possible for computers to understand, analyze, manipulate and generate human language. You encounter NLP machine learning in your everyday life â from spam detection, to autocorrect, to your digital assistant (âHey, Siri?â). You may even encounter NLP and not even realize it. In this article, Iâll show you how to develop your own NLP projects with Natural Language Toolkit (NLTK) but before we dive into the tutorial, letâs look at some every day examples of NLP.
## Examples of NLP Machine Learning
- Email spam filters
- Auto-correct
- Predictive text
- Speech recognition
- Information retrieval
- Information extraction
- Machine translation
- Text simplification
- Sentiment analysis
- Text summarization
- Query response
- Natural language generation
More From Our Experts[Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://builtin.com/artificial-intelligence/ai-vs-machine-learning)
## Get Started With NLP
NLTK is a popular open-source suite of Python libraries. Rather than building all of your NLP tools from scratch, NLTK provides all common NLP tasks so you can jump right in. In this tutorial, Iâll show you how to perform basic NLP tasks and use a machine learning classifier to predict whether an SMS is spam (a harmful, malicious, or unwanted message or ham (something you might actually want to read. You can find all the code below in this [Github Repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning).
First things first, youâll want to [install NLTK](https://pypi.python.org/pypi/nltk).
Type `!pip install nltk` in a Jupyter Notebook. If it doesnât work in cmd, type `conda install -c conda-forge nltk`. You shouldnât have to do much troubleshooting beyond that.
### Importing NLTK Library
```
import nltk
nltk.download()
```
This code gives us an NLTK downloader application which is helpful in all NLP Tasks.

As you can see, Iâve already installed Stopwords Corpus in my system, which helps remove redundant words. Youâll be able to install whatever packages will be most useful to your project.
NLP Machine Learning in 10 Minutes
## Prepare Your Data for NLP
### Reading In-text Data
Our data comes to us in a structured or unstructured format. A structured format has a well-defined pattern. For example Excel and Google Sheets are structured data. Alternatively, unstructured data has no discernible pattern (e.g. images, audio files, social media posts). In between these two data types, we may find we have a semi-structured format. Language is a great example of semi-structured data.

Access raw code [here](https://gist.github.com/BadreeshShetty/24d98bfa84dbe8d0589ba824028700d2).
As we can see from the code above, when we read semi-structured data, itâs hard for a computer (and a human!) to interpret. We can use Pandas to help us understand our data.

Access raw code [here](https://gist.github.com/BadreeshShetty/8755fad754100ce47555341feab4ac6c).
With the help of Pandas we can now see and interpret our semi-structured data more clearly.
## How to Clean Your Data
Cleaning up your text data is necessary to highlight attributes that weâre going to want our machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of three steps.
## How to Clean Your Data for NLP
1. Remove punctuation
2. Tokenize
3. Remove stop words
4. Stem
5. Lemmatize
### 1\. Remove Punctuation
Punctuation can provide grammatical context to a sentence which supports human understanding. But for our vectorizer, which counts the number of words and not the context, punctuation does not add value. So we need to remove all special characters. For example, âHow are you?â becomes: How are you
Hereâs how to do it:

In `body_text_clean`, you can see weâve removed all punctuation. Iâve becomes Ive and WILL!! Becomes WILL.
### 2\.Tokenize
Tokenizing separates text into units such as sentences or words. In other words, this function gives structure to previously unstructured text. For example: Plata o Plomo becomes âPlataâ,âoâ,âPlomoâ.

Access raw code [here](https://gist.github.com/BadreeshShetty/0861309de75358fd788235ff99a73aa5).
In `body_text_tokenized`, weâve generated all the words as tokens.
### 3\. Remove Stop Words
Stop words are common words that will likely appear in any text. They donât tell us much about our data so we remove them. Again, these are words that are great for human understanding, but will confuse your machine learning program. For example: silver or lead is fine for me becomes silver, lead, fine.

Access raw code [here](https://gist.github.com/BadreeshShetty/cc86f564e5b368c1ec9a6d19ecd682ff).
In `body_text_nostop`, we remove all unnecessary words like âbeen,â âfor,â and âthe.â
### 4\. Stem
Stemming helps reduce a word to its stem form. It often makes sense to treat related words in the same way. It removes suffixes like âing,â âly,â âsâ by a simple rule-based approach. Stemming reduces the corpus of words but often the actual words are lost, in a sense. For example: âEntitlingâ or âEntitledâ become âEntitl.â
*Note: Some search engines treat words with the same stem as synonyms.*

Access raw code [here](https://gist.github.com/BadreeshShetty/3bd45483ebc032dc768caaef9ea77c1b).
In `body_text_stemmed`, words like entry and goes are stemmed to entri and goe even though they donât mean anything in English.
### 5\. Lemmatize
Lemmatizing derives the root form (âlemmaâ) of a word. This practice is more robust than stemming because it uses a dictionary-based approach (i.e a morphological analysis) to the root word. For example, âEntitlingâ or âEntitledâ become âEntitle.â
In short, stemming is typically faster as it simply chops off the end of the word, but without understanding the wordâs context. Lemmatizing is slower but more accurate because it takes an informed analysis with the wordâs context in mind.

Access raw code [here](https://gist.github.com/BadreeshShetty/d421a71ab1ce7c25f04ae651171d0aa0).
In `body_text_stemmed`, we can see words like âchancesâ are lemmatized to âchanceâ but stemmed to âchanc.â
Want More Data Science Tutorials?[Need to Automate Your Data Analysis? Hereâs How.](https://builtin.com/data-science/automate-data-analysis)
## Vectorize Data
Vectorizing is the process of encoding text as integers to create feature vectors so that machine learning algorithms can understand language.
## Methods of Vectorizing Data for NLP
1. Bag-of-Words
2. N-Grams
3. TF-IDF
### 1\. Bag-Of-Words
Bag-of-Words (BoW) or CountVectorizer describes the presence of words within the text data. This process gives a result of one if present in the sentence and zero if absent. This model therefore, creates a bag of words with a document-matrix count in each text document.
```
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(analyzer=clean_text)
X_counts = count_vect.fit_transform(data['body_text'])
print(X_counts.shape)
print(count_vect.get_feature_names())
```
We apply BoW to the `body_text` so the count of each word is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)).
### 2\. N-Grams
N-grams are simply all combinations of adjacent words or letters of length `n` that we find in our source text. N-grams with `n=1` are called unigrams, `n=2` are bigrams, and so on.

Access raw code [here](https://gist.github.com/BadreeshShetty/0dc2305ac012a5018aa39879489617be).
Unigrams usually donât contain much information as compared to bigrams or trigrams. The basic principle behind N-grams is that they capture which letter or word is likely to follow a given word. The longer the N-gram (higher `n`), the more context you have to work with.
```
from sklearn.feature_extraction.text import CountVectorizer
ngram_vect = CountVectorizer(ngram_range=(2,2),analyzer=clean_text) # It applies only bigram vectorizer
X_counts = ngram_vect.fit_transform(data['body_text'])
print(X_counts.shape)
print(ngram_vect.get_feature_names())
```
Weâve applied N-Gram to the `body_text`, so the count of each group of words in a sentence is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)).
### 3\. TF-IDF
TF-IDF computes the relative frequency with which a word appears in a document compared to its frequency across all documents. Itâs more useful than term frequency for identifying key words in each document (high frequency in that document, low frequency in other documents).
*Note: We use TF-IDF for search engine scoring, text summarization and document clustering. Check my article on [recommender systems](https://builtin.com/data-science/recommender-systems) to learn more about TF-IDF.*
```
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])
print(X_tfidf.shape)
print(tfidf_vect.get_feature_names())
```
Weâve applied TF-IDF in the body\_text, so the relative count of each word in the sentences is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)).
*Note: Vectorizers output sparse matrices in which most entries are zero. In the interest of efficient storage, a sparse matrix will be stored if youâre only storing locations of the non-zero elements.*
How to Make the Most of Your Graphs[7 Ways to Tell Powerful Stories With Your Data Visualization](https://builtin.com/data-science/data-visualization)
## Feature Engineering
### Feature Creation
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Because feature engineering requires domain knowledge, feature can be tough to create, but theyâre certainly worth your time.

Access raw code [here](https://gist.github.com/BadreeshShetty/fdd7706e4c7b1553b00b2d7cb00120de).
- `body_len` shows the length of words excluding whitespaces in a message body.
- `punct%` shows the percentage of punctuation marks in a message body.
### Is Your Feature Worthwhile?

Access raw code [here](https://gist.github.com/BadreeshShetty/af2fc15ae9db9498df17f814c0ac806b).
We can see clearly that spams have a high number of words compared to hams. So `body_len` is a good feature to distinguish.
Now letâs look at `punct%`.

Access raw code [here](https://gist.github.com/BadreeshShetty/af2fc15ae9db9498df17f814c0ac806b).
Spam has a higher percentage of punctuations but not that far away from ham. This is surprising given spam emails often contain a lot of punctuation marks. Nevertheless, given the apparent difference, we can still call this a useful feature.
Need to Optimize Your Hardware? We Have a Tutorial for That.[Create a Linux Virtual Machine on Your Computer](https://builtin.com/data-science/linux-vm)
## Building Machine Learning Classifiers
### Model Selection
We use an ensemble method of machine learning. By using multiple models in concert, their combination produces more robust results than a single model (e.g. support vector machine, Naive Bayes). Ensemble methods are the first choice for many Kaggle competitions. We construct [random forest algorithms](https://builtin.com/data-science/random-forest-algorithm) (i.e. multiple random decision trees) and use the aggregates of each tree for the final prediction. This process can be used for classification as well as regression problems and follows a random bagging strategy.
- **Grid-search:** This model exhaustively searches overall parameter combinations in a given grid to determine the best model.
- **Cross-validation:** This model divides a data set into k subsets and repeats the method k times.This model also uses a different subset as the test set in each iteration.

Access raw code [here](https://gist.github.com/BadreeshShetty/31cc189f321ae859ec9740f89048b143).
The `mean_test_score` for `n_estimators=150` and `max_depth` gives the best result. Here, `n_estimators` is the number of trees in the forest (group of decision trees) and `max_depth` is the max number of levels in each decision tree.

Access raw code [here](https://gist.github.com/BadreeshShetty/8a30f5ff4e5890e52d35a044d53e1882).
Similarly, the `mean_test_score` for `n_estimators=150` and `max_depth=90` gives the best result.
### Future Improvements
You could use GradientBoosting, XgBoost for classifying. GradientBoosting will take a while because it takes an iterative approach by combining weak learners to create strong learners thereby focusing on mistakes of prior iterations. In short, compared to random forest, GradientBoosting follows a sequential approach rather than a random parallel approach.
More From Badreesh Shetty[An In-Depth Guide to How Recommender Systems Work](https://builtin.com/data-science/recommender-systems)
We combine all the above-discussed sections to build a Spam-Ham Classifier.

Random forest provides 97.7 percent accuracy. We obtain a high-value F1-score from the model. This confusion matrix tells us that we correctly predicted 965 hams and 123 spams. We incorrectly identified zero hams as spams and 26 spams were incorrectly predicted as hams. This margin of error is justifiable given the fact that detecting spams as hams is preferable to potentially losing important hams to an SMS spam filter.
Spam filters are just one example of NLP you encounter every day. Here are others that influence your life each day (and some you may want to try out!). Hopefully this tutorial will help you try more of these out for yourself.
- **Email spam filters** â your âjunkâ folder
- **Auto-correct** â text messages, word processors
- **Predictive text** â search engines, text messages
- **Speech recognition** â digital assistants like Siri, Alexa
- **Information retrieval** â Google finds relevant and similar results
- **Information extraction** â Gmail suggests events from emails to add on your calendar
- **Machine translation** â Google Translate translates language from one language to another
- **Text simplification** â [Rewordify](https://rewordify.com/) simplifies the meaning of sentences
- **Sentiment analysis** â[Hater News](https://haternews.herokuapp.com/) gives us the sentiment of the user
- **Text summarization** â Redditâs [autotldr](https://www.reddit.com/r/autotldr/) gives a summary of a submission
- **Query response** â IBM Watsonâs answers to [a question](https://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
- **Natural language generation** â generation of text from image or [video data](https://www.bbc.com/news/technology-34204052) |
| Shard | 169 (laksa) |
| Root Hash | 7607033694470393769 |
| Unparsed URL | com,builtin!/machine-learning/nlp-machine-learning s443 |