🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 169 (from laksa187)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄
INDEXABLE
✅
CRAWLED
3 days ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.1 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://builtin.com/machine-learning/nlp-machine-learning
Last Crawled2026-04-03 19:47:38 (3 days ago)
First Indexed2021-11-10 20:31:36 (4 years ago)
HTTP Status Code200
Meta TitleNLP Machine Learning: Build an NLP Classifier | Built In
Meta DescriptionTry your hand at NLP with this machine learning tutorial.
Meta Canonicalnull
Boilerpipe Text
Natural Language Processing (NLP) is a subfield of machine learning that makes it possible for computers to understand, analyze, manipulate and generate human language. You encounter NLP machine learning in your everyday life — from spam detection, to autocorrect, to your digital assistant (“Hey, Siri?”). You may even encounter NLP and not even realize it. In this article, I’ll show you how to develop your own NLP projects with Natural Language Toolkit (NLTK) but before we dive into the tutorial, let’s look at some every day examples of NLP.  Examples of NLP Machine Learning Email spam filters Auto-correct Predictive text Speech recognition Information retrieval Information extraction Machine translation Text simplification Sentiment analysis Text summarization Query response Natural language generation More From Our Experts Artificial Intelligence vs. Machine Learning vs. Deep Learning Get Started With NLP NLTK is a popular open-source suite of Python libraries. Rather than building all of your NLP tools from scratch, NLTK provides all common NLP tasks so you can jump right in. In this tutorial, I’ll show you how to perform basic NLP tasks and use a machine learning classifier to predict whether an SMS is spam (a harmful, malicious, or unwanted message or ham (something you might actually want to read. You can find all the code below in this Github Repo . First things first, you’ll want to install NLTK .  Type !pip install nltk in a Jupyter Notebook. If it doesn’t work in cmd, type conda install -c conda-forge nltk . You shouldn’t have to do much troubleshooting beyond that. Importing NLTK Library import nltk nltk .download() This code gives us an NLTK downloader application which is helpful in all NLP Tasks. As you can see, I’ve already installed Stopwords Corpus in my system, which helps remove redundant words. You’ll be able to install whatever packages will be most useful to your project. NLP Machine Learning in 10 Minutes Prepare Your Data for NLP Reading In-text Data  Our data comes to us in a structured or unstructured format. A structured format has a well-defined pattern. For example Excel and Google Sheets are structured data. Alternatively, unstructured data has no discernible pattern (e.g. images, audio files, social media posts). In between these two data types, we may find we have a semi-structured format. Language is a great example of semi-structured data. Access raw code here .  As we can see from the code above, when we read semi-structured data, it’s hard for a computer (and a human!) to interpret. We can use Pandas to help us understand our data. Access raw code here .  With the help of Pandas we can now see and interpret our semi-structured data more clearly. How to Clean Your Data Cleaning up your text data is necessary to highlight attributes that we’re going to want our machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of three steps. How to Clean Your Data for NLP Remove punctuation Tokenize Remove stop words Stem Lemmatize 1. Remove Punctuation Punctuation can provide grammatical context to a sentence which supports human understanding. But for our vectorizer, which counts the number of words and not the context, punctuation does not add value. So we need to remove all special characters. For example, “How are you?” becomes: How are you Here’s how to do it: In body_text_clean , you can see we’ve removed all punctuation. I’ve becomes Ive and WILL!! Becomes WILL. 2.Tokenize Tokenizing separates text into units such as sentences or words. In other words, this function gives structure to previously unstructured text. For example: Plata o Plomo becomes ‘Plata’,’o’,’Plomo’. Access raw code here . In body_text_tokenized , we’ve generated all the words as tokens. 3. Remove Stop Words Stop words are common words that will likely appear in any text. They don’t tell us much about our data so we remove them. Again, these are words that are great for human understanding, but will confuse your machine learning program. For example: silver or lead is fine for me becomes silver, lead, fine. Access raw code here . In body_text_nostop , we remove all unnecessary words like “been,” “for,” and “the.” 4. Stem Stemming helps reduce a word to its stem form. It often makes sense to treat related words in the same way. It removes suffixes like “ing,” “ly,” “s” by a simple rule-based approach. Stemming reduces the corpus of words but often the actual words are lost, in a sense. For example: “Entitling” or  “Entitled” become  “Entitl.” Note: Some search engines treat words with the same stem as synonyms.  Access raw code here . In body_text_stemmed , words like entry and goes are stemmed to entri and goe even though they don’t mean anything in English. 5. Lemmatize Lemmatizing derives the root form (“lemma”) of a word. This practice is more robust than stemming because it uses a dictionary-based approach (i.e a morphological analysis) to the root word. For example, “Entitling” or “Entitled” become “Entitle.” In short, stemming is typically faster as it simply chops off the end of the word, but without understanding the word’s context. Lemmatizing is slower but more accurate because it takes an informed analysis with the word’s context in mind. Access raw code here . In body_text_stemmed , we can see words like “chances” are lemmatized to “chance” but stemmed to “chanc.” Want More Data Science Tutorials? Need to Automate Your Data Analysis? Here’s How. Vectorize Data Vectorizing is the process of encoding text as integers to create feature vectors so that machine learning algorithms can understand language. Methods of Vectorizing Data for NLP Bag-of-Words N-Grams TF-IDF 1. Bag-Of-Words Bag-of-Words (BoW) or CountVectorizer describes the presence of words within the text data. This process gives a result of one if present in the sentence and zero if absent. This model therefore, creates a bag of words with a document-matrix count in each text document. from sklearn .feature_extraction .text import CountVectorizer count_vect = CountVectorizer (analyzer=clean_text) X_counts = count_vect .fit_transform (data [ 'body_text' ] ) print (X_counts.shape) print (count_vect.get_feature_names() ) We apply BoW to the body_text so the count of each word is stored in the document matrix. (Check the repo ). 2. N-Grams N-grams are simply all combinations of adjacent words or letters of length n that we find in our source text. N-grams with n=1 are called unigrams, n=2 are bigrams, and so on.  Access raw code here . Unigrams usually don’t contain much information as compared to bigrams or trigrams. The basic principle behind N-grams is that they capture which letter or word is likely to follow a given word. The longer the N-gram (higher n ), the more context you have to work with. from sklearn.feature_extraction.text import CountVectorizer ngram_vect = CountVectorizer(ngram_range=(2,2), analyzer =clean_text) # It applies only bigram vectorizer X_counts = ngram_vect.fit_transform(data[ 'body_text' ]) print (X_counts.shape) print (ngram_vect.get_feature_names()) We’ve applied N-Gram to the body_text , so the count of each group of words in a sentence is stored in the document matrix. (Check the repo ). 3. TF-IDF TF-IDF computes the relative frequency with which a word appears in a document compared to its frequency across all documents. It’s more useful than term frequency for identifying key words in each document (high frequency in that document, low frequency in other documents). Note: We use TF-IDF for search engine scoring, text summarization and document clustering. Check my article on recommender systems   to learn more about TF-IDF. from sklearn .feature_extraction .text import TfidfVectorizer tfidf_vect = TfidfVectorizer (analyzer=clean_text) X_tfidf = tfidf_vect .fit_transform (data [ 'body_text' ] ) print (X_tfidf.shape) print (tfidf_vect.get_feature_names() ) We’ve applied TF-IDF in the body_text, so the relative count of each word in the sentences is stored in the document matrix. (Check the repo ). Note: Vectorizers output sparse matrices in which most entries are zero. In the interest of efficient storage, a sparse matrix will be stored if you’re only storing locations of the non-zero elements. How to Make the Most of Your Graphs 7 Ways to Tell Powerful Stories With Your Data Visualization Feature Engineering Feature Creation Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Because feature engineering requires domain knowledge, feature can be tough to create, but they’re certainly worth your time. Access raw code here . body_len shows the length of words excluding whitespaces in a message body. punct% shows the percentage of punctuation marks in a message body. Is Your Feature Worthwhile?  Access raw code here . We can see clearly that spams have a high number of words compared to hams. So body_len is a good feature to distinguish. Now let’s look at punct% . Access raw code here . Spam has a higher percentage of punctuations but not that far away from ham. This is surprising given spam emails often contain a lot of punctuation marks. Nevertheless, given the apparent difference, we can still call this a useful feature.  Need to Optimize Your Hardware? We Have a Tutorial for That. Create a Linux Virtual Machine on Your Computer Building Machine Learning Classifiers  Model Selection We use an ensemble method of machine learning. By using multiple models in concert, their combination produces more robust results than a single model (e.g. support vector machine, Naive Bayes). Ensemble methods are the first choice for many Kaggle competitions. We construct random forest algorithms (i.e. multiple random decision trees) and use the aggregates of each tree for the final prediction. This process can be used for classification as well as regression problems and follows a random bagging strategy. Grid-search: This model exhaustively searches overall parameter combinations in a given grid to determine the best model. Cross-validation: This model divides a data set into k subsets and repeats the method k times.This model also uses a different subset as the test set in each iteration. Access raw code here . The mean_test_score for n_estimators=150 and max_depth gives the best result. Here,  n_estimators is the number of trees in the forest (group of decision trees) and max_depth is the max number of levels in each decision tree. Access raw code here . Similarly, the mean_test_score for n_estimators=150 and max_depth=90 gives the best result. Future Improvements You could use GradientBoosting, XgBoost for classifying. GradientBoosting will take a while because it takes an iterative approach by combining weak learners to create strong learners thereby focusing on mistakes of prior iterations. In short, compared to random forest, GradientBoosting follows a sequential approach rather than a random parallel approach. More From Badreesh Shetty An In-Depth Guide to How Recommender Systems Work We combine all the above-discussed sections to build a Spam-Ham Classifier. Random forest provides 97.7 percent accuracy. We obtain a high-value F1-score from the model. This confusion matrix tells us that we correctly predicted 965 hams and 123 spams. We incorrectly identified zero hams as spams and 26 spams were incorrectly predicted as hams. This margin of error is justifiable given the fact that detecting spams as hams is preferable to potentially losing important hams to an SMS spam filter. Spam filters are just one example of NLP you encounter every day. Here are others that influence your life each day (and some you may want to try out!). Hopefully this tutorial will help you try more of these out for yourself. Email spam filters — your “junk” folder Auto-correct — text messages, word processors Predictive text — search engines, text messages Speech recognition — digital assistants like Siri, Alexa  Information retrieval — Google finds relevant and similar results Information extraction — Gmail suggests events from emails to add on your calendar Machine translation — Google Translate translates language from one language to another Text simplification — Rewordify simplifies the meaning of sentences Sentiment analysis — Hater News gives us the sentiment of the user Text summarization — Reddit’s autotldr gives a summary of a submission Query response — IBM Watson’s answers to a question Natural language generation — generation of text from image or video data
Markdown
[![Built In Logo](https://static.builtin.com/dist/images/bi-header-logo.svg) ![Built In Logo](https://static.builtin.com/dist/images/bi-header-logo.svg)](https://builtin.com/) [In Jobs](https://builtin.com/jobs?search=) [View All Jobs](https://builtin.com/jobs) [For Employers](https://employers.builtin.com/?utm_medium=BIReferral&utm_source=foremployers) [Join](https://builtin.com/auth/signup?destination=%2Fmachine-learning%2Fnlp-machine-learning) [Log In](https://builtin.com/auth/login?destination=%2Fmachine-learning%2Fnlp-machine-learning) - [Jobs](https://builtin.com/jobs) - [Companies](https://builtin.com/companies) - [Remote](https://builtin.com/jobs/remote) - [Articles](https://builtin.com/tech-topics) - [Best Places To Work](https://builtin.com/awards/us/2026/best-places-to-work) - [Job Application Tracker](https://builtin.com/auth/login?destination=%2Fhome%23application-tracker-section) - [Artificial Intelligence](https://builtin.com/tag/artificial-intelligence "Artificial Intelligence") - [Expert Contributors](https://builtin.com/tag/expert-contributors "Expert Contributors") - [Machine Learning](https://builtin.com/tag/machine-learning "Machine Learning") - \+2 - [Natural Language Processing](https://builtin.com/tag/natural-language-processing "Natural Language Processing") - [Natural Language Processing Algorithms](https://builtin.com/tag/natural-language-processing-algorithms "Natural Language Processing Algorithms") - \+4 # A Step-by-Step NLP Machine Learning Classifier Tutorial Natural language processing influences your life every day. Here’s a tutorial to help you try it out for yourself. ![Badreesh Shetty](https://cdn.builtin.com/cdn-cgi/image/f=auto,w=96,h=96,q=100/sites/www.builtin.com/files/2022-07/Shetty%2C%20Badreesh_2.jpg) Written by [Badreesh Shetty](https://builtin.com/authors/badreesh-shetty) ![nlp machine learning](https://cdn.builtin.com/cdn-cgi/image/f=auto,fit=cover,w=320,h=200,q=80/sites/www.builtin.com/files/2021-11/NLP%20Machine%20Learning.png) Image: Shutterstock / Built In ![Brand Studio Logo](https://static.builtin.com/dist/images/expert-badge.svg) UPDATED BY [Hal Koss](https://builtin.com/authors/hal-koss) \| Jul 19, 2022 Natural Language Processing (NLP) is a subfield of machine learning that makes it possible for computers to understand, analyze, manipulate and generate human language. You encounter NLP machine learning in your everyday life — from spam detection, to autocorrect, to your digital assistant (“Hey, Siri?”). You may even encounter NLP and not even realize it. In this article, I’ll show you how to develop your own NLP projects with Natural Language Toolkit (NLTK) but before we dive into the tutorial, let’s look at some every day examples of NLP. ## Examples of NLP Machine Learning - Email spam filters - Auto-correct - Predictive text - Speech recognition - Information retrieval - Information extraction - Machine translation - Text simplification - Sentiment analysis - Text summarization - Query response - Natural language generation More From Our Experts[Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://builtin.com/artificial-intelligence/ai-vs-machine-learning) ## Get Started With NLP NLTK is a popular open-source suite of Python libraries. Rather than building all of your NLP tools from scratch, NLTK provides all common NLP tasks so you can jump right in. In this tutorial, I’ll show you how to perform basic NLP tasks and use a machine learning classifier to predict whether an SMS is spam (a harmful, malicious, or unwanted message or ham (something you might actually want to read. You can find all the code below in this [Github Repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning). First things first, you’ll want to [install NLTK](https://pypi.python.org/pypi/nltk). Type `!pip install nltk` in a Jupyter Notebook. If it doesn’t work in cmd, type `conda install -c conda-forge nltk`. You shouldn’t have to do much troubleshooting beyond that. ### Importing NLTK Library ``` import nltk nltk.download() ``` This code gives us an NLTK downloader application which is helpful in all NLP Tasks. ![nlp machine learning downloader application window screenshot](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/1_nlp%20machine%20learning.png) As you can see, I’ve already installed Stopwords Corpus in my system, which helps remove redundant words. You’ll be able to install whatever packages will be most useful to your project. NLP Machine Learning in 10 Minutes ## Prepare Your Data for NLP ### Reading In-text Data Our data comes to us in a structured or unstructured format. A structured format has a well-defined pattern. For example Excel and Google Sheets are structured data. Alternatively, unstructured data has no discernible pattern (e.g. images, audio files, social media posts). In between these two data types, we may find we have a semi-structured format. Language is a great example of semi-structured data. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/2_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/24d98bfa84dbe8d0589ba824028700d2). As we can see from the code above, when we read semi-structured data, it’s hard for a computer (and a human!) to interpret. We can use Pandas to help us understand our data. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/3_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/8755fad754100ce47555341feab4ac6c). With the help of Pandas we can now see and interpret our semi-structured data more clearly. ## How to Clean Your Data Cleaning up your text data is necessary to highlight attributes that we’re going to want our machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of three steps. ## How to Clean Your Data for NLP 1. Remove punctuation 2. Tokenize 3. Remove stop words 4. Stem 5. Lemmatize ### 1\. Remove Punctuation Punctuation can provide grammatical context to a sentence which supports human understanding. But for our vectorizer, which counts the number of words and not the context, punctuation does not add value. So we need to remove all special characters. For example, “How are you?” becomes: How are you Here’s how to do it: ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/4_nlp%20machine%20learning_0.png) In `body_text_clean`, you can see we’ve removed all punctuation. I’ve becomes Ive and WILL!! Becomes WILL. ### 2\.Tokenize Tokenizing separates text into units such as sentences or words. In other words, this function gives structure to previously unstructured text. For example: Plata o Plomo becomes ‘Plata’,’o’,’Plomo’. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/5_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/0861309de75358fd788235ff99a73aa5). In `body_text_tokenized`, we’ve generated all the words as tokens. ### 3\. Remove Stop Words Stop words are common words that will likely appear in any text. They don’t tell us much about our data so we remove them. Again, these are words that are great for human understanding, but will confuse your machine learning program. For example: silver or lead is fine for me becomes silver, lead, fine. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/6_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/cc86f564e5b368c1ec9a6d19ecd682ff). In `body_text_nostop`, we remove all unnecessary words like “been,” “for,” and “the.” ### 4\. Stem Stemming helps reduce a word to its stem form. It often makes sense to treat related words in the same way. It removes suffixes like “ing,” “ly,” “s” by a simple rule-based approach. Stemming reduces the corpus of words but often the actual words are lost, in a sense. For example: “Entitling” or “Entitled” become “Entitl.” *Note: Some search engines treat words with the same stem as synonyms.* ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/7_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/3bd45483ebc032dc768caaef9ea77c1b). In `body_text_stemmed`, words like entry and goes are stemmed to entri and goe even though they don’t mean anything in English. ### 5\. Lemmatize Lemmatizing derives the root form (“lemma”) of a word. This practice is more robust than stemming because it uses a dictionary-based approach (i.e a morphological analysis) to the root word. For example, “Entitling” or “Entitled” become “Entitle.” In short, stemming is typically faster as it simply chops off the end of the word, but without understanding the word’s context. Lemmatizing is slower but more accurate because it takes an informed analysis with the word’s context in mind. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/8_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/d421a71ab1ce7c25f04ae651171d0aa0). In `body_text_stemmed`, we can see words like “chances” are lemmatized to “chance” but stemmed to “chanc.” Want More Data Science Tutorials?[Need to Automate Your Data Analysis? Here’s How.](https://builtin.com/data-science/automate-data-analysis) ### ## Vectorize Data Vectorizing is the process of encoding text as integers to create feature vectors so that machine learning algorithms can understand language. ## Methods of Vectorizing Data for NLP 1. Bag-of-Words 2. N-Grams 3. TF-IDF ### 1\. Bag-Of-Words Bag-of-Words (BoW) or CountVectorizer describes the presence of words within the text data. This process gives a result of one if present in the sentence and zero if absent. This model therefore, creates a bag of words with a document-matrix count in each text document. ``` from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer(analyzer=clean_text) X_counts = count_vect.fit_transform(data['body_text']) print(X_counts.shape) print(count_vect.get_feature_names()) ``` We apply BoW to the `body_text` so the count of each word is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)). ### 2\. N-Grams N-grams are simply all combinations of adjacent words or letters of length `n` that we find in our source text. N-grams with `n=1` are called unigrams, `n=2` are bigrams, and so on. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/9_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/0dc2305ac012a5018aa39879489617be). Unigrams usually don’t contain much information as compared to bigrams or trigrams. The basic principle behind N-grams is that they capture which letter or word is likely to follow a given word. The longer the N-gram (higher `n`), the more context you have to work with. ``` from sklearn.feature_extraction.text import CountVectorizer ngram_vect = CountVectorizer(ngram_range=(2,2),analyzer=clean_text) # It applies only bigram vectorizer X_counts = ngram_vect.fit_transform(data['body_text']) print(X_counts.shape) print(ngram_vect.get_feature_names()) ``` We’ve applied N-Gram to the `body_text`, so the count of each group of words in a sentence is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)). ### 3\. TF-IDF TF-IDF computes the relative frequency with which a word appears in a document compared to its frequency across all documents. It’s more useful than term frequency for identifying key words in each document (high frequency in that document, low frequency in other documents). *Note: We use TF-IDF for search engine scoring, text summarization and document clustering. Check my article on [recommender systems](https://builtin.com/data-science/recommender-systems) to learn more about TF-IDF.* ``` from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vect = TfidfVectorizer(analyzer=clean_text) X_tfidf = tfidf_vect.fit_transform(data['body_text']) print(X_tfidf.shape) print(tfidf_vect.get_feature_names()) ``` We’ve applied TF-IDF in the body\_text, so the relative count of each word in the sentences is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)). *Note: Vectorizers output sparse matrices in which most entries are zero. In the interest of efficient storage, a sparse matrix will be stored if you’re only storing locations of the non-zero elements.* How to Make the Most of Your Graphs[7 Ways to Tell Powerful Stories With Your Data Visualization](https://builtin.com/data-science/data-visualization) ## Feature Engineering ### Feature Creation Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Because feature engineering requires domain knowledge, feature can be tough to create, but they’re certainly worth your time. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/10_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/fdd7706e4c7b1553b00b2d7cb00120de). - `body_len` shows the length of words excluding whitespaces in a message body. - `punct%` shows the percentage of punctuation marks in a message body. ### Is Your Feature Worthwhile? ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/11_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/af2fc15ae9db9498df17f814c0ac806b). We can see clearly that spams have a high number of words compared to hams. So `body_len` is a good feature to distinguish. Now let’s look at `punct%`. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/12_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/af2fc15ae9db9498df17f814c0ac806b). Spam has a higher percentage of punctuations but not that far away from ham. This is surprising given spam emails often contain a lot of punctuation marks. Nevertheless, given the apparent difference, we can still call this a useful feature. Need to Optimize Your Hardware? We Have a Tutorial for That.[Create a Linux Virtual Machine on Your Computer](https://builtin.com/data-science/linux-vm) ## Building Machine Learning Classifiers ### Model Selection We use an ensemble method of machine learning. By using multiple models in concert, their combination produces more robust results than a single model (e.g. support vector machine, Naive Bayes). Ensemble methods are the first choice for many Kaggle competitions. We construct [random forest algorithms](https://builtin.com/data-science/random-forest-algorithm) (i.e. multiple random decision trees) and use the aggregates of each tree for the final prediction. This process can be used for classification as well as regression problems and follows a random bagging strategy. - **Grid-search:** This model exhaustively searches overall parameter combinations in a given grid to determine the best model. - **Cross-validation:** This model divides a data set into k subsets and repeats the method k times.This model also uses a different subset as the test set in each iteration. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/13_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/31cc189f321ae859ec9740f89048b143). The `mean_test_score` for `n_estimators=150` and `max_depth` gives the best result. Here, `n_estimators` is the number of trees in the forest (group of decision trees) and `max_depth` is the max number of levels in each decision tree. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/14_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/8a30f5ff4e5890e52d35a044d53e1882). Similarly, the `mean_test_score` for `n_estimators=150` and `max_depth=90` gives the best result. ### Future Improvements You could use GradientBoosting, XgBoost for classifying. GradientBoosting will take a while because it takes an iterative approach by combining weak learners to create strong learners thereby focusing on mistakes of prior iterations. In short, compared to random forest, GradientBoosting follows a sequential approach rather than a random parallel approach. More From Badreesh Shetty[An In-Depth Guide to How Recommender Systems Work](https://builtin.com/data-science/recommender-systems) ## Our NLP Machine Learning Classifier We combine all the above-discussed sections to build a Spam-Ham Classifier. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/15_nlp%20machine%20learning.png) Random forest provides 97.7 percent accuracy. We obtain a high-value F1-score from the model. This confusion matrix tells us that we correctly predicted 965 hams and 123 spams. We incorrectly identified zero hams as spams and 26 spams were incorrectly predicted as hams. This margin of error is justifiable given the fact that detecting spams as hams is preferable to potentially losing important hams to an SMS spam filter. Spam filters are just one example of NLP you encounter every day. Here are others that influence your life each day (and some you may want to try out!). Hopefully this tutorial will help you try more of these out for yourself. - **Email spam filters** — your “junk” folder - **Auto-correct** — text messages, word processors - **Predictive text** — search engines, text messages - **Speech recognition** — digital assistants like Siri, Alexa - **Information retrieval** — Google finds relevant and similar results - **Information extraction** — Gmail suggests events from emails to add on your calendar - **Machine translation** — Google Translate translates language from one language to another - **Text simplification** — [Rewordify](https://rewordify.com/) simplifies the meaning of sentences - **Sentiment analysis** —[Hater News](https://haternews.herokuapp.com/) gives us the sentiment of the user - **Text summarization** — Reddit’s [autotldr](https://www.reddit.com/r/autotldr/) gives a summary of a submission - **Query response** — IBM Watson’s answers to [a question](https://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/) - **Natural language generation** — generation of text from image or [video data](https://www.bbc.com/news/technology-34204052) ### Recent Artificial Intelligence Articles [![You Won’t Even Be Able to See the Next AI Breakthrough](https://cdn.builtin.com/cdn-cgi/image/f=auto,fit=contain,w=120,h=70,q=80/sites/www.builtin.com/files/2026-04/invisible-ai%201.png) You Won’t Even Be Able to See the Next AI Breakthrough](https://builtin.com/articles/invisible-ai) [![From OpenAI to SpaceX, These Are the Hottest IPOs to Watch in 2026](https://cdn.builtin.com/cdn-cgi/image/f=auto,fit=contain,w=120,h=70,q=80/sites/www.builtin.com/files/2026-01/2026-tech-ipos_0.jpg) From OpenAI to SpaceX, These Are the Hottest IPOs to Watch in 2026](https://builtin.com/articles/top-tech-ipos-2026) [![What Is Waymo? How Its Robotaxi Service Works and Where It’s Expanding.](https://cdn.builtin.com/cdn-cgi/image/f=auto,fit=contain,w=120,h=70,q=80/sites/www.builtin.com/files/2026-01/waymo.jpg) What Is Waymo? How Its Robotaxi Service Works and Where It’s Expanding.](https://builtin.com/articles/waymo-robotaxis) Explore Job Matches. Job Title or Keyword Clear search Location Fully Remote, Hybrid, On Site Fully Remote Hybrid On Site Clear Apply See Jobs - [Jobs](https://builtin.com/jobs) - [Companies](https://builtin.com/companies) - [Articles](https://builtin.com/tech-topics) - [Tracker](https://builtin.com/auth/login?destination=%2Fhome%23application-tracker-section) - More ![Built In](https://static.builtin.com/dist/images/midnight_9.svg) [Join](https://builtin.com/auth/signup?destination=%2Fmachine-learning%2Fnlp-machine-learning) [Log In](https://builtin.com/auth/login?destination=%2Fmachine-learning%2Fnlp-machine-learning) - [Tech Jobs](https://builtin.com/jobs) - [Companies](https://builtin.com/companies) - [Articles](https://builtin.com/tech-topics) - [Remote](https://builtin.com/jobs/remote) - [Best Places To Work](https://builtin.com/awards/us/2026/best-places-to-work) - [Tech Hubs](https://builtin.com/tech-hubs) [Post Job](https://employers.builtin.com/membership?utm_medium=BIReferral&utm_source=foremployers) [![BuiltIn](https://static.builtin.com/dist/images/builtin-logo.svg)](https://builtin.com/) ![United We Tech](https://static.builtin.com/dist/images/united-we-tech.svg) Built In is the online community for startups and tech companies. Find startup jobs, tech news and events. About [Our Story](https://builtin.com/our-story) [Careers](https://employers.builtin.com/careers/) [Our Staff Writers](https://builtin.com/our-staff) [Content Descriptions](https://builtin.com/content-descriptions) *** Get Involved [Recruit With Built In](https://employers.builtin.com/membership?utm_medium=BIReferral&utm_source=foremployers) [Become an Expert Contributor](https://builtin.com/expert-contributors) *** Resources [Customer Support](https://knowledgebase.builtin.com/s/) [Share Feedback](https://form.jotform.com/223044927257054) [Report a Bug](https://knowledgebase.builtin.com/s/contactsupport) [Tech Job Tools + Career Resources](https://builtin.com/articles/grow-your-career) [Browse Jobs](https://builtin.com/browse-jobs) [Tech A-Z](https://builtin.com/tech-dictionary) *** Tech Hubs [Our Sites](https://builtin.com/our-sites) *** [Learning Lab User Agreement](https://builtin.com/learning-lab-user-agreement) [Accessibility Statement](https://builtin.com/accessibility-statement) [Copyright Policy](https://builtin.com/copyright-policy) [Privacy Policy](https://builtin.com/privacy-policy) [Terms of Use](https://builtin.com/community-terms-of-use) [Your Privacy Choices/Cookie Settings](https://builtin.com/california-do-not-sell-my-information) [CA Notice of Collection](https://builtin.com/ca-notice-collection) © Built In 2026
Readable Markdown
Natural Language Processing (NLP) is a subfield of machine learning that makes it possible for computers to understand, analyze, manipulate and generate human language. You encounter NLP machine learning in your everyday life — from spam detection, to autocorrect, to your digital assistant (“Hey, Siri?”). You may even encounter NLP and not even realize it. In this article, I’ll show you how to develop your own NLP projects with Natural Language Toolkit (NLTK) but before we dive into the tutorial, let’s look at some every day examples of NLP. ## Examples of NLP Machine Learning - Email spam filters - Auto-correct - Predictive text - Speech recognition - Information retrieval - Information extraction - Machine translation - Text simplification - Sentiment analysis - Text summarization - Query response - Natural language generation More From Our Experts[Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://builtin.com/artificial-intelligence/ai-vs-machine-learning) ## Get Started With NLP NLTK is a popular open-source suite of Python libraries. Rather than building all of your NLP tools from scratch, NLTK provides all common NLP tasks so you can jump right in. In this tutorial, I’ll show you how to perform basic NLP tasks and use a machine learning classifier to predict whether an SMS is spam (a harmful, malicious, or unwanted message or ham (something you might actually want to read. You can find all the code below in this [Github Repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning). First things first, you’ll want to [install NLTK](https://pypi.python.org/pypi/nltk). Type `!pip install nltk` in a Jupyter Notebook. If it doesn’t work in cmd, type `conda install -c conda-forge nltk`. You shouldn’t have to do much troubleshooting beyond that. ### Importing NLTK Library ``` import nltk nltk.download() ``` This code gives us an NLTK downloader application which is helpful in all NLP Tasks. ![nlp machine learning downloader application window screenshot](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/1_nlp%20machine%20learning.png) As you can see, I’ve already installed Stopwords Corpus in my system, which helps remove redundant words. You’ll be able to install whatever packages will be most useful to your project. NLP Machine Learning in 10 Minutes ## Prepare Your Data for NLP ### Reading In-text Data Our data comes to us in a structured or unstructured format. A structured format has a well-defined pattern. For example Excel and Google Sheets are structured data. Alternatively, unstructured data has no discernible pattern (e.g. images, audio files, social media posts). In between these two data types, we may find we have a semi-structured format. Language is a great example of semi-structured data. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/2_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/24d98bfa84dbe8d0589ba824028700d2). As we can see from the code above, when we read semi-structured data, it’s hard for a computer (and a human!) to interpret. We can use Pandas to help us understand our data. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/3_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/8755fad754100ce47555341feab4ac6c). With the help of Pandas we can now see and interpret our semi-structured data more clearly. ## How to Clean Your Data Cleaning up your text data is necessary to highlight attributes that we’re going to want our machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of three steps. ## How to Clean Your Data for NLP 1. Remove punctuation 2. Tokenize 3. Remove stop words 4. Stem 5. Lemmatize ### 1\. Remove Punctuation Punctuation can provide grammatical context to a sentence which supports human understanding. But for our vectorizer, which counts the number of words and not the context, punctuation does not add value. So we need to remove all special characters. For example, “How are you?” becomes: How are you Here’s how to do it: ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/4_nlp%20machine%20learning_0.png) In `body_text_clean`, you can see we’ve removed all punctuation. I’ve becomes Ive and WILL!! Becomes WILL. ### 2\.Tokenize Tokenizing separates text into units such as sentences or words. In other words, this function gives structure to previously unstructured text. For example: Plata o Plomo becomes ‘Plata’,’o’,’Plomo’. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/5_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/0861309de75358fd788235ff99a73aa5). In `body_text_tokenized`, we’ve generated all the words as tokens. ### 3\. Remove Stop Words Stop words are common words that will likely appear in any text. They don’t tell us much about our data so we remove them. Again, these are words that are great for human understanding, but will confuse your machine learning program. For example: silver or lead is fine for me becomes silver, lead, fine. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/6_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/cc86f564e5b368c1ec9a6d19ecd682ff). In `body_text_nostop`, we remove all unnecessary words like “been,” “for,” and “the.” ### 4\. Stem Stemming helps reduce a word to its stem form. It often makes sense to treat related words in the same way. It removes suffixes like “ing,” “ly,” “s” by a simple rule-based approach. Stemming reduces the corpus of words but often the actual words are lost, in a sense. For example: “Entitling” or “Entitled” become “Entitl.” *Note: Some search engines treat words with the same stem as synonyms.* ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/7_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/3bd45483ebc032dc768caaef9ea77c1b). In `body_text_stemmed`, words like entry and goes are stemmed to entri and goe even though they don’t mean anything in English. ### 5\. Lemmatize Lemmatizing derives the root form (“lemma”) of a word. This practice is more robust than stemming because it uses a dictionary-based approach (i.e a morphological analysis) to the root word. For example, “Entitling” or “Entitled” become “Entitle.” In short, stemming is typically faster as it simply chops off the end of the word, but without understanding the word’s context. Lemmatizing is slower but more accurate because it takes an informed analysis with the word’s context in mind. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/8_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/d421a71ab1ce7c25f04ae651171d0aa0). In `body_text_stemmed`, we can see words like “chances” are lemmatized to “chance” but stemmed to “chanc.” Want More Data Science Tutorials?[Need to Automate Your Data Analysis? Here’s How.](https://builtin.com/data-science/automate-data-analysis) ## Vectorize Data Vectorizing is the process of encoding text as integers to create feature vectors so that machine learning algorithms can understand language. ## Methods of Vectorizing Data for NLP 1. Bag-of-Words 2. N-Grams 3. TF-IDF ### 1\. Bag-Of-Words Bag-of-Words (BoW) or CountVectorizer describes the presence of words within the text data. This process gives a result of one if present in the sentence and zero if absent. This model therefore, creates a bag of words with a document-matrix count in each text document. ``` from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer(analyzer=clean_text) X_counts = count_vect.fit_transform(data['body_text']) print(X_counts.shape) print(count_vect.get_feature_names()) ``` We apply BoW to the `body_text` so the count of each word is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)). ### 2\. N-Grams N-grams are simply all combinations of adjacent words or letters of length `n` that we find in our source text. N-grams with `n=1` are called unigrams, `n=2` are bigrams, and so on. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/9_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/0dc2305ac012a5018aa39879489617be). Unigrams usually don’t contain much information as compared to bigrams or trigrams. The basic principle behind N-grams is that they capture which letter or word is likely to follow a given word. The longer the N-gram (higher `n`), the more context you have to work with. ``` from sklearn.feature_extraction.text import CountVectorizer ngram_vect = CountVectorizer(ngram_range=(2,2),analyzer=clean_text) # It applies only bigram vectorizer X_counts = ngram_vect.fit_transform(data['body_text']) print(X_counts.shape) print(ngram_vect.get_feature_names()) ``` We’ve applied N-Gram to the `body_text`, so the count of each group of words in a sentence is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)). ### 3\. TF-IDF TF-IDF computes the relative frequency with which a word appears in a document compared to its frequency across all documents. It’s more useful than term frequency for identifying key words in each document (high frequency in that document, low frequency in other documents). *Note: We use TF-IDF for search engine scoring, text summarization and document clustering. Check my article on [recommender systems](https://builtin.com/data-science/recommender-systems) to learn more about TF-IDF.* ``` from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vect = TfidfVectorizer(analyzer=clean_text) X_tfidf = tfidf_vect.fit_transform(data['body_text']) print(X_tfidf.shape) print(tfidf_vect.get_feature_names()) ``` We’ve applied TF-IDF in the body\_text, so the relative count of each word in the sentences is stored in the document matrix. (Check the [repo](https://github.com/BadreeshShetty/Natural-Language-Processing-NLP-for-Machine-Learning)). *Note: Vectorizers output sparse matrices in which most entries are zero. In the interest of efficient storage, a sparse matrix will be stored if you’re only storing locations of the non-zero elements.* How to Make the Most of Your Graphs[7 Ways to Tell Powerful Stories With Your Data Visualization](https://builtin.com/data-science/data-visualization) ## Feature Engineering ### Feature Creation Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Because feature engineering requires domain knowledge, feature can be tough to create, but they’re certainly worth your time. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/10_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/fdd7706e4c7b1553b00b2d7cb00120de). - `body_len` shows the length of words excluding whitespaces in a message body. - `punct%` shows the percentage of punctuation marks in a message body. ### Is Your Feature Worthwhile? ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/11_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/af2fc15ae9db9498df17f814c0ac806b). We can see clearly that spams have a high number of words compared to hams. So `body_len` is a good feature to distinguish. Now let’s look at `punct%`. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/12_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/af2fc15ae9db9498df17f814c0ac806b). Spam has a higher percentage of punctuations but not that far away from ham. This is surprising given spam emails often contain a lot of punctuation marks. Nevertheless, given the apparent difference, we can still call this a useful feature. Need to Optimize Your Hardware? We Have a Tutorial for That.[Create a Linux Virtual Machine on Your Computer](https://builtin.com/data-science/linux-vm) ## Building Machine Learning Classifiers ### Model Selection We use an ensemble method of machine learning. By using multiple models in concert, their combination produces more robust results than a single model (e.g. support vector machine, Naive Bayes). Ensemble methods are the first choice for many Kaggle competitions. We construct [random forest algorithms](https://builtin.com/data-science/random-forest-algorithm) (i.e. multiple random decision trees) and use the aggregates of each tree for the final prediction. This process can be used for classification as well as regression problems and follows a random bagging strategy. - **Grid-search:** This model exhaustively searches overall parameter combinations in a given grid to determine the best model. - **Cross-validation:** This model divides a data set into k subsets and repeats the method k times.This model also uses a different subset as the test set in each iteration. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/13_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/31cc189f321ae859ec9740f89048b143). The `mean_test_score` for `n_estimators=150` and `max_depth` gives the best result. Here, `n_estimators` is the number of trees in the forest (group of decision trees) and `max_depth` is the max number of levels in each decision tree. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/14_nlp%20machine%20learning.png) Access raw code [here](https://gist.github.com/BadreeshShetty/8a30f5ff4e5890e52d35a044d53e1882). Similarly, the `mean_test_score` for `n_estimators=150` and `max_depth=90` gives the best result. ### Future Improvements You could use GradientBoosting, XgBoost for classifying. GradientBoosting will take a while because it takes an iterative approach by combining weak learners to create strong learners thereby focusing on mistakes of prior iterations. In short, compared to random forest, GradientBoosting follows a sequential approach rather than a random parallel approach. More From Badreesh Shetty[An In-Depth Guide to How Recommender Systems Work](https://builtin.com/data-science/recommender-systems) We combine all the above-discussed sections to build a Spam-Ham Classifier. ![nlp machine learning](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/15_nlp%20machine%20learning.png) Random forest provides 97.7 percent accuracy. We obtain a high-value F1-score from the model. This confusion matrix tells us that we correctly predicted 965 hams and 123 spams. We incorrectly identified zero hams as spams and 26 spams were incorrectly predicted as hams. This margin of error is justifiable given the fact that detecting spams as hams is preferable to potentially losing important hams to an SMS spam filter. Spam filters are just one example of NLP you encounter every day. Here are others that influence your life each day (and some you may want to try out!). Hopefully this tutorial will help you try more of these out for yourself. - **Email spam filters** — your “junk” folder - **Auto-correct** — text messages, word processors - **Predictive text** — search engines, text messages - **Speech recognition** — digital assistants like Siri, Alexa - **Information retrieval** — Google finds relevant and similar results - **Information extraction** — Gmail suggests events from emails to add on your calendar - **Machine translation** — Google Translate translates language from one language to another - **Text simplification** — [Rewordify](https://rewordify.com/) simplifies the meaning of sentences - **Sentiment analysis** —[Hater News](https://haternews.herokuapp.com/) gives us the sentiment of the user - **Text summarization** — Reddit’s [autotldr](https://www.reddit.com/r/autotldr/) gives a summary of a submission - **Query response** — IBM Watson’s answers to [a question](https://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/) - **Natural language generation** — generation of text from image or [video data](https://www.bbc.com/news/technology-34204052)
Shard169 (laksa)
Root Hash7607033694470393769
Unparsed URLcom,builtin!/machine-learning/nlp-machine-learning s443