âšī¸ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.1 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://www.geeksforgeeks.org/nlp/building-language-models-in-nlp/ |
| Last Crawled | 2026-04-10 21:45:55 (1 day ago) |
| First Indexed | 2025-06-16 17:02:23 (9 months ago) |
| HTTP Status Code | 200 |
| Meta Title | Building Language Models in NLP - GeeksforGeeks |
| Meta Description | Your All-in-One Learning Portal: GeeksforGeeks is a comprehensive educational platform that empowers learners across domains-spanning computer science and programming, school education, upskilling, commerce, software tools, competitive exams, and more., Your All-in-One Learning Portal. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions. |
| Meta Canonical | null |
| Boilerpipe Text | Last Updated :
23 Jul, 2025
Building language models is a fundamental task in natural language processing (NLP) that involves creating computational models capable of predicting the next word in a sequence of words. These models are essential for various NLP applications, such as machine translation, speech recognition, and text generation.
In this article, we will build a language model using NLP using LSTM.
What is a Language Model?
A language model is a statistical model that is used to predict the probability of a sequence of words.
It learns the structure and patterns of a language from a given text corpus and can be used to generate new text that is similar to the original text.
Language models are a fundamental component of many natural language processing (
NLP
) tasks, such as machine translation, speech recognition, and text generation.
Steps to Build a Language Model in NLP
Here, we will implement these steps to build a language model in NLP.
Step 1: Importing Necessary Libraries
We will, at first, import all the necessary libraries required for building our model.
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
Step 2: Generate Sample Data
We will at first take a sample text data.
text_data = "Hello, how are you? I am doing well. Thank you for asking."
Step 3: Preprocessing the Data
The preprocessing involves tokenizing the input text data, creates input sequences, and pads the sequences to make them equal in length.
# Tokenize the text
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts([text_data])
total_words = len(tokenizer.word_index) + 1
# Create input sequences and labels
input_sequences = []
for line in text_data.split('.'):
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
# Pad sequences for equal length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = tf.keras.preprocessing.sequence.pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')
Step 4: One hot encoding
The input sequences are split into predictors (xs) and labels (ys). The labels are converted to one-hot encoding.
# Create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
# Convert labels to one-hot encoding
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)
Step 5: Defining and Compiling the Model
This code defines and compiles a simple
LSTM
-based language model using Keras
# Define the model
model = Sequential()
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
history = model.fit(xs, ys, epochs=100, verbose=1)
Step 6: Generating Text
This
generate_text
function takes a
seed_text
as input and generates
next_words
number of words using the provided
model
and
max_sequence_len
.
def generate_text(seed_text, next_words, model, max_sequence_len):
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = tf.keras.preprocessing.sequence.pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted_probs = model.predict(token_list, verbose=0)[0]
predicted_index = tf.argmax(predicted_probs, axis=-1).numpy()
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted_index:
output_word = word
break
seed_text += " " + output_word
return seed_text
# Generate text
print(generate_text("how", 5, model, max_sequence_len))
Below is the complete Implemention:
import
tensorflow
as
tf
from
tensorflow.keras.layers
import
Embedding
,
LSTM
,
Dense
from
tensorflow.keras.models
import
Sequential
# Sample text data
text_data
=
"Hello, how are you? I am doing well. Thank you for asking."
# Tokenize the text
tokenizer
=
tf
.
keras
.
preprocessing
.
text
.
Tokenizer
()
tokenizer
.
fit_on_texts
([
text_data
])
total_words
=
len
(
tokenizer
.
word_index
)
+
1
# Create input sequences and labels
input_sequences
=
[]
for
line
in
text_data
.
split
(
'.'
):
token_list
=
tokenizer
.
texts_to_sequences
([
line
])[
0
]
for
i
in
range
(
1
,
len
(
token_list
)):
n_gram_sequence
=
token_list
[:
i
+
1
]
input_sequences
.
append
(
n_gram_sequence
)
# Pad sequences for equal length
max_sequence_len
=
max
([
len
(
x
)
for
x
in
input_sequences
])
input_sequences
=
tf
.
keras
.
preprocessing
.
sequence
.
pad_sequences
(
input_sequences
,
maxlen
=
max_sequence_len
,
padding
=
'pre'
)
# Create predictors and label
xs
,
labels
=
input_sequences
[:,:
-
1
],
input_sequences
[:,
-
1
]
# Convert labels to one-hot encoding
ys
=
tf
.
keras
.
utils
.
to_categorical
(
labels
,
num_classes
=
total_words
)
# Define the model
model
=
Sequential
()
model
.
add
(
Embedding
(
total_words
,
64
,
input_length
=
max_sequence_len
-
1
))
model
.
add
(
LSTM
(
100
))
model
.
add
(
Dense
(
total_words
,
activation
=
'softmax'
))
# Compile the model
model
.
compile
(
loss
=
'categorical_crossentropy'
,
optimizer
=
'adam'
,
metrics
=
[
'accuracy'
])
# Fit the model
history
=
model
.
fit
(
xs
,
ys
,
epochs
=
100
,
verbose
=
1
)
def
generate_text
(
seed_text
,
next_words
,
model
,
max_sequence_len
):
for
_
in
range
(
next_words
):
token_list
=
tokenizer
.
texts_to_sequences
([
seed_text
])[
0
]
token_list
=
tf
.
keras
.
preprocessing
.
sequence
.
pad_sequences
([
token_list
],
maxlen
=
max_sequence_len
-
1
,
padding
=
'pre'
)
predicted_probs
=
model
.
predict
(
token_list
,
verbose
=
0
)[
0
]
predicted_index
=
tf
.
argmax
(
predicted_probs
,
axis
=-
1
)
.
numpy
()
output_word
=
""
for
word
,
index
in
tokenizer
.
word_index
.
items
():
if
index
==
predicted_index
:
output_word
=
word
break
seed_text
+=
" "
+
output_word
return
seed_text
# Generate text
print
(
generate_text
(
"how"
,
5
,
model
,
max_sequence_len
))
Output:
how are you i am doing
In summary, constructing language models for natural language processing (NLP) include various stages, including tokenization, sequence creation, model construction, training, and text generation. Tokenization transforms textual data into numerical representations, while sequence creation generates input-output pairs for model training. The model typically comprises layers like Embedding and LSTM, followed by a Dense layer for predictions. Training involves fitting the model to input sequences and their labels, while text generation utilizes the trained model to generate new text based on a provided seed text. Overall, language models are vital for NLP tasks such as text generation, machine translation, and sentiment analysis, among others. |
| Markdown | [](https://www.geeksforgeeks.org/)

- Sign In
- [Courses]()
- [Tutorials]()
- [Interview Prep]()
- [NLP Tutorial](https://www.geeksforgeeks.org/nlp/natural-language-processing-nlp-tutorial/)
- [Libraries](https://www.geeksforgeeks.org/nlp/nlp-libraries-in-python/)
- [Phases](https://www.geeksforgeeks.org/machine-learning/phases-of-natural-language-processing-nlp/)
- [Text Preprosessing](https://www.geeksforgeeks.org/nlp/text-preprocessing-for-nlp-tasks/)
- [Tokenization](https://www.geeksforgeeks.org/nlp/nlp-how-tokenizing-text-sentence-words-works/)
- [Lemmatization](https://www.geeksforgeeks.org/python/python-lemmatization-with-nltk/)
- [Word Embeddings](https://www.geeksforgeeks.org/nlp/word-embeddings-in-nlp/)
- [Projects Ideas](https://www.geeksforgeeks.org/nlp/top-natural-language-processing-nlp-projects/)
- [Interview Question](https://www.geeksforgeeks.org/nlp/advanced-natural-language-processing-interview-question/)
- [NLP Quiz](https://www.geeksforgeeks.org/quizzes/natural-language-processing-quiz/)
# Building Language Models in NLP
Last Updated : 23 Jul, 2025
Building language models is a fundamental task in natural language processing (NLP) that involves creating computational models capable of predicting the next word in a sequence of words. These models are essential for various NLP applications, such as machine translation, speech recognition, and text generation.
*****In this article, we will build a language model using NLP using LSTM.*****
## What is a Language Model?
- A language model is a statistical model that is used to predict the probability of a sequence of words.
- It learns the structure and patterns of a language from a given text corpus and can be used to generate new text that is similar to the original text.
- Language models are a fundamental component of many natural language processing ([NLP](https://www.geeksforgeeks.org/nlp/natural-language-processing-overview/)) tasks, such as machine translation, speech recognition, and text generation.
## Steps to Build a Language Model in NLP
Here, we will implement these steps to build a language model in NLP.
### Step 1: Importing Necessary Libraries
We will, at first, import all the necessary libraries required for building our model.
```
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
```
### Step 2: Generate Sample Data
We will at first take a sample text data.
```
text_data = "Hello, how are you? I am doing well. Thank you for asking."
```
### Step 3: Preprocessing the Data
The preprocessing involves tokenizing the input text data, creates input sequences, and pads the sequences to make them equal in length.
```
# Tokenize the text
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts([text_data])
total_words = len(tokenizer.word_index) + 1
# Create input sequences and labels
input_sequences = []
for line in text_data.split('.'):
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
# Pad sequences for equal length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = tf.keras.preprocessing.sequence.pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')
```
### Step 4: One hot encoding
The input sequences are split into predictors (xs) and labels (ys). The labels are converted to one-hot encoding.
```
# Create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
# Convert labels to one-hot encoding
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)
```
### Step 5: Defining and Compiling the Model
This code defines and compiles a simple [LSTM](https://www.geeksforgeeks.org/machine-learning/understanding-of-lstm-networks/)\-based language model using Keras
```
# Define the model
model = Sequential()
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
history = model.fit(xs, ys, epochs=100, verbose=1)
```
### Step 6: Generating Text
This `generate_text` function takes a `seed_text` as input and generates `next_words` number of words using the provided `model` and `max_sequence_len`.
```
def generate_text(seed_text, next_words, model, max_sequence_len):
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = tf.keras.preprocessing.sequence.pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted_probs = model.predict(token_list, verbose=0)[0]
predicted_index = tf.argmax(predicted_probs, axis=-1).numpy()
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted_index:
output_word = word
break
seed_text += " " + output_word
return seed_text
# Generate text
print(generate_text("how", 5, model, max_sequence_len))
```
#### ****Below is the complete Implemention:****
Python
``
****Output:****
```
how are you i am doing
```
In summary, constructing language models for natural language processing (NLP) include various stages, including tokenization, sequence creation, model construction, training, and text generation. Tokenization transforms textual data into numerical representations, while sequence creation generates input-output pairs for model training. The model typically comprises layers like Embedding and LSTM, followed by a Dense layer for predictions. Training involves fitting the model to input sequences and their labels, while text generation utilizes the trained model to generate new text based on a provided seed text. Overall, language models are vital for NLP tasks such as text generation, machine translation, and sentiment analysis, among others.
Comment
[G](https://www.geeksforgeeks.org/user/gs7701ri2f/)
[gs7701ri2f](https://www.geeksforgeeks.org/user/gs7701ri2f/)
0
Article Tags:
Article Tags:
[Blogathon](https://www.geeksforgeeks.org/category/blogathon/)
[NLP](https://www.geeksforgeeks.org/category/ai-ml-ds/nlp/)
[AI-ML-DS](https://www.geeksforgeeks.org/category/ai-ml-ds/)
[Data Science Blogathon 2024](https://www.geeksforgeeks.org/tag/data-science-blogathon-2024/)
### Explore
[](https://www.geeksforgeeks.org/)

Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)

Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
[](https://geeksforgeeksapp.page.link/gfg-app)[](https://geeksforgeeksapp.page.link/gfg-app)
- Company
- [About Us](https://www.geeksforgeeks.org/about/)
- [Legal](https://www.geeksforgeeks.org/legal/)
- [Privacy Policy](https://www.geeksforgeeks.org/legal/privacy-policy/)
- [Contact Us](https://www.geeksforgeeks.org/about/contact-us/)
- [Advertise with us](https://www.geeksforgeeks.org/advertise-with-us/)
- [GFG Corporate Solution](https://www.geeksforgeeks.org/gfg-corporate-solution/)
- [Campus Training Program](https://www.geeksforgeeks.org/campus-training-program/)
- Explore
- [POTD](https://www.geeksforgeeks.org/problem-of-the-day)
- [Job-A-Thon](https://practice.geeksforgeeks.org/events/rec/job-a-thon/)
- [Blogs](https://www.geeksforgeeks.org/category/blogs/?type=recent)
- [Nation Skill Up](https://www.geeksforgeeks.org/nation-skill-up/)
- Tutorials
- [Programming Languages](https://www.geeksforgeeks.org/computer-science-fundamentals/programming-language-tutorials/)
- [DSA](https://www.geeksforgeeks.org/dsa/dsa-tutorial-learn-data-structures-and-algorithms/)
- [Web Technology](https://www.geeksforgeeks.org/web-tech/web-technology/)
- [AI, ML & Data Science](https://www.geeksforgeeks.org/machine-learning/ai-ml-and-data-science-tutorial-learn-ai-ml-and-data-science/)
- [DevOps](https://www.geeksforgeeks.org/devops/devops-tutorial/)
- [CS Core Subjects](https://www.geeksforgeeks.org/gate/gate-exam-tutorial/)
- [Interview Preparation](https://www.geeksforgeeks.org/aptitude/interview-corner/)
- [Software and Tools](https://www.geeksforgeeks.org/websites-apps/software-and-tools-a-to-z-list/)
- Courses
- [ML and Data Science](https://www.geeksforgeeks.org/courses/category/machine-learning-data-science)
- [DSA and Placements](https://www.geeksforgeeks.org/courses/category/dsa-placements)
- [Web Development](https://www.geeksforgeeks.org/courses/category/development-testing)
- [Programming Languages](https://www.geeksforgeeks.org/courses/category/programming-languages)
- [DevOps & Cloud](https://www.geeksforgeeks.org/courses/category/cloud-devops)
- [GATE](https://www.geeksforgeeks.org/courses/category/gate)
- [Trending Technologies](https://www.geeksforgeeks.org/courses/category/trending-technologies/)
- Videos
- [DSA](https://www.geeksforgeeks.org/videos/category/sde-sheet/)
- [Python](https://www.geeksforgeeks.org/videos/category/python/)
- [Java](https://www.geeksforgeeks.org/videos/category/java-w6y5f4/)
- [C++](https://www.geeksforgeeks.org/videos/category/c/)
- [Web Development](https://www.geeksforgeeks.org/videos/category/web-development/)
- [Data Science](https://www.geeksforgeeks.org/videos/category/data-science/)
- [CS Subjects](https://www.geeksforgeeks.org/videos/category/cs-subjects/)
- Preparation Corner
- [Interview Corner](https://www.geeksforgeeks.org/interview-prep/interview-corner/)
- [Aptitude](https://www.geeksforgeeks.org/aptitude/aptitude-questions-and-answers/)
- [Puzzles](https://www.geeksforgeeks.org/aptitude/puzzles/)
- [GfG 160](https://www.geeksforgeeks.org/courses/gfg-160-series)
- [System Design](https://www.geeksforgeeks.org/system-design/system-design-tutorial/)
[@GeeksforGeeks, Sanchhaya Education Private Limited](https://www.geeksforgeeks.org/), [All rights reserved](https://www.geeksforgeeks.org/copyright-information/)
![]() |
| Readable Markdown | Last Updated : 23 Jul, 2025
Building language models is a fundamental task in natural language processing (NLP) that involves creating computational models capable of predicting the next word in a sequence of words. These models are essential for various NLP applications, such as machine translation, speech recognition, and text generation.
*****In this article, we will build a language model using NLP using LSTM.*****
## What is a Language Model?
- A language model is a statistical model that is used to predict the probability of a sequence of words.
- It learns the structure and patterns of a language from a given text corpus and can be used to generate new text that is similar to the original text.
- Language models are a fundamental component of many natural language processing ([NLP](https://www.geeksforgeeks.org/nlp/natural-language-processing-overview/)) tasks, such as machine translation, speech recognition, and text generation.
## Steps to Build a Language Model in NLP
Here, we will implement these steps to build a language model in NLP.
### Step 1: Importing Necessary Libraries
We will, at first, import all the necessary libraries required for building our model.
```
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
```
### Step 2: Generate Sample Data
We will at first take a sample text data.
```
text_data = "Hello, how are you? I am doing well. Thank you for asking."
```
### Step 3: Preprocessing the Data
The preprocessing involves tokenizing the input text data, creates input sequences, and pads the sequences to make them equal in length.
```
# Tokenize the text
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts([text_data])
total_words = len(tokenizer.word_index) + 1
# Create input sequences and labels
input_sequences = []
for line in text_data.split('.'):
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
# Pad sequences for equal length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = tf.keras.preprocessing.sequence.pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')
```
### Step 4: One hot encoding
The input sequences are split into predictors (xs) and labels (ys). The labels are converted to one-hot encoding.
```
# Create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
# Convert labels to one-hot encoding
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)
```
### Step 5: Defining and Compiling the Model
This code defines and compiles a simple [LSTM](https://www.geeksforgeeks.org/machine-learning/understanding-of-lstm-networks/)\-based language model using Keras
```
# Define the model
model = Sequential()
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
history = model.fit(xs, ys, epochs=100, verbose=1)
```
### Step 6: Generating Text
This `generate_text` function takes a `seed_text` as input and generates `next_words` number of words using the provided `model` and `max_sequence_len`.
```
def generate_text(seed_text, next_words, model, max_sequence_len):
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = tf.keras.preprocessing.sequence.pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted_probs = model.predict(token_list, verbose=0)[0]
predicted_index = tf.argmax(predicted_probs, axis=-1).numpy()
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted_index:
output_word = word
break
seed_text += " " + output_word
return seed_text
# Generate text
print(generate_text("how", 5, model, max_sequence_len))
```
#### ****Below is the complete Implemention:****
``
****Output:****
```
how are you i am doing
```
In summary, constructing language models for natural language processing (NLP) include various stages, including tokenization, sequence creation, model construction, training, and text generation. Tokenization transforms textual data into numerical representations, while sequence creation generates input-output pairs for model training. The model typically comprises layers like Embedding and LSTM, followed by a Dense layer for predictions. Training involves fitting the model to input sequences and their labels, while text generation utilizes the trained model to generate new text based on a provided seed text. Overall, language models are vital for NLP tasks such as text generation, machine translation, and sentiment analysis, among others. |
| Shard | 103 (laksa) |
| Root Hash | 12046344915360636903 |
| Unparsed URL | org,geeksforgeeks!www,/nlp/building-language-models-in-nlp/ s443 |