🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 103 (from laksa116)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

đź“„
INDEXABLE
âś…
CRAWLED
5 days ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.2 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://www.johnsnowlabs.com/the-ultimate-guide-to-building-your-own-ner-model-with-python/
Last Crawled2026-04-01 13:45:36 (5 days ago)
First Indexed2023-04-25 13:56:12 (2 years ago)
HTTP Status Code200
Meta TitleThe Ultimate Guide to Building Your Own NER Model with Python - John Snow Labs
Meta DescriptionTraining a NER model from scratch with Python. The Ultimate Guide to Building Your Own NER Model with Python.
Meta Canonicalnull
Boilerpipe Text
Training a NER model from scratch with Python Named Entity Recognition is a Natural Language Processing technique that involves identifying and extracting entities from a text, such as people, organizations, locations, dates, and other types of named entities. NER is used in many fields of NLP, and using Spark NLP, it is possible to train deep learning models that extract entities from text with very high accuracy. Named Entity Recognition (NER) is a Natural Language Processing (NLP) technique used to identify and extract named entities from text. Named entities are words or phrases that refer to specific entities such as people, organizations, locations, dates, times, and other types of entities that have a specific name or title. NER has many practical applications in various fields, such as information extraction, sentiment analysis, chatbots , question answering systems, and more. NER models are crucial in NLP that enables machines to understand and process unstructured text data more efficiently and accurately. They have many practical applications in various fields and can help in automating tasks that would otherwise require human effort. NER involves analyzing text to identify and classify these named entities into predefined categories. This can be done using various techniques, such as rule-based approaches, machine learning algorithms, or deep learning models. Although there are other alternatives, deep learning models are very successful in NER tasks. There are more than 1,700 NER models in the John Snow Labs Models Hub, but it is possible to train your own deep learning model by using Spark NLP. The purpose of model training is to teach a model to make accurate predictions on new, unseen data by learning from labeled annotated data . The training process involves feeding the model with labeled examples and adjusting its parameters to minimize the difference between its predicted outputs and the true outputs in the training data. The trained model can then be used to make predictions on new, unseen data. In other words, model training process involves providing the NerDLApproach (Spark NLP annotator for NER model based on Neural Networks) with a set of annotated data, called the training set, that includes text documents along with labels for the named entities present in the text. The training set is typically created by human annotators who label the named entities in the text with predefined categories. In this post, we will discuss three concepts; namely, CoNLL File Preparation, TFNerDLGraphBuilder and NerDLApproach in order to understand the fundamentals of NER model training in Spark NLP. CoNLL (Conference on Computational Natural Language Learning) is a standard format used for annotating and sharing annotated language data. CoNLL files are commonly used in named entity recognition. TFNerDLGraphBuilder is a Spark NLP annotator that is used to build the TF graph for training and inference of a custom NER model based on the Deep Learning architecture. NerDLApproach in Spark NLP is a powerful annotator for building and training NER models using deep learning techniques. It supports different embedding strategies and hyperparameters, and is highly customizable to meet the specific needs of different NER tasks. In this post, you will learn how to use certain Spark NLP annotators to train deep learning models for the named entity recognition task. Let us start with a short Spark NLP introduction and then discuss the details of NER model training with some solid results. Introduction to Spark NLP Spark NLP is an open-source library maintained by John Snow Labs . It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Since its first release in July 2017, Spark NLP has grown in a full NLP tool, providing: A single unified solution for all your NLP needs (for Medicine , Banking and Finance , Legal ) Transfer learning and implementing the latest and greatest SOTA algorithms and models in NLP research The most widely used NLP library in industry (5 years in a row) The most scalable, accurate and fastest library in NLP history Spark NLP comes with 17,800+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster. Spark NLP processes the data using Pipelines , structure that contains all the steps to be run on the input data: Spark NLP pipelines Each step contains an annotator that performs a specific task such as tokenization, normalization, and dependency parsing. Each annotator has input(s) annotation(s) and outputs new annotation. An annotator in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document, while a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral. Setup To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter pip install spark-nlp pip install pyspark pip install spark-nlp pip install pyspark pip install spark-nlp pip install pyspark For other installation options for different environments and machines, please check the official documentation . Then, simply import the library and start a Spark session: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter import sparknlp # Start Spark Session spark = sparknlp. start () import sparknlp # Start Spark Session spark = sparknlp.start() import sparknlp # Start Spark Session spark = sparknlp.start() CoNLL File Preparation CoNLL (Conference on Natural Language Learning) is a format for representing annotated data in NLP. The CoNLL format consists of columns, with each row representing a token and its associated features. To prepare data in the CoNLL format, the raw text is first annotated with the relevant labels (e.g., named entities or part-of-speech tags). This annotated data is then converted to the CoNLL format by representing each token and its associated features as a separate row in the CoNLL file. The resulting CoNLL file can then be used to train and evaluate machine learning models for the relevant NLP task. Here is a sample sentence: CoNLL representation of the sentence: We will use train and test datasets from the John Snow Labs Github, so first let us get their links: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter !wget -q https: //raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train !wget -q https: //raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa Now, import the first 5,000 texts of the training dataset as CoNLL file: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter from sparknlp. training import CoNLL training_data = CoNLL () . readDataset ( spark, './eng.train' ) . limit ( 5000 ) # Observe the first 3 rows of the Dataframe training_data. show ( 3 ) from sparknlp.training import CoNLL training_data = CoNLL().readDataset(spark, './eng.train').limit(5000) # Observe the first 3 rows of the Dataframe training_data.show(3) from sparknlp.training import CoNLL training_data = CoNLL().readDataset(spark, './eng.train').limit(5000) # Observe the first 3 rows of the Dataframe training_data.show(3) Let’s explode the training data to understand the number of all the entities in IOB format (short for inside, outside, beginning): Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter import pyspark. sql . functions as F training_data. select ( F. explode ( F. arrays_zip ( training_data. token . result , training_data. label . result )) . alias ( "cols" )) \ . select ( F. expr ( "cols['0']" ) . alias ( "token" ) , F. expr ( "cols['1']" ) . alias ( "ground_truth" )) . groupBy ( 'ground_truth' ) . count () . orderBy ( 'count' , ascending= False ) . show ( 100 ,truncate= False ) import pyspark.sql.functions as F training_data.select(F.explode(F.arrays_zip(training_data.token.result, training_data.label.result)).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth")).groupBy('ground_truth').count().orderBy('count', ascending=False).show(100,truncate=False) import pyspark.sql.functions as F training_data.select(F.explode(F.arrays_zip(training_data.token.result, training_data.label.result)).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth")).groupBy('ground_truth').count().orderBy('count', ascending=False).show(100,truncate=False) Now that we have the training dataframe, we can get to the next stage, namely graph generation. There is a detailed notebook in the John Snow Labs Github repo about CoNLL preparation. Please check the notebook to understand the details of the process. TFNerDLGraphBuilder Graphs are data structures that contain a set of TensorFlow (TF) operation objects, which represent units of computation and TF tensor objects, which represent the units of data that flow between operations. They are defined in a TF Graph context. Since these graphs are data structures, they can be saved, run, and restored all without the original Python code. Graphs are extremely useful and let the TF run fast , run in parallel , and run efficiently on multiple devices . TFNerDLGraphBuilder is a Spark NLP annotator that is used to build the TF graph for training and inference of a custom NER model based on the Deep Learning architecture. It constructs the graph using TF APIs to define the model’s layers, inputs, and outputs. It also defines the optimization algorithm, loss function, and evaluation metrics to use during training. The resulting graph can be used to train a custom NER model on a large corpus of text data and then use it to extract named entities from new text. First, we need to install Tensor Flow and Tensor Flow addons: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter pip install -q tensorflow== 2.7 . 0 pip install -q tensorflow-addons pip install -q tensorflow==2.7.0 pip install -q tensorflow-addons pip install -q tensorflow==2.7.0 pip install -q tensorflow-addons Then create directories for log and graph files: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter !mkdir ner_logs !mkdir ner_graphs graph_folder = "./ner_graphs" !mkdir ner_logs !mkdir ner_graphs graph_folder = "./ner_graphs" !mkdir ner_logs !mkdir ner_graphs graph_folder = "./ner_graphs" Finally, define the TFNerDLGraphBuilder annotator with the parameters. Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter from sparknlp. annotator import TFNerDLGraphBuilder graph_builder = TFNerDLGraphBuilder () \ . setInputCols ([ "sentence" , "token" , "embeddings" ]) \ . setLabelColumn ( "label" ) \ . setGraphFile ( "auto" ) \ . setGraphFolder ( graph_folder ) \ . setHiddenUnitsNumber ( 20 ) from sparknlp.annotator import TFNerDLGraphBuilder graph_builder = TFNerDLGraphBuilder()\ .setInputCols(["sentence", "token", "embeddings"]) \ .setLabelColumn("label")\ .setGraphFile("auto")\ .setGraphFolder(graph_folder)\ .setHiddenUnitsNumber(20) from sparknlp.annotator import TFNerDLGraphBuilder graph_builder = TFNerDLGraphBuilder()\ .setInputCols(["sentence", "token", "embeddings"]) \ .setLabelColumn("label")\ .setGraphFile("auto")\ .setGraphFolder(graph_folder)\ .setHiddenUnitsNumber(20) The graph will be stored in the defined folder and loaded by the NerDLApproach annotator. NerDLApproach NerDLApproach is an annotator within Spark NLP that implements a deep learning approach for NER model training. NerDLApproach allows users to train custom NER models on large text corpora, using pre-trained word embeddings, character embeddings, and contextual embeddings, such as BERT (Bidirectional Encoder Representations from Transformers) or ELMo (Embeddings from Language Models). NerDLApproach annotator expects DOCUMENT, TOKEN and WORD_EMBEDDINGS as input, and then will provide NAMED_ENTITY as output. Thus, the pipeline will require the previous steps to generate those annotations that will be used as input to our annotator. The next step is to get the word embeddings through BERT. We will use Spark NLP annotator called BertEmbeddings() . Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter # Import the required modules and classes from sparknlp. base import DocumentAssembler, Pipeline from sparknlp. annotator import ( Tokenizer, SentenceDetector, BertEmbeddings ) # Step 1: Transforms raw texts to `document` annotation documentAssembler = DocumentAssembler () \ . setInputCol ( "text" ) \ . setOutputCol ( "document" ) # Step 2: Getting the sentences sentence = SentenceDetector () \ . setInputCols ([ "document" ]) \ . setOutputCol ( "sentence" ) # Step 3: Tokenization tokenizer = Tokenizer () \ . setInputCols ([ "sentence" ]) \ . setOutputCol ( "token" ) # Step 4: Bert Embeddings embeddings = BertEmbeddings. pretrained () .\ setInputCols ([ "sentence" , 'token' ]) .\ setOutputCol ( "embeddings" ) # Import the required modules and classes from sparknlp.base import DocumentAssembler, Pipeline from sparknlp.annotator import ( Tokenizer, SentenceDetector, BertEmbeddings ) # Step 1: Transforms raw texts to `document` annotation documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") # Step 2: Getting the sentences sentence = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") # Step 3: Tokenization tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") # Step 4: Bert Embeddings embeddings = BertEmbeddings.pretrained().\ setInputCols(["sentence", 'token']).\ setOutputCol("embeddings") # Import the required modules and classes from sparknlp.base import DocumentAssembler, Pipeline from sparknlp.annotator import ( Tokenizer, SentenceDetector, BertEmbeddings ) # Step 1: Transforms raw texts to `document` annotation documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") # Step 2: Getting the sentences sentence = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") # Step 3: Tokenization tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") # Step 4: Bert Embeddings embeddings = BertEmbeddings.pretrained().\ setInputCols(["sentence", 'token']).\ setOutputCol("embeddings") We already created the graph by using TFNerDLGraphBuilder , and saved in the “graph_folder”, next step will be running the NerDLApproach annotator, the main module that is responsible for training the NER model. Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter from sparknlp. annotator import NerDLApproach # Model training nerTagger = NerDLApproach () \ . setInputCols ([ "sentence" , "token" , "embeddings" ]) \ . setLabelColumn ( "label" ) \ . setOutputCol ( "ner" ) \ . setMaxEpochs ( 7 ) \ . setLr ( 0.003 ) \ . setBatchSize ( 32 ) \ . setRandomSeed ( 0 ) \ . setVerbose ( 1 ) \ . setValidationSplit ( 0.2 ) \ . setEvaluationLogExtended ( True ) \ . setEnableOutputLogs ( True ) \ . setIncludeConfidence ( True ) \ . setGraphFolder ( graph_folder ) \ . setOutputLogsPath ( 'ner_logs' ) # Define the pipeline ner_pipeline = Pipeline ( stages= [ embeddings, graph_builder, nerTagger ]) from sparknlp.annotator import NerDLApproach # Model training nerTagger = NerDLApproach()\ .setInputCols(["sentence", "token", "embeddings"])\ .setLabelColumn("label")\ .setOutputCol("ner")\ .setMaxEpochs(7)\ .setLr(0.003)\ .setBatchSize(32)\ .setRandomSeed(0)\ .setVerbose(1)\ .setValidationSplit(0.2)\ .setEvaluationLogExtended(True) \ .setEnableOutputLogs(True)\ .setIncludeConfidence(True)\ .setGraphFolder(graph_folder)\ .setOutputLogsPath('ner_logs') # Define the pipeline ner_pipeline = Pipeline(stages=[embeddings, graph_builder, nerTagger]) from sparknlp.annotator import NerDLApproach # Model training nerTagger = NerDLApproach()\ .setInputCols(["sentence", "token", "embeddings"])\ .setLabelColumn("label")\ .setOutputCol("ner")\ .setMaxEpochs(7)\ .setLr(0.003)\ .setBatchSize(32)\ .setRandomSeed(0)\ .setVerbose(1)\ .setValidationSplit(0.2)\ .setEvaluationLogExtended(True) \ .setEnableOutputLogs(True)\ .setIncludeConfidence(True)\ .setGraphFolder(graph_folder)\ .setOutputLogsPath('ner_logs') # Define the pipeline ner_pipeline = Pipeline(stages=[embeddings, graph_builder, nerTagger]) Next step will be fitting the training dataset to train the model: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter ner_model = ner_pipeline. fit ( training_data ) ner_model = ner_pipeline.fit(training_data) ner_model = ner_pipeline.fit(training_data) Here are the metrics from the first epoch: And the last epoch: By checking the metrics, you can observe the improvement in the trained model’s accuracy. NerDLApproach has many parameters and by fine-tuning, it is possible to achieve very high accuracy values. Please check this notebook for different options in NER DL training. Getting Predictions from the Trained Model Now that we have trained the model, we can test its efficiency on the test dataset. First, convert the CoNLL file to Spark data frame: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter test_data = CoNLL () . readDataset ( spark, './eng.testa' ) . limit ( 1000 ) test_data = CoNLL().readDataset(spark, './eng.testa').limit(1000) test_data = CoNLL().readDataset(spark, './eng.testa').limit(1000) Let’s get predictions by transforming the test dataframe: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter predictions = ner_model. transform ( test_data ) predictions = ner_model.transform(test_data) predictions = ner_model.transform(test_data) Now, we will explode the results to get a nice dataframe of the tokens, ground truths and the labels predicted by the model we just trained. Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter predictions. select ( F. explode ( F. arrays_zip ( predictions. token . result , predictions. label . result , predictions. ner . result )) . alias ( "cols" )) \ . select ( F. expr ( "cols['0']" ) . alias ( "token" ) , F. expr ( "cols['1']" ) . alias ( "ground_truth" ) , F. expr ( "cols['2']" ) . alias ( "prediction" )) . show ( 30 , truncate= False ) predictions.select(F.explode(F.arrays_zip(predictions.token.result, predictions.label.result, predictions.ner.result)).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth"), F.expr("cols['2']").alias("prediction")).show(30, truncate=False) predictions.select(F.explode(F.arrays_zip(predictions.token.result, predictions.label.result, predictions.ner.result)).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth"), F.expr("cols['2']").alias("prediction")).show(30, truncate=False) You can see that the model was very successful in predicting the named entities. It is also possible to save the model and then load back by using the NerDLModel annotator in a pipeline. Please check the post about Python Named Entity Recognition (NER), which gives details about the NerDLModel annotator. Highlight Entities The ability to quickly visualize the entities generated using Spark NLP is a very useful feature for speeding up the development process as well as for understanding the obtained results. Spark NLP Display is an open python NLP library for visualizing the annotations generated with Spark NLP. The NerVisualizer annotator highlights the extracted named entities and also displays their labels as decorations on top of the analyzed text. The colors assigned to the predicted labels can be configured to fit the particular needs of the application. The figure below shows the visualization of the named entities recognized from a sample text. The entities are extracted, labelled (as PERSON, DATE, ORG, LOC etc) and displayed on the original text. Please check the post named “Visualizing Named Entities with Spark NLP”, which gives details about NerVisualizer . Extracted named entities, displayed by the Ner Visualizer For additional information, please consult the following references. Documentation : CoNLL Datasets , TF Graphs , NerDLApproach . Python Doc : CoNLL Datasets , TFNerDLGraphBuilder , NerDLApproach . Scala Doc : CoNLL Datasets , NerDLApproach . For extended examples of usage, see the notebooks for CoNLL File Preparation , Graph Generation and NerDL Training . Conclusion In this article, we walked you through training an NER model by BERT embeddings. Named entity recognition is a crucial task in NLP that involves identifying and extracting entities such as people, places, organizations, dates, and other types of named entities from unstructured text data. A well-trained NER model helps to extract useful information from unstructured text data with high accuracy. NER deep learning model training in Spark NLP provides an efficient and scalable way to build accurate NER models for various natural language processing tasks. Spark NLP also provides a variety of pre-trained models, including deep learning models like BERT, RoBERTa, and DistilBERT, which can be used to classify entities in the text. These models can be fine-tuned on specific datasets to improve the accuracy of the NER classification. Read also related articles on the topic: Named Entity Recognition BERT Named Entity Recognition in NLP Try The Generative AI Lab - No-Code Platform For Model Tuning & Validation See in action
Markdown
[Join the Applied Healthcare AI Summit](https://appliedaisummit.org/) \| Free online conference \| April 14-15, 2026 [![John Snow Labs](https://www.johnsnowlabs.com/wp-content/uploads/2022/02/jsl_logo.svg)](https://www.johnsnowlabs.com/) - [Home](https://www.johnsnowlabs.com/) - [Platform]() - - - - Language Models - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/VisualNLP.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/VisualNLP.svg)*Visual*](https://www.johnsnowlabs.com/visual-nlp/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/healthcare.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/healthcare.svg)*Clinical*](https://www.johnsnowlabs.com/clinical-nlp/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/12/Biomedical-1.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2024/12/Biomedical-1.svg)*Biomedical*](https://www.johnsnowlabs.com/biomedical-nlp/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/fin_nlp.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/fin_nlp.svg)*Finance*](https://www.johnsnowlabs.com/finance-nlp/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/LegalNLP.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/LegalNLP.svg)*Legal*](https://www.johnsnowlabs.com/legal-nlp/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/12/open_source.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2024/12/open_source.svg)*Open Source*](https://www.johnsnowlabs.com/spark-nlp/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/live.svg)![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/live.svg)*Demos*![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/live_1.svg)![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/live_1.svg)](https://nlp.johnsnowlabs.com/demos) - - - Products - [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/12/medical_llm.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2024/12/medical_llm.svg)*Medical LLMGenerative AI models exceling in clinical text summarization, information extraction and question answering*](https://www.johnsnowlabs.com/healthcare-llm/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/MedicalLanguageModels.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/MedicalLanguageModels.svg)*Healthcare NLP2,500+ small language models for de-identification and data curation from clinical & biomedical text*](https://www.johnsnowlabs.com/healthcare-nlp/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/gener_alilab.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/gener_alilab.svg)*Generative AI LabImplement Human-in-the-loop workflows for regulatory-grade AI without coding*](https://www.johnsnowlabs.com/generative-ai-lab/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2025/09/ter_ser.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2025/09/ter_ser.svg)*Terminology ServerSemantic mapping of medical phrases to standard or custom code systems, concept maps, and value sets*](https://www.johnsnowlabs.com/terminology-server/) - [Solutions]() - - - - [![](https://www.johnsnowlabs.com/wp-content/uploads/2025/08/de-ind-2.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2025/08/de-ind-2.svg)*De-identificationAnonymize and obfuscate tabular, free text, FHIR, PDF, DICOM, and SVS files with regulatory-grade accuracy*](https://www.johnsnowlabs.com/deidentification/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/de_ind_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/de_ind_i.svg)*Data CurationAutomate and scale the creation of accurate patient registries, cohorts, quality measures, and analytics from clinical documents*](https://www.johnsnowlabs.com/state-of-the-art-clinical-data-curation/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2025/08/PatientJourney.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2025/08/PatientJourney.svg)*Patient JourneyAutomatically integrate multimodal, longitudinal, and messy clinical data into a unified OMOP data model*](https://www.johnsnowlabs.com/patient-journey-intelligence/) - - - [![](https://www.johnsnowlabs.com/wp-content/uploads/2025/08/martlet.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2025/08/martlet.svg)*Martlet AI: HCC Coding*](https://martlet.ai/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2025/08/pacific.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2025/08/pacific.svg)*Pacific AI: AI Governance*](https://pacific.ai/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/gener_alilab.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/gener_alilab.svg)*Generative AI Lab*](https://www.johnsnowlabs.com/generative-ai-lab/) - - [Upcoming Events \>\>](https://www.johnsnowlabs.com/webinars) ![](https://www.johnsnowlabs.com/wp-content/uploads/2025/07/summitAI.webp) **free**Online Event April 14-16, 2026 [Register now](https://appliedaisummit.org/) - [Learn]() - - - [Articles, videos & papers \>\>](https://www.johnsnowlabs.com/blog/) - - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/12/GenerativeAILab.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/12/GenerativeAILab.svg)*Generative AI Survey*](https://www.johnsnowlabs.com/generative-ai-in-healthcare/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/llm.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/llm.svg)*Medical Language Models*](https://www.johnsnowlabs.com/large-language-models-blog/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/healthcare.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/healthcare.svg)*Healthcare NLP*](https://www.johnsnowlabs.com/healthcare-nlp-blog/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/nlpLab.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/nlpLab.svg)*Generative AI Lab*](https://www.johnsnowlabs.com/generative-ai-lab-blog/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/SparkNLP.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/SparkNLP.svg)*Spark NLP*](https://www.johnsnowlabs.com/spark-nlp-blog/) - - - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/aih.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/aih.svg)*AI in Healthcare*](https://www.johnsnowlabs.com/ai-in-healthcare-blog/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/VisualNLP.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/VisualNLP.svg)*Visual Language Models*](https://www.johnsnowlabs.com/visual-nlp-blog/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/rai_1.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/rai_1.svg)*Responsible AI*](https://www.johnsnowlabs.com/responsible-nlp-blog/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/deind2.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/deind2.svg)*De-Identification*](https://www.johnsnowlabs.com/de-identification-blog/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/fin_nlp.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/fin_nlp.svg)*Benchmarks*](https://www.johnsnowlabs.com/benchmarks-blog/) - - [Latest From the Blogs \>\>](https://www.johnsnowlabs.com/blog/) - - [![](https://www.johnsnowlabs.com/wp-content/uploads/2026/03/Gemini_Generated_Image_n4u7ren4u7ren4u7-120x80.png)John Snow Labs Wins Real World Evidence Catalyst Challenge at PHUSE US Conn…](https://www.johnsnowlabs.com/john-snow-labs-wins-real-world-evidence-catalyst-challenge-at-phuse-us-connect-2026/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2026/03/Gemini_Generated_Image_kx1v5tkx1v5tkx1v-120x80.png)The Age of Agentic AI in Healthcare Data: Automating Clinical Research and …](https://www.johnsnowlabs.com/the-age-of-agentic-ai-in-healthcare-data-automating-clinical-research-and-real-world-evidence-workflows/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2026/03/Summit-120x80.jpg)John Snow Labs to Spotlight Regulatory-Grade Healthcare AI and Governance a…](https://www.johnsnowlabs.com/john-snow-labs-to-spotlight-regulatory-grade-healthcare-ai-and-governance-at-the-2026-applied-healthcare-ai-summit/) - [Resources]() - - - - Resources - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/prp_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/prp_i.svg)*Peer-Reviewed Papers*](https://www.johnsnowlabs.com/peer-reviewed-papers/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/training_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/training_i.svg)*Training & Certification*](https://www.johnsnowlabs.com/training/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/webinars_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/webinars_i.svg)*Webinars*](https://www.johnsnowlabs.com/webinars/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/nlp_blog_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/nlp_blog_i.svg)*Blog*](https://www.johnsnowlabs.com/blog/) - - - Resources - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/08/ProductionDocumentation.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/08/ProductionDocumentation.svg)*Product Documentation*](https://nlp.johnsnowlabs.com/docs) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/nlp_bl_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/nlp_bl_i.svg)*AI Governance*](https://www.johnsnowlabs.com/ai-governance/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/08/partner.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/08/partner.svg)*Partners*](https://www.johnsnowlabs.com/partners/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/toga_hat.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/toga_hat.svg)*Software for Academia*](https://www.johnsnowlabs.com/spark-nlp-for-research-and-education/) - - Upcoming [Webinars](https://www.johnsnowlabs.com/webinars) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/08/Christian-Kasim-Loan.jpg) Christian Kasim Loan Data Scientist at John Snow Labs March 18 02:00 P.M \| ET Benchmarking Medical vLLMs: 10 Clinical Use Cases for Automated Document Understanding [Watch now](https://www.johnsnowlabs.com/benchmarking-medical-vllms-10-clinical-use-cases-for-automated-document-understanding/) - [Customers](https://www.johnsnowlabs.com/customers/) - - - Trusted by innovative pharma and healthcare companies [See All Case Studies](https://www.johnsnowlabs.com/customers/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/Group39570_03.jpg)](https://www.nlpsummit.org/deep-learning-for-relation-extraction-from-clinical-documents/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/10/wvu.webp)](https://www.nlpsummit.org/maximizing-patient-care-through-ai-enhanced-hcc-code-discovery/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/Group39571_03.jpg)](https://www.nlpsummit.org/automated-classification-and-entity-extraction-from-essential-documents-pertaining-to-clinical-trials/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/va_1.png)](https://www.nlpsummit.org/using-healthcare-specific-llms-for-data-discovery-from-patient-notes-stories/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/Group39564_03.jpg)](https://www.nlpsummit.org/lessons-learned-de-identifying-700-million-patient-notes-with-spark-nlp/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/07/merck1.jpg)](https://www.nlpsummit.org/understand-patient-experience-journey-to-improve-pharma-value-chain/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2025/04/cigna_2.webp)](https://www.nlpsummit.org/using-generative-ai-for-data-extraction-clinical-support/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/Group39569_03.jpg)](https://www.johnsnowlabs.com/spark-nlp-in-action-improving-patient-flow-forecasting-at-kaiser-permanente/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/Group39572_03.jpg)](https://www.nlpsummit.org/extracting-what-when-why-and-how-from-radiology-reports-in-real-world-data-acquisition-projects/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/ih_1.png)](https://www.nlpsummit.org/empowering-healthcare-through-nlp-harnessing-clinical-document-insights-at-intermountain-health/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/11/fda.png)](https://www.johnsnowlabs.com/identifying-opioid-related-adverse-events-from-unstructured-text-in-electronic-health-records-using-rule-based-algorithms-and-deep-learning-methods/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/om1.png)](https://www.nlpsummit.org/using-real-world-data-to-better-understand-inflammatory-bowel-disease-ibd/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/03/child.png)](https://www.nlpsummit.org/identifying-mental-health-concerns-subtypes-temporal-patterns-and-differential-risks-among-children-with-cerebral-palsy-using-nlp-on-ehr-data/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/idexxx_1.png)](https://www.nlpsummit.org/using-spark-nlp-in-r-a-drug-standardization-case-study/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2025/04/fm_1.webp) ![](https://www.johnsnowlabs.com/wp-content/uploads/2025/04/fm_1.webp)](https://www.nlpsummit.org/transforming-functional-medicine-with-ai-accuracy-challenges-and-future-directions/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/ava_1.png) ![](https://www.johnsnowlabs.com/wp-content/uploads/2024/04/ava_1.png)](https://www.nlpsummit.org/rag-on-fhir-using-fhir-with-generative-ai-to-make-healthcare-less-opaque/) - [Company](https://www.johnsnowlabs.com/our-story/) - - - - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/story_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/story_i.svg)*Our Story*](https://www.johnsnowlabs.com/our-story/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/careers_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/careers_i.svg)*Careers*](https://www.johnsnowlabs.com/careers/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/prp_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/prp_i.svg)*Press*](https://www.johnsnowlabs.com/press/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/awards_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/awards_i.svg)*Awards*](https://www.johnsnowlabs.com/awards/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/social_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/social_i.svg)*Social Impact*](https://www.johnsnowlabs.com/social-impact/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/annoncements_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/annoncements_i.svg)*Announcements*](https://www.johnsnowlabs.com/announcements/) - - Announcement [See all](https://www.johnsnowlabs.com/announcements/) - - [![](https://www.johnsnowlabs.com/wp-content/uploads/2026/03/Gemini_Generated_Image_n4u7ren4u7ren4u7-120x80.png)John Snow Labs Wins Real World Evidence Catalyst Challenge at PHUSE US Conn…](https://www.johnsnowlabs.com/john-snow-labs-wins-real-world-evidence-catalyst-challenge-at-phuse-us-connect-2026/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2026/03/Summit-120x80.jpg)John Snow Labs to Spotlight Regulatory-Grade Healthcare AI and Governance a…](https://www.johnsnowlabs.com/john-snow-labs-to-spotlight-regulatory-grade-healthcare-ai-and-governance-at-the-2026-applied-healthcare-ai-summit/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2026/02/pacific_bg-120x80.webp)John Snow Labs Earns Pacific AI Governance Certification, Raising the Bar f…](https://www.johnsnowlabs.com/john-snow-labs-earns-pacific-ai-governance-certification-raising-the-bar-for-responsible-ai-in-healthcare/) - [Sign In](https://www.johnsnowlabs.com/the-ultimate-guide-to-building-your-own-ner-model-with-python/) - [Install Software](https://www.johnsnowlabs.com/install/) - [Schedule a Call](https://www.johnsnowlabs.com/schedule-a-demo/) - [Install Software](https://www.johnsnowlabs.com/install/) - [Schedule a Call](https://www.johnsnowlabs.com/schedule-a-demo/) - [User profile![user image](https://www.johnsnowlabs.com/wp-content/uploads/2020/05/user-icon.png) ![user image](https://www.johnsnowlabs.com/wp-content/uploads/2020/05/user-icon.png)](https://www.johnsnowlabs.com/account/) - [Install Software for Free](https://www.johnsnowlabs.com/install/) - [Schedule a Call](https://www.johnsnowlabs.com/schedule-a-demo/) [0](https://www.johnsnowlabs.com/cart/) was successfully added to your cart. # The Ultimate Guide to Building Your Own NER Model with Python 25\.04.2023 ![Avatar photo](https://www.johnsnowlabs.com/wp-content/uploads/2023/03/cropped-gursev-160x160.jpg) ![Avatar photo](https://www.johnsnowlabs.com/wp-content/uploads/2023/03/cropped-gursev-160x160.jpg) [Gursev Pirge](https://www.johnsnowlabs.com/author/gursev-pirge/) Researcher and Data Scientist Rate 5 (1) *Training a NER model from scratch with Python* *![Picture illustrates how finance NLP works.](https://www.johnsnowlabs.com/wp-content/uploads/2024/10/img_1-4.webp)* *Named Entity Recognition is a Natural Language Processing technique that involves identifying and extracting entities from a text, such as people, organizations, locations, dates, and other types of named entities. NER is used in many fields of NLP, and using Spark NLP, it is possible to train deep learning models that extract entities from text with very high accuracy.* [Named Entity Recognition](https://www.johnsnowlabs.com/an-overview-of-named-entity-recognition-in-natural-language-processing/) (NER) is a [Natural Language Processing (NLP)](https://www.johnsnowlabs.com/introduction-to-natural-language-processing/ "Know more about NLP") technique used to identify and extract named entities from text. Named entities are words or phrases that refer to specific entities such as people, organizations, locations, dates, times, and other types of entities that have a specific name or title. NER has many practical applications in various fields, such as information extraction, sentiment analysis, [chatbots](https://www.johnsnowlabs.com/medical-chatbot/ "Medical Chatbot"), question answering systems, and more. NER models are crucial in NLP that enables machines to understand and process unstructured text data more efficiently and accurately. They have many practical applications in various fields and can help in automating tasks that would otherwise require human effort. NER involves analyzing text to identify and classify these named entities into predefined categories. This can be done using various techniques, such as rule-based approaches, machine learning algorithms, or deep learning models. Although there are other alternatives, deep learning models are very successful in NER tasks. There are more than 1,700 [NER models](https://nlp.johnsnowlabs.com/models?task=Named+Entity+Recognition) in the John Snow Labs Models Hub, but it is possible to train your own deep learning model by using Spark NLP. The purpose of model training is to teach a model to make accurate predictions on new, unseen data by learning from [labeled **annotated data**](https://www.johnsnowlabs.com/top-6-text-annotation-tools/ "Know more about NLP labeling tool"). The training process involves feeding the model with labeled examples and adjusting its parameters to minimize the difference between its predicted outputs and the true outputs in the training data. The trained model can then be used to make predictions on new, unseen data. In other words, model training process involves providing the NerDLApproach (Spark NLP annotator for NER model based on Neural Networks) with a set of annotated data, called the training set, that includes text documents along with labels for the named entities present in the text. The training set is typically created by human annotators who label the named entities in the text with predefined categories. In this post, we will discuss three concepts; namely, CoNLL File Preparation, TFNerDLGraphBuilder and NerDLApproach in order to understand the fundamentals of NER model training in Spark NLP. **CoNLL** (Conference on Computational Natural Language Learning) is a standard format used for annotating and sharing annotated language data. CoNLL files are commonly used in named entity recognition. `TFNerDLGraphBuilder`is a Spark NLP annotator that is used to build the TF graph for training and inference of a custom NER model based on the Deep Learning architecture. `NerDLApproach` in Spark NLP is a powerful annotator for building and training NER models using deep learning techniques. It supports different embedding strategies and hyperparameters, and is highly customizable to meet the specific needs of different NER tasks. In this post, you will learn how to use certain Spark NLP annotators to train deep learning models for the named entity recognition task. Let us start with a short Spark NLP introduction and then discuss the details of NER model training with some solid results. ## Introduction to Spark NLP Spark NLP is an open-source library maintained by [John Snow Labs](https://www.johnsnowlabs.com/). It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Since its first release in July 2017, Spark NLP has grown in a full NLP tool, providing: - A single unified solution for all your NLP needs (for [Medicine](https://www.johnsnowlabs.com/healthcare-nlp/), [Banking and Finance](https://www.johnsnowlabs.com/finance-nlp/), [Legal](https://www.johnsnowlabs.com/legal-nlp/)) - Transfer learning and implementing the latest and greatest **SOTA** algorithms and models in NLP research - The most widely used NLP library in industry (5 years in a row) - The most scalable, accurate and fastest library in NLP history Spark NLP comes with 17,800+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster. Spark NLP processes the data using `Pipelines`, structure that contains all the steps to be run on the input data: ![Structure of NLP process for finance and banking.](https://www.johnsnowlabs.com/wp-content/uploads/2024/10/img_2.webp) ![Structure of NLP process for finance and banking.](https://www.johnsnowlabs.com/wp-content/uploads/2024/10/img_2.webp) Spark NLP pipelines Each step contains an [annotator](https://nlp.johnsnowlabs.com/docs/en/concepts#annotators) that performs a specific task such as tokenization, normalization, and dependency parsing. Each annotator has input(s) [annotation(s)](https://nlp.johnsnowlabs.com/docs/en/concepts#annotation) and outputs new annotation. An **annotator** in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document, while a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral. ## Setup To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter pip install spark-nlp pip install pyspark pip install spark-nlp pip install pyspark ``` pip install spark-nlp pip install pyspark ``` For other installation options for different environments and machines, please check the [official documentation](https://nlp.johnsnowlabs.com/docs/en/install). Then, simply import the library and start a Spark session: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter import sparknlp \# Start Spark Session spark = sparknlp.start() import sparknlp \# Start Spark Session spark = sparknlp.start() ``` import sparknlp # Start Spark Session spark = sparknlp.start() ``` ## CoNLL File Preparation CoNLL (Conference on Natural Language Learning) is a format for representing annotated data in NLP. The CoNLL format consists of columns, with each row representing a token and its associated features. To prepare data in the CoNLL format, the raw text is first annotated with the relevant labels (e.g., named entities or part-of-speech tags). This annotated data is then converted to the CoNLL format by representing each token and its associated features as a separate row in the CoNLL file. The resulting CoNLL file can then be used to train and evaluate machine learning models for the relevant NLP task. Here is a sample sentence: ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_MQ4baSLwHgFIrNJQvp73fw.webp) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_MQ4baSLwHgFIrNJQvp73fw.webp) CoNLL representation of the sentence: ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_uAGoY5O75mJYKPpprb1Lbw.webp) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_uAGoY5O75mJYKPpprb1Lbw.webp) We will use train and test datasets from the John Snow Labs Github, so first let us get their links: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa ``` !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa ``` Now, import the first 5,000 texts of the training dataset as CoNLL file: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter from sparknlp.training import CoNLL training\_data = CoNLL().readDataset(spark, './eng.train').limit(5000) \# Observe the first 3 rows of the Dataframe training\_data.show(3) from sparknlp.training import CoNLL training\_data = CoNLL().readDataset(spark, './eng.train').limit(5000) \# Observe the first 3 rows of the Dataframe training\_data.show(3) ``` from sparknlp.training import CoNLL training_data = CoNLL().readDataset(spark, './eng.train').limit(5000) # Observe the first 3 rows of the Dataframe training_data.show(3) ``` ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_759Hmxfjop0J6UfuFfXsRg.webp) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_759Hmxfjop0J6UfuFfXsRg.webp) Let’s explode the training data to understand the number of all the entities in IOB format (short for inside, outside, beginning): Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter import pyspark.sql.functions as F training\_data.select(F.explode(F.arrays\_zip(training\_data.token.result, training\_data.label.result)).alias("cols")) \\ .select(F.expr("cols\['0'\]").alias("token"), F.expr("cols\['1'\]").alias("ground\_truth")).groupBy('ground\_truth').count().orderBy('count', ascending=False).show(100,truncate=False) import pyspark.sql.functions as F training\_data.select(F.explode(F.arrays\_zip(training\_data.token.result, training\_data.label.result)).alias("cols")) \\ .select(F.expr("cols\['0'\]").alias("token"), F.expr("cols\['1'\]").alias("ground\_truth")).groupBy('ground\_truth').count().orderBy('count', ascending=False).show(100,truncate=False) ``` import pyspark.sql.functions as F training_data.select(F.explode(F.arrays_zip(training_data.token.result, training_data.label.result)).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth")).groupBy('ground_truth').count().orderBy('count', ascending=False).show(100,truncate=False) ``` ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_rQAYIFTJm3ucfkL-gPhRPQ.webp) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_rQAYIFTJm3ucfkL-gPhRPQ.webp) Now that we have the training dataframe, we can get to the next stage, namely graph generation. There is a [**detailed notebook**](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.3.prepare_CoNLL_from_annotations_for_NER.ipynb) in the John Snow Labs Github repo about CoNLL preparation. Please check the notebook to understand the details of the process. ## TFNerDLGraphBuilder Graphs are data structures that contain a set of TensorFlow (TF) operation objects, which represent units of computation and TF tensor objects, which represent the units of data that flow between operations. They are defined in a TF Graph context. Since these graphs are data structures, they can be saved, run, and restored all without the original Python code. Graphs are extremely useful and let the TF run **fast**, run **in parallel**, and run efficiently **on multiple devices**. `TFNerDLGraphBuilder` is a Spark NLP annotator that is used to build the TF graph for training and inference of a custom NER model based on the Deep Learning architecture. It constructs the graph using TF APIs to define the model’s layers, inputs, and outputs. It also defines the optimization algorithm, loss function, and evaluation metrics to use during training. The resulting graph can be used to train a custom NER model on a large corpus of text data and then use it to extract named entities from new text. First, we need to install Tensor Flow and Tensor Flow addons: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter pip install -q tensorflow==2\.7.0 pip install -q tensorflow-addons pip install -q tensorflow==2.7.0 pip install -q tensorflow-addons ``` pip install -q tensorflow==2.7.0 pip install -q tensorflow-addons ``` Then create directories for log and graph files: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter !mkdir ner\_logs !mkdir ner\_graphs graph\_folder = "./ner\_graphs" !mkdir ner\_logs !mkdir ner\_graphs graph\_folder = "./ner\_graphs" ``` !mkdir ner_logs !mkdir ner_graphs graph_folder = "./ner_graphs" ``` Finally, define the `TFNerDLGraphBuilder` annotator with the parameters. Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter from sparknlp.annotator import TFNerDLGraphBuilder graph\_builder = TFNerDLGraphBuilder()\\ .setInputCols(\["sentence", "token", "embeddings"\]) \\ .setLabelColumn("label")\\ .setGraphFile("auto")\\ .setGraphFolder(graph\_folder)\\ .setHiddenUnitsNumber(20) from sparknlp.annotator import TFNerDLGraphBuilder graph\_builder = TFNerDLGraphBuilder()\\ .setInputCols(\["sentence", "token", "embeddings"\]) \\ .setLabelColumn("label")\\ .setGraphFile("auto")\\ .setGraphFolder(graph\_folder)\\ .setHiddenUnitsNumber(20) ``` from sparknlp.annotator import TFNerDLGraphBuilder graph_builder = TFNerDLGraphBuilder()\ .setInputCols(["sentence", "token", "embeddings"]) \ .setLabelColumn("label")\ .setGraphFile("auto")\ .setGraphFolder(graph_folder)\ .setHiddenUnitsNumber(20) ``` The graph will be stored in the defined folder and loaded by the NerDLApproach annotator. ## NerDLApproach NerDLApproach is an annotator within Spark NLP that implements a deep learning approach for NER model training. `NerDLApproach` allows users to train custom NER models on large text corpora, using pre-trained word embeddings, character embeddings, and contextual embeddings, such as BERT (Bidirectional Encoder Representations from Transformers) or ELMo (Embeddings from Language Models). `NerDLApproach` annotator expects `DOCUMENT, TOKEN` and `WORD_EMBEDDINGS` as input, and then will provide `NAMED_ENTITY` as output. Thus, the pipeline will require the previous steps to generate those annotations that will be used as input to our annotator. The next step is to get the word embeddings through BERT. We will use Spark NLP annotator called `BertEmbeddings()`*.* Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter \# Import the required modules and classes from sparknlp.base import DocumentAssembler, Pipeline from sparknlp.annotator import ( Tokenizer, SentenceDetector, BertEmbeddings ) \# Step 1: Transforms raw texts to \`document\` annotation documentAssembler = DocumentAssembler()\\ .setInputCol("text")\\ .setOutputCol("document") \# Step 2: Getting the sentences sentence = SentenceDetector() \\ .setInputCols(\["document"\]) \\ .setOutputCol("sentence") \# Step 3: Tokenization tokenizer = Tokenizer() \\ .setInputCols(\["sentence"\]) \\ .setOutputCol("token") \# Step 4: Bert Embeddings embeddings = BertEmbeddings.pretrained().\\ setInputCols(\["sentence", 'token'\]).\\ setOutputCol("embeddings") \# Import the required modules and classes from sparknlp.base import DocumentAssembler, Pipeline from sparknlp.annotator import ( Tokenizer, SentenceDetector, BertEmbeddings ) \# Step 1: Transforms raw texts to \`document\` annotation documentAssembler = DocumentAssembler()\\ .setInputCol("text")\\ .setOutputCol("document") \# Step 2: Getting the sentences sentence = SentenceDetector() \\ .setInputCols(\["document"\]) \\ .setOutputCol("sentence") \# Step 3: Tokenization tokenizer = Tokenizer() \\ .setInputCols(\["sentence"\]) \\ .setOutputCol("token") \# Step 4: Bert Embeddings embeddings = BertEmbeddings.pretrained().\\ setInputCols(\["sentence", 'token'\]).\\ setOutputCol("embeddings") ``` # Import the required modules and classes from sparknlp.base import DocumentAssembler, Pipeline from sparknlp.annotator import ( Tokenizer, SentenceDetector, BertEmbeddings ) # Step 1: Transforms raw texts to `document` annotation documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") # Step 2: Getting the sentences sentence = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") # Step 3: Tokenization tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") # Step 4: Bert Embeddings embeddings = BertEmbeddings.pretrained().\ setInputCols(["sentence", 'token']).\ setOutputCol("embeddings") ``` We already created the graph by using `TFNerDLGraphBuilder`, and saved in the “graph\_folder”, next step will be running the `NerDLApproach` annotator, the main module that is responsible for training the NER model. Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter from sparknlp.annotator import NerDLApproach \# Model training nerTagger = NerDLApproach()\\ .setInputCols(\["sentence", "token", "embeddings"\])\\ .setLabelColumn("label")\\ .setOutputCol("ner")\\ .setMaxEpochs(7)\\ .setLr(0\.003)\\ .setBatchSize(32)\\ .setRandomSeed(0)\\ .setVerbose(1)\\ .setValidationSplit(0\.2)\\ .setEvaluationLogExtended(True) \\ .setEnableOutputLogs(True)\\ .setIncludeConfidence(True)\\ .setGraphFolder(graph\_folder)\\ .setOutputLogsPath('ner\_logs') \# Define the pipeline ner\_pipeline = Pipeline(stages=\[embeddings, graph\_builder, nerTagger\]) from sparknlp.annotator import NerDLApproach \# Model training nerTagger = NerDLApproach()\\ .setInputCols(\["sentence", "token", "embeddings"\])\\ .setLabelColumn("label")\\ .setOutputCol("ner")\\ .setMaxEpochs(7)\\ .setLr(0.003)\\ .setBatchSize(32)\\ .setRandomSeed(0)\\ .setVerbose(1)\\ .setValidationSplit(0.2)\\ .setEvaluationLogExtended(True) \\ .setEnableOutputLogs(True)\\ .setIncludeConfidence(True)\\ .setGraphFolder(graph\_folder)\\ .setOutputLogsPath('ner\_logs') \# Define the pipeline ner\_pipeline = Pipeline(stages=\[embeddings, graph\_builder, nerTagger\]) ``` from sparknlp.annotator import NerDLApproach # Model training nerTagger = NerDLApproach()\ .setInputCols(["sentence", "token", "embeddings"])\ .setLabelColumn("label")\ .setOutputCol("ner")\ .setMaxEpochs(7)\ .setLr(0.003)\ .setBatchSize(32)\ .setRandomSeed(0)\ .setVerbose(1)\ .setValidationSplit(0.2)\ .setEvaluationLogExtended(True) \ .setEnableOutputLogs(True)\ .setIncludeConfidence(True)\ .setGraphFolder(graph_folder)\ .setOutputLogsPath('ner_logs') # Define the pipeline ner_pipeline = Pipeline(stages=[embeddings, graph_builder, nerTagger]) ``` Next step will be fitting the training dataset to train the model: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter ner\_model = ner\_pipeline.fit(training\_data) ner\_model = ner\_pipeline.fit(training\_data) ``` ner_model = ner_pipeline.fit(training_data) ``` Here are the metrics from the first epoch: ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_ff787X9evZG9DjWOyinfOA.webp) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_ff787X9evZG9DjWOyinfOA.webp) And the last epoch: ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_wkMcxihkRuesO4gZO0DjDQ.webp) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_wkMcxihkRuesO4gZO0DjDQ.webp) By checking the metrics, you can observe the improvement in the trained model’s accuracy. `NerDLApproach` has many parameters and by fine-tuning, it is possible to achieve very high accuracy values. Please check this [**notebook**](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NERDL_Training.ipynb) for different options in NER DL training. ## Getting Predictions from the Trained Model Now that we have trained the model, we can test its efficiency on the test dataset. First, convert the CoNLL file to Spark data frame: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter test\_data = CoNLL().readDataset(spark, './eng.testa').limit(1000) test\_data = CoNLL().readDataset(spark, './eng.testa').limit(1000) ``` test_data = CoNLL().readDataset(spark, './eng.testa').limit(1000) ``` Let’s get predictions by transforming the test dataframe: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter predictions = ner\_model.transform(test\_data) predictions = ner\_model.transform(test\_data) ``` predictions = ner_model.transform(test_data) ``` Now, we will explode the results to get a nice dataframe of the tokens, ground truths and the labels predicted by the model we just trained. Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter predictions.select(F.explode(F.arrays\_zip(predictions.token.result, predictions.label.result, predictions.ner.result)).alias("cols")) \\ .select(F.expr("cols\['0'\]").alias("token"), F.expr("cols\['1'\]").alias("ground\_truth"), F.expr("cols\['2'\]").alias("prediction")).show(30, truncate=False) predictions.select(F.explode(F.arrays\_zip(predictions.token.result, predictions.label.result, predictions.ner.result)).alias("cols")) \\ .select(F.expr("cols\['0'\]").alias("token"), F.expr("cols\['1'\]").alias("ground\_truth"), F.expr("cols\['2'\]").alias("prediction")).show(30, truncate=False) ``` predictions.select(F.explode(F.arrays_zip(predictions.token.result, predictions.label.result, predictions.ner.result)).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth"), F.expr("cols['2']").alias("prediction")).show(30, truncate=False) ``` ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_KWZ5o-_ELPAo0aPDuI3pXg.webp) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_KWZ5o-_ELPAo0aPDuI3pXg.webp) You can see that the model was very successful in predicting the named entities. It is also possible to **save the model** and then load back by using the `NerDLModel` annotator in a pipeline. Please check the post about [Python Named Entity Recognition](https://www.johnsnowlabs.com/named-entity-recognition-ner-with-python-at-scale/) (NER), which gives details about the NerDLModel annotator. ## Highlight Entities The ability to quickly visualize the entities generated using Spark NLP is a very useful feature for speeding up the development process as well as for understanding the obtained results. [**Spark NLP Display**](https://nlp.johnsnowlabs.com/docs/en/display) is an [open python NLP library](https://nlp.johnsnowlabs.com/) for visualizing the annotations generated with Spark NLP. The `NerVisualizer` annotator highlights the extracted named entities and also displays their labels as decorations on top of the analyzed text. The colors assigned to the predicted labels can be configured to fit the particular needs of the application. The figure below shows the visualization of the named entities recognized from a sample text. The entities are extracted, labelled (as PERSON, DATE, ORG, LOC etc) and displayed on the original text. Please check the post named “Visualizing Named Entities with Spark NLP”, which gives details about `NerVisualizer`. ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_R2dfQdEhaTggMpA7lxZ6VA.webp) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_R2dfQdEhaTggMpA7lxZ6VA.webp) Extracted named entities, displayed by the Ner Visualizer For additional information, please consult the following references. - Documentation : [CoNLL Datasets](https://nlp.johnsnowlabs.com/docs/en/training#conll-dataset), [TF Graphs](https://nlp.johnsnowlabs.com/docs/en/training#tensorflow-graphs), [NerDLApproach](https://nlp.johnsnowlabs.com/docs/en/annotators#nerdl). - Python Doc : [CoNLL Datasets](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/training/conll/index.html#sparknlp.training.conll.CoNLL), [TFNerDLGraphBuilder](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/tf_ner_dl_graph_builder/index.html#sparknlp.annotator.tf_ner_dl_graph_builder.TFNerDLGraphBuilder), [NerDLApproach](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/ner/ner_dl/index.html#sparknlp.annotator.ner.ner_dl.NerDLApproach). - Scala Doc : [CoNLL Datasets](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/training/CoNLL.html), [NerDLApproach](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/ner/dl/NerDLApproach). - For extended examples of usage, see the notebooks for [CoNLL File Preparation](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.3.prepare_CoNLL_from_annotations_for_NER.ipynb), [Graph Generation](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.1_NerDL_Graph.ipynb) and [NerDL Training](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NERDL_Training.ipynb). ## Conclusion In this article, we walked you through training an NER model by BERT embeddings. Named entity recognition is a crucial task in NLP that involves identifying and extracting entities such as people, places, organizations, dates, and other types of named entities from unstructured text data. A well-trained NER model helps to extract useful information from unstructured text data with high accuracy. NER deep learning model training in Spark NLP provides an efficient and scalable way to build accurate NER models for various natural language processing tasks. Spark NLP also provides a variety of pre-trained models, including deep learning models like BERT, RoBERTa, and DistilBERT, which can be used to classify entities in the text. These models can be fine-tuned on specific datasets to improve the accuracy of the NER classification. Read also related articles on the topic: [Named Entity Recognition BERT](https://www.johnsnowlabs.com/named-entity-recognition-ner-with-bert-in-spark-nlp/) [Named Entity Recognition in NLP](https://www.johnsnowlabs.com/an-overview-of-named-entity-recognition-in-natural-language-processing/) How useful was this post? Submit Rating Try The Generative AI Lab - No-Code Platform For Model Tuning & Validation [See in action](https://www.johnsnowlabs.com/nlp-lab/) ![Avatar photo](https://www.johnsnowlabs.com/wp-content/uploads/2023/03/cropped-gursev-160x160.jpg) ![Avatar photo](https://www.johnsnowlabs.com/wp-content/uploads/2023/03/cropped-gursev-160x160.jpg) [Gursev Pirge](https://www.johnsnowlabs.com/author/gursev-pirge/) Researcher and Data Scientist **Our additional expert:** A Researcher and Data Scientist with demonstrated success delivering innovative policies and machine learning algorithms, having strong statistical skills, and presenting to all levels of leadership to improve decision making. Experience in Education, Logistics, Data Analysis and Data Science. Strong education professional with a Doctor of Philosophy (Ph.D.) focused on Mechanical Engineering from Boğaziçi University. Reliable and verified information compiled by our editorial and professional team. [John Snow Labs' Editorial Policy](https://www.johnsnowlabs.com/john-snow-labs-editorial-policy/). [Next Article](https://www.johnsnowlabs.com/the-complete-guide-to-information-extraction-by-regular-expressions-with-spark-nlp-and-python/) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/Scaling-Machine-Learning-based-Sentiment-Analysis-with-Spark-NLP_2-300x240.jpg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/Scaling-Machine-Learning-based-Sentiment-Analysis-with-Spark-NLP_2-300x240.jpg) #### [The Complete Guide to Information Extraction by Regular Expressions with Spark NLP and Python](https://www.johnsnowlabs.com/the-complete-guide-to-information-extraction-by-regular-expressions-with-spark-nlp-and-python/) ![Avatar photo](https://www.johnsnowlabs.com/wp-content/uploads/2023/03/cropped-gursev-56x56.jpg) ![Avatar photo](https://www.johnsnowlabs.com/wp-content/uploads/2023/03/cropped-gursev-56x56.jpg) [Gursev Pirge](https://www.johnsnowlabs.com/author/gursev-pirge/) Extract Hidden Insights from Texts at Scale with Regex Patterns Information extraction in natural language processing (NLP) is the process of automatically... #### Recommended For You [![Applications of Generative AI in healthcare](https://www.johnsnowlabs.com/wp-content/uploads/2024/02/Generative-AI-in-Healthcare-1-1-347x239.webp) ![Applications of Generative AI in healthcare](https://www.johnsnowlabs.com/wp-content/uploads/2024/02/Generative-AI-in-Healthcare-1-1-347x239.webp)](https://www.johnsnowlabs.com/generative-ai-healthcare/) [Generative AI in Healthcare: Use Cases, Benefits, and Challenges](https://www.johnsnowlabs.com/generative-ai-healthcare/) [![Comparison of Annotation Tools](https://www.johnsnowlabs.com/wp-content/uploads/2022/12/Comparison-of-Annotation-Tools-347x239.webp) ![Comparison of Annotation Tools](https://www.johnsnowlabs.com/wp-content/uploads/2022/12/Comparison-of-Annotation-Tools-347x239.webp)](https://www.johnsnowlabs.com/top-6-text-annotation-tools/) [Top 6 Text Annotation Tools](https://www.johnsnowlabs.com/top-6-text-annotation-tools/) [![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/Introduction-to-Large-Language-Models-LLMs_1-347x239.jpg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/06/Introduction-to-Large-Language-Models-LLMs_1-347x239.jpg)](https://www.johnsnowlabs.com/introduction-to-large-language-models-llms-an-overview-of-bert-gpt-and-other-popular-models/) [Introduction to Large Language Models (LLMs): An Overview of BERT, GPT, and Other Popular Models](https://www.johnsnowlabs.com/introduction-to-large-language-models-llms-an-overview-of-bert-gpt-and-other-popular-models/) ## Join the Global Healthcare AI Community Be the first to know about new releases, offers, and events #### Company - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/story_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/story_i.svg)*Our Story*](https://www.johnsnowlabs.com/our-story/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/careers_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/careers_i.svg)*Careers*](https://www.johnsnowlabs.com/careers/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/prp_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/prp_i.svg)*Press*](https://www.johnsnowlabs.com/press/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/awards_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/awards_i.svg)*Awards*](https://www.johnsnowlabs.com/awards/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/social_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/social_i.svg)*Social Impact*](https://www.johnsnowlabs.com/social-impact/) - [![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/annoncements_i.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2022/04/annoncements_i.svg)*Announcements*](https://www.johnsnowlabs.com/announcements/) #### Products - [Language Models](https://www.johnsnowlabs.com/install/) - [Medical ChatBot](https://www.johnsnowlabs.com/medical-chatbot/) - [Generative Ai Lab](https://www.johnsnowlabs.com/generative-ai-lab/) - [Healthcare LLM](https://www.johnsnowlabs.com/healthcare-llm/) - [Healthcare NLP](https://www.johnsnowlabs.com/healthcare-nlp/) - [Visual NLP](https://www.johnsnowlabs.com/visual-nlp/) #### Resources - [Live Demos](https://nlp.johnsnowlabs.com/demos) - [Peer-Reviewed Papers](https://www.johnsnowlabs.com/peer-reviewed-papers/) - [Training & Certification](https://www.johnsnowlabs.com/training/) - [Partners](https://www.johnsnowlabs.com/partners/) - [Webinars](https://www.johnsnowlabs.com/webinars/) - [Blog](https://www.johnsnowlabs.com/blog/) ![](https://www.johnsnowlabs.com/wp-content/uploads/2024/09/JSL_logo_RGB.svg) ![](https://www.johnsnowlabs.com/wp-content/uploads/2024/09/JSL_logo_RGB.svg) 16192 Coastal Highway Lewes, DE 19958, USA [\+1 (302) 786-5227](tel:+13027865227) [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com) - [facebook](https://www.facebook.com/JohnSnowLabsInc) - [linkedin](https://www.linkedin.com/company/johnsnowlabs) - [twitter](https://twitter.com/JohnSnowLabs) - [instagram](https://www.instagram.com/johnsnowlabs/) - [youtube](https://www.youtube.com/c/Johnsnowlabs) © 2026 John Snow Labs, Inc. All rights reserved. - [Refund Policy](https://www.johnsnowlabs.com/refund-policy/) - [Terms of Service](https://www.johnsnowlabs.com/terms-of-service/) - [Privacy Policy](https://www.johnsnowlabs.com/privacy-policy/) - [License Agreements](https://www.johnsnowlabs.com/health-nlp-spark-ocr-libraries-eula/) - [AI Acceptable Use Policy](https://www.johnsnowlabs.com/ai-acceptable-use-policy/) ![preloader](https://www.johnsnowlabs.com/wp-content/plugins/woocommerce-products-filter/img/loading-master/loading-spinning-bubbles.svg) ![preloader](https://www.johnsnowlabs.com/wp-content/plugins/woocommerce-products-filter/img/loading-master/loading-spinning-bubbles.svg)
Readable Markdown
*Training a NER model from scratch with Python* *![Picture illustrates how finance NLP works.](https://www.johnsnowlabs.com/wp-content/uploads/2024/10/img_1-4.webp)* *Named Entity Recognition is a Natural Language Processing technique that involves identifying and extracting entities from a text, such as people, organizations, locations, dates, and other types of named entities. NER is used in many fields of NLP, and using Spark NLP, it is possible to train deep learning models that extract entities from text with very high accuracy.* [Named Entity Recognition](https://www.johnsnowlabs.com/an-overview-of-named-entity-recognition-in-natural-language-processing/) (NER) is a [Natural Language Processing (NLP)](https://www.johnsnowlabs.com/introduction-to-natural-language-processing/ "Know more about NLP") technique used to identify and extract named entities from text. Named entities are words or phrases that refer to specific entities such as people, organizations, locations, dates, times, and other types of entities that have a specific name or title. NER has many practical applications in various fields, such as information extraction, sentiment analysis, [chatbots](https://www.johnsnowlabs.com/medical-chatbot/ "Medical Chatbot"), question answering systems, and more. NER models are crucial in NLP that enables machines to understand and process unstructured text data more efficiently and accurately. They have many practical applications in various fields and can help in automating tasks that would otherwise require human effort. NER involves analyzing text to identify and classify these named entities into predefined categories. This can be done using various techniques, such as rule-based approaches, machine learning algorithms, or deep learning models. Although there are other alternatives, deep learning models are very successful in NER tasks. There are more than 1,700 [NER models](https://nlp.johnsnowlabs.com/models?task=Named+Entity+Recognition) in the John Snow Labs Models Hub, but it is possible to train your own deep learning model by using Spark NLP. The purpose of model training is to teach a model to make accurate predictions on new, unseen data by learning from [labeled **annotated data**](https://www.johnsnowlabs.com/top-6-text-annotation-tools/ "Know more about NLP labeling tool"). The training process involves feeding the model with labeled examples and adjusting its parameters to minimize the difference between its predicted outputs and the true outputs in the training data. The trained model can then be used to make predictions on new, unseen data. In other words, model training process involves providing the NerDLApproach (Spark NLP annotator for NER model based on Neural Networks) with a set of annotated data, called the training set, that includes text documents along with labels for the named entities present in the text. The training set is typically created by human annotators who label the named entities in the text with predefined categories. In this post, we will discuss three concepts; namely, CoNLL File Preparation, TFNerDLGraphBuilder and NerDLApproach in order to understand the fundamentals of NER model training in Spark NLP. **CoNLL** (Conference on Computational Natural Language Learning) is a standard format used for annotating and sharing annotated language data. CoNLL files are commonly used in named entity recognition. `TFNerDLGraphBuilder`is a Spark NLP annotator that is used to build the TF graph for training and inference of a custom NER model based on the Deep Learning architecture. `NerDLApproach` in Spark NLP is a powerful annotator for building and training NER models using deep learning techniques. It supports different embedding strategies and hyperparameters, and is highly customizable to meet the specific needs of different NER tasks. In this post, you will learn how to use certain Spark NLP annotators to train deep learning models for the named entity recognition task. Let us start with a short Spark NLP introduction and then discuss the details of NER model training with some solid results. ## Introduction to Spark NLP Spark NLP is an open-source library maintained by [John Snow Labs](https://www.johnsnowlabs.com/). It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Since its first release in July 2017, Spark NLP has grown in a full NLP tool, providing: - A single unified solution for all your NLP needs (for [Medicine](https://www.johnsnowlabs.com/healthcare-nlp/), [Banking and Finance](https://www.johnsnowlabs.com/finance-nlp/), [Legal](https://www.johnsnowlabs.com/legal-nlp/)) - Transfer learning and implementing the latest and greatest **SOTA** algorithms and models in NLP research - The most widely used NLP library in industry (5 years in a row) - The most scalable, accurate and fastest library in NLP history Spark NLP comes with 17,800+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster. Spark NLP processes the data using `Pipelines`, structure that contains all the steps to be run on the input data: ![Structure of NLP process for finance and banking.](https://www.johnsnowlabs.com/wp-content/uploads/2024/10/img_2.webp) Spark NLP pipelines Each step contains an [annotator](https://nlp.johnsnowlabs.com/docs/en/concepts#annotators) that performs a specific task such as tokenization, normalization, and dependency parsing. Each annotator has input(s) [annotation(s)](https://nlp.johnsnowlabs.com/docs/en/concepts#annotation) and outputs new annotation. An **annotator** in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document, while a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral. ## Setup To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter pip install spark-nlp pip install pyspark pip install spark-nlp pip install pyspark ``` pip install spark-nlp pip install pyspark ``` For other installation options for different environments and machines, please check the [official documentation](https://nlp.johnsnowlabs.com/docs/en/install). Then, simply import the library and start a Spark session: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter import sparknlp \# Start Spark Session spark = sparknlp.start() import sparknlp \# Start Spark Session spark = sparknlp.start() ``` import sparknlp # Start Spark Session spark = sparknlp.start() ``` ## CoNLL File Preparation CoNLL (Conference on Natural Language Learning) is a format for representing annotated data in NLP. The CoNLL format consists of columns, with each row representing a token and its associated features. To prepare data in the CoNLL format, the raw text is first annotated with the relevant labels (e.g., named entities or part-of-speech tags). This annotated data is then converted to the CoNLL format by representing each token and its associated features as a separate row in the CoNLL file. The resulting CoNLL file can then be used to train and evaluate machine learning models for the relevant NLP task. Here is a sample sentence: ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_MQ4baSLwHgFIrNJQvp73fw.webp) CoNLL representation of the sentence: ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_uAGoY5O75mJYKPpprb1Lbw.webp) We will use train and test datasets from the John Snow Labs Github, so first let us get their links: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa ``` !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa ``` Now, import the first 5,000 texts of the training dataset as CoNLL file: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter from sparknlp.training import CoNLL training\_data = CoNLL().readDataset(spark, './eng.train').limit(5000) \# Observe the first 3 rows of the Dataframe training\_data.show(3) from sparknlp.training import CoNLL training\_data = CoNLL().readDataset(spark, './eng.train').limit(5000) \# Observe the first 3 rows of the Dataframe training\_data.show(3) ``` from sparknlp.training import CoNLL training_data = CoNLL().readDataset(spark, './eng.train').limit(5000) # Observe the first 3 rows of the Dataframe training_data.show(3) ``` ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_759Hmxfjop0J6UfuFfXsRg.webp) Let’s explode the training data to understand the number of all the entities in IOB format (short for inside, outside, beginning): Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter import pyspark.sql.functions as F training\_data.select(F.explode(F.arrays\_zip(training\_data.token.result, training\_data.label.result)).alias("cols")) \\ .select(F.expr("cols\['0'\]").alias("token"), F.expr("cols\['1'\]").alias("ground\_truth")).groupBy('ground\_truth').count().orderBy('count', ascending=False).show(100,truncate=False) import pyspark.sql.functions as F training\_data.select(F.explode(F.arrays\_zip(training\_data.token.result, training\_data.label.result)).alias("cols")) \\ .select(F.expr("cols\['0'\]").alias("token"), F.expr("cols\['1'\]").alias("ground\_truth")).groupBy('ground\_truth').count().orderBy('count', ascending=False).show(100,truncate=False) ``` import pyspark.sql.functions as F training_data.select(F.explode(F.arrays_zip(training_data.token.result, training_data.label.result)).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth")).groupBy('ground_truth').count().orderBy('count', ascending=False).show(100,truncate=False) ``` ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_rQAYIFTJm3ucfkL-gPhRPQ.webp) Now that we have the training dataframe, we can get to the next stage, namely graph generation. There is a [**detailed notebook**](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.3.prepare_CoNLL_from_annotations_for_NER.ipynb) in the John Snow Labs Github repo about CoNLL preparation. Please check the notebook to understand the details of the process. ## TFNerDLGraphBuilder Graphs are data structures that contain a set of TensorFlow (TF) operation objects, which represent units of computation and TF tensor objects, which represent the units of data that flow between operations. They are defined in a TF Graph context. Since these graphs are data structures, they can be saved, run, and restored all without the original Python code. Graphs are extremely useful and let the TF run **fast**, run **in parallel**, and run efficiently **on multiple devices**. `TFNerDLGraphBuilder` is a Spark NLP annotator that is used to build the TF graph for training and inference of a custom NER model based on the Deep Learning architecture. It constructs the graph using TF APIs to define the model’s layers, inputs, and outputs. It also defines the optimization algorithm, loss function, and evaluation metrics to use during training. The resulting graph can be used to train a custom NER model on a large corpus of text data and then use it to extract named entities from new text. First, we need to install Tensor Flow and Tensor Flow addons: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter pip install -q tensorflow==2\.7.0 pip install -q tensorflow-addons pip install -q tensorflow==2.7.0 pip install -q tensorflow-addons ``` pip install -q tensorflow==2.7.0 pip install -q tensorflow-addons ``` Then create directories for log and graph files: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter !mkdir ner\_logs !mkdir ner\_graphs graph\_folder = "./ner\_graphs" !mkdir ner\_logs !mkdir ner\_graphs graph\_folder = "./ner\_graphs" ``` !mkdir ner_logs !mkdir ner_graphs graph_folder = "./ner_graphs" ``` Finally, define the `TFNerDLGraphBuilder` annotator with the parameters. Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter from sparknlp.annotator import TFNerDLGraphBuilder graph\_builder = TFNerDLGraphBuilder()\\ .setInputCols(\["sentence", "token", "embeddings"\]) \\ .setLabelColumn("label")\\ .setGraphFile("auto")\\ .setGraphFolder(graph\_folder)\\ .setHiddenUnitsNumber(20) from sparknlp.annotator import TFNerDLGraphBuilder graph\_builder = TFNerDLGraphBuilder()\\ .setInputCols(\["sentence", "token", "embeddings"\]) \\ .setLabelColumn("label")\\ .setGraphFile("auto")\\ .setGraphFolder(graph\_folder)\\ .setHiddenUnitsNumber(20) ``` from sparknlp.annotator import TFNerDLGraphBuilder graph_builder = TFNerDLGraphBuilder()\ .setInputCols(["sentence", "token", "embeddings"]) \ .setLabelColumn("label")\ .setGraphFile("auto")\ .setGraphFolder(graph_folder)\ .setHiddenUnitsNumber(20) ``` The graph will be stored in the defined folder and loaded by the NerDLApproach annotator. ## NerDLApproach NerDLApproach is an annotator within Spark NLP that implements a deep learning approach for NER model training. `NerDLApproach` allows users to train custom NER models on large text corpora, using pre-trained word embeddings, character embeddings, and contextual embeddings, such as BERT (Bidirectional Encoder Representations from Transformers) or ELMo (Embeddings from Language Models). `NerDLApproach` annotator expects `DOCUMENT, TOKEN` and `WORD_EMBEDDINGS` as input, and then will provide `NAMED_ENTITY` as output. Thus, the pipeline will require the previous steps to generate those annotations that will be used as input to our annotator. The next step is to get the word embeddings through BERT. We will use Spark NLP annotator called `BertEmbeddings()`*.* Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter \# Import the required modules and classes from sparknlp.base import DocumentAssembler, Pipeline from sparknlp.annotator import ( Tokenizer, SentenceDetector, BertEmbeddings ) \# Step 1: Transforms raw texts to \`document\` annotation documentAssembler = DocumentAssembler()\\ .setInputCol("text")\\ .setOutputCol("document") \# Step 2: Getting the sentences sentence = SentenceDetector() \\ .setInputCols(\["document"\]) \\ .setOutputCol("sentence") \# Step 3: Tokenization tokenizer = Tokenizer() \\ .setInputCols(\["sentence"\]) \\ .setOutputCol("token") \# Step 4: Bert Embeddings embeddings = BertEmbeddings.pretrained().\\ setInputCols(\["sentence", 'token'\]).\\ setOutputCol("embeddings") \# Import the required modules and classes from sparknlp.base import DocumentAssembler, Pipeline from sparknlp.annotator import ( Tokenizer, SentenceDetector, BertEmbeddings ) \# Step 1: Transforms raw texts to \`document\` annotation documentAssembler = DocumentAssembler()\\ .setInputCol("text")\\ .setOutputCol("document") \# Step 2: Getting the sentences sentence = SentenceDetector() \\ .setInputCols(\["document"\]) \\ .setOutputCol("sentence") \# Step 3: Tokenization tokenizer = Tokenizer() \\ .setInputCols(\["sentence"\]) \\ .setOutputCol("token") \# Step 4: Bert Embeddings embeddings = BertEmbeddings.pretrained().\\ setInputCols(\["sentence", 'token'\]).\\ setOutputCol("embeddings") ``` # Import the required modules and classes from sparknlp.base import DocumentAssembler, Pipeline from sparknlp.annotator import ( Tokenizer, SentenceDetector, BertEmbeddings ) # Step 1: Transforms raw texts to `document` annotation documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") # Step 2: Getting the sentences sentence = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") # Step 3: Tokenization tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") # Step 4: Bert Embeddings embeddings = BertEmbeddings.pretrained().\ setInputCols(["sentence", 'token']).\ setOutputCol("embeddings") ``` We already created the graph by using `TFNerDLGraphBuilder`, and saved in the “graph\_folder”, next step will be running the `NerDLApproach` annotator, the main module that is responsible for training the NER model. Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter from sparknlp.annotator import NerDLApproach \# Model training nerTagger = NerDLApproach()\\ .setInputCols(\["sentence", "token", "embeddings"\])\\ .setLabelColumn("label")\\ .setOutputCol("ner")\\ .setMaxEpochs(7)\\ .setLr(0\.003)\\ .setBatchSize(32)\\ .setRandomSeed(0)\\ .setVerbose(1)\\ .setValidationSplit(0\.2)\\ .setEvaluationLogExtended(True) \\ .setEnableOutputLogs(True)\\ .setIncludeConfidence(True)\\ .setGraphFolder(graph\_folder)\\ .setOutputLogsPath('ner\_logs') \# Define the pipeline ner\_pipeline = Pipeline(stages=\[embeddings, graph\_builder, nerTagger\]) from sparknlp.annotator import NerDLApproach \# Model training nerTagger = NerDLApproach()\\ .setInputCols(\["sentence", "token", "embeddings"\])\\ .setLabelColumn("label")\\ .setOutputCol("ner")\\ .setMaxEpochs(7)\\ .setLr(0.003)\\ .setBatchSize(32)\\ .setRandomSeed(0)\\ .setVerbose(1)\\ .setValidationSplit(0.2)\\ .setEvaluationLogExtended(True) \\ .setEnableOutputLogs(True)\\ .setIncludeConfidence(True)\\ .setGraphFolder(graph\_folder)\\ .setOutputLogsPath('ner\_logs') \# Define the pipeline ner\_pipeline = Pipeline(stages=\[embeddings, graph\_builder, nerTagger\]) ``` from sparknlp.annotator import NerDLApproach # Model training nerTagger = NerDLApproach()\ .setInputCols(["sentence", "token", "embeddings"])\ .setLabelColumn("label")\ .setOutputCol("ner")\ .setMaxEpochs(7)\ .setLr(0.003)\ .setBatchSize(32)\ .setRandomSeed(0)\ .setVerbose(1)\ .setValidationSplit(0.2)\ .setEvaluationLogExtended(True) \ .setEnableOutputLogs(True)\ .setIncludeConfidence(True)\ .setGraphFolder(graph_folder)\ .setOutputLogsPath('ner_logs') # Define the pipeline ner_pipeline = Pipeline(stages=[embeddings, graph_builder, nerTagger]) ``` Next step will be fitting the training dataset to train the model: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter ner\_model = ner\_pipeline.fit(training\_data) ner\_model = ner\_pipeline.fit(training\_data) ``` ner_model = ner_pipeline.fit(training_data) ``` Here are the metrics from the first epoch: ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_ff787X9evZG9DjWOyinfOA.webp) And the last epoch: ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_wkMcxihkRuesO4gZO0DjDQ.webp) By checking the metrics, you can observe the improvement in the trained model’s accuracy. `NerDLApproach` has many parameters and by fine-tuning, it is possible to achieve very high accuracy values. Please check this [**notebook**](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NERDL_Training.ipynb) for different options in NER DL training. ## Getting Predictions from the Trained Model Now that we have trained the model, we can test its efficiency on the test dataset. First, convert the CoNLL file to Spark data frame: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter test\_data = CoNLL().readDataset(spark, './eng.testa').limit(1000) test\_data = CoNLL().readDataset(spark, './eng.testa').limit(1000) ``` test_data = CoNLL().readDataset(spark, './eng.testa').limit(1000) ``` Let’s get predictions by transforming the test dataframe: Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter predictions = ner\_model.transform(test\_data) predictions = ner\_model.transform(test\_data) ``` predictions = ner_model.transform(test_data) ``` Now, we will explode the results to get a nice dataframe of the tokens, ground truths and the labels predicted by the model we just trained. Plain text Copy to clipboard Open code in new window EnlighterJS 3 Syntax Highlighter predictions.select(F.explode(F.arrays\_zip(predictions.token.result, predictions.label.result, predictions.ner.result)).alias("cols")) \\ .select(F.expr("cols\['0'\]").alias("token"), F.expr("cols\['1'\]").alias("ground\_truth"), F.expr("cols\['2'\]").alias("prediction")).show(30, truncate=False) predictions.select(F.explode(F.arrays\_zip(predictions.token.result, predictions.label.result, predictions.ner.result)).alias("cols")) \\ .select(F.expr("cols\['0'\]").alias("token"), F.expr("cols\['1'\]").alias("ground\_truth"), F.expr("cols\['2'\]").alias("prediction")).show(30, truncate=False) ``` predictions.select(F.explode(F.arrays_zip(predictions.token.result, predictions.label.result, predictions.ner.result)).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth"), F.expr("cols['2']").alias("prediction")).show(30, truncate=False) ``` ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_KWZ5o-_ELPAo0aPDuI3pXg.webp) You can see that the model was very successful in predicting the named entities. It is also possible to **save the model** and then load back by using the `NerDLModel` annotator in a pipeline. Please check the post about [Python Named Entity Recognition](https://www.johnsnowlabs.com/named-entity-recognition-ner-with-python-at-scale/) (NER), which gives details about the NerDLModel annotator. ## Highlight Entities The ability to quickly visualize the entities generated using Spark NLP is a very useful feature for speeding up the development process as well as for understanding the obtained results. [**Spark NLP Display**](https://nlp.johnsnowlabs.com/docs/en/display) is an [open python NLP library](https://nlp.johnsnowlabs.com/) for visualizing the annotations generated with Spark NLP. The `NerVisualizer` annotator highlights the extracted named entities and also displays their labels as decorations on top of the analyzed text. The colors assigned to the predicted labels can be configured to fit the particular needs of the application. The figure below shows the visualization of the named entities recognized from a sample text. The entities are extracted, labelled (as PERSON, DATE, ORG, LOC etc) and displayed on the original text. Please check the post named “Visualizing Named Entities with Spark NLP”, which gives details about `NerVisualizer`. ![](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_R2dfQdEhaTggMpA7lxZ6VA.webp) Extracted named entities, displayed by the Ner Visualizer For additional information, please consult the following references. - Documentation : [CoNLL Datasets](https://nlp.johnsnowlabs.com/docs/en/training#conll-dataset), [TF Graphs](https://nlp.johnsnowlabs.com/docs/en/training#tensorflow-graphs), [NerDLApproach](https://nlp.johnsnowlabs.com/docs/en/annotators#nerdl). - Python Doc : [CoNLL Datasets](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/training/conll/index.html#sparknlp.training.conll.CoNLL), [TFNerDLGraphBuilder](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/tf_ner_dl_graph_builder/index.html#sparknlp.annotator.tf_ner_dl_graph_builder.TFNerDLGraphBuilder), [NerDLApproach](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/ner/ner_dl/index.html#sparknlp.annotator.ner.ner_dl.NerDLApproach). - Scala Doc : [CoNLL Datasets](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/training/CoNLL.html), [NerDLApproach](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/ner/dl/NerDLApproach). - For extended examples of usage, see the notebooks for [CoNLL File Preparation](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.3.prepare_CoNLL_from_annotations_for_NER.ipynb), [Graph Generation](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.1_NerDL_Graph.ipynb) and [NerDL Training](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NERDL_Training.ipynb). ## Conclusion In this article, we walked you through training an NER model by BERT embeddings. Named entity recognition is a crucial task in NLP that involves identifying and extracting entities such as people, places, organizations, dates, and other types of named entities from unstructured text data. A well-trained NER model helps to extract useful information from unstructured text data with high accuracy. NER deep learning model training in Spark NLP provides an efficient and scalable way to build accurate NER models for various natural language processing tasks. Spark NLP also provides a variety of pre-trained models, including deep learning models like BERT, RoBERTa, and DistilBERT, which can be used to classify entities in the text. These models can be fine-tuned on specific datasets to improve the accuracy of the NER classification. Read also related articles on the topic: [Named Entity Recognition BERT](https://www.johnsnowlabs.com/named-entity-recognition-ner-with-bert-in-spark-nlp/) [Named Entity Recognition in NLP](https://www.johnsnowlabs.com/an-overview-of-named-entity-recognition-in-natural-language-processing/) Try The Generative AI Lab - No-Code Platform For Model Tuning & Validation [See in action](https://www.johnsnowlabs.com/nlp-lab/)
Shard103 (laksa)
Root Hash1775817778889846503
Unparsed URLcom,johnsnowlabs!www,/the-ultimate-guide-to-building-your-own-ner-model-with-python/ s443