🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:

Response:

Calculated Shard: 146 (from laksa187)

2. Crawled Status Check

Query:

curl -X POST \
  'http://laksa146.int.ahrefs:8124/' \
  -H 'Content-Type: text/plain' \
  -H 'X-ClickHouse-Database: crawler3' \
  -H 'Authorization: Basic YXBpOg==' \
  -d 'SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, ifNull(toUnixTimestamp(download_stamp), 0) AS crawl_time, ifNull(toUnixTimestamp(props_url_first_seen), 0) AS first_indexed_time, download_http_code AS http_code, src_unparsed AS src_unparsed, src_root_hash AS src_root_hash, history_drop_reason AS history_drop_reason, meta_title AS meta_title, meta_descriptions AS meta_descriptions, attrs_boilerpipe_text AS attrs_boilerpipe_text, attrs_markdown AS attrs_markdown, attrs_readable_markdown AS attrs_readable_markdown, meta_canonical AS meta_canonical FROM crawler3.page_info_local FINAL PREWHERE (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://ubiai.tools/build-nlp-project-from-zero-to-hero-5-model-training-ubiai/\')), getAhrefsUnparsedNoserviceFromURL(\'https://ubiai.tools/build-nlp-project-from-zero-to-hero-5-model-training-ubiai/\'))) FORMAT JSONEachRow'

Response:

{"found_url":"https:\/\/ubiai.tools\/build-nlp-project-from-zero-to-hero-5-model-training-ubiai\/","crawl_time":1769270181,"first_indexed_time":1697243514,"http_code":200,"src_unparsed":"tools,ubiai!\/build-nlp-project-from-zero-to-hero-5-model-training-ubiai\/ s443","src_root_hash":"14153978550210686946","history_drop_reason":null,"meta_title":"Build NLP Project From Zero To Hero (5): Model Training","meta_descriptions":["Model Training to Build An NLP Project - Build An NLP Project From Zero To Hero (5): Model Training - UBIAI"],"attrs_boilerpipe_text":"Input Data Format\nModel Evaluation\nTraining a CRF Model\nWorkflow\nI am using Google Colaboratory as the working environment and Google Drive as where to store our data and our models.\n \nNow, let us transform our data into useful features by collecting details about every token and its adjacent neighbors. You notice that I am commenting out part of speech tagging. You can include it if you have a Part Of Speech Model which can be a part of the feature engineering pipeline.\n# Utils functions to extract features\r\ndef word2features(sent, i):\r\n    word = sent[i][0]\r\n    #postag = sent[i][1]\r\n\r\n    features = {\r\n        'bias': 1.0,\r\n        'word.lower()': word.lower(),\r\n        'word[-3:]': word[-3:],\r\n        'word[-2:]': word[-2:],\r\n        'word.isupper()': word.isupper(),\r\n        'word.istitle()': word.istitle(),\r\n        'word.isdigit()': word.isdigit(),\r\n        # 'postag': postag,\r\n        # 'postag[:2]': postag[:2],\r\n    }\r\n\r\n    if i > 0:\r\n        word1 = sent[i-1][0]\r\n        #postag1 = sent[i-1][1]\r\n        features.update({\r\n            '-1:word.lower()': word1.lower(),\r\n            '-1:word.istitle()': word1.istitle(),\r\n            '-1:word.isupper()': word1.isupper(),\r\n            # '-1:postag': postag1,\r\n            # '-1:postag[:2]': postag1[:2],\r\n        })\r\n    else:\r\n        features['BOS'] = True\r\n\r\n    if i\nF1-score for CRF model\nThat is pretty good! It performed better than the Spacy NER model (81%). Remember that our labels are not balanced, if we adjust this problem, we will be having an excellent model.\nThe CRF model was the defacto solution for various NLP tasks like Part of Speech Tagging and Named Entity Recognition before the Deep Learning Era. It is still efficient as you have seen right now!\nModel Usage\nLet us use it for unseen examples:\n#convert raw sentences into list of tuples (token and empty)\r\ndef sents2tuples(sents):\r\n      res = []\r\n      for sent in sents:\r\n        tokens = word_tokenize(sent)\r\n        res.append([(token,'') for token in tokens])\r\n      return res\r\n\r\n#with sent2tuples, preprocessing will work just fine with new text\r\ndef preprocess( texts):\r\n      texts = [res for res in sents2tuples(texts)]\r\n      X = [sent2features(s) for s in texts]\r\n      return X\r\n\r\nsamples = [\"Facebook has a price target of $ 20 for this quarter\",\r\n         \"$ AAPL is gaining a new momentum\"]\r\n\r\n\r\nprocessed = preprocess(samples)\r\n\r\npred = crf.predict(processed)\r\nfor i in range(len(samples)):\r\n  sentence = samples[i].split()\r\n  for j in range(len(sentence)):\r\n    print(sentence[j],'-->',pred[i][j])\r\n  print()\nThe output of CRF\nNot bad for our two examples. However, the model will struggle with variations of ‘$20’ for ‘$ 20’ and ‘$AAPL’ for ‘$ AAPL’. It will not label them correctly. This can be mitigated by more effective tokenization and feature engineering. We can generate variations of the same instances (new training examples by varying the spacing) and let the model learn them. This is called Data Augmentation.\nLastly, don’t forget to save your model and test it again!\nimport pickle\nfilename\n=\n'crf_model.sav'\npickle.dump(crf,\nopen\n(\nfilename\n,\n'wb'\n))\n\n      loaded_model = pickle.\nload\n(\nopen\n(\nfilename\n,\n'rb'\n))\n\n      loaded_model.predict(processed)\nTraining a Spacy NER Transformer-based Model\nTransformers are considered state of the art for NLP tasks. As usual, we will try to understanding the intuition behind it without going too much in details.\nYou heard buzzwords like Google BERT (Bidirectional Encoder Representations from Transformer) and Open AI GPT3 (Generative Pre-trained Transformer 3\n)\n, highly sophisticated models that can understand natural languages and generate well-structured sentences. It all goes back to the Transformers as you can see from their names.\nThe main objective is to understand, how the tokens in a given document are interconnected with each other.\nFacebook has a price target of $ 20 for this quarter. Analysts put it to ‘Hold’.\nWhen we read this hypothetical tweet, our minds memorizes the word ‘Facebook’, and remembers that the terms ‘has’ and ‘price target’ are related to it. It can also deduce that ‘it’ is also related to the initial term. By looking for every word that is connected to it, a Transformer model uses the same concept to synthesize effectively the semantic relationships between words which other models struggle to do.\nModel Architecture\nLet us see Transformers through word embeddings:\nWe have a document with 16 tokens (the tweet example)\nSelect the first Token ‘Facebook’ X\nConvert the 16 tokens to Word Embeddings (each token encoding depends on the rest of the tokens and their weights.)\nReflect similarities between Encoded Tokens and the Selected Token by using Dot Product (then normalization to not inflate the weights)\nThe Dot product and the Normalization will produce new weights which will be used again with the 16 Encoded Tokens to compute the representation Y of our token X.\nAnd you repeat the process for the rest of Tokens. It is noted that the weights mentioned here are different in concept than those of Neural Networks Weights.\nKeywords that relate to Transformers are Values (the Encoded Tokens at the phase of computing Y), the Query (The Selected Token), and the Keys (Encoded Tokens as an output of the Word Embeddings).\nValues, Queries and Keys\nThis is a naïve simple Transformer architecture. You can introduce a Feed-Forward Network for the Values.\nIt is a loose explanation but it is enough to get you started. I recommend reading this \narticle\n by \nOleg Borisov\n.\nWorkflow\nIt is noted that training any Spacy-based Model follows the same workflow.\nAs usual, I have used Google Colab and Google Drive.\nUse GPU for your runtime and check it:\n!\nnvcc\n--version\n#nvcc\n:\nNVIDIA\n(\nR\n)\nCuda\ncompiler\ndriver\nCopyright\n(\nc\n) 2005\n-2020\n#NVIDIA\nCorporation\nBuilt\non\nMon_Oct_12_20\n:09\n:46_PDT_2020\nCuda\n#compilation\ntools\n,\nrelease\n11\n.1\n,\nV11\n.1\n.105\nBuild\n#cuda_11\n.1\n.TC455_06\n.29190527_0\nInstall these dependencies, we will need cuda and spacy transformers.\n!pip\ninstall\n-U pip setuptools wheel\n\n      !pip\ninstall\n-U spacy\n\n      !pip\ninstall\n-U spacy[cuda111,transformers]\n#cuda version 111\nMake sure that cuda (parallel computing platform by Nvidia) and cupy (python library like numpy but used for GPU-accelerated computing) has the same versions or closer. As of this moment, Colab current cupy version is 9.4 and Cuda version is 11.1. Configuring GPU for training your Models locally can cause a lot of headaches, so be careful about the matching of versions between the library and the platform.\nCheck PyTorch and Cuda availability:\nimport torch\n\n      torch\n.cuda\n.is_available\n()\nMake a folder for your project and change the current working directory to that folder:\n!\nmkdir\ntrf_ner\ncd\ntrf_ner\nThese commands will get the train and test datasets and convert them from IOB to JSON format and then spacy binary format:\n!cp \/content\/drive\/MyDrive\/Public\/stock-market-analysis-split\/stock_test_IOB.tsv .\/stock_test_IOB.tsv\n\n      !cp \/content\/drive\/MyDrive\/Public\/stock-market-analysis-split\/stock_train_IOB.tsv .\/stock_train_IOB.tsv\n\n      !python -m spacy\nconvert\n.\/stock_train_IOB.tsv .\/ -\nt\njson -\nn\n1\n-c iob\n\n      !python -m spacy\nconvert\n.\/stock_test_IOB.tsv .\/ -\nt\njson -\nn\n1\n-c iob\n\n      !python -m spacy\nconvert\n.\/stock_train_IOB.json .\/ -\nt\nspacy\n\n      !python -m spacy\nconvert\n.\/stock_test_IOB.json .\/ -\nt\nspacy\nGo to the \nSpacy Config File Widget\n and generate your proper config file. Make sure to select NER, GPU Transformer, and efficiency. Upload the config file to Google Drive, and alter parameters according to your need. In fact, I needed to modify the ‘total_steps’ in [training.optimizer.learn_rate] from 20000 to 10000 because of\n a sudden drop in the performance of the model\n while training, it went from 74% F-score to 0 and remained that way for the rest of epochs.\nInitialize your project using the config file:\n!python -m spacy init\nfill-config\n\/content\/drive\/MyDrive\/Public\/stock-market-analysis-split\/base_config.cfg .\/config.cfg\nWe can debug our data:\n!python -m spacy debug\ndata\n.\/config.cfg\nIf there are major issues that will prevent you from training your model, Spacy will inform you. Here, we are told that there is a low number of training examples (350) and that there are labels with very low cardinality like PERSON. They are warnings and not errors and so we can proceed.\nDebugging your data with spacy\nTrain the model:\n!python -m spacy train -g\n0\n.\/config\n.cfg\n--output .\/\nIf you see that the pipeline has initialized then everything is working correctly:\nTraining spacy NER transformer model\nYou certainly noticed that I have aborted the training after just three epochs. There is a story behind this. I actually trained the model for 3 hours and it reached an F-Score of around 87%, however, Google Colab cut out the GPU support because I exceeded the allowed quota. So be careful with Cloud resources, or your day-work will be lost. I had to redo the training the next day.\nOur Best Ner Trf Model has an F-Score of 85.71%. Not bad.\nThe last thing, save your model after training!\ncp -r\n\/content\/\ntrf_ner\n\/content\/\ndrive\n\/MyDrive\/\nPublic\n\/stock-market-analysis-split\/\ntrf_ner\nUsage\nWe can import the model like any other Spacy model:\nimport spacy\n\n      ner = spacy.\nload\n('\/\ncontent\n\/drive\/MyDrive\/Public\/stock-market-analysis-\nsplit\n\/trf_ner\/model-best')\nLet us try it on some examples:\nsamples = [\"Facebook has a price target of $ 20 for this quarter\",\n         \"$ AAPL is gaining a new momentum\"]\n\n\nfor doc in ner.pipe(samples):\n  for ent in doc.ents:\n      print(ent.label_, ent.text)\n  print()\n \nOutput of the transformer model\nIt works!\nConclusion\nWe were able to get decent starting models from a small dataset (350 training examples, 50 test examples) and we know exactly why they are not performing better (unbalanced dataset, Twitter tweets have a specific style of writing different). Our journey is far from over. There are a lot of things to consider, like Model Fine-Tuning, Data Augmentation, Model Monitoring, and Model Deployment. We are not done yet!\nThis article was longer than usual but I believe it can serve as a future guide for those who want to get quickly into training their own NLP models.\nIf you are curious, you can request a demo contacting admin@100.21.53.251 or \nTwitter\n.\nHappy learning and see you in the next article!","attrs_markdown":"![](https:\/\/www.facebook.com\/tr?id=950528140575647&ev=PageView&noscript=1)\n\n![ubiai deep learning](https:\/\/ubiai.tools\/wp-content\/uploads\/2023\/01\/LogoPalceHolder.svg)\n\n[Login](https:\/\/app.ubiai.tools\/login)\n\n[Get a Demo](https:\/\/apiv2.ubiai.tools\/widget\/booking\/rQdEzYzc0Mo3sA6EAeJt)\n\n![ubiai deep learning](https:\/\/ubiai.tools\/wp-content\/uploads\/2023\/01\/LogoPalceHolder.svg)\n\n# Build An NLP Project From Zero To Hero (5): Model Training\n###### Jan 18, 2022\nTraining an ML model is without a doubt the most interesting part for every data scientist and for every machine learning enthusiast. Model training refers simply to the model learning from its input data to generalize over a given phenomenon. With every training iteration, the model adjusts its weights to be able to make correct predictions as much as possible using a training algorithm like gradient descent.\n\nThere are a lot of details that concern this phase: selecting the model, verifying the integrity of the input data, evaluating the model, training, and saving it. We will get to every detail and then we will show how we apply each one.\n## Model Selection\nIn general, since we have observed and have prepared our data through preprocessing and labeling, we should have a good idea of what model we will be using.\n\nUsually, there will be a list of models to choose from, and this will make the project far more complicated than it should be. A good intuition is to try the *simplest model* for your task and then proceed to improve the architecture of the existing model or to choose a more complex model that is compatible with the same task. However, the simplest model can fail from the start.\n\nSo, you can identify certain characteristics and properties that will help you reduce the candidate model list.\n\nA very good example of this is from the Google Developers Guide of [Text Classification](https:\/\/developers.google.com\/machine-learning\/guides\/text-classification\/step-2-5). Through a lot of experimentation and testing, they identified a metric *S\/W* or the number of samples\/number of words per sample ratio. This metric will indicate wether you should choose **n-gram models** like logistic regression and support vector machines or **sequence models** like CNN or RNN for the text classification task. In practice, this is difficult to achieve on your own as you need to do a lot of experimentation and testing. This is why you should research industry-standard models.\n\nThere are also the characteristics of the data itself that would help map to the correct model: [If your data has a large number of features but a significantly lower number of observations](https:\/\/medium.com\/axum-labs\/logistic-regression-vs-support-vector-machines-svm-c335610a3d16#:~:text=When%20To%20Use%20Logistic%20Regression%20vs%20Support%20Vector%20Machine), a support vector machine will perform better than logistic regression.\n\nFor Named Entity Recognition, do you want to train a new model from scratch? Or use a pre-trained model and probably build upon it? For example, you can use a Spacy pre-trained NER model but it might not suit your need as in the case of this project. Or, you can train a Spacy Model from scratch using your own dataset, giving new vocabulary and labels for the model.\n\nThis [article](https:\/\/medium.com\/@b.terryjack\/nlp-pretrained-named-entity-recognition-7caa5cd28d7b#:~:text=There%20are%20a%20good%20range,)%20API%20(e.g.%20GATE).) presents a great overview of pre-trained NER models, ranging from rule-based models like in NLTK to probabilistic models like Stanford Core NLP and Deep Learning Models like Flair.\n\nIn the [Data Preprocessing Phase](https:\/\/medium.com\/ubiai-nlp\/build-an-nlp-project-from-zero-to-hero-3-data-preprocessing-9a7ef729d05?source=your_stories_page----------------------------------------), we have used the [Spacy pre-trained NER model](https:\/\/spacy.io\/universe\/project\/video-spacys-ner-model) during the pre-annotation of the dataset. The model uses a sophisticated word embedding strategy using subword features and “Bloom” embeddings, a deep convolutional neural network with residual connections.\n\nBesides, we have trained a spacy-based model in the last article, using the Model Assisted Labeling feature within the [UBIAI tool](https:\/\/ubiai.tools\/). The performance was not too bad as a start. We can download it and use it like any other spacy model by clicking the Download Button in the Action Column in the Models Tab of our current project:\n\n![Build An NLP Project From Zero To Hero (Model Training)](https:\/\/i0.wp.com\/ubiai.tools\/wp-content\/uploads\/2023\/02\/1_GEEjhQt_PgTXIgr3_bQw3g.png?fit=750%2C300&ssl=1)\n\nUBIAI Model Training Dashboard\n\nFor this article, we will feature the workflow for training two models: a **Probabilistic** Model, **CRF** or **Continuous Random Fields**, and a **Deep Learning** Model, **Spacy NER** model with **Transformers**.\n\nBut before that, we need to talk about the Input Data Format and Model Evaluation.\n## Input Data Format\nTo assure that your model will actually work, you must identify clearly the format of its Input Data. This is necessary for both prediction and training. There exist many suitable formats for the NER task and among them:\n\n1. **IOB format**: short for inside, outside, the beginning is **a common tagging format for tagging tokens in a chunking task** in NLP. Every document is separated into tokens. Each token will take a row and in front of every token, you will find its label. The Null label or ‘O’ is necessary in this case to mark unlabeled tokens. Since the labeling is practically word by word, there is an additional technique to label multiple token terms, I-notation (Inside of a labeled term) and B-notation (Beginning of a labeled term). Documents are separated between each other by a special separator (in our case ‘-DOCSTART- -X- O O’).\n2. **JSON format**: In this format, your dataset is a list of JSON objects. Each object represents a document and a list of its annotations. An annotation is a labeled term represented by a dictionary containing its text, its label, its starting position, and its ending index in the text in the document string.\n\n![Build An NLP Project From Zero To Hero (Model Training)](https:\/\/i0.wp.com\/ubiai.tools\/wp-content\/uploads\/2023\/02\/1_5ekkr-TrYZ71sg8Sia-agA-1.png?fit=612%2C428&ssl=1)\n\n![Build An NLP Project From Zero To Hero (Model Training)](https:\/\/i0.wp.com\/ubiai.tools\/wp-content\/uploads\/2023\/02\/1_5HlLAFBOSpnsPZBCycvWOQ.png?fit=1151%2C723&ssl=1)\n\nIOB left, Spacy JSON right\n\nIf you recall, we have already used the JSON format with Spacy previously in the pre-annotation and the Data Labeling Process.\n\nIn the UBIAI tool, you just need to open the Project list menu and click on the Download Button in the Actions’ Column.\n\nThere is an important point to talk about, splitting your dataset into a training set, development (or validation set), and a test set. We can omit the dev set to simplify things as we are in a learning project. A ratio of 80\/20 is good for our small dataset.\n## Model Evaluation\nSince it is a classification task, we might begin with **Accuracy**. Accuracy is good if your dataset is balanced (every label has the same number of instances as everyone else). This is not our case.\n\nTraditionally, these three metrics are considered for the NER task:\n\n- **Precision**: Determines if your model predicts a real incorrect label as correct. In other words, your model predicts that ‘Google’ is a PERSON name while it is not correct in reality. The higher this metric is, the lesser your model makes this mistake.\n- **Recall**: Determines if your model predicts a real correct label as incorrect. For example, your model does not predict ‘Google’ as COMPANY even though it is in reality. The higher this metric is, the lesser your model will miss correct instances.\n- **F1-Score**: an overall indicator of the performance of the classifier that takes into account both Precision and Recall.\n\nYou noticed that I explained these metrics in terms of intuition. To delve more theoretically, begin by checking out this [article](https:\/\/medium.com\/analytics-vidhya\/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd#:~:text=Recall%20is%20also%20known%20as,high) by [Harikrishnan](https:\/\/medium.com\/@harikrishnannb).\n\nIn the next two parts, I am supposing that you have a basic understanding of how a Machine Learning model train and what constitute a Model architecture. If you want to delve deeper into this topic, I recommend the [Coursera Deep Learning Specialization](https:\/\/www.coursera.org\/specializations\/deep-learning) by [Andrew Ng](https:\/\/twitter.com\/andrewyng?lang=en).\n## Training a CRF Model\nWe have a sequence of tokens in every training example. These tokens are usually words if you decided to tokenize at the word level. We have talked about tokenization extensively in this episode.\n\nTo predict the nature of a word (is it a PERSON, a COMPANY, etc), we cannot ignore the sequential nature of our data (tweets, sentences) as it is a significant loss of information. We have to select a model that can infer from previous positions for the prediction of the current position. Named Entity Recognition is of sequential nature after all.\n\nFor example, using the IOB format and given the tweet “Facebook has a target price of \\$10”, the labeling might be “Facebook (B-COMPANY) has (O) a (O) target (O) target (B-MONEY\\_LABEL) price (I-MONEY\\_LABEL) of (O) \\$10 (MONEY)”. So to predict the word ‘price’ as an Inside MONEY\\_LABEL, we need to know features of the previous word ‘target’. Knowing that it has the label Beginning MONEY\\_LABEL, its features would serve very well in making a correct prediction for ‘price’.\n\n1. **Model Architecture :**\n2. CRF or Continuous Random Fields builds upon this intuition by building feature functions that take into account the sentence and arbitrary labels throughout it. As a simple example, let us consider this feature function input\n3. Sentence **s**\n4. The position **i** of a **word** in the sentence\n5. The label **l(i)** of the current word\n6. The label **l(i−1)** of the previous word\n7. This feature function outputs a real-valued number which is usually binary. It is called **linear-chain CRF.** We then **\\*\\*assign each feature function a** weight **that is to be learned by the model. Lastly, we transform these functions into** probabilities\\*\\* by summing over every sentence for every feature function and by subsequent exponentiation and normalization.\n\n![Build An NLP Project From Zero To Hero (Model Training)](https:\/\/i0.wp.com\/ubiai.tools\/wp-content\/uploads\/2023\/02\/1_lj1ypNPml-NxaccEBWGIg.png?fit=1043%2C279&ssl=1)\n\nGeneric CRF Model\n\nI hope that you have an intuition of how CRF works. For more details, check out these two great articles by [Analyticsvidhya](https:\/\/www.analyticsvidhya.com\/blog\/2018\/08\/nlp-guide-conditional-random-fields-text-classification\/#h2_5) and [Edwin Chen](https:\/\/blog.echen.me\/2012\/01\/03\/introduction-to-conditional-random-fields\/).\n## Workflow\nI am using Google Colaboratory as the working environment and Google Drive as where to store our data and our models.\n\n```\n\n```\n``\n\n``\n\n``\n\n``\n\n``\n```\n\n\t\tNow, let us transform our data into useful features by collecting details about every token and its adjacent neighbors. You notice that I am commenting out part of speech tagging. You can include it if you have a Part Of Speech Model which can be a part of the feature engineering pipeline.\n```\n\n``\n\n``\n\n``\n\n``\n\n``","attrs_readable_markdown":null,"meta_canonical":null}

3. Robots.txt Check

Query:

Response:

4. Spam/Ban Check

Query:

Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄

INDEXABLE

✅

CRAWLED

2 months ago

🤖

ROBOTS ALLOWED

Page Info Filters

Filter	Status	Condition	Details
HTTP status	PASS	`download_http_code = 200`	HTTP 200
Age cutoff	PASS	`download_stamp > now() - 6 MONTH`	2.9 months ago
History drop	PASS	`isNull(history_drop_reason)`	No drop reason
Spam/ban	PASS	`fh_dont_index != 1 AND ml_spam_score = 0`	ml_spam_score=0
Canonical	PASS	`meta_canonical IS NULL OR = '' OR = src_unparsed`	Not set

Page Details

Property	Value
URL	https://ubiai.tools/build-nlp-project-from-zero-to-hero-5-model-training-ubiai/
Last Crawled	2026-01-24 15:56:21 (2 months ago)
First Indexed	2023-10-14 00:31:54 (2 years ago)
HTTP Status Code	200
Meta Title	Build NLP Project From Zero To Hero (5): Model Training
Meta Description	Model Training to Build An NLP Project - Build An NLP Project From Zero To Hero (5): Model Training - UBIAI
Meta Canonical	null
Boilerpipe Text	Input Data Format Model Evaluation Training a CRF Model Workflow I am using Google Colaboratory as the working environment and Google Drive as where to store our data and our models. Now, let us transform our data into useful features by collecting details about every token and its adjacent neighbors. You notice that I am commenting out part of speech tagging. You can include it if you have a Part Of Speech Model which can be a part of the feature engineering pipeline. # Utils functions to extract features def word2features(sent, i): word = sent[i][0] #postag = sent[i][1] features = { 'bias': 1.0, 'word.lower()': word.lower(), 'word[-3:]': word[-3:], 'word[-2:]': word[-2:], 'word.isupper()': word.isupper(), 'word.istitle()': word.istitle(), 'word.isdigit()': word.isdigit(), # 'postag': postag, # 'postag[:2]': postag[:2], } if i > 0: word1 = sent[i-1][0] #postag1 = sent[i-1][1] features.update({ '-1:word.lower()': word1.lower(), '-1:word.istitle()': word1.istitle(), '-1:word.isupper()': word1.isupper(), # '-1:postag': postag1, # '-1:postag[:2]': postag1[:2], }) else: features['BOS'] = True if i F1-score for CRF model That is pretty good! It performed better than the Spacy NER model (81%). Remember that our labels are not balanced, if we adjust this problem, we will be having an excellent model. The CRF model was the defacto solution for various NLP tasks like Part of Speech Tagging and Named Entity Recognition before the Deep Learning Era. It is still efficient as you have seen right now! Model Usage Let us use it for unseen examples: #convert raw sentences into list of tuples (token and empty) def sents2tuples(sents): res = [] for sent in sents: tokens = word_tokenize(sent) res.append([(token,'') for token in tokens]) return res #with sent2tuples, preprocessing will work just fine with new text def preprocess( texts): texts = [res for res in sents2tuples(texts)] X = [sent2features(s) for s in texts] return X samples = ["Facebook has a price target of $ 20 for this quarter", "$ AAPL is gaining a new momentum"] processed = preprocess(samples) pred = crf.predict(processed) for i in range(len(samples)): sentence = samples[i].split() for j in range(len(sentence)): print(sentence[j],'-->',pred[i][j]) print() The output of CRF Not bad for our two examples. However, the model will struggle with variations of ‘$20’ for ‘$ 20’ and ‘$AAPL’ for ‘$ AAPL’. It will not label them correctly. This can be mitigated by more effective tokenization and feature engineering. We can generate variations of the same instances (new training examples by varying the spacing) and let the model learn them. This is called Data Augmentation. Lastly, don’t forget to save your model and test it again! import pickle filename = 'crf_model.sav' pickle.dump(crf, open ( filename , 'wb' )) loaded_model = pickle. load ( open ( filename , 'rb' )) loaded_model.predict(processed) Training a Spacy NER Transformer-based Model Transformers are considered state of the art for NLP tasks. As usual, we will try to understanding the intuition behind it without going too much in details. You heard buzzwords like Google BERT (Bidirectional Encoder Representations from Transformer) and Open AI GPT3 (Generative Pre-trained Transformer 3 ) , highly sophisticated models that can understand natural languages and generate well-structured sentences. It all goes back to the Transformers as you can see from their names. The main objective is to understand, how the tokens in a given document are interconnected with each other. Facebook has a price target of $ 20 for this quarter. Analysts put it to ‘Hold’. When we read this hypothetical tweet, our minds memorizes the word ‘Facebook’, and remembers that the terms ‘has’ and ‘price target’ are related to it. It can also deduce that ‘it’ is also related to the initial term. By looking for every word that is connected to it, a Transformer model uses the same concept to synthesize effectively the semantic relationships between words which other models struggle to do. Model Architecture Let us see Transformers through word embeddings: We have a document with 16 tokens (the tweet example) Select the first Token ‘Facebook’ X Convert the 16 tokens to Word Embeddings (each token encoding depends on the rest of the tokens and their weights.) Reflect similarities between Encoded Tokens and the Selected Token by using Dot Product (then normalization to not inflate the weights) The Dot product and the Normalization will produce new weights which will be used again with the 16 Encoded Tokens to compute the representation Y of our token X. And you repeat the process for the rest of Tokens. It is noted that the weights mentioned here are different in concept than those of Neural Networks Weights. Keywords that relate to Transformers are Values (the Encoded Tokens at the phase of computing Y), the Query (The Selected Token), and the Keys (Encoded Tokens as an output of the Word Embeddings). Values, Queries and Keys This is a naïve simple Transformer architecture. You can introduce a Feed-Forward Network for the Values. It is a loose explanation but it is enough to get you started. I recommend reading this article by Oleg Borisov . Workflow It is noted that training any Spacy-based Model follows the same workflow. As usual, I have used Google Colab and Google Drive. Use GPU for your runtime and check it: ! nvcc --version #nvcc : NVIDIA ( R ) Cuda compiler driver Copyright ( c ) 2005 -2020 #NVIDIA Corporation Built on Mon_Oct_12_20 :09 :46_PDT_2020 Cuda #compilation tools , release 11 .1 , V11 .1 .105 Build #cuda_11 .1 .TC455_06 .29190527_0 Install these dependencies, we will need cuda and spacy transformers. !pip install -U pip setuptools wheel !pip install -U spacy !pip install -U spacy[cuda111,transformers] #cuda version 111 Make sure that cuda (parallel computing platform by Nvidia) and cupy (python library like numpy but used for GPU-accelerated computing) has the same versions or closer. As of this moment, Colab current cupy version is 9.4 and Cuda version is 11.1. Configuring GPU for training your Models locally can cause a lot of headaches, so be careful about the matching of versions between the library and the platform. Check PyTorch and Cuda availability: import torch torch .cuda .is_available () Make a folder for your project and change the current working directory to that folder: ! mkdir trf_ner cd trf_ner These commands will get the train and test datasets and convert them from IOB to JSON format and then spacy binary format: !cp /content/drive/MyDrive/Public/stock-market-analysis-split/stock_test_IOB.tsv ./stock_test_IOB.tsv !cp /content/drive/MyDrive/Public/stock-market-analysis-split/stock_train_IOB.tsv ./stock_train_IOB.tsv !python -m spacy convert ./stock_train_IOB.tsv ./ - t json - n 1 -c iob !python -m spacy convert ./stock_test_IOB.tsv ./ - t json - n 1 -c iob !python -m spacy convert ./stock_train_IOB.json ./ - t spacy !python -m spacy convert ./stock_test_IOB.json ./ - t spacy Go to the Spacy Config File Widget and generate your proper config file. Make sure to select NER, GPU Transformer, and efficiency. Upload the config file to Google Drive, and alter parameters according to your need. In fact, I needed to modify the ‘total_steps’ in [training.optimizer.learn_rate] from 20000 to 10000 because of a sudden drop in the performance of the model while training, it went from 74% F-score to 0 and remained that way for the rest of epochs. Initialize your project using the config file: !python -m spacy init fill-config /content/drive/MyDrive/Public/stock-market-analysis-split/base_config.cfg ./config.cfg We can debug our data: !python -m spacy debug data ./config.cfg If there are major issues that will prevent you from training your model, Spacy will inform you. Here, we are told that there is a low number of training examples (350) and that there are labels with very low cardinality like PERSON. They are warnings and not errors and so we can proceed. Debugging your data with spacy Train the model: !python -m spacy train -g 0 ./config .cfg --output ./ If you see that the pipeline has initialized then everything is working correctly: Training spacy NER transformer model You certainly noticed that I have aborted the training after just three epochs. There is a story behind this. I actually trained the model for 3 hours and it reached an F-Score of around 87%, however, Google Colab cut out the GPU support because I exceeded the allowed quota. So be careful with Cloud resources, or your day-work will be lost. I had to redo the training the next day. Our Best Ner Trf Model has an F-Score of 85.71%. Not bad. The last thing, save your model after training! cp -r /content/ trf_ner /content/ drive /MyDrive/ Public /stock-market-analysis-split/ trf_ner Usage We can import the model like any other Spacy model: import spacy ner = spacy. load ('/ content /drive/MyDrive/Public/stock-market-analysis- split /trf_ner/model-best') Let us try it on some examples: samples = ["Facebook has a price target of $ 20 for this quarter", "$ AAPL is gaining a new momentum"] for doc in ner.pipe(samples): for ent in doc.ents: print(ent.label_, ent.text) print() Output of the transformer model It works! Conclusion We were able to get decent starting models from a small dataset (350 training examples, 50 test examples) and we know exactly why they are not performing better (unbalanced dataset, Twitter tweets have a specific style of writing different). Our journey is far from over. There are a lot of things to consider, like Model Fine-Tuning, Data Augmentation, Model Monitoring, and Model Deployment. We are not done yet! This article was longer than usual but I believe it can serve as a future guide for those who want to get quickly into training their own NLP models. If you are curious, you can request a demo contacting admin@100.21.53.251 or Twitter . Happy learning and see you in the next article!
Markdown	![](https://www.facebook.com/tr?id=950528140575647&ev=PageView&noscript=1) ![ubiai deep learning](https://ubiai.tools/wp-content/uploads/2023/01/LogoPalceHolder.svg) [Login](https://app.ubiai.tools/login) [Get a Demo](https://apiv2.ubiai.tools/widget/booking/rQdEzYzc0Mo3sA6EAeJt) ![ubiai deep learning](https://ubiai.tools/wp-content/uploads/2023/01/LogoPalceHolder.svg) # Build An NLP Project From Zero To Hero (5): Model Training ###### Jan 18, 2022 Training an ML model is without a doubt the most interesting part for every data scientist and for every machine learning enthusiast. Model training refers simply to the model learning from its input data to generalize over a given phenomenon. With every training iteration, the model adjusts its weights to be able to make correct predictions as much as possible using a training algorithm like gradient descent. There are a lot of details that concern this phase: selecting the model, verifying the integrity of the input data, evaluating the model, training, and saving it. We will get to every detail and then we will show how we apply each one. ## Model Selection In general, since we have observed and have prepared our data through preprocessing and labeling, we should have a good idea of what model we will be using. Usually, there will be a list of models to choose from, and this will make the project far more complicated than it should be. A good intuition is to try the simplest model for your task and then proceed to improve the architecture of the existing model or to choose a more complex model that is compatible with the same task. However, the simplest model can fail from the start. So, you can identify certain characteristics and properties that will help you reduce the candidate model list. A very good example of this is from the Google Developers Guide of [Text Classification](https://developers.google.com/machine-learning/guides/text-classification/step-2-5). Through a lot of experimentation and testing, they identified a metric S/W or the number of samples/number of words per sample ratio. This metric will indicate wether you should choose n-gram models like logistic regression and support vector machines or sequence models like CNN or RNN for the text classification task. In practice, this is difficult to achieve on your own as you need to do a lot of experimentation and testing. This is why you should research industry-standard models. There are also the characteristics of the data itself that would help map to the correct model: [If your data has a large number of features but a significantly lower number of observations](https://medium.com/axum-labs/logistic-regression-vs-support-vector-machines-svm-c335610a3d16#:~:text=When%20To%20Use%20Logistic%20Regression%20vs%20Support%20Vector%20Machine), a support vector machine will perform better than logistic regression. For Named Entity Recognition, do you want to train a new model from scratch? Or use a pre-trained model and probably build upon it? For example, you can use a Spacy pre-trained NER model but it might not suit your need as in the case of this project. Or, you can train a Spacy Model from scratch using your own dataset, giving new vocabulary and labels for the model. This [article](https://medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b#:~:text=There%20are%20a%20good%20range,)%20API%20(e.g.%20GATE).) presents a great overview of pre-trained NER models, ranging from rule-based models like in NLTK to probabilistic models like Stanford Core NLP and Deep Learning Models like Flair. In the [Data Preprocessing Phase](https://medium.com/ubiai-nlp/build-an-nlp-project-from-zero-to-hero-3-data-preprocessing-9a7ef729d05?source=your_stories_page----------------------------------------), we have used the [Spacy pre-trained NER model](https://spacy.io/universe/project/video-spacys-ner-model) during the pre-annotation of the dataset. The model uses a sophisticated word embedding strategy using subword features and “Bloom” embeddings, a deep convolutional neural network with residual connections. Besides, we have trained a spacy-based model in the last article, using the Model Assisted Labeling feature within the [UBIAI tool](https://ubiai.tools/). The performance was not too bad as a start. We can download it and use it like any other spacy model by clicking the Download Button in the Action Column in the Models Tab of our current project: ![Build An NLP Project From Zero To Hero (Model Training)](https://i0.wp.com/ubiai.tools/wp-content/uploads/2023/02/1_GEEjhQt_PgTXIgr3_bQw3g.png?fit=750%2C300&ssl=1) UBIAI Model Training Dashboard For this article, we will feature the workflow for training two models: a Probabilistic Model, CRF or Continuous Random Fields, and a Deep Learning Model, Spacy NER model with Transformers. But before that, we need to talk about the Input Data Format and Model Evaluation. ## Input Data Format To assure that your model will actually work, you must identify clearly the format of its Input Data. This is necessary for both prediction and training. There exist many suitable formats for the NER task and among them: 1. IOB format: short for inside, outside, the beginning is a common tagging format for tagging tokens in a chunking task in NLP. Every document is separated into tokens. Each token will take a row and in front of every token, you will find its label. The Null label or ‘O’ is necessary in this case to mark unlabeled tokens. Since the labeling is practically word by word, there is an additional technique to label multiple token terms, I-notation (Inside of a labeled term) and B-notation (Beginning of a labeled term). Documents are separated between each other by a special separator (in our case ‘-DOCSTART- -X- O O’). 2. JSON format: In this format, your dataset is a list of JSON objects. Each object represents a document and a list of its annotations. An annotation is a labeled term represented by a dictionary containing its text, its label, its starting position, and its ending index in the text in the document string. ![Build An NLP Project From Zero To Hero (Model Training)](https://i0.wp.com/ubiai.tools/wp-content/uploads/2023/02/1_5ekkr-TrYZ71sg8Sia-agA-1.png?fit=612%2C428&ssl=1) ![Build An NLP Project From Zero To Hero (Model Training)](https://i0.wp.com/ubiai.tools/wp-content/uploads/2023/02/1_5HlLAFBOSpnsPZBCycvWOQ.png?fit=1151%2C723&ssl=1) IOB left, Spacy JSON right If you recall, we have already used the JSON format with Spacy previously in the pre-annotation and the Data Labeling Process. In the UBIAI tool, you just need to open the Project list menu and click on the Download Button in the Actions’ Column. There is an important point to talk about, splitting your dataset into a training set, development (or validation set), and a test set. We can omit the dev set to simplify things as we are in a learning project. A ratio of 80/20 is good for our small dataset. ## Model Evaluation Since it is a classification task, we might begin with Accuracy. Accuracy is good if your dataset is balanced (every label has the same number of instances as everyone else). This is not our case. Traditionally, these three metrics are considered for the NER task: - Precision: Determines if your model predicts a real incorrect label as correct. In other words, your model predicts that ‘Google’ is a PERSON name while it is not correct in reality. The higher this metric is, the lesser your model makes this mistake. - Recall: Determines if your model predicts a real correct label as incorrect. For example, your model does not predict ‘Google’ as COMPANY even though it is in reality. The higher this metric is, the lesser your model will miss correct instances. - F1-Score: an overall indicator of the performance of the classifier that takes into account both Precision and Recall. You noticed that I explained these metrics in terms of intuition. To delve more theoretically, begin by checking out this [article](https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd#:~:text=Recall%20is%20also%20known%20as,high) by [Harikrishnan](https://medium.com/@harikrishnannb). In the next two parts, I am supposing that you have a basic understanding of how a Machine Learning model train and what constitute a Model architecture. If you want to delve deeper into this topic, I recommend the [Coursera Deep Learning Specialization](https://www.coursera.org/specializations/deep-learning) by [Andrew Ng](https://twitter.com/andrewyng?lang=en). ## Training a CRF Model We have a sequence of tokens in every training example. These tokens are usually words if you decided to tokenize at the word level. We have talked about tokenization extensively in this episode. To predict the nature of a word (is it a PERSON, a COMPANY, etc), we cannot ignore the sequential nature of our data (tweets, sentences) as it is a significant loss of information. We have to select a model that can infer from previous positions for the prediction of the current position. Named Entity Recognition is of sequential nature after all. For example, using the IOB format and given the tweet “Facebook has a target price of \$10”, the labeling might be “Facebook (B-COMPANY) has (O) a (O) target (O) target (B-MONEY\_LABEL) price (I-MONEY\_LABEL) of (O) \$10 (MONEY)”. So to predict the word ‘price’ as an Inside MONEY\_LABEL, we need to know features of the previous word ‘target’. Knowing that it has the label Beginning MONEY\_LABEL, its features would serve very well in making a correct prediction for ‘price’. 1. Model Architecture : 2. CRF or Continuous Random Fields builds upon this intuition by building feature functions that take into account the sentence and arbitrary labels throughout it. As a simple example, let us consider this feature function input 3. Sentence s 4. The position i of a word in the sentence 5. The label l(i) of the current word 6. The label l(i−1) of the previous word 7. This feature function outputs a real-valued number which is usually binary. It is called linear-chain CRF. We then *\\assign each feature function a* weight that is to be learned by the model. Lastly, we transform these functions into probabilities\\ by summing over every sentence for every feature function and by subsequent exponentiation and normalization. ![Build An NLP Project From Zero To Hero (Model Training)](https://i0.wp.com/ubiai.tools/wp-content/uploads/2023/02/1_lj1ypNPml-NxaccEBWGIg.png?fit=1043%2C279&ssl=1) Generic CRF Model I hope that you have an intuition of how CRF works. For more details, check out these two great articles by [Analyticsvidhya](https://www.analyticsvidhya.com/blog/2018/08/nlp-guide-conditional-random-fields-text-classification/#h2_5) and [Edwin Chen](https://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/). ## Workflow I am using Google Colaboratory as the working environment and Google Drive as where to store our data and our models. ``` ``` `` `` `` `` `` ``` Now, let us transform our data into useful features by collecting details about every token and its adjacent neighbors. You notice that I am commenting out part of speech tagging. You can include it if you have a Part Of Speech Model which can be a part of the feature engineering pipeline. ``` `` `` `` `` ``
Readable Markdown	null
Shard	146 (laksa)
Root Hash	14153978550210686946
Unparsed URL	tools,ubiai!/build-nlp-project-from-zero-to-hero-5-model-training-ubiai/ s443