🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:

Response:

Calculated Shard: 115 (from laksa055)

2. Crawled Status Check

Query:

curl -X POST \
  'http://laksa115.int.ahrefs:8124/' \
  -H 'Content-Type: text/plain' \
  -H 'X-ClickHouse-Database: crawler3' \
  -H 'Authorization: Basic YXBpOg==' \
  -d 'SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, ifNull(toUnixTimestamp(download_stamp), 0) AS crawl_time, ifNull(toUnixTimestamp(props_url_first_seen), 0) AS first_indexed_time, download_http_code AS http_code, src_unparsed AS src_unparsed, src_root_hash AS src_root_hash, history_drop_reason AS history_drop_reason, meta_title AS meta_title, meta_descriptions AS meta_descriptions, attrs_boilerpipe_text AS attrs_boilerpipe_text, attrs_markdown AS attrs_markdown, attrs_readable_markdown AS attrs_readable_markdown, meta_canonical AS meta_canonical, ml_categories_json AS ml_categories_json, ml_types_json AS ml_types_json, ml_intent_types_json AS ml_intent_types_json, meta_language AS meta_language, attrs_author AS attrs_author, ifNull(toUnixTimestamp(attrs_publish_time), 0) AS attrs_publish_time, ifNull(toUnixTimestamp(attrs_original_publish_time), 0) AS attrs_original_publish_time, ifNull(attrs_is_republished, 0) AS attrs_is_republished, ifNull(attrs_nr_words, 0) AS attrs_nr_words, ifNull(attrs_boilerpipe_nr_words, 0) AS attrs_boilerpipe_nr_words, ifNull(body_ext_links_number, 0) AS body_ext_links_number, ifNull(body_int_links_number, 0) AS body_int_links_number, ifNull(meta_nofollow, 0) AS meta_nofollow, ifNull(meta_noarchive, 0) AS meta_noarchive, ifNull(props_was_rendered, 0) AS props_was_rendered, ifNull(src_redirect, \'\') AS src_redirect, ifNull(download_time_msec, 0) AS download_time_msec, ifNull(download_ttfb_msec, 0) AS download_ttfb_msec, ifNull(download_size, 0) AS download_size FROM crawler3.page_info_local FINAL PREWHERE (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/72931.html\')), getAhrefsUnparsedNoserviceFromURL(\'https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/72931.html\'))) FORMAT JSONEachRow'

Response:

{"found_url":"https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/schedule\/detail\/72931.html","crawl_time":1775510873,"first_indexed_time":1598020843,"http_code":200,"src_unparsed":"com,oreilly!conferences,\/strata\/strata-ca-2019\/public\/schedule\/detail\/72931.html s443","src_root_hash":"3309313061461398115","history_drop_reason":null,"meta_title":"NLP from scratch: Solving the cold start problem for natural language processing: Big data conference & machine learning training | Strata Data","meta_descriptions":["How do you train a machine learning model with no training data? Michael Johnson and Norris Heintzelman share their journey implementing multiple solutions to bootstrapping training data in the NLP domain, covering topics including weak supervision, building an active learning framework, and annotation adjudication for named-entity recognition."],"attrs_boilerpipe_text":"Description Unstructured data in the form of documents, web pages, and social media interactions is an ever-growing, ever-more valuable data source for addressing present business problems, from exploring brand sentiment to identifying sensitive information in internal documents. Unfortunately, the classification and annotation algorithms behind solving these problems often require significant amounts of labeled training data to produce desired accuracy. Michael Johnson and Norris Heintzelman share several techniques they’ve implemented to build classification and  NER  models from scratch. They lead a tour through this space as it applies to  NLP  and demonstrate their approach and architecture for the following techniques: Weak supervision for news documents: Using rules base classification alongside deep learning system for text classification Active learning and human in the loop: Explaining how breakthroughs in transfer learning for  NLP  have impacted their active learning framework for building an  LSTM -based relevance model Creative training sets: Identifying and cleaning already-labeled datasets, training classifier on “only” positive examples NER  adjudication: Combining knowledge from several annotation sources that leverages the strengths of each source For each of these topics, Michael and Norris outline the theoretical foundation, the implementation architecture, and tools used and discuss the problems they encountered—so you can avoid making the same mistakes.","attrs_markdown":"☰\n\n- [Schedule](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/schedule\/full.html)\n- [Training](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/schedule\/stype\/1334.html)\n- [Speakers](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/schedule\/speakers.html)\n- [Sponsors](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/content\/sponsors.html)\n- [Events](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/schedule\/stype\/1335.html)\n- [About](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/content\/about.html)\n- [Account](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/user\/account.html)\n\nSan Francisco[London](https:\/\/conferences.oreilly.com\/strata\/strata-eu)[New York](https:\/\/conferences.oreilly.com\/strata\/strata-ny)\n\n[![Strata Data Conference](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/en\/assets\/1\/eventseries\/23\/strataconf_svg_logo_rev.svg)](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019.html)\n\nPresented By  \n O’Reilly + Cloudera\n\nMake Data Work\n\nMarch 25-28, 2019  \n San Francisco, CA\n\nPlease log in\n\n[![Add to your personal schedule](https:\/\/conferences.oreilly.com\/images\/personal-schedule-icon.png)Add to Your Schedule](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/schedule\/add\/72931%3Ftype=Proposal.html \"Add to your personal schedule\")\n\n# NLP from scratch: Solving the cold start problem for natural language processing\n[Michael Johnson](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/schedule\/speaker\/308246.html) (Lockheed Martin), [Norris Heintzelman](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/schedule\/speaker\/332388.html) (Lockheed Martin)\n\n11:50am–12:30pm Wednesday, March 27, 2019\n\n[Data Science, Machine Learning & AI](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/schedule\/topic\/2843.html)  \n  Location: 2010\n\nSecondary topics: [Text and Language processing and analysis](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/schedule\/stopic\/2990.html)\n\nAverage rating: ![\\*](https:\/\/conferences.oreilly.com\/images\/rating-on.gif)![\\*](https:\/\/conferences.oreilly.com\/images\/rating-on.gif)![\\*](https:\/\/conferences.oreilly.com\/images\/rating-on.gif)![\\*](https:\/\/conferences.oreilly.com\/images\/rating-on.gif)![.](https:\/\/conferences.oreilly.com\/images\/rating-off.gif)\n\n(4.60, 15 ratings)\n\n![](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/en\/assets\/1\/eventprovider\/1\/test_slide_icon.jpg) [Download slides (PPTX)](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/en\/assets\/1\/event\/290\/NLP%20from%20scratch_%20Solving%20the%20cold%20start%20problem%20for%20natural%20language%20processing%20Presentation.pptx)\n\n## Who is this presentation for?\n- Machine learning practitioners, data scientists, and business analysts\n## Level\nIntermediate\n\n## Prerequisite knowledge\n- Familiarity with machine learning and NLP (useful but not required)\n## What you'll learn\n- Explore cutting-edge techniques for building machine learning models from scratch in the NLP domain\n- Understand how to frame a business problem as an NLP problem\n- Learn strategies for getting a solution off the ground\n- Discover mistakes made in the process and how to avoid them\n\n## Description\nUnstructured data in the form of documents, web pages, and social media interactions is an ever-growing, ever-more valuable data source for addressing present business problems, from exploring brand sentiment to identifying sensitive information in internal documents. Unfortunately, the classification and annotation algorithms behind solving these problems often require significant amounts of labeled training data to produce desired accuracy.\n\nMichael Johnson and Norris Heintzelman share several techniques they’ve implemented to build classification and NER models from scratch. They lead a tour through this space as it applies to NLP and demonstrate their approach and architecture for the following techniques:\n\n- Weak supervision for news documents: Using rules base classification alongside deep learning system for text classification\n- Active learning and human in the loop: Explaining how breakthroughs in transfer learning for NLP have impacted their active learning framework for building an LSTM\\-based relevance model\n- Creative training sets: Identifying and cleaning already-labeled datasets, training classifier on “only” positive examples\n- NER adjudication: Combining knowledge from several annotation sources that leverages the strengths of each source\n\nFor each of these topics, Michael and Norris outline the theoretical foundation, the implementation architecture, and tools used and discuss the problems they encountered—so you can avoid making the same mistakes.\n\n![Photo of Michael Johnson](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/en\/assets\/1\/eventprovider\/1\/_@user_308246.jpg)\n\n## Michael Johnson\n#### Lockheed Martin\nMichael Johnson is a senior data scientist at Lockheed Martin. He has done data science and analytics in fields including manufacturing optimization, semiconductor reliability, and human resources-focused time series forecasting and simulation. He has recently been focused on how to apply cutting-edge deep learning algorithms to NLP domains.\n\n![Photo of Norris Heintzelman](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/en\/assets\/1\/eventprovider\/1\/_@user_332388.jpg)\n\n## Norris Heintzelman\n#### Lockheed Martin\nNorris Heintzelman is a senior research and data scientist with 19 years’ real-world experience converting data into knowledge—that is, 19 years’ experience in many areas of natural language processing, knowledge systems, cleaning and normalizing messy data, and rigorous accuracy measurement. Norris has published several papers in the fields of health informatics and general knowledge management. She has worked for Lockheed Martin for a very long time, in multiple business areas, from public sector contracts to advanced R\\&D to internal business process support. An alumna of both Temple University and the University of Pennsylvania, she lives in Wilmington, Delaware, with her husband, two daughters, and two cats. She likes to eat and talk about food.\n\n- [Website](https:\/\/www.lockheedmartin.com\/)\n\nPresented by\n\n- [![Cloudera](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/cloudera-1150.png)](http:\/\/www.cloudera.com\/)\n- [![O'Reilly](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/oreilly_logo_box150.png)](http:\/\/www.oreilly.com\/)\n\nStrategic Sponsors\n\n- [![Dataiku](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/dataiku-2.png)](http:\/\/www.dataiku.com\/)\n- [![Google Cloud](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/google-cloud150.png)](<http:\/\/cloud.google.com >)\n- [![IBM](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/ibm150.gif)](http:\/\/www.ibm.com\/)\n\nZettabyte Sponsor\n\n- [![Oracle Cloud Infrastructure](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/oracle-cloud.png)](https:\/\/cloud.oracle.com\/home)\n\nContributing Sponsors\n\n- [![Kyligence](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/kyligence-3.png)](http:\/\/www.kyligence.io\/)\n- [![MemSQL](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/memsql-4.png)](http:\/\/www.memsql.com\/)\n- [![MinIO](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/minio-4.png)](http:\/\/min.io\/)\n\nExabyte Sponsors\n\n- [![Amazon Web Services](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/amazon-web-services.png)](http:\/\/aws.amazon.com\/big-data%20)\n- [![erwin Inc](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/erwin.png)](http:\/\/www.erwin.com\/)\n- [![SAS](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/sas_3.png)](http:\/\/www.sas.com\/)\n- [![Talend](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/talend-4.png)](http:\/\/www.talend.com\/)\n- [![ThoughtTrace](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/thoughttrace-1.png)](http:\/\/www.thoughttrace.com\/)\n\nImpact Sponsors\n\n- [![Impetus](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/impetus-4.png)](http:\/\/bigdata.impetus.com\/)\n- [![Kyvos Insights](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/kyvos-2.png)](http:\/\/www.kyvosinsights.com\/)\n- [![Robin Systems](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/robin.png)](https:\/\/robin.io\/)\n- [![Striim](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/striim-2.png)](http:\/\/www.striim.com\/)\n- [![Syncsort](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/syncsort-3.png)](http:\/\/www.syncsort.com\/)\n\nSupporting Sponsor\n\n- [![Deloitte Consulting ](https:\/\/cdn.oreillystatic.com\/conferences\/images\/logos\/deloitte-1.png)](https:\/\/www2.deloitte.com\/)\n\n### Sponsorship Opportunities\nFor exhibition and sponsorship opportunities, email [strataconf@oreilly.com](mailto:strataconf@oreilly.com)\n\n### Partner Opportunities\nFor information on trade opportunities with O'Reilly conferences, email [partners@oreilly.com](mailto:partners@oreilly.com)\n\n### Contact Us\nView a complete list of [Strata Data Conference contacts](https:\/\/conferences.oreilly.com\/strata-ca-2019\/public\/content\/contact)\n\n[![Kyligence](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/oreilly\/images\/transparent-1px.png)](http:\/\/kyligence.io\/) [![MemSQL](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/oreilly\/images\/transparent-1px.png)](https:\/\/www.memsql.com\/?utm_source=strata&utm_medium=pp&utm_campaign=2019-strata-data-sf-banner) [![SAS](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/oreilly\/images\/transparent-1px.png)](http:\/\/www.sas.com\/platform) [![Impetus](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/oreilly\/images\/transparent-1px.png)](http:\/\/bigdata.impetus.com\/) [![Syncsort](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/oreilly\/images\/transparent-1px.png)](https:\/\/www.syncsort.com\/) [![IBM](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/oreilly\/images\/transparent-1px.png)](http:\/\/www.ibm.com\/) [![Dataiku](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/oreilly\/images\/transparent-1px.png)](http:\/\/www.dataiku.com\/) [![MinIO](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/oreilly\/images\/transparent-1px.png)](https:\/\/min.io\/) [![ThoughtTrace](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/oreilly\/images\/transparent-1px.png)](https:\/\/www.thoughttrace.com\/) [![Oracle Cloud Infrastructure](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/oreilly\/images\/transparent-1px.png)](https:\/\/cloud.oracle.com\/try\/big-data) [![Google Cloud](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/oreilly\/images\/transparent-1px.png)](http:\/\/cloud.google.com\/)\n\n- **Information**\n- [About](https:\/\/conferences.oreilly.com\/strata-ca-2019\/public\/content\/about)\n- [Resources](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/content\/resources.html)\n- [Diversity](http:\/\/oreilly.com\/conferences\/diversity.html)\n- [Code of Conduct](http:\/\/oreilly.com\/conferences\/code-of-conduct.html)\n- [Privacy Policy](http:\/\/oreilly.com\/oreilly\/privacy.html)\n- [Contact Us](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/content\/contact.html)\n\n- **More O'Reilly Events**\n- [Artificial Intelligence](https:\/\/conferences.oreilly.com\/artificial-intelligence)\n- [Open Source](https:\/\/conferences.oreilly.com\/oscon)\n- [Software Architecture](https:\/\/conferences.oreilly.com\/software-architecture)\n- [TensorFlow World](https:\/\/conferences.oreilly.com\/tensorflow)\n- [Velocity](https:\/\/conferences.oreilly.com\/velocity\/)\n\n- **More O'Reilly Sites**\n- [O'Reilly online learning](https:\/\/www.oreilly.com\/online-learning\/individuals.html)\n- [O'Reilly Conferences](https:\/\/conferences.oreilly.com\/)\n- [oreilly.com](http:\/\/oreilly.com\/)\n- [O'Reilly Video Training](https:\/\/www.oreilly.com\/search\/?query=*&formats=video)\n\n- [![Twitter](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/en\/assets\/1\/event\/127\/solid2015_social_twitter.svg)Twitter](https:\/\/twitter.com\/strataconf)\n- [![Facebook](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/en\/assets\/1\/event\/127\/solid2015_social_facebook.svg)Facebook](https:\/\/www.facebook.com\/OReilly)\n\n- [![LinkedIn](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/en\/assets\/1\/event\/127\/solid2015_social_linkedin.svg)LinkedIn](https:\/\/www.linkedin.com\/company\/8459)\n- [![YouTube](https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/cdn.oreillystatic.com\/en\/assets\/1\/event\/127\/solid2015_social_youtube.svg)YouTube](https:\/\/www.youtube.com\/user\/OreillyMedia)\n\n©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • [confreg@oreilly.com](mailto:confreg@oreilly.com)","attrs_readable_markdown":null,"meta_canonical":null,"ml_categories_json":"","ml_types_json":"","ml_intent_types_json":"","meta_language":"en","attrs_author":"Michael Johnson","attrs_publish_time":0,"attrs_original_publish_time":1598020843,"attrs_is_republished":0,"attrs_nr_words":"675","attrs_boilerpipe_nr_words":"203","body_ext_links_number":39,"body_int_links_number":36,"meta_nofollow":0,"meta_noarchive":0,"props_was_rendered":0,"src_redirect":"","download_time_msec":92,"download_ttfb_msec":91,"download_size":16121}

3. Robots.txt Check

Query:

Response:

4. Spam/Ban Check

Query:

Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄

INDEXABLE

✅

CRAWLED

15 days ago

🤖

ROBOTS ALLOWED

Page Info Filters

Filter	Status	Condition	Details
HTTP status	PASS	`download_http_code = 200`	HTTP 200
Age cutoff	PASS	`download_stamp > now() - 6 MONTH`	0.5 months ago
History drop	PASS	`isNull(history_drop_reason)`	No drop reason
Spam/ban	PASS	`fh_dont_index != 1 AND ml_spam_score = 0`	ml_spam_score=0
Canonical	PASS	`meta_canonical IS NULL OR = '' OR = src_unparsed`	Not set

Page Details

Property	Value
URL	https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/72931.html
Last Crawled	2026-04-06 21:27:53 (15 days ago)
First Indexed	2020-08-21 14:40:43 (5 years ago)
HTTP Status Code	200
Content
Meta Title	NLP from scratch: Solving the cold start problem for natural language processing: Big data conference & machine learning training \| Strata Data
Meta Description	How do you train a machine learning model with no training data? Michael Johnson and Norris Heintzelman share their journey implementing multiple solutions to bootstrapping training data in the NLP domain, covering topics including weak supervision, building an active learning framework, and annotation adjudication for named-entity recognition.
Meta Canonical	null
Boilerpipe Text	Description Unstructured data in the form of documents, web pages, and social media interactions is an ever-growing, ever-more valuable data source for addressing present business problems, from exploring brand sentiment to identifying sensitive information in internal documents. Unfortunately, the classification and annotation algorithms behind solving these problems often require significant amounts of labeled training data to produce desired accuracy. Michael Johnson and Norris Heintzelman share several techniques they’ve implemented to build classification and NER models from scratch. They lead a tour through this space as it applies to NLP and demonstrate their approach and architecture for the following techniques: Weak supervision for news documents: Using rules base classification alongside deep learning system for text classification Active learning and human in the loop: Explaining how breakthroughs in transfer learning for NLP have impacted their active learning framework for building an LSTM -based relevance model Creative training sets: Identifying and cleaning already-labeled datasets, training classifier on “only” positive examples NER adjudication: Combining knowledge from several annotation sources that leverages the strengths of each source For each of these topics, Michael and Norris outline the theoretical foundation, the implementation architecture, and tools used and discuss the problems they encountered—so you can avoid making the same mistakes.
Markdown	☰ - [Schedule](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/full.html) - [Training](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/stype/1334.html) - [Speakers](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/speakers.html) - [Sponsors](https://conferences.oreilly.com/strata/strata-ca-2019/public/content/sponsors.html) - [Events](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/stype/1335.html) - [About](https://conferences.oreilly.com/strata/strata-ca-2019/public/content/about.html) - [Account](https://conferences.oreilly.com/strata/strata-ca-2019/user/account.html) San Francisco[London](https://conferences.oreilly.com/strata/strata-eu)[New York](https://conferences.oreilly.com/strata/strata-ny) [![Strata Data Conference](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/eventseries/23/strataconf_svg_logo_rev.svg)](https://conferences.oreilly.com/strata/strata-ca-2019.html) Presented By O’Reilly + Cloudera Make Data Work March 25-28, 2019 San Francisco, CA Please log in [![Add to your personal schedule](https://conferences.oreilly.com/images/personal-schedule-icon.png)Add to Your Schedule](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/add/72931%3Ftype=Proposal.html "Add to your personal schedule") # NLP from scratch: Solving the cold start problem for natural language processing [Michael Johnson](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/speaker/308246.html) (Lockheed Martin), [Norris Heintzelman](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/speaker/332388.html) (Lockheed Martin) 11:50am–12:30pm Wednesday, March 27, 2019 [Data Science, Machine Learning & AI](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/topic/2843.html) Location: 2010 Secondary topics: [Text and Language processing and analysis](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/stopic/2990.html) Average rating: ![\](https://conferences.oreilly.com/images/rating-on.gif)![\](https://conferences.oreilly.com/images/rating-on.gif)![\](https://conferences.oreilly.com/images/rating-on.gif)![\](https://conferences.oreilly.com/images/rating-on.gif)![.](https://conferences.oreilly.com/images/rating-off.gif) (4.60, 15 ratings) ![](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/eventprovider/1/test_slide_icon.jpg) [Download slides (PPTX)](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/event/290/NLP%20from%20scratch_%20Solving%20the%20cold%20start%20problem%20for%20natural%20language%20processing%20Presentation.pptx) ## Who is this presentation for? - Machine learning practitioners, data scientists, and business analysts ## Level Intermediate ## Prerequisite knowledge - Familiarity with machine learning and NLP (useful but not required) ## What you'll learn - Explore cutting-edge techniques for building machine learning models from scratch in the NLP domain - Understand how to frame a business problem as an NLP problem - Learn strategies for getting a solution off the ground - Discover mistakes made in the process and how to avoid them ## Description Unstructured data in the form of documents, web pages, and social media interactions is an ever-growing, ever-more valuable data source for addressing present business problems, from exploring brand sentiment to identifying sensitive information in internal documents. Unfortunately, the classification and annotation algorithms behind solving these problems often require significant amounts of labeled training data to produce desired accuracy. Michael Johnson and Norris Heintzelman share several techniques they’ve implemented to build classification and NER models from scratch. They lead a tour through this space as it applies to NLP and demonstrate their approach and architecture for the following techniques: - Weak supervision for news documents: Using rules base classification alongside deep learning system for text classification - Active learning and human in the loop: Explaining how breakthroughs in transfer learning for NLP have impacted their active learning framework for building an LSTM\-based relevance model - Creative training sets: Identifying and cleaning already-labeled datasets, training classifier on “only” positive examples - NER adjudication: Combining knowledge from several annotation sources that leverages the strengths of each source For each of these topics, Michael and Norris outline the theoretical foundation, the implementation architecture, and tools used and discuss the problems they encountered—so you can avoid making the same mistakes. ![Photo of Michael Johnson](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/eventprovider/1/_@user_308246.jpg) ## Michael Johnson #### Lockheed Martin Michael Johnson is a senior data scientist at Lockheed Martin. He has done data science and analytics in fields including manufacturing optimization, semiconductor reliability, and human resources-focused time series forecasting and simulation. He has recently been focused on how to apply cutting-edge deep learning algorithms to NLP domains. ![Photo of Norris Heintzelman](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/eventprovider/1/_@user_332388.jpg) ## Norris Heintzelman #### Lockheed Martin Norris Heintzelman is a senior research and data scientist with 19 years’ real-world experience converting data into knowledge—that is, 19 years’ experience in many areas of natural language processing, knowledge systems, cleaning and normalizing messy data, and rigorous accuracy measurement. Norris has published several papers in the fields of health informatics and general knowledge management. She has worked for Lockheed Martin for a very long time, in multiple business areas, from public sector contracts to advanced R\&D to internal business process support. An alumna of both Temple University and the University of Pennsylvania, she lives in Wilmington, Delaware, with her husband, two daughters, and two cats. She likes to eat and talk about food. - [Website](https://www.lockheedmartin.com/) Presented by - [![Cloudera](https://cdn.oreillystatic.com/conferences/images/logos/cloudera-1150.png)](http://www.cloudera.com/) - [![O'Reilly](https://cdn.oreillystatic.com/conferences/images/logos/oreilly_logo_box150.png)](http://www.oreilly.com/) Strategic Sponsors - [![Dataiku](https://cdn.oreillystatic.com/conferences/images/logos/dataiku-2.png)](http://www.dataiku.com/) - [![Google Cloud](https://cdn.oreillystatic.com/conferences/images/logos/google-cloud150.png)](<http://cloud.google.com >) - [![IBM](https://cdn.oreillystatic.com/conferences/images/logos/ibm150.gif)](http://www.ibm.com/) Zettabyte Sponsor - [![Oracle Cloud Infrastructure](https://cdn.oreillystatic.com/conferences/images/logos/oracle-cloud.png)](https://cloud.oracle.com/home) Contributing Sponsors - [![Kyligence](https://cdn.oreillystatic.com/conferences/images/logos/kyligence-3.png)](http://www.kyligence.io/) - [![MemSQL](https://cdn.oreillystatic.com/conferences/images/logos/memsql-4.png)](http://www.memsql.com/) - [![MinIO](https://cdn.oreillystatic.com/conferences/images/logos/minio-4.png)](http://min.io/) Exabyte Sponsors - [![Amazon Web Services](https://cdn.oreillystatic.com/conferences/images/logos/amazon-web-services.png)](http://aws.amazon.com/big-data%20) - [![erwin Inc](https://cdn.oreillystatic.com/conferences/images/logos/erwin.png)](http://www.erwin.com/) - [![SAS](https://cdn.oreillystatic.com/conferences/images/logos/sas_3.png)](http://www.sas.com/) - [![Talend](https://cdn.oreillystatic.com/conferences/images/logos/talend-4.png)](http://www.talend.com/) - [![ThoughtTrace](https://cdn.oreillystatic.com/conferences/images/logos/thoughttrace-1.png)](http://www.thoughttrace.com/) Impact Sponsors - [![Impetus](https://cdn.oreillystatic.com/conferences/images/logos/impetus-4.png)](http://bigdata.impetus.com/) - [![Kyvos Insights](https://cdn.oreillystatic.com/conferences/images/logos/kyvos-2.png)](http://www.kyvosinsights.com/) - [![Robin Systems](https://cdn.oreillystatic.com/conferences/images/logos/robin.png)](https://robin.io/) - [![Striim](https://cdn.oreillystatic.com/conferences/images/logos/striim-2.png)](http://www.striim.com/) - [![Syncsort](https://cdn.oreillystatic.com/conferences/images/logos/syncsort-3.png)](http://www.syncsort.com/) Supporting Sponsor - [![Deloitte Consulting ](https://cdn.oreillystatic.com/conferences/images/logos/deloitte-1.png)](https://www2.deloitte.com/) ### Sponsorship Opportunities For exhibition and sponsorship opportunities, email [strataconf@oreilly.com](mailto:strataconf@oreilly.com) ### Partner Opportunities For information on trade opportunities with O'Reilly conferences, email [partners@oreilly.com](mailto:partners@oreilly.com) ### Contact Us View a complete list of [Strata Data Conference contacts](https://conferences.oreilly.com/strata-ca-2019/public/content/contact) [![Kyligence](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/oreilly/images/transparent-1px.png)](http://kyligence.io/) [![MemSQL](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/oreilly/images/transparent-1px.png)](https://www.memsql.com/?utm_source=strata&utm_medium=pp&utm_campaign=2019-strata-data-sf-banner) [![SAS](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/oreilly/images/transparent-1px.png)](http://www.sas.com/platform) [![Impetus](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/oreilly/images/transparent-1px.png)](http://bigdata.impetus.com/) [![Syncsort](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/oreilly/images/transparent-1px.png)](https://www.syncsort.com/) [![IBM](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/oreilly/images/transparent-1px.png)](http://www.ibm.com/) [![Dataiku](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/oreilly/images/transparent-1px.png)](http://www.dataiku.com/) [![MinIO](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/oreilly/images/transparent-1px.png)](https://min.io/) [![ThoughtTrace](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/oreilly/images/transparent-1px.png)](https://www.thoughttrace.com/) [![Oracle Cloud Infrastructure](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/oreilly/images/transparent-1px.png)](https://cloud.oracle.com/try/big-data) [![Google Cloud](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/oreilly/images/transparent-1px.png)](http://cloud.google.com/) - Information - [About](https://conferences.oreilly.com/strata-ca-2019/public/content/about) - [Resources](https://conferences.oreilly.com/strata/strata-ca-2019/public/content/resources.html) - [Diversity](http://oreilly.com/conferences/diversity.html) - [Code of Conduct](http://oreilly.com/conferences/code-of-conduct.html) - [Privacy Policy](http://oreilly.com/oreilly/privacy.html) - [Contact Us](https://conferences.oreilly.com/strata/strata-ca-2019/public/content/contact.html) - More O'Reilly Events - [Artificial Intelligence](https://conferences.oreilly.com/artificial-intelligence) - [Open Source](https://conferences.oreilly.com/oscon) - [Software Architecture](https://conferences.oreilly.com/software-architecture) - [TensorFlow World](https://conferences.oreilly.com/tensorflow) - [Velocity](https://conferences.oreilly.com/velocity/) - More O'Reilly Sites - [O'Reilly online learning](https://www.oreilly.com/online-learning/individuals.html) - [O'Reilly Conferences](https://conferences.oreilly.com/) - [oreilly.com](http://oreilly.com/) - [O'Reilly Video Training](https://www.oreilly.com/search/?query=*&formats=video) - [![Twitter](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/event/127/solid2015_social_twitter.svg)Twitter](https://twitter.com/strataconf) - [![Facebook](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/event/127/solid2015_social_facebook.svg)Facebook](https://www.facebook.com/OReilly) - [![LinkedIn](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/event/127/solid2015_social_linkedin.svg)LinkedIn](https://www.linkedin.com/company/8459) - [![YouTube](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/event/127/solid2015_social_youtube.svg)YouTube](https://www.youtube.com/user/OreillyMedia) ©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • [confreg@oreilly.com](mailto:confreg@oreilly.com)
Readable Markdown	null
ML Classification
ML Categories	null
ML Page Types	null
ML Intent Types	null
Content Metadata
Language	en
Author	Michael Johnson
Publish Time	not set
Original Publish Time	2020-08-21 14:40:43 (5 years ago)
Republished	No
Word Count (Total)	675
Word Count (Content)	203
Links
External Links	39
Internal Links	36
Technical SEO
Meta Nofollow	No
Meta Noarchive	No
JS Rendered	No
Redirect Target	null
Performance
Download Time (ms)	92
TTFB (ms)	91
Download Size (bytes)	16,121
Shard	115 (laksa)
Root Hash	3309313061461398115
Unparsed URL	com,oreilly!conferences,/strata/strata-ca-2019/public/schedule/detail/72931.html s443