âšď¸ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.5 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/72931.html |
| Last Crawled | 2026-04-06 21:27:53 (15 days ago) |
| First Indexed | 2020-08-21 14:40:43 (5 years ago) |
| HTTP Status Code | 200 |
| Content | |
| Meta Title | NLP from scratch: Solving the cold start problem for natural language processing: Big data conference & machine learning training | Strata Data |
| Meta Description | How do you train a machine learning model with no training data? Michael Johnson and Norris Heintzelman share their journey implementing multiple solutions to bootstrapping training data in the NLP domain, covering topics including weak supervision, building an active learning framework, and annotation adjudication for named-entity recognition. |
| Meta Canonical | null |
| Boilerpipe Text | Description Unstructured data in the form of documents, web pages, and social media interactions is an ever-growing, ever-more valuable data source for addressing present business problems, from exploring brand sentiment to identifying sensitive information in internal documents. Unfortunately, the classification and annotation algorithms behind solving these problems often require significant amounts of labeled training data to produce desired accuracy. Michael Johnson and Norris Heintzelman share several techniques theyâve implemented to build classification and NER models from scratch. They lead a tour through this space as it applies to NLP and demonstrate their approach and architecture for the following techniques: Weak supervision for news documents: Using rules base classification alongside deep learning system for text classification Active learning and human in the loop: Explaining how breakthroughs in transfer learning for NLP have impacted their active learning framework for building an LSTM -based relevance model Creative training sets: Identifying and cleaning already-labeled datasets, training classifier on âonlyâ positive examples NER adjudication: Combining knowledge from several annotation sources that leverages the strengths of each source For each of these topics, Michael and Norris outline the theoretical foundation, the implementation architecture, and tools used and discuss the problems they encounteredâso you can avoid making the same mistakes. |
| Markdown | â°
- [Schedule](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/full.html)
- [Training](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/stype/1334.html)
- [Speakers](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/speakers.html)
- [Sponsors](https://conferences.oreilly.com/strata/strata-ca-2019/public/content/sponsors.html)
- [Events](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/stype/1335.html)
- [About](https://conferences.oreilly.com/strata/strata-ca-2019/public/content/about.html)
- [Account](https://conferences.oreilly.com/strata/strata-ca-2019/user/account.html)
San Francisco[London](https://conferences.oreilly.com/strata/strata-eu)[New York](https://conferences.oreilly.com/strata/strata-ny)
[](https://conferences.oreilly.com/strata/strata-ca-2019.html)
Presented By
OâReilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA
Please log in
[Add to Your Schedule](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/add/72931%3Ftype=Proposal.html "Add to your personal schedule")
# NLP from scratch: Solving the cold start problem for natural language processing
[Michael Johnson](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/speaker/308246.html) (Lockheed Martin), [Norris Heintzelman](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/speaker/332388.html) (Lockheed Martin)
11:50amâ12:30pm Wednesday, March 27, 2019
[Data Science, Machine Learning & AI](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/topic/2843.html)
Location: 2010
Secondary topics: [Text and Language processing and analysis](https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/stopic/2990.html)
Average rating: 
(4.60, 15 ratings)
 [Download slides (PPTX)](https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/event/290/NLP%20from%20scratch_%20Solving%20the%20cold%20start%20problem%20for%20natural%20language%20processing%20Presentation.pptx)
## Who is this presentation for?
- Machine learning practitioners, data scientists, and business analysts
## Level
Intermediate
## Prerequisite knowledge
- Familiarity with machine learning and NLP (useful but not required)
## What you'll learn
- Explore cutting-edge techniques for building machine learning models from scratch in the NLP domain
- Understand how to frame a business problem as an NLP problem
- Learn strategies for getting a solution off the ground
- Discover mistakes made in the process and how to avoid them
## Description
Unstructured data in the form of documents, web pages, and social media interactions is an ever-growing, ever-more valuable data source for addressing present business problems, from exploring brand sentiment to identifying sensitive information in internal documents. Unfortunately, the classification and annotation algorithms behind solving these problems often require significant amounts of labeled training data to produce desired accuracy.
Michael Johnson and Norris Heintzelman share several techniques theyâve implemented to build classification and NER models from scratch. They lead a tour through this space as it applies to NLP and demonstrate their approach and architecture for the following techniques:
- Weak supervision for news documents: Using rules base classification alongside deep learning system for text classification
- Active learning and human in the loop: Explaining how breakthroughs in transfer learning for NLP have impacted their active learning framework for building an LSTM\-based relevance model
- Creative training sets: Identifying and cleaning already-labeled datasets, training classifier on âonlyâ positive examples
- NER adjudication: Combining knowledge from several annotation sources that leverages the strengths of each source
For each of these topics, Michael and Norris outline the theoretical foundation, the implementation architecture, and tools used and discuss the problems they encounteredâso you can avoid making the same mistakes.

## Michael Johnson
#### Lockheed Martin
Michael Johnson is a senior data scientist at Lockheed Martin. He has done data science and analytics in fields including manufacturing optimization, semiconductor reliability, and human resources-focused time series forecasting and simulation. He has recently been focused on how to apply cutting-edge deep learning algorithms to NLP domains.

## Norris Heintzelman
#### Lockheed Martin
Norris Heintzelman is a senior research and data scientist with 19 yearsâ real-world experience converting data into knowledgeâthat is, 19 yearsâ experience in many areas of natural language processing, knowledge systems, cleaning and normalizing messy data, and rigorous accuracy measurement. Norris has published several papers in the fields of health informatics and general knowledge management. She has worked for Lockheed Martin for a very long time, in multiple business areas, from public sector contracts to advanced R\&D to internal business process support. An alumna of both Temple University and the University of Pennsylvania, she lives in Wilmington, Delaware, with her husband, two daughters, and two cats. She likes to eat and talk about food.
- [Website](https://www.lockheedmartin.com/)
Presented by
- [](http://www.cloudera.com/)
- [](http://www.oreilly.com/)
Strategic Sponsors
- [](http://www.dataiku.com/)
- [](<http://cloud.google.com >)
- [](http://www.ibm.com/)
Zettabyte Sponsor
- [](https://cloud.oracle.com/home)
Contributing Sponsors
- [](http://www.kyligence.io/)
- [](http://www.memsql.com/)
- [](http://min.io/)
Exabyte Sponsors
- [](http://aws.amazon.com/big-data%20)
- [](http://www.erwin.com/)
- [](http://www.sas.com/)
- [](http://www.talend.com/)
- [](http://www.thoughttrace.com/)
Impact Sponsors
- [](http://bigdata.impetus.com/)
- [](http://www.kyvosinsights.com/)
- [](https://robin.io/)
- [](http://www.striim.com/)
- [](http://www.syncsort.com/)
Supporting Sponsor
- [](https://www2.deloitte.com/)
### Sponsorship Opportunities
For exhibition and sponsorship opportunities, email [strataconf@oreilly.com](mailto:strataconf@oreilly.com)
### Partner Opportunities
For information on trade opportunities with O'Reilly conferences, email [partners@oreilly.com](mailto:partners@oreilly.com)
### Contact Us
View a complete list of [Strata Data Conference contacts](https://conferences.oreilly.com/strata-ca-2019/public/content/contact)
[](http://kyligence.io/) [](https://www.memsql.com/?utm_source=strata&utm_medium=pp&utm_campaign=2019-strata-data-sf-banner) [](http://www.sas.com/platform) [](http://bigdata.impetus.com/) [](https://www.syncsort.com/) [](http://www.ibm.com/) [](http://www.dataiku.com/) [](https://min.io/) [](https://www.thoughttrace.com/) [](https://cloud.oracle.com/try/big-data) [](http://cloud.google.com/)
- **Information**
- [About](https://conferences.oreilly.com/strata-ca-2019/public/content/about)
- [Resources](https://conferences.oreilly.com/strata/strata-ca-2019/public/content/resources.html)
- [Diversity](http://oreilly.com/conferences/diversity.html)
- [Code of Conduct](http://oreilly.com/conferences/code-of-conduct.html)
- [Privacy Policy](http://oreilly.com/oreilly/privacy.html)
- [Contact Us](https://conferences.oreilly.com/strata/strata-ca-2019/public/content/contact.html)
- **More O'Reilly Events**
- [Artificial Intelligence](https://conferences.oreilly.com/artificial-intelligence)
- [Open Source](https://conferences.oreilly.com/oscon)
- [Software Architecture](https://conferences.oreilly.com/software-architecture)
- [TensorFlow World](https://conferences.oreilly.com/tensorflow)
- [Velocity](https://conferences.oreilly.com/velocity/)
- **More O'Reilly Sites**
- [O'Reilly online learning](https://www.oreilly.com/online-learning/individuals.html)
- [O'Reilly Conferences](https://conferences.oreilly.com/)
- [oreilly.com](http://oreilly.com/)
- [O'Reilly Video Training](https://www.oreilly.com/search/?query=*&formats=video)
- [Twitter](https://twitter.com/strataconf)
- [Facebook](https://www.facebook.com/OReilly)
- [LinkedIn](https://www.linkedin.com/company/8459)
- [YouTube](https://www.youtube.com/user/OreillyMedia)
Š2019, O'Reilly Media, Inc. ⢠(800) 889-8969 or (707) 827-7019 ⢠Monday-Friday 7:30am-5pm PT ⢠All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. ⢠[confreg@oreilly.com](mailto:confreg@oreilly.com) |
| Readable Markdown | null |
| ML Classification | |
| ML Categories | null |
| ML Page Types | null |
| ML Intent Types | null |
| Content Metadata | |
| Language | en |
| Author | Michael Johnson |
| Publish Time | not set |
| Original Publish Time | 2020-08-21 14:40:43 (5 years ago) |
| Republished | No |
| Word Count (Total) | 675 |
| Word Count (Content) | 203 |
| Links | |
| External Links | 39 |
| Internal Links | 36 |
| Technical SEO | |
| Meta Nofollow | No |
| Meta Noarchive | No |
| JS Rendered | No |
| Redirect Target | null |
| Performance | |
| Download Time (ms) | 92 |
| TTFB (ms) | 91 |
| Download Size (bytes) | 16,121 |
| Shard | 115 (laksa) |
| Root Hash | 3309313061461398115 |
| Unparsed URL | com,oreilly!conferences,/strata/strata-ca-2019/public/schedule/detail/72931.html s443 |