0% found this document useful (0 votes)
151 views50 pages

Spark NLP Training-Public-Oct 2020

This document provides an agenda for a two-day Spark NLP training for data scientists. Day 1 will cover introductions, NLP basics, pretrained pipelines and models in Spark NLP, and text preprocessing with Spark NLP notebooks. Day 2 will cover named entity recognition, both part 1 and 2, and text classification with Spark NLP. The document also provides setup instructions and links for the training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views50 pages

Spark NLP Training-Public-Oct 2020

This document provides an agenda for a two-day Spark NLP training for data scientists. Day 1 will cover introductions, NLP basics, pretrained pipelines and models in Spark NLP, and text preprocessing with Spark NLP notebooks. Day 2 will cover named entity recognition, both part 1 and 2, and text classification with Spark NLP. The document also provides setup instructions and links for the training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Spark NLP

for Data Scientists

October 13-14, 2020

Veysel Kocaman
Lead Data Scientist
[email protected]
Welcome
- Intro to John Snow Labs and Spark NLP
60 min
- Intro to NLP and Spark NLP Basics (colab)

10 min Break
Day-1
50 min - Text preprocessing with Spark NLP Notebook

10 min Break

50 min - Pretrained Models in Spark NLP

- Keyword Extraction (YAKE)


50 min
- Named Entity Recognition (NER) (part-1)

10 min Break
Day-2
60 min - Named Entity Recognition (NER) (part-2)

10 min Break

50 min - Text Classification with Spark NLP


Setup
RUNNING CODE:
https://github.com/JohnSnowLabs/spark-nlp-work
shop/blob/master/tutorials/Certification_Trainings/
Public
[How to set up Google Colab]

BOOKMARK:
https://nlp.johnsnowlabs.com/models
https://nlp.johnsnowlabs.com/docs/en/quickstart
spark-nlp.slack.com
Part - I
❖ Introducing JSL and Spark NLP
❖ Natural Language Processing (NLP) Basics
❖ Spark NLP Pretrained Pipelines
❖ Text Preprocessing with Spark NLP
❖ Spark NLP Pretrained Models
Introducing Spark NLP
Daily ~ 10K ● Spark NLP is an open-source natural language
Monthly ~ 250K
processing library, built on top of Apache Spark
and Spark ML. (initial release: Oct 2017)
○ A single unified solution for all your NLP needs

○ Take advantage of transfer learning and


implementing the latest and greatest SOTA
algorithms and models in NLP research

○ The most widely used NLP library in industry (3


yrs in a row)

○ Delivering a mission-critical, enterprise grade NLP


library (used by multiple Fortune 500)

○ Full-time development team (26 new releases in


2018. 30 new releases in 2019.)

https://medium.com/spark-nlp/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59
Hugging
NLP Feature Spark NLP spaCy NLTK CoreNLP
Face

Spark NLP Tokenization


Sentence
Yes

Yes
Yes

Yes
Yes

Yes
Yes

Yes
Yes

No
segmentation
Steeming Yes Yes Yes Yes No
● 73 total Lemmatization Yes Yes Yes Yes No
● two weeks POS tagging Yes Yes Yes Yes No

3 years Entity recognition Yes Yes Yes Yes Yes


Dep parser Yes Yes Yes Yes No
● unified
Text matcher Yes Yes No No No
NLP/NLU
Date matcher Yes No No No No
● community Slack Sentiment detector Yes No Yes Yes Yes
GitHub Text classification Yes Yes Yes No Yes

Spell checker Yes No No No No


Language detector Yes No No No No
Keyword extraction Yes No No No No
Pretrained models Yes Yes Yes Yes Yes
Trainable models Yes Yes Yes Yes Yes
Spark NLP in Industry
Which of the following AI tools do you use? Which NLP libraries does your organization use?

NLP Industry Survey by Gradient Flow,


an independent data science research & insights company, September 2020
Spark NLP: Apache License 2.0
● ● BERT
● ● ELMO
● ● ALBERT
● ● XLNet
● ● Universal Sentence Encoder
● ● BERT
● ●
● ●
● ● keywords extraction
● ● Language Detection
● ● Multi-class
● Part-of-speech ● Multi-label  
● Dependency ● Sentiment Analysis
● Sentiment ● Named entity recognition
● TensorFlow
● Spell Checker ●
● +250 pre-trained models
● Embeddings ● +90 pre-trained pipelines
Spark NLP Modules (Enterprise and Public)
Introducing Spark NLP
Pipeline of annotators
Introducing Spark NLP
Spark is like a locomotive racing a
bicycle. The bike will win if the load
is light, it is quicker to accelerate
and more agile, but with a heavy
load the locomotive might take a
while to get up to speed, but it’s
going to be faster in the end.
Faster inference
LightPipelines are Spark ML pipelines converted into a single
machine but multithreaded task, becoming more than 10x times
faster for smaller amounts of data (small is relative, but 50k
sentences is roughly a good maximum).
NLP Basics

● The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related
forms of a word to a common base form for normalization purposes.

● Lemmatization always returns real words, stemming doesn’t.


NLP Basics

● For tasks like text classification, where the text is to be classified into different
categories, stopwords are removed or excluded from the given text so that
more focus can be given to those words which define the meaning of the text.
Spell Checking & Correction
Context Spell Checker
N-gram Tokenization

● Kind of tokenizers which split words or sentences into several tokens


● Each token has certain number of characters
● Number of character depends on the type of ngram tokenizer
● Unigram, bigram, trigram, etc.
A POS tag (or part-of-speech tag) is a special label assigned to each token (word) in a text corpus to
indicate the part of speech and often also other grammatical categories such as tense, number
(plural/singular), case etc.
Word & Sentence Embeddings

● Words that are used in similar contexts will be given similar representations. That is, words that are used
in similar ways will be placed close together within the high-dimensional semantic space–these points
will cluster together, and their distance to each other will be low.
Word & Sentence Embeddings

● Deep-Learning-based natural language processing systems.

● They encode words and sentences 📜 in fixed-length dense vectors 📐 to drastically improve the processing of textual
data.
● Based on The Distributional Hypothesis: Words that occur in the same contexts tend to have similar meanings.
Word & Sentence Embeddings

Glove ELMO BERT Universal Sentence Encoders


(100, 200, 300) (512, 1024) (768d) (512)

Albert XLNet Electra Bert Sentence Embeddings


(768, 1024, 2048, 4096) (768, 1024) (768) (768)

● Deep-Learning-based natural language


processing systems.

● They encode words and sentences in


fixed-length dense vectors 📐 to drastically
improve the processing of textual data.

● Based on The Distributional Hypothesis:


Words that occur in the same contexts tend to
have similar meanings.

● Elmo and Bert-family embeddings are


context-aware.
Text Classification with Word & Sentence Embeddings

Glove ELMO BERT Universal Sentence Encoders


(100, 200, 300) (512, 1024) (768d) (512)

Albert XLNet Electra Bert Sentence Embeddings


(768, 1024, 2048, 4096) (768, 1024) (768) (768)

Spark NLP

BERT ClassifierDL
SentimentDL
MultiClassifierDL
Word & Sentence Embeddings
albert_base = https://tfhub.dev/google/albert_base/3 | BERT is a bi-directional transformer for pre-training over a lot of unlabeled textual data
768-embed-dim, 12-layer, 12-heads, 12M parameters to learn a language representation that can be used to fine-tune for specific machine
learning tasks. While BERT outperformed the NLP state-of-the-art on several challenging
albert_large = https://tfhub.dev/google/albert_large/3 | tasks, its performance improvement could be attributed to the bidirectional
1024-embed-dim, 24-layer, 16-heads, 18M parameters transformer, novel pre-training tasks of Masked Language Model and Next Structure
albert_xlarge = https://tfhub.dev/google/albert_xlarge/3 | Prediction along with a lot of data and Google’s compute power.
2048-embed-dim, 24-layer, 32-heads, 60M parameters
XLNet is a large bidirectional transformer that uses improved training methodology,
albert_xxlarge = https://tfhub.dev/google/albert_xxlarge/3 | larger data and more computational power to achieve better than BERT prediction
4096-embed-dim, 12-layer, 64-heads, 235M parameters metrics on 20 language tasks.

To improve the training, XLNet introduces permutation language modeling, where all
tokens are predicted but in random order. This is in contrast to BERT’s masked language
XLNet-Large =
model where only the masked (15%) tokens are predicted.
https://storage.googleapis.com/xlnet/released_models/cased
_L-24_H-1024_A-16.zip | 24-layer, 1024-hidden, 16-heads Albert is a Google’s new “ALBERT” language model and achieved state-of-the-art results
on three popular benchmark tests for natural language understanding (NLU): GLUE,
XLNet-Base =
RACE, and SQuAD 2.0. ALBERT is a “lite” version of Google’s 2018 NLU pretraining
https://storage.googleapis.com/xlnet/released_models/cased
method BERT. Researchers introduced two parameter-reduction techniques in ALBERT
_L-12_H-768_A-12.zip | 12-layer, 768-hidden, 12-heads.
to lower memory consumption and increase training speed.

https://nlp.johnsnowlabs.com/api/
Coding ...
1. Spark NLP Basics

2. Text Preprocessing with Spark NLP

3. Spark NLP Pretrained Models

(click on Colab icon or open in a new tab)

https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/1.S
parkNLP_Basics.ipynb

https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/2.T
ext_Preprocessing_with_SparkNLP(Annotators_Transf
ormers).ipynb

https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/3.S
parkNLP_Pretrained_Models.ipynb
Spark NLP
for Data Scientists

Veysel Kocaman
Sr. Data Scientist
[email protected]
Part - II

❖ Named Entity Recognition (NER) in Spark NLP


NER-DL in Spark NLP
The best NER score in
production
93.3 %
Test Set

Bert

NerDLApproach
NER-DL in Spark NLP
NER Systems
1. Classical Approaches (rule based)
2. ML Approaches
- Multi-class classification
- Conditional Random Field (CRF)
3. DL Approaches
- Bidirectional LSTM-CRF
- Bidirectional LSTM-CNNs
- Bidirectional LSTM-CNNS-CRF
- Pre-trained language models
(Bert, Elmo)

4. Hybrid Approaches (DL + ML)


NER-DL in Spark NLP

F1: ["One", "sentence", "with", "5", "WORDS"]


F2: [titlecase, lowercase, lowercase, numeric, uppercase]
Char-CNN-BiLSTM+ F3: [Card.Digit,Noun,Preposition,Card.Digit,Noun plural]
CRF F4: [[O,n,e],[s,e,n,t,e,n,c,e],[w,i,t,h],[5],[W,O,R,D,S]]
NER-DL in Spark NLP

Char-CNN-BiLSTM
Clinical Named Entity Recognition

Posology NER

Anatomy NER

Clinical NER

PHI NER
NER-DL in Spark NLP
CoNNL2003 format BIO schema

John Smith ⇒ PERSON


New York ⇒ LOCATION
* Each line contains four fields: the word, its part-of-speech tag, its chunk
tag and its named entity tag.

* CoNLL: Conference on Computational Natural Language Learning


NER-DL in Spark NLP

Char-CNN process, e.g. on the


world “HEALTH”
NER-DL in Spark NLP
Char-CNN-BiLSTM
NER-DL in Spark NLP
Classification
Coding ...

Open 4. NERDL Training notebook in


Colab

(click on Colab icon or open in a new tab)

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
4.NERDL_Training.ipynb
Part - III

❖ Text Classification with Classifier DL in


Spark NLP
SentimentDL, ClassifierDL, and
MultiClassifierDL




● ●

● ●
● ●

● ●
● ●
● ●
● ●

● ●
● ●

● ●
● ●

● ●


● ●
Classifier DL
Tensorflow
Architecture
Coding ...

Open 5. Text CLassification with


ClassifierDL notebook in Colab

(click on Colab icon or open in a new tab)

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
4.NERDL_Training.ipynb
Spark NLP Resources
Spark NLP Official page
Spark NLP Workshop Repo
JSL Youtube channel
JSL Blogs
Introduction to Spark NLP: Foundations and Basic Components (Part-I)
Introduction to: Spark NLP: Installation and Getting Started (Part-II)
Named Entity Recognition with Bert in Spark NLP
Text Classification in Spark NLP with Bert and Universal Sentence Encoders
Spark NLP 101 : Document Assembler
Spark NLP 101: LightPipeline
https://www.oreilly.com/radar/one-simple-chart-who-is-interested-in-spark-nlp/
https://blog.dominodatalab.com/comparing-the-functionality-of-open-source-natural-language-processing-libraries/
https://databricks.com/blog/2017/10/19/introducing-natural-language-processing-library-apache-spark.html
https://databricks.com/fr/session/apache-spark-nlp-extending-spark-ml-to-deliver-fast-scalable-unified-natural-language-processing
https://medium.com/@saif1988/spark-nlp-walkthrough-powered-by-tensorflow-9965538663fd
https://www.kdnuggets.com/2019/06/spark-nlp-getting-started-with-worlds-most-widely-used-nlp-library-enterprise.html
https://www.forbes.com/sites/forbestechcouncil/2019/09/17/winning-in-health-care-ai-with-small-data/#1b2fc2555664
https://medium.com/hackernoon/mueller-report-for-nerds-spark-meets-nlp-with-tensorflow-and-bert-part-1-32490a8f8f12
https://www.analyticsindiamag.com/5-reasons-why-spark-nlp-is-the-most-widely-used-library-in-enterprises/
https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-training-spark-nlp-and-spacy-pipelines
https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-accuracy-performance-and-scalability
https://www.infoworld.com/article/3031690/analytics/why-you-should-use-spark-for-machine-learning.html
Slides:

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/Spark%20NLP%20Training-
Public-April%202020.pdf

Colab:

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.Spa
rkNLP_Basics.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Tex
t_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/3.Spa
rkNLP_Pretrained_Models.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NE
RDL_Training.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Tex
t_Classification_with_ClassifierDL.ipynb

You might also like