Spark NLP Training-Public-Oct 2020
Spark NLP Training-Public-Oct 2020
Veysel Kocaman
Lead Data Scientist
[email protected]
Welcome
- Intro to John Snow Labs and Spark NLP
60 min
- Intro to NLP and Spark NLP Basics (colab)
10 min Break
Day-1
50 min - Text preprocessing with Spark NLP Notebook
10 min Break
10 min Break
Day-2
60 min - Named Entity Recognition (NER) (part-2)
10 min Break
BOOKMARK:
https://nlp.johnsnowlabs.com/models
https://nlp.johnsnowlabs.com/docs/en/quickstart
spark-nlp.slack.com
Part - I
❖ Introducing JSL and Spark NLP
❖ Natural Language Processing (NLP) Basics
❖ Spark NLP Pretrained Pipelines
❖ Text Preprocessing with Spark NLP
❖ Spark NLP Pretrained Models
Introducing Spark NLP
Daily ~ 10K ● Spark NLP is an open-source natural language
Monthly ~ 250K
processing library, built on top of Apache Spark
and Spark ML. (initial release: Oct 2017)
○ A single unified solution for all your NLP needs
https://medium.com/spark-nlp/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59
Hugging
NLP Feature Spark NLP spaCy NLTK CoreNLP
Face
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
segmentation
Steeming Yes Yes Yes Yes No
● 73 total Lemmatization Yes Yes Yes Yes No
● two weeks POS tagging Yes Yes Yes Yes No
● The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related
forms of a word to a common base form for normalization purposes.
● For tasks like text classification, where the text is to be classified into different
categories, stopwords are removed or excluded from the given text so that
more focus can be given to those words which define the meaning of the text.
Spell Checking & Correction
Context Spell Checker
N-gram Tokenization
● Words that are used in similar contexts will be given similar representations. That is, words that are used
in similar ways will be placed close together within the high-dimensional semantic space–these points
will cluster together, and their distance to each other will be low.
Word & Sentence Embeddings
● They encode words and sentences 📜 in fixed-length dense vectors 📐 to drastically improve the processing of textual
data.
● Based on The Distributional Hypothesis: Words that occur in the same contexts tend to have similar meanings.
Word & Sentence Embeddings
Spark NLP
BERT ClassifierDL
SentimentDL
MultiClassifierDL
Word & Sentence Embeddings
albert_base = https://tfhub.dev/google/albert_base/3 | BERT is a bi-directional transformer for pre-training over a lot of unlabeled textual data
768-embed-dim, 12-layer, 12-heads, 12M parameters to learn a language representation that can be used to fine-tune for specific machine
learning tasks. While BERT outperformed the NLP state-of-the-art on several challenging
albert_large = https://tfhub.dev/google/albert_large/3 | tasks, its performance improvement could be attributed to the bidirectional
1024-embed-dim, 24-layer, 16-heads, 18M parameters transformer, novel pre-training tasks of Masked Language Model and Next Structure
albert_xlarge = https://tfhub.dev/google/albert_xlarge/3 | Prediction along with a lot of data and Google’s compute power.
2048-embed-dim, 24-layer, 32-heads, 60M parameters
XLNet is a large bidirectional transformer that uses improved training methodology,
albert_xxlarge = https://tfhub.dev/google/albert_xxlarge/3 | larger data and more computational power to achieve better than BERT prediction
4096-embed-dim, 12-layer, 64-heads, 235M parameters metrics on 20 language tasks.
To improve the training, XLNet introduces permutation language modeling, where all
tokens are predicted but in random order. This is in contrast to BERT’s masked language
XLNet-Large =
model where only the masked (15%) tokens are predicted.
https://storage.googleapis.com/xlnet/released_models/cased
_L-24_H-1024_A-16.zip | 24-layer, 1024-hidden, 16-heads Albert is a Google’s new “ALBERT” language model and achieved state-of-the-art results
on three popular benchmark tests for natural language understanding (NLU): GLUE,
XLNet-Base =
RACE, and SQuAD 2.0. ALBERT is a “lite” version of Google’s 2018 NLU pretraining
https://storage.googleapis.com/xlnet/released_models/cased
method BERT. Researchers introduced two parameter-reduction techniques in ALBERT
_L-12_H-768_A-12.zip | 12-layer, 768-hidden, 12-heads.
to lower memory consumption and increase training speed.
https://nlp.johnsnowlabs.com/api/
Coding ...
1. Spark NLP Basics
https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/1.S
parkNLP_Basics.ipynb
https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/2.T
ext_Preprocessing_with_SparkNLP(Annotators_Transf
ormers).ipynb
https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/3.S
parkNLP_Pretrained_Models.ipynb
Spark NLP
for Data Scientists
Veysel Kocaman
Sr. Data Scientist
[email protected]
Part - II
Bert
NerDLApproach
NER-DL in Spark NLP
NER Systems
1. Classical Approaches (rule based)
2. ML Approaches
- Multi-class classification
- Conditional Random Field (CRF)
3. DL Approaches
- Bidirectional LSTM-CRF
- Bidirectional LSTM-CNNs
- Bidirectional LSTM-CNNS-CRF
- Pre-trained language models
(Bert, Elmo)
Char-CNN-BiLSTM
Clinical Named Entity Recognition
Posology NER
Anatomy NER
Clinical NER
PHI NER
NER-DL in Spark NLP
CoNNL2003 format BIO schema
https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
4.NERDL_Training.ipynb
Part - III
https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
4.NERDL_Training.ipynb
Spark NLP Resources
Spark NLP Official page
Spark NLP Workshop Repo
JSL Youtube channel
JSL Blogs
Introduction to Spark NLP: Foundations and Basic Components (Part-I)
Introduction to: Spark NLP: Installation and Getting Started (Part-II)
Named Entity Recognition with Bert in Spark NLP
Text Classification in Spark NLP with Bert and Universal Sentence Encoders
Spark NLP 101 : Document Assembler
Spark NLP 101: LightPipeline
https://www.oreilly.com/radar/one-simple-chart-who-is-interested-in-spark-nlp/
https://blog.dominodatalab.com/comparing-the-functionality-of-open-source-natural-language-processing-libraries/
https://databricks.com/blog/2017/10/19/introducing-natural-language-processing-library-apache-spark.html
https://databricks.com/fr/session/apache-spark-nlp-extending-spark-ml-to-deliver-fast-scalable-unified-natural-language-processing
https://medium.com/@saif1988/spark-nlp-walkthrough-powered-by-tensorflow-9965538663fd
https://www.kdnuggets.com/2019/06/spark-nlp-getting-started-with-worlds-most-widely-used-nlp-library-enterprise.html
https://www.forbes.com/sites/forbestechcouncil/2019/09/17/winning-in-health-care-ai-with-small-data/#1b2fc2555664
https://medium.com/hackernoon/mueller-report-for-nerds-spark-meets-nlp-with-tensorflow-and-bert-part-1-32490a8f8f12
https://www.analyticsindiamag.com/5-reasons-why-spark-nlp-is-the-most-widely-used-library-in-enterprises/
https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-training-spark-nlp-and-spacy-pipelines
https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-accuracy-performance-and-scalability
https://www.infoworld.com/article/3031690/analytics/why-you-should-use-spark-for-machine-learning.html
Slides:
https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/Spark%20NLP%20Training-
Public-April%202020.pdf
Colab:
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.Spa
rkNLP_Basics.ipynb
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Tex
t_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/3.Spa
rkNLP_Pretrained_Models.ipynb
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NE
RDL_Training.ipynb
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Tex
t_Classification_with_ClassifierDL.ipynb