0% found this document useful (0 votes)
150 views39 pages

Spark NLP Training-Public-April 2020

Spark NLP is an open-source natural language processing library built on Apache Spark. It provides pretrained pipelines and models for tasks like text preprocessing, named entity recognition and text classification. The document discusses Spark NLP's advantages over other NLP libraries and introduces key NLP concepts. It also provides links to tutorials demonstrating Spark NLP's capabilities for named entity recognition using deep learning models and text classification using ClassifierDL.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views39 pages

Spark NLP Training-Public-April 2020

Spark NLP is an open-source natural language processing library built on Apache Spark. It provides pretrained pipelines and models for tasks like text preprocessing, named entity recognition and text classification. The document discusses Spark NLP's advantages over other NLP libraries and introduces key NLP concepts. It also provides links to tutorials demonstrating Spark NLP's capabilities for named entity recognition using deep learning models and text classification using ClassifierDL.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Spark NLP

for Data Scientists

April 22, 2020

Veysel Kocaman
Sr. Data Scientist
[email protected]
Setup
RUNNING CODE:
https://github.com/JohnSnowLabs/spark-nlp-work
shop/blob/master/tutorials/Certification_Trainings/
Public
[How to set up Google Colab]

BOOKMARK:
nlp.johnsnowlabs.com/docs/en/concepts
spark-nlp.slack.com
Part - I
❖ Introducing JSL and Spark NLP
❖ Natural Language Processing (NLP) Basics
❖ Spark NLP Pretrained Pipelines
❖ Text Preprocessing with Spark NLP
❖ Spark NLP Pretrained Models
Introducing Spark NLP
● Spark NLP is an open-source natural language
● Natural Language Toolkit (NLTK): The
processing library, built on top of Apache Spark
complete toolkit for all NLP techniques.
and Spark ML. (initial release: Oct 2017)
● TextBlob: Easy to use NLP tools API, built on
○ A single unified solution for all your NLP needs
top of NLTK and Pattern.
● SpaCy: Industrial strength NLP with Python ○ Take advantage of transfer learning and
and Cython.
implementing the latest and greatest SOTA
algorithms and models in NLP research
● Gensim: Topic Modelling for Humans
● Stanford Core NLP: NLP services and ○ Lack of any NLP library that’s fully supported by
Spark
packages by Stanford NLP Group.
● Fasttext: NLP library by Facebook’s AI ○ Delivering a mission-critical, enterprise grade NLP
library (used by multiple Fortune 500)
Research (FAIR) lab

● ... ○ Full-time development team (26 new releases in


2018. 30 new releases in 2019.)
https://medium.com/spark-nlp/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59
Introducing Spark NLP

Available in Python, R, Scala and Java ”AI Adoption in the Enterprise”, February 2019
Most widely used ML frameworks and tools survey of 1,300 practitioners by O’Reilly
BUILT ON THE SHOULDERS OF SPARK ML
● Reusing the Spark ML Pipeline
● Unified NLP & ML pipelines
● End-to-end execution planning
● Serializable
● Distributable

● Reusing NLP Functionality


● TF-IDF calculation
● String distance calculation
● Topic modeling
● Distributed ML algorithms
Introducing Spark NLP
Pipeline of annotators
NLP Basics

● The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related
forms of a word to a common base form for normalization purposes.

● Lemmatization always returns real words, stemming doesn’t.


NLP Basics

● For tasks like text classification, where the text is to be classified into different
categories, stopwords are removed or excluded from the given text so that
more focus can be given to those words which define the meaning of the text.
Spell Checking & Correction
A POS tag (or part-of-speech tag) is a special label assigned to each token (word) in a text corpus to
indicate the part of speech and often also other grammatical categories such as tense, number
(plural/singular), case etc.
Coding ...

Open 1. Spark NLP Basics and 2. Text


Preprocessing with SparkNLP
notebooks in Colab

(click on Colab icon or open in a new tab)

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
1.SparkNLP_Basics.ipynb

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
2.Text_Preprocessing_with_SparkNLP(Annotators_T
ransformers).ipynb
Word & Sentence Embeddings

● Deep-Learning-based natural language processing systems.

● They encode words and sentences 📜 in fixed-length dense vectors 📐 to drastically improve the processing of textual
data.
● Based on The Distributional Hypothesis: Words that occur in the same contexts tend to have similar meanings.
Word & Sentence Embeddings

Glove ELMO BERT Universal Sentence Encoders


(100, 200, 300) (512, 1024) (768d) (512)
Coding ...

Open 3. Spark NLP Pretrained Models


notebooks in Colab

(click on Colab icon or open in a new tab)

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
3.SparkNLP_Pretrained_Models.ipynb
Part - II

❖ Named Entity Recognition (NER) in Spark NLP


Named Entity Recognition (NER)
NER Benchmarks
● Spark NLP 2.4.x obtained the best performing
academic peer-reviewed results
● State-of-the-art Deep Learning algorithms
● Achieve high accuracy within a few minutes
● Achieve high accuracy with a few lines of codes
● Blazing fast training
● Use CPU or GPU
● Easy to choose Word Embeddings
● Pre-trained GloVe models
● Pre-trained BERT models from TF Hub
● Pre-trained ELMO models from TF Hub
NER-DL in Spark NLP
CoNNL2003 format BIO schema

John Smith ⇒ PERSON


New York ⇒ LOCATION
* Each line contains four fields: the word, its part-of-speech tag, its chunk
tag and its named entity tag.

* CoNLL: Conference on Computational Natural Language Learning


NER-DL in Spark NLP
Annotated Data in
CoNLL

CoNLL Reader

Word Embeddings
(Glove, Bert, Elmo)

Char-CNN-
NerDLApproach
BiLSTM
You can also train your own Word Embeddings
in Gensim and load in Spark NLP.
Coding ...

Open 4. NERDL Training notebook in


Colab

(click on Colab icon or open in a new tab)

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
4.NERDL_Training.ipynb
Part - III

❖ Text Classification with Classifier DL in


Spark NLP
Text
Classification
with
Classifier DL
in
Spark NLP
Classifier DL
Tensorflow
Architecture
Coding ...

Open 5. Text CLassification with


ClassifierDL notebook in Colab

(click on Colab icon or open in a new tab)

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
4.NERDL_Training.ipynb
Spark NLP Resources
Spark NLP Official page
Spark NLP Workshop Repo
JSL Youtube channel
JSL Blogs
Introduction to Spark NLP: Foundations and Basic Components (Part-I)
Introduction to: Spark NLP: Installation and Getting Started (Part-II)
Spark NLP 101 : Document Assembler
Spark NLP 101: LightPipeline
https://www.oreilly.com/radar/one-simple-chart-who-is-interested-in-spark-nlp/
https://blog.dominodatalab.com/comparing-the-functionality-of-open-source-natural-language-processing-libraries/
https://databricks.com/blog/2017/10/19/introducing-natural-language-processing-library-apache-spark.html
https://databricks.com/fr/session/apache-spark-nlp-extending-spark-ml-to-deliver-fast-scalable-unified-natural-language-processing
https://medium.com/@saif1988/spark-nlp-walkthrough-powered-by-tensorflow-9965538663fd
https://www.kdnuggets.com/2019/06/spark-nlp-getting-started-with-worlds-most-widely-used-nlp-library-enterprise.html
https://www.forbes.com/sites/forbestechcouncil/2019/09/17/winning-in-health-care-ai-with-small-data/#1b2fc2555664
https://medium.com/hackernoon/mueller-report-for-nerds-spark-meets-nlp-with-tensorflow-and-bert-part-1-32490a8f8f12
https://www.analyticsindiamag.com/5-reasons-why-spark-nlp-is-the-most-widely-used-library-in-enterprises/
https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-training-spark-nlp-and-spacy-pipelines
https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-accuracy-performance-and-scalability
https://www.infoworld.com/article/3031690/analytics/why-you-should-use-spark-for-machine-learning.html
Slides:

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/Spark%20NLP%20Training-
Public-April%202020.pdf

Colab:

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.Spa
rkNLP_Basics.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Tex
t_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/3.Spa
rkNLP_Pretrained_Models.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NE
RDL_Training.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Tex
t_Classification_with_ClassifierDL.ipynb
Spark NLP Spark NLP
for Data Scientists for Healthcare Data Scientist

April 22, 2020 May 13, 2020

Veysel Kocaman
Sr. Data Scientist
[email protected]
Welcome !
Spark NLP for Data Scientists

Overview and key concepts in Spark NLP


Text preprocessing and cleaning
Named Entity Recognition (NER) and Train your own NER
Text Classification and model inference

Spark NLP for Healthcare Data Scientists

Cleaning medical text: Normalization, stop-words, clinical POS, spell checking


Common medical NLP use cases
Clinical named entity recognition
Assertion status detection
Medical Entity Resolution (ICD, RxNorm, SNOMED)
Healthcare Data De-identification
Object Character Recognition (OCR)
Spark NLP
for Data Scientists

Session start at xx:xx am

Veysel Kocaman
Sr. Data Scientist
[email protected]

You might also like