0% found this document useful (0 votes)

152 views39 pages

Spark NLP Training-Public-April 2020

Spark NLP is an open-source natural language processing library built on Apache Spark. It provides pretrained pipelines and models for tasks like text preprocessing, named entity recognition and text classification. The document discusses Spark NLP's advantages over other NLP libraries and introduces key NLP concepts. It also provides links to tutorials demonstrating Spark NLP's capabilities for named entity recognition using deep learning models and text classification using ClassifierDL.

Uploaded by

Xuân Vinh Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

152 views39 pages

Spark NLP Training-Public-April 2020

Uploaded by

Xuân Vinh Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Spark NLP

for Data Scientists

April 22, 2020

Veysel Kocaman
Sr. Data Scientist
[email protected]
Setup
RUNNING CODE:
https://github.com/JohnSnowLabs/spark-nlp-work
shop/blob/master/tutorials/Certification_Trainings/
Public
[How to set up Google Colab]

BOOKMARK:
nlp.johnsnowlabs.com/docs/en/concepts
spark-nlp.slack.com
Part - I
❖ Introducing JSL and Spark NLP
❖ Natural Language Processing (NLP) Basics
❖ Spark NLP Pretrained Pipelines
❖ Text Preprocessing with Spark NLP
❖ Spark NLP Pretrained Models
Introducing Spark NLP
● Spark NLP is an open-source natural language
● Natural Language Toolkit (NLTK): The
processing library, built on top of Apache Spark
complete toolkit for all NLP techniques.
and Spark ML. (initial release: Oct 2017)
● TextBlob: Easy to use NLP tools API, built on
○ A single uniﬁed solution for all your NLP needs
top of NLTK and Pattern.
● SpaCy: Industrial strength NLP with Python ○ Take advantage of transfer learning and
and Cython.
implementing the latest and greatest SOTA
algorithms and models in NLP research
● Gensim: Topic Modelling for Humans
● Stanford Core NLP: NLP services and ○ Lack of any NLP library that’s fully supported by
Spark
packages by Stanford NLP Group.
● Fasttext: NLP library by Facebook’s AI ○ Delivering a mission-critical, enterprise grade NLP
library (used by multiple Fortune 500)
Research (FAIR) lab

● ... ○ Full-time development team (26 new releases in

2018. 30 new releases in 2019.)
https://medium.com/spark-nlp/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59
Introducing Spark NLP

Available in Python, R, Scala and Java ”AI Adoption in the Enterprise”, February 2019
Most widely used ML frameworks and tools survey of 1,300 practitioners by O’Reilly
BUILT ON THE SHOULDERS OF SPARK ML
● Reusing the Spark ML Pipeline
● Uniﬁed NLP & ML pipelines
● End-to-end execution planning
● Serializable
● Distributable

● Reusing NLP Functionality

● TF-IDF calculation
● String distance calculation
● Topic modeling
● Distributed ML algorithms
Introducing Spark NLP
Pipeline of annotators
NLP Basics

● The goal of both stemming and lemmatization is to reduce inﬂectional forms and sometimes derivationally related
forms of a word to a common base form for normalization purposes.

● Lemmatization always returns real words, stemming doesn’t.

NLP Basics

● For tasks like text classification, where the text is to be classified into different
categories, stopwords are removed or excluded from the given text so that
more focus can be given to those words which define the meaning of the text.
Spell Checking & Correction
A POS tag (or part-of-speech tag) is a special label assigned to each token (word) in a text corpus to
indicate the part of speech and often also other grammatical categories such as tense, number
(plural/singular), case etc.
Coding ...

Open 1. Spark NLP Basics and 2. Text

Preprocessing with SparkNLP
notebooks in Colab

(click on Colab icon or open in a new tab)

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
1.SparkNLP_Basics.ipynb

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
2.Text_Preprocessing_with_SparkNLP(Annotators_T
ransformers).ipynb
Word & Sentence Embeddings

● Deep-Learning-based natural language processing systems.

● They encode words and sentences 📜 in fixed-length dense vectors 📐 to drastically improve the processing of textual
data.
● Based on The Distributional Hypothesis: Words that occur in the same contexts tend to have similar meanings.
Word & Sentence Embeddings

Glove ELMO BERT Universal Sentence Encoders

(100, 200, 300) (512, 1024) (768d) (512)
Coding ...

Open 3. Spark NLP Pretrained Models

notebooks in Colab

(click on Colab icon or open in a new tab)

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
3.SparkNLP_Pretrained_Models.ipynb
Part - II

❖ Named Entity Recognition (NER) in Spark NLP

Named Entity Recognition (NER)
NER Benchmarks
● Spark NLP 2.4.x obtained the best performing
academic peer-reviewed results
● State-of-the-art Deep Learning algorithms
● Achieve high accuracy within a few minutes
● Achieve high accuracy with a few lines of codes
● Blazing fast training
● Use CPU or GPU
● Easy to choose Word Embeddings
● Pre-trained GloVe models
● Pre-trained BERT models from TF Hub
● Pre-trained ELMO models from TF Hub
NER-DL in Spark NLP
CoNNL2003 format BIO schema

John Smith ⇒ PERSON

New York ⇒ LOCATION
* Each line contains four fields: the word, its part-of-speech tag, its chunk
tag and its named entity tag.

* CoNLL: Conference on Computational Natural Language Learning

NER-DL in Spark NLP
Annotated Data in
CoNLL

CoNLL Reader

Word Embeddings
(Glove, Bert, Elmo)

Char-CNN-
NerDLApproach
BiLSTM
You can also train your own Word Embeddings
in Gensim and load in Spark NLP.
Coding ...

Open 4. NERDL Training notebook in

Colab

(click on Colab icon or open in a new tab)

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
4.NERDL_Training.ipynb
Part - III

❖ Text Classification with Classifier DL in

Spark NLP
Text
Classification
with
Classifier DL
in
Spark NLP
Classifier DL
Tensorflow
Architecture
Coding ...

Open 5. Text CLassification with

ClassifierDL notebook in Colab

(click on Colab icon or open in a new tab)

https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
4.NERDL_Training.ipynb
Spark NLP Resources
Spark NLP Official page
Spark NLP Workshop Repo
JSL Youtube channel
JSL Blogs
Introduction to Spark NLP: Foundations and Basic Components (Part-I)
Introduction to: Spark NLP: Installation and Getting Started (Part-II)
Spark NLP 101 : Document Assembler
Spark NLP 101: LightPipeline
https://www.oreilly.com/radar/one-simple-chart-who-is-interested-in-spark-nlp/
https://blog.dominodatalab.com/comparing-the-functionality-of-open-source-natural-language-processing-libraries/
https://databricks.com/blog/2017/10/19/introducing-natural-language-processing-library-apache-spark.html
https://databricks.com/fr/session/apache-spark-nlp-extending-spark-ml-to-deliver-fast-scalable-unified-natural-language-processing
https://medium.com/@saif1988/spark-nlp-walkthrough-powered-by-tensorflow-9965538663fd
https://www.kdnuggets.com/2019/06/spark-nlp-getting-started-with-worlds-most-widely-used-nlp-library-enterprise.html
https://www.forbes.com/sites/forbestechcouncil/2019/09/17/winning-in-health-care-ai-with-small-data/#1b2fc2555664
https://medium.com/hackernoon/mueller-report-for-nerds-spark-meets-nlp-with-tensorflow-and-bert-part-1-32490a8f8f12
https://www.analyticsindiamag.com/5-reasons-why-spark-nlp-is-the-most-widely-used-library-in-enterprises/
https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-training-spark-nlp-and-spacy-pipelines
https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-accuracy-performance-and-scalability
https://www.infoworld.com/article/3031690/analytics/why-you-should-use-spark-for-machine-learning.html
Slides:

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/Spark%20NLP%20Training-
Public-April%202020.pdf

Colab:

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.Spa
rkNLP_Basics.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Tex
t_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/3.Spa
rkNLP_Pretrained_Models.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NE
RDL_Training.ipynb

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Tex
t_Classification_with_ClassifierDL.ipynb
Spark NLP Spark NLP
for Data Scientists for Healthcare Data Scientist

April 22, 2020 May 13, 2020

Veysel Kocaman
Sr. Data Scientist
[email protected]
Welcome !
Spark NLP for Data Scientists

Overview and key concepts in Spark NLP

Text preprocessing and cleaning
Named Entity Recognition (NER) and Train your own NER
Text Classiﬁcation and model inference

Spark NLP for Healthcare Data Scientists

Cleaning medical text: Normalization, stop-words, clinical POS, spell checking

Common medical NLP use cases
Clinical named entity recognition
Assertion status detection
Medical Entity Resolution (ICD, RxNorm, SNOMED)
Healthcare Data De-identiﬁcation
Object Character Recognition (OCR)
Spark NLP
for Data Scientists

Session start at xx:xx am

Veysel Kocaman
Sr. Data Scientist
[email protected]

Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Apache Airflow 1741977651
No ratings yet
Apache Airflow 1741977651
83 pages
Spark QA
No ratings yet
Spark QA
34 pages
Anatomy and Pathophysiology of Anemia
88% (8)
Anatomy and Pathophysiology of Anemia
9 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Airflow Notes
No ratings yet
Airflow Notes
10 pages
Spark Job Dataproc
No ratings yet
Spark Job Dataproc
4 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Slide 3 Hadoop MapReduce Tutorial
No ratings yet
Slide 3 Hadoop MapReduce Tutorial
119 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Big Data Computing Spark Built-In Libraries
No ratings yet
Big Data Computing Spark Built-In Libraries
11 pages
Airflow
No ratings yet
Airflow
37 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
Nurtured Womb e Book
100% (5)
Nurtured Womb e Book
22 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Snowflake Fundamentals Anand Jha
No ratings yet
Snowflake Fundamentals Anand Jha
50 pages
Company Interview Question Bank
No ratings yet
Company Interview Question Bank
16 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Python Scripting
No ratings yet
Python Scripting
108 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Spark NLP Training-Public-Oct 2020
No ratings yet
Spark NLP Training-Public-Oct 2020
50 pages
Machine Learning in Spark
100% (1)
Machine Learning in Spark
26 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Google People and Ai Guidebook-Workshop-Slides
No ratings yet
Google People and Ai Guidebook-Workshop-Slides
126 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
DLL Gr8 Edited
No ratings yet
DLL Gr8 Edited
60 pages
Donald Ngandeu 1
No ratings yet
Donald Ngandeu 1
6 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Operative Obstetrics, 4E Joseph J. Apuzzio Download
No ratings yet
Operative Obstetrics, 4E Joseph J. Apuzzio Download
56 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Ept Reviewer With Answers
No ratings yet
Ept Reviewer With Answers
24 pages
The Doctrine of Separation of Powers Indian Constitution
No ratings yet
The Doctrine of Separation of Powers Indian Constitution
5 pages
GED Practice 2025
No ratings yet
GED Practice 2025
7 pages
Hales - The Accounts of The Angel With A Drawn Sword
No ratings yet
Hales - The Accounts of The Angel With A Drawn Sword
17 pages
Rcsi PHD Thesis
100% (2)
Rcsi PHD Thesis
6 pages
Sample .Paper - 1 - Class Xii
No ratings yet
Sample .Paper - 1 - Class Xii
7 pages
Kenya - Going Nuts Macadamia Farming As A Livelihood Strategy For Kibugus Farmers
No ratings yet
Kenya - Going Nuts Macadamia Farming As A Livelihood Strategy For Kibugus Farmers
63 pages
Prose Genres
100% (2)
Prose Genres
33 pages
2021 Investment Case For After School Programmes
No ratings yet
2021 Investment Case For After School Programmes
27 pages
Lesson 7 - Ceramics Industry
No ratings yet
Lesson 7 - Ceramics Industry
3 pages
Originators Guide Rules v2.3 Nov 06
No ratings yet
Originators Guide Rules v2.3 Nov 06
171 pages
USB Devices As VMFS Datastore in Vsphere ESXi 70 Virtennet
No ratings yet
USB Devices As VMFS Datastore in Vsphere ESXi 70 Virtennet
14 pages
Translation For University Students - College of Artsdocx
No ratings yet
Translation For University Students - College of Artsdocx
28 pages
Lesson 5 - Site Layout and Design-1
No ratings yet
Lesson 5 - Site Layout and Design-1
7 pages
购买定制论文
100% (1)
购买定制论文
7 pages
Machine Learning Ai in Medical Devices
No ratings yet
Machine Learning Ai in Medical Devices
24 pages
The Role of Well-Being in The Perceived Parental Involvement and Academic Achievement of Dean's Listers of BS Psychology Program
No ratings yet
The Role of Well-Being in The Perceived Parental Involvement and Academic Achievement of Dean's Listers of BS Psychology Program
17 pages
Complete The Sentences With The Correct Form of One of The Phrasal Verbs
No ratings yet
Complete The Sentences With The Correct Form of One of The Phrasal Verbs
3 pages
How Rentomojo and Furlenco Got Buried Under The Weight of Their Own Furniture - The Ken
No ratings yet
How Rentomojo and Furlenco Got Buried Under The Weight of Their Own Furniture - The Ken
2 pages
Special Web 1 PDF
No ratings yet
Special Web 1 PDF
12 pages
Seismic Analysis of A Reinforced Concrete Building by Response Spectrum Method
No ratings yet
Seismic Analysis of A Reinforced Concrete Building by Response Spectrum Method
10 pages
Life of Augustine of Hippo The Donatist Controvers... - (PG 25 - 164) PDF
No ratings yet
Life of Augustine of Hippo The Donatist Controvers... - (PG 25 - 164) PDF
140 pages
Tivix Guide To Design Thinking - 2023
No ratings yet
Tivix Guide To Design Thinking - 2023
20 pages
Evaluating and Choosing An Iot Platform
No ratings yet
Evaluating and Choosing An Iot Platform
26 pages
Đề số 54-CMP
No ratings yet
Đề số 54-CMP
6 pages
Decision For Supply Material Deployment OPGW in January 2023
No ratings yet
Decision For Supply Material Deployment OPGW in January 2023
2 pages

Spark NLP Training-Public-April 2020

Uploaded by

Spark NLP Training-Public-April 2020

Uploaded by

Spark NLP

for Data Scientists

April 22, 2020

● ... ○ Full-time development team (26 new releases in

● Reusing NLP Functionality

● Lemmatization always returns real words, stemming doesn’t.

Open 1. Spark NLP Basics and 2. Text

(click on Colab icon or open in a new tab)

● Deep-Learning-based natural language processing systems.

Glove ELMO BERT Universal Sentence Encoders

Open 3. Spark NLP Pretrained Models

(click on Colab icon or open in a new tab)

❖ Named Entity Recognition (NER) in Spark NLP

John Smith ⇒ PERSON

* CoNLL: Conference on Computational Natural Language Learning

Open 4. NERDL Training notebook in

(click on Colab icon or open in a new tab)

❖ Text Classification with Classifier DL in

Open 5. Text CLassification with

(click on Colab icon or open in a new tab)

April 22, 2020 May 13, 2020

Overview and key concepts in Spark NLP

Spark NLP for Healthcare Data Scientists

Cleaning medical text: Normalization, stop-words, clinical POS, spell checking

Session start at xx:xx am

You might also like