Spark NLP Training-Public-April 2020
Spark NLP Training-Public-April 2020
Veysel Kocaman
Sr. Data Scientist
[email protected]
Setup
RUNNING CODE:
https://github.com/JohnSnowLabs/spark-nlp-work
shop/blob/master/tutorials/Certification_Trainings/
Public
[How to set up Google Colab]
BOOKMARK:
nlp.johnsnowlabs.com/docs/en/concepts
spark-nlp.slack.com
Part - I
❖ Introducing JSL and Spark NLP
❖ Natural Language Processing (NLP) Basics
❖ Spark NLP Pretrained Pipelines
❖ Text Preprocessing with Spark NLP
❖ Spark NLP Pretrained Models
Introducing Spark NLP
● Spark NLP is an open-source natural language
● Natural Language Toolkit (NLTK): The
processing library, built on top of Apache Spark
complete toolkit for all NLP techniques.
and Spark ML. (initial release: Oct 2017)
● TextBlob: Easy to use NLP tools API, built on
○ A single unified solution for all your NLP needs
top of NLTK and Pattern.
● SpaCy: Industrial strength NLP with Python ○ Take advantage of transfer learning and
and Cython.
implementing the latest and greatest SOTA
algorithms and models in NLP research
● Gensim: Topic Modelling for Humans
● Stanford Core NLP: NLP services and ○ Lack of any NLP library that’s fully supported by
Spark
packages by Stanford NLP Group.
● Fasttext: NLP library by Facebook’s AI ○ Delivering a mission-critical, enterprise grade NLP
library (used by multiple Fortune 500)
Research (FAIR) lab
Available in Python, R, Scala and Java ”AI Adoption in the Enterprise”, February 2019
Most widely used ML frameworks and tools survey of 1,300 practitioners by O’Reilly
BUILT ON THE SHOULDERS OF SPARK ML
● Reusing the Spark ML Pipeline
● Unified NLP & ML pipelines
● End-to-end execution planning
● Serializable
● Distributable
● The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related
forms of a word to a common base form for normalization purposes.
● For tasks like text classification, where the text is to be classified into different
categories, stopwords are removed or excluded from the given text so that
more focus can be given to those words which define the meaning of the text.
Spell Checking & Correction
A POS tag (or part-of-speech tag) is a special label assigned to each token (word) in a text corpus to
indicate the part of speech and often also other grammatical categories such as tense, number
(plural/singular), case etc.
Coding ...
https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
1.SparkNLP_Basics.ipynb
https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
2.Text_Preprocessing_with_SparkNLP(Annotators_T
ransformers).ipynb
Word & Sentence Embeddings
● They encode words and sentences 📜 in fixed-length dense vectors 📐 to drastically improve the processing of textual
data.
● Based on The Distributional Hypothesis: Words that occur in the same contexts tend to have similar meanings.
Word & Sentence Embeddings
https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
3.SparkNLP_Pretrained_Models.ipynb
Part - II
CoNLL Reader
Word Embeddings
(Glove, Bert, Elmo)
Char-CNN-
NerDLApproach
BiLSTM
You can also train your own Word Embeddings
in Gensim and load in Spark NLP.
Coding ...
https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
4.NERDL_Training.ipynb
Part - III
https://github.com/JohnSnowLabs/spark-nlp-worksho
p/blob/master/tutorials/Certification_Trainings/Public/
4.NERDL_Training.ipynb
Spark NLP Resources
Spark NLP Official page
Spark NLP Workshop Repo
JSL Youtube channel
JSL Blogs
Introduction to Spark NLP: Foundations and Basic Components (Part-I)
Introduction to: Spark NLP: Installation and Getting Started (Part-II)
Spark NLP 101 : Document Assembler
Spark NLP 101: LightPipeline
https://www.oreilly.com/radar/one-simple-chart-who-is-interested-in-spark-nlp/
https://blog.dominodatalab.com/comparing-the-functionality-of-open-source-natural-language-processing-libraries/
https://databricks.com/blog/2017/10/19/introducing-natural-language-processing-library-apache-spark.html
https://databricks.com/fr/session/apache-spark-nlp-extending-spark-ml-to-deliver-fast-scalable-unified-natural-language-processing
https://medium.com/@saif1988/spark-nlp-walkthrough-powered-by-tensorflow-9965538663fd
https://www.kdnuggets.com/2019/06/spark-nlp-getting-started-with-worlds-most-widely-used-nlp-library-enterprise.html
https://www.forbes.com/sites/forbestechcouncil/2019/09/17/winning-in-health-care-ai-with-small-data/#1b2fc2555664
https://medium.com/hackernoon/mueller-report-for-nerds-spark-meets-nlp-with-tensorflow-and-bert-part-1-32490a8f8f12
https://www.analyticsindiamag.com/5-reasons-why-spark-nlp-is-the-most-widely-used-library-in-enterprises/
https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-training-spark-nlp-and-spacy-pipelines
https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-accuracy-performance-and-scalability
https://www.infoworld.com/article/3031690/analytics/why-you-should-use-spark-for-machine-learning.html
Slides:
https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/Spark%20NLP%20Training-
Public-April%202020.pdf
Colab:
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.Spa
rkNLP_Basics.ipynb
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Tex
t_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/3.Spa
rkNLP_Pretrained_Models.ipynb
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NE
RDL_Training.ipynb
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Tex
t_Classification_with_ClassifierDL.ipynb
Spark NLP Spark NLP
for Data Scientists for Healthcare Data Scientist
Veysel Kocaman
Sr. Data Scientist
[email protected]
Welcome !
Spark NLP for Data Scientists
Veysel Kocaman
Sr. Data Scientist
[email protected]