Learning
APACHE
Spark.
Learning Apache Spark with Python
Wenqiang Feng
December 05, 2021Preface
11 About
1.2 Motivation for this tutorial
1.3 Copyright notice and license info.
14 Acknowledgement .
15. Feedback and suggestions
Why Spark with Python ?
2.1 Why Spark?
2.2 Why Spark with Python (PySpark)?
Configure Running Platform
3.1 Run on Databricks Community Cloud
3.2. Configure Spark on Mac and Ubuntu
3.3. Configure Spark on Windows
3.4 PySpark With Text Editor or IDE.
3.5. PySparkling Water: Spark + H20.
3.6 Set up Spark on Cloud
3.7 PySpark on Colaboratory
3.8 Demo Code in this Section
An Introduction to Apache Spark
4.1 Core Concepts
4.2 Spark Components
43° Architecture
4.4 How Spark Works?
Programming with RDDs
5. Create RED
5.2 Spark Operations
53 rdd.DataPrame vs pd.DataFrame
Statistics and Linear Algebra Preliminaries
6.1 Notations
62 Linear Algebra Preliminaries
63 Measurement Formula
CONTENTS
wus now
a4
uw
u
18
20
21
29,
30
31
31
33
33
34
36
36
7
37
4
43
61
6
61
6310
uw
2
2B
14
64 Confusion Matrix
65 Statistical Tests
Data Exploration
7.1 Univariate Analysis
7.2 Multivariate Analysis
Data Manipulation: Features
8.1 Feature Extraction
82 Feature Transform
83 Feature Selection
84 Unbalanced data: Undersampling
Regression
9.1 Linear Regression
9.2 Generalized linear regression
9.3 Decision tree Regression
9.4 Random Forest Regression
9.5 Gradient-boosted tree regression
Regularization
10.1 Ordinary least squares regression
10.2 Ridge regression
10.3 Least Absolute Shrinkage and Selection Operate (Lasso)
104 Elastic net .
Classification
ILL Binomial logistic regression
11.2 Multinomial logistic regression
113 Decision tree Classification
114. Random forest Classification
ILS. Gradient-boosted tree Classification
116 XGBoost: Gradient-boosted tree Classification
17 Naive Bayes Classification
Clustering
12.1 K-Means Model
RFM Analysis,
13.1 RFM Analysis Methodology
13.2 Demo
13.3. Extension
‘Text Mining
14.1 Text Collection
14.2. Text Preprocessing
143 Text Classification
14.4 Sentiment analysis
14.5. Negrams and Correlations
or
or
30
87
96
116
47
9
119
133
142
Ist
159
167
167
168
168
168
169
169
181
194
206
217
218
219
233
233
247
248,
250
256
263
263
2m
2714
280
28715
16
7
18
19
20
2
22
23
24
146 Topic Model: Latent Dirichlet Allocation
Social Network Anal
15.1 Introduction
15.2 Co-occurrence Network
15.3. Appendix: matrix multiplication in PySpark
15.4 Correlation Network
ALS: Stock Portfolio Recommendations
16.1 Recommender systems
16.2 Alternating Least Squares.
163 Demo
Monte Carlo Simulation
17.1 Simulating Casino Win
17.2. Simulating a Random Walk
Markov Chain Monte Carlo
18.1 Metropolis algorithm
18.2 A Toy Example of Metropolis
18.3 Demos eee
Neural Network
19.1 Feedforward Neural Network
Automation for Cloudera Distribution Hadoop
20.1 Automation Pipeline
20.2 Data Clean and Manipulation Automation
20.3. ML Pipeline Automation
20.4 Save and Load PipelineModel
20.5. Ingest Results Back into Hadoop
Wrap PySpark Package
21.1 Package Wrapper
21.2 Pacakge Publishing on PyPI
PySpark Data Audit Library
22.1 Install with pip
Install from Repo
Uninstall
224 Test .
225 Auditing on Big Dataset
‘Zeppelin to jupyter notebook
23.1 How to Install
23.2 Converting Demos
My Cheat Sheet
287
305
306
306
310
312
313
314
31s
3s
323
324
326
335,
336
336
337
345,
345
349)
349)
349)
392
393
353
355
355
357
359
359)
359)
359)
360
367
311
371
372
37725 JDBC Connection
25.1
25.2
253
25.4
JDBC Driver
JDBC read
JDBC write
JDBC temp_view
26 Databricks Tips
26.1
26.2
26.3
26.4
26.5
Display samples
Auto files download
‘Working with AWS $3
delta format
nlflow
27 PySpark API
24
212
213
214
215
216
207
278
Stat API
Regression API
Classification APT
Clustering API
Recommendation API
Pipeline API
Tuning API
Evaluation API
28 Main Reference
Bibliography
Python Module Index
Index
381
381
382
383
383
385
385
385
389)
403
403
405
405
4
430
450
465
470
472
ann
483,
435
487
439Learning Apache Spark with Python
Learning
Spark
‘Welcome to my Learning Apache Spark with Python note! In this note, you will lear a wide array of
concepts about PySpark in Data Mining, Text Mining, Machine Learning and Deep Leaming. The PDF
version can be downloaded from HERE.
CONTENTS 1Learning Apache Spark with Python
2 CONTENTSCHAPTER
ONE
PREFACE
1.1 About
1.1.1 About this note
This is a shared repository for Learning Apache Spark Notes. The PDF version can be downloaded from
HERE. The first version was posted on Github in ChenFeng ({Feng2017). This shared repository mainly
contains the self-learning and self-teaching notes from Wengiang during his IMA Data Science Fellowship,
‘The reader is referred to the repository hitps://github com/runawayhorse001/LeamningApacheSpark for more
details about the dataset and the . ipynb files.
In this repository, I try to use the detailed demo code and examples to show how to use each main functions.
If you find your work wasn’t cited in this note, please feel free to let me know.
Although I am by no means an data mining programming and Big Data expert, I decided that it would be
‘useful for me to share what I learned about PySpark programming in the form of easy tutorials with detailed
example, I hope those tutorials will be a valuable tool for your studies.
‘The tutorials assume that the reader has a preliminary knowledge of programming and Linux. And this
document is generated automatically by using sphinx.
1.1.2 About the author
+ Wengiang Feng
- Director of Data Science and PhD in Mathematics
~ University of Tennessee at Knoxville
= Email: von198@ gmail.com
+ Biography
Wengiang Feng is the Director of Data Science at American Express (AMEX). Prior to his time at
AMEX. Dr, Feng was a Sr, Data Scientist in Machine Learning Lab, H&R Block. Before joining
Block, Dr. Feng was a Data Scientist at Applied Analytics Group, DST (now SS&C). Dr. Feng's
responsibilities include providing clients with access to cutting-edge skills and technologies, including
Big Data analytic solutions, advanced analytic and data enhancement techniques and modeling.Learning Apache Spark with Python
Dr. Feng has deep analytic expertise in data mining, analytic systems, machine learning algorithms,
business intelligence, and applying Big Data tools to strategically solve industry problems in a cross
functional business. Before joining DST, Dr. Feng was an IMA Data Science Fellow at The Institute
for Mathematics and its Applications (IMA) at the University of Minnesota. While there, he helped
startup companies make marketing decisions based on deep predictive analytics.
Dr. Feng graduated from University of Tennessee, Knoxville, with Ph.D. in Computational Mathe-
matics and Master's degree in Statistics. He also holds Master's degree in Computational Mathematics
from Missouri University of Science and Technology (MST) and Master's degree in Applied Mathe-
‘matics from the University of Science and Technology of China (USTC).
+ Declaration
‘The work of Wengiang Feng was supported by the IMA, while working at IMA. However, any opin-
ion, finding, and conclusions or recommendations expressed in this material are those of the author
and do not necessarily reflect the views of the IMA, UTK, DST, HR & Block and AMEX.
1.2 Motivation for this tutorial
I was motivated by the IMA Data Science Fellowship project to learn PySpark. After that I was impressed
and attracted by the PySpark. And I foud that:
1, It is no exaggeration to say that Spark is the most powerful Bigdata tool,
2. However, I still found that learning Spark was a difficult process. I have to Google it and identify
which one is true. And it was hard to find detailed examples which I can easily learned the full
process in one file.
3. Good sources are expensive for a graduate student
1.3 Copyright notice and license info
This Learning Apache Spark with Python PDF file is supposed to be a free and living document, which
is why its source is available online at hitps://runawayhorse001. github io/LearningApacheSpark/pyspark.
pdf, But this document is licensed according to both MIT License and Creative Commons Attribution-
‘NonCommercial 2.0 Generic (CC BY-NC 2.0) License.
‘When you plan to use, copy, modify, merge, publish, distribute or sublicense, Please see the terms of
those licenses for more details and give the corresponding credits to the author.
4 Chapter 1. PretLearning Apache Spark with Python
1.4 Acknowledgement
At here, I would like to thank Ming Chen, Jian Sun and Zhongbo Li at the University of Tennessee at
Knoxwille for the valuable discussion and thank the generous anonymous authors for providing the detailed
solutions and source code on the internet, Without those help, this repository would not have been possible
to be made. Wengiang also would like to thank the Institute for Mathematics and Its Applications (IMA) at
University of Minnesota, Twin Cities for support during his IMA Data Scientist Fellow visit and thank TAN
THIAM HUAT and Mark Rabins for finding the typos.
‘A special thank you goes to Dr. Haiping Lu, Lecturer in Machine Learning at Department of Computer
Science, University of Sheffield, for recommending and heavily using my tutorial in his teaching class and
for the valuable suggestions.
1.5 Feedback and suggestions
Your comments and suggestions are highly appreciated. I am more than happy to receive corrections, sug-
gestions or feedbacks through email (von !98@ gmail.com) for improvements.
1.4, Acknowledgement 5Learning Apache Spark with Python
6 Chapter 1. PrefaceCHAPTER
Two
WHY SPARK WITH PYTHON ?
Chinese proverb
‘Sharpening the knife longer can make it easier to hack the firewood — old Chinese proverb
I want to answer this question from the following two parts:
2.1. Why Spark?
I think the following four main reasons from Apache Spark"™ official website are good enough to convince
you to use Spark.
1, Speed
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory
computing,
120 440
2
2 90
£ 50 ™ Hadoop
D
= ® Spark
3 a0 0.8
= 5 z
Fig. I: Logistic regression in Hadoop and Spark
2, Base of Use
Write applications quickly in Java, Scala, Python, R.Learning Apache Spark with Python
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it
interactively from the Scala, Python and R shells.
3. Generality
‘Combine SQL, streaming, and complex analytics.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning,
GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Spark SL Sod ry Graphx
Structured [| Streaming
eae pert (machine learning) ff (graph)
Apache Spark Core
Standalone
Scheduler
Fig. 2: The Spark stack
4, Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including
DES, Cassandra, HBase, and $3,
8 Chapter 2, Why Spark with Python ?Learning Apache Spark with Python
APACHE
Soark
hemp rr
Pe MESOS fYpRICE
Fig, 3: The Spark platform
2.2 Why Spark with Python (PySpark)?
No matter you like it or not, Python has been one of the most popular programming languages.
2.2. Why Spark with Python (PySpark)? 9Learning Apache Spark with Python
KDnuggets Analytics, Data Science, Machine
Learning Software Poll, top tools share, 2015-2017
0% 10% 2098 30% 40% 50% 50%
Python
language
sob
language
RapidMiner
Excel
spark
Anaconda
Tensorflow
sekit-learn
Tableau
KNIME
Fig. 4: KDnuggets Analytics/Data Science 2017 Software Poll from kinuggets
10 Chapter 2, Why Spark with Python ?CHAPTER
THREE
CONFIGURE RUNNING PLATFORM
Chinese proverb
Good tools are prerequisite to the successful execution of a job. ~ old Chinese proverb
A good programming platform can save you lots of troubles and time. Herein I will only present how to
install my favorite programming platform and only show the easiest way which I know to set it up on Linux
system, If you want to install on the other operator system, you can Google it. In this section, you may learn
how to set up Pyspark on the corresponding programming platform and package
3.1 Run on Databricks Community Cloud
If you don’t have any experience with Linux or Unix operator system, I would love to recommend you to
use Spark on Databricks Community Cloud. Since you do not need to setup the Spark and it’s totally free
for Community Edition. Please follow the steps listed below.
1, Sign up a account at: https://community.cloud.databricks.com/login html
"Learning Apache Spark with Python
€ > © Of secure |https/communitylouddatabicks.com sinh Tt)e B02
PPS. 4k Bookmarks mjob [5 ForeignNation> [> TheFORTRAN?: [Fortran Tutor: © Usingthedluste, G
@databricks
& Sign In to Databricks
&wlengt@uked
Forgot Passwort?
SignUp.
2. Sign in with your account, then you can creat your cluster(machine), table(dataset) and.
rnotebook(coue)
2 Chapter 3, Configure RunLearning Apache Spark with Python
eombids x —
€ 9-6 Oe see | htp/cammuntycloddabvcscom em Oro
Apps Bookmarks tjob D ForeignNations [S TeFORTRANE) (Fortran Tutorsl & Wingthecuste © Ei @ Atachments
@databricks
New Documentation What's ne
2 notsok aac oe
Pye, Se, SOL
@ amteke ora Sone
2 terteenion
3. Create your cluster where your code will run
3.1, Run on Databricks Community Cloud 13Learning Apache Spark with Python
fe uta Ou = =
6.0 |# nse dnc t= More
a ea es a ST ae Tle .
Create Custer ta
New Cluster cm QIEEIB a
4, Import your dataset
4 Chapter 3. Configure Running PlatformLearning Apache Spark with Python
'@ cesterbe-outs oe
€ > © 0 [6 sue | s/communtydontscem n t= gore
fons 4 teokmarts tjeb () Freanvaton) ()TheFORRANP, 0 Forrnutrs: a Uingthedia GBH @ Atachmets E
+ ova ear awa
or Data Import
€ > © Os use seyejunmaycntaacen Tes gore
hoe te tsinsls ejb 6 Fesgn iti B heRSMAR Efe ir & Wyte daa 6 Aan ep tny
Table Details
3.1, Run on Databricks Community Cloud 5Learning Apache Spark with Python
Note: You need to save the path which appears at Uploaded to DBES: /File-
Store/tables/0Srmhuqv1489687378010/. Since we will use this path to load the dataset.
5. Create your notebook
© 9 0 0 (a use seyejurmaycndandacn? ssc Tes @oroe
Epes teinats ib Sforapniton S eFORtaey Sfatn Ma 6 ugeda 6 GAs w up 2
16 Chapter 3, Configure RunLearning Apache Spark with Python
© > 0.0 (& at [hep/mmny atta TH = More
hem teckowis nb Frsnttons (Te PONTMY. Foro hrs Wgtecnts Ed @ atc mbm .
IneaRemression er ere
‘LLinear Regression with PySpark on Databricks
‘Author: Wenglang Feng
Setup sparksession
2. Load dataset
After finishing the above 5 steps, you are ready to run your Spark code on Databricks Community Cloud. I
will run all the following demos on Databricks Community Cloud. Hopefully, when you run the demo code,
‘you will get the following results:
TV) Radio| Newspaper | Sales)
69.2) 22.1]
45.1] 10.41
° 69.3) 8.3]
3 58.5) 18.5]
a 58.4) 12.8)
only showing top 5 rows
ac0: (nullable = true)
TV: double (nullable = true)
|-- Radio: double (nullable = true)
|-- Newspaper: double (nullable = true}
|-- Sales: double (nullable = true)
3.1, Run on Databricks Community Cloud 7Learning Apache Spark with Python
3.2 Configure Spark on Mac and Ubuntu
3.2.1 Installing Prerequisites
will strongly recommend you to install Anaconda, since it contains most of the prerequisites and support
multiple Operator Systems.
1, Install Python
Go to Ubuntu Software Center and follow the following steps:
a, Open Ubuntu Software Center
b, Search for python
©, And click Install
(Or Open your terminal and using the following command:
sudo apt-get install build-esseatial checkinstal
sudo apt-get insta:
sudo easy_install pip
sudo pip install épy
3.2.2 Install Java
Java is used by many other softwares, So it is quite possible that you have already installed it, You can by
using the following command in Command Prompt
Otherwise, you can follow the steps in How do I install Java for my Mac? to install java on Mac and use the
following command in Command Prompt to install on Ubuntu:
sudo apt-add-repositer webupdsteam/ Java
sudo apt get update
sudo apt-get install oracle-java8- installer
3.2.3 Install Java SE Runtime Environment
Tinstalled ORACLE Java IDK
Warning: Installing Java and Java SE Runtime Environment steps are very important, since
Spark is a domain-specific language written in Java.
8 Chapter 3. Configure Running PlatformLearning Apache Spark with Python
You can check if your Java is available and find it’s version by using the following command in Command
Prompt
If your Java is installed successfully, you will get the similar results as follows:
java version "1.8.0_131"
va(IM) SE Runtime Env:
spot (TM) 64-Bi
onment (build 1.8.0_131-bi1)
Server VM (build 25.131-b11, mi
3.2.4 Install Apache Spark
Actually, the Pre-build version doesn’t need installation, You can use it when you unpack it.
a. Download: You can get the Pre-built Apache Spark™ from Download Apache Spark™.
. Unpack: Unpack the Apache Spark™ to the path where you want to install the Spark.
cc. Test: Test the Prerequisites: change the direction spark-¥.#. f-bin-hadoop!. #/
bin and run
/pyspark
on 2.7.13 [Anaconda 4
3:05:08:
C 4.2.1 Compatible Apple LLVM
0 (x86_64) | 7 Dee 20 2016,,
® (clang
57) ] on darwir
<" for more,
", Ncopyright™, "credit
informati
Anaconda is
prought to you by Continuum Analytics
Please check out: h nuun.io/thanks and https://anaconda.org
Using Spark's esau rofile: org/apache/spark/log4j-defavits
pre
setLogievel (newLevel) sparkR,
0 13:30:12 WARN Nativecode
der: Unable to load native-had
ibrary for your platform... using builtin-Jjava classes where,
7 WARN ObjectStore: Failed to get datal
jectException
Using Python version 2.7
(default, Dee 20 2018 23:05:08)
spark".
rkSession available as
3.2, Configure Spark on Mac and Ubuntu 19Learning Apache Spark with Python
3.2.5 Configure the Spark
a, Mac Operator System: open your bash_profi le in Terminal
vin ~/-bash_profile
And add the following lines to your bash_prof ile (remember to change the path)
our_spark_installation,
ME/bin
Me fbi
At last, remember to source your bash_profile
ree ~/.bash_profile ]
b. Ubuntu Operator Sysytem: open your bashyc in Terminal
And add the following lines to your bas!hre (remember to change the path)
your_spark_installation_path
ME/bin: $SPARK_HOME/sbin
Me/bin
jupyter*
At last, remember to source your bashre
source -/.bashre
3.3 Configure Spark on Windows
Installing open source software on Windows is always a nightmare for me. Thanks for Deelesh Mandloi.
‘You can follow the detailed procedures in the blog Getting Started with PySpark on Windows to install the
Apache Spark™ on your Windows Operator System.
20 Chapter 3. Configure Running PlatformLearning Apache Spark with Python
3.4 PySpark With Text Editor or IDE
3.4.1 PySpark With Jupyter Notebook
After you finishing the above setup steps in Configure Spark on Mac and Ubuntu, then you should be good
to write and run your PySpark Code in Jupyter notebook.
C0 Omaha 7 ws Oe oe
ge mite sp mS capone C22 Degen’ Deron Foran ne :
SI JUPYter Tes PySpmk cone marae e
3.4.2 PySpark With PyCharm
‘After you finishing the above setup steps in Configure Spark on Mac and Ubuntu, then you should be good
to add the PySpark to your PyCharm project.
1, Create a new PyCharm project
3.4, PySpark With Text Editor or IDE 2Learning Apache Spark with Python
2. Goto Project Structure
Option L: File -> Settings -> Project: -> Project Structure
Option 2: PyCharm -> Preferences -> Project: -> Project Structure
22 Chapter 3. Configure Running PlatformLearning Apache Spark with Python
3. Add Content Root: all
files from $SPARK_HOME/python/lib
3.4, PySpark With Text Editor or IDE 23Learning Apache Spark with Python
Chapter 3. Configure Running PlatformLearning Apache Spark with Python
4, Run your script
3.4.3 PySpark With Apache Zeppelin
‘After you finishing the above setup steps in Conjigure Spark on Mac and Ubuntu, then you should be good
to write and run your PySpark Code in Apache Zeppelin
3.4, PySpark With Text Editor or IDE 25Learning Apache Spark with Python
test SHWre4) asim @ ©
3.4.4 PySpark With Sublime Text
After you finishing the above setup steps in Configure Spark on Mac and Ubuntu, then you should be good
to use Sublime Text to write your PySpark Code and run your code as a normal python code in Terminal.
python test_pyspark.py
Then you should get the output results in your terminal,
26 Chapter 3, Configure RunLearning Apache Spark with Python
3.4.5 PySpark With Eclipse
If you want to run PySpark code on Eclipse, you need to add the paths for the External Libraries for your
Current Project as follows
1, Open the properties of your project,
LL
3.4, PySpark With Text Editor or IDE 27Learning Apache Spark with Python
2. Add the paths for the External Libraries
Z
type fitertext Gl) PyDev - PYTHONPATH a
> Resource ‘The final PYTHONPATH used for a launch is composed of the paths
Builders defined here, joined with the paths defined by the selected interpreter.
ProjectReferences _epsource Folders | mjExternal Libraries _@ String Substitution Variables
PyDev-Interpreter/¢
. External libraries (source Folders/zips/jars/eggs) outside of the workspace.
‘Run/Debug Settings Whenusing variables, the final paths resolved must be filesystem absolute.
> Task Repostory ‘Changes in external libraries are not monitored, so, the Force restore internal info’
wikitext should be used if an external library changes.
im
/opt/spark/python ‘Add source folder
‘Add zip/jar/eog
‘Add based on variable
Remove
Force restore internal info
Restore Defaults || Apply
cancel OK
And then you should be good to run your code on Felipse with PyDev.
28 Chapter 3, Configure Running PlatformLearning Apache Spark with Python
3.5 PySparkling Water: Spark + H2O
1. Download Sparkling Water from: _ hitps://s3.amazonaws.com/h2o-release/sparkling-water/
rel-2.4/Sfindex. html
2. Test PySparking
water-2.4.5.2ip
wacer-2.4.5/pin
cd ~/sp
fey.
rkling
If you have a correct setup for PySpark, then you will get the following results:
rk defined in SPARA_HOME-/Users/dt216661/spark environmen
(default, Dec 14 2018, 13:28:58
patible Apple LLVM 6.0 (
+ "copyright", "credits"
4:08:30 WARN NativeCodeic
sLibrary fer your pla’
fault log level to "WARN"
m ng builtin-java classes where applicab
TEES OH TERT PART
3.5, PySparkling Water: Spark + H20 29Learning Apache Spark with Python
(continued from previous page)
Using Spark's dersuie ofile: org/
che/spark/log4j-de
fault log level to "WAR
use sc.se Ltnewievel) SparkR, use,
2:66 ice 'SparkUI' could not bind o
'sparkUI' could not bin o
eCodeLoader: Unable to lead native-hadoop
using builtin-java classes where appl.
Store: Failed to get database g
I NG IS TINX version
Using P:
version 3.7.1 (default, Dec 14 2018 13:28:58
Le as ‘spark’.
3. Setup pyspar'
Ling with Jupyter notebook
Add the following alias to your bash (Linux systems) or bas»_prof ile (Mac system)
'SPARK_DRIVER_PYTHON="
sparkling-water-2.4.5
al " PYSPARK_DRIVER_PYTHON_0}
4, Open pyspark! ing in terminal
‘sparkling
3.6 Set up Spark on Cloud
Following the setup steps in Configure Spark on Mac and Ubuntu, you can set up your own cluster on the
cloud, for example AWS, Google Cloud. Actually, for those clouds, they have their own Big Data tool. You
can run them directly whitout any setting just like Databricks Community Cloud. If you want more details,
please feel free to contact with me.
30 Chapter 3. Configure Running PlatformLearning Apache Spark with Python
3.7 PySpark on Colaboratory
Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud,
3.7.1 Installation
ipip install pyspark
3.7.2 Testing
from pyspark.sql import Sparksession
spark = Sparkse
builder \
appNane ("Py
getorcreate()
\
value")
(spark. some. config optic
\
88
Ho) Sigh
3.8 Demo Code in this Section
‘The Jupyter notebook can be download from installation on colab,
+ Python Source code
#8 set up SparkSes
from pyspark.sql import SpazkSessior
spark ~ Sparksession \
TET TERT AEST
3.7. PySpark on Colaboratory
31Learning Apache Spark with Python
saved from previous page)
builder \
ppName ("
onfigt
Spark SQl basic example") \
©.config.option",
getorcr
ae om.databricks. spark
options (header= \
nferschema="true').\
le/
o";header=True)
af. show (5)
af-printSchema ()
32 Chapter 3. Configure Running PlatformCHAPTER
FOUR
AN INTRODUCTION TO APACHE SPARK
Chinese proverb
‘Know yourself and know your enemy, and you will never be defeated — idiom, from Sunzi’s Art of War
4.1 Core Concepts
‘Most of the following content comes from [Kirillov2016]. So the copyright belongs to Anton Kirilloy. I
will refer you to get more details from Apache Spark core concepts, architecture and internals.
Before diving deep into how Apache Spark works, lets understand the jargon of Apache Spark
+ Job: A piece of code which reads some input from HDFS or local, performs some computation on the
data and writes some output data
Stages: Jobs are divided into stages. Stages are classified as a Map or reduce stages (Its easier to
understand if you have worked on Hadoop and want to correlate). Stages are divided based on com-
putational boundaries, all computations (operators) cannot be Updated in a single Stage. It happens
over many stages.
+ Tasks: Each stage has some tasks, one task per partition. One task is executed on one partition of data
on one executor (machine)
+ DAG: DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
+ Executor: The process responsible for executing a task,
‘Master: The machine on which the Driver program runs
+ Slave: The machine on which the Executor program runs
33Learning Apache Spark with Python
4.2 Spark Components
Driver Program
1, Spark Driver
+ separate process to execute user applications
+ creates SparkContext to schedule jobs execution and negotiate with cluster manager
2. Executors
+ nun tasks scheduled by driver
+ store computation results in memory, on disk or off-heap
+ interact with storage systems
3. Cluster Manager
+ Mesos
34 Chapter 4, An Introduction to Apache SparkLearning Apache Spark with Python
+ YARN
+ Spark Standalone
Spark Driver contains more components responsible for translation of user code into actual jobs executed
on cluster:
User Program Driver Executor
DD graph
val oo = wav mpartooertiomaty] DAGScheduler Threads
eT | raseneauer
7
rdd1.join¢rdd2) split graph into launch tasks vie execute tasks
-Grounay(..J stages of tasks cluster manager
-filter(.)
submit each retry failed or store and serve
buildoperatorDAG stage as ready straggling tasks blocks
stge (ES
agnosticto ie
Cee ee Tso T
36 Chapter 4. An Introduction to Apache SparkCHAPTER
FIVE
PROGRAMMING WITH RDDS
Chinese proverb
If you only know yourself, but not your opponent, you may win or may lose. If you know neither
yourself nor your enemy, you will always endanger yourself — idiom, from Sunzi’s Art of War
RDD represents Resilient Distributed Dataset. An RDD in Spark is simply an immutable distributed
collection of objects sets. Each RDD is split into multiple partitions (similar pattern with smaller sets),
which may be computed on different nodes of the cluster.
5.1 Create RDD
Usually, there are two popular ways to create the RDDs: loading an external dataset, or distributing a
set of collection of objects. The following examples show some simplest ways to create RDDs by using
parallelize () fuention which takes an already existing collection in your program and pass the same
to the Spark Context.
1, Byusing parallelize( ) function
from pyspark.sqi import SparkSession
arkSess:
builder
configt
gecOrcreate()
parallelize([(1, 2, 3, ‘ab
tae £9),
tg h it)])-tepr({'coli', ‘col2', tcol2','col4"])
Then you will get the RDD data:
show (}
CORTES On TERT FAT
37Learning Apache Spark with Python
(continued from previous page)
from pyspark.sql import Sparksessio:
spark = SparkSession \
puilder \
appName (
alue") \
12d, 2p Ve (514 (TeV, 1920013
myDat
‘Then you will get the RDD data
[mybata. collect ()
[ile 2d, (8p Me (Se Oe (Te Be (10d
2. By using createDataFrame ( ) function
from pyspark.sql import Spark
spark = SparkSession \
puilder
appName ("Py
config (" spark. some..cc \
get 0
Employee = spark. createl
ment Ta")
Then you will get the RDD data:
Department td
3B Chapter 5. Programming with RDDsLearning Apache Spark with Python
3. By using read and Load functions
a, Read dataset from .csv file
FF set up Sparksession
from pyspark.sql import SparkSess
spark = SparkSession \
builder \
appName ("Python Spark
nfig("spark.some.conlig vy
getorcreate()
ae read. format (‘com.databricks.spark.cev') .\
options (header='true', \
infezschema-'tcue').\
Load ("/nome/£eng/spark/Code/data
sheader-Teue)
0 Sales
l 22.1
2 10.4
4 18
5|180.8| 10.8 98.4) 12.8
only showing top 5 rows
_c0: integer (nullable = true}
Tv: double (nullable = true}
Radio: double (nullable = tree)
Newspaper: doub
sales: do
@ (nullable = tree)
‘Once created, RDDs offer two types of operations: transformations and actions.
b, Read dataset from DataBase
from pyspark.sql import SparkSession
WF set up
spark = SparkSessioi
builder \
appName ("Pychon Spark ores \
config("spark. sone.cont "some-value")
TEaTTET OS BEAT FET
5.1, Create RDD 39Learning Apache Spark with Python
{continued from previous page)
getorcreace()
48 User information
HHH OPH #9:5432/datase
-post:
af = spark. read. jabe(url-url,
ypropertiee-propert ies) —
af. show (
af .printSchema ()
Driver', ‘password’: pw,
‘Then you will get the RDD data:
_c0| IV) Radio | Newepaper|Sales
37.8 68.21 22.2
38.3 45.11 10.4
45.9 69.31 9.3
3 58.5) 18.5
5)180.8) 10.8 38.4) 12
only showing top § rows
root
_c0: integer (nullable ~ true)
Tv: double (nullable ~ ¢
Radio: double (nullable = true}
Newspaper: double (nullable = true)
Sales: double (n
Note: Reading tables from Database needs the proper drive for the corresponding Database. For example,
the above demo needs org -p
stgresq] .Dr iver and you need to download it and put itin jars folder
of your spark installation path, I download post. gresql—42.1.1. jar from the official website and put
itin
folder,
C, Read dataset from HDFS
‘from pyspark.conf impert Sparkcont
from pyspark.context import SpazkCo!
from pyspark.sql import HiveContext.
TEaRTRIET OS HEAT FET
40 Chapter 5. Programming with RDDsLearning Apache Spark with Python
{continued from previous page)
c= Sparkcor
(local ", "exempie')
Hivecontext (sc)
textFile(
Lidiret())
s://edhst test /user/data/deno
OM spf LIMIT 100")
print (spf. show (5)
5.2 Spark Operations
Warning: All the figures below are from Jeffrey Thompson, The interested reader is referred to pyspark
pictures
‘There are two main types of Spark operations: Transformations and Actions [Karau20L
ee
TRANSFORMATIONS
Spaik® Operations = +
ui
yey ACTIONS
Note: Some people defined three types of operations: ‘Transformations, Actions and Shufftes.
5.2, Spark Operations aLearning Apache Spark with Python
5.2.1 Spark Transformations
‘Transformations construct a new RDD from a previous one. For example, one common transformation is
filtering data that matches a predicate.
eS seo ee =e
Essential Core & Intermediate Spark Operations
eee er
az ra
g a
2 ices
BS Ea
2 —
NE oo 1 FeporeteloiedSoreitehtoreitions
iz aoa
SES ceey eS = medion
Essential Core & Intermediate PairRDD Operations
[Moth / Statistical Set Theory /Relatinel Data Sirctre
sarpietytey Psgeee cero) + cca
‘tiauerfetn
apeteioin
me oe a
TRANSFORMATIONS
5.2.2 Spark Actions
Actions, on the other hand, compute a result based on an RDD, and either return it to the driver program ot
save it to an external storage system (e.g., DFS).
1S Keteoweretetee 1 oeiproe
1 cohtecoon| 1 SSnepronnistinct
42 Chapter 5, Programming with RDDsLearning Apache Spark with Python
+ tgs + eomtyey
1 Stpeoyteence
AC actions
5.3 rdd.DataFrame Vs pd. DataFrame
5.3.1 Create DataFrame
1, From List
P3iele', 3 All
col_name)
spark.createDataFrane (my_list, col_name) . sho!
Comparison
Bc
oa 1 2 2
1 23 bi 2) 2
223 4 4
Altention: Pay attentation to the parameter
will make the list as rows,
mns= in pd. DataFrame. Since the default value
python ©
DataFrane(my_list, columns
name)
i-DataFrame(my_list, col_name}
Comparison
5.3, rdd.DataFrame Vs pd.DataFrame 43Learning Apache Spark with Python
acae
2. From Dict
At: [0, 1, Oly
Bt: (1, 0, 2],
tots Ty, 0, 0
array (List (d.values())}.T
olist (), list (d.keys(}))
AU Bc
ABC
0 1 oat
1 ° 1) 0) 0
201 oO) 1) 0
5.3.2 Load DataFrame
1
From DataBase
‘Most of time, you need to share your code with your colleagues or release your code for Code Review or
‘Quality assurance(QA). You will definitely do not want to have your User
So you can save them in login.txt:
nformat ion in the code.
runawayhorse001
thonTips
and use the following code to import your User Information:
ery:
login = pd.re:
v(x! login. txt’, header-None}
pw (ol
print (‘User information is ready! ")
except,
print (‘Login information is not available!!!)
TEORUIES OF ERT FAT
44 Chapter 5. Programming with RDDsLearning Apache Spark with Python
(continued from previous page)
#atabase 2
host = '##. 488. 984.48"
“table_name!
Comparison:
conn = paycopg2.connest (host-host, database-dh_name, user-user, password
cur = conn. cursor ()
sqi ="
select
format (table_name=tab
name)
432/' +dn_name a
sql.Driver', "password':
able-table_name, properties
Attention: Reading tables from Database with PySpark needs the proper drive for the corresponding
Database. For example, the above demo needs org.postgresql.Driver and you need to download it and
put it in jars folder of your spark installation path, I download postgresql-42.1. | jar from the official
website and put it in jars folder.
2. From .csv
tising.csv')
Frame sp
read.csv(path='Advertising
header-True,
inferSchema~True}
3. From . json
Data from: http://api.lufidaten.info/statio/y data json
dp = pd.read_json("data/data. json")
dg = spark. read. 3}
(data/e
Python
5.3, rdd.DataFrame Vs pd.DataFrame 45Learning Apache Spark with Python
cimestanp
Smestamp"]
ead (4)
show (4)
nestamp
ja
assiae1 1/2019-02-28,,
23:52
1 2994551482 2019-02-28,
19-02-28,
2994551484 2019-02-28,
017123152
only showing top 4 rows
5.3.3 First n Rows
Python Code:
ap head (a)
ds. show (4)
TV Radio |Newspaper | Sales
Ty Radio sales
0 230 27.8 22.1 230.1) 27.8 69.2) 22.1
1 44,5 39.3 0.4 44.5) 39.3 45.1) 10.4
12 45.9 93 2) 45.9 68.3) 9
3151.5 41.3 2.5 151.5| 41.3 52.5| 18.5
only showing top 4 rows
46
Chapter 5. Program
ming with RDDsLearning Apache Spark with Python
5.3.4 Column Names
aT"
Ra
floatse
Eloats4
Eloats4
Eloats4
dtype: object,
5.3.6 Fill Null
['female', 2, 21,{'male', 3, 41]
None],
head (}
show(}
Comparison
Al B
0 male male ull
1 female 2 female| 2/3
2 male 2 male| 2) 4
Python Code
5.3, rdd.DataFrame Vs pd.DataFrame a7Learning Apache Spark with Python
dp. fillna (~99)
ds.f111na(-99) .show()
Al Bloc
AB
0 male 2 male| 1
1 female 2 3.0 female) 2) 3
2 male 3 4.0 male) 3] 4
5.3.7 Replace Values
Python Code
A.replace({'male', ‘fenale'],[1, 0], ace=True)
ds.na.replace({'male!, ' female’) ub 0
Al gl oc
Boe
0 1 NaN Linu
10 2 3.0 a} 2) 3
213 4.0 13) 4
5.3.8 Rename Columns
1, Rename all columns
Python Code:
dp head (2) ‘
dg toDP('a','b!
TEORUIES OF ERT FAT
48 Chapter 5. Programming with RDDsLearning Apache Spark with Python
(continued from previous page)
1 5 38
> 45
2 151.5 5
2. Rename one or more columns
Python Code
dp. renane (columns-mapping) head (4)
new_names = [mapping.get (
de. CoDF (snew_names) “
ol) for
umn]
Comparis
TV Rad: Ca)
Tv Rad: D
0 230 2 22.1
144.5 0.4
2 2 45.9 9.3
2151.5 41 8.5
Note: You can also use withColumnRenamed to rename one column in PySpark,
Python
ds .withColumnRenaned( Newspaper", Paper) show(4
51 39.3 to
5 41.3 5] 18
only showing W3
5.3, rdd.DataFrame Vs pd.DataFrame 49Learning Apache Spark with Python
5.3.9 Drop Columns
"ame = [ "Newspaper", "Sales"
Python Co:
ap drop (dzop_pane, axis=1) -head()
ds .drop (varop_name!
) show (4)
0 8 23 37.8
: 5 04.51 39.3
2 5 45.9
; 51.58/41
only showing top 4 rows
5.3.10 Filter
dp = pd. read_cev (Advertising.
’
as read.esv(path=' tising
header-True,
inferSehema—Teu
Python Code:
dp (dp. Newspaper<20] -head (4)
‘
de [ds .Newspaper<20]
av Radio Newspaper
7 2 19.6 11.6 11.6] 13.2
8 8.6 2.1 1.0 2.1 al 48
11 214.7 24.0 4.0 24.0 4.01 17.4
13097.5 7.6 7.200 9.7 7.6 7.2 7
Python
50 Chapter 5. Programming with RDDsLearning Apache Spark with Python
dp [ (dp. Newspaper<20} (dp. tV>100)] -head(4)
'
ds [ (ds Newspaper<20) (ds.TV>100) ].show(4)
nv : sales focea tenant boat
7 120.2 6 120.2) 19.6 11.6] 13.2
1. 214.7 ° 214.7| 24 4.01 17.4
19 147.3 1 147.3) 23.9 19.1) 14.6
25 262 5 262.9) 3.5 19.5) 12.0
only showing top 4 rows
5.3.11 With New Column
Python Code:
dp('tv_norn'] = dp.2V/sum(dp. TV)
dp head (4)
‘
ds.withColumn(‘tv_norm', ds.TV/ds.groupBy () .agg (F.sum("TV")) .collect () £0] (01)
show (4)
Newspap .
Tv Radio Newspaper Sales tv_norm
27.8 68.2 22.1 9.007824 230.1) 37.8 69.21 22
“+1 /0.007824268493802813
1 44.5 38.3 45 0.4 9.001813 44.51 39.3 45.11 10
+4 10.001513167962643,
2 17.2 45.8 9.3 9.000585 2) 45.9 69.31 9
42) 5.84864920006
2151.5 41.3 58.5 18.5 0.005152 191.5) 41.3 58.5) 18
<5) 0.005151971824472517
Python Code
5.3, rdd.DataFrame Vs pd.DataFrame 51Learning Apache Spark with Python
dp['cond'] = dp.apply (lambda ec: 1 4f ((c.TV>100)4(e.Radio<40)) else 2 if ©
sa 0 else 2,axis=1)
‘
de .withColumn(!cond',P. when ((de.TV>100) & (ds.Radie<40),1)\
when (ds.Sales>10, 2)\
otherwise (3)) show (4)
TV Radio Newspaper Sales cond
0 230.2 8 69.2 22.1 1 230.1 69.21 22.11,
sol
1 44,5 1 10d 2 44.5| 39.3 45.11 10.41,
2 2 68.30 8.300 3 45.9 68.3) 8.310
43
3151.5 41.3 58.5 18.5 2 151.5) 41.3 8.5] 18.51
+2
only showing top 4 rows
Python
ap "] © np. log tap. Tv)
dp.head{s)
?
import pyspark.sql.functions as F
ds -withColumn('log_tv',®.log(ds-TV)} show (4)
'V | Radio | Newspaper | Sales|.,
o og_ty
TV Radio Newspaper Sales logty
o 230 27.8 69.2 22.1 5.438514 220.1) 37.8 69.21 22.11,
+ §.43851399700132
1 44,5 39.3 4 0.4 3.795489 44.5) 29.3 45.1
44) 3. 7954891891721947
2 45.9 2 9 17.2) 43.9 68.3) 9
38381940
5 18.5 5 151.5) 41.3 58.5] 18.5
CORTES OF ERT FET
52 Chapter 5. Programming with RDDs