Pyspark
Pyspark
1.1
Architecture
A Spark program consists of a driver application and worker programs
2.1
Spark Context
The SparkContext contains all of the necessary info on the cluster to run Spark code.
conf = SparkConf().setAppName('spark-app').setMaster('local[*]')
sc = SparkContext.getOrCreate(conf=conf)
sc
Out[1]:
SparkContext
Spark UI
Version
v2.2.1
Master
local[*]
AppName
spark-app
3.1
Resilient Distributed Dataset
A partitioned collection of objects spread accross a cluster, stored in memory or on disk.
4.1
3 ways of creating a RDD
4.2
3 ways of creating a RDD
In [5]: titanic.take(3)
Out[5]: ['PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked',
'1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S',
'2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C']
4.3
3 ways of creating a RDD
4.4
Working with RDDs
Let's create a RDD from a list of numbers, and play with it.
5.1
Remember !
A RDD is immutable
In [9]: print(rdd) # prints only info on RDD, no evaluation
b'(4) PythonRDD[10] at RDD at PythonRDD.scala:48 []\n | PythonRDD[8] at RDD at PythonRDD.scala:48 []\n | ParallelCollectionRD
D[7] at parallelize at PythonRDD.scala:489 []'
6.1
Spark operations
Come in two types : transformations / actions
7.1
Transformations
Transformations shape your dataset
8.1
Filter
Return a new RDD containing only the elements that satisfy a predicate.
8.2
Map
Return a new RDD by applying a function to each element of this RDD.
Out[13]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]
8.3
FlatMap
Return a new RDD by rst applying a function to all elements of this RDD, and then attening the results.
Ex : return a long matrix of rows [1, 2, 3] of dimension the number of elements in the rdd variable, then
atten it.
Out[14]: [1, 2, 3, 1, 2, 3]
8.4
Distinct
Return a new RDD containing the distinct elements in this RDD.
Out[15]: [0, 1]
8.5
Actions
Actions execute the task and associated transformations
9.1
Collect / take
Return a list that contains all of the elements in this RDD.
Note this method should only be used if the resulting array is expected to be small, as all the data is loaded
into the driver’s memory
In [16]: rdd.take(5)
Out[16]: [0, 1, 2, 3, 4]
9.2
Count
Return the number of elements in this RDD.
In [17]: rdd.count()
Out[17]: 16
9.3
Reduce
Reduces the elements of this RDD using the speci ed commutative and associative binary operator.
Currently reduces partitions locally.
Out[18]: 120
9.4
Key-value transformations
Key/value RDDs are commonly used to perform aggregations, and often we will do some initial ETL
(extract, transform, and load) to get our data into a key/value format.
Key/value RDDs expose new operations (e.g., counting up reviews for each product, grouping
together data with the same key, and grouping together two different RDDs).
10 . 1
ReduceByKey
Merge the values for each key using an associative and commutative reduce function.
In [19]: rdd = sc.parallelize([('a', 1), ('b', 0), ('b', 2), ('a', 5)], 4)
rdd.reduceByKey(lambda x,y: x + y).collect()
10 . 2
Join
Return an RDD containing all pairs of elements with matching keys in self and other.
In [20]: countLetter = sc.parallelize([('a', 1), ('b', 6), ('c', 2), ('a', 5)], 4)
defLetter = sc.parallelize([('a', 'vowel'), ('b', 'consonant'), ('c', 'consonant'), ('d', 'consonant')], 4)
countLetter.join(defLetter).map(lambda x: (x[1][1], x[1][0])).reduceByKey(lambda x,y: x + y).collect()
10 . 3
Wordcount !
In [21]: rdd = sc.textFile('data/lorem.txt')
rdd.flatMap(lambda row: [(r, 1) for r in row.split(' ')]).reduceByKey(lambda x,y: x + y).take(6)
11 . 1
RDD conclusion
Resilient Distributed Datasets (RDDs) are a distributed collection of immutable JVM objects that allow you
to perform calculations very quickly, and they are the backbone of Apache Spark
12 . 1
In [22]: sc.stop()
13 . 1
Combine SQL, streaming, and complex analytics.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and
Spark Streaming. You can combine these libraries seamlessly in the same application.
14 . 1
SparkSQL
This chapter introduces Spark SQL, Spark’s interface for working with structured and semistructured data.
15 . 1
SparkSession
The entry point to programming Spark with the Dataset and DataFrame API.
conf = SparkConf().setAppName('spark-app').setMaster('local[*]')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark
Out[23]:
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.2.1
Master
local[*]
AppName
spark-app
16 . 1
Dataframes
Under the hood, a Dataframe is an RDD composed of Row objects with additional schema information of the
types in each col‐ umn. Row objects are just wrappers around arrays of basic types.
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass| Name| Sex| Age|SibSp|Parch| Ticket| Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
| 1| 0| 3|Braund, Mr. Owen ...| male|22.0| 1| 0| A/5 21171| 7.25| null| S|
| 2| 1| 1|Cumings, Mrs. Joh...|female|38.0| 1| 0| PC 17599|71.2833| C85| C|
| 3| 1| 3|Heikkinen, Miss. ...|female|26.0| 0| 0|STON/O2. 3101282| 7.925| null| S|
| 4| 1| 1|Futrelle, Mrs. Ja...|female|35.0| 1| 0| 113803| 53.1| C123| S|
| 5| 0| 3|Allen, Mr. Willia...| male|35.0| 0| 0| 373450| 8.05| null| S|
| 6| 0| 3| Moran, Mr. James| male|null| 0| 0| 330877| 8.4583| null| Q|
| 7| 0| 1|McCarthy, Mr. Tim...| male|54.0| 0| 0| 17463|51.8625| E46| S|
| 8| 0| 3|Palsson, Master. ...| male| 2.0| 3| 1| 349909| 21.075| null| S|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 8 rows
16 . 2
Two ways of interacting
Domain-speci c language for structured data manipulation
In [25]: titanic.filter(titanic.Sex == 'male').select(['Name', 'Sex', 'Survived']).show(3)
+--------------------+----+--------+
| Name| Sex|Survived|
+--------------------+----+--------+
|Braund, Mr. Owen ...|male| 0|
|Allen, Mr. Willia...|male| 0|
| Moran, Mr. James|male| 0|
+--------------------+----+--------+
only showing top 3 rows
In [26]: spark.sql('SELECT Name, Sex, Survived FROM titanic WHERE Sex = "male"').show(3)
+--------------------+----+--------+
| Name| Sex|Survived|
+--------------------+----+--------+
|Braund, Mr. Owen ...|male| 0|
|Allen, Mr. Willia...|male| 0|
| Moran, Mr. James|male| 0|
+--------------------+----+--------+
only showing top 3 rows
16 . 3
Uni ed data source interaction
Spark provides with a unique interface for reading/saving data, which is then implemented for multiple data
storage formats : json, parquet, jdbc, orc, libsvm, csv, text.
root
|-- comment: string (nullable = true)
|-- decryptor: string (nullable = true)
|-- encryptionAlgorithm: string (nullable = true)
|-- extensionPattern: string (nullable = true)
|-- extensions: string (nullable = true)
|-- iocs: string (nullable = true)
|-- microsoftDetectionName: string (nullable = true)
|-- microsoftInfo: string (nullable = true)
|-- name: array (nullable = true)
| |-- element: string (containsNull = true)
|-- ransomNoteFilenames: string (nullable = true)
|-- resources: array (nullable = true)
| |-- element: string (containsNull = true)
|-- sandbox: string (nullable = true)
|-- screenshots: string (nullable = true)
|-- snort: string (nullable = true)
16 . 4
Catalyst optimization
Catalyst is an extensible query optimizer used internally by SparkSQL for planning and de ning the
execution of SparkSQL queries.
== Physical Plan ==
*Project [Name#15, Sex#16]
+- *Filter (isnotnull(Sex#16) && (Sex#16 = male))
+- *FileScan csv [Name#15,Sex#16] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/workspaceperso/pyspark-i
nteractive-lecture/notebooks/data/titanic.csv], PartitionFilters: [], PushedFilters: [IsNotNull(Sex), EqualTo(Sex,male)], ReadS
chema: struct<Name:string,Sex:string>
16 . 5
Machine Learning
MLlib is Spark’s machine learning (ML) library. It has an RDD-based API in maintenance mode and a
Dataframe-based API.
Dataframe API = Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations,
uniform APIs across languages.
ML Pipelines are set of high-level APIs on top of DataFrames that help users create and tune practical
machine learning pipelines
17 . 1
Transformers
A Transformer implements a method transform(), which converts one DataFrame into another
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+--------+
|PassengerId|Survived|Pclass| Name| Sex| Age|SibSp|Parch| Ticket| Fare|Cabin|Embarked|SexIndex|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+--------+
| 1| 0| 3|Braund, Mr. Owen ...| male|22.0| 1| 0| A/5 21171| 7.25| null| S| 0.0|
| 2| 1| 1|Cumings, Mrs. Joh...|female|38.0| 1| 0| PC 17599|71.2833| C85| C| 1.0|
| 3| 1| 3|Heikkinen, Miss. ...|female|26.0| 0| 0|STON/O2. 3101282| 7.925| null| S| 1.0|
| 4| 1| 1|Futrelle, Mrs. Ja...|female|35.0| 1| 0| 113803| 53.1| C123| S| 1.0|
| 5| 0| 3|Allen, Mr. Willia...| male|35.0| 0| 0| 373450| 8.05| null| S| 0.0|
| 6| 0| 3| Moran, Mr. James| male|null| 0| 0| 330877| 8.4583| null| Q| 0.0|
| 7| 0| 1|McCarthy, Mr. Tim...| male|54.0| 0| 0| 17463|51.8625| E46| S| 0.0|
| 8| 0| 3|Palsson, Master. ...| male| 2.0| 3| 1| 349909| 21.075| null| S| 0.0|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+--------+
only showing top 8 rows
17 . 2
Estimators
An Estimator implements a method t(), which accepts a DataFrame and produces a Model, which is a
Transformer.
+--------+----------+--------------------+
|Survived|prediction| probability|
+--------+----------+--------------------+
| 0| 0.0|[0.94189369125263...|
| 1| 1.0|[0.21883383407637...|
| 1| 1.0|[0.46619780756453...|
| 1| 1.0|[0.02089552238805...|
| 0| 0.0|[0.87832770415448...|
| 0| 0.0|[0.84656818503583...|
| 0| 0.0|[0.66412205718598...|
| 0| 0.0|[0.86223713039307...|
+--------+----------+--------------------+
only showing top 8 rows
17 . 3
Pipelines
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms
into a single pipeline, or work ow.
+--------+----------+--------------------+
|Survived|prediction| probability|
+--------+----------+--------------------+
| 0| 0.0|[0.94189369125263...|
| 1| 1.0|[0.21883383407637...|
| 1| 1.0|[0.46619780756453...|
| 1| 1.0|[0.02089552238805...|
| 0| 0.0|[0.87832770415448...|
| 0| 0.0|[0.84656818503583...|
| 0| 0.0|[0.66412205718598...|
| 0| 0.0|[0.86223713039307...|
+--------+----------+--------------------+
only showing top 8 rows
17 . 4
Spark Streaming
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams.
18 . 1
In [32]: # Prepare a netcat client before launching launchSparkStreaming
import nclib
In [34]: #nc.close()
18 . 2
GraphX
To support graph computation, GraphX extends the Spark RDD by introducing a new Graph abstraction: a
directed multigraph with properties attached to each vertex and edge.
19 . 1
Going further
20 . 1
Spark packages
20 . 2
Uni ed engine
Spark's main contribution is to enable previously disparate cluster workloads to be composed. In the
following example, we build a logistic model on the titanic dataset, save it on disk and push it to spark
streaming for realtime inference.
20 . 3
In [35]: spark.stop()
21 . 1
Conclusion
22 . 1