0% found this document useful (0 votes)

13 views44 pages

Pyspark

The document provides an overview of Apache Spark, detailing its architecture, including the roles of the driver application and worker nodes, as well as the concept of Resilient Distributed Datasets (RDDs). It explains how to create RDDs, perform transformations and actions, and introduces Spark SQL and DataFrames for structured data manipulation. Additionally, it highlights the integration of Spark with machine learning through MLlib and the use of the Catalyst optimizer for query planning.

Uploaded by

subbs reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views44 pages

Pyspark

Uploaded by

subbs reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Welcome to Apache Spark

1.1
Architecture
A Spark program consists of a driver application and worker programs

Worker nodes run on different machines in a cluster, or in local threads.

Data is distributed among workers.

2.1
Spark Context
The SparkContext contains all of the necessary info on the cluster to run Spark code.

In [1]: from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('spark-app').setMaster('local[*]')
sc = SparkContext.getOrCreate(conf=conf)

Out[1]:
SparkContext

Spark UI

Version
v2.2.1
Master
local[*]
AppName
spark-app

3.1
Resilient Distributed Dataset
A partitioned collection of objects spread accross a cluster, stored in memory or on disk.

4.1
3 ways of creating a RDD

by parallelizing an existing collection

In [2]: array = range(10)
array

Out[2]: range(0, 10)

In [3]: rdd = sc.parallelize(array)

rdd

Out[3]: PythonRDD[1] at RDD at PythonRDD.scala:48

4.2
3 ways of creating a RDD

from les in a storage system

In [4]: titanic = sc.textFile('data/titanic.csv')
titanic

Out[4]: data/titanic.csv MapPartitionsRDD[3] at textFile at <unknown>:0

In [5]: titanic.take(3)

Out[5]: ['PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked',
'1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S',
'2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C']

4.3
3 ways of creating a RDD

by transforming another RDD

In [6]: rdd.map(lambda number: number * 2)

Out[6]: PythonRDD[5] at RDD at PythonRDD.scala:48

In [7]: rdd.map(lambda number: number * 2).collect()

Out[7]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

4.4
Working with RDDs
Let's create a RDD from a list of numbers, and play with it.

In [8]: rdd = sc.parallelize(range(16), 4)

rdd.cache()

Out[8]: PythonRDD[8] at RDD at PythonRDD.scala:48

5.1
Remember !
A RDD is immutable
In [9]: print(rdd) # prints only info on RDD, no evaluation

PythonRDD[8] at RDD at PythonRDD.scala:48

A RDD is evaluated lazily

In [10]: print(rdd.map(lambda x: x*2)) # specific methods to gather data back to driver

PythonRDD[9] at RDD at PythonRDD.scala:48

Only tracks its lineage so it can reconstruct itself

In [11]: print(rdd.map(lambda num: num + 1).toDebugString()) # check RDD lineage

b'(4) PythonRDD[10] at RDD at PythonRDD.scala:48 []\n | PythonRDD[8] at RDD at PythonRDD.scala:48 []\n | ParallelCollectionRD
D[7] at parallelize at PythonRDD.scala:489 []'

6.1
Spark operations
Come in two types : transformations / actions

Transformations are lazy (not computed immediately)

Only an action on a RDD will trigger the execution of all subsequent transformations.

7.1
Transformations
Transformations shape your dataset

8.1
Filter
Return a new RDD containing only the elements that satisfy a predicate.

Ex : return only even numbers.

In [12]: rdd.filter(lambda x: x % 2 == 0).collect()

Out[12]: [0, 2, 4, 6, 8, 10, 12, 14]

8.2
Map
Return a new RDD by applying a function to each element of this RDD.

Ex : multiply all numbers by 2.

In [13]: rdd.map(lambda x: x * 2).collect()

Out[13]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]

8.3
FlatMap
Return a new RDD by rst applying a function to all elements of this RDD, and then attening the results.

Ex : return a long matrix of rows [1, 2, 3] of dimension the number of elements in the rdd variable, then
atten it.

In [14]: rdd.flatMap(lambda num: [1, 2, 3]).take(6)

Out[14]: [1, 2, 3, 1, 2, 3]

8.4
Distinct
Return a new RDD containing the distinct elements in this RDD.

In [15]: rdd.map(lambda num: 0 if num % 2 == 0 else 1).distinct().collect()

Out[15]: [0, 1]

8.5
Actions
Actions execute the task and associated transformations

9.1
Collect / take
Return a list that contains all of the elements in this RDD.

Note this method should only be used if the resulting array is expected to be small, as all the data is loaded
into the driver’s memory

In [16]: rdd.take(5)

Out[16]: [0, 1, 2, 3, 4]

9.2
Count
Return the number of elements in this RDD.

In [17]: rdd.count()

Out[17]: 16

9.3
Reduce
Reduces the elements of this RDD using the speci ed commutative and associative binary operator.
Currently reduces partitions locally.

Ex : sum all the numbers in the RDD.

In [18]: rdd.reduce(lambda x,y: x + y)

Out[18]: 120

9.4
Key-value transformations
Key/value RDDs are commonly used to perform aggregations, and often we will do some initial ETL
(extract, transform, and load) to get our data into a key/value format.

Key/value RDDs expose new operations (e.g., counting up reviews for each product, grouping
together data with the same key, and grouping together two different RDDs).

10 . 1
ReduceByKey
Merge the values for each key using an associative and commutative reduce function.

Ex : Add all numbers associated to each key.

In [19]: rdd = sc.parallelize([('a', 1), ('b', 0), ('b', 2), ('a', 5)], 4)
rdd.reduceByKey(lambda x,y: x + y).collect()

Out[19]: [('b', 2), ('a', 6)]

10 . 2
Join
Return an RDD containing all pairs of elements with matching keys in self and other.

Ex : Add all numbers associated to vowels and consonants.

In [20]: countLetter = sc.parallelize([('a', 1), ('b', 6), ('c', 2), ('a', 5)], 4)
defLetter = sc.parallelize([('a', 'vowel'), ('b', 'consonant'), ('c', 'consonant'), ('d', 'consonant')], 4)
countLetter.join(defLetter).map(lambda x: (x[1][1], x[1][0])).reduceByKey(lambda x,y: x + y).collect()

Out[20]: [('consonant', 8), ('vowel', 6)]

10 . 3
Wordcount !
In [21]: rdd = sc.textFile('data/lorem.txt')
rdd.flatMap(lambda row: [(r, 1) for r in row.split(' ')]).reduceByKey(lambda x,y: x + y).take(6)

Out[21]: [('Lorem', 1),

('ipsum', 3),
('consectetur', 2),
('elit.', 2),
('est', 4),
('mattis', 5)]

11 . 1
RDD conclusion
Resilient Distributed Datasets (RDDs) are a distributed collection of immutable JVM objects that allow you
to perform calculations very quickly, and they are the backbone of Apache Spark

12 . 1
In [22]: sc.stop()

13 . 1
Combine SQL, streaming, and complex analytics.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and
Spark Streaming. You can combine these libraries seamlessly in the same application.

14 . 1
SparkSQL
This chapter introduces Spark SQL, Spark’s interface for working with structured and semistructured data.

15 . 1
SparkSession
The entry point to programming Spark with the Dataset and DataFrame API.

In [23]: from pyspark import SparkConf

from pyspark.sql import SparkSession

conf = SparkConf().setAppName('spark-app').setMaster('local[*]')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark

Out[23]:
SparkSession - in-memory

SparkContext

Spark UI

Version
v2.2.1
Master
local[*]
AppName
spark-app

16 . 1
Dataframes
Under the hood, a Dataframe is an RDD composed of Row objects with additional schema information of the
types in each col‐ umn. Row objects are just wrappers around arrays of basic types.

In [24]: titanic = spark.read.option('header', 'true').option('inferSchema', 'true').csv('data/titanic.csv')

titanic.createOrReplaceTempView('titanic')
titanic.show(8)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass| Name| Sex| Age|SibSp|Parch| Ticket| Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
| 1| 0| 3|Braund, Mr. Owen ...| male|22.0| 1| 0| A/5 21171| 7.25| null| S|
| 2| 1| 1|Cumings, Mrs. Joh...|female|38.0| 1| 0| PC 17599|71.2833| C85| C|
| 3| 1| 3|Heikkinen, Miss. ...|female|26.0| 0| 0|STON/O2. 3101282| 7.925| null| S|
| 4| 1| 1|Futrelle, Mrs. Ja...|female|35.0| 1| 0| 113803| 53.1| C123| S|
| 5| 0| 3|Allen, Mr. Willia...| male|35.0| 0| 0| 373450| 8.05| null| S|
| 6| 0| 3| Moran, Mr. James| male|null| 0| 0| 330877| 8.4583| null| Q|
| 7| 0| 1|McCarthy, Mr. Tim...| male|54.0| 0| 0| 17463|51.8625| E46| S|
| 8| 0| 3|Palsson, Master. ...| male| 2.0| 3| 1| 349909| 21.075| null| S|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 8 rows

16 . 2
Two ways of interacting
Domain-speci c language for structured data manipulation
In [25]: titanic.filter(titanic.Sex == 'male').select(['Name', 'Sex', 'Survived']).show(3)

sql function on SparkSession to run SQL queries programmatically on temporary tables

In [26]: spark.sql('SELECT Name, Sex, Survived FROM titanic WHERE Sex = "male"').show(3)

16 . 3
Uni ed data source interaction
Spark provides with a unique interface for reading/saving data, which is then implemented for multiple data
storage formats : json, parquet, jdbc, orc, libsvm, csv, text.

In [27]: ransomware = spark.read.json('data/ransomware.json')

ransomware.printSchema()

16 . 4
Catalyst optimization
Catalyst is an extensible query optimizer used internally by SparkSQL for planning and de ning the
execution of SparkSQL queries.

In [28]: titanic[titanic['Sex'] == 'male'].select(['Name', 'Sex']).explain()

== Physical Plan ==
*Project [Name#15, Sex#16]
+- *Filter (isnotnull(Sex#16) && (Sex#16 = male))
+- *FileScan csv [Name#15,Sex#16] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/workspaceperso/pyspark-i
nteractive-lecture/notebooks/data/titanic.csv], PartitionFilters: [], PushedFilters: [IsNotNull(Sex), EqualTo(Sex,male)], ReadS
chema: struct<Name:string,Sex:string>

16 . 5
Machine Learning
MLlib is Spark’s machine learning (ML) library. It has an RDD-based API in maintenance mode and a
Dataframe-based API.

Dataframe API = Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations,
uniform APIs across languages.
ML Pipelines are set of high-level APIs on top of DataFrames that help users create and tune practical
machine learning pipelines

17 . 1
Transformers
A Transformer implements a method transform(), which converts one DataFrame into another

In [29]: from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="Sex", outputCol="SexIndex")

titanic_indexed = indexer.fit(titanic).transform(titanic)
titanic_indexed.show(8)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+--------+
|PassengerId|Survived|Pclass| Name| Sex| Age|SibSp|Parch| Ticket| Fare|Cabin|Embarked|SexIndex|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+--------+
| 1| 0| 3|Braund, Mr. Owen ...| male|22.0| 1| 0| A/5 21171| 7.25| null| S| 0.0|
| 2| 1| 1|Cumings, Mrs. Joh...|female|38.0| 1| 0| PC 17599|71.2833| C85| C| 1.0|
| 3| 1| 3|Heikkinen, Miss. ...|female|26.0| 0| 0|STON/O2. 3101282| 7.925| null| S| 1.0|
| 4| 1| 1|Futrelle, Mrs. Ja...|female|35.0| 1| 0| 113803| 53.1| C123| S| 1.0|
| 5| 0| 3|Allen, Mr. Willia...| male|35.0| 0| 0| 373450| 8.05| null| S| 0.0|
| 6| 0| 3| Moran, Mr. James| male|null| 0| 0| 330877| 8.4583| null| Q| 0.0|
| 7| 0| 1|McCarthy, Mr. Tim...| male|54.0| 0| 0| 17463|51.8625| E46| S| 0.0|
| 8| 0| 3|Palsson, Master. ...| male| 2.0| 3| 1| 349909| 21.075| null| S| 0.0|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+--------+
only showing top 8 rows

17 . 2
Estimators
An Estimator implements a method t(), which accepts a DataFrame and produces a Model, which is a
Transformer.

In [30]: from pyspark.ml.classification import RandomForestClassifier

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=["SexIndex", "Fare"], outputCol="features")

titanic_train = assembler.transform(titanic_indexed)

rf = RandomForestClassifier(labelCol="Survived", featuresCol="features", numTrees=10)

model = rf.fit(titanic_train)
model.transform(titanic_train).select(["Survived", "prediction", "probability"]).show(8)

+--------+----------+--------------------+
|Survived|prediction| probability|
+--------+----------+--------------------+
| 0| 0.0|[0.94189369125263...|
| 1| 1.0|[0.21883383407637...|
| 1| 1.0|[0.46619780756453...|
| 1| 1.0|[0.02089552238805...|
| 0| 0.0|[0.87832770415448...|
| 0| 0.0|[0.84656818503583...|
| 0| 0.0|[0.66412205718598...|
| 0| 0.0|[0.86223713039307...|
+--------+----------+--------------------+
only showing top 8 rows

17 . 3
Pipelines
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms
into a single pipeline, or work ow.

In [31]: from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[indexer, assembler, rf])

model = pipeline.fit(titanic)
model.transform(titanic).select(["Survived", "prediction", "probability"]).show(8)

17 . 4
Spark Streaming
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams.

18 . 1
In [32]: # Prepare a netcat client before launching launchSparkStreaming
import nclib

#nc = nclib.Netcat(listen=('localhost', 9999), verbose=True)

In [33]: #for i in range(1000):

#nc.send_line(b'hello world')

In [34]: #nc.close()

18 . 2
GraphX
To support graph computation, GraphX extends the Spark RDD by introducing a new Graph abstraction: a
directed multigraph with properties attached to each vertex and edge.

NB : No active development of Python bindings on GraphX...take a look on GraphFrames for graph

computation on Dataframes, which is the unof cial GraphX Dataframe-based API.

19 . 1
Going further

20 . 1
Spark packages

20 . 2
Uni ed engine
Spark's main contribution is to enable previously disparate cluster workloads to be composed. In the
following example, we build a logistic model on the titanic dataset, save it on disk and push it to spark
streaming for realtime inference.

20 . 3
In [35]: spark.stop()

21 . 1
Conclusion

22 . 1

PySpark Notes
No ratings yet
PySpark Notes
190 pages
Assignment 3
No ratings yet
Assignment 3
6 pages
Lab Spark
No ratings yet
Lab Spark
3 pages
Performance Comparison of Graph Database and Relational Database
No ratings yet
Performance Comparison of Graph Database and Relational Database
14 pages
Pyspark
No ratings yet
Pyspark
31 pages
Handout 5 Sentence-Problems
No ratings yet
Handout 5 Sentence-Problems
6 pages
Generating AI Text To Video A Comprehensive Guide
No ratings yet
Generating AI Text To Video A Comprehensive Guide
4 pages
Isogonal
No ratings yet
Isogonal
56 pages
Solutions Manual For Physics of Semiconductor Devices 4th Edition by Sze
0% (1)
Solutions Manual For Physics of Semiconductor Devices 4th Edition by Sze
12 pages
SPARK
No ratings yet
SPARK
35 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Flames of Freedom Quickstart V1.23
100% (2)
Flames of Freedom Quickstart V1.23
61 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
2 - Intro To PySpark RDD
No ratings yet
2 - Intro To PySpark RDD
35 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Further Reading On Adverbials Buysschaert 2010
No ratings yet
Further Reading On Adverbials Buysschaert 2010
11 pages
Collarity, Inc. v. Google, Inc., C.A. No. 11-1103-MPT (D. Del. May 6, 2013)
No ratings yet
Collarity, Inc. v. Google, Inc., C.A. No. 11-1103-MPT (D. Del. May 6, 2013)
20 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Spark
No ratings yet
Spark
51 pages
Journal
No ratings yet
Journal
47 pages
Spark
No ratings yet
Spark
96 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
A Poem That Has No Title Line Byline Analysis
No ratings yet
A Poem That Has No Title Line Byline Analysis
1 page
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
EM 12X2 Complex Numbers 2023
No ratings yet
EM 12X2 Complex Numbers 2023
13 pages
Verigy Lab 4 SW Overview
No ratings yet
Verigy Lab 4 SW Overview
8 pages
Othello
No ratings yet
Othello
52 pages
2335 m8 Demo1 v1 0h2 Cq188do
No ratings yet
2335 m8 Demo1 v1 0h2 Cq188do
9 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Oral Communication Grade 11 Q1 W3
No ratings yet
Oral Communication Grade 11 Q1 W3
16 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Contoh RPH
No ratings yet
Contoh RPH
3 pages
BDT MSE2Scheme 23-24
No ratings yet
BDT MSE2Scheme 23-24
4 pages
JNCIA Junos P2 - 2012 12 20
No ratings yet
JNCIA Junos P2 - 2012 12 20
48 pages
Spark
No ratings yet
Spark
11 pages
HILONGO - Midterm-Assignment No. 2
No ratings yet
HILONGO - Midterm-Assignment No. 2
7 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
Pen Style Scrapbook Style Journal
No ratings yet
Pen Style Scrapbook Style Journal
27 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Poetry My Mother at Sixty Six
No ratings yet
Poetry My Mother at Sixty Six
10 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations
No ratings yet
Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations
5 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Data Frames
No ratings yet
Data Frames
12 pages
DSA-251 by Parikh Jain
No ratings yet
DSA-251 by Parikh Jain
22 pages
2023 S4 Prelim P1 - Question Booklet
No ratings yet
2023 S4 Prelim P1 - Question Booklet
13 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Brandbook - Buffalo Wild Wings
No ratings yet
Brandbook - Buffalo Wild Wings
27 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Team TFM
No ratings yet
Team TFM
24 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
UCSP Unit 14 Religion and Belief Systems
No ratings yet
UCSP Unit 14 Religion and Belief Systems
40 pages
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Alveoconsistograph Evaluation of Rheological Prope
No ratings yet
Alveoconsistograph Evaluation of Rheological Prope
8 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
3 TABLEAU Terminolgy
No ratings yet
3 TABLEAU Terminolgy
12 pages
TUM-CPE 203 Module 1
No ratings yet
TUM-CPE 203 Module 1
5 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Sped Report Card - Docx Final
No ratings yet
Sped Report Card - Docx Final
2 pages
Cognizant Syllabus and Exam Pattern For 2025 Batch
No ratings yet
Cognizant Syllabus and Exam Pattern For 2025 Batch
7 pages
Cognos Framework Manager Example
100% (1)
Cognos Framework Manager Example
7 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Spark SQL
No ratings yet
Spark SQL
24 pages

Pyspark

Uploaded by

Pyspark

Uploaded by

Welcome to Apache Spark

Worker nodes run on different machines in a cluster, or in local threads.

In [1]: from pyspark import SparkContext, SparkConf

by parallelizing an existing collection

Out[2]: range(0, 10)

In [3]: rdd = sc.parallelize(array)

Out[3]: PythonRDD[1] at RDD at PythonRDD.scala:48

from les in a storage system

Out[4]: data/titanic.csv MapPartitionsRDD[3] at textFile at <unknown>:0

by transforming another RDD

Out[6]: PythonRDD[5] at RDD at PythonRDD.scala:48

In [7]: rdd.map(lambda number: number * 2).collect()

Out[7]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [8]: rdd = sc.parallelize(range(16), 4)

Out[8]: PythonRDD[8] at RDD at PythonRDD.scala:48

PythonRDD[8] at RDD at PythonRDD.scala:48

A RDD is evaluated lazily

PythonRDD[9] at RDD at PythonRDD.scala:48

Only tracks its lineage so it can reconstruct itself

Transformations are lazy (not computed immediately)

Ex : return only even numbers.

In [12]: rdd.filter(lambda x: x % 2 == 0).collect()

Out[12]: [0, 2, 4, 6, 8, 10, 12, 14]

Ex : multiply all numbers by 2.

In [13]: rdd.map(lambda x: x * 2).collect()

In [14]: rdd.flatMap(lambda num: [1, 2, 3]).take(6)

In [15]: rdd.map(lambda num: 0 if num % 2 == 0 else 1).distinct().collect()

Ex : sum all the numbers in the RDD.

In [18]: rdd.reduce(lambda x,y: x + y)

Ex : Add all numbers associated to each key.

Out[19]: [('b', 2), ('a', 6)]

Ex : Add all numbers associated to vowels and consonants.

Out[20]: [('consonant', 8), ('vowel', 6)]

Out[21]: [('Lorem', 1),

In [23]: from pyspark import SparkConf

In [24]: titanic = spark.read.option('header', 'true').option('inferSchema', 'true').csv('data/titanic.csv')

sql function on SparkSession to run SQL queries programmatically on temporary tables

In [27]: ransomware = spark.read.json('data/ransomware.json')

In [28]: titanic[titanic['Sex'] == 'male'].select(['Name', 'Sex']).explain()

In [29]: from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="Sex", outputCol="SexIndex")

In [30]: from pyspark.ml.classification import RandomForestClassifier

assembler = VectorAssembler(inputCols=["SexIndex", "Fare"], outputCol="features")

rf = RandomForestClassifier(labelCol="Survived", featuresCol="features", numTrees=10)

In [31]: from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[indexer, assembler, rf])

#nc = nclib.Netcat(listen=('localhost', 9999), verbose=True)

In [33]: #for i in range(1000):

NB : No active development of Python bindings on GraphX...take a look on GraphFrames for graph

You might also like