0% found this document useful (0 votes)

101 views43 pages

Big Data Computing Spark Basics and RDD: Ke Yi

Spark is a fast and general engine for large-scale data processing. It uses resilient distributed datasets (RDDs) that can tolerate failures through reconstruction from lineage graphs. RDDs support operations like map, filter, and reduce that are executed lazily in parallel on the cluster. Spark's in-memory computations provide much faster performance than disk-based systems like Hadoop for iterative and interactive jobs.

Uploaded by

Patrick Li

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views43 pages

Big Data Computing Spark Basics and RDD: Ke Yi

Uploaded by

Patrick Li

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

Big Data Computing

Spark Basics and RDD

Ke Yi

1
A Brief History

2
Why is Map/Reduce bad?
• Programming model too restricted

• Iterative jobs involve a lot of disk I/O

3
Many specialized systems on top
of Hadoop

4
What is Spark?
Fast and Expressive Cluster Computing
Engine Compatible with Apache Hadoop

10 × ffaasstteerr oonn d
Upp ttoo
U
100× iinn mmem diisskk,, 2- 5× le s
emoorryy s c ode
Efficient Usable
• General execution • Rich APIs in Java, Scala,
graphs Python
• Interactive shell
• In-memory storage
Spark’s (Short) History

• Initially developed at UC Berkeley in 2009

• Became an open-source project in 2010

• Became the most active project in the Apache Software

Foundation and among Big Data open source projects.
• Many companies are switching from Hadoop to Spark,
including Amazon, Alibaba, Baidu, eBay, Dianping,
Tencent, etc.

6
Spark Popularity

7
Use Memory Instead of Disk

8
Tech Trend: Cost of Memory

9
In-Memory Data Sharing

10
Spark and Map Reduce
Differences

11
Spark Programming

12
Resilient Distributed Datasets (RDDs)

A real or virtual file consisting of RDD1 RDD2

records
P1
Partitioned into partitions
P2

Created through deterministic

P3
transformations on:
- Data in persistent storage
- Other RDDs
RDDs do not need to be materialized

Users can control two other aspects:

- Persistence
- Partitioning (e.g. key of the record)
RDDs and partitions

• Programmer specifies number of partitions for

an RDD (Default value used if unspecified)
• more partitions: more parallelism but also
more overhead
Example

Web service is experiencing errors and an operator wants to search

terabytes of logs in the Hadoop file system to find the cause.

lines = spark.textFile(“hdfs://…”)
errors = lines.filter(_.startsWith(“Error”))
errors.filter(_.contains(“HDFS”))
.map(_split(‘\t’)(3))
.collect()

Execution is pipelined and parallel

No need to store intermediate results
Lazy execution allows optimization
Lineage Graph

Fault Tolerance Mechanism

RDD1

RDD has enough information about RDD2

how it was derived from to
RDD3
compute its partitions from data in
stable storage. RDD4

Example:
If a partition of errors is lost, Spark rebuilds it by applying a filter on
only the corresponding partition of lines.

Partitions can be recomputed in parallel on different nodes, without

having to roll back the whole program.
Operations
Spark recomputes transformations

lines = sc.textFile("...",4)
comments = lines.filter(isComment)
print lines.count(),comments.count()
Caching RDDs

lines = sc.textFile("...",4)
lines.cache()
comments = lines.filter(isComment)
print lines.count(),comments.count()
RDD Persistence

• Make an RDD persist using persist() or cache()

• Different storage levels, default is
MEMORY_ONLY
• Allows faster reuse and fault recovery
• Spark also automatically persists some
intermediate data in shuffle operations
• Spark automatically drops out old data partitions
using LRU policy. You can also unpersist() an RDD
manually.
Spark Shell

Spark’s shell provides a simple way to learn the API, as well as a

powerful tool to analyze data interactively. Start it by running the
following in the Spark directory:
./bin/pyspark
You can use –master local[k] to use k workers. You may need to use –
master “local[k]” on some systems. The default is local[*], using as
many workers as the number of logical CPU cores.
Use –master spark://host:port to connect to a spark cluster.
The SparkContext sc is already initialized in spark shell.
Let’s make a new RDD from the text of the README file in the Spark
source directory:
>>> textFile = sc.textFile("README.md")
Transformations and actions

Let’s start with a few actions:

>>> textFile.count() # Number of items in this RDD
104
>>> textFile.first() # First item in this RDD
u'# Apache Spark‘

>>> textFile.take(5)

Now let’s use a transformation. We will use the filter transformation to return a new

RDD with a subset of the items in the file.
>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)

We can chain together transformations and actions:

>>> textFile.filter(lambda line: "Spark" in line).count() # How many lines contain
"Spark"?
20
Transformations and actions

RDD actions and transformations can be used for more complex computations. Let’s say we
want to find the line with the most words:
>>> textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)
22

The arguments to map and reduce are Python anonymous functions (lambdas), but we can

also pass any top-level Python function we want. For example:

>>> def max(a, b):

... if a > b:
... return a
... else:
... return b
...
>>> textFile.map(lambda line: len(line.split())).reduce(max)
22

For multiple commands, you can write them in a .py file, and execute it using execfile(). But
you will need add “print” to get the output.
Lazy transformations

For example, if you run the following commands:

>>> import math

>>> a = sc.parallelize(range(1,100000))
>>> b = a.map(lambda x: math.sqrt(x))
>>> b.count()

You can realize that map function returns immediately, while the count() function
really triggers the work.

By default, each transformed RDD may be recomputed each time you run an
action on it. However, you may also persist an RDD in memory using
the persist (or cache) method, in which case Spark will keep the elements around
on the cluster for much faster access the next time you query it. There is also
support for persisting RDDs on disk, or replicated across multiple nodes
RDD Basics

consider the simple program below:

lines = sc.textFile("README.md")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

The first line defines a base RDD from an external file. This dataset is not
loaded in memory or otherwise acted on: lines is merely a pointer to the
file. The second line defines lineLengths as the result of
a map transformation. Again, lineLengths is not immediately computed,
due to laziness. Finally, we run reduce, which is an action. At this point
Spark breaks the computation into tasks to run on separate machines, and
each machine runs both its part of the map and a local reduction,
returning only its answer to the driver program.
Self-Contained Applications

Example: http://www.cse.ust.hk/msbd5003/SimpleApp.py
This program just counts the number of lines containing ‘a’ and
the number containing ‘b’ in a text file. We can run this
application using the bin/spark-submit script:

$ ./bin/spark-submit SimpleApp.py
...
Lines with a: 62, Lines with b: 30

For self-contained applications, you should use sc.stop() to stop

the SparkContext at the end of your application. On the other
hand, you should not stop the SparkContext in the Spark shell.
Cluster Mode

Note: SparkContext can connect to several types of cluster managers (either

Spark’s own standalone cluster manager, Mesos or YARN).
Spark Standalone Cluster Manager

• To install Spark Standalone mode, you simply place a compiled version of Spark
on each node on the cluster.
• You can start a standalone master server by executing:
./sbin/start-master.sh
• Start one or more slaves by executing
./sbin/start-slave.sh <master-spark-url>
• Finally, remember to shut down the master and workers using the following
scripts when they are not needed:
sbin/stop-master.sh
sbin/stop-slave.sh
• Note: Only one master/worker can run on the same machine, but a machine can
be both a master and a worker.
• Now you can submit your application to the cluster. Example:
./bin/spark-submit --master spark://hostname:7077 pi.py 20
• You can monitor the cluster’s status on the master at port 8080.
• Job status can be monitored at driver at port 4040, 4041, …
Where code runs

Most Python code runs in driver, except for code passed to transformations.
Transformations run at executors, actions run at executors and driver.
Example: Let’s say you want to combine two RDDs: a, b.
You remember that rdd.collect() returns a list, and in Python you can combine
two lists with +
A naïve implementation would be:
>>> a = RDDa.collect()
>>> b = RDDb.collect()
>>> RDDc = sc.parallelize(a+b)
Where does this code run?
In the first line, all distributed data for a and b is sent to driver. What if a and/or b
is very large? Driver could run out of memory. Also, it takes a long time to send
the data to the driver.
In the third line, all data is sent from driver to executors.
The correct way:
>>> RDDc = RDDa.union(RDDb)
This runs completely at executors.
The Jupyter Notebook

• In-browser editing for code, with automatic syntax

highlighting, indentation, and tab
completion/introspection.
• The ability to execute code from the browser, with the
results of computations attached to the code which
generated them.
• In-browser editing for rich text using the Markdown
markup language, which can provide commentary for the
code, is not limited to plain text.
• See Canvas page for installation guide
• Code: https://www.cse.ust.hk/msbd5003/nb/rdd.ipynb
Closure

• A task’s closure is those variables and methods

which must be visible for the executor to perform
its computations on the RDD.
– Functions that run on RDDs at executors
– Any global variables used by those executors
• The variables within the closure sent to each
executor are copies.
• This closure is serialized and sent to each executor
from the driver when an action is invoked.
Accumulators

• Accumulators are variables that are only “added”

to through an associative and commutative
operation.
• Created from an initial value v by
calling SparkContext.accumulator(v).
• Tasks running on a cluster can then add to it
using the add method or the += operator
• Only the driver program can read the
accumulator’s value, using its value method.
Accumulators: Note

• Accumulators do not change the lazy evaluation model

of Spark. If they are being updated within an operation
on an RDD, their value is only updated once that RDD is
computed as part of an action. Consequently,
accumulator updates are not guaranteed to be executed
when made within a lazy transformation like map().
• Update in transformations may be applied more than
once if tasks or job stages are re-executed. See example.
• Suggestion: Avoid using accumulators whenever
possible. Use reduce() instead.
More notes on random number generation

• Use deterministic pseudo random number

generators
• Use fixed seed (not system time)
– If a partition is lost and recomputed, the recovered
partition is the same as before
– If not, the program may not behave correctly
• If different random numbers are desired every time
you run your program, generate a single seed (using
system time) in driver program and use it in all tasks.
• Better yet, use Spark SQL (later)
Example: Linear-time selection

• Problem:
– Input: an array A of n numbers (unordered), and k
– Output: the k-th smallest number (counting from 0)
• Algorithm
1. x = A[0]
2. partition A into
A[0..mid-1] < A[mid] = x < A[mid+1..n-1]
3. if mid = k then return x
4. if k < mid then A = A[0..mid-1]
if k > mid then A = A[mid+1,n-1], k = k – mid – 1
5. go to step 1
Why didn’t it work?

• This closure is sent to each executor from the

driver when an action is invoked.
• How to fix?
– Use different variables to hold x ?
– Add an action immediately after each transformation ?
– And cache it.
– But this introduces unnecessary work.
• Rule: When the closure includes variables that will
change later, call an action and cache the RDD
before the variables change.
Lab: Key-Value Pairs

While most Spark operations work on RDDs containing any type of

objects, a few special operations are only available on RDDs of key-
value pairs.
In Python, these operations work on RDDs containing built-in Python
tuples such as (1, 2). Simply create such tuples and then call your
desired operation.
For example, the following code uses the reduceByKey operation on
key-value pairs to count how many times each line of text occurs in a
file:
lines = sc.textFile("README.md")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)
We could also use counts.sortByKey(), for example, to sort the pairs
alphabetically, and finally counts.collect() to bring them back to the
driver program as a list of objects.
Lab: Key-value pairs

Other useful operations on key-value pairs:

flatMap(func), sortByKey()

Examples
Lab: PMI Computation

•• PMI
(pointwise mutual information) is a measure of association
used in information theory and statistics.
• Given a list of pairs (x, y)
x y p(x, y)
where
0 0 0.1
probability of ,
probability of , 0 1 0.7
joint probability of 1 0 0.15

• Example: 1 1 0.05

– p(x=0) = 0.8, p(x=1)=0.2, p(y=0)=0.25, p(y=1)=0.75

– pmi(x=0;y=0) = −1
– pmi(x=0;y=1) = 0.222392
– pmi(x=1;y=0) = 1.584963
– pmi(x=1;y=1) = -1.584963
• Code at: https://www.cse.ust.hk/msbd5003/nb/PMI.ipynb
Lab: k-means clustering

The algorithm:
1. Choose k points from the input points randomly. These
points represent initial group centroids.
2. Assign each point to the closest centroid.
3. When all points have been assigned, recalculate the
positions of the k centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move.
This produces a separation of the objects into groups
from which the metric to be minimized can be
calculated.
See example at http://shabal.in/visuals/kmeans/6.html
Lab: PageRank

• Algorithm:
– Initialize all PR’s to 1
– Iteratively
compute 𝑃𝑅 (𝑣 )
𝑃𝑅 ( 𝑢 ) ← 0.15 ×1+0.85 × ∑
𝑣 →𝑢 outdegree (𝑣)

• Functions to note: groupByKey(), join(), mapValues()

Broadcast Variables

• Broadcast variables allow the programmer to

keep a read-only variable cached on each
machine (not each task)
• More efficient than sending closures to tasks
• Example

# Frontend Development Roadmap 2025
No ratings yet
# Frontend Development Roadmap 2025
1 page
Elastic Stack: Elasticsearch Logstash and Kibana
No ratings yet
Elastic Stack: Elasticsearch Logstash and Kibana
24 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Chapter3-Working With Dask DataFrames
100% (1)
Chapter3-Working With Dask DataFrames
24 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Bank Loan Analysis Final Presentation
No ratings yet
Bank Loan Analysis Final Presentation
29 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
07 - Ingesting New Datasets Into Google BigQuery
No ratings yet
07 - Ingesting New Datasets Into Google BigQuery
8 pages
Slide 3 Hadoop MapReduce Tutorial
No ratings yet
Slide 3 Hadoop MapReduce Tutorial
119 pages
Donald Ngandeu 1
No ratings yet
Donald Ngandeu 1
6 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Course Contents of Hadoop and Big Data
No ratings yet
Course Contents of Hadoop and Big Data
11 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Big Data Hadoop Architect - V4
No ratings yet
Big Data Hadoop Architect - V4
20 pages
Data Mining N Business Intelligence
No ratings yet
Data Mining N Business Intelligence
63 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Data Modeling Guidelines For Nosql Document-Store Databases
No ratings yet
Data Modeling Guidelines For Nosql Document-Store Databases
12 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Hadoop Admin Course
No ratings yet
Hadoop Admin Course
8 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
Spark NLP Training-Public-April 2020
No ratings yet
Spark NLP Training-Public-April 2020
39 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Hadoop Notes Unit2
No ratings yet
Hadoop Notes Unit2
24 pages
Analyse Data in GCP
No ratings yet
Analyse Data in GCP
14 pages
Hadoop Training #1: Thinking at Scale
100% (1)
Hadoop Training #1: Thinking at Scale
20 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
Building Websites with VB.NET and DotNetNuke 4
From Everand
Building Websites with VB.NET and DotNetNuke 4
Daniel N. Egan
1/5 (1)
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet
ATM Project
No ratings yet
ATM Project
13 pages
NB-MED - 2.2 - Rec - 4-2010 Software and Medical Devices
No ratings yet
NB-MED - 2.2 - Rec - 4-2010 Software and Medical Devices
16 pages
3rd Quarter Exam (2023-2024)
No ratings yet
3rd Quarter Exam (2023-2024)
2 pages
Practical 6C
No ratings yet
Practical 6C
4 pages
World 5
No ratings yet
World 5
20 pages
Ug Avenar en F01u378877 PDF
No ratings yet
Ug Avenar en F01u378877 PDF
94 pages
L20 Cassandra - Fa12
No ratings yet
L20 Cassandra - Fa12
27 pages
Rokok Analysis 9
No ratings yet
Rokok Analysis 9
5 pages
DDCS Expert User's Manual V1 (Part2)
No ratings yet
DDCS Expert User's Manual V1 (Part2)
55 pages
Nanoscope Analysis v140r1 Download Instructions
No ratings yet
Nanoscope Analysis v140r1 Download Instructions
1 page
10th CS Test Unit-1-2
No ratings yet
10th CS Test Unit-1-2
1 page
Srijan Tripathi
No ratings yet
Srijan Tripathi
1 page
Vishal
No ratings yet
Vishal
6 pages
Internship Report
No ratings yet
Internship Report
24 pages
Quiz 6
No ratings yet
Quiz 6
6 pages
Assignment 20 Stringfunctions Ans
No ratings yet
Assignment 20 Stringfunctions Ans
7 pages
(Student) L1 L2 FAG1003 Type Computer Component
No ratings yet
(Student) L1 L2 FAG1003 Type Computer Component
31 pages
Custom Macro - Course Details
No ratings yet
Custom Macro - Course Details
58 pages
Case Study Template
No ratings yet
Case Study Template
3 pages
Comment Variable and Data Types
No ratings yet
Comment Variable and Data Types
3 pages
Itil Cartoon
100% (1)
Itil Cartoon
43 pages
2025-01-04 Biz Main
No ratings yet
2025-01-04 Biz Main
6 pages
02 Computer Applications in Pharmacy Full Unit II
No ratings yet
02 Computer Applications in Pharmacy Full Unit II
8 pages
HP EliteBook x360 1040 G8 Notebook PC
No ratings yet
HP EliteBook x360 1040 G8 Notebook PC
37 pages
Grade 7 Computer Science
No ratings yet
Grade 7 Computer Science
4 pages
Nasa Technical Standard NASA-STD-8739.8A: Software Assurance and Software Safety Standard
No ratings yet
Nasa Technical Standard NASA-STD-8739.8A: Software Assurance and Software Safety Standard
59 pages
Radeonx 1800 Crossfireug
No ratings yet
Radeonx 1800 Crossfireug
90 pages
Advance Excel Course Applicaiton 210721
No ratings yet
Advance Excel Course Applicaiton 210721
1 page
Uwu2x Guide
No ratings yet
Uwu2x Guide
5 pages

Big Data Computing Spark Basics and RDD: Ke Yi

Uploaded by

Big Data Computing Spark Basics and RDD: Ke Yi

Uploaded by

Big Data Computing

Spark Basics and RDD

• Iterative jobs involve a lot of disk I/O

• Initially developed at UC Berkeley in 2009

• Became an open-source project in 2010

• Became the most active project in the Apache Software

A real or virtual file consisting of RDD1 RDD2

Created through deterministic

Users can control two other aspects:

• Programmer specifies number of partitions for

Web service is experiencing errors and an operator wants to search

Execution is pipelined and parallel

Fault Tolerance Mechanism

RDD has enough information about RDD2

Partitions can be recomputed in parallel on different nodes, without

• Make an RDD persist using persist() or cache()

Spark’s shell provides a simple way to learn the API, as well as a

Let’s start with a few actions:

Now let’s use a transformation. We will use the filter transformation to return a new

We can chain together transformations and actions:

The arguments to map and reduce are Python anonymous functions (lambdas), but we can

>>> def max(a, b):

For example, if you run the following commands:

>>> import math

consider the simple program below:

For self-contained applications, you should use sc.stop() to stop

Note: SparkContext can connect to several types of cluster managers (either

• In-browser editing for code, with automatic syntax

• A task’s closure is those variables and methods

• Accumulators are variables that are only “added”

• Accumulators do not change the lazy evaluation model

• Use deterministic pseudo random number

• This closure is sent to each executor from the

While most Spark operations work on RDDs containing any type of

Other useful operations on key-value pairs:

– p(x=0) = 0.8, p(x=1)=0.2, p(y=0)=0.25, p(y=1)=0.75

• Functions to note: groupByKey(), join(), mapValues()

• Broadcast variables allow the programmer to

You might also like