0% found this document useful (0 votes)

83 views36 pages

Apache Spark: CS240A Winter 2016. T Yang

This document provides an overview of Apache Spark, including: - Spark is a cluster computing framework based on the MapReduce model. It uses Resilient Distributed Datasets (RDDs) to distribute data across a cluster. - RDDs allow parallel operations like map, filter, reduce, and join to be performed efficiently. Transformations create new RDDs from existing ones while actions return values to the driver program. - Spark supports programming in Python, Scala, and Java. It can run programs interactively or as standalone applications on a cluster.

Uploaded by

omegapoint077609

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views36 pages

Apache Spark: CS240A Winter 2016. T Yang

Uploaded by

omegapoint077609

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 36

Apache Spark

CS240A
Winter 2016. T Yang

Some of them are based on P. Wendell’s Spark slides

Parallel Processing using Spark+Hadoop
• Hadoop: Distributed file system that connects machines.
• Mapreduce: parallel programming style built on a Hadoop
cluster
• Spark: Berkeley design of Mapreduce programming
• Given a file treated as a big list
 A file may be divided into multiple parts (splits).
• Each record (line) is processed by a Map function,
 produces a set of intermediate key/value pairs.
• Reduce: combine a set of values for the same key
>>> words = 'The quick brown fox jumps over the lazy dog'.split()

Python Examples and List Comprehension

>>> lst = [3, 1, 4, 1, 5] for i in [5, 4, 3, 2, 1] :

>>>
>>>
lst.append(2)
len(lst)
print i
5
>>> lst.sort()
print 'Blastoff!'
>>> lst.insert(4,"Hello")
>>> [1]+ [2]  [1,2]
>>> lst[0] ->3
>>>M = [x for x in S if x % 2 == 0]
Python tuples
>>> S = [x**2 for x in range(10)]
>>> num=(1, 2, 3, 4) [0,1,4,9,16,…,81]
>>> num +(5)  (1,2,3,4, 5)

>>> words =‘hello lazy dog'.split()
>>> stuff = [(w.upper(), len(w)] for w in words]
 [ (‘HELLO’, 5) (‘LAZY’, 4) , (‘DOG’, 4)]

>>>numset=set([1, 2, 3, 2])
Duplicated entries are deleted
>>>numset=frozenset([1, 2,3])
Such a set cannot be modified
Python map/reduce
a = [1, 2, 3]
b = [4, 5, 6, 7]
c = [8, 9, 1, 2, 3]
f= lambda x: len(x)
L = map(f, [a, b, c])
[3, 4, 5]

g=lambda x,y: x+y

reduce(g, [47,11,42,13])
113
Mapreduce programming with SPAK: key
concept
Write programs in terms of operations on implicitly
distributed datasets (RDD)
RDD
RDD: Resilient Distributed RDD
Datasets
RDD
•Like a big list: RDD
 Collections of objects spread
across a cluster, stored in RAM or Operations
on Disk •Transformations
•Built through parallel (e.g. map, filter,
groupBy)
transformations
•Make sure input/output
•Automatically rebuilt on failure match
MapReduce vs Spark

RDD
RDD
RDD
RDD

Spark operates on RDD

Map and reduce
tasks operate on key-value
pairs
Language Support

Python Standalone Programs

lines
lines =
= sc.textFile(...)
sc.textFile(...) •Python, Scala, & Java
lines.filter(lambda
lines.filter(lambda s:
s: “ERROR”
“ERROR” in
in s).count()
s).count()

Interactive Shells
Scala •Python & Scala
val
val lines
lines =
= sc.textFile(...)
sc.textFile(...)
lines.filter(x
lines.filter(x =>
=> x.contains(“ERROR”)).count()
x.contains(“ERROR”)).count()
Performance
•Java & Scala are faster
Java due to static typing
JavaRDD<String>
JavaRDD<String> lines
lines.filter(new
lines =
= sc.textFile(...);
sc.textFile(...); •…but Python is often fine
lines.filter(new Function<String, Boolean>()
Function<String, Boolean>() {
{
Boolean call(String s) {
Boolean call(String s) {
return
return s.contains(“error”);
s.contains(“error”);
}
}
}).count();
}).count();
Spark Context and Creating RDDs

#Start with sc – SparkContext as

Main entry point to Spark functionality

# Turn a Python collection into an RDD

>sc.parallelize([1, 2, 3])

# Load text file from local FS, HDFS, or S3

>sc.textFile(“file.txt”)
>sc.textFile(“directory/*.txt”)
>sc.textFile(“hdfs://namenode:9000/path/file”)
Spark Architecture
Spark Architecture
Basic Transformations

> nums = sc.parallelize([1, 2, 3])

# Pass each element through a function

> squares = nums.map(lambda x: x*x) // {1, 4, 9}

# Keep elements passing a predicate

> even = squares.filter(lambda x: x % 2 == 0) // {4}

#read a text file and count number of lines

containing error
lines = sc.textFile(“file.log”)
lines.filter(lambda s: “ERROR” in s).count()
Basic Actions

> nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection

> nums.collect() # => [1, 2, 3]
# Return first K elements
> nums.take(2) # => [1, 2]
# Count number of elements
> nums.count() # => 3
# Merge elements with an associative function
> nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
> nums.saveAsTextFile(“hdfs://file.txt”)
Working with Key-Value Pairs
Spark’s “distributed reduce” transformations
operate on RDDs of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b

Scala: val pair = (a, b)

pair._1 // => a
pair._2 // => b

Java: Tuple2 pair = new Tuple2(a, b);

pair._1 // => a
pair._2 // => b
Some Key-Value Operations

> pets = sc.parallelize(

[(“cat”, 1), (“dog”, 1), (“cat”, 2)])
> pets.reduceByKey(lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
> pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}
> pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

also automatically implements

reduceByKey
combiners on the map side
Example: Word Count
> lines = sc.textFile(“hamlet.txt”)
> counts = lines.flatMap(lambda line: line.split(“ ”))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)

“to” (to, 1)
(be, 1)(be,1) (be,2)
“to be or” “be” (be, 1)
(not, 1) (not, 1)
“or” (or, 1)

“not” (not, 1) (or, 1)

(or, 1)
“not to be” “to” (to, 1) (to, 2)
(to, 1)(to,1)
“be” (be, 1)
Other Key-Value Operations
> visits = sc.parallelize([ (“index.html”, “1.2.3.4”),
(“about.html”, “3.4.5.6”),
(“index.html”, “1.3.3.1”) ])

> pageNames = sc.parallelize([ (“index.html”, “Home”),

(“about.html”, “About”) ])

> visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))

> visits.cogroup(pageNames)
# (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”]))
# (“about.html”, ([“3.4.5.6”], [“About”]))
Under The Hood: DAG Scheduler

• General task A: B:
graphs
• Automatically F:
Stage 1 groupBy
pipelines
functions C: D: E:
• Data locality
aware join
• Partitioning
aware Stage 2 map filter Stage 3
to avoid shuffles
= RDD = cached partition
Setting the Level of Parallelism

All the pair RDD operations take an optional second

parameter for number of tasks

> words.reduceByKey(lambda x, y: x + y, 5)
> words.groupByKey(5)
> visits.join(pageViews, 5)
More RDD Operators

• map • reduce sample

• filter • count take
• groupBy • fold first
• sort • reduceByKey partitionBy
• union • groupByKey mapWith
• join • cogroup pipe
• leftOuterJoin • cross save ...
• rightOuterJoin • zip
Interactive Shell

• The Fastest Way to

Learn Spark
• Available in Python
and Scala
• Runs as an
application on an
existing Spark
Cluster…
• OR Can run locally
… or a Standalone Application

import sys
from pyspark import SparkContext

if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, sys.argv[0],
None)
lines = sc.textFile(sys.argv[1])

counts = lines.flatMap(lambda s: s.split(“ ”)) \

.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)

counts.saveAsTextFile(sys.argv[2])
Create a SparkContext

import org.apache.spark.SparkContext
Scala

import org.apache.spark.SparkContext._

val sc = new SparkContext(“url”, “name”, “sparkHome”, Seq(“app.jar”))

Spark install List of JARs

Cluster URL, or App
import org.apache.spark.api.java.JavaSparkContext;
path on
path on with app code
local / local[N] name
Java

cluster (to ship)

JavaSparkContext sc = new JavaSparkContext(
“masterUrl”, “name”, “sparkHome”, new String[] {“app.jar”}));
Python

from pyspark import SparkContext

sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“library.py”]))

Administrative GUIs

http://<Standalone Master>:8080
(by default)
EXAMPLE APPLICATION:
PAGERANK
Google PageRank

Give pages ranks

(scores) based on
links to them
•Links from many
pages  high rank
•Link from a high-rank
page  high rank

Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
PageRank (one definition)

 Model page reputation on the web

n
PR (ti )
PR ( x )  (1  d )  d 
i 1 C (t i )

 i=1,n lists all parents of page x.

 PR(x) is the page rank of each page.
0.4 0.2
 C(t) is the out-degree of t.
0.2
 d is a damping factor . 0.2
0.2
0.4 0.4
Computing PageRank Iteratively

 Effects at each iteration is local. i+1th iteration depends only on

ith iteration
 At iteration i, PageRank for individual nodes can be computed
independently
PageRank using MapReduce

Map: distribute PageRank “credit” to link targets