0% found this document useful (0 votes)
4 views48 pages

7 Apache Spark

Chapter 6 discusses batch processing with Apache Spark, highlighting its advantages over traditional MapReduce, particularly in handling iterative jobs and reducing disk I/O through in-memory processing. It introduces key concepts such as Resilient Distributed Datasets (RDDs) and DataFrames, emphasizing their fault tolerance and efficiency in data manipulation. The chapter also covers operations, transformations, and actions in Spark programming, showcasing how to work with data in a more streamlined manner.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views48 pages

7 Apache Spark

Chapter 6 discusses batch processing with Apache Spark, highlighting its advantages over traditional MapReduce, particularly in handling iterative jobs and reducing disk I/O through in-memory processing. It introduces key concepts such as Resilient Distributed Datasets (RDDs) and DataFrames, emphasizing their fault tolerance and efficiency in data manipulation. The chapter also covers operations, transformations, and actions in Spark programming, showcasing how to work with data in a more streamlined manner.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Chapter 6

Batch processing - part 2


Apache Spark
An unified analytics engine for large-scale data
processing
Map Reduce: Iterative jobs
• Iterative jobs involve a lot of disk I/O for each repetition

• è Disk I/O is very slow!

[email protected] 3
0.1 Gb/s
1 Gb/s or125 MB/s Nodesin
another
Network rack
CPUs:
10GB/s

100MB/s 1 Gb/s or125 MB/s Nodesin


600MB/s same
rack

3-12 msrandom 0.1 ms random


access access

$0.025 perGB $0.35 perGB


RAM is the new disk

[email protected] 5
A unified analytics engine for large-scale
data processing
• Better support for
• Iterative algorithms
• Interactive data mining
• Fault tolerance, data locality, scalability
• Hide complexites: help users avoid the coding for structure
the distributed mechanism.

[email protected] 6
Memory instead of disk

HDFS HDFS HDFS

[email protected] 7
Spark and Map Reduce differences
Apache Hadoop MR Apache Spark
Storage Disk only In-memory or on disk
Operations Map and Reduce Many transformations and actions,
including Map and Reduce

Execution model Batch Batch, iterative, streaming


Languages Java Scala, Java, Python and R

[email protected] 8
Apache Spark vs Apache Hadoop

https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
[email protected] 9
Resilient Distributed Dataset (RDD)
• RDDs are fault-tolerant, parallel data structures that
let users explicitly persist intermediate results in
memory, control their partitioning to optimize data
placement, and manipulate them using a rich set of
operators.
• coarse-grained transformations vs. fine-grained
updates
• e.g., map, filter and join) that apply the same operation to
many data items at once.

[email protected] 10
more partitions= more parallelism

RDD

item-1 item-6 item-11 item-16 item-21


item-2 item-7 item-12 item-17 item-22
item-3 item-8 item-13 item-18 item-23
item-4 item-9 item-14 item-19 item-24
item-5 item-10 item-15 item-20 item-25

W W W

Ex Ex Ex
RDD RDD RDD
RDD RDD
RDD with 4 partitions

Error, ts, Info, ts, msg8 Error, ts, Error, ts,


msg1 Warn, Warn, ts, msg3 Info, msg4 Warn,
ts, msg2 msg2 Info, ts, ts, msg5 ts, msg9
Error, ts, msg8 Info, ts, Error, ts, logLinesRDD
msg1 msg5 msg1

Abase RDD can be created 2 ways:

- Parallelize a collection
- Read data from an external source (S3, C*, HDFS,etc)
Parallelize
// Parallelize in Scala
• Take an existing in-
val wordsRDD = sc.parallelize(List("fish", "cats", "dogs"))
memory collection and
pass it toSparkContext’s
parallelizemethod

• Not generallyused
outside of prototyping
andtesting since it
requires entire dataset in
# Parallelize in Python memory on one machine
wordsRDD = sc.parallelize(["fish", "cats", "dogs"])

// Parallelize in Java
JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList("fish", "cats", "dogs"));
Read from Text File

// Read a local txt file in Scala


There are other
val linesRDD = sc.textFile("/path/to/README.md") methods to read data
from HDFS, C*, S3,
HBase,etc.

# Read a local txt file in Python


linesRDD = sc.textFile("/path/to/README.md")

// Read a local txt file in Java


JavaRDD<String> lines = sc.textFile("/path/to/README.md");
Operations on Distributed Data
• Two types of operations: transformations and actions
• Transformations are lazy (not computed immediately)
• Transformations are executed when an action is run
• Persist (cache) distributed data in memory or disk
Transformation: Filter

Error, ts, Info, ts, msg8 Error,ts, Error, ts,


msg1 Warn, Warn, ts, msg3 Info, msg4 Warn,
ts, msg2 msg2 Info, ts, msg5 ts, msg9 logLinesRDD
Error, ts, ts,msg8 Info, ts, Error, ts, (input/base RDD)
msg1 msg5 msg1

.filter( λ )

Error,ts, Error,ts,
Error,ts, msg3
msg1 Error, msg4 Error,

ts, msg1 ts, msg1 errorsRDD


Action: Collect
Error,ts, msg3 Error,ts,
Error,ts,
msg4 Error,
msg1 Error, errorsRDD
ts, msg1

ts, msg1 .coalesce( 2)

Error, ts, msg1 Error, ts, msg4


Error, ts, msg3
Error, ts, msg1 Error, ts, msg1
cleanedRDD

.collect( )

Driver
DAG execution

Execute DAG!

.collect( )

Driver
Logical

logLinesRDD
.filter( λ )

errorsRDD
.coalesce( 2)

cleanedRDD

.collect( )

Driver
Physical
4. compute

logLinesRDD

errorsRDD

cleanedRDD

Driver
DAG
logLinesRDD

errorsRDD

.saveAsTextFile( ) Error, ts, msg1 Error, ts, msg4


Error, ts, msg3
Error, ts, msg1 Error, ts, msg1
cleanedRDD

.filter( λ )

Error, ts, msg1


.count( )
Error, ts, msg1 Error, ts, msg1
5 errorMsg1RDD
.collect( )
Cache
logLinesRDD

errorsRDD
.cache( )

.saveAsTextFile( ) Error, ts, msg1 Error, ts, msg4


Error, ts, msg3
Error, ts, msg1 Error, ts, msg1
cleanedRDD

.filter( λ )

Error, ts, msg1


.count( )
Error, ts, msg1 Error, ts, msg1
5
errorMsg1RDD
.collect( )
Partition >>> Task >>> Partition

logLinesRDD
(HadoopRDD)
Task-1
Task-2
.filter( λ ) Task-3
Task-4

errorsRDD
(filteredRDD)
RDD Lineage

[email protected] 24
Resilient Distributed Dataset (RDD)
• Initial RDD on disks (HDFS, etc)
• Intermediate RDD on RAM
• Fault recovery based on lineage
• RDD operations is distributed

[email protected] 25
DataFrame
• A primary abstraction in Spark 2.0
• Immutable once constructed
• Track lineage information to efficiently re-compute lost data
• Enable operations on collection of elements in parallel
• To construct DataFrame
• By parallelizing existing Python collections (lists)
• By transforming an existing Spark or pandas DataFrame
• From files in HDFS or other storage system

[email protected] 26
Using DataFrame
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
[Row(name=u’Alice’, age=1),
Row=(name=u’Bob’, age=2),
Row=(name=u’Bob’, age=2)]

[email protected] 27
Transformations
• Create new DataFrame from an existing one
• Use lazy evaluation
• Nothing executes
• Spark saves recipe for transformation source

Transformation Description
select(*cols) Selects columns from this DataFrame
drop(col) Returns a new Dataframe that drops the specific column

filter(func) Returns a new DataFrame formed by selecting those rows of the


source on which func returns true

where(func) Where is an alias for filter


distinct() Returns a new DataFrame that contains the distinct rows of the
source DataFrame

sort(*cols, **kw) Returns a new DataFrame sorted by the specified columns and in
the sort order specified by kw
[email protected] 28
Using Transformations
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
>>> df2 = df1.distinct()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’,
age=2)]
>>> df3 = df2.sort(“age”, asceding=False)
[Row=(name=u’Bob’, age=2), Row(name=u’Alice’,
age=1)]

[email protected] 29
Actions
• Cause Spark to execute recipe to transform source
• Mechanisms for getting results out of Spark

Action Description
show(n, truncate) Prints the first n rows of this DataFrame
take(n) Returns the first n rows as a list of Row
collect() Returns all the records as a list of Row (*)
count() Returns the number of rows in this DataFrame

describe(*cols) Exploratory Data Analysis function that computes statistics


(count, mean, stddev, min, max) for numeric columns

[email protected] 30
Using Actions
>>> data = [(‘Alice’, 1), (‘Bob’, 2)]
>>> df = sqlContext.createDataFrame(data, [‘name’, ‘age’])
>>> df.collect()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’, age=2)]
>>> df.count()
2
>>> df.show()
+-------+--------+
|name| age |
+-------+-------+
|Alice| 1|
|Bob | 2|
+-----+-------+

[email protected] 31
Caching
>>> linesDF = sqlContext.read.text(‘…’)
>>> linesDF.cache()
>>> commentsDF = linesDF.filter(isComment)
>>> print linesDF.count(), commentsDF.count()
>>> commentsDF.cache()

[email protected] 32
Spark Programming Routine
• Create DataFrames from external data or
createDataFrame from a collection in driver program
• Lazily transform them into new DataFrames
• cache() some DataFrames for reuse
• Perform actions to execute parallel computation and
produce results

[email protected] 33
DataFrames versus RDDs
• For new users familiar with data frames in other
programming languages, this API should make them
feel at home
• For existing Spark users, the API will make Spark
easier to program than using RDDs
• For both sets of users, DataFrames will improve
performance through intelligent optimizations and
code-generation
Write Less Code: Input & Output

Unified interface to reading/writing data ina


variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")

df.write.
format("parquet").
mode("append").
partitionBy("year").
saveAsTable("faster-stuff")

46
Write Less Code: Input & Output

Unified interface to reading/writing data ina


variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")

df.write.
format("parquet"). read and write
mode("append"). functions create
partitionBy("year").
saveAsTable("faster-stuff") new builders for
doing I/O
47
Write Less Code: Input & Output
Unified interface to reading/writing data ina
variety of formats.

val df = sqlContext.
read.
format("json"). Builder
"methods
}
option("samplingRatio", "0.1").
)
specify:
load("/Users/spark/data/stuff.json
• format
df.write.
mode("append").
• partitioning
• handling of
}
format("parquet").
existing data
partitionBy("year").
saveAsTable("faster- 48
stuff")
Write Less Code: Input & Output

Unified interface to reading/writing data ina


variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1"). load(…), save(…),
load("/Users/spark/data/stuff.json")
or saveAsTable(…)
df.write.
format("parquet").
finish the I/O
mode("append"). specification
partitionBy("year").
saveAsTable("faster-stuff")

49
Data Sources supported by DataFrames

built-in external

JDBC

{ JSON }

and more …
Write Less Code: High-Level
Operations
• Solve common problems concisely with DataFrame
functions:
• selecting columns and filtering
• joining different data sources
• aggregation (count, sum, average, etc.)
• plotting results (e.g., with Pandas)
Write Less Code: Compute an Average

private IntWritable one = new IntWritable(1); rdd = sc.textFile(...).map(_.split(" "))


private IntWritable output =new IntWritable(); rdd.map { x => (x(0), (x(1).toFloat, 1)) }.
protected void map(LongWritable key, reduceByKey { case ((num1, count1), (num2, count2)) =>
Text value, (num1 + num2, count1 + count2)
Context context) { }.
String[] fields = value.split("\t"); map { case (key, (num, count)) => (key, num / count) }.
output.set(Integer.parseInt(fields[1])); collect()
context.write(one, output);
}

---------------------------------------------------------------------------------- rdd = sc.textFile(...).map(lambda s: s.split())


rdd.map(lambda x: (x[0], (float(x[1]), 1))).\
IntWritable one = new IntWritable(1) reduceByKey(lambda t1, t2: (t1[0] + t2[0], t1[1] + t2[1])).\
DoubleWritable average = new DoubleWritable(); map(lambda t: (t[0], t[1][0] / t[1][1])).\
collect()
protected void reduce(IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0;
int count = 0;
for (IntWritable value: values) {
sum += value.get();
count++;
}
average.set(sum / (double) count);
context.write(key, average);
}
Write Less Code: Compute an
Average
Using RDDs
rdd = sc.textFile(...).map(_.split(" "))
rdd.map { x => (x(0), (x(1).toFloat, 1)) }. Full APIDocs
reduceByKey { case ((num1, count1), (num2, count2)) =>
(num1 + num2, count1 + count2)
• Scala
}. • Java
map { case (key, (num, count)) => (key, num / count) }.
collect() • Python
• R
Using DataFrames
import org.apache.spark.sql.functions._

val df = rdd.map(a => (a(0), a(1))).toDF("key", "value")


df.groupBy("key")
.agg(avg("value"))
.collect()
Architecture
• A master-worker type architecture
• A driver or master node
• Worker nodes

• The master send works to the workers and either


instructs them to pull data from memory or from hard
disk (or from another source like S3 or HDSF)

[email protected] 43
Architecture(2)
• A Spark program first creates a SparkContext object
• SparkContext tells Spark how and where to access a cluster
• The master parameter for a SparkContext determines which
type and size of cluster to use

Master parameter Description


local Run Spark locally with one worker thread (no parallelism)
local[K] Run Spark locally with K worker threads (ideal set to number of
cores)
spark://HOST:PORT Connect to a Spark standalone cluster
mesos://HOST:PORT Connect to a Mesos cluster
yarn Connect to a YARN cluster

[email protected] 44
Lifetime of a Job in Spark

[email protected] 45
Demo
References
• Zaharia, Matei, et al. "Resilient distributed datasets: A fault-
tolerant abstraction for in-memory cluster computing." Presented
as part of the 9th {USENIX} Symposium on Networked Systems
Design and Implementation ({NSDI} 12). 2012.
• Armbrust, Michael, et al. "Spark sql: Relational data processing in
spark." Proceedings of the 2015 ACM SIGMOD international
conference on management of data. 2015.
• Zaharia, Matei, et al. "Discretized streams: Fault-tolerant
streaming computation at scale." Proceedings of the twenty-fourth
ACM symposium on operating systems principles. 2013.
• Chambers, Bill, and Matei Zaharia. Spark: The definitive guide:
Big data processing made simple. " O'Reilly Media, Inc.", 2018.
Thank you
for your
attention!!!

You might also like