7 Apache Spark
7 Apache Spark
[email protected] 3
0.1 Gb/s
1 Gb/s or125 MB/s Nodesin
another
Network rack
CPUs:
10GB/s
[email protected] 5
A unified analytics engine for large-scale
data processing
• Better support for
• Iterative algorithms
• Interactive data mining
• Fault tolerance, data locality, scalability
• Hide complexites: help users avoid the coding for structure
the distributed mechanism.
[email protected] 6
Memory instead of disk
[email protected] 7
Spark and Map Reduce differences
Apache Hadoop MR Apache Spark
Storage Disk only In-memory or on disk
Operations Map and Reduce Many transformations and actions,
including Map and Reduce
[email protected] 8
Apache Spark vs Apache Hadoop
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
[email protected] 9
Resilient Distributed Dataset (RDD)
• RDDs are fault-tolerant, parallel data structures that
let users explicitly persist intermediate results in
memory, control their partitioning to optimize data
placement, and manipulate them using a rich set of
operators.
• coarse-grained transformations vs. fine-grained
updates
• e.g., map, filter and join) that apply the same operation to
many data items at once.
[email protected] 10
more partitions= more parallelism
RDD
W W W
Ex Ex Ex
RDD RDD RDD
RDD RDD
RDD with 4 partitions
- Parallelize a collection
- Read data from an external source (S3, C*, HDFS,etc)
Parallelize
// Parallelize in Scala
• Take an existing in-
val wordsRDD = sc.parallelize(List("fish", "cats", "dogs"))
memory collection and
pass it toSparkContext’s
parallelizemethod
• Not generallyused
outside of prototyping
andtesting since it
requires entire dataset in
# Parallelize in Python memory on one machine
wordsRDD = sc.parallelize(["fish", "cats", "dogs"])
// Parallelize in Java
JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList("fish", "cats", "dogs"));
Read from Text File
.filter( λ )
Error,ts, Error,ts,
Error,ts, msg3
msg1 Error, msg4 Error,
.collect( )
Driver
DAG execution
Execute DAG!
.collect( )
Driver
Logical
logLinesRDD
.filter( λ )
errorsRDD
.coalesce( 2)
cleanedRDD
.collect( )
Driver
Physical
4. compute
logLinesRDD
errorsRDD
cleanedRDD
Driver
DAG
logLinesRDD
errorsRDD
.filter( λ )
errorsRDD
.cache( )
.filter( λ )
logLinesRDD
(HadoopRDD)
Task-1
Task-2
.filter( λ ) Task-3
Task-4
errorsRDD
(filteredRDD)
RDD Lineage
[email protected] 24
Resilient Distributed Dataset (RDD)
• Initial RDD on disks (HDFS, etc)
• Intermediate RDD on RAM
• Fault recovery based on lineage
• RDD operations is distributed
[email protected] 25
DataFrame
• A primary abstraction in Spark 2.0
• Immutable once constructed
• Track lineage information to efficiently re-compute lost data
• Enable operations on collection of elements in parallel
• To construct DataFrame
• By parallelizing existing Python collections (lists)
• By transforming an existing Spark or pandas DataFrame
• From files in HDFS or other storage system
[email protected] 26
Using DataFrame
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
[Row(name=u’Alice’, age=1),
Row=(name=u’Bob’, age=2),
Row=(name=u’Bob’, age=2)]
[email protected] 27
Transformations
• Create new DataFrame from an existing one
• Use lazy evaluation
• Nothing executes
• Spark saves recipe for transformation source
Transformation Description
select(*cols) Selects columns from this DataFrame
drop(col) Returns a new Dataframe that drops the specific column
sort(*cols, **kw) Returns a new DataFrame sorted by the specified columns and in
the sort order specified by kw
[email protected] 28
Using Transformations
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
>>> df2 = df1.distinct()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’,
age=2)]
>>> df3 = df2.sort(“age”, asceding=False)
[Row=(name=u’Bob’, age=2), Row(name=u’Alice’,
age=1)]
[email protected] 29
Actions
• Cause Spark to execute recipe to transform source
• Mechanisms for getting results out of Spark
Action Description
show(n, truncate) Prints the first n rows of this DataFrame
take(n) Returns the first n rows as a list of Row
collect() Returns all the records as a list of Row (*)
count() Returns the number of rows in this DataFrame
[email protected] 30
Using Actions
>>> data = [(‘Alice’, 1), (‘Bob’, 2)]
>>> df = sqlContext.createDataFrame(data, [‘name’, ‘age’])
>>> df.collect()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’, age=2)]
>>> df.count()
2
>>> df.show()
+-------+--------+
|name| age |
+-------+-------+
|Alice| 1|
|Bob | 2|
+-----+-------+
[email protected] 31
Caching
>>> linesDF = sqlContext.read.text(‘…’)
>>> linesDF.cache()
>>> commentsDF = linesDF.filter(isComment)
>>> print linesDF.count(), commentsDF.count()
>>> commentsDF.cache()
[email protected] 32
Spark Programming Routine
• Create DataFrames from external data or
createDataFrame from a collection in driver program
• Lazily transform them into new DataFrames
• cache() some DataFrames for reuse
• Perform actions to execute parallel computation and
produce results
[email protected] 33
DataFrames versus RDDs
• For new users familiar with data frames in other
programming languages, this API should make them
feel at home
• For existing Spark users, the API will make Spark
easier to program than using RDDs
• For both sets of users, DataFrames will improve
performance through intelligent optimizations and
code-generation
Write Less Code: Input & Output
df.write.
format("parquet").
mode("append").
partitionBy("year").
saveAsTable("faster-stuff")
46
Write Less Code: Input & Output
df.write.
format("parquet"). read and write
mode("append"). functions create
partitionBy("year").
saveAsTable("faster-stuff") new builders for
doing I/O
47
Write Less Code: Input & Output
Unified interface to reading/writing data ina
variety of formats.
val df = sqlContext.
read.
format("json"). Builder
"methods
}
option("samplingRatio", "0.1").
)
specify:
load("/Users/spark/data/stuff.json
• format
df.write.
mode("append").
• partitioning
• handling of
}
format("parquet").
existing data
partitionBy("year").
saveAsTable("faster- 48
stuff")
Write Less Code: Input & Output
49
Data Sources supported by DataFrames
built-in external
JDBC
{ JSON }
and more …
Write Less Code: High-Level
Operations
• Solve common problems concisely with DataFrame
functions:
• selecting columns and filtering
• joining different data sources
• aggregation (count, sum, average, etc.)
• plotting results (e.g., with Pandas)
Write Less Code: Compute an Average
[email protected] 43
Architecture(2)
• A Spark program first creates a SparkContext object
• SparkContext tells Spark how and where to access a cluster
• The master parameter for a SparkContext determines which
type and size of cluster to use
[email protected] 44
Lifetime of a Job in Spark
[email protected] 45
Demo
References
• Zaharia, Matei, et al. "Resilient distributed datasets: A fault-
tolerant abstraction for in-memory cluster computing." Presented
as part of the 9th {USENIX} Symposium on Networked Systems
Design and Implementation ({NSDI} 12). 2012.
• Armbrust, Michael, et al. "Spark sql: Relational data processing in
spark." Proceedings of the 2015 ACM SIGMOD international
conference on management of data. 2015.
• Zaharia, Matei, et al. "Discretized streams: Fault-tolerant
streaming computation at scale." Proceedings of the twenty-fourth
ACM symposium on operating systems principles. 2013.
• Chambers, Bill, and Matei Zaharia. Spark: The definitive guide:
Big data processing made simple. " O'Reilly Media, Inc.", 2018.
Thank you
for your
attention!!!