Spark (Introduction, RDD)
Spark (Introduction, RDD)
age
Apache Spark
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved.
Spark Overview
TA
punishable under law
Copyright @2016 V2 Maestros, All rights reserved.
1
8/27/2016
2
8/27/2016
Online Reference
• http://spark.apache.org
3
8/27/2016
Spark Eco-system
egg
Spark Framework
4
8/27/2016
one
RDD
of
5
8/27/2016
Spark Architecture
e
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved.
Spark Architecture
Worker Node
Master Node Executor
Task Cache
Driver Program
Spark Cluster
Context Manager
Worker Node
Executor
Task Cache
6
8/27/2016
Driver Program e
• The main executable program from where Spark operations are
performed
• Runs in the master node of a cluster
• Controls and co-ordinates all operations
• The Driver program is the “main” class.
• Executes parallel operations on a cluster
• Defines RDDs
• Each driver program execution is a “Job”
SparkContext
• Driver accesses Spark functionality through a SparkContext object.
• Represents a connection to the computing cluster
• Used to build RDDs. Is
• Partitions RDDs and distributes them on the cluster
• Works with the cluster manager
• Manages executors running on Worker nodes
• Splits jobs as parallel “tasks” and executes them on worker nodes
• Collects results and presents them to the Driver Program
7
8/27/2016
Spark modes
• Batch mode
• A program is scheduled for execution through the scheduler
• Runs fully at periodic intervals and processes data
• Interactive mode
• An interactive shell is used by the user to execute Spark commands one-by-
one.
• Shell acts as the Driver program and provides SparkContext
• Can run tasks on a cluster
• Streaming mode
• An always running program continuously processes data as it arrives
Spark scalability
• Single JVM
Is
• Runs on a single box (Linux or Windows)
• All components (Driver, executors) run within the same JVM
• Managed Cluster
• Can scale from 2 to thousands of nodes
• Can use any cluster manager for managing nodes
• Data is distributed and processed on all nodes
8
8/27/2016
De
Loading and Storing Data
tag
Creating RDDs
• RDDs can be created from a number of sources
• Text Files
• JSON
• Parallelize() on local collections
• Java Collections
• Python Lists
• R Data Frames
• Sequence files
• RDBMS – load into location collections first and create RDD
• Very large data – create HDFS files outside of Spark and then create
RDDs from them
9
8/27/2016
Storing RDDs
• Spark provides simple functions to persist RDDs to a variety of data
sinks
• Text Files
• JSON
• Sequence Files
• Collections
• For optimization use language specific libraries for persistence than
using Spark utilities.
• saveAsTextFile()
• RDBMS – move to local collections and then store.
Lazy evaluation
• Lazy evaluation means Spark will not load or transform data unless an
action is performed
• Load file into RDD
• Filter the RDD
• Count no. of elements (only now loading and filtering happens)
• Helps internally optimize operations and resource usage
• Watch out during troubleshooting – errors found while executing
actions might be related to earlier transformations
10
8/27/2016
Transformations
to
Copyright @2016 V2 Maestros, All rights reserved.
Overview
• Perform operation on one RDD and create a new RDD
• Operate on one element at a time
• Lazy evaluation
• Can be distributed across multiple nodes based on the partitions they
act upon
11
8/27/2016
Map
newRdd=rdd.map(function)
• Works similar to the Map Reduce “Map”
• Act upon each element and perform some operation
• Element level computation or transformation
• Result RDD may have the same number of elements as original RDD
• Result can be of different type
• Can pass functions to this operation to perform complex tasks
• Use Cases
• Data Standardization – First Name, Last Name
• Data type conversion
• Element level computations – compute tax
• Add new attributes – Grades based on test scores
flatMap
newRdd=rdd.flatMap(function)
• Works the same way as map
• Can return more elements than the original map
• Use to break up elements in the original map and create a new map
• Split strings in the original map
• Extract child elements from a nested json string
12
8/27/2016
Filter
newRdd=rdd.filter(function)
I
• Filter a RDD to select elements that match a condition
• Result RDD smaller than the original RDD
• A function can be passed as a condition to perform complex filtering
• Returns a true/false
Set Operations
• Set operations are performed on two RDDs
• Union – Return a new dataset that contains the union of the elements
in the source dataset and the argument.
• unionRDD=firstRDD.union(secondRDD)
• Intersection - Return a new RDD that contains the intersection of
elements in the source dataset and the argument.
• intersectionRDD=firstRDD.intersect(secondRDD)
13
8/27/2016
af
Actions
Tf
punishable under law
Copyright @2016 V2 Maestros, All rights reserved.
Introduction to actions
• Act on a RDD and product a result (not a RDD)
• Lazy evaluation – Spark does not act until it sees an action
• Simple actions
• collect – return all elements in the RDD as an array. Use to trigger
execution or print values
• count – count the number of elements in the RDD
• first – returns the first element in the RDD
• take(n) – returns the first n elements
14
8/27/2016
reduce
• Perform an operation across all elements of an RDD
• sum, count etc.
• The operation is a function that takes as input two values.
• The function is called for every element in the RDD
inputRDD = [ a, b, c, d, e ]
and the function is func(x,y)
func( func( func( func(a,b), c), d), e)
Reduce Example
vals = [3,5,2,4,1]
sum(x,y) { return x + y }
sum( sum( sum( sum(3,5), 2), 4), 1) = 15
15
8/27/2016
Pair RDD
Tty
Pair RDDs
• Pair RDDs are a special type of RDDs that can store key value pairs.
• Can be created through regular map operations
• All transformations for regular RDDs available for Pair RDDs
• Spark supports a set of special functions to handle Pair RDD
operations
• mapValues : transform each value without changing the key
• flatMapValues : generate multiple values with the same key
16
8/27/2016
E
Advanced Spark
F
punishable under law
Copyright @2016 V2 Maestros, All rights reserved.
17
8/27/2016
Broadcast variables
• A read-only variable that is shared by all nodes
• Used for lookup tables or similar functions
• Spark optimizes distribution and storage for better performance.
18
8/27/2016
Accumulators
• A shared variable across nodes that can be updated by each node
• Helps compute items not done through reduce operations
• Spark optimizes distribution and takes care of race conditions
Partitioning
• By default all RDDs are partitioned
• spark.default.parallelism parameter
• Default is the total no. of cores available across the entire cluster
• Should configure for large clusters
• Can be specified during RDD creation explicitly
• Derived RDD take the same number as the source.
19
8/27/2016
Persistence
• By default, Spark loads an RDD whenever it required. It drops it once
the action is over
• It will load and re-compute the RDD chain, each time a different operation is
performed
• Persistence allows the intermediate RDD to be persisted so it need
not have to be recomputed.
• persist() can persist the RDD in memory, disk, shared or in other third
party sinks
• cache() provides the default persist() – in memory
Se
Spark SQL
20
8/27/2016
Overview
• A library built on Spark Core that supports SQL like data and
operations
• Make it easy for traditional RDBMS developers to transition to big
data
• Works with “structured” data that has a schema
• Seamlessly mix SQL queries with Spark programs.
• Supports JDBC
• Helps “mix” n “match” different RDBMS and NoSQL Data sources
Spark Session
• All functionality for Spark SQL accessed through a Spark Session
• Data Frames are created through Spark Session
• Provides a standard interface to work across different data sources
• Can register Data Frames as temp table and then run SQL queries on
them
21
8/27/2016
DataFrame
• A distributed collection of data organized as rows and columns
• Has a schema – column names, data types
• Built upon RDD, Spark optimizes better since it knows the schema
• Can be created from and persisted to a variety of sources
• CSV
• Database tables
• Hive / NoSQL tables
• JSON
• RDD
22
8/27/2016
Try
Temp Tables
23
teabag
Mi
again
meat
tea
tea
i
ffhm
Bragging
doggy
EEEEdor
trap
TTEff
offer
affiliate