0% found this document useful (0 votes)
6 views28 pages

Spark (Introduction, RDD)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views28 pages

Spark (Introduction, RDD)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

8/27/2016

age
Apache Spark
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved.

Spark Overview

Unauthorized copying, distribution and exhibition of this presentation is

TA
punishable under law
Copyright @2016 V2 Maestros, All rights reserved.

1
8/27/2016

What is Apache Spark? a


• A fast and general engine for large-scale data processing
• A Open-source cluster computing framework
• End-to-End Analytics platform
• Developed to overcome limitations of Hadoop/Map Reduce
• Runs from a single desktop or a huge cluster
• Iterative, interactive or stream processing
• Supports multiple languages – Scala, Python, R, Java
• Major companies like Amazon, eBay, Yahoo use Spark.

Copyright @2016 V2 Maestros, All rights reserved.

When to use Spark?


• Data Integration and ETL
• Interactive Analytics
• High Performance Batch computation I
• Machine Learning and Advanced Analytics
• Real time stream processing
• Example applications
• Credit Card Fraud Detection
• Network Intrusion Detection
• Advertisement Targeting

Copyright @2016 V2 Maestros, All rights reserved.

2
8/27/2016

Typical Spark workflow


• Load data from source
I
• HDFS, NoSQL,S3, real time sources
• Transform Data
• Filter, Clean, Join, Enhance
• Store processed data
• Memory, HDFS, NoSQL
• Interactive Analytics
• Shells, Spark SQL, third-party tools
• Machine Learning
• Action

Copyright @2016 V2 Maestros, All rights reserved.

Online Reference
• http://spark.apache.org

Copyright @2016 V2 Maestros, All rights reserved.

3
8/27/2016

Spark Eco-system

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved.

egg

Spark Framework

Programming Scala Python R Java Tools

Library Spark SQL ML Lib GraphX Streaming

Engine Spark Core

Management YARN Mesos Spark Scheduler

Storage Local HDFS S3 RDBMS NoSQL

Copyright @2016 V2 Maestros, All rights reserved.

4
8/27/2016

one
RDD

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved.

of

Resilient Distributed Datasets (RDD)


• Spark is built around RDDs. You create, transform, analyze and store
RDDs in a Spark program.
• The Dataset contains a collection of elements of any type.
• Strings, Lines, rows, objects, collections
• The Dataset can be partitioned and distributed across multiple nodes
• RDDs are immutable. They cant be changed.
• They can be cached and persisted
• Transformations act on RDDs to create a new RDD
• Actions analyze RDDs to provide a result

Copyright @2016 V2 Maestros, All rights reserved.

5
8/27/2016

Spark Architecture

e
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved.

Spark Architecture
Worker Node
Master Node Executor
Task Cache
Driver Program
Spark Cluster
Context Manager
Worker Node
Executor
Task Cache

Copyright @2016 V2 Maestros, All rights reserved.

6
8/27/2016

Driver Program e
• The main executable program from where Spark operations are
performed
• Runs in the master node of a cluster
• Controls and co-ordinates all operations
• The Driver program is the “main” class.
• Executes parallel operations on a cluster
• Defines RDDs
• Each driver program execution is a “Job”

Copyright @2016 V2 Maestros, All rights reserved.

SparkContext
• Driver accesses Spark functionality through a SparkContext object.
• Represents a connection to the computing cluster
• Used to build RDDs. Is
• Partitions RDDs and distributes them on the cluster
• Works with the cluster manager
• Manages executors running on Worker nodes
• Splits jobs as parallel “tasks” and executes them on worker nodes
• Collects results and presents them to the Driver Program

Copyright @2016 V2 Maestros, All rights reserved.

7
8/27/2016

Spark modes
• Batch mode
• A program is scheduled for execution through the scheduler
• Runs fully at periodic intervals and processes data
• Interactive mode
• An interactive shell is used by the user to execute Spark commands one-by-
one.
• Shell acts as the Driver program and provides SparkContext
• Can run tasks on a cluster
• Streaming mode
• An always running program continuously processes data as it arrives

Copyright @2016 V2 Maestros, All rights reserved.

Spark scalability
• Single JVM
Is
• Runs on a single box (Linux or Windows)
• All components (Driver, executors) run within the same JVM
• Managed Cluster
• Can scale from 2 to thousands of nodes
• Can use any cluster manager for managing nodes
• Data is distributed and processed on all nodes

Copyright @2016 V2 Maestros, All rights reserved.

8
8/27/2016

De
Loading and Storing Data

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved.

tag

Creating RDDs
• RDDs can be created from a number of sources
• Text Files
• JSON
• Parallelize() on local collections
• Java Collections
• Python Lists
• R Data Frames
• Sequence files
• RDBMS – load into location collections first and create RDD
• Very large data – create HDFS files outside of Spark and then create
RDDs from them

Copyright @2016 V2 Maestros, All rights reserved.

9
8/27/2016

Storing RDDs
• Spark provides simple functions to persist RDDs to a variety of data
sinks
• Text Files
• JSON
• Sequence Files
• Collections
• For optimization use language specific libraries for persistence than
using Spark utilities.
• saveAsTextFile()
• RDBMS – move to local collections and then store.

Copyright @2016 V2 Maestros, All rights reserved.

Lazy evaluation
• Lazy evaluation means Spark will not load or transform data unless an
action is performed
• Load file into RDD
• Filter the RDD
• Count no. of elements (only now loading and filtering happens)
• Helps internally optimize operations and resource usage
• Watch out during troubleshooting – errors found while executing
actions might be related to earlier transformations

Copyright @2016 V2 Maestros, All rights reserved.

10
8/27/2016

Transformations

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law

to
Copyright @2016 V2 Maestros, All rights reserved.

Overview
• Perform operation on one RDD and create a new RDD
• Operate on one element at a time
• Lazy evaluation
• Can be distributed across multiple nodes based on the partitions they
act upon

Copyright @2016 V2 Maestros, All rights reserved.

11
8/27/2016

Map
newRdd=rdd.map(function)
• Works similar to the Map Reduce “Map”
• Act upon each element and perform some operation
• Element level computation or transformation
• Result RDD may have the same number of elements as original RDD
• Result can be of different type
• Can pass functions to this operation to perform complex tasks
• Use Cases
• Data Standardization – First Name, Last Name
• Data type conversion
• Element level computations – compute tax
• Add new attributes – Grades based on test scores

Copyright @2016 V2 Maestros, All rights reserved.

flatMap
newRdd=rdd.flatMap(function)
• Works the same way as map
• Can return more elements than the original map
• Use to break up elements in the original map and create a new map
• Split strings in the original map
• Extract child elements from a nested json string

Copyright @2016 V2 Maestros, All rights reserved.

12
8/27/2016

Filter
newRdd=rdd.filter(function)
I
• Filter a RDD to select elements that match a condition
• Result RDD smaller than the original RDD
• A function can be passed as a condition to perform complex filtering
• Returns a true/false

Copyright @2016 V2 Maestros, All rights reserved.

Set Operations
• Set operations are performed on two RDDs
• Union – Return a new dataset that contains the union of the elements
in the source dataset and the argument.
• unionRDD=firstRDD.union(secondRDD)
• Intersection - Return a new RDD that contains the intersection of
elements in the source dataset and the argument.
• intersectionRDD=firstRDD.intersect(secondRDD)

Copyright @2016 V2 Maestros, All rights reserved.

13
8/27/2016

af
Actions

Unauthorized copying, distribution and exhibition of this presentation is

Tf
punishable under law
Copyright @2016 V2 Maestros, All rights reserved.

Introduction to actions
• Act on a RDD and product a result (not a RDD)
• Lazy evaluation – Spark does not act until it sees an action
• Simple actions
• collect – return all elements in the RDD as an array. Use to trigger
execution or print values
• count – count the number of elements in the RDD
• first – returns the first element in the RDD
• take(n) – returns the first n elements

Copyright @2016 V2 Maestros, All rights reserved.

14
8/27/2016

reduce
• Perform an operation across all elements of an RDD
• sum, count etc.
• The operation is a function that takes as input two values.
• The function is called for every element in the RDD
inputRDD = [ a, b, c, d, e ]
and the function is func(x,y)
func( func( func( func(a,b), c), d), e)

Copyright @2016 V2 Maestros, All rights reserved.

Reduce Example
vals = [3,5,2,4,1]
sum(x,y) { return x + y }
sum( sum( sum( sum(3,5), 2), 4), 1) = 15

Copyright @2016 V2 Maestros, All rights reserved.

15
8/27/2016

Pair RDD

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved.

Tty

Pair RDDs
• Pair RDDs are a special type of RDDs that can store key value pairs.
• Can be created through regular map operations
• All transformations for regular RDDs available for Pair RDDs
• Spark supports a set of special functions to handle Pair RDD
operations
• mapValues : transform each value without changing the key
• flatMapValues : generate multiple values with the same key

Copyright @2016 V2 Maestros, All rights reserved.

16
8/27/2016

Pair RDD Actions


• countByKey – produces a count by each key in the RDD
I
• groupByKey – perform aggregation like sum, average by key
• reduceByKey – perform reduce, but by key
• aggregateByKey – perform aggregate by key
• Join - join multiple RDDs with the same key

Copyright @2016 V2 Maestros, All rights reserved.

E
Advanced Spark

Unauthorized copying, distribution and exhibition of this presentation is

F
punishable under law
Copyright @2016 V2 Maestros, All rights reserved.

17
8/27/2016

Local Variables in Spark


• Spark makes copies of your code (one per partition) and executes
Ss
them
• Any variable you create in the base programming language are local
to a cluster
• Duplicate copies of local variables for each cluster
• Each variable acted upon independently

Copyright @2016 V2 Maestros, All rights reserved.

Broadcast variables
• A read-only variable that is shared by all nodes
• Used for lookup tables or similar functions
• Spark optimizes distribution and storage for better performance.

Copyright @2016 V2 Maestros, All rights reserved.

18
8/27/2016

Accumulators
• A shared variable across nodes that can be updated by each node
• Helps compute items not done through reduce operations
• Spark optimizes distribution and takes care of race conditions

Copyright @2016 V2 Maestros, All rights reserved.

Partitioning
• By default all RDDs are partitioned
• spark.default.parallelism parameter
• Default is the total no. of cores available across the entire cluster
• Should configure for large clusters
• Can be specified during RDD creation explicitly
• Derived RDD take the same number as the source.

Copyright @2016 V2 Maestros, All rights reserved.

19
8/27/2016

Persistence
• By default, Spark loads an RDD whenever it required. It drops it once
the action is over
• It will load and re-compute the RDD chain, each time a different operation is
performed
• Persistence allows the intermediate RDD to be persisted so it need
not have to be recomputed.
• persist() can persist the RDD in memory, disk, shared or in other third
party sinks
• cache() provides the default persist() – in memory

Copyright @2016 V2 Maestros, All rights reserved.

Se
Spark SQL

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved.

20
8/27/2016

Overview
• A library built on Spark Core that supports SQL like data and
operations
• Make it easy for traditional RDBMS developers to transition to big
data
• Works with “structured” data that has a schema
• Seamlessly mix SQL queries with Spark programs.
• Supports JDBC
• Helps “mix” n “match” different RDBMS and NoSQL Data sources

Copyright @2016 V2 Maestros, All rights reserved.

Spark Session
• All functionality for Spark SQL accessed through a Spark Session
• Data Frames are created through Spark Session
• Provides a standard interface to work across different data sources
• Can register Data Frames as temp table and then run SQL queries on
them

Copyright @2016 V2 Maestros, All rights reserved.

21
8/27/2016

DataFrame
• A distributed collection of data organized as rows and columns
• Has a schema – column names, data types
• Built upon RDD, Spark optimizes better since it knows the schema
• Can be created from and persisted to a variety of sources
• CSV
• Database tables
• Hive / NoSQL tables
• JSON
• RDD

Copyright @2016 V2 Maestros, All rights reserved.

Operations supported by Data Frames


• filter – filter data based on a condition
• join – join two Data Frames based on common column
• groupBy – group data frames by specific column values
• agg – compute aggregates like sum, average.
• Operations can be nested.

Copyright @2016 V2 Maestros, All rights reserved.

22
8/27/2016

Try
Temp Tables

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved.

Temp tables / Views


• Provide SQL table like operations
• A “wrapper” around a data frame
• Execute ANSI SQL Queries against temp tables.
• Very simple and yet very powerful
• A query on a Temp table generates another data frame
• createOrReplaceTempView – register the Data Frame as a table
within SQL Session
• SparkSession provides the ability to execute SQL

Copyright @2016 V2 Maestros, All rights reserved.

23
teabag

Mi

again
meat

tea

tea
i

ffhm

Bragging

doggy
EEEEdor

trap

TTEff
offer

affiliate

You might also like