Hadoop Spark
Hadoop Spark
Gianluca Quercini
Spark programming.
MongoDB practice.
Hadoop technologies.
Scaling.
Class material
Available online
https://tinyurl.com/p7jb5wra
Click here
Evaluation
Contact
Email: [email protected]
Email: [email protected]
Apache Spark
Main features
Speed. Run computations in memory (Hadoop relies on disks).
General-purpose. Different workloads in the same system.
Batch applications, iterative algorithms.
Interactive queries, streaming applications.
Accessibility. Python, Scala, Java, SQL and R; rich built-in libraries.
Integration. With other Big Data tools, such as Hadoop.
Spark components
Structured
Spark SQL MLlib GraphX
Streaming
Spark Core
Standalone
YARN Mesos Kubernetes
Scheduler
Image source
Spark components
Spark core
Scheduling, distributing, and monitoring applications.
Data structures for manipulating data (RDDs, DataFrames).
Spark SQL
Spark’s package for working with (semi-)structured data.
Data querying with SQL and HQL (Hive Query Language).
Many sources of data: JSON, XML, Parquet...
Spark components
Structured streaming
Processing of live streams of data (e.g., real-time event logs)
Similar API to batch processing.
MLlib
Machine learning algorithms (e.g., classification, regression,
clustering)
All methods designed to scale out across a cluster.
Spark components
GraphX
Manipulation of graph data.
Library with common graph algorithms (e.g., PageRank)
Cluster managers
Control how tasks are distributed across a cluster.
Spark provides its own standalone cluster manager.
Spark can also use other cluster managers.
Using Spark
Interactive mode
Using a command-line interface (CLI) or shell.
Python and Scala shell.
SparkSQL shell.
SparkR shell.
Amazon.
Groupon.
TripAdvisor.
Yahoo!
Spark application
Spark application: set of independent processes called executors.
Executor run computations and store the data for the application.
Executors are coordinated by the driver.
Worker Node
Executor Cache
Worker Node
Executor Cache
Worker Node
Spark Driver Executor Cache
Cluster
SparkContext Manager Task Task Task
Spark programming
Partition 0 Partition 1
Call me Ishmael. Some years ago--never mind how long precisely-- regulating the circulation. Whenever I find myself growing grim about
having little or no money in my purse, and nothing particular to interest the mouth; whenever it is a damp, drizzly November in my soul;
me on shore, I thought I would sail about a little and see the watery whenever I find myself involuntarily pausing before coffin warehouses,
part of the world. It is a way I have of driving off the spleen and and bringing up the rear of every funeral I meet; and especially
whenever my hypos get such an upper hand of me, that it requires a and ball. With a philosophical flourish Cato throws himself upon his
strong moral principle to prevent me from deliberately stepping into the sword; I quietly take to the ship. There is nothing surprising in this. If
street, and methodically knocking people's hats off--then, I account it they but knew it, almost all men in their degree, some time or other,
high time to get to sea as soon as I can. This is my substitute for pistol cherish very nearly the same feelings towards the ocean with me.
Creating an RDD
sc.parallelize([1, 5, 3, 2, 6, 7])
sc.textFile("hdfs://sar01:9000/data/sample_text.txt")
RDD transformations
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Partition 7
map(lambda x: x*x)
2 ; 5 ; 6 ; 7 ; 8 ; 11 ; 4 ; 25 ; 36 ; 49 ; 64 ; 121 ;
Partition 0
13 169
map(lambda x: x*x)
Partition 1 4;5;2;3;4;5;8 16 ; 25 ; 4 ; 9 ; 16 ; 25 ; 64
Lorem ; ipsum ; dolor ; sit ; Lorem ipsum dolor sit amet ; [Lorem, ipsum, dolor, sit,
amet ; consectetur ; consectetur adipiscing elit ; amet] ; [consectetur,
adipiscing ; elit ; sed ; do ; sed do eiusmod tempor adipiscing, elit] ; [sed, do,
eiusmod ; tempor ; incididunt incididunt eiusmod, tempor incididunt]
ut ; labore ; et ; dolore ;
ut labore et dolore magna [ut, labore, et, dolore, magna,
magna ; aliqua ; Ut ; enim ;
aliqua ; Ut enim ad minim aliqua] ; [Ut, enim, ad, minim,
ad ; minim ; veniam ; quis ;
veniam ; quis nostrud veniam] ; [quis, nostrud,
nostrud ; exercitation ;
exercitation ullamco laboris exercitation, ullamco, laboris]
ullamco ; laboris
filter(lambda x: x>3)
2 ; 5 ; 6 ; 7 ; 8 ; 11 ;
Partition 0 5 ; 6 ; 7 ; 8 ; 11 ; 13
13
filter(lambda x: x>3)
Partition 1 4;5;2;3;4;5;8 4;5;4;5;8
filter(lambda x: x>3)
Partition 2 1;4;3;2;4;5;6 4;4;5;6
union() takes in two RDDs and returns a new RDD containing the
items of the first and second RDD with repetitions.
RDD 1 union
3;4
3;4
1;5
1;5
4;2
4;2
4;5
4;5
10
RDD 2
10 12 ; 13
12 ; 13 2;4
2;4 3;6
3;6
distinct()
2 ; 5 ; 6 ; 7 ; 8 ; 11 ;
Partition 0 8;4
13
Partition 1 4;5;2;3;4;5;8 5 ; 13 ; 1
Partition 3 2;4;5;2;3;4;8 7 ; 11 ; 3
distinct()
Partition 0 4;5;4 4 ; 12
Partition 2 23 ; 12 ; 1 ; 4 2;6
Partition 3 4 ; 23 ; 11 ; 2 3 ; 7 ; 11
p = hashCode(K ) mod n
RDD 1 intesection
3;4
1;5
4;2
2
4;5
3
4
RDD 2
10
12 ; 13
2;4
3;6
Narrow transformations F
Narrow transformations F
Wide transformations F
Wide transformations F
RDD actions
If the result is a list of values, all values are sent to the driver.
reduce(lambda x, y: x+y)
RDD
Partition 3 2;4;5 2 + 4 + 5 = 11
collect()
Partition 3 2;4;5 [ 2, 4, 5]
count()
Partition 0 3;4 2
Partition 1 1;5 2
Driver 2+2+2+3 = 9
Partition 2 4;2 2
Partition 3 2;4;5 3
The code is incorrect, because the return type (list) of the reduce
function is different from the type of the input RDD elements (string).
Key-value RDDs
Key-value RDDs (a.k.a., Pair RDDs) are RDDs where each item is
a pair (k, v ), k being the key and v being the value.
Key-value RDDs support all the transformations and actions that can
be applied on regular RDDs.
reduceByKey takes in a RDD with (K , V ) pairs and a function f and returns a new
RDD of (K , V ) pairs where the values for each key are aggregated using f , which
must be of type (V , V ) → V .
reduceByKey(lambda x, y: x+y)
RDD
('cat', 7) ; (‘owl', 7) ;
Partition 0 ('cat', 2) ; (‘owl', 3)
(‘cow', 1) ; (‘tiger', 1)
groupByKey()
mapValues takes in a RDD with (K , V ) pairs and a function f and returns a new
RDD where the function f is applied to each value V (keys are not modified).
mapValues(lambda x: len(x))
('cat', [2, 2, 3]) ; ('owl', [3, 4]) ; ('cat', 3), ('owl', 2),
Partition 0
('cow', [1]) ; ('tiger', [1]) ('cow', 1), ('tiger', 1)
Partition 1
Partition 2
def word_count(input_file):
text = sc.textFile(input_file)
return text.flatMap(lambda line: line.split(" "))\
.map(lambda word: (word, 1))\
.reduceByKey(lambda x, y: x+y)
RDD lineage
HadoopRDD sc.textFile(input_file)
Stage 1
SparkContext
(sc)
Sequences of narrow transformations
are pipelined into a single stage. Wide
HadoopRDD sc.textFile(input_file)
transformations always trigger a new
stage.
Stage 2
Stage 1
SparkContext
(sc) Stage 2
sc.parallelize()
sc.textFile(input_file)
Stage 3
filter()
Stage 1
SparkContext
(sc)
The DAG scheduler submits the stages
to the task scheduler. Creates as many
sc.textFile(input_file)
tasks as there are partitions in the RDD.
Tasks are executed in parallel.
Stage 2
reduceByKey(lambda x, y: x+y)
groupBy
What to do when a partition is
lost?
join
map
union
groupBy
Lost partitions can be recomputed
thanks to the lineage graph.
join
map
union
Lazy evaluation
lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
print("Number of exception lines ", nb_lines)
lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
print("Number of exception lines ", nb_lines)
Example
lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
exceptions.collect()
Example
lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
exceptions.collect()
lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
exceptions.persist(StorageLevel.MEMORY_AND_DISK)
nb_lines = exceptions.count()
exceptions.collect()
References