Spark Running Notes
Spark Running Notes
=============
Apache spark is a
==================
general purpose
in memory
compute engine
compute engine
===============
9108179578
Spark is a replacement/alternative of mapreduce.
in-memory
==========
HDFS
Spark
9108179578
V1 V2 V3 v4 v5
HDFS
spark is said to be
10 to 100 times faster than mapreduce
General Purpose
================
9108179578
hive for querying
mahout
sqoop
only bound to use map and reduce
learn just one style of writing the code and all the things
like cleaning, querying, machine learning, data ingestion
all these can happen with that.
filter
map
reduce
Spark Session - 2
==================
List
RDD is nothing but in-memory distributed collection.
1. Transformation
2. Action
9108179578
whenever you call transformations an entry to the
execution plan is added.
RDD's are:
Distributed
in memory
RDD1
| MAP
|
RDD2
| FILTER
|
RDD3
9108179578
PARENT RDD USING THE LINEAGE GRAPH AND IT
WILL QUICKLY APPLY THE FILTER TRANSFORMATION
ON RDD2
Immutable
==========
why immutable?
rdd1.print(line1)
rdd1.print(line1)
9108179578
consider after applying filter we are just interested in 5
records..
but consider the fact that spark is lazy:
=======
spark-shell (scala)
pyspark (python)
9108179578
sc is nothing but the spark context
Array(spark,is,very,interesting,spark,is,in,memory,compute
,engine)
spark (spark,1)
is (is,1)
very (very,1)
interesting (interesting,1)
spark. (spark,1)
is
in 9108179578
in a map if we have n inputs then we will definitely have n
outputs.
val rdd3 = rdd2.map(x => (x,1))
(spark,4)
(is,1)
(very,1)
(interesting,2)
rdd4.collect()
9108179578
spark with scala code
=======================
val rdd1 = sc.textFile("/user/cloudera/sparkinput/file1")
rdd4.collect()
pyspark
=========
rdd1 = sc.textFile("file:///home/cloudera/file1")
rdd4.collect
9108179578
.saveAsTextFile("<hdfs path>")
44,8602,37.19 44,37.19
9108179578
35,5368,65.89
2,3391,40.64
-> map 35,65.89
2,40.64
(44,94)
(35,165)
(2,40)
map
collect
x 44,8602,37.19
x.split(",")
9108179578
here the 1st element can be treated like key and second
element can be treated like value.
Spark practical - 5
====================
3
3
1
2
1
3
2
4 9108179578
(3,1)
(3,1)
(1,1)
(2,1)
(1,1)
9108179578
Spark practical - 6
====================
33 , 100
33 , 200
33, 300
output 33,200
42, 200
42, 400
42, 500
42 ,700
input 0,Will,33,385
9108179578
output (33,385)
//input
//(33,100)
//(33,200)
//(33,300)
x._1 33
x._2 100
//output
//(33,(100,1))
//(33,(200,1))
//(33,(300,1))
input
x //(33,(100,1))
y //(33,(200,1))
//(33,(300,1))
x._1
y._1
9108179578
output
//(33,(600,3))
//(34,(800,4))
input
//(33,(600,3))
//(34,(800,4))
output
(33,200)
(34,200)
9108179578
Spark practical - 7
===================
1,11