Cloudera Spark
Cloudera Spark
Chapter"3"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"1$
Course"Chapters"
!! IntroducDon" Course"IntroducDon"
!! Why"Spark?"
!! Spark$Basics$ Introduc.on$to$Spark$
!! Working"With"RDDs"
!! The"Hadoop"Distributed"File"System"(HDFS)"
!! Running"Spark"on"a"Cluster"
Distributed"Data"Processing""
!! Parallel"Programming"with"Spark"
with"Spark"
!! Caching"and"Persistence"
!! WriDng"Spark"ApplicaDons"
!! Spark"Streaming"
!! Common"Pa=erns"in"Spark"Programming" Solving"Business"Problems""
!! Improving"Spark"Performance" with"Spark"
!! Spark,"Hadoop,"and"the"Enterprise"Data"Center"
!! Conclusion" Course"Conclusion"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"2$
Spark"Basics"
In$this$chapter$you$will$learn$
! How$to$start$the$Spark$Shell$
! About$the$SparkContext$
! Key$Concepts$of$Resilient$Distributed$Datasets$(RDDs)$
– What"are"they?"
– How"do"you"create"them?"
– What"operaDons"can"you"perform"with"them?"
! How$Spark$uses$the$principles$of$func.onal$programming$
! About$the$Hands"On$Exercises$for$the$course$
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"3$
Chapter"Topics"
Spark$Basics$ Introduc.on$to$Spark$
!! What$is$Apache$Spark?$
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"in"Spark"
!! Conclusion"
!! HandsUOn"Exercises"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"4$
What"is"Apache"Spark?"
! Apache$Spark$is$a$fast$and$general$engine$for$large"scale$
data$processing$
! WriNen$in$Scala$
– FuncDonal"programming"language"that"runs"in"a"JVM"
! Spark$Shell$
– InteracDve"–"for"learning"or"data"exploraDon"
– Python"or"Scala"
! Spark$Applica.ons$
– For"large"scale"data"processing"
– Python,"Scala,"or"Java"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"5$
Chapter"Topics"
Spark$Basics$ Introduc.on$to$Spark$
!! What"is"Apache"Spark?""
!! Using$the$Spark$Shell$
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"in"Spark"
!! Conclusion"
!! HandsUOn"Exercises"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"6$
Spark"Shell"
! The$Spark$Shell$provides$interac.ve$data$explora.on$(REPL)
! Wri.ng$Spark$applica.ons$without$the$shell$will$be$covered$later$
Python"Shell:"pyspark Scala"Shell:"spark-shell
$ pyspark $ spark-shell
Welcome to
Welcome to
____ __
/ __/__ ___ _____/ /__ ____ __
_\ \/ _ \/ _ `/ __/ '_/ / __/__ ___ _____/ /__
/__ / .__/\_,_/_/ /_/\_\ version 1.0.0 _\ \/ _ \/ _ `/ __/ '_/
/_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.0
/_/
Using Python version 2.6.6 (r266:84292, Jan
22 2014 09:42:36)
SparkContext available as sc. Using Scala version 2.10.3 (Java HotSpot(TM)
64-Bit Server VM, Java 1.7.0_51)
>>> Created spark context..
Spark context available as sc.
scala>
REPL:"Read/Evaluate/Print"Loop"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"7$
Spark"Context"
! Every$Spark$applica.on$requires$a$Spark$Context$
– The"main"entry"point"to"the"Spark"API"
! Spark$Shell$provides$a$preconfigured$Spark$Context$called$sc
scala> sc.appName
res0: String = Spark shell
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"8$
Chapter"Topics"
Spark$Basics$ Introduc.on$to$Spark$
!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs$(Resilient$Distributed$Datasets)$
!! FuncDonal"Programming"With"Spark"
!! Conclusion"
!! HandsUOn"Exercise:"Ge`ng"Started"with"RDDs"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"9$
RDD"(Resilient"Distributed"Dataset)"
! RDD$(Resilient$Distributed$Dataset)$
– Resilient"–"if"data"in"memory"is"lost,"it"can"be"recreated"
– Distributed"–"stored"in"memory"across"the"cluster"
– Dataset"–"iniDal"data"can"come"from"a"file"or"be"created"
programmaDcally"
! RDDs$are$the$fundamental$unit$of$data$in$Spark$
! Most$Spark$programming$consists$of$performing$opera.ons$on$RDDs$
"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"10$
CreaDng"an"RDD"
! Three$ways$to$create$an$RDD$
– From"a"file"or"set"of"files"
– From"data"in"memory"
– From"another"RDD"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"11$
FileUBased"RDDs"
! For$file"based$RDDS,$use$SparkContext.textFile$$
– Accepts"a"single"file,"a"wildcard"list"of"files,"or"a"commaUseparated"list"of"
files"
– Examples"
– sc.textFile("myfile.txt")
– sc.textFile("mydata/*.log")
– sc.textFile("myfile1.txt,myfile2.txt")
– Each"line"in"the"file(s)"is"a"separate"record"in"the"RDD"
! Files$are$referenced$by$absolute$or$rela.ve$URI$
– Absolute"URI:"file:/home/training/myfile.txt
– RelaDve"URI"(uses"default"file"system):"myfile.txt
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"12$
Example:"A"FileUbased"RDD"
File:"purplecow.txt"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"13$
RDD"OperaDons"
! Two$types$of$RDD$opera.ons$
RDD"
value&
– AcDons"–"return"values"
Base"RDD" New"RDD"
– TransformaDons"–"define"a"new"
RDD"based"on"the"current"one(s)"
$
! Pop$quiz:$
– Which"type"of"operaDon"is"
count()?"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"14$
RDD"OperaDons:"AcDons"
! Some$common$ac.ons$ RDD"
– count()"–""return"the"number"of"elements"
value&
– take(n)"–"return"an"array"of"the"first"n"
elements"
– collect()–"return"an"array"of"all"elements"
– saveAsTextFile(file)$–"save"to"text"file(s)
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"15$
RDD"OperaDons:"TransformaDons"
! Transforma.ons$create$a$new$RDD$from$
Base"RDD" New"RDD"
an$exis.ng$one$
! RDDs$are$immutable$
– Data"in"an"RDD"is"never"changed"
– Transform"in"sequence"to"modify"the"
data"as"needed""
! Some$common$transforma.ons$
– map(function)"–"creates"a"new"RDD"by"performing"a"funcDon"on"
each"record"in"the"base"RDD"
– filter(function)"–"creates"a"new"RDD"by"including"or"
excluding"each"record"in"the"base"RDD"according"to"a"boolean"
funcDon"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"16$
Example:"map"and"filter"TransformaDons"
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"17$
Lazy"ExecuDon"(1)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
>
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"18$
Lazy"ExecuDon"(2)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD:"mydata"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"19$
Lazy"ExecuDon"(3)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD:"mydata"
RDD:"mydata_uc"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"20$
Lazy"ExecuDon"(4)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD:"mydata"
RDD:"mydata_filt"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"21$
Lazy"ExecuDon"(5)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD:"mydata"
I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt") I never hope to see one;
> mydata_uc = mydata.map(lambda line: But I can tell you, anyhow,
line.upper()) I'd rather see than be one.
> mydata_filt = \
RDD:"mydata_uc"
mydata_uc.filter(lambda line: \ I'VE NEVER SEEN A PURPLE COW.
line.startswith('I')) I NEVER HOPE TO SEE ONE;
> mydata_filt.count() BUT I CAN TELL YOU, ANYHOW,
3 I'D RATHER SEE THAN BE ONE.
RDD:"mydata_filt"
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"22$
Chaining"TransformaDons"
! Transforma.ons$may$be$chained$together$
is"exactly"equivalent"to"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"23$
Chapter"Topics"
Spark$Basics$ Introduc.on$to$Spark$
!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! Func.onal$Programming$in$Spark$
!! Conclusion"
!! HandsUOn"Exercises"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"24$
FuncDonal"Programming"in"Spark"
! Spark$depends$heavily$on$the$concepts$of$func#onal&programming&
– FuncDons"are"the"fundamental"unit"of"programming"
– FuncDons"have"input"and"output"only"
– No"state"or"side"effects"
! Key$concepts$
– Passing"funcDons"as"input"to"other"funcDons"
– Anonymous"funcDons"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"25$
Passing"FuncDons"as"Parameters"
! Many$RDD$opera.ons$take$func.ons$as$parameters$
! Pseudocode$for$the$RDD$map$opera.on$
– Applies"funcDon"fn"to"each"record"in"the"RDD"
RDD {
map(fn(x)) {
foreach record in rdd
emit fn(record)
}
}
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"26$
Example:"Passing"Named"FuncDons"
! Python$
! Scala$
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"27$
Anonymous"FuncDons"
! Func.ons$defined$in"line$without$an$iden.fier$
– Best"for"short,"oneUoff"funcDons"
! Supported$in$many$programming$languages$
– Python:"lambda x: ...
– Scala:"x => ...
– Java"8:"x -> ...
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"28$
Example:"Passing"Anonymous"FuncDons"
! Python:$
! Scala:$
OR"
> mydata.map(_.toUpperCase()).take(2)
Scala"allows"anonymous"parameters"
using"underscore"(_)"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"29$
Example:"Java""
...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
new MapFunction<String, String>() {
Java"7" public String call(String line) {
return line.toUpperCase();
}
}
...
...
Java"8" JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
line -> line.toUpperCase());
...
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"30$
Chapter"Topics"
Spark$Basics$ Introduc.on$to$Spark$
!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"With"Spark"
!! Conclusion$
!! HandsUOn"Exercises"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"31$
Key"Points"
! Spark$can$be$used$interac.vely$via$the$Spark$Shell$
– Python"or"Scala"
– WriDng"nonUinteracDve"Spark"applicaDons"will"be"covered"later"
! RDDs$(Resilient$Distributed$Datasets)$are$a$key$concept$in$Spark$
! RDD$Opera.ons$
– TransformaDons"create"a"new"RDD"based"on"an"exisDng"one"
– AcDons"return"a"value"from"an"RDD"
! Lazy$Execu.on$
– TransformaDons"are"not"executed"unDl"required"by"an"acDon"
! Spark$uses$func.onal$programming$
– Passing"funcDons"as"parameters"
– Anonymous"funcDons"in"supported"languages"(Python"and"Scala)"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"32$
Chapter"Topics"
Spark$Basics$ Introduc.on$to$Spark$
!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"With"Spark"
!! Conclusion"
!! Hands"On$Exercises$
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"33$
IntroducDon"to"Exercises:"Ge`ng"Started"
! Instruc.ons$are$in$the$Hands"On$Exercise$Manual$
! Start$with$$
– General"Notes"
– Se`ng"Up"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"34$
IntroducDon"to"Exercises:"Pick"Your"Language"
! Your$choice:$Python$or$Scala$
– For"most"exercises"in"this"course,"you"may"choose"to"work"with"either"
Python"or"Scala"
– ExcepDon:"Spark"Streaming"is"currently"available"only"in"Scala"
– Course"examples"are"mostly"presented"in"Python"
! Solu.on$and$example$files$
– .pyspark"–"Python"shell"commands"
– .scalaspark"–"Scala"shell"commands"
– .py"–"complete"Python"Spark"applicaDons"
– .scala"–"complete"Scala"Spark"applicaDons"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"35$
IntroducDon"to"Exercises:"Classroom"Virtual"Machine"
! Your$virtual$machine$
– Log"in"as"user"training"(password"training)"
– PreUinstalled"and"configured"with"
– Spark"and"CDH"(Cloudera’s"DistribuDon,"including"Apache"Hadoop)"
– Various"tools"including"Emacs,"IntelliJ,"and"Maven"
! Training$materials:$~/training_materials/sparkdev$folder$on$
the$VM$
– data"–"sample"datasets"uses"in"exercises""
– examples"–"all"the"example"code"in"this"course"
– solutions"–"soluDons"for"Scala"Shell"and"Python"exercises"
– stubs"–"starter"code"required"in"some"exercises"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"36$
IntroducDon"to"Exercises:"The"Data"
! Most$exercises$are$based$around$a$hypothe.cal$company:$Loudacre$
Mobile$
– A"cellular"telephone"company"
! Loudacre$Mobile$Customer$Support$has$many$sources$of$data$they$need$
to$process,$transform$and$analyze$
– Customer"account"data""
– Web"server"logs"from"Loudacre’s"customer"support"website"
– New"device"acDvaDon"records"
– Customer"support"Knowledge"Base"arDcles"
– InformaDon"about"models"of"supported"devices"
L udacre
mobile
o
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"37$
HandsUOn"Exercises"
! Now,$please$do$the$following$three$Hands"On$Exercises$
1. Viewing&the&Spark&Documenta8on&
– Familiarize"yourself"with"the"Spark"documentaDon;"you"will"refer"to"
this"documentaDon"frequently"during"the"course"
2. Using&the&Spark&Shell&
– Follow"the"instrucDons"for"either"the"Python"or"Scala"shell"
3. Ge>ng&Started&with&RDDs&
– Use"either"the"Python"or"Scala"Spark"Shell"to"explore"the"Loudacre"
weblogs"
! Please$refer$to$the$Hands"On$Exercise$Manual$
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"38$
Working"With"RDDs"
Chapter"4"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"1$
Course"Chapters"
!! IntroducDon" Course"IntroducDon"
!! What"is"Apache"Spark?"
!! Spark"Basics" Introduc.on$to$Spark$
!! Working$With$RDDs$
!! The"Hadoop"Distributed"File"System"(HDFS)"
!! Running"Spark"on"a"Cluster"
Distributed"Data"Processing""
!! Parallel"Programming"with"Spark"
with"Spark"
!! Caching"and"Persistence"
!! WriDng"Spark"ApplicaDons"
!! Spark"Streaming"
!! Common"Pa=erns"in"Spark"Programming" Solving"Business"Problems""
!! Improving"Spark"Performance" with"Spark"
!! Spark,"Hadoop,"and"the"Enterprise"Data"Center"
!! Conclusion" Course"Conclusion"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"2$
Working"With"RDDs"
In$this$chapter$you$will$learn$
! How$RDDs$are$created$$
! Addi.onal$RDD$opera.ons$
! Special$opera.ons$available$on$RDDs$of$key"value$pairs$
! How$MapReduce$algorithms$are$implemented$in$Spark$
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"3$
Chapter"Topics"
Working$With$RDDs$ Spark$Development$
!! A$Closer$Look$at$RDDs$
!! KeyTValue"Pair"RDDs"
!! MapReduce"
!! Other"Pair"RDD"OperaDons"
!! Conclusion"
!! HandsTOn"Exercise:"Working"with"Pair"RDDs"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"4$
RDDs"
! RDDs$can$hold$any$type$of$element$
– PrimiDve"types:"integers,"characters,"booleans,"etc."
– Sequence"types:"strings,"lists,"arrays,"tuples,"dicts,"etc."(including"nested"
data"types)"
– Scala/Java"Objects"(if"serializable)"
– Mixed"types"
! Some$types$of$RDDs$have$addi.onal$func.onality$
– Pair"RDDs"
– RDDs"consisDng"of"KeyTValue"pairs"
– Double"RDDs"
– RDDs"consisDng"of"numeric"data"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"5$
CreaDng"RDDs"From"CollecDons"
! You$can$create$RDDs$from$collec.ons$instead$of$files$
– sc.parallelize(collection)
> randomnumlist = \
[random.uniform(0,10) for _ in xrange(10000)]
> randomrdd = sc.parallelize(randomnumlist)
> print "Mean: %f" % randomrdd.mean()
! Useful$when$
– TesDng"
– GeneraDng"data"programmaDcally"
– IntegraDng"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"6$
Some"Other"General"RDD"OperaDons"
! Transforma.ons$
– flatMap"–"maps"one"element"in"the"base"RDD"to"mulDple"elements
– distinct"–"filter"out"duplicates"
– union"–"add"all"elements"of"two"RDDs"into"a"single"new"RDD"
! Other$RDD$opera.ons$
– first"–"return"the"first"element"of"the"RDD
– foreach"–"apply"a"funcDon"to"each"element"in"an"RDD"
– top(n)$–"return"the"largest"n"elements"using"natural"ordering"
! Sampling$opera.ons$
– sample(percent)$–"create"a"new"RDD"with"a"sampling"of"elements""
– takeSample(percent)$–"return"an"array"of"sampled"elements"
! Double$RDD$opera.ons$
– StaDsDcal"funcDons,"e.g.,"mean,"sum,"variance,"stdev
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"7$
Example:"flatMap"and"distinct
> sc.textFile(file) \
Python" .flatMap(lambda line: line.split()) \
.distinct()
> sc.textFile(file)
Scala" .flatMap(line => line.split("\\W"))
.distinct()
I’ve I’ve
never never
I've never seen a purple cow.
seen seen
I never hope to see one;
But I can tell you, anyhow,
a a
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"8$
Chapter"Topics"
Working$With$RDDs$ Spark$Development$
!! A"Closer"Look"at"RDDs"
!! Key"Value$Pair$RDDs$
!! MapReduce"
!! Other"Pair"RDD"OperaDons"
!! Conclusion"
!! HandsTOn"Exercise:"Working"with"Pair"RDDs"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"9$
Pair"RDDs"
! Pair$RDDs$are$a$special$form$of$RDD$ Pair"RDD"
– Each"element"must"be"a"keyTvalue"pair"(a"" (key1,value1)
twoTelement"tuple)" (key2,value2)
– Keys"and"values"can"be"any"type" (key3,value3)
! Why?$ …
– Use"with"MapReduce"algorithms""
– Many"addiDonal"funcDons"are"available"for"
common"data"processing"needs"
– e.g.,"sorDng,"joining,"grouping,"counDng,"etc."
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"10$
CreaDng"Pair"RDDs"
! The$first$step$in$most$workflows$is$to$get$the$data$into$key/value$form$
– What"should"the"RDD"should"be"keyed"on?"
– What"is"the"value?"
! Commonly$used$func.ons$to$create$Pair$RDDs$
– map
– flatMap$/$flatMapValues
– keyBy
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"11$
Example:"A"Simple"Pair"RDD"
! Example:$Create$a$Pair$RDD$from$a$tab"separated$file$
(user001,Fred Flintstone)
user001 Fred Flintstone
(user090,Bugs Bunny)
user090 Bugs Bunny
user111 Harry Potter (user111,Harry Potter)
… …
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"12$
Example:"Keying"Web"Logs"by"User"ID"
> sc.textFile(logfile) \
.keyBy(lambda line: line.split(' ')[2])
> sc.textFile(logfile) \
.keyBy(line => line.split(' ')(2))
User"ID"
56.38.234.188 – 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 – 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 – 25254 "GET /KBDOC-00230.html HTTP/1.0" …
…
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"13$
QuesDon"1:"Pairs"With"Complex"Values
! How$would$you$do$this?$
– Input:"a"list"of"postal"codes"with"laDtude"and"longitude"
– Output:"postal"code"(key)"and"lat/long"pair"(value)"
(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202
?" (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202
…
…$
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"14$
Answer"1:"Pairs"With"Complex"Values"
> sc.textFile(file) \
.map(lambda line: line.split()) \
.map(lambda fields: (fields[0],(fields[1],fields[2])))
(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202 (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202
…
…$
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"15$
QuesDon"2:"Mapping"Single"Rows"to"MulDple"Pairs"(1)"
! How$would$you$do$this?$
– Input:"order"numbers"with"a"list"of"SKUs"in"the"order"
– Output:"order"(key)"and"sku"(value)"
Input"Data" Pair"RDD"
00001 sku010:sku933:sku022 (00001,sku010)
00002 sku912:sku331 (00001,sku933)
00003 sku888:sku022:sku010:sku594
00004 sku411
?" (00001,sku022)
(00002,sku912)
" (00002,sku331)
(00003,sku888)
…
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"16$
QuesDon"2:"Mapping"Single"Rows"to"MulDple"Pairs"(2)"
! Hint:$map$alone$won’t$work$
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411
"
(00001,(sku010,sku933,sku022))
(00002,(sku912,sku331))
(00003,(sku888,sku022,sku010,sku594))
(00004,(sku411))
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"17$
Answer"2:"Mapping"Single"Rows"to"MulDple"Pairs"(1)"
> sc.textFile(file)
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"18$
Answer"2:"Mapping"Single"Rows"to"MulDple"Pairs"(2)"
> sc.textFile(file) \
.map(lambda line: line.split('\t'))
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
[00004,sku411]
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"19$
Answer"2:"Mapping"Single"Rows"to"MulDple"Pairs"(3)"
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
(00001,sku010:sku933:sku022)
[00004,sku411]
(00002,sku912:sku331)
(00003,sku888:sku022:sku010:sku594)
(00004,sku411)
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"20$
Answer"2:"Mapping"Single"Rows"to"MulDple"Pairs"(4)"
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
.flatMapValues(lambda skus: skus.split(':'))
00001 sku010:sku933:sku022
00002 sku912:sku331 (00001,sku010)
00003 sku888:sku022:sku010:sku594 (00001,sku933)
[00001,sku010:sku933:sku022]
00004 sku411 (00001,sku022)
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594] (00002,sku912)
(00001,sku010:sku933:sku022)
[00004,sku411] (00002,sku331)
(00002,sku912:sku331)
(00003,sku888)
(00003,sku888:sku022:sku010:sku594)
…
(00004,sku411)
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"21$
Chapter"Topics"
Working$With$RDDs$ Introduc.on$to$Spark$
!! A"Closer"Look"at"RDDs"
!! KeyTValue"Pair"RDDs"
!! MapReduce$
!! Other"Pair"RDD"OperaDons"
!! Conclusion"
!! HandsTOn"Exercise:"Working"with"Pair"RDDs"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"22$
MapReduce"
! MapReduce$is$a$common$programming$model$
– Easily"applicable"to"distributed"processing"of"large"data"sets"
! Hadoop$MapReduce$is$the$major$implementa.on$$
– Somewhat"limited"
– Each"job"has"one"Map"phase,"one"Reduce"phase""
– Job"output"is"saved"to"files"
! Spark$implements$MapReduce$with$much$greater$flexibility$
– Map"and"Reduce"funcDons"can"be"interspersed"
– Results"stored"in"memory"
– OperaDons"can"easily"be"chained"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"23$
MapReduce"in"Spark"
! MapReduce$in$Spark$works$on$Pair$RDDs$
! Map$phase$
– Operates"on"one"record"at"a"Dme"
– “Maps”"each"record"to"one"or"more"new"records"
– map"and"flatMap
! Reduce$phase$
– Works"on"Map"output"
– Consolidates"mulDple"records"
– reduceByKey
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"24$
MapReduce"Example:"Word"Count"
Result"
aardvark 1
Input"Data"
cat 1
the cat sat on the mat mat 1
the aardvark sat on the sofa ?" on 2
$ sat 2
sofa 1
the 4
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"25$
Example:"Word"Count"(1)"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"26$
Example:"Word"Count"(2)"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"27$
Example:"Word"Count"(3)"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"28$
Example:"Word"Count"(4)"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"29$
ReduceByKey"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"30$
Word"Count"Recap"(the"Scala"Version)"
OR"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"31$
Why"Do"We"Care"About"CounDng"Words?"
! Word$count$is$challenging$over$massive$amounts$of$data$
– Using"a"single"compute"node"would"be"too"DmeTconsuming"
– Number"of"unique"words"could"exceed"available"memory"
! Sta.s.cs$are$o`en$simple$aggregate$func.ons$
– DistribuDve"in"nature"
– e.g.,"max,"min,"sum,"count"
! MapReduce$breaks$complex$tasks$down$into$smaller$elements$which$can$
be$executed$in$parallel$
! Many$common$tasks$are$very$similar$to$word$count$
– e.g.,"log"file"analysis"
©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"32$