0% found this document useful (0 votes)
140 views70 pages

Cloudera Spark

cloudera spark

Uploaded by

İsmail Cambaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views70 pages

Cloudera Spark

cloudera spark

Uploaded by

İsmail Cambaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Spark"Basics"

Chapter"3"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"1$
Course"Chapters"
!! IntroducDon" Course"IntroducDon"
!! Why"Spark?"
!! Spark$Basics$ Introduc.on$to$Spark$
!! Working"With"RDDs"

!! The"Hadoop"Distributed"File"System"(HDFS)"
!! Running"Spark"on"a"Cluster"
Distributed"Data"Processing""
!! Parallel"Programming"with"Spark"
with"Spark"
!! Caching"and"Persistence"
!! WriDng"Spark"ApplicaDons"

!! Spark"Streaming"
!! Common"Pa=erns"in"Spark"Programming" Solving"Business"Problems""
!! Improving"Spark"Performance" with"Spark"
!! Spark,"Hadoop,"and"the"Enterprise"Data"Center"

!! Conclusion" Course"Conclusion"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"2$
Spark"Basics"

In$this$chapter$you$will$learn$
! How$to$start$the$Spark$Shell$
! About$the$SparkContext$
! Key$Concepts$of$Resilient$Distributed$Datasets$(RDDs)$
– What"are"they?"
– How"do"you"create"them?"
– What"operaDons"can"you"perform"with"them?"
! How$Spark$uses$the$principles$of$func.onal$programming$
! About$the$Hands"On$Exercises$for$the$course$

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"3$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What$is$Apache$Spark?$
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"in"Spark"
!! Conclusion"
!! HandsUOn"Exercises"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"4$
What"is"Apache"Spark?"

! Apache$Spark$is$a$fast$and$general$engine$for$large"scale$
data$processing$
! WriNen$in$Scala$
– FuncDonal"programming"language"that"runs"in"a"JVM"
! Spark$Shell$
– InteracDve"–"for"learning"or"data"exploraDon"
– Python"or"Scala"
! Spark$Applica.ons$
– For"large"scale"data"processing"
– Python,"Scala,"or"Java"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"5$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What"is"Apache"Spark?""
!! Using$the$Spark$Shell$
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"in"Spark"
!! Conclusion"
!! HandsUOn"Exercises"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"6$
Spark"Shell"

! The$Spark$Shell$provides$interac.ve$data$explora.on$(REPL)
! Wri.ng$Spark$applica.ons$without$the$shell$will$be$covered$later$

Python"Shell:"pyspark Scala"Shell:"spark-shell
$ pyspark $ spark-shell

Welcome to
Welcome to
____ __
/ __/__ ___ _____/ /__ ____ __
_\ \/ _ \/ _ `/ __/ '_/ / __/__ ___ _____/ /__
/__ / .__/\_,_/_/ /_/\_\ version 1.0.0 _\ \/ _ \/ _ `/ __/ '_/
/_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.0
/_/
Using Python version 2.6.6 (r266:84292, Jan
22 2014 09:42:36)
SparkContext available as sc. Using Scala version 2.10.3 (Java HotSpot(TM)
64-Bit Server VM, Java 1.7.0_51)
>>> Created spark context..
Spark context available as sc.

scala>

REPL:"Read/Evaluate/Print"Loop"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"7$
Spark"Context"

! Every$Spark$applica.on$requires$a$Spark$Context$
– The"main"entry"point"to"the"Spark"API"
! Spark$Shell$provides$a$preconfigured$Spark$Context$called$sc

Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36)


Spark context available as sc.
Python"
>>> sc.appName
u'PySparkShell'

Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM,


Java 1.7.0_51)
Created spark context..
Scala" Spark context available as sc.

scala> sc.appName
res0: String = Spark shell

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"8$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs$(Resilient$Distributed$Datasets)$
!! FuncDonal"Programming"With"Spark"
!! Conclusion"
!! HandsUOn"Exercise:"Ge`ng"Started"with"RDDs"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"9$
RDD"(Resilient"Distributed"Dataset)"

! RDD$(Resilient$Distributed$Dataset)$
– Resilient"–"if"data"in"memory"is"lost,"it"can"be"recreated"
– Distributed"–"stored"in"memory"across"the"cluster"
– Dataset"–"iniDal"data"can"come"from"a"file"or"be"created"
programmaDcally"
! RDDs$are$the$fundamental$unit$of$data$in$Spark$
! Most$Spark$programming$consists$of$performing$opera.ons$on$RDDs$
"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"10$
CreaDng"an"RDD"

! Three$ways$to$create$an$RDD$
– From"a"file"or"set"of"files"
– From"data"in"memory"
– From"another"RDD"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"11$
FileUBased"RDDs"

! For$file"based$RDDS,$use$SparkContext.textFile$$
– Accepts"a"single"file,"a"wildcard"list"of"files,"or"a"commaUseparated"list"of"
files"
– Examples"
– sc.textFile("myfile.txt")
– sc.textFile("mydata/*.log")
– sc.textFile("myfile1.txt,myfile2.txt")
– Each"line"in"the"file(s)"is"a"separate"record"in"the"RDD"
!  Files$are$referenced$by$absolute$or$rela.ve$URI$
– Absolute"URI:"file:/home/training/myfile.txt
– RelaDve"URI"(uses"default"file"system):"myfile.txt

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"12$
Example:"A"FileUbased"RDD"

File:"purplecow.txt"

>  mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.


I never hope to see one;

But I can tell you, anyhow,
14/01/29 06:20:37 INFO storage.MemoryStore: I'd rather see than be one.
Block broadcast_0 stored as values to
memory (estimated size 151.4 KB, free 296.8
MB)

>  mydata.count() RDD:"mydata"


I've never seen a purple cow.

I never hope to see one;
14/01/29 06:27:37 INFO spark.SparkContext: Job
finished: take at <stdin>:1, took But I can tell you, anyhow,
0.160482078 s I'd rather see than be one.
4

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"13$
RDD"OperaDons"

! Two$types$of$RDD$opera.ons$
RDD"

value&
– AcDons"–"return"values"

Base"RDD" New"RDD"
– TransformaDons"–"define"a"new"
RDD"based"on"the"current"one(s)"
$
! Pop$quiz:$
– Which"type"of"operaDon"is"
count()?"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"14$
RDD"OperaDons:"AcDons"

! Some$common$ac.ons$ RDD"
– count()"–""return"the"number"of"elements"
value&
– take(n)"–"return"an"array"of"the"first"n"
elements"
– collect()–"return"an"array"of"all"elements"
– saveAsTextFile(file)$–"save"to"text"file(s)

>  mydata = >  val mydata =


sc.textFile("purplecow.txt") sc.textFile("purplecow.txt")

>  mydata.count() >  mydata.count()


4 4

>  for line in mydata.take(2): >  for (line <- mydata.take(2))


print line println(line)
I've never seen a purple cow. I've never seen a purple cow.
I never hope to see one; I never hope to see one;

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"15$
RDD"OperaDons:"TransformaDons"

! Transforma.ons$create$a$new$RDD$from$
Base"RDD" New"RDD"
an$exis.ng$one$
! RDDs$are$immutable$
– Data"in"an"RDD"is"never"changed"
– Transform"in"sequence"to"modify"the"
data"as"needed""
! Some$common$transforma.ons$
– map(function)"–"creates"a"new"RDD"by"performing"a"funcDon"on"
each"record"in"the"base"RDD"
– filter(function)"–"creates"a"new"RDD"by"including"or"
excluding"each"record"in"the"base"RDD"according"to"a"boolean"
funcDon"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"16$
Example:"map"and"filter"TransformaDons"
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

map(lambda line: line.upper()) map(line => line.toUpperCase)

I'VE NEVER SEEN A PURPLE COW.


I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

filter(lambda line: line.startswith('I')) filter(line => line.startsWith('I'))

I'VE NEVER SEEN A PURPLE COW.


I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"17$
Lazy"ExecuDon"(1)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

> 

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"18$
Lazy"ExecuDon"(2)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD:"mydata"

>  mydata = sc.textFile("purplecow.txt")

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"19$
Lazy"ExecuDon"(3)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD:"mydata"

>  mydata = sc.textFile("purplecow.txt")


>  mydata_uc = mydata.map(lambda line:
line.upper())

RDD:"mydata_uc"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"20$
Lazy"ExecuDon"(4)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD:"mydata"

>  mydata = sc.textFile("purplecow.txt")


>  mydata_uc = mydata.map(lambda line:
line.upper())
>  mydata_filt = \
RDD:"mydata_uc"
mydata_uc.filter(lambda line: \
line.startswith('I'))

RDD:"mydata_filt"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"21$
Lazy"ExecuDon"(5)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD:"mydata"
I've never seen a purple cow.
>  mydata = sc.textFile("purplecow.txt") I never hope to see one;
>  mydata_uc = mydata.map(lambda line: But I can tell you, anyhow,
line.upper()) I'd rather see than be one.
>  mydata_filt = \
RDD:"mydata_uc"
mydata_uc.filter(lambda line: \ I'VE NEVER SEEN A PURPLE COW.
line.startswith('I')) I NEVER HOPE TO SEE ONE;
>  mydata_filt.count() BUT I CAN TELL YOU, ANYHOW,
3 I'D RATHER SEE THAN BE ONE.

RDD:"mydata_filt"
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"22$
Chaining"TransformaDons"

! Transforma.ons$may$be$chained$together$

>  mydata = sc.textFile("purplecow.txt")


>  mydata_uc = mydata.map(lambda line: line.upper())
>  mydata_filt = mydata_uc.filter(lambda line: line.startswith('I'))
>  mydata_filt.count()
3

is"exactly"equivalent"to"

>  sc.textFile("purplecow.txt").map(lambda line: line.upper()) \


.filter(lambda line: line.startswith('I')).count()
3

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"23$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! Func.onal$Programming$in$Spark$
!! Conclusion"
!! HandsUOn"Exercises"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"24$
FuncDonal"Programming"in"Spark"

! Spark$depends$heavily$on$the$concepts$of$func#onal&programming&
– FuncDons"are"the"fundamental"unit"of"programming"
– FuncDons"have"input"and"output"only"
– No"state"or"side"effects"
! Key$concepts$
– Passing"funcDons"as"input"to"other"funcDons"
– Anonymous"funcDons"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"25$
Passing"FuncDons"as"Parameters"

! Many$RDD$opera.ons$take$func.ons$as$parameters$
! Pseudocode$for$the$RDD$map$opera.on$
– Applies"funcDon"fn"to"each"record"in"the"RDD"

RDD {
map(fn(x)) {
foreach record in rdd
emit fn(record)
}
}

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"26$
Example:"Passing"Named"FuncDons"

! Python$

>  def toUpper(s):


return s.upper()
>  mydata = sc.textFile("purplecow.txt")
>  mydata.map(toUpper).take(2)

! Scala$

>  def toUpper(s: String): String =


{ s.toUpperCase }
>  val mydata = sc.textFile("purplecow.txt")
>  mydata.map(toUpper).take(2)

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"27$
Anonymous"FuncDons"

! Func.ons$defined$in"line$without$an$iden.fier$
– Best"for"short,"oneUoff"funcDons"
! Supported$in$many$programming$languages$
– Python:"lambda x: ...
– Scala:"x => ...
– Java"8:"x -> ...

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"28$
Example:"Passing"Anonymous"FuncDons"

!  Python:$

>  mydata.map(lambda line: line.upper()).take(2)

!  Scala:$

>  mydata.map(line => line.toUpperCase()).take(2)

OR"

>  mydata.map(_.toUpperCase()).take(2)

Scala"allows"anonymous"parameters"
using"underscore"(_)"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"29$
Example:"Java""

...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
new MapFunction<String, String>() {
Java"7" public String call(String line) {
return line.toUpperCase();
}
}
...

...
Java"8" JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
line -> line.toUpperCase());
...

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"30$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"With"Spark"
!! Conclusion$
!! HandsUOn"Exercises"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"31$
Key"Points"

! Spark$can$be$used$interac.vely$via$the$Spark$Shell$
– Python"or"Scala"
– WriDng"nonUinteracDve"Spark"applicaDons"will"be"covered"later"
! RDDs$(Resilient$Distributed$Datasets)$are$a$key$concept$in$Spark$
! RDD$Opera.ons$
– TransformaDons"create"a"new"RDD"based"on"an"exisDng"one"
– AcDons"return"a"value"from"an"RDD"
! Lazy$Execu.on$
– TransformaDons"are"not"executed"unDl"required"by"an"acDon"
! Spark$uses$func.onal$programming$
– Passing"funcDons"as"parameters"
– Anonymous"funcDons"in"supported"languages"(Python"and"Scala)"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"32$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"With"Spark"
!! Conclusion"
!! Hands"On$Exercises$

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"33$
IntroducDon"to"Exercises:"Ge`ng"Started"

! Instruc.ons$are$in$the$Hands"On$Exercise$Manual$
! Start$with$$
– General"Notes"
– Se`ng"Up"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"34$
IntroducDon"to"Exercises:"Pick"Your"Language"

! Your$choice:$Python$or$Scala$
– For"most"exercises"in"this"course,"you"may"choose"to"work"with"either"
Python"or"Scala"
– ExcepDon:"Spark"Streaming"is"currently"available"only"in"Scala"
– Course"examples"are"mostly"presented"in"Python"
! Solu.on$and$example$files$
– .pyspark"–"Python"shell"commands"
– .scalaspark"–"Scala"shell"commands"
– .py"–"complete"Python"Spark"applicaDons"
– .scala"–"complete"Scala"Spark"applicaDons"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"35$
IntroducDon"to"Exercises:"Classroom"Virtual"Machine"

! Your$virtual$machine$
– Log"in"as"user"training"(password"training)"
– PreUinstalled"and"configured"with"
– Spark"and"CDH"(Cloudera’s"DistribuDon,"including"Apache"Hadoop)"
– Various"tools"including"Emacs,"IntelliJ,"and"Maven"
! Training$materials:$~/training_materials/sparkdev$folder$on$
the$VM$
– data"–"sample"datasets"uses"in"exercises""
– examples"–"all"the"example"code"in"this"course"
– solutions"–"soluDons"for"Scala"Shell"and"Python"exercises"
– stubs"–"starter"code"required"in"some"exercises"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"36$
IntroducDon"to"Exercises:"The"Data"

! Most$exercises$are$based$around$a$hypothe.cal$company:$Loudacre$
Mobile$
– A"cellular"telephone"company"
! Loudacre$Mobile$Customer$Support$has$many$sources$of$data$they$need$
to$process,$transform$and$analyze$
– Customer"account"data""
– Web"server"logs"from"Loudacre’s"customer"support"website"
– New"device"acDvaDon"records"
– Customer"support"Knowledge"Base"arDcles"
– InformaDon"about"models"of"supported"devices"

L udacre
mobile
o

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"37$
HandsUOn"Exercises"

! Now,$please$do$the$following$three$Hands"On$Exercises$
1.  Viewing&the&Spark&Documenta8on&
– Familiarize"yourself"with"the"Spark"documentaDon;"you"will"refer"to"
this"documentaDon"frequently"during"the"course"
2.  Using&the&Spark&Shell&
– Follow"the"instrucDons"for"either"the"Python"or"Scala"shell"
3.  Ge>ng&Started&with&RDDs&
– Use"either"the"Python"or"Scala"Spark"Shell"to"explore"the"Loudacre"
weblogs"
! Please$refer$to$the$Hands"On$Exercise$Manual$

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"38$
Working"With"RDDs"
Chapter"4"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"1$
Course"Chapters"
!! IntroducDon" Course"IntroducDon"
!! What"is"Apache"Spark?"
!! Spark"Basics" Introduc.on$to$Spark$
!! Working$With$RDDs$

!! The"Hadoop"Distributed"File"System"(HDFS)"
!! Running"Spark"on"a"Cluster"
Distributed"Data"Processing""
!! Parallel"Programming"with"Spark"
with"Spark"
!! Caching"and"Persistence"
!! WriDng"Spark"ApplicaDons"

!! Spark"Streaming"
!! Common"Pa=erns"in"Spark"Programming" Solving"Business"Problems""
!! Improving"Spark"Performance" with"Spark"
!! Spark,"Hadoop,"and"the"Enterprise"Data"Center"

!! Conclusion" Course"Conclusion"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"2$
Working"With"RDDs"

In$this$chapter$you$will$learn$
! How$RDDs$are$created$$
! Addi.onal$RDD$opera.ons$
! Special$opera.ons$available$on$RDDs$of$key"value$pairs$
! How$MapReduce$algorithms$are$implemented$in$Spark$

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"3$
Chapter"Topics"

Working$With$RDDs$ Spark$Development$

!! A$Closer$Look$at$RDDs$
!! KeyTValue"Pair"RDDs"
!! MapReduce"
!! Other"Pair"RDD"OperaDons"
!! Conclusion"
!! HandsTOn"Exercise:"Working"with"Pair"RDDs"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"4$
RDDs"

! RDDs$can$hold$any$type$of$element$
– PrimiDve"types:"integers,"characters,"booleans,"etc."
– Sequence"types:"strings,"lists,"arrays,"tuples,"dicts,"etc."(including"nested"
data"types)"
– Scala/Java"Objects"(if"serializable)"
– Mixed"types"
! Some$types$of$RDDs$have$addi.onal$func.onality$
– Pair"RDDs"
– RDDs"consisDng"of"KeyTValue"pairs"
– Double"RDDs"
– RDDs"consisDng"of"numeric"data"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"5$
CreaDng"RDDs"From"CollecDons"

! You$can$create$RDDs$from$collec.ons$instead$of$files$
– sc.parallelize(collection)

> randomnumlist = \
[random.uniform(0,10) for _ in xrange(10000)]
> randomrdd = sc.parallelize(randomnumlist)
> print "Mean: %f" % randomrdd.mean()

!  Useful$when$
– TesDng"
– GeneraDng"data"programmaDcally"
– IntegraDng"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"6$
Some"Other"General"RDD"OperaDons"

! Transforma.ons$
– flatMap"–"maps"one"element"in"the"base"RDD"to"mulDple"elements
– distinct"–"filter"out"duplicates"
– union"–"add"all"elements"of"two"RDDs"into"a"single"new"RDD"
! Other$RDD$opera.ons$
– first"–"return"the"first"element"of"the"RDD
– foreach"–"apply"a"funcDon"to"each"element"in"an"RDD"
– top(n)$–"return"the"largest"n"elements"using"natural"ordering"
! Sampling$opera.ons$
– sample(percent)$–"create"a"new"RDD"with"a"sampling"of"elements""
– takeSample(percent)$–"return"an"array"of"sampled"elements"
! Double$RDD$opera.ons$
– StaDsDcal"funcDons,"e.g.,"mean,"sum,"variance,"stdev

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"7$
Example:"flatMap"and"distinct

> sc.textFile(file) \
Python" .flatMap(lambda line: line.split()) \
.distinct()

> sc.textFile(file)
Scala" .flatMap(line => line.split("\\W"))
.distinct()

I’ve I’ve
never never
I've never seen a purple cow.
seen seen
I never hope to see one;
But I can tell you, anyhow,
a a

I'd rather see than be one. purple purple


cow cow
I hope
never …
hope

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"8$
Chapter"Topics"

Working$With$RDDs$ Spark$Development$

!! A"Closer"Look"at"RDDs"
!! Key"Value$Pair$RDDs$
!! MapReduce"
!! Other"Pair"RDD"OperaDons"
!! Conclusion"
!! HandsTOn"Exercise:"Working"with"Pair"RDDs"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"9$
Pair"RDDs"

! Pair$RDDs$are$a$special$form$of$RDD$ Pair"RDD"
– Each"element"must"be"a"keyTvalue"pair"(a"" (key1,value1)
twoTelement"tuple)" (key2,value2)
– Keys"and"values"can"be"any"type" (key3,value3)
! Why?$ …
– Use"with"MapReduce"algorithms""
– Many"addiDonal"funcDons"are"available"for"
common"data"processing"needs"
– e.g.,"sorDng,"joining,"grouping,"counDng,"etc."

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"10$
CreaDng"Pair"RDDs"

! The$first$step$in$most$workflows$is$to$get$the$data$into$key/value$form$
– What"should"the"RDD"should"be"keyed"on?"
– What"is"the"value?"
! Commonly$used$func.ons$to$create$Pair$RDDs$
– map
– flatMap$/$flatMapValues
– keyBy

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"11$
Example:"A"Simple"Pair"RDD"

! Example:$Create$a$Pair$RDD$from$a$tab"separated$file$

Python" > users = sc.textFile(file) \


.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))

> val users = sc.textFile(file) \


Scala" .map(line => line.split('\t')) \
.map(fields => (fields(0),fields(1)))

(user001,Fred Flintstone)
user001 Fred Flintstone
(user090,Bugs Bunny)
user090 Bugs Bunny
user111 Harry Potter (user111,Harry Potter)
… …

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"12$
Example:"Keying"Web"Logs"by"User"ID"

> sc.textFile(logfile) \
.keyBy(lambda line: line.split(' ')[2])

> sc.textFile(logfile) \
.keyBy(line => line.split(' ')(2))
User"ID"
56.38.234.188 – 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 – 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 – 25254 "GET /KBDOC-00230.html HTTP/1.0" …

(99788,56.38.234.188 – 99788 "GET /KBDOC-00157.html…)


(99788,56.38.234.188 – 99788 "GET /theme.css…)
(25254,203.146.17.59 – 25254 "GET /KBDOC-00230.html…)

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"13$
QuesDon"1:"Pairs"With"Complex"Values

! How$would$you$do$this?$
– Input:"a"list"of"postal"codes"with"laDtude"and"longitude"
– Output:"postal"code"(key)"and"lat/long"pair"(value)"

(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202
?" (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202

…$

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"14$
Answer"1:"Pairs"With"Complex"Values"

> sc.textFile(file) \
.map(lambda line: line.split()) \
.map(lambda fields: (fields[0],(fields[1],fields[2])))

(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202 (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202

…$

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"15$
QuesDon"2:"Mapping"Single"Rows"to"MulDple"Pairs"(1)"

! How$would$you$do$this?$
– Input:"order"numbers"with"a"list"of"SKUs"in"the"order"
– Output:"order"(key)"and"sku"(value)"

Input"Data" Pair"RDD"
00001 sku010:sku933:sku022 (00001,sku010)
00002 sku912:sku331 (00001,sku933)
00003 sku888:sku022:sku010:sku594
00004 sku411
?" (00001,sku022)
(00002,sku912)
" (00002,sku331)
(00003,sku888)

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"16$
QuesDon"2:"Mapping"Single"Rows"to"MulDple"Pairs"(2)"

! Hint:$map$alone$won’t$work$

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411

"

(00001,(sku010,sku933,sku022))
(00002,(sku912,sku331))
(00003,(sku888,sku022,sku010,sku594))
(00004,(sku411))

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"17$
Answer"2:"Mapping"Single"Rows"to"MulDple"Pairs"(1)"

> sc.textFile(file)

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"18$
Answer"2:"Mapping"Single"Rows"to"MulDple"Pairs"(2)"

> sc.textFile(file) \
.map(lambda line: line.split('\t'))

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
[00004,sku411]

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"19$
Answer"2:"Mapping"Single"Rows"to"MulDple"Pairs"(3)"

> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
(00001,sku010:sku933:sku022)
[00004,sku411]
(00002,sku912:sku331)
(00003,sku888:sku022:sku010:sku594)
(00004,sku411)

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"20$
Answer"2:"Mapping"Single"Rows"to"MulDple"Pairs"(4)"

> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
.flatMapValues(lambda skus: skus.split(':'))

00001 sku010:sku933:sku022
00002 sku912:sku331 (00001,sku010)
00003 sku888:sku022:sku010:sku594 (00001,sku933)
[00001,sku010:sku933:sku022]
00004 sku411 (00001,sku022)
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594] (00002,sku912)
(00001,sku010:sku933:sku022)
[00004,sku411] (00002,sku331)
(00002,sku912:sku331)
(00003,sku888)
(00003,sku888:sku022:sku010:sku594)

(00004,sku411)

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"21$
Chapter"Topics"

Working$With$RDDs$ Introduc.on$to$Spark$

!! A"Closer"Look"at"RDDs"
!! KeyTValue"Pair"RDDs"
!! MapReduce$
!! Other"Pair"RDD"OperaDons"
!! Conclusion"
!! HandsTOn"Exercise:"Working"with"Pair"RDDs"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"22$
MapReduce"

! MapReduce$is$a$common$programming$model$
– Easily"applicable"to"distributed"processing"of"large"data"sets"
! Hadoop$MapReduce$is$the$major$implementa.on$$
– Somewhat"limited"
– Each"job"has"one"Map"phase,"one"Reduce"phase""
– Job"output"is"saved"to"files"
! Spark$implements$MapReduce$with$much$greater$flexibility$
– Map"and"Reduce"funcDons"can"be"interspersed"
– Results"stored"in"memory"
– OperaDons"can"easily"be"chained"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"23$
MapReduce"in"Spark"

! MapReduce$in$Spark$works$on$Pair$RDDs$
! Map$phase$
– Operates"on"one"record"at"a"Dme"
– “Maps”"each"record"to"one"or"more"new"records"
– map"and"flatMap
! Reduce$phase$
– Works"on"Map"output"
– Consolidates"mulDple"records"
– reduceByKey

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"24$
MapReduce"Example:"Word"Count"
Result"
aardvark 1
Input"Data"
cat 1
the cat sat on the mat mat 1
the aardvark sat on the sofa ?" on 2

$ sat 2
sofa 1
the 4

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"25$
Example:"Word"Count"(1)"

> counts = sc.textFile(file)

the cat sat on the


mat
the aardvark sat on
the sofa

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"26$
Example:"Word"Count"(2)"

> counts = sc.textFile(file) \


.flatMap(lambda line: line.split())

the cat sat on the the


mat
cat
the aardvark sat on
sat
the sofa
on
the
mat
the
aardvark

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"27$
Example:"Word"Count"(3)"

> counts = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) KeyT
Value"
Pairs"

the cat sat on the the (the, 1)


mat
cat (cat, 1)
the aardvark sat on
sat (sat, 1)
the sofa
on (on, 1)
the (the, 1)
mat (mat, 1)
the (the, 1)
aardvark (aardvark, 1)
… …

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"28$
Example:"Word"Count"(4)"

> counts = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)

the cat sat on the the (the, 1) (aardvark, 1)


mat (cat, 1)
cat (cat, 1)
the aardvark sat on (mat, 1)
sat (sat, 1)
the sofa
on (on, 1) (on, 2)
the (the, 1) (sat, 2)
mat (mat, 1) (sofa, 1)
the (the, 1) (the, 4)
aardvark (aardvark, 1)
… …

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"29$
ReduceByKey"

! ReduceByKey$func.ons$must$be$ > counts = sc.textFile(file) \


– Binary"–"combines"values" .flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
from"two"keys" .reduceByKey(lambda v1,v2: v1+v2)
– CommutaDve"–"x+y"="y+x"
– AssociaDve"–"(x+y)+z"="x+(y+z)"
(the,1)
(cat,1)
(the,2)
(sat,1) (aardvark, 1)
(on,1) (cat, 1)
(the,1) (the,3) (mat, 1)
(mat,1) (on, 2)
(the,1) (sat, 2)
(aardvark,1) (sofa, 1)
(the,4)
(sat,1) (the, 4)
(on,1)
(the,1)

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"30$
Word"Count"Recap"(the"Scala"Version)"

> val counts = sc.textFile(file) \


.flatMap(line => line.split("\\W")) \
.map(word => (word,1)) \
.reduceByKey((v1,v2) => v1+v2)

OR"

> val counts = sc.textFile(file) \


.flatMap(_.split("\\W")) \
.map(_,1)) \
.reduceByKey(_+_)

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"31$
Why"Do"We"Care"About"CounDng"Words?"

! Word$count$is$challenging$over$massive$amounts$of$data$
– Using"a"single"compute"node"would"be"too"DmeTconsuming"
– Number"of"unique"words"could"exceed"available"memory"
! Sta.s.cs$are$o`en$simple$aggregate$func.ons$
– DistribuDve"in"nature"
– e.g.,"max,"min,"sum,"count"
! MapReduce$breaks$complex$tasks$down$into$smaller$elements$which$can$
be$executed$in$parallel$
! Many$common$tasks$are$very$similar$to$word$count$
– e.g.,"log"file"analysis"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"32$

You might also like