0% found this document useful (0 votes)

140 views70 pages

Cloudera Spark

cloudera spark

Uploaded by

İsmail Cambaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

140 views70 pages

Cloudera Spark

cloudera spark

Uploaded by

İsmail Cambaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Spark"Basics"

Chapter"3"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"1$
Course"Chapters"
!! IntroducDon" Course"IntroducDon"
!! Why"Spark?"
!! Spark$Basics$ Introduc.on$to$Spark$
!! Working"With"RDDs"

!! The"Hadoop"Distributed"File"System"(HDFS)"
!! Running"Spark"on"a"Cluster"
Distributed"Data"Processing""
!! Parallel"Programming"with"Spark"
with"Spark"
!! Caching"and"Persistence"
!! WriDng"Spark"ApplicaDons"

!! Spark"Streaming"
!! Common"Pa=erns"in"Spark"Programming" Solving"Business"Problems""
!! Improving"Spark"Performance" with"Spark"
!! Spark,"Hadoop,"and"the"Enterprise"Data"Center"

!! Conclusion" Course"Conclusion"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"2$
Spark"Basics"

In$this$chapter$you$will$learn$
! How$to$start$the$Spark$Shell$
! About$the$SparkContext$
! Key$Concepts$of$Resilient$Distributed$Datasets$(RDDs)$
– What"are"they?"
– How"do"you"create"them?"
– What"operaDons"can"you"perform"with"them?"
! How$Spark$uses$the$principles$of$func.onal$programming$
! About$the$Hands"On$Exercises$for$the$course$

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"3$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What$is$Apache$Spark?$
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"in"Spark"
!! Conclusion"
!! HandsUOn"Exercises"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"4$
What"is"Apache"Spark?"

! Apache$Spark$is$a$fast$and$general$engine$for$large"scale$
data$processing$
! WriNen$in$Scala$
– FuncDonal"programming"language"that"runs"in"a"JVM"
! Spark$Shell$
– InteracDve"–"for"learning"or"data"exploraDon"
– Python"or"Scala"
! Spark$Applica.ons$
– For"large"scale"data"processing"
– Python,"Scala,"or"Java"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"5$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What"is"Apache"Spark?""
!! Using$the$Spark$Shell$
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"in"Spark"
!! Conclusion"
!! HandsUOn"Exercises"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"6$
Spark"Shell"

! The$Spark$Shell$provides$interac.ve$data$explora.on$(REPL)
! Wri.ng$Spark$applica.ons$without$the$shell$will$be$covered$later$

Python"Shell:"pyspark Scala"Shell:"spark-shell
$ pyspark $ spark-shell

Welcome to
Welcome to
____ __
/ __/__ ___ _____/ /__ ____ __
_\ \/ _ \/ _ `/ __/ '_/ / __/__ ___ _____/ /__
/__ / .__/\_,_/_/ /_/\_\ version 1.0.0 _\ \/ _ \/ _ `/ __/ '_/
/_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.0
/_/
Using Python version 2.6.6 (r266:84292, Jan
22 2014 09:42:36)
SparkContext available as sc. Using Scala version 2.10.3 (Java HotSpot(TM)
64-Bit Server VM, Java 1.7.0_51)
>>> Created spark context..
Spark context available as sc.

scala>

REPL:"Read/Evaluate/Print"Loop"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"7$
Spark"Context"

! Every$Spark$applica.on$requires$a$Spark$Context$
– The"main"entry"point"to"the"Spark"API"
! Spark$Shell$provides$a$preconﬁgured$Spark$Context$called$sc

Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36)

Spark context available as sc.
Python"
>>> sc.appName
u'PySparkShell'

Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM,

Java 1.7.0_51)
Created spark context..
Scala" Spark context available as sc.

scala> sc.appName
res0: String = Spark shell

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"8$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs$(Resilient$Distributed$Datasets)$
!! FuncDonal"Programming"With"Spark"
!! Conclusion"
!! HandsUOn"Exercise:"Ge`ng"Started"with"RDDs"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"9$
RDD"(Resilient"Distributed"Dataset)"

! RDD$(Resilient$Distributed$Dataset)$
– Resilient"–"if"data"in"memory"is"lost,"it"can"be"recreated"
– Distributed"–"stored"in"memory"across"the"cluster"
– Dataset"–"iniDal"data"can"come"from"a"ﬁle"or"be"created"
programmaDcally"
! RDDs$are$the$fundamental$unit$of$data$in$Spark$
! Most$Spark$programming$consists$of$performing$opera.ons$on$RDDs$
"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"10$
CreaDng"an"RDD"

! Three$ways$to$create$an$RDD$
– From"a"ﬁle"or"set"of"ﬁles"
– From"data"in"memory"
– From"another"RDD"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"11$
FileUBased"RDDs"

! For$file"based$RDDS,$use$SparkContext.textFile$$
– Accepts"a"single"file,"a"wildcard"list"of"files,"or"a"commaUseparated"list"of"
files"
– Examples"
– sc.textFile("myfile.txt")
– sc.textFile("mydata/*.log")
– sc.textFile("myfile1.txt,myfile2.txt")
– Each"line"in"the"file(s)"is"a"separate"record"in"the"RDD"
!  Files$are$referenced$by$absolute$or$rela.ve$URI$
– Absolute"URI:"file:/home/training/myfile.txt
– RelaDve"URI"(uses"default"file"system):"myfile.txt

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"12$
Example:"A"FileUbased"RDD"

File:"purplecow.txt"

>  mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.

I never hope to see one;
…
But I can tell you, anyhow,
14/01/29 06:20:37 INFO storage.MemoryStore: I'd rather see than be one.
Block broadcast_0 stored as values to
memory (estimated size 151.4 KB, free 296.8
MB)

>  mydata.count() RDD:"mydata"

I've never seen a purple cow.
…
I never hope to see one;
14/01/29 06:27:37 INFO spark.SparkContext: Job
finished: take at <stdin>:1, took But I can tell you, anyhow,
0.160482078 s I'd rather see than be one.
4

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"13$
RDD"OperaDons"

! Two$types$of$RDD$opera.ons$
RDD"

value&
– AcDons"–"return"values"

Base"RDD" New"RDD"
– TransformaDons"–"deﬁne"a"new"
RDD"based"on"the"current"one(s)"
$
! Pop$quiz:$
– Which"type"of"operaDon"is"
count()?"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"14$
RDD"OperaDons:"AcDons"

! Some$common$ac.ons$ RDD"
– count()"–""return"the"number"of"elements"
value&
– take(n)"–"return"an"array"of"the"ﬁrst"n"
elements"
– collect()–"return"an"array"of"all"elements"
– saveAsTextFile(file)$–"save"to"text"ﬁle(s)

>  mydata = >  val mydata =

sc.textFile("purplecow.txt") sc.textFile("purplecow.txt")

>  mydata.count() >  mydata.count()

4 4

>  for line in mydata.take(2): >  for (line <- mydata.take(2))

print line println(line)
I've never seen a purple cow. I've never seen a purple cow.
I never hope to see one; I never hope to see one;

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"15$
RDD"OperaDons:"TransformaDons"

! Transforma.ons$create$a$new$RDD$from$
Base"RDD" New"RDD"
an$exis.ng$one$
! RDDs$are$immutable$
– Data"in"an"RDD"is"never"changed"
– Transform"in"sequence"to"modify"the"
data"as"needed""
! Some$common$transforma.ons$
– map(function)"–"creates"a"new"RDD"by"performing"a"funcDon"on"
each"record"in"the"base"RDD"
– filter(function)"–"creates"a"new"RDD"by"including"or"
excluding"each"record"in"the"base"RDD"according"to"a"boolean"
funcDon"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"16$
Example:"map"and"filter"TransformaDons"
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

map(lambda line: line.upper()) map(line => line.toUpperCase)

I'VE NEVER SEEN A PURPLE COW.

I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

filter(lambda line: line.startswith('I')) filter(line => line.startsWith('I'))

I'VE NEVER SEEN A PURPLE COW.

I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"17$
Lazy"ExecuDon"(1)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"18$
Lazy"ExecuDon"(2)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD:"mydata"

>  mydata = sc.textFile("purplecow.txt")

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"19$
Lazy"ExecuDon"(3)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD:"mydata"

>  mydata = sc.textFile("purplecow.txt")

>  mydata_uc = mydata.map(lambda line:
line.upper())

RDD:"mydata_uc"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"20$
Lazy"ExecuDon"(4)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD:"mydata"

>  mydata = sc.textFile("purplecow.txt")

>  mydata_uc = mydata.map(lambda line:
line.upper())
>  mydata_filt = \
RDD:"mydata_uc"
mydata_uc.filter(lambda line: \
line.startswith('I'))

RDD:"mydata_ﬁlt"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"21$
Lazy"ExecuDon"(5)"
File:"purplecow.txt"
! Data$in$RDDs$is$not$processed$un.l$ I've never seen a purple cow.
an$ac#on&is$performed$ I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD:"mydata"
I've never seen a purple cow.
>  mydata = sc.textFile("purplecow.txt") I never hope to see one;
>  mydata_uc = mydata.map(lambda line: But I can tell you, anyhow,
line.upper()) I'd rather see than be one.
>  mydata_filt = \
RDD:"mydata_uc"
mydata_uc.filter(lambda line: \ I'VE NEVER SEEN A PURPLE COW.
line.startswith('I')) I NEVER HOPE TO SEE ONE;
>  mydata_filt.count() BUT I CAN TELL YOU, ANYHOW,
3 I'D RATHER SEE THAN BE ONE.

RDD:"mydata_ﬁlt"
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"22$
Chaining"TransformaDons"

! Transforma.ons$may$be$chained$together$

>  mydata = sc.textFile("purplecow.txt")

>  mydata_uc = mydata.map(lambda line: line.upper())
>  mydata_filt = mydata_uc.filter(lambda line: line.startswith('I'))
>  mydata_filt.count()
3

is"exactly"equivalent"to"

>  sc.textFile("purplecow.txt").map(lambda line: line.upper()) \

.filter(lambda line: line.startswith('I')).count()
3

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"23$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! Func.onal$Programming$in$Spark$
!! Conclusion"
!! HandsUOn"Exercises"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"24$
FuncDonal"Programming"in"Spark"

! Spark$depends$heavily$on$the$concepts$of$func#onal&programming&
– FuncDons"are"the"fundamental"unit"of"programming"
– FuncDons"have"input"and"output"only"
– No"state"or"side"eﬀects"
! Key$concepts$
– Passing"funcDons"as"input"to"other"funcDons"
– Anonymous"funcDons"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"25$
Passing"FuncDons"as"Parameters"

! Many$RDD$opera.ons$take$func.ons$as$parameters$
! Pseudocode$for$the$RDD$map$opera.on$
– Applies"funcDon"fn"to"each"record"in"the"RDD"

RDD {
map(fn(x)) {
foreach record in rdd
emit fn(record)
}
}

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"26$
Example:"Passing"Named"FuncDons"

! Python$

>  def toUpper(s):

return s.upper()
>  mydata = sc.textFile("purplecow.txt")
>  mydata.map(toUpper).take(2)

! Scala$

>  def toUpper(s: String): String =

{ s.toUpperCase }
>  val mydata = sc.textFile("purplecow.txt")
>  mydata.map(toUpper).take(2)

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"27$
Anonymous"FuncDons"

! Func.ons$defined$in"line$without$an$iden.fier$
– Best"for"short,"oneUoff"funcDons"
! Supported$in$many$programming$languages$
– Python:"lambda x: ...
– Scala:"x => ...
– Java"8:"x -> ...

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"28$
Example:"Passing"Anonymous"FuncDons"

!  Python:$

>  mydata.map(lambda line: line.upper()).take(2)

!  Scala:$

>  mydata.map(line => line.toUpperCase()).take(2)

OR"

>  mydata.map(_.toUpperCase()).take(2)

Scala"allows"anonymous"parameters"
using"underscore"(_)"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"29$
Example:"Java""

...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
new MapFunction<String, String>() {
Java"7" public String call(String line) {
return line.toUpperCase();
}
}
...

...
Java"8" JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
line -> line.toUpperCase());
...

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"30$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"With"Spark"
!! Conclusion$
!! HandsUOn"Exercises"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"31$
Key"Points"

! Spark$can$be$used$interac.vely$via$the$Spark$Shell$
– Python"or"Scala"
– WriDng"nonUinteracDve"Spark"applicaDons"will"be"covered"later"
! RDDs$(Resilient$Distributed$Datasets)$are$a$key$concept$in$Spark$
! RDD$Opera.ons$
– TransformaDons"create"a"new"RDD"based"on"an"exisDng"one"
– AcDons"return"a"value"from"an"RDD"
! Lazy$Execu.on$
– TransformaDons"are"not"executed"unDl"required"by"an"acDon"
! Spark$uses$func.onal$programming$
– Passing"funcDons"as"parameters"
– Anonymous"funcDons"in"supported"languages"(Python"and"Scala)"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"32$
Chapter"Topics"

Spark$Basics$ Introduc.on$to$Spark$

!! What"is"Apache"Spark?""
!! Using"the"Spark"Shell"
!! RDDs"(Resilient"Distributed"Datasets)"
!! FuncDonal"Programming"With"Spark"
!! Conclusion"
!! Hands"On$Exercises$

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"33$
IntroducDon"to"Exercises:"Ge`ng"Started"

! Instruc.ons$are$in$the$Hands"On$Exercise$Manual$
! Start$with$$
– General"Notes"
– Se`ng"Up"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"34$
IntroducDon"to"Exercises:"Pick"Your"Language"

! Your$choice:$Python$or$Scala$
– For"most"exercises"in"this"course,"you"may"choose"to"work"with"either"
Python"or"Scala"
– ExcepDon:"Spark"Streaming"is"currently"available"only"in"Scala"
– Course"examples"are"mostly"presented"in"Python"
! Solu.on$and$example$ﬁles$
– .pyspark"–"Python"shell"commands"
– .scalaspark"–"Scala"shell"commands"
– .py"–"complete"Python"Spark"applicaDons"
– .scala"–"complete"Scala"Spark"applicaDons"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"35$
IntroducDon"to"Exercises:"Classroom"Virtual"Machine"

! Your$virtual$machine$
– Log"in"as"user"training"(password"training)"
– PreUinstalled"and"conﬁgured"with"
– Spark"and"CDH"(Cloudera’s"DistribuDon,"including"Apache"Hadoop)"
– Various"tools"including"Emacs,"IntelliJ,"and"Maven"
! Training$materials:$~/training_materials/sparkdev$folder$on$
the$VM$
– data"–"sample"datasets"uses"in"exercises""
– examples"–"all"the"example"code"in"this"course"
– solutions"–"soluDons"for"Scala"Shell"and"Python"exercises"
– stubs"–"starter"code"required"in"some"exercises"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"36$
IntroducDon"to"Exercises:"The"Data"

! Most$exercises$are$based$around$a$hypothe.cal$company:$Loudacre$
Mobile$
– A"cellular"telephone"company"
! Loudacre$Mobile$Customer$Support$has$many$sources$of$data$they$need$
to$process,$transform$and$analyze$
– Customer"account"data""
– Web"server"logs"from"Loudacre’s"customer"support"website"
– New"device"acDvaDon"records"
– Customer"support"Knowledge"Base"arDcles"
– InformaDon"about"models"of"supported"devices"

L udacre
mobile
o

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"37$
HandsUOn"Exercises"

! Now,$please$do$the$following$three$Hands"On$Exercises$
1.  Viewing&the&Spark&Documenta8on&
– Familiarize"yourself"with"the"Spark"documentaDon;"you"will"refer"to"
this"documentaDon"frequently"during"the"course"
2.  Using&the&Spark&Shell&
– Follow"the"instrucDons"for"either"the"Python"or"Scala"shell"
3.  Ge>ng&Started&with&RDDs&
– Use"either"the"Python"or"Scala"Spark"Shell"to"explore"the"Loudacre"
weblogs"
! Please$refer$to$the$Hands"On$Exercise$Manual$

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 3"38$
Working"With"RDDs"
Chapter"4"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"1$
Course"Chapters"
!! IntroducDon" Course"IntroducDon"
!! What"is"Apache"Spark?"
!! Spark"Basics" Introduc.on$to$Spark$
!! Working$With$RDDs$

!! Spark"Streaming"
!! Common"Pa=erns"in"Spark"Programming" Solving"Business"Problems""
!! Improving"Spark"Performance" with"Spark"
!! Spark,"Hadoop,"and"the"Enterprise"Data"Center"

!! Conclusion" Course"Conclusion"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"2$
Working"With"RDDs"

In$this$chapter$you$will$learn$
! How$RDDs$are$created$$
! Addi.onal$RDD$opera.ons$
! Special$opera.ons$available$on$RDDs$of$key"value$pairs$
! How$MapReduce$algorithms$are$implemented$in$Spark$

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"3$
Chapter"Topics"

Working$With$RDDs$ Spark$Development$

!! A$Closer$Look$at$RDDs$
!! KeyTValue"Pair"RDDs"
!! MapReduce"
!! Other"Pair"RDD"OperaDons"
!! Conclusion"
!! HandsTOn"Exercise:"Working"with"Pair"RDDs"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"4$
RDDs"

! RDDs$can$hold$any$type$of$element$
– PrimiDve"types:"integers,"characters,"booleans,"etc."
– Sequence"types:"strings,"lists,"arrays,"tuples,"dicts,"etc."(including"nested"
data"types)"
– Scala/Java"Objects"(if"serializable)"
– Mixed"types"
! Some$types$of$RDDs$have$addi.onal$func.onality$
– Pair"RDDs"
– RDDs"consisDng"of"KeyTValue"pairs"
– Double"RDDs"
– RDDs"consisDng"of"numeric"data"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"5$
CreaDng"RDDs"From"CollecDons"

! You$can$create$RDDs$from$collec.ons$instead$of$ﬁles$
– sc.parallelize(collection)

> randomnumlist = \
[random.uniform(0,10) for _ in xrange(10000)]
> randomrdd = sc.parallelize(randomnumlist)
> print "Mean: %f" % randomrdd.mean()

!  Useful$when$
– TesDng"
– GeneraDng"data"programmaDcally"
– IntegraDng"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"6$
Some"Other"General"RDD"OperaDons"

! Transforma.ons$
– flatMap"–"maps"one"element"in"the"base"RDD"to"mulDple"elements
– distinct"–"ﬁlter"out"duplicates"
– union"–"add"all"elements"of"two"RDDs"into"a"single"new"RDD"
! Other$RDD$opera.ons$
– first"–"return"the"ﬁrst"element"of"the"RDD
– foreach"–"apply"a"funcDon"to"each"element"in"an"RDD"
– top(n)$–"return"the"largest"n"elements"using"natural"ordering"
! Sampling$opera.ons$
– sample(percent)$–"create"a"new"RDD"with"a"sampling"of"elements""
– takeSample(percent)$–"return"an"array"of"sampled"elements"
! Double$RDD$opera.ons$
– StaDsDcal"funcDons,"e.g.,"mean,"sum,"variance,"stdev

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"7$
Example:"flatMap"and"distinct

> sc.textFile(file) \
Python" .flatMap(lambda line: line.split()) \
.distinct()

> sc.textFile(file)
Scala" .flatMap(line => line.split("\\W"))
.distinct()

I’ve I’ve
never never
I've never seen a purple cow.
seen seen
I never hope to see one;
But I can tell you, anyhow,
a a

I'd rather see than be one. purple purple

cow cow
I hope
never …
hope
…

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"8$
Chapter"Topics"

Working$With$RDDs$ Spark$Development$

!! A"Closer"Look"at"RDDs"
!! Key"Value$Pair$RDDs$
!! MapReduce"
!! Other"Pair"RDD"OperaDons"
!! Conclusion"
!! HandsTOn"Exercise:"Working"with"Pair"RDDs"

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"9$
Pair"RDDs"

! Pair$RDDs$are$a$special$form$of$RDD$ Pair"RDD"
– Each"element"must"be"a"keyTvalue"pair"(a"" (key1,value1)
twoTelement"tuple)" (key2,value2)
– Keys"and"values"can"be"any"type" (key3,value3)
! Why?$ …
– Use"with"MapReduce"algorithms""
– Many"addiDonal"funcDons"are"available"for"
common"data"processing"needs"
– e.g.,"sorDng,"joining,"grouping,"counDng,"etc."

! The$ﬁrst$step$in$most$workﬂows$is$to$get$the$data$into$key/value$form$
– What"should"the"RDD"should"be"keyed"on?"
– What"is"the"value?"
! Commonly$used$func.ons$to$create$Pair$RDDs$
– map
– flatMap$/$flatMapValues
– keyBy

! Example:$Create$a$Pair$RDD$from$a$tab"separated$ﬁle$

Python" > users = sc.textFile(file) \

.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))

> val users = sc.textFile(file) \

Scala" .map(line => line.split('\t')) \
.map(fields => (fields(0),fields(1)))

(user001,Fred Flintstone)
user001 Fred Flintstone
(user090,Bugs Bunny)
user090 Bugs Bunny
user111 Harry Potter (user111,Harry Potter)
… …

> sc.textFile(logfile) \
.keyBy(lambda line: line.split(' ')[2])

> sc.textFile(logfile) \
.keyBy(line => line.split(' ')(2))
User"ID"
56.38.234.188 – 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 – 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 – 25254 "GET /KBDOC-00230.html HTTP/1.0" …
…

(99788,56.38.234.188 – 99788 "GET /KBDOC-00157.html…)

(99788,56.38.234.188 – 99788 "GET /theme.css…)
(25254,203.146.17.59 – 25254 "GET /KBDOC-00230.html…)
…

! How$would$you$do$this?$
– Input:"a"list"of"postal"codes"with"laDtude"and"longitude"
– Output:"postal"code"(key)"and"lat/long"pair"(value)"

(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202
?" (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202
…
…$

> sc.textFile(file) \
.map(lambda line: line.split()) \
.map(lambda fields: (fields[0],(fields[1],fields[2])))

(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202 (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202
…
…$

! How$would$you$do$this?$
– Input:"order"numbers"with"a"list"of"SKUs"in"the"order"
– Output:"order"(key)"and"sku"(value)"

Input"Data" Pair"RDD"
00001 sku010:sku933:sku022 (00001,sku010)
00002 sku912:sku331 (00001,sku933)
00003 sku888:sku022:sku010:sku594
00004 sku411
?" (00001,sku022)
(00002,sku912)
" (00002,sku331)
(00003,sku888)
…

! Hint:$map$alone$won’t$work$

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411

(00001,(sku010,sku933,sku022))
(00002,(sku912,sku331))
(00003,(sku888,sku022,sku010,sku594))
(00004,(sku411))

> sc.textFile(file)

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411

> sc.textFile(file) \
.map(lambda line: line.split('\t'))

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
[00004,sku411]

> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
(00001,sku010:sku933:sku022)
[00004,sku411]
(00002,sku912:sku331)
(00003,sku888:sku022:sku010:sku594)
(00004,sku411)

> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
.flatMapValues(lambda skus: skus.split(':'))

00001 sku010:sku933:sku022
00002 sku912:sku331 (00001,sku010)
00003 sku888:sku022:sku010:sku594 (00001,sku933)
[00001,sku010:sku933:sku022]
00004 sku411 (00001,sku022)
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594] (00002,sku912)
(00001,sku010:sku933:sku022)
[00004,sku411] (00002,sku331)
(00002,sku912:sku331)
(00003,sku888)
(00003,sku888:sku022:sku010:sku594)
…
(00004,sku411)

Working$With$RDDs$ Introduc.on$to$Spark$

!! A"Closer"Look"at"RDDs"
!! KeyTValue"Pair"RDDs"
!! MapReduce$
!! Other"Pair"RDD"OperaDons"
!! Conclusion"
!! HandsTOn"Exercise:"Working"with"Pair"RDDs"

! MapReduce$is$a$common$programming$model$
– Easily"applicable"to"distributed"processing"of"large"data"sets"
! Hadoop$MapReduce$is$the$major$implementa.on$$
– Somewhat"limited"
– Each"job"has"one"Map"phase,"one"Reduce"phase""
– Job"output"is"saved"to"ﬁles"
! Spark$implements$MapReduce$with$much$greater$ﬂexibility$
– Map"and"Reduce"funcDons"can"be"interspersed"
– Results"stored"in"memory"
– OperaDons"can"easily"be"chained"

! MapReduce$in$Spark$works$on$Pair$RDDs$
! Map$phase$
– Operates"on"one"record"at"a"Dme"
– “Maps”"each"record"to"one"or"more"new"records"
– map"and"flatMap
! Reduce$phase$
– Works"on"Map"output"
– Consolidates"mulDple"records"
– reduceByKey

©"Copyright"2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." 4"24$
MapReduce"Example:"Word"Count"
Result"
aardvark 1
Input"Data"
cat 1
the cat sat on the mat mat 1
the aardvark sat on the sofa ?" on 2

$ sat 2
sofa 1
the 4

> counts = sc.textFile(file)

the cat sat on the

mat
the aardvark sat on
the sofa

> counts = sc.textFile(file) \

.flatMap(lambda line: line.split())

the cat sat on the the

mat
cat
the aardvark sat on
sat
the sofa
on
the
mat
the
aardvark
…

> counts = sc.textFile(file) \

.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) KeyT
Value"
Pairs"

the cat sat on the the (the, 1)

mat
cat (cat, 1)
the aardvark sat on
sat (sat, 1)
the sofa
on (on, 1)
the (the, 1)
mat (mat, 1)
the (the, 1)
aardvark (aardvark, 1)
… …

> counts = sc.textFile(file) \

.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)

the cat sat on the the (the, 1) (aardvark, 1)

mat (cat, 1)
cat (cat, 1)
the aardvark sat on (mat, 1)
sat (sat, 1)
the sofa
on (on, 1) (on, 2)
the (the, 1) (sat, 2)
mat (mat, 1) (sofa, 1)
the (the, 1) (the, 4)
aardvark (aardvark, 1)
… …

! ReduceByKey$func.ons$must$be$ > counts = sc.textFile(file) \

– Binary"–"combines"values" .flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
from"two"keys" .reduceByKey(lambda v1,v2: v1+v2)
– CommutaDve"–"x+y"="y+x"
– AssociaDve"–"(x+y)+z"="x+(y+z)"
(the,1)
(cat,1)
(the,2)
(sat,1) (aardvark, 1)
(on,1) (cat, 1)
(the,1) (the,3) (mat, 1)
(mat,1) (on, 2)
(the,1) (sat, 2)
(aardvark,1) (sofa, 1)
(the,4)
(sat,1) (the, 4)
(on,1)
(the,1)

> val counts = sc.textFile(file) \

.flatMap(line => line.split("\\W")) \
.map(word => (word,1)) \
.reduceByKey((v1,v2) => v1+v2)

OR"

> val counts = sc.textFile(file) \

.flatMap(_.split("\\W")) \
.map(_,1)) \
.reduceByKey(_+_)

! Word$count$is$challenging$over$massive$amounts$of$data$
– Using"a"single"compute"node"would"be"too"DmeTconsuming"
– Number"of"unique"words"could"exceed"available"memory"
! Sta.s.cs$are$o`en$simple$aggregate$func.ons$
– DistribuDve"in"nature"
– e.g.,"max,"min,"sum,"count"
! MapReduce$breaks$complex$tasks$down$into$smaller$elements$which$can$
be$executed$in$parallel$
! Many$common$tasks$are$very$similar$to$word$count$
– e.g.,"log"ﬁle"analysis"

Airflow - Notes
No ratings yet
Airflow - Notes
82 pages
Cloudera Spark Developer Training
No ratings yet
Cloudera Spark Developer Training
491 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Cloudera Administration PDF
100% (1)
Cloudera Administration PDF
476 pages
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
Cloudera Administrator Training For Apache Hadoop
No ratings yet
Cloudera Administrator Training For Apache Hadoop
5 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
300 Series User Guide
100% (1)
300 Series User Guide
84 pages
10 SparkBasics
No ratings yet
10 SparkBasics
45 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Apache Hadoop Training
No ratings yet
Apache Hadoop Training
377 pages
Oozie Tutorial
No ratings yet
Oozie Tutorial
84 pages
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
No ratings yet
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
13 pages
Cloudera Developer Training PDF
No ratings yet
Cloudera Developer Training PDF
593 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Tutorial-HDP-Administration V III
100% (1)
Tutorial-HDP-Administration V III
274 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
Cloudera Installation
No ratings yet
Cloudera Installation
180 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
7 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Administration of Hadoop Summer 2014 Lab Guide v3.1
No ratings yet
Administration of Hadoop Summer 2014 Lab Guide v3.1
107 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Cloudera Administration PDF
No ratings yet
Cloudera Administration PDF
478 pages
TalendOpenStudio BigData UG 5.2.1 en
No ratings yet
TalendOpenStudio BigData UG 5.2.1 en
266 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Apache Hue-Cloudera
No ratings yet
Apache Hue-Cloudera
63 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
PracticeExam DCADAS3 Scala 1
No ratings yet
PracticeExam DCADAS3 Scala 1
27 pages
Cloudera Developer Training Exercise Manual
No ratings yet
Cloudera Developer Training Exercise Manual
131 pages
Apache Hive Interview Questions
50% (2)
Apache Hive Interview Questions
6 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Hive Interview Questions Answers
No ratings yet
Hive Interview Questions Answers
6 pages
Cloudera Administration
No ratings yet
Cloudera Administration
694 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Hadoop Lab
100% (2)
Hadoop Lab
6 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Company Interview Question Bank
No ratings yet
Company Interview Question Bank
16 pages
Apache Cassandra Sample Resume
No ratings yet
Apache Cassandra Sample Resume
17 pages
Cloudera Hive
No ratings yet
Cloudera Hive
137 pages
Teradata Studio User Guide
No ratings yet
Teradata Studio User Guide
256 pages
Hadoop Admin
No ratings yet
Hadoop Admin
13 pages
Spark Notes
0% (1)
Spark Notes
23 pages
Step Install Cloudera Manager & Setup Cloudera Cluster
No ratings yet
Step Install Cloudera Manager & Setup Cloudera Cluster
23 pages
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Cloudera Spark
No ratings yet
Cloudera Spark
66 pages
Big Data Processing With Apache Spark
No ratings yet
Big Data Processing With Apache Spark
17 pages
Amazon Elastic MapReduce PDF
No ratings yet
Amazon Elastic MapReduce PDF
231 pages
Spark SQL
100% (1)
Spark SQL
25 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
100% (1)
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
72 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Primefaces: Next Generation Component Suite
No ratings yet
Primefaces: Next Generation Component Suite
49 pages
Big Data Essentials: Activity Guide
No ratings yet
Big Data Essentials: Activity Guide
122 pages
q1 Solution
No ratings yet
q1 Solution
1 page
HDP Developer-Enterprise Spark 1-Student Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Student Guide-Rev 1
234 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
Bit Access in Byte - OpenPLC Forum
No ratings yet
Bit Access in Byte - OpenPLC Forum
1 page
Battery Firmware Upgrade Manual-Hyper Terminal V5
No ratings yet
Battery Firmware Upgrade Manual-Hyper Terminal V5
11 pages
CSC 421 Operating System For DLC2
No ratings yet
CSC 421 Operating System For DLC2
87 pages
Worksheet 1: Evolution of Computers
No ratings yet
Worksheet 1: Evolution of Computers
1 page
Ganipineni 2
No ratings yet
Ganipineni 2
119 pages
Additional Book Chapters - Service Fabric
No ratings yet
Additional Book Chapters - Service Fabric
313 pages
SINUMERIK 810M NCnet 7.0 Settings
100% (1)
SINUMERIK 810M NCnet 7.0 Settings
3 pages
Storage Networking Design and Management
No ratings yet
Storage Networking Design and Management
15 pages
Most Useful Concepts For System Design Interviews - Part 1 - by ScalaBrix - Medium
No ratings yet
Most Useful Concepts For System Design Interviews - Part 1 - by ScalaBrix - Medium
18 pages
Dynamic VLAN Assignment Using RADIUS
No ratings yet
Dynamic VLAN Assignment Using RADIUS
6 pages
AFF A200 - Replacing The Boot Media
No ratings yet
AFF A200 - Replacing The Boot Media
17 pages
Chapter IoT and IIoT
No ratings yet
Chapter IoT and IIoT
46 pages
Azure Course Content
No ratings yet
Azure Course Content
5 pages
UCA 2.0 Master Protocol
No ratings yet
UCA 2.0 Master Protocol
13 pages
FD1608S-B0 Xpon Olt
No ratings yet
FD1608S-B0 Xpon Olt
4 pages
HPE - A00051579enw - HPE Storage Fibre Channel Switch C-Series SN6730C
No ratings yet
HPE - A00051579enw - HPE Storage Fibre Channel Switch C-Series SN6730C
18 pages
Red Hat Enterprise Linux-8-8.8 Release Notes-En-Us
No ratings yet
Red Hat Enterprise Linux-8-8.8 Release Notes-En-Us
209 pages
U90 Ladder Tutorial
No ratings yet
U90 Ladder Tutorial
65 pages
Orange Pi
No ratings yet
Orange Pi
28 pages
Computer System Architecture MCA - 301
No ratings yet
Computer System Architecture MCA - 301
10 pages
6500 Part 3
No ratings yet
6500 Part 3
62 pages
CMP 3011 - Unit 2 - CPU
No ratings yet
CMP 3011 - Unit 2 - CPU
186 pages
DLS-3 v1-3 IM EN NA 29005694 R001
No ratings yet
DLS-3 v1-3 IM EN NA 29005694 R001
120 pages
Final Cs Project
No ratings yet
Final Cs Project
14 pages
User Manual USR TCP232 410S Quick Start Guide 10
No ratings yet
User Manual USR TCP232 410S Quick Start Guide 10
1 page
Ethical Hacking: Enumeration
No ratings yet
Ethical Hacking: Enumeration
37 pages
Building A PLC Program That You Can Be Proud of - Part 6 - Acc Automation
No ratings yet
Building A PLC Program That You Can Be Proud of - Part 6 - Acc Automation
28 pages
Microservices DesignPatterns
No ratings yet
Microservices DesignPatterns
15 pages
Basic Six Ict
No ratings yet
Basic Six Ict
3 pages

Cloudera Spark

Uploaded by

Cloudera Spark

Uploaded by

Spark"Basics"

Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36)

Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM,

> mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.

> mydata.count() RDD:"mydata"

> mydata = > val mydata =

> mydata.count() > mydata.count()

> for line in mydata.take(2): > for (line <- mydata.take(2))

map(lambda line: line.upper()) map(line => line.toUpperCase)

I'VE NEVER SEEN A PURPLE COW.

filter(lambda line: line.startswith('I')) filter(line => line.startsWith('I'))

I'VE NEVER SEEN A PURPLE COW.

> mydata = sc.textFile("purplecow.txt")

> mydata = sc.textFile("purplecow.txt")

> mydata = sc.textFile("purplecow.txt")

> mydata = sc.textFile("purplecow.txt")

> sc.textFile("purplecow.txt").map(lambda line: line.upper()) \

> def toUpper(s):

> def toUpper(s: String): String =

> mydata.map(lambda line: line.upper()).take(2)

> mydata.map(line => line.toUpperCase()).take(2)

I'd rather see than be one. purple purple

Python" > users = sc.textFile(file) \

> val users = sc.textFile(file) \

(99788,56.38.234.188 – 99788 "GET /KBDOC-00157.html…)

> counts = sc.textFile(file)

the cat sat on the

> counts = sc.textFile(file) \

the cat sat on the the

> counts = sc.textFile(file) \

the cat sat on the the (the, 1)

> counts = sc.textFile(file) \

the cat sat on the the (the, 1) (aardvark, 1)

! ReduceByKey$func.ons$must$be$ > counts = sc.textFile(file) \

> val counts = sc.textFile(file) \

> val counts = sc.textFile(file) \

You might also like

>  mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.

>  mydata.count() RDD:"mydata"

>  mydata = >  val mydata =

>  mydata.count() >  mydata.count()

>  for line in mydata.take(2): >  for (line <- mydata.take(2))

>  mydata = sc.textFile("purplecow.txt")

>  mydata = sc.textFile("purplecow.txt")

>  mydata = sc.textFile("purplecow.txt")

>  mydata = sc.textFile("purplecow.txt")

>  sc.textFile("purplecow.txt").map(lambda line: line.upper()) \

>  def toUpper(s):

>  def toUpper(s: String): String =

>  mydata.map(lambda line: line.upper()).take(2)

>  mydata.map(line => line.toUpperCase()).take(2)

! ReduceByKey$func.ons$must$be$ > counts = sc.textFile(file) \