Cloudera Spark Developer Training

Download as pdf or txt
Download as pdf or txt
You are on page 1of 491
At a glance
Powered by AI
The document provides an overview of Apache Spark, a framework for large-scale data processing. Spark can solve problems related to large-scale data processing across clusters in a distributed manner.

Apache Spark is an open-source cluster computing framework. It provides fast, easy-to-use APIs to process large datasets across clusters using parallel operations on resilient distributed datasets (RDDs).

The basic programming concepts in Spark involve operations on RDDs, which are immutable distributed collections of objects spread across a cluster. Common operations include transformations and actions.

Cloudera)Developer)Training)

for)Apache)Spark)

201409)
Introduc>on)
Chapter)1)
Course)Chapters)
!! Introduc.on% Course%Introduc.on%
!! Why)Spark?)
!! Spark)Basics) Introduc>on)to)Spark)
!! Working)With)RDDs)

!! The)Hadoop)Distributed)File)System)(HDFS))
!! Running)Spark)on)a)Cluster)
Distributed)Data)Processing))
!! Parallel)Programming)with)Spark)
with)Spark)
!! Caching)and)Persistence)
!! Wri>ng)Spark)Applica>ons)

!! Spark)Streaming)
!! Common)PaHerns)in)Spark)Programming) Solving)Business)Problems))
!! Improving)Spark)Performance) with)Spark)
!! Spark,)Hadoop,)and)the)Enterprise)Data)Center)

!! Conclusion) Course)Conclusion)

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#3%
Chapter)Topics)

Introduc.on% Course%Introduc.on%

!! About%This%Course%
!! About)Cloudera)
!! Course)Logis>cs)

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#4%
Course)Objec>ves)(1))

During%this%course,%you%will%learn%
! What%Apache%Spark%is,%what%problems%it%solves,%and%why%you%would%want%
to%use%it%
! The%basic%programming%concepts%of%Spark:%opera.ons%on%Resilient%
Distributed%Datasets%(RDDs)%
! How%Spark%works%to%distribute%processing%of%big%data%across%a%cluster%
! How%Spark%interacts%with%other%components%of%a%big%data%system:%data%
storage%and%cluster%resource%management%
! How%to%take%advantage%of%key%Spark%features%such%as%caching%and%shared%
variables%to%improve%performance%
! How%to%use%Spark%–%either%interac.vely%using%a%Spark%Shell%or%by%wri.ng%
your%own%Spark%Applica.ons%

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#5%
%
Course)Objec>ves)(2))

! How%to%use%Spark%Streaming%to%process%a%live%data%stream%in%real%.me%
! How%Spark%integrates%with%other%parts%of%the%Hadoop%Ecosystem%to%
provide%Enterprise#level%data%processing%

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#6%
Chapter)Topics)

Introduc.on% Course%Introduc.on%

!! About)This)Course)
!! About%Cloudera%
!! Course)Logis>cs)

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#7%
About)Cloudera)(1))

! The%leader%in%Apache%Spark%and%Hadoop#based%soQware%and%services%
! Founded%by%leading%experts%on%Big%Data%processing%from%Facebook,%Yahoo,%
Google,%and%Oracle%
! Provides%support,%consul.ng,%training,%and%cer.fica.on%
! Staff%includes%commi[ers%and%contributors%to%virtually%all%Hadoop%and%
Spark%projects%
! Many%authors%of%industry%standard%books%on%Apache%Hadoop%projects%
– Tom)White,)Lars)George,)Kathleen)Ting,)etc.)

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#8%
About)Cloudera)(2))

! Customers%include:%
– Allstate,)AOL)Adver>sing,)Box,)CBS)Interac>ve,)eBay,)Experian,)Groupon,)
Na>onal)Cancer)Ins>tute,)Orbitz,)Social)Security)Administra>on,)Trend)
Micro,)Trulia,)US)Army,)…)
! Cloudera%public%training:%
– Cloudera)Developer)Training)for)Apache)Spark)
– Cloudera)Developer)Training)for)Apache)Hadoop)
– Designing)and)Building)Big)Data)Applica>ons)
– Cloudera)Administrator)Training)for)Apache)Hadoop)
– Cloudera)Data)Analyst)Training:)Using)Pig,)Hive,)and)Impala)with)Hadoop)
– Cloudera)Training)for)Apache)HBase)
– Introduc>on)to)Data)Science:)Building)Recommender)Systems)
– Cloudera)Essen>als)for)Apache)Hadoop)
! Onsite%and%custom%training%is%also%available%
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#9%
CDH)

! CDH%
– 100%)open)source,))
enterpriseAready))
distribu>on)of)Hadoop))
and)related)projects)
– The)most)complete,))
tested,)and)widelyA)
deployed)distribu>on))
of)Hadoop)
– Integrates)all)key)Spark))
and)Hadoop))
ecosystem)projects)

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#10%
Cloudera)Express)

! Cloudera%Express%
– Free)download)
! The%best%way%to%get%started%
%with%Spark%and%Hadoop%
! Includes%CDH%
! Includes%Cloudera%Manager%
– EndAtoAend))
administra>on))
– Deploy,)manage,)and))
monitor)your)cluster)

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#11%
Cloudera)Enterprise)

! Cloudera%Enterprise%
– Subscrip>on)product)including)CDH)and))
Cloudera)Manager)
! Includes%support%
! Includes%extra%Cloudera%Manager%features%
– Configura>on)history)and)rollbacks)
– Rolling)updates)
– LDAP)integra>on)
– SNMP)support)
– Automated)disaster)recovery)
– Etc.)

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#12%
Chapter)Topics)

Introduc.on% Course%Introduc.on%

!! About)This)Course)
!! About)Cloudera)
!! Course%Logis.cs%

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#13%
Logis>cs)

! Course%start%and%end%.mes%
! Lunch%
! Breaks%
! Restrooms%
! Can%I%come%in%early/stay%late?%
! Access%to%the%course%materials%

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#14%
Introduc>ons)

! About%your%instructor%
! About%you%
– Experience)with)Spark)or)Hadoop?)
– Experience)as)a)developer?)
– What)programming)languages)do)you)usually)use?)
– What)programming)language)will)you)use)in)this)course?))
– Expecta>ons)from)the)course?)

©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#15%
Why$Spark?$
Chapter$2$
Course$Chapters$
!! IntroducEon$ Course$IntroducEon$
!! Why$Spark?$
!! Spark$Basics$ Introduc-on$to$Spark$
!! Working$With$RDDs$

!! The$Hadoop$Distributed$File$System$(HDFS)$
!! Running$Spark$on$a$Cluster$
Distributed$Data$Processing$$
!! Parallel$Programming$with$Spark$
with$Spark$
!! Caching$and$Persistence$
!! WriEng$Spark$ApplicaEons$

!! Spark$Streaming$
!! Common$PaBerns$in$Spark$Programming$ Solving$Business$Problems$$
!! Improving$Spark$Performance$ with$Spark$
!! Spark,$Hadoop,$and$the$Enterprise$Data$Center$

!! Conclusion$ Course$Conclusion$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#2$
Why$Spark?$

In$this$chapter$you$will$learn$
! What$problems$exist$with$tradi-onal$large#scale$compu-ng$systems$
! How$Spark$addresses$those$issues$
! Some$typical$big$data$ques-ons$Spark$can$be$used$to$answer$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#3$
Chapter$Topics$

Why$Spark?$ Introduc-on$to$Spark$

!! Problems$with$Tradi-onal$Large#scale$Systems$
!! Spark!$
!! Conclusion$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#4$
TradiEonal$Large5Scale$ComputaEon$

! Tradi-onally,$computa-on$has$been$$
processor#bound$
– RelaEvely$small$amounts$of$data$
– Lots$of$complex$processing$

! The$early$solu-on:$bigger$computers$
– Faster$processor,$more$memory$
– But$even$this$couldn’t$keep$up$$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#5$
Distributed$Systems$

! The$beJer$solu-on:$more$computers$
– Distributed$systems$–$use$mulEple$machines$
for$a$single$job$

“In$pioneer$days$they$used$oxen$for$heavy$
pulling,$and$when$one$ox$couldn’t$budge$a$log,$
we$didn’t$try$to$grow$a$larger$ox.$We$shouldn’t$
be$trying$for$bigger$computers,$but$for$more%
systems$of$computers.”$
$ $ $ $ $ $–$Grace$Hopper$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$
Database Hadoop Cluster 02#6$
Distributed$Systems:$Challenges$

! Challenges$with$distributed$systems$
– Programming$complexity$
– Keeping$data$and$processes$in$sync$
– Finite$bandwidth$$
– ParEal$failures$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#7$
Distributed$Systems:$The$Data$BoBleneck$(1)$

! Tradi-onally,$data$is$stored$in$a$central$loca-on$
! Data$is$copied$to$processors$at$run-me$
! Fine$for$limited$amounts$of$data$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#8$
Distributed$Systems:$The$Data$BoBleneck$(2)$

! Modern$systems$have$much$more$data$
– terabytes+$a$day$
– petabytes+$total$
! We$need$a$new$approach…$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#9$
Big$Data$Processing$

! Hadoop$introduced$a$radical$new$approach$based$on$two$key$concepts$
– Distribute$the$data$when$it$is$stored$
– Run$computaEon$where$the$data$is$
! Spark$takes$this$new$approach$to$the$next$level$
– Data$is$distributed$in$memory$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#10$
Chapter$Topics$

Why$Spark?$ Introduc-on$to$Spark$

!! Problems$with$TradiEonal$Large5scale$Systems$
!! Spark!$
!! Conclusion$
$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#11$
Introducing$Apache$Spark$

! Apache$Spark$is$a$fast,$general$engine$for$large5scale$data$processing$on$a$
cluster$
! Originally$developed$at$AMPLab$at$UC$Berkeley$
– Started$as$a$research$project$in$2009$
! Open$source$Apache$project$
– CommiBers$from$Cloudera,$Yahoo,$Databricks,$UC$Berkeley,$Intel,$
Groupon,$…$
– One$of$the$most$acEve$and$fastest5growing$Apache$projects$
– Cloudera$provides$enterprise5level$support$for$Spark$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#12$
Distributed$Processing$with$the$Spark$Framework$

API$

Spark$

Cluster$CompuEng$ Storage$
•  Spark$Standalone$ HDFS$
•  YARN$ (Hadoop$Distributed$File$
•  Mesos$ System)$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#13$
Advantages$of$Spark$

! High#level$programming$framework$
– Programmers$can$focus$on$logic,$not$plumbing$
! Cluster$compu-ng$
– ApplicaEon$processes$are$distributed$across$a$cluster$of$worker$nodes$
– Managed$by$a$single$“master”$
– Scalable$and$fault$tolerant$
! Distributed$storage$
– Data$is$distributed$when$it$is$stored$
– Replicated$for$efficiency$and$fault$tolerance$
– “Bring$the$computaEon$to$the$data”$
! Data$in$memory$
– Configurable$caching$for$efficient$iteraEon$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#14$
Scalability$

! Increasing$load$results$in$a$graceful$decline$in$performance$$
– Not$failure$of$the$system$
! Adding$nodes$adds$capacity$propor-onally$
Capacity$

Number$of$Nodes$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#15$
Fault$Tolerance$

! Node$failure$is$inevitable$
! What$happens?$
– System$conEnues$to$funcEon$
– Master$re5assigns$tasks$to$a$different$node$
– Data$replicaEon$=$no$loss$of$data$
– Nodes$which$recover$rejoin$the$cluster$automaEcally$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#16$
Who$Uses$Spark?$

! Yahoo!$$
– PersonalizaEon$and$ad$analyEcs$
! Conviva$$
– Real5Eme$video$stream$opEmizaEon$
! Technicolor$
– Real5Eme$analyEcs$for$telco$clients$
! Ooyala$
– Cross5device$personalized$video$experience$
! Plus…$
– Intel,$Groupon,$TrendMicro,$Autodesk,$Nokia,$Shopify,$ClearStory,$
Technicolor,$and$many$more…$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#17$
Common$Spark$Use$Cases$

! Extract/Transform/Load$(ETL)$ ! Collabora-ve$filtering$
! Text$mining$ ! Predic-on$models$
! Index$building$ ! Sen-ment$analysis$
! Graph$crea-on$and$analysis$ ! Risk$assessment$
! PaJern$recogni-on$ $

! What$do$these$workloads$have$in$common?$Nature$of$the$data…$
– Volume$
– Velocity$
– Variety$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#18$
Benefits$of$Spark$

! Previously$impossible$or$imprac-cal$analysis$
! Lower$cost$
! Less$-me$
! Greater$flexibility$
! Near#linear$scalability$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#19$
Spark$v.$Hadoop$MapReduce$

! Spark$takes$the$concepts$of$ sc.textFile(file) \
MapReduce$to$the$next$level$ .flatMap(lambda s: s.split()) \
.map(lambda w: (w,1)) \

– Higher$level$API$=$faster,$easier$ .reduceByKey(lambda v1,v2: v1+v2) \


.saveAsTextFile(output)
development$
– Low$latency$=$near$real5Eme$ public class WordCount {

processing$$
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");

– In5memory$data$storage$=$up$to$
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);

100x$performance$improvement$ job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}

public class WordMapper extends Mapper<LongWritable, Text, Text,


IntWritable> {
public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {
if (word.length() > 0)
context.write(new Text(word), new IntWritable(1));
}
}
}
}

public class SumReducer extends Reducer<Text, IntWritable, Text,


IntWritable> {
public void reduce(Text key, Iterable<IntWritable>
LogisEc$Regression$$ values, Context context) throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#20$
Chapter$Topics$

Why$Spark?$ Introduc-on$to$Spark$

!! Problems$with$TradiEonal$Large5scale$Systems$
!! Spark!$
!! Conclusion$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#21$
Key$Points$

! Tradi-onal$large#scale$compu-ng$involved$complex$processing$on$small$
amounts$of$data$
! Exponen-al$growth$in$data$drove$development$of$distributed$compu-ng$
! Distributed$compu-ng$is$difficult!$
! Spark$addresses$big$data$distributed$compu-ng$challenges$
– Bring$the$computaEon$to$the$data$
– Fault$tolerance$
– Scalability$
– Hides$the$‘plumbing’$so$developers$can$focus$on$the$data$
– Caches$data$in$memory$$

©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#22$
Spark&Basics&
Chapter&3&
Course&Chapters&
!! IntroducEon& Course&IntroducEon&
!! Why&Spark?&
!! Spark%Basics% Introduc.on%to%Spark%
!! Working&With&RDDs&

!! The&Hadoop&Distributed&File&System&(HDFS)&
!! Running&Spark&on&a&Cluster&
Distributed&Data&Processing&&
!! Parallel&Programming&with&Spark&
with&Spark&
!! Caching&and&Persistence&
!! WriEng&Spark&ApplicaEons&

!! Spark&Streaming&
!! Common&PaBerns&in&Spark&Programming& Solving&Business&Problems&&
!! Improving&Spark&Performance& with&Spark&
!! Spark,&Hadoop,&and&the&Enterprise&Data&Center&

!! Conclusion& Course&Conclusion&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#2%
Spark&Basics&

In%this%chapter%you%will%learn%
! How%to%start%the%Spark%Shell%
! About%the%SparkContext%
! Key%Concepts%of%Resilient%Distributed%Datasets%(RDDs)%
– What&are&they?&
– How&do&you&create&them?&
– What&operaEons&can&you&perform&with&them?&
! How%Spark%uses%the%principles%of%func.onal%programming%
! About%the%Hands#On%Exercises%for%the%course%

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#3%
Chapter&Topics&

Spark%Basics% Introduc.on%to%Spark%

!! What%is%Apache%Spark?%
!! Using&the&Spark&Shell&
!! RDDs&(Resilient&Distributed&Datasets)&
!! FuncEonal&Programming&in&Spark&
!! Conclusion&
!! Hands7On&Exercises&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#4%
What&is&Apache&Spark?&

! Apache%Spark%is%a%fast%and%general%engine%for%large#scale%
data%processing%
! WriNen%in%Scala%
– FuncEonal&programming&language&that&runs&in&a&JVM&
! Spark%Shell%
– InteracEve&–&for&learning&or&data&exploraEon&
– Python&or&Scala&
! Spark%Applica.ons%
– For&large&scale&data&processing&
– Python,&Scala,&or&Java&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#5%
Chapter&Topics&

Spark%Basics% Introduc.on%to%Spark%

!! What&is&Apache&Spark?&&
!! Using%the%Spark%Shell%
!! RDDs&(Resilient&Distributed&Datasets)&
!! FuncEonal&Programming&in&Spark&
!! Conclusion&
!! Hands7On&Exercises&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#6%
Spark&Shell&

! The%Spark%Shell%provides%interac.ve%data%explora.on%(REPL)
! Wri.ng%standalone%Spark%applica.ons%will%be%covered%later%

Python&Shell:&pyspark Scala&Shell:&spark-shell
$ pyspark $ spark-shell

Welcome to
Welcome to
____ __
/ __/__ ___ _____/ /__ ____ __
_\ \/ _ \/ _ `/ __/ '_/ / __/__ ___ _____/ /__
/__ / .__/\_,_/_/ /_/\_\ version 1.0.0 _\ \/ _ \/ _ `/ __/ '_/
/_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.0
/_/
Using Python version 2.6.6 (r266:84292, Jan
22 2014 09:42:36)
SparkContext available as sc. Using Scala version 2.10.3 (Java HotSpot(TM)
64-Bit Server VM, Java 1.7.0_51)
>>> Created spark context..
Spark context available as sc.

scala>

REPL:&Read/Evaluate/Print&Loop&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#7%
Spark&Context&

! Every%Spark%applica.on%requires%a%Spark&Context&
– The&main&entry&point&to&the&Spark&API&
! Spark%Shell%provides%a%preconfigured%Spark%Context%called%sc

Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36)


Spark context available as sc.
Python&
>>> sc.appName
u'PySparkShell'

Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM,


Java 1.7.0_51)
Created spark context..
Scala& Spark context available as sc.

scala> sc.appName
res0: String = Spark shell

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#8%
Chapter&Topics&

Spark%Basics% Introduc.on%to%Spark%

!! What&is&Apache&Spark?&&
!! Using&the&Spark&Shell&
!! RDDs%(Resilient%Distributed%Datasets)%
!! FuncEonal&Programming&With&Spark&
!! Conclusion&
!! Hands7On&Exercise:&Ge`ng&Started&with&RDDs&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#9%
RDD&(Resilient&Distributed&Dataset)&

! RDD%(Resilient%Distributed%Dataset)%
– Resilient&–&if&data&in&memory&is&lost,&it&can&be&recreated&
– Distributed&–&stored&in&memory&across&the&cluster&
– Dataset&–&iniEal&data&can&come&from&a&file&or&be&created&
programmaEcally&
! RDDs%are%the%fundamental%unit%of%data%in%Spark%
! Most%Spark%programming%consists%of%performing%opera.ons%on%RDDs%
&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#10%
CreaEng&an&RDD&

! Three%ways%to%create%an%RDD%
– From&a&file&or&set&of&files&
– From&data&in&memory&
– From&another&RDD&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#11%
File7Based&RDDs&

! For%file#based%RDDS,%use%SparkContext.textFile%%
– Accepts&a&single&file,&a&wildcard&list&of&files,&or&a&comma7separated&list&of&
files&
– Examples&
– sc.textFile("myfile.txt")
– sc.textFile("mydata/*.log")
– sc.textFile("myfile1.txt,myfile2.txt")
– Each&line&in&the&file(s)&is&a&separate&record&in&the&RDD&
!  Files%are%referenced%by%absolute%or%rela.ve%URI%
– Absolute&URI:&file:/home/training/myfile.txt
– RelaEve&URI&(uses&default&file&system):&myfile.txt

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#12%
Example:&A&File7based&RDD&

File:&purplecow.txt&

>  mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.


I never hope to see one;

But I can tell you, anyhow,
14/01/29 06:20:37 INFO storage.MemoryStore: I'd rather see than be one.
Block broadcast_0 stored as values to
memory (estimated size 151.4 KB, free 296.8
MB)

>  mydata.count() RDD:&mydata&


I've never seen a purple cow.

I never hope to see one;
14/01/29 06:27:37 INFO spark.SparkContext: Job
finished: take at <stdin>:1, took But I can tell you, anyhow,
0.160482078 s I'd rather see than be one.
4

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#13%
RDD&OperaEons&

! Two%types%of%RDD%opera.ons%
& RDD&

– AcEons&–&return&values&
value&

Base&RDD& New&RDD&
– TransformaEons&–&define&a&new&
RDD&based&on&the&current&one(s)&
%
! Quiz:%
– Which&type&of&operaEon&is&
count()?&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#14%
RDD&OperaEons:&AcEons&

! Some%common%ac.ons% RDD&
– count()&–&&return&the&number&of&elements&
value&
– take(n)&–&return&an&array&of&the&first&n&
elements&
– collect()–&return&an&array&of&all&elements&
– saveAsTextFile(filename)%–&save&to&text&
file(s)
>  mydata = >  val mydata =
sc.textFile("purplecow.txt") sc.textFile("purplecow.txt")

>  mydata.count() >  mydata.count()


4 4

>  for line in mydata.take(2): >  for (line <- mydata.take(2))


print line println(line)
I've never seen a purple cow. I've never seen a purple cow.
I never hope to see one; I never hope to see one;

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#15%
RDD&OperaEons:&TransformaEons&

! Transforma.ons%create%a%new%RDD%from%
Base&RDD& New&RDD&
an%exis.ng%one%
! RDDs%are%immutable%
– Data&in&an&RDD&is&never&changed&
– Transform&in&sequence&to&modify&the&
data&as&needed&&
! Some%common%transforma.ons%
– map(function)&–&creates&a&new&RDD&by&performing&a&funcEon&on&
each&record&in&the&base&RDD&
– filter(function)&–&creates&a&new&RDD&by&including&or&
excluding&each&record&in&the&base&RDD&according&to&a&boolean&
funcEon&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#16%
Example:&map&and&filter&TransformaEons&
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

map(lambda line: line.upper()) map(line => line.toUpperCase)

I'VE NEVER SEEN A PURPLE COW.


I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

filter(lambda line: line.startswith('I')) filter(line => line.startsWith('I'))

I'VE NEVER SEEN A PURPLE COW.


I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#17%
Lazy&ExecuEon&(1)&
File:&purplecow.txt&
! RDDs%are%not%always%immediately% I've never seen a purple cow.
materialized% I never hope to see one;
But I can tell you, anyhow,
– Spark&logs&the&lineage&of&& I'd rather see than be one.

transformaEons&used&to&build&
datasets&

> 
! Data%in%RDDs%is%not%processed%un.l%
an%ac.on&is%performed%

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#18%
Lazy&ExecuEon&(2)&
File:&purplecow.txt&
! Data%in%RDDs%is%not%processed%un.l% I've never seen a purple cow.
an%ac.on&is%performed% I never hope to see one;
But I can tell you, anyhow,
– RDD&is&materialized&in&memory& I'd rather see than be one.

upon&the&first&acEon&that&uses&it& RDD:&mydata&

>  mydata = sc.textFile("purplecow.txt")

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#19%
Lazy&ExecuEon&(3)&
File:&purplecow.txt&
! Data%in%RDDs%is%not%processed%un.l% I've never seen a purple cow.
an%ac.on&is%performed% I never hope to see one;
But I can tell you, anyhow,
– RDD&is&materialized&in&memory& I'd rather see than be one.

upon&the&first&acEon&that&uses&it& RDD:&mydata&

>  mydata = sc.textFile("purplecow.txt")


>  mydata_uc = mydata.map(lambda line: RDD:&mydata_uc&

line.upper())

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#20%
Lazy&ExecuEon&(4)&
File:&purplecow.txt&
! Data%in%RDDs%is%not%processed%un.l% I've never seen a purple cow.
an%ac.on&is%performed% I never hope to see one;
But I can tell you, anyhow,
– RDD&is&materialized&in&memory& I'd rather see than be one.

upon&the&first&acEon&that&uses&it& RDD:&mydata&

>  mydata = sc.textFile("purplecow.txt")


>  mydata_uc = mydata.map(lambda line: RDD:&mydata_uc&

line.upper())
>  mydata_filt = \
mydata_uc.filter(lambda line: \
line.startswith('I'))
RDD:&mydata_filt&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#21%
Lazy&ExecuEon&(5)&
File:&purplecow.txt&
! Data%in%RDDs%is%not%processed%un.l% I've never seen a purple cow.
an%ac.on&is%performed% I never hope to see one;
But I can tell you, anyhow,
– RDD&is&materialized&in&memory& I'd rather see than be one.

upon&the&first&acEon&that&uses&it& RDD:&mydata&
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
>  mydata = sc.textFile("purplecow.txt")
>  mydata_uc = mydata.map(lambda line: RDD:&mydata_uc&
I'VE NEVER SEEN A PURPLE COW.
line.upper())
I NEVER HOPE TO SEE ONE;
>  mydata_filt = \
BUT I CAN TELL YOU, ANYHOW,
mydata_uc.filter(lambda line: \
I'D RATHER SEE THAN BE ONE.
line.startswith('I'))
>  mydata_filt.count() RDD:&mydata_filt&
I'VE NEVER SEEN A PURPLE COW.
3
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#22%
Chaining&TransformaEons&

! Transforma.ons%may%be%chained%together%

>  mydata = sc.textFile("purplecow.txt")


>  mydata_uc = mydata.map(lambda line: line.upper())
>  mydata_filt = mydata_uc.filter(lambda line: line.startswith('I'))
>  mydata_filt.count()
3

is&exactly&equivalent&to&

>  sc.textFile("purplecow.txt").map(lambda line: line.upper()) \


.filter(lambda line: line.startswith('I')).count()
3

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#23%
Chapter&Topics&

Spark%Basics% Introduc.on%to%Spark%

!! What&is&Apache&Spark?&&
!! Using&the&Spark&Shell&
!! RDDs&(Resilient&Distributed&Datasets)&
!! Func.onal%Programming%in%Spark%
!! Conclusion&
!! Hands7On&Exercises&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#24%
FuncEonal&Programming&in&Spark&

! Spark%depends%heavily%on%the%concepts%of%func.onal&programming&
– FuncEons&are&the&fundamental&unit&of&programming&
– FuncEons&have&input&and&output&only&
– No&state&or&side&effects&
! Key%concepts%
– Passing&funcEons&as&input&to&other&funcEons&
– Anonymous&funcEons&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#25%
Passing&FuncEons&as&Parameters&

! Many%RDD%opera.ons%take%func.ons%as%parameters%
! Pseudocode%for%the%RDD%map%opera.on%
– Applies&funcEon&fn&to&each&record&in&the&RDD&

RDD {
map(fn(x)) {
foreach record in rdd
emit fn(record)
}
}

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#26%
Example:&Passing&Named&FuncEons&

! Python%

>  def toUpper(s):


return s.upper()
>  mydata = sc.textFile("purplecow.txt")
>  mydata.map(toUpper).take(2)

! Scala%

>  def toUpper(s: String): String =


{ s.toUpperCase }
>  val mydata = sc.textFile("purplecow.txt")
>  mydata.map(toUpper).take(2)

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#27%
Anonymous&FuncEons&

! Func.ons%defined%in#line%without%an%iden.fier%
– Best&for&short,&one7off&funcEons&
! Supported%in%many%programming%languages%
– Python:&lambda x: ...
– Scala:&x => ...
– Java&8:&x -> ...

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#28%
Example:&Passing&Anonymous&FuncEons&

!  Python:%

>  mydata.map(lambda line: line.upper()).take(2)

!  Scala:%

>  mydata.map(line => line.toUpperCase()).take(2)

OR&

>  mydata.map(_.toUpperCase()).take(2)

Scala&allows&anonymous&parameters&
using&underscore&(_)&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#29%
Example:&Java&&

...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
new MapFunction<String, String>() {
Java&7& public String call(String line) {
return line.toUpperCase();
}
}
...

...
Java&8& JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
line -> line.toUpperCase());
...

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#30%
Chapter&Topics&

Spark%Basics% Introduc.on%to%Spark%

!! What&is&Apache&Spark?&&
!! Using&the&Spark&Shell&
!! RDDs&(Resilient&Distributed&Datasets)&
!! FuncEonal&Programming&With&Spark&
!! Conclusion%
!! Hands7On&Exercises&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#31%
Key&Points&

! Spark%can%be%used%interac.vely%via%the%Spark%Shell%
– Python&or&Scala&
– WriEng&non7interacEve&Spark&applicaEons&will&be&covered&later&
! RDDs%(Resilient%Distributed%Datasets)%are%a%key%concept%in%Spark%
! RDD%Opera.ons%
– TransformaEons&create&a&new&RDD&based&on&an&exisEng&one&
– AcEons&return&a&value&from&an&RDD&
! Lazy%Execu.on%
– TransformaEons&are&not&executed&unEl&required&by&an&acEon&
! Spark%uses%func.onal%programming%
– Passing&funcEons&as&parameters&
– Anonymous&funcEons&in&supported&languages&(Python&and&Scala)&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#32%
Chapter&Topics&

Spark%Basics% Introduc.on%to%Spark%

!! What&is&Apache&Spark?&&
!! Using&the&Spark&Shell&
!! RDDs&(Resilient&Distributed&Datasets)&
!! FuncEonal&Programming&With&Spark&
!! Conclusion&
!! Hands#On%Exercises%

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#33%
IntroducEon&to&Exercises:&Ge`ng&Started&

! Instruc.ons%are%in%the%Hands#On%Exercise%Manual%
! Start%with%%
– General&Notes&
– Se`ng&Up&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#34%
IntroducEon&to&Exercises:&Pick&Your&Language&

! Your%choice:%Python%or%Scala%
– For&most&exercises&in&this&course,&you&may&choose&to&work&with&either&
Python&or&Scala&
– ExcepEon:&Spark&Streaming&material&is&currently&presented&only&in&
Scala&
– Course&examples&are&mostly&presented&in&Python&
! Solu.on%and%example%files%
– .pyspark&–&Python&shell&commands&
– .scalaspark&–&Scala&shell&commands&
– .py&–&complete&Python&Spark&applicaEons&
– .scala&–&complete&Scala&Spark&applicaEons&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#35%
IntroducEon&to&Exercises:&Classroom&Virtual&Machine&

! Your%virtual%machine%
– Log&in&as&user&training&(password&training)&
– Pre7installed&and&configured&with&
– Spark&and&CDH&
– Various&tools&including&Emacs,&IntelliJ,&and&Maven&
! Training%materials:%~/training_materials/sparkdev%folder%on%
the%VM%
– data&–&sample&datasets&uses&in&exercises&&
– examples&–&all&the&example&code&in&this&course&
– solutions&–&soluEons&for&Scala&Shell&and&Python&exercises&
– stubs&–&starter&code&required&in&some&exercises&

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#36%
IntroducEon&to&Exercises:&The&Data&

! Most%exercises%are%based%around%a%hypothe.cal%company:%Loudacre%
Mobile%
– A&cellular&telephone&company&
! Loudacre%Mobile%Customer%Support%has%many%sources%of%data%they%need%
to%process,%transform,%and%analyze%
– Customer&account&data&&
– Web&server&logs&from&Loudacre’s&customer&support&website&
– New&device&acEvaEon&records&
– Customer&support&Knowledge&Base&arEcles&
– InformaEon&about&models&of&supported&devices&

L udacre
mobile
o

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#37%
Hands7On&Exercises&

! Now,%please%do%the%following%three%Hands#On%Exercises%
1.  Viewing&the&Spark&Documenta8on&
– Familiarize&yourself&with&the&Spark&documentaEon;&you&will&refer&to&
this&documentaEon&frequently&during&the&course&
2.  Using&the&Spark&Shell&
– Follow&the&instrucEons&for&either&the&Python&or&Scala&shell&
3.  Ge>ng&Started&with&RDDs&
– Use&either&the&Python&or&Scala&Spark&Shell&to&explore&the&Loudacre&
weblogs&
! Please%refer%to%the%Hands#On%Exercise%Manual%

©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#38%
Working(With(RDDs(
Chapter(4(
Course(Chapters(
!! IntroducFon( Course(IntroducFon(
!! What(is(Apache(Spark?(
!! Spark(Basics( Introduc.on%to%Spark%
!! Working%With%RDDs%

!! The(Hadoop(Distributed(File(System((HDFS)(
!! Running(Spark(on(a(Cluster(
Distributed(Data(Processing((
!! Parallel(Programming(with(Spark(
with(Spark(
!! Caching(and(Persistence(
!! WriFng(Spark(ApplicaFons(

!! Spark(Streaming(
!! Common(PaDerns(in(Spark(Programming( Solving(Business(Problems((
!! Improving(Spark(Performance( with(Spark(
!! Spark,(Hadoop,(and(the(Enterprise(Data(Center(

!! Conclusion( Course(Conclusion(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#2%
Working(With(RDDs(

In%this%chapter%you%will%learn%
! How%RDDs%are%created%%
! Addi.onal%RDD%opera.ons%
! Special%opera.ons%available%on%RDDs%of%key#value%pairs%
! How%MapReduce%algorithms%are%implemented%in%Spark%

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#3%
Chapter(Topics(

Working%With%RDDs% Introduc.on%to%Spark%

!! A%Closer%Look%at%RDDs%
!! Key8Value(Pair(RDDs(
!! MapReduce(
!! Other(Pair(RDD(OperaFons(
!! Conclusion(
!! Hands8On(Exercise:(Working(with(Pair(RDDs(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#4%
RDDs(

! RDDs%can%hold%any%type%of%element%
– PrimiFve(types:(integers,(characters,(booleans,(etc.(
– Sequence(types:(strings,(lists,(arrays,(tuples,(dicts,(etc.((Including(nested(
data(types)(
– Scala/Java(Objects((if(serializable)(
– Mixed(types(
! Some%types%of%RDDs%have%addi.onal%func.onality%
– Pair(RDDs(
– RDDs(consisFng(of(Key8Value(pairs(
– Double(RDDs(
– RDDs(consisFng(of(numeric(data(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#5%
CreaFng(RDDs(From(CollecFons(

! You%can%create%RDDs%from%collec.ons%instead%of%files%
– sc.parallelize(collection)

> randomnumlist = \
[random.uniform(0,10) for _ in xrange(10000)]
> randomrdd = sc.parallelize(randomnumlist)
> print "Mean: %f" % randomrdd.mean()

!  Useful%when%
– TesFng(
– GeneraFng(data(programmaFcally(
– IntegraFng(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#6%
Some(Other(General(RDD(OperaFons(

! Transforma.ons%
– flatMap(–(maps(one(element(in(the(base(RDD(to(mulFple(elements
– distinct(–(filter(out(duplicates(
– union(–(add(all(elements(of(two(RDDs(into(a(single(new(RDD(
! Other%RDD%opera.ons%
– first(–(return(the(first(element(of(the(RDD
– foreach(–(apply(a(funcFon(to(each(element(in(an(RDD(
– top(n)%–(return(the(largest(n(elements(using(natural(ordering(
! Sampling%opera.ons%
– takeSample(withReplacement, num)%–(return(an(array(of(num(
sampled(elements(
! Double%RDD%opera.ons%
– StaFsFcal(funcFons,(e.g.,(mean,(sum,(variance,(stdev

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#7%
Example:(flatMap(and(distinct

> sc.textFile(file) \
Python( .flatMap(lambda line: line.split()) \
.distinct()

> sc.textFile(file).
Scala( flatMap(line => line.split("\\W")).
distinct()

I’ve I’ve
never never
I've never seen a purple cow.
seen seen
I never hope to see one;
But I can tell you, anyhow,
a a

I'd rather see than be one. purple purple


cow cow
I hope
never …
hope

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#8%
Chapter(Topics(

Working%With%RDDs% Introduc.on%to%Spark%

!! A(Closer(Look(at(RDDs(
!! Key#Value%Pair%RDDs%
!! MapReduce(
!! Other(Pair(RDD(OperaFons(
!! Conclusion(
!! Hands8On(Exercise:(Working(with(Pair(RDDs(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#9%
Pair(RDDs(

! Pair%RDDs%are%a%special%form%of%RDD% Pair(RDD(
– Each(element(must(be(a(key8value(pair((a(( (key1,value1)
two8element(tuple)( (key2,value2)
– Keys(and(values(can(be(any(type( (key3,value3)
! Why?% …
– Use(with(MapReduce(algorithms((
– Many(addiFonal(funcFons(are(available(for(
common(data(processing(needs(
– e.g.,(sorFng,(joining,(grouping,(counFng,(etc.(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#10%
CreaFng(Pair(RDDs(

! The%first%step%in%most%workflows%is%to%get%the%data%into%key/value%form%
– What(should(the(RDD(be(keyed(on?(
– What(is(the(value?(
! Commonly%used%func.ons%to%create%Pair%RDDs%
– map
– flatMap%/%flatMapValues
– keyBy

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#11%
Example:(A(Simple(Pair(RDD(

! Example:%Create%a%Pair%RDD%from%a%tab#separated%file%

Python( > users = sc.textFile(file) \


.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))

> val users = sc.textFile(file).


Scala( map(line => line.split('\t')).
map(fields => (fields(0),fields(1)))

(user001,Fred Flintstone)
user001 Fred Flintstone
(user090,Bugs Bunny)
user090 Bugs Bunny
user111 Harry Potter (user111,Harry Potter)
… …

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#12%
Example:(Keying(Web(Logs(by(User(ID(

> sc.textFile(logfile) \
.keyBy(lambda line: line.split(' ')[2])

> sc.textFile(logfile).
keyBy(line => line.split(' ')(2))
User(ID(
56.38.234.188 – 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 – 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 – 25254 "GET /KBDOC-00230.html HTTP/1.0" …

(99788,56.38.234.188 – 99788 "GET /KBDOC-00157.html…)


(99788,56.38.234.188 – 99788 "GET /theme.css…)
(25254,203.146.17.59 – 25254 "GET /KBDOC-00230.html…)

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#13%
QuesFon(1:(Pairs(With(Complex(Values

! How%would%you%do%this?%
– Input:(a(list(of(postal(codes(with(laFtude(and(longitude(
– Output:(postal(code((key)(and(lat/long(pair((value)(

(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202
?( (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202

…%

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#14%
Answer(1:(Pairs(With(Complex(Values(

> sc.textFile(file) \
.map(lambda line: line.split()) \
.map(lambda fields: (fields[0],(fields[1],fields[2])))

(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202 (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202

…%

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#15%
QuesFon(2:(Mapping(Single(Rows(to(MulFple(Pairs((1)(

! How%would%you%do%this?%
– Input:(order(numbers(with(a(list(of(SKUs(in(the(order(
– Output:(order((key)(and(sku((value)(

Input(Data( Pair(RDD(
00001 sku010:sku933:sku022 (00001,sku010)
00002 sku912:sku331 (00001,sku933)
00003 sku888:sku022:sku010:sku594
00004 sku411
?( (00001,sku022)
(00002,sku912)
( (00002,sku331)
(00003,sku888)

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#16%
QuesFon(2:(Mapping(Single(Rows(to(MulFple(Pairs((2)(

! Hint:%map%alone%won’t%work%

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411

(00001,(sku010,sku933,sku022))
(00002,(sku912,sku331))
(00003,(sku888,sku022,sku010,sku594))
(00004,(sku411))

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#17%
Answer(2:(Mapping(Single(Rows(to(MulFple(Pairs((1)(

> sc.textFile(file)

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#18%
Answer(2:(Mapping(Single(Rows(to(MulFple(Pairs((2)(

> sc.textFile(file) \
.map(lambda line: line.split('\t'))

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
[00004,sku411]

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#19%
Answer(2:(Mapping(Single(Rows(to(MulFple(Pairs((3)(

> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
(00001,sku010:sku933:sku022)
[00004,sku411]
(00002,sku912:sku331)
(00003,sku888:sku022:sku010:sku594)
(00004,sku411)

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#20%
Answer(2:(Mapping(Single(Rows(to(MulFple(Pairs((4)(

> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1])) \
.flatMapValues(lambda skus: skus.split(':'))

00001 sku010:sku933:sku022
00002 sku912:sku331 (00001,sku010)
00003 sku888:sku022:sku010:sku594 (00001,sku933)
[00001,sku010:sku933:sku022]
00004 sku411 (00001,sku022)
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594] (00002,sku912)
(00001,sku010:sku933:sku022)
[00004,sku411] (00002,sku331)
(00002,sku912:sku331)
(00003,sku888)
(00003,sku888:sku022:sku010:sku594)

(00004,sku411)

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#21%
Chapter(Topics(

Working%With%RDDs% Introduc.on%to%Spark%

!! A(Closer(Look(at(RDDs(
!! Key8Value(Pair(RDDs(
!! MapReduce%
!! Other(Pair(RDD(OperaFons(
!! Conclusion(
!! Hands8On(Exercise:(Working(with(Pair(RDDs(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#22%
MapReduce(

! MapReduce%is%a%common%programming%model%
– Easily(applicable(to(distributed(processing(of(large(data(sets(
! Hadoop%MapReduce%is%the%best#known%implementa.on%%
– Somewhat(limited(
– Each(job(has(one(Map(phase,(one(Reduce(phase((
– Job(output(is(saved(to(files(
! Spark%implements%MapReduce%with%much%greater%flexibility%
– Map(and(Reduce(funcFons(can(be(interspersed(
– Results(are(stored(in(memory(
– OperaFons(can(easily(be(chained(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#23%
MapReduce(in(Spark(

! MapReduce%in%Spark%works%on%Pair%RDDs%
! Map%phase%
– Operates(on(one(record(at(a(Fme(
– “Maps”(each(record(to(one(or(more(new(records(
– map(and(flatMap
! Reduce%phase%
– Works(on(Map(output(
– Consolidates(mulFple(records(
– reduceByKey

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#24%
MapReduce(Example:(Word(Count(
Result(
aardvark 1
Input(Data(
cat 1
the cat sat on the mat mat 1
the aardvark sat on the sofa ?( on 2

% sat 2
sofa 1
the 4

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#25%
Example:(Word(Count((1)(

> counts = sc.textFile(file)

the cat sat on the


mat
the aardvark sat on
the sofa

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#26%
Example:(Word(Count((2)(

> counts = sc.textFile(file) \


.flatMap(lambda line: line.split())

the cat sat on the the


mat
cat
the aardvark sat on
sat
the sofa
on
the
mat
the
aardvark

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#27%
Example:(Word(Count((3)(

> counts = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) Key8
Value(
Pairs(

the cat sat on the the (the, 1)


mat
cat (cat, 1)
the aardvark sat on
sat (sat, 1)
the sofa
on (on, 1)
the (the, 1)
mat (mat, 1)
the (the, 1)
aardvark (aardvark, 1)
… …

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#28%
Example:(Word(Count((4)(

> counts = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)

the cat sat on the the (the, 1) (aardvark, 1)


mat (cat, 1)
cat (cat, 1)
the aardvark sat on (mat, 1)
sat (sat, 1)
the sofa
on (on, 1) (on, 2)
the (the, 1) (sat, 2)
mat (mat, 1) (sofa, 1)
the (the, 1) (the, 4)
aardvark (aardvark, 1)
… …

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#29%
ReduceByKey(

! ReduceByKey%func.ons%must%be% > counts = sc.textFile(file) \


– Binary(–(combines(values( .flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
from(two(keys( .reduceByKey(lambda v1,v2: v1+v2)
– CommutaFve(–(x+y(=(y+x(
– AssociaFve(–((x+y)+z(=(x+(y+z)(
(the,1)
(cat,1)
(the,2)
(sat,1) (aardvark, 1)
(on,1) (cat, 1)
(the,1) (the,3) (mat, 1)
(mat,1) (on, 2)
(the,1) (sat, 2)
(aardvark,1) (sofa, 1)
(the,4)
(sat,1) (the, 4)
(on,1)
(the,1)

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#30%
Word(Count(Recap((the(Scala(Version)(

> val counts = sc.textFile(file).


flatMap(line => line.split("\\W")).
map(word => (word,1)).
reduceByKey((v1,v2) => v1+v2)

OR(

> val counts = sc.textFile(file).


flatMap(_.split("\\W")).
map((_,1)).
reduceByKey(_+_)

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#31%
Why(Do(We(Care(About(CounFng(Words?(

! Word%count%is%challenging%over%massive%amounts%of%data%
– Using(a(single(compute(node(would(be(too(Fme8consuming(
– Number(of(unique(words(could(exceed(available(memory(
! Sta.s.cs%are%o_en%simple%aggregate%func.ons%
– DistribuFve(in(nature(
– e.g.,(max,(min,(sum,(count(
! MapReduce%breaks%complex%tasks%down%into%smaller%elements%which%can%
be%executed%in%parallel%
! Many%common%tasks%are%very%similar%to%word%count%
– e.g.,(log(file(analysis(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#32%
Chapter(Topics(

Working%With%RDDs% Introduc.on%to%Spark%

!! Key8Value(Pair(RDDs(
!! Map8Reduce(
!! Other%Pair%RDD%Opera.ons%
!! Conclusion(
!! Hands8On(Exercise:(Working(with(Pair(RDDs(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#33%
Pair(RDD(OperaFons(

! In%addi.on%to%map%and%reduce%func.ons,%Spark%has%several%opera.ons%
specific%to%Pair%RDDs%
! Examples%
– countByKey(–(return(a(map(with(the(count(of(occurrences(of(each(key(
– groupByKey –(group(all(the(values(for(each(key(in(an(RDD
– sortByKey(–(sort(in(ascending(or(descending(order(
– join%–(return(an(RDD(containing(all(pairs(with(matching(keys(from(two(
RDDs(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#34%
Example:(Pair(RDD(OperaFons(

(00004,sku411)
(00003,sku888)
(00001,sku010)
=Fa lse)( (00003,sku022)
d i n g
(00001,sku933)
( a s cen
(00001,sku022) sortByK
ey (00003,sku010)
(00003,sku594)
(00002,sku912)
(00002,sku912)
(00002,sku331)

(00003,sku888)

(00002,[sku912,sku331])
(00001,[sku010,sku933,sku022])
(00003,[sku888,sku022,sku010,sku594])
(00004,[sku411])

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#35%
Example:(Joining(by(Key(

> movies = moviegross.join(movieyear)

RDD:(moviegross RDD:(movieyear
(Casablanca,$3.7M) (Casablanca,1942)
(Star Wars,$775M) (Star Wars,1977)
(Annie Hall,$38M) (Annie Hall,1977)
(Argo,$232M) (Argo,2012)
… …

(Casablanca,($3.7M,1942))
(Star Wars,($775M,1977))
(Annie Hall,($38M,1977))
(Argo,($232M,2012))

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#36%
Using(Join(

! A%common%programming%paaern%
1.  Map(separate(datasets(into(key8value(Pair(RDDs(
2.  Join(by(key(
3.  Map(joined(data(into(the(desired(format(
4.  Save,(display,(or(conFnue(processing…(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#37%
Example:(Join(Web(Log(With(Knowledge(Base(ArFcles((1)(

weblogs(
56.38.234.188 – 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 – 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 – 25254 "GET /KBDOC-00230.html HTTP/1.0" …
221.78.60.155 – 45402 "GET /titanic_4000_sales.html HTTP/1.0" …
65.187.255.81 – 14242 "GET /KBDOC-00107.html HTTP/1.0" …

User(ID( Requested(File(
join(
kblist(
KBDOC-00157:Ronin Novelty Note 3 - Back up files
KBDOC-00230:Sorrento F33L - Transfer Contacts
KBDOC-00050:Titanic 1000 - Transfer Contacts
KBDOC-00107:MeeToo 5.0 - Transfer Contacts
KBDOC-00300:iFruit 5A – overheats

ArFcle(ID( ArFcle(Title(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#38%
Example:(Join(Web(Log(With(Knowledge(Base(ArFcles((2)(

! Steps%
1.  Map(separate(datasets(into(key8value(Pair(RDDs(
a.  Map(web(log(requests(to((docid,userid)
b.  Map(KB(Doc(index(to((docid,title)
2.  Join(by(key:(docid
3.  Map(joined(data(into(the(desired(format:((userid,title)
4.  Further(processing:(group(Ftles(by(User(ID(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#39%
Step(1a:(Map(Web(Log(Requests(to((docid,userid)

> import re
> def getRequestDoc(s):
return re.search(r'KBDOC-[0-9]*',s).group()

> kbreqs = sc.textFile(logfile) \


.filter(lambda line: 'KBDOC-' in line) \
.map(lambda line: (getRequestDoc(line),line.split(' ')[2])) \
.distinct()

56.38.234.188 – 99788 "GET /KBDOC-00157.html HTTP/1.0" …


56.38.234.188 – 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 – 25254 "GET /KBDOC-00230.html HTTP/1.0" …
221.78.60.155 – 45402 "GET kbreqs( …
/titanic_4000_sales.html HTTP/1.0"
65.187.255.81 – 14242 "GET /KBDOC-00107.html HTTP/1.0" …
… (KBDOC-00157,99788)
(KBDOC-00203,25254)
(KBDOC-00107,14242)

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#40%
Step(1b:(Map(KB(Index(to((docid,title)%

> kblist = sc.textFile(kblistfile) \


.map(lambda line: line.split(':')) \
.map(lambda fields: (fields[0],fields[1]))

KBDOC-00157:Ronin Novelty Note 3 - Back up files


KBDOC-00230:Sorrento F33L - Transfer Contacts
KBDOC-00050:Titanic 1000 - Transfer Contacts
KBDOC-00107:MeeToo 5.0 - Transfer Contacts
KBDOC-00206:iFruit 5A – overheats

kblist(
(KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00050,Titanic 1000 - Transfer Contacts)
(KBDOC-00107,MeeToo 5.0 - Transfer Contacts)

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#41%
Step(2:(Join(By(Key(docid

> titlereqs = kbreqs.join(kblist)

kbreqs( kblist(
(KBDOC-00157,99788) (KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,25254) (KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00107,14242) (KBDOC-00050,Titanic 1000 - Transfer Contacts)
… (KBDOC-00107,MeeToo 5.0 - Transfer Contacts)

(KBDOC-00157,(99788,Ronin Novelty Note 3 - Back up files))


(KBDOC-00230,(25254,Sorrento F33L - Transfer Contacts))
(KBDOC-00107,(14242,MeeToo 5.0 - Transfer Contacts))

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#42%
Step(3:(Map(Result(to(Desired(Format((userid,title)

> titlereqs = kbreqs.join(kblist) \


.map(lambda (docid,(userid,title)): (userid,title))

(KBDOC-00157,(99788,Ronin Novelty Note 3 - Back up files))


(KBDOC-00230,(25254,Sorrento F33L - Transfer Contacts))
(KBDOC-00107,(14242,MeeToo 5.0 - Transfer Contacts))

(99788,Ronin Novelty Note 3 - Back up files)


(25254,Sorrento F33L - Transfer Contacts)
(14242,MeeToo 5.0 - Transfer Contacts)

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#43%
Step(4:(ConFnue(Processing(–(Group(Titles(by(User(ID(

> titlereqs = kbreqs.join(kblist) \


.map(lambda (docid,(userid,title)): (userid,title)) \
.groupByKey()

(99788,Ronin Novelty Note 3 - Back up files)


(25254,Sorrento F33L - Transfer Contacts)
(14242,MeeToo 5.0 - Transfer Contacts)

(99788,[Ronin Novelty Note 3 - Back up files,


Ronin S3 - overheating])
(25254,[Sorrento F33L - Transfer Contacts])
(14242,[MeeToo 5.0 - Transfer Contacts,
MeeToo 5.1 - Back up files,
iFruit 1 - Back up files,
MeeToo 3.1 - Transfer Contacts])

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#44%
Example(Output(

> for (userid,titles) in titlereqs.take(10):


print 'user id: ',userid
for title in titles: print '\t',title

user id: 99788


Ronin Novelty Note 3 - Back up files
Ronin S3 – overheating (99788,[Ronin Novelty Note 3 - Back up files,
Ronin S3 - overheating])
user id: 25254
(25254,[Sorrento F33L - Transfer Contacts])
Sorrento F33L - Transfer Contacts
(14242,[MeeToo 5.0 - Transfer Contacts,
user id: 14242 MeeToo 5.1 - Back up files,
iFruit 1 - Back up files,
MeeToo 5.0 - Transfer Contacts MeeToo 3.1 - Transfer Contacts])
MeeToo 5.1 - Back up files
iFruit 1 - Back up files …

MeeToo 3.1 - Transfer Contacts


©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#45%
Aside:(Anonymous(FuncFon(Parameters(

! Python%and%Scala%paaern%matching%can%help%improve%code%readability%

Python( > map(lambda (docid,(userid,title)): (userid,title))

Scala( > map(pair => (pair._2._1,pair._2._2))

OR(

> map{case (docid,(userid,title)) => (userid,title)}

(KBDOC-00157,(99788,…title…)) (99788,…title…)
(KBDOC-00230,(25254,…title…)) (25254,…title…)
(KBDOC-00107,(14242,…title…)) (14242,…title…)
… …

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#46%
Other(Pair(OperaFons(

! Some%other%pair%opera.ons%
– keys(–(return(an(RDD(of(just(the(keys,(without(the(values(
– values(–(return(an(RDD(of(just(the(values,(without(keys(
– lookup(key)(–(return(the(value(s)(for(a(key
– leftOuterJoin,(rightOuterJoin%–(join,(including(keys(defined(
only(in(the(lel(or(right(RDDs(respecFvely(
– mapValues,(flatMapValues(–(execute(a(funcFon(on(just(the(
values,(keeping(the(key(the(same(
! See%the%PairRDDFunctions%class%Scaladoc%for%a%full%list%

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#47%
Chapter(Topics(

Working%With%RDDs% Introduc.on%to%Spark%

!! Key8Value(Pair(RDDs(
!! MapReduce(
!! Other(Pair(RDD(OperaFons(
!! Conclusion%
!! Hands8On(Exercise:(Working(with(Pair(RDDs(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#48%
Key(Points(

! Pair%RDDs%are%a%special%form%of%RDD%consis.ng%of%Key#Value%pairs%(tuples)%
! Spark%provides%several%opera.ons%for%working%with%Pair%RDDs%
! MapReduce%is%a%generic%programming%model%for%distributed%processing%
– Spark(implements(MapReduce(with(Pair(RDDs(
– Hadoop(MapReduce(and(other(implementaFons(are(limited(to(a(single(
Map(and(Reduce(phase(per(job(
– Spark(allows(flexible(chaining(of(map(and(reduce(operaFons(
– Spark(provides(operaFons(to(easily(perform(common(MapReduce(
algorithms(like(joining,(sorFng,(and(grouping(

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#49%
Chapter(Topics(

Working%With%RDDs% Introduc.on%to%Spark%

!! Key8Value(Pair(RDDs(
!! MapReduce(
!! Other(Pair(RDD(OperaFons(
!! Conclusion(
!! Hands#On%Exercise:%Working%with%Pair%RDDs%

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#50%
Hands8On(Exercise:(Working(with(Pair(RDDs(

! Hands#On%Exercise:%Working(with(Pair(RDDs(
– ConFnue(exploring(web(server(log(files(using(key8value(Pair(RDDs(
– Join(log(data(with(user(account(data(
! Please%refer%to%the%Hands#On%Exercise%Manual%

©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#51%
The$Hadoop$Distributed$File$System$
(HDFS)$
Chapter$5$
Course$Chapters$
!! IntroducIon$ Course$IntroducIon$
!! What$is$Apache$Spark?$
!! Spark$Basics$ IntroducIon$to$Spark$
!! Working$With$RDDs$

!! The%Hadoop%Distributed%File%System%(HDFS)%
!! Running$Spark$on$a$Cluster$
Distributed%Data%Processing%%
!! Parallel$Programming$with$Spark$
with%Spark%
!! Caching$and$Persistence$
!! WriIng$Spark$ApplicaIons$

!! Spark$Streaming$
!! Common$PaFerns$in$Spark$Programming$ Solving$Business$Problems$$
!! Improving$Spark$Performance$ with$Spark$
!! Spark,$Hadoop,$and$the$Enterprise$Data$Center$

!! Conclusion$ Course$Conclusion$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#2%
The$Hadoop$Distributed$File$System$

In%this%chapter%you%will%learn%
! How%HDFS%supports%Big%Data%processing%by%distribuEng%data%storage%
across%a%cluster%
! How%to%save%and%retrieve%data%from%HDFS%using%both%command%line%tools%
and%the%Spark%API%

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#3%
Chapter$Topics$

Distributed%Data%Processing%%
The%Hadoop%Distributed%File%System%
with%Spark%

!! Why%HDFS?%
!! HDFS$Architecture$
!! Using$HDFS$
!! Conclusion$
!! Hands?On$Exercise:$Using$HDFS$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#4%
Distributed$Processing$with$the$Spark$Framework$

API$

Spark$

Cluster$CompuIng$ Storage$
•  Spark$Standalone$
•  YARN$ HDFS$
•  Mesos$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#5%
Big$Data$Processing$with$Spark$

! Three%key%concepts%
– Distribute$data$when$the$data$is$stored$–$HDFS$$
– Run$computaIon$where$the$data$is$–$HDFS$and$Spark$
– Cache$data$in$memory$–$Spark$$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#6%
Chapter$Topics$

Distributed%Data%Processing%%
The%Hadoop%Distributed%File%System%
with%Spark%

!! Why$HDFS?$
!! HDFS%Architecture%
!! Using$HDFS$
!! Conclusion$
!! Hands?On$Exercise:$Using$HDFS$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#7%
HDFS$Basic$Concepts$(1)$

! HDFS%is%a%filesystem%wriPen%in%Java%
– Based$on$Google’s$GFS$
! Sits%on%top%of%a%naEve%filesystem%
– Such$as$ext3,$ext4,$or$xfs$
! Provides%redundant%storage%for%massive%amounts%of%data%
– Using$readily?available,$industry?standard$computers$

HDFS%

NaIve$OS$filesystem$

Disk$Storage$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#8%
HDFS$Basic$Concepts$(2)$

! HDFS%performs%best%with%a%‘modest’%number%of%large%files%
– Millions,$rather$than$billions,$of$files$
– Each$file$typically$100MB$or$more$
! Files%in%HDFS%are%‘write%once’%
– No$random$writes$to$files$are$allowed$
! HDFS%is%opEmized%for%large,%streaming%reads%of%files%
– Rather$than$random$reads$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#9%
How$Files$Are$Stored$

! Data%files%are%split%into%128MB%blocks%which%are%distributed%at%load%Eme%
! Each%block%is%replicated%on%mulEple%data%nodes%(default%3x)%
! NameNode%stores%metadata%
Block$1$ Name$
Block$3$
Node$
Block$1$
Block$1$
Metadata:$
Very$ Block$2$
informaIon$
Large$ Block$2$ Block$2$
about$files$
Data$File$ Block$3$
and$blocks$

Block$2$
Block$3$

Block$1$

Block$3$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#10%
Example:$Storing$and$Retrieving$Files$(1)$

Local$

Node$A$ Node$D$
/logs/ $
031512.log
$

Node$B$ Node$E$

/logs/
042313.log Node$C$

HDFS$
Cluster$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#11%
Example:$Storing$and$Retrieving$Files$(2)$

Metadata$ B1: A,B,D NameNode$


B2: B,D,E
B3: A,B,C
/logs/031512.log: B1,B2,B3 B4: A,B,E
/logs/042313.log: B4,B5 B5: C,E,D

1 Node$A$ Node$D$
/logs/
031512.log
2 1 3 1 5$
3 4 2 $

Node$B$ Node$E$
1 2 2 5
3 4 4
4
/logs/
042313.log 5 Node$C$
3 5

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#12%
Example:$Storing$and$Retrieving$Files$(3)$

Metadata$ B1: A,B,D NameNode$


B2: B,D,E
B3: A,B,C
/logs/031512.log: B1,B2,B3 B4: A,B,E
/logs/042313.log: B4,B5 B5: C,E,D

1 Node$A$ Node$D$
/logs/
/logs/042313.log?$
031512.log
2 1 3 1 5$
3 4 2 $

Node$B$ Node$E$ B4,B5$


1 2 2 5
3 4 4
4
/logs/
042313.log 5 Node$C$ Client$
Client$
3 5

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#13%
Example:$Storing$and$Retrieving$Files$(4)$

Metadata$ B1: A,B,D NameNode$


B2: B,D,E
B3: A,B,C
/logs/031512.log: B1,B2,B3 B4: A,B,E
/logs/042313.log: B4,B5 B5: C,E,D

1 Node$A$ Node$D$
/logs/
/logs/042313.log?$
031512.log
2 1 3 1 5$
3 4 2 $

Node$B$ Node$E$ B4,B5$


1 2 2 5
3 4 4
4
/logs/
042313.log 5 Node$C$ Client$
3 5

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#14%
HDFS$NameNode$Availability$

! The%NameNode%daemon%must%be%running%at%all%Emes%
– If$the$NameNode$stops,$the$cluster$becomes$inaccessible$

! HDFS%is%typically%set%up%for%High%
Availability% AcIve Standby$
– Two$NameNodes:$AcIve$and$ Name$ Name$
Standby$ Node$ Node$

! Small%clusters%may%use%‘Classic%mode’%
– One$NameNode$
Secondary$
– One$“helper”$node$called$the$ Name$ Name$
Secondary$NameNode$ Node$ Node$

– Bookkeeping,$not$backup$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#15%
Chapter$Topics$

Distributed%Data%Processing%%
The%Hadoop%Distributed%File%System%
with%Spark%

!! Why$HDFS?$$
!! HDFS$Architecture$
!! Using%HDFS%
!! Conclusion$
!! Hands?On$Exercise:$Using$HDFS$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#16%
OpIons$for$Accessing$HDFS$ $$
put
! From%the%command%line% HDFS$
– FsShell:$$ Client$ Cluster$
hdfs dfs$ get

! In%Spark%
– By$URI,$e.g.$
hdfs://host:port/file…

! Other%programs%
– Java$API$
– Used$by$Hadoop$MapReduce,$$
Impala,$Hue,$Sqoop,$$
Flume,$etc.$
– RESTful$interface$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#17%
hdfs dfs%Examples$(1)$

! Copy%file%foo.txt%from%local%disk%to%the%user’s%directory%in%HDFS%

$ hdfs dfs -put foo.txt foo.txt

– This$will$copy$the$file$to$/user/username/foo.txt
! Get%a%directory%lisEng%of%the%user’s%home%directory%in%HDFS%

$ hdfs dfs -ls

! Get%a%directory%lisEng%of%the%HDFS%root%directory%

$ hdfs dfs –ls /

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#18%
hdfs dfs%Examples$(2)$

! Display%the%contents%of%the%HDFS%file%/user/fred/bar.txt%%

$ hdfs dfs -cat /user/fred/bar.txt

! Copy%that%file%to%the%local%disk,%named%as%baz.txt

$ hdfs dfs -get /user/fred/bar.txt baz.txt

! Create%a%directory%called%input%under%the%user’s%home%directory%

$ hdfs dfs -mkdir input

Note:$copyFromLocal$is$a$synonym$for$put;$copyToLocal$is$a$synonym$for$get$$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#19%
hdfs dfs%Examples$(3)$

! Delete%the%directory%input_old%and%all%its%contents%

$ hdfs dfs -rm -r input_old

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#20%
Example:$HDFS$in$Spark$

! Specify%HDFS%files%in%Spark%by%URI%
– hdfs://hdfs-host[:port]/path
– Default$port$is$8020$

> mydata = sc.textFile \


("hdfs://hdfs-host:port/user/training/purplecow.txt")

> mydata.map(lambda s: s.upper()).\


saveAsTextFile \
("hdfs://hdfs-host:port/user/training/purplecowuc")

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#21%
Using$HDFS$By$Default$

! If%Hadoop%configuraEon%files%are%on%Spark’s%classpath,%Spark%will%use%HDFS%
by%default%
– e.g.$/etc/hadoop/conf
! Paths%are%relaEve%to%the%user’s%home%HDFS%directory%

> mydata = sc.textFile("purplecow.txt")

hdfs://hdfs-host:port/user/training/purplecow.txt$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#22%
Chapter$Topics$

Distributed%Data%Processing%%
The%Hadoop%Distributed%File%System%
with%Spark%

!! Why$HDFS?$$
!! HDFS$Architecture$
!! Using$HDFS$
!! Conclusion%
!! Hands?On$Exercise:$Using$HDFS$

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#23%
Key$Points$

! HDFS%provides%a%key%component%of%big%data%processing%
– Distribute$data$when$it$is$stored,$so$that$computaIon$can$be$run$where$
the$data$is$
! How%HDFS%works%
– Files$are$divided$into$blocks$
– Blocks$are$replicated$across$nodes$
! Command%line%access%to%HDFS%
– FsShell:$hdfs dfs
– Sub?commands:$-get,$-put,$-ls,$-cat,$etc.$
! Spark%access%to%HDFS%
– sc.textFile$and$rdd.saveAsTextFile$methods$$
– e.g.,$hdfs://host:port/path/to/file

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#24%
Chapter$Topics$

Distributed%Data%Processing%%
The%Hadoop%Distributed%File%System%
with%Spark%

!! Why$HDFS?$$
!! HDFS$Architecture$
!! Using$HDFS$
!! Conclusion$
!! Hands#On%Exercise:%Using%HDFS%

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#25%
Hands?On$Exercise:$Using$HDFS$

! Hands#On%Exercise:%Using&HDFS&
– Begin$to$get$acquainted$with$the$Hadoop$Distributed$File$System$$
– Read$and$write$files$using$hdfs dfs%on$the$command$line,$and$from$
the$Spark$Shell$
! Please%refer%to%the%Hands#On%Exercise%Manual%

©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#26%
Running&Spark&on&a&Cluster&
Chapter&6&
Course&Chapters&
!! IntroducEon& Course&IntroducEon&
!! What&is&Apache&Spark?&
!! Spark&Basics& IntroducEon&to&Spark&
!! Working&With&RDDs&

!! The&Hadoop&Distributed&File&System&(HDFS)&
!! Running%Spark%on%a%Cluster%
Distributed%Data%Processing%%
!! Parallel&Programming&with&Spark&
with%Spark%
!! Caching&and&Persistence&
!! WriEng&Spark&ApplicaEons&

!! Spark&Streaming&
!! Common&PaCerns&in&Spark&Programming& Solving&Business&Problems&&
!! Improving&Spark&Performance& with&Spark&
!! Spark,&Hadoop,&and&the&Enterprise&Data&Center&

!! Conclusion& Course&Conclusion&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#2%
Running&Spark&on&a&Cluster&

In%this%chapter%you%will%learn%
! Spark%clustering%concepts%and%terminology%
! Spark%deployment%opAons%
! How%to%run%a%Spark%applicaAon%on%a%Spark%Standalone%cluster%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#3%
Chapter&Topics&

Distributed%Data%Processing%%
Running%Spark%on%a%Cluster%
with%Spark%

!! Overview%
!! A&Spark&Standalone&Cluster&
!! The&Spark&Standalone&Web&UI&
!! Spark&Deployment&OpEons&
!! Conclusion&
!! Hands9On&Exercise:&Running&the&Spark&Shell&on&a&Cluster&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#4%
Spark&Cluster&OpEons&

! Spark%can%run%
– Locally&
– No&distributed&processing&
– Locally&with&mulEple&worker&threads&
– On&a&cluster&
– Spark&Standalone&
– Apache&Hadoop&YARN&(Yet&Another&Resource&NegoEator)&
– Apache&Mesos&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#5%
Why&Run&on&a&Cluster?&

! Run%Spark%on%a%cluster%to%get%the%advantages%of%distributed%processing%
– Ability&to&process&large&amounts&of&data&efficiently&
– Fault&tolerance&and&scalability&&
! Local%mode%is%useful%for%development%and%tesAng%
! ProducAon%use%is%almost%always%on%a%cluster%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#6%
Distributed&Processing&with&the&Spark&Framework&

API&

Spark&

Cluster&CompuEng& Storage&
•  Spark&Standalone&
•  YARN& HDFS&
•  Mesos&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#7%
Spark&Cluster&Terminology&

! A%cluster%is%a%group%of%computers%working%together%
– Usually&runs&HDFS&in&addiEon&to&Spark&Standalone,&YARN,&or&Mesos&
! A%node%is%an%individual%computer%in%the%cluster%
– Master&nodes&manage&distribuEon&of&work&and&data&to&worker&nodes&
! A%daemon%is%a%program%running%on%a%node%
– Each&performs&different&funcEons&in&the&cluster&

Worker&Node&

Cluster& Worker&Node&
HDFS&&
Manager&
Master&Node&
Master&Node&
Worker&Node&

Worker&Node&
…&&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#8%
The&Spark&Driver&Program&

! A%Spark%Driver%
– The&“main”&program&
– Either&the&Spark&Shell&or&a&Spark&applicaEon&
– Creates&a&Spark&Context&configured&for&the&cluster&
– Communicates&with&Cluster&Manager&to&distribute&tasks&to&executors&
Worker&Node&
Executor&

Worker&Node&
Driver&Program& Master&Node& Executor&
Cluster&
Spark& Worker&Node&
Manager&
Context& Executor&

Worker&Node&
Executor&
…&&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#9%
StarEng&the&Spark&Shell&on&a&Cluster&

! Set%the%Spark%Shell%master%to%
– url&–&the&URL&of&the&cluster&manager&
– local[*]%–&run&with&as&many&threads&as&cores&(default)&
– local[n]%–&run&locally&with*n*worker&threads&
– local&–&run&locally&without&distributed&processing&
! This%configures%the%SparkContext.master%property%

Python& $ MASTER=spark://masternode:7077 pyspark

Scala& $ spark-shell --master spark://masternode:7077

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#10%
Chapter&Topics&

Distributed%Data%Processing%%
Running%Spark%on%a%Cluster%
with%Spark%

!! Overview&
!! A%Spark%Standalone%Cluster%
!! The&Spark&Standalone&Web&UI&
!! Spark&Deployment&OpEons&
!! Conclusion&
!! Hands9On&Exercise:&Running&the&Spark&Shell&on&a&Cluster&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#11%
Spark&Standalone&Daemons&

! Spark%Standalone%daemons%
– Spark&Master&
– One&per&cluster&
– Manages&applicaEons,&distributes&individual&tasks&to&Spark&Workers&
– Spark&Worker&
– One&per&worker&node&
– Starts&and&monitors&Executors&for&applicaEons&
Worker&Nodes&

Cluster&Master&Node& SparkWorker&

Spark& SparkWorker&
Master&
SparkWorker&

…&&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#12%
Running&Spark&on&a&Standalone&Cluster&(1)&
Worker&(Slave)&Nodes&
Client&
SparkWorker& DataNode&

SparkWorker& DataNode&
HDFS&Master&&
Master&Node& Node&

Spark& Name&
Master& Node&
SparkWorker& DataNode&

SparkWorker& DataNode&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#13%
Running&Spark&on&a&Standalone&Cluster&(2)&
Worker&(Slave)&Nodes&
Client& $ hdfs dfs –put mydata
SparkWorker& DataNode&
HDFS:
mydata

SparkWorker& DataNode&
HDFS&Master&&
Master&Node& Node&
Block&1&

Spark& Name&
Master& Node&
SparkWorker& DataNode&
Block&2&

SparkWorker& DataNode&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#14%
Running&Spark&on&a&Standalone&Cluster&(3)&
Worker&(Slave)&Nodes&
Driver&Program& Client&
SparkWorker& DataNode&
Spark&
Context&

SparkWorker& DataNode&
HDFS&Master&&
Master&Node& Node&
Block&1&

Spark& Name&
Master& Node&
SparkWorker& DataNode&
Block&2&

SparkWorker& DataNode&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#15%
Running&Spark&on&a&Standalone&Cluster&(4)&
Worker&(Slave)&Nodes&
Driver&Program& Client&
SparkWorker& DataNode&
Spark&
Context& Executor&

SparkWorker& DataNode&
HDFS&Master&&
Master&Node& Node&
Block&1&
Executor&
Spark& Name&
Master& Node&
SparkWorker& DataNode&
Block&2&
Executor&

SparkWorker& DataNode&

Executor&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#16%
Running&Spark&on&a&Standalone&Cluster&(5)&
Worker&(Slave)&Nodes&
Driver&Program& Client&
SparkWorker& DataNode&
Spark&
Context& Executor&

SparkWorker& DataNode&
HDFS&Master&&
Master&Node& Node&
Executor& Task&
Task&
Task&
Block&1&

Spark& Name&
Master& Node&
SparkWorker& DataNode&
Block&2&
Executor& Task&

SparkWorker& DataNode&

Executor&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#17%
Chapter&Topics&

Distributed%Data%Processing%%
Running%Spark%on%a%Cluster%
with%Spark%

!! Cluster&Overview&
!! A&Spark&Standalone&Cluster&
!! The%Spark%Standalone%Web%UI%
!! Spark&Deployment&OpEons&
!! Conclusion&
!! Hands9On&Exercise:&Running&the&Spark&Shell&on&a&Cluster&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#18%
Spark&Standalone&Web&UI&

! Spark%Standalone%clusters%offer%a%Web%UI%to%monitor%the%cluster%
– http://masternode:uiport
– e.g.,&in&our&class&environment,&http://localhost:18080

Master&URL&
Worker&Nodes&

ApplicaEons&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#19%
Spark&Standalone&Web&UI:&ApplicaEon&Overview&

Link&to&Spark&
ApplicaEon&UI&

Executors&for&this&
applicaEon&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#20%
Spark&Standalone&Web&UI:&Worker&Detail&

Log&files&
All&executors&on&
this&node&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#21%
Chapter&Topics&

Distributed%Data%Processing%%
Spark%on%a%Cluster%
with%Spark%

!! Overview&
!! A&Spark&Standalone&Cluster&
!! The&Spark&Standalone&Web&UI&
!! Spark%Deployment%OpAons%
!! Conclusion&
!! Hands9On&Exercise:&Running&the&Spark&Shell&on&a&Cluster&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#22%
Supported&Cluster&Resource&Managers&

! Spark%Standalone%
– Included&with&Spark&
– Easy&to&install&and&run&
– Limited&configurability&and&scalability&
– Useful&for&tesEng,&development,&or&small&systems&
! Hadoop%YARN%
– Included&in&CDH&
– Most&common&for&producEon&sites&
– Allows&sharing&cluster&resources&with&other&applicaEons&(MapReduce,&
Impala,&etc.)&
! Apache%Mesos%
– First&plaeorm&supported&by&Spark&
– Now&used&less&ofen&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#23%
Client&Mode&and&Cluster&Mode&

! By%default,%the%driver%program%runs%outside%the%cluster%
– Called&“client”&deploy&mode&
– Most&common&
– Required&for&interacEve&use&(e.g.,&the&Spark&Shell)&
! It%is%also%possible%to%run%the%driver%program%on%a%worker%node%in%the%
cluster% Worker&Node&
– Called&“cluster”&deploy&mode& Executor&

Worker&Node&
Executor&
Master&Node&
Cluster& Worker&Node&
submit& Manager&
Executor&

Worker&Node&
Driver%Program%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#24%
Installing&a&Spark&Cluster&(1)&

! ProducAon%cluster%installaAon%is%usually%performed%by%a%system%
administrator%
– Out&of&the&scope&of&this&course&
! Developers%should%understand%how%the%components%of%a%cluster%work%
together%
! Developers%oXen%test%first%locally,%then%on%a%small%test%cluster%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#25%
Installing&a&Spark&Cluster&(2)&

! Difficult:%
– Download&and&install&Spark&and&HDFS&directly&from&Apache&

! Easier:%CDH%
– Cloudera’s&DistribuEon,&including&Apache&Hadoop&
– Includes&HDFS,&Spark&API,&Spark&Standalone,&and&YARN&
– Includes&many&patches,&backports,&bug&fixes&
&
! Easiest:%Cloudera%Manager%
– Wizard9based&UI&to&install,&configure,&and&manage&a&cluster&
– Included&with&Cloudera&Express&(free)&or&Cloudera&Enterprise&
– Supports&Spark&deployment&as&Standalone&or&YARN&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#26%
Sejng&Up&a&Spark&Standalone&Cluster&on&EC2&

! Spark%includes%support%to%easily%set%up%and%manage%a%Spark%Standalone%
cluster%on%Amazon%Web%Services%EC2%
– Create&your&own&AWS&account&
– Use&the&spark-ec2&script&to&
– Start,&pause,&and&stop&a&cluster&
– Launch&an&applicaEon&on&the&cluster&
– Specify&regions,&spot&pricing,&Spark&version,&and&other&opEons&
– Use&distributed&files&stored&on&Amazon&S3&(Simple&Storage&Service)&
– s3://path/to/file

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#27%
Chapter&Topics&

Distributed%Data%Processing%%
Running%Spark%on%a%Cluster%
with%Spark%

!! Overview&
!! A&Spark&Standalone&Cluster&
!! The&Spark&Standalone&Web&UI&
!! Spark&Deployment&OpEons&
!! Conclusion%
!! Hands9On&Exercise:&Running&the&Spark&Shell&on&a&Standalone&Cluster&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#28%
Key&Points&

! Spark%is%designed%to%run%on%a%cluster%
– Spark&includes&a&basic&cluster&management&plaeorm&called&Spark&
Standalone&
– Can&also&run&on&Hadoop&YARN&and&Mesos&
! The%master%distributes%tasks%to%individual%workers%in%the%cluster%
– Tasks&run&in&executors*–&JVMs&running&on&worker&nodes&
! Spark%clusters%work%closely%with%HDFS%
– Tasks&are&assigned&to&workers&where&the&data&is&physically&stored&when&
possible&
! Spark%Standalone%provides%a%UI%for%monitoring%the%cluster%
– YARN&has&its&own&UI&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#29%
Chapter&Topics&

Distributed%Data%Processing%%
Running%Spark%on%a%Cluster%
with%Spark%

!! Overview&
!! A&Spark&Standalone&Cluster&
!! The&Spark&Standalone&Web&UI&
!! Spark&Deployment&OpEons&
!! Conclusion&
!! Hands#On%Exercise:%Running%the%Spark%Shell%on%a%Cluster%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#30%
Hands9On&Exercise:&Running&Spark&on&a&Cluster&

! Hands#On%Exercise:%Running&Spark&on&a&Cluster&
– Start&the&Spark&Standalone&daemons&(Spark&Master&and&Spark&Worker)&
on&your&local&machine&(a&simulated&Spark&Standalone&cluster)&
– Run&the&Spark&Shell&on&the&cluster&
– View&the&Spark&Standalone&UI&
! Please%refer%to%the%Hands#On%Exercise%Manual%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#31%
Parallel&Programming&with&Spark&
Chapter&7&
Course&Chapters&
!! IntroducFon& Course&IntroducFon&
!! What&is&Apache&Spark?&
!! Spark&Basics& IntroducFon&to&Spark&
!! Working&With&RDDs&

!! The&Hadoop&Distributed&File&System&(HDFS)&
!! Running&Spark&on&a&Cluster&
Distributed%Data%Processing%%
!! Parallel%Programming%with%Spark%
with%Spark%
!! Caching&and&Persistence&
!! WriFng&Spark&ApplicaFons&

!! Spark&Streaming&
!! Common&PaDerns&in&Spark&Programming& Solving&Business&Problems&&
!! Improving&Spark&Performance& with&Spark&
!! Spark,&Hadoop,&and&the&Enterprise&Data&Center&

!! Conclusion& Course&Conclusion&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#2%
Parallel&Programming&with&Spark&

In%this%chapter%you%will%learn%
! How%RDDs%are%distributed%across%a%cluster%
! How%Spark%executes%RDD%operaBons%in%parallel%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#3%
Chapter&Topics&

Parallel%Programming%with%Spark% Distributed%Data%Processing%%
with%Spark%

!! RDD%ParBBons%%
!! ParFFoning&of&File9based&RDDs&
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&
&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#4%
Spark&Cluster&Review&
Worker&(Slave)&Nodes&
Client&
Executor&

Executor& Task&
Cluster& HDFS&
Master& Master&
Node& Node&

Executor& Task&

Executor&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#5%
RDDs&on&a&Cluster&

! Resilient%Distributed*Datasets% RDD&1&
– Data&is&par$$oned&across&worker&nodes&
Executor&
! ParBBoning%is%done%automaBcally%by%Spark% rdd_1_0&
– OpFonally,&you&can&control&how&many&
parFFons&are&created& Executor&
rdd_1_1&

Executor&
rdd_1_2&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#6%
Chapter&Topics&

Parallel%Programming%with%Spark% Distributed%Data%Processing%%
with%Spark%

!! RDD&ParFFons&
!! ParBBoning%of%File#based%RDDs%%
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&
&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#7%
File&ParFFoning:&Single&Files&

! ParBBons%from%single%files% sc.textFile("myfile",3)
– ParFFons&based&on&size&
– You&can&opFonally&specify&a&minimum& RDD&
number&of&parFFons&
textFile(file, minPartitions) Executor&

– Default&is&2&
myfile
– More&parFFons&=&more&parallelizaFon&
Executor&

Executor&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#8%
File&ParFFoning:&MulFple&Files&
RDD&
!  sc.textFile("mydir/*")
– Each&file&becomes&(at&least)&one& Executor&
parFFon&& file1

– File9based&operaFons&can&be&done&
per9parFFon,&for&example&parsing& Executor&
XML& file2

!  sc.wholeTextFiles("mydir") RDD&
– For&many&small&files&
Executor&
– Creates&a&key9value&PairRDD&
– key&=&file&name&
– value&=&file&contents&
Executor&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#9%
OperaFng&on&ParFFons&

! Most%RDD%operaBons%work%on%each%element%of%an%RDD%
! A%few%work%on%each%par00on*
– foreachPartition&–&call&a&funcFon&for&each&parFFon&
– mapPartitions&–&create&a&new&RDD&by&execuFng&a&funcFon&on&each&
parFFon&in&the&current&RDD&
– mapPartitionsWithIndex&–&same&as&mapPartitions&but&
includes&index&of&the&RDD&
! FuncBons%for%parBBon%operaBons%take%iterators%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#10%
Example:&Count&JPGs&Requests&per&File&

> def countJpgs(index,partIter):


jpgcount = 0
for line in partIter:
Note:&Works&with&
if "jpg" in line: jpgcount += 1 small&files&that&each&
yield (index,jpgcount) fit&in&a&single&
parFFon&
> jpgcounts = sc.textFile("weblogs/*") \
.mapPartitionsWithIndex(countJpgs) jpgcounts&
(0,237)
> def countJpgs(index: Int, partIter: (1,132)
Iterator[String]): Iterator[(Int,Int)] = {
(2,188)
var jpgcount = 0
for (line <- partIter) (3,193)
if (line.contains("jpg")) jpgcount += 1 …
Iterator((index,jpgcount))
}
> var jpgcounts = sc.textFile("weblogs/*").
mapPartitionsWithIndex(countJpgs)

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#11%
Chapter&Topics&

Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%

!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS%and%Data%Locality%
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&
&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#12%
HDFS&and&Data&Locality&(1)&

$ hdfs dfs –put mydata

HDFS:
Client& mydata
Executor&
HDFS&
Block&1&

Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Block&3&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#13%
HDFS&and&Data&Locality&(2)&

By&default,&Spark&parFFons&
sc.textFile("hdfs://…mydata…").collect() file9based&RDDs&by&block.&
Each&block&loads&into&a&single&
parFFon.&
Client& RDD& HDFS:
Driver&Program& Client& mydata
Executor&
HDFS&
Block&1&

Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Block&3&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#14%
HDFS&and&Data&Locality&(3)&

An&acFon&triggers&
sc.textFile("hdfs://…mydata…").collect() execuFon:&tasks&on&
executors&load&data&from&
blocks&into&parFFons&
Client& RDD& HDFS:
Driver&Program& Client& mydata
Executor&
HDFS&
task& Block&1&

Executor&
HDFS&
Master&Node& task& Block&2&
Spark&
Master&
Executor&
HDFS&
task& Block&3&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#15%
HDFS&and&Data&Locality&(4)&

Data&is&distributed&across&
sc.textFile("hdfs://…mydata…").collect() executors&unFl&an&acFon&
returns&a&value&to&the&driver&

Client& RDD& HDFS:


Driver&Program& Client& mydata
Executor&
HDFS&
Block&1&

Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Block&3&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#16%
Chapter&Topics&

Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%

!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS&and&Data&Locality&
!! Hands#On%Exercise:%Working%With%ParBBons%
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&

&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#17%
Hands9On&Exercise:&Working&With&ParFFons&

! Hands#On%Exercise:%Working*With*Par00ons*
– Parse&mulFple&small&XML&files&containing&device&acFvaFon&records&
– Use&provided&XML&parsing&funcFons&in&exercise&stubs&
– Find&the&most&common&device&models&in&the&dataset&
! Please%refer%to%the%Hands#On%Exercise%Manual%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#18%
Chapter&Topics&

Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%

!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuBng%Parallel%OperaBons%
!! Stages&and&Tasks&
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#19%
Parallel&OperaFons&on&ParFFons&

! RDD%operaBons%are%executed%in%parallel%on%each%parBBon%
– When&possible,&tasks&execute&on&the&worker&nodes&where&the&data&is&in&
memory&&
! Some%operaBons%preserve%parBBoning%
– e.g.,&map,&flatMap,&filter
! Some%operaBons%reparBBon%
– e.g.,&reduce,&sort,&group

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#20%
Example:&Average&Word&Length&by&LeDer&(1)&

> avglens = sc.textFile(file)

RDD&

HDFS:
mydata

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#21%
Example:&Average&Word&Length&by&LeDer&(2)&

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split())

RDD& RDD&

HDFS:
mydata

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#22%
Example:&Average&Word&Length&by&LeDer&(3)&

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word)))

RDD& RDD& RDD&

HDFS:
mydata

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#23%
Example:&Average&Word&Length&by&LeDer&(4)&

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word))) \
.groupByKey()

RDD& RDD& RDD&


RDD&

HDFS:
mydata

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#24%
Example:&Average&Word&Length&by&LeDer&(5)&

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word))) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))

RDD& RDD& RDD&


RDD& RDD&

HDFS:
mydata

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#25%
Chapter&Topics&

Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%

!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages%and%Tasks%
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#26%
Stages&

! OperaBons%that%can%run%on%the%same%parBBon%are%executed%in%stages*
! Tasks%within%a%stage%are%pipelined%together%
! Developers%should%be%aware%of%stages%to%improve%performance%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#27%
Spark&ExecuFon:&Stages&(1)&

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word))) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))
> avglens.count()

Stage&1& Stage&2&
RDD& RDD& RDD&
RDD& RDD&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#28%
Spark&ExecuFon:&Stages&(2)&

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word))) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))
> avglens.count()

Stage&1& Stage&2&

Task&1&
Task&4&
Task&2&
Task&5&
Task&3&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#29%
Spark&ExecuFon:&Stages&(3)&

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word))) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))
> avglens.count()

Stage&1& Stage&2&

Task&1&
Task&4&
Task&2&
Task&5&
Task&3&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#30%
Spark&ExecuFon:&Stages&(4)&

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word))) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))
> avglens.count()

Stage&1& Stage&2&

Task&1& Task&4&

Task&2& Task&5&

Task&3&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#31%
Summary&of&Spark&Terminology&

! Job%–%a&set&of&tasks&executed&as&a&result&of&an&ac$on*
! Stage%–%a&set&of&tasks&in&a&job&that&can&be&executed&in&parallel&
! Task%–%an&individual&unit&of&work&sent&to&one&executor&

Job& Task& Stage&

RDD& RDD& RDD&


RDD& RDD&

Stage&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#32%
How&Spark&Calculates&Stages&

! Spark%constructs%a%DAG%(Directed%Acyclic%Graph)%of%RDD%dependencies%
! Narrow%operaBons%
– Only&one&child&depends&on&the&RDD&
– No&shuffle&required&between&nodes&
– Can&be&collapsed&into&a&single&stage&
– e.g.,&map,&filter,&union
! Wide%operaBons%
– MulFple&children&depend&on&the&RDD&
– Defines&a&new&stage&
– e.g.,&reduceByKey,&join,&groupByKey

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#33%
Controlling&the&Level&of&Parallelism&

! “Wide”%operaBons%(e.g.,%reduceByKey)%parBBon%result%RDDs%
– More&parFFons&=&more&parallel&tasks&
– Cluster&will&be&under9uFlized&if&there&are&too&few&parFFons&
! You%can%control%how%many%parBBons%
– Configure&with&the&spark.default.parallelism&property&

spark.default.parallelism 10

– OpFonal&numPartitions%parameter&in&funcFon&call&

> words.reduceByKey(lambda v1, v2: v1 + v2, 15)

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#34%
Spark&ExecuFon:&Task&Scheduling&(1)&

Stage&1& Stage&2&

Task&1& Task&4&

Task&2& Client&
Task&5&
Executor&
HDFS&
Task&3& Block&1&

Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Block&3&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#35%
Spark&ExecuFon:&Task&Scheduling&(2)&

Stage&1& Stage&2&

Task&4&
Client&
Task&5&
Executor&
HDFS&
Task&1& Block&1&

Executor&
HDFS&
Master&Node&
Task&2& Block&2&
Spark&
Master&
Executor&
HDFS&
Task&3& Block&3&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#36%
Spark&ExecuFon:&Task&Scheduling&(3)&

Stage&2&

Task&4&
Client&
Task&5&
Executor&
HDFS&
Block&1&

Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Block&3&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#37%
Spark&ExecuFon:&Task&Scheduling&(4)&

Stage&2&

Client&
Executor&
HDFS&
Task&4& Block&1&

Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Task&5& Block&3&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#38%
Viewing&Stages&in&the&Spark&ApplicaFon&UI&

! You%can%view%the%execuBon%stages%in%the%Spark%ApplicaBon%UI%

Stages&are& Number&of&tasks&=&
Data&shuffled&
idenFfied&by&the& number&of&
between&stages&
last&operaFon& parFFons&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#39%
Chapter&Topics&

Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%

!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion%
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#40%
Key&Points&

! RDDs%are%stored%in%the%memory%of%Spark%executor%JVMs%
! Data%is%split%into%parBBons%–%each%parBBon%in%a%separate%executor%
! RDD%operaBons%are%executed%on%parBBons%in%parallel%
! OperaBons%that%depend%on%the%same%parBBon%are%pipelined%together%in%
stages%
– e.g.,&map,&filter
! OperaBons%that%depend%on%mulBple%parBBons%are%executed%in%separate%
stages%
– e.g.,&join,&reduceByKey

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#41%
Chapter&Topics&

Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%

!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion&
!! Hands#On%Exercise:%Viewing%Stages%and%Tasks%in%the%Spark%ApplicaBon%UI%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#42%
Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&
ApplicaFon&UI&
! Hands#On%Exercise:%Viewing*Stages*and*Tasks*in*the*Spark*Applica0on*UI*
– Use&the&Spark&ApplicaFon&UI&to&view&how&stages&and&tasks&are&executed&
in&a&job&
! Please%refer%to%the%Hands#On%Exercise%Manual%

©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#43%
Caching(and(Persistence(
Chapter(8(
Course(Chapters(
!! IntroducCon( Course(IntroducCon(
!! What(is(Apache(Spark?(
!! Spark(Basics( IntroducCon(to(Spark(
!! Working(With(RDDs(

!! The(Hadoop(Distributed(File(System((HDFS)(
!! Running(Spark(on(a(Cluster(
Distributed%Data%Processing%%
!! Parallel(Programming(with(Spark(
with%Spark%
!! Caching%and%Persistence%
!! WriCng(Spark(ApplicaCons(

!! Spark(Streaming(
!! Common(PaAerns(in(Spark(Programming( Solving(Business(Problems((
!! Improving(Spark(Performance( with(Spark(
!! Spark,(Hadoop,(and(the(Enterprise(Data(Center(

!! Conclusion( Course(Conclusion(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#2%
Caching(and(Persistence(

In%this%chapter%you%will%learn%
! How%Spark%uses%an%RDD’s%lineage%in%operaBons%
! How%to%persist%RDDs%to%improve%performance%

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#3%
Chapter(Topics(

Distributed%Data%Processing%%
Caching%and%Persistence%
with%Spark%

!! RDD%Lineage%
!! Caching(Overview(
!! Distributed(Persistence(
!! Conclusion(
!! Hands7On(Exercises(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#4%
Lineage(Example((1)(
File:(purplecow.txt(
! Each%transforma)on%operaBon% I've never seen a purple cow.
creates%a%new%child%RDD% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#5%
Lineage(Example((2)(
File:(purplecow.txt(
! Each%transforma)on%operaBon% I've never seen a purple cow.
creates%a%new%child%RDD% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

MappedRDD[1]((mydata)(

>  mydata = sc.textFile("purplecow.txt")

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#6%
Lineage(Example((3)(
File:(purplecow.txt(
! Each%transforma)on%operaBon% I've never seen a purple cow.
creates%a%new%child%RDD% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

MappedRDD[1]((mydata)(

>  mydata = sc.textFile("purplecow.txt")


>  myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))

MappedRDD[2](

FilteredRDD[3]:((myrdd)(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#7%
Lineage(Example((4)(
File:(purplecow.txt(
! Spark%keeps%track%of%the%parent%RDD% I've never seen a purple cow.
for%each%new%RDD% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
! Child%RDDs%depend1on1their%parents%
MappedRDD[1]((mydata)(

>  mydata = sc.textFile("purplecow.txt")


>  myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))

MappedRDD[2](

FilteredRDD[3]:((myrdd)(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#8%
Lineage(Example((5)(
File:(purplecow.txt(
! Ac)on%operaBons%execute%the% I've never seen a purple cow.
parent%transformaBons% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

MappedRDD[1]((mydata)(
I've never seen a purple cow.
>  mydata = sc.textFile("purplecow.txt")
I never hope to see one;
>  myrdd = mydata.map(lambda s: s.upper())\
But I can tell you, anyhow,
.filter(lambda s:s.startswith('I'))
>  myrdd.count() I'd rather see than be one.

3 MappedRDD[2](
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

FilteredRDD[3]:((myrdd)(
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#9%
Lineage(Example((6)(
File:(purplecow.txt(
! Each%acBon%re#executes%the%lineage% I've never seen a purple cow.
transformaBons%starBng%with%the% I never hope to see one;
But I can tell you, anyhow,
base% I'd rather see than be one.

– By(default( MappedRDD[1]((mydata)(

>  mydata = sc.textFile("purplecow.txt")


>  myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))
>  myrdd.count()
3 MappedRDD[2](
>  myrdd.count()

FilteredRDD[3]:((myrdd)(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#10%
Lineage(Example((7)(
File:(purplecow.txt(
! Each%acBon%re#executes%the%lineage% I've never seen a purple cow.
transformaBons%starBng%with%the% I never hope to see one;
But I can tell you, anyhow,
base% I'd rather see than be one.

– By(default( MappedRDD[1]((mydata)(
I've never seen a purple cow.
>  mydata = sc.textFile("purplecow.txt") I never hope to see one;
>  myrdd = mydata.map(lambda s: s.upper())\ But I can tell you, anyhow,
.filter(lambda s:s.startswith('I')) I'd rather see than be one.
>  myrdd.count()
3 MappedRDD[2](
I'VE NEVER SEEN A PURPLE COW.
>  myrdd.count()
3 I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

FilteredRDD[3]:((myrdd)(
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#11%
Chapter(Topics(

Distributed%Data%Processing%%
Caching%and%Persistence%
with%Spark%

!! RDD(Lineage(
!! Caching%Overview%
!! Distributed(Persistence(
!! Conclusion(
!! Hands7On(Exercises(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#12%
Caching(
File:(purplecow.txt(
! Caching%an%RDD%saves%the%data%in% I've never seen a purple cow.
memory% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#13%
Caching(
File:(purplecow.txt(
! Caching%an%RDD%saves%the%data%in% I've never seen a purple cow.
memory% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1]((mydata)(
>  mydata = sc.textFile("purplecow.txt")
>  myrdd = mydata.map(lambda s:
s.upper())

RDD[2]((myrdd)(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#14%
Caching(
File:(purplecow.txt(
! Caching%an%RDD%saves%the%data%in% I've never seen a purple cow.
memory% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1]((mydata)(
>  mydata = sc.textFile("purplecow.txt")
>  myrdd = mydata.map(lambda s:
s.upper())
>  myrdd.cache()
RDD[2]((myrdd)(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#15%
Caching(
File:(purplecow.txt(
! Caching%an%RDD%saves%the%data%in% I've never seen a purple cow.
memory% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1]((mydata)(
>  mydata = sc.textFile("purplecow.txt")
>  myrdd = mydata.map(lambda s:
s.upper())
>  myrdd.cache()
>  myrdd2 = myrdd.filter(lambda \ RDD[2]((myrdd)(
s:s.startswith('I'))

RDD[3]((myrdd2)(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#16%
Caching(
File:(purplecow.txt(
! Caching%an%RDD%saves%the%data%in% I've never seen a purple cow.
memory% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1]((mydata)(
>  mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.

>  myrdd = mydata.map(lambda s: I never hope to see one;


But I can tell you, anyhow,
s.upper())
I'd rather see than be one.
>  myrdd.cache()
>  myrdd2 = myrdd.filter(lambda \ RDD[2]((myrdd)(
I'VE NEVER SEEN A PURPLE COW.
s:s.startswith('I'))
I NEVER HOPE TO SEE ONE;
>  myrdd2.count()
BUT I CAN TELL YOU, ANYHOW,
3
I'D RATHER SEE THAN BE ONE.

RDD[3]((myrdd2)(
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#17%
Caching(
File:(purplecow.txt(
! Subsequent%operaBons%use%saved% I've never seen a purple cow.
data% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1]((mydata)(
>  mydata = sc.textFile("purplecow.txt")
>  myrdd = mydata.map(lambda s:
s.upper())
>  myrdd.cache()
>  myrdd2 = myrdd.filter(lambda \ RDD[2]((myrdd)(
I'VE NEVER SEEN A PURPLE COW.
s:s.startswith('I'))
I NEVER HOPE TO SEE ONE;
>  myrdd2.count()
BUT I CAN TELL YOU, ANYHOW,
3
I'D RATHER SEE THAN BE ONE.
>  myrdd2.count()
RDD[3]((myrdd2)(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#18%
Caching(
File:(purplecow.txt(
! Subsequent%operaBons%use%saved% I've never seen a purple cow.
data% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1]((mydata)(
>  mydata = sc.textFile("purplecow.txt")
>  myrdd = mydata.map(lambda s:
s.upper())
>  myrdd.cache()
>  myrdd2 = myrdd.filter(lambda \ RDD[2]((myrdd)(
I'VE NEVER SEEN A PURPLE COW.
s:s.startswith('I'))
I NEVER HOPE TO SEE ONE;
>  myrdd2.count()
BUT I CAN TELL YOU, ANYHOW,
3
I'D RATHER SEE THAN BE ONE.
>  myrdd2.count()
3 RDD[3]((myrdd2)(
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#19%
Caching(

! Caching%is%a%suggesBon%to%Spark%
– If(not(enough(memory(is(available,(transformaCons(will(be(re7executed(
when(needed(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#20%
Chapter(Topics(

Distributed%Data%Processing%%
Caching%and%Persistence%
with%Spark%

!! RDD(Lineage(
!! Caching(Overview(
!! Distributed%Persistence(
!! Conclusion(
!! Hands7On(Exercises(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#21%
Caching(and(Fault7Tolerance(

! RDD%=%Resilient1Distributed%Dataset%
– Resiliency(is(a(product(of(tracking(lineage(
– RDDs(can(always(be(recomputed(from(their(base(if(needed(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#22%
Distributed(Cache(

! RDD%parBBons%are%distributed%across%a%cluster%
! Cached%parBBons%are%stored%in%memory%in%Executor%JVMs%
RDD(
Client(
Executor(
rdd_1_0(

Master(Node(
Executor(
Spark( rdd_1_1(
Master(

Executor(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#23%
RDD(Fault7Tolerance((1)(

! What%happens%if%a%cached%parBBon%becomes%unavailable?%

RDD(
Client(
Executor(
rdd_1_0(

Master(Node(

Spark( ?(
Master(

Executor(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#24%
RDD(Fault7Tolerance((2)(

! The%SparkMaster%starts%a%new%task%to%recompute%the%parBBon%on%a%
different%node%%

RDD(
Client(
Executor(
rdd_1_0(

Master(Node(

Spark(
Master(

Executor(
task( rdd_1_1(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#25%
Persistence(Levels((1)(

! The%cache%method%stores%data%in%memory%only%
! The%persist%method%offers%other%opBons%called%Storage%Levels%
! Storage%locaBon%–%where%is%the%data%stored?%
– MEMORY_ONLY((default)(–(same(as(cache(
– MEMORY_AND_DISK(–(Store(parCCons(on(disk(if(they(do(not(fit(in(
memory((
– Called(spilling(
– DISK_ONLY(–(Store(all(parCCons(on(disk(
! ReplicaBon%–%store%parBBons%on%two%nodes%
– MEMORY_ONLY_2,(MEMORY_AND_DISK_2,(etc.(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#26%
Persistence(Levels((2)(

! SerializaBon%–%you%can%choose%to%serialize%the%data%in%memory%
– MEMORY_ONLY_SER(and(MEMORY_AND_DISK_SER
– Much(more(space(efficient(
– Less(Cme(efficient(
– Choose(a(fast(serializaCon(library((covered(later)(

>  from pyspark import StorageLevel


Python(
>  myrdd.persist(StorageLevel.DISK_ONLY)

>  import org.apache.spark.storage.StorageLevel


Scala(
>  myrdd.persist(StorageLevel.DISK_ONLY)

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#27%
Changing(Persistence(OpCons(

! To%stop%persisBng%and%remove%from%memory%and%disk%
– rdd.unpersist()
! To%change%an%RDD%to%a%different%persistence%level%
– Unpersist(first(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#28%
Distributed(Disk(Persistence((1)(

! Disk#persisted%parBBons%are%stored%in%local%files%

RDD(
Client(
Executor(
rdd_0(

Master(Node(
Executor(
Spark( part1
rdd_1(
Master(

Executor(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#29%
Distributed(Disk(Persistence((2)(

! Data%on%disk%will%be%used%to%recreate%the%parBBon%if%possible%
– Will(be(recomputed(if(the(data(is(unavailable((
– e.g.,(the(node(is(down(
RDD(
Client(
Executor(
rdd_0(

Master(Node(

Spark( part1
Master(

Executor(
rdd_1(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#30%
ReplicaCon(

! Persistence%replicaBon%makes%recomputaBon%less%likely%to%be%necessary%%

RDD(
Client(
Executor(
rdd_0(

Master(Node(
Executor(
Spark( part1
rdd_1(
Master(

Executor(
part1

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#31%
When(and(Where(to(Cache(

! When%should%you%cache%a%dataset?%
– When(a(dataset(is(likely(to(be(re7used(
– e.g.,(iteraCve(algorithms,(machine(learning(
! How%to%choose%a%persistence%level%
– Memory(only(–(when(possible,(best(performance(
– Save(space(by(saving(as(serialized(objects(in(memory(if(necessary(
– Disk(–(choose(when(recomputaCon(is(more(expensive(than(disk(read(
– e.g.,(expensive(funcCons(or(filtering(large(datasets(
– ReplicaCon(–(choose(when(recomputaCon(is(more(expensive(than(
memory(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#32%
CheckpoinCng((1)(

! Maintaining%RDD%lineage%provides%resilience%but%can%also%cause%problems%
when%the%lineage%gets%very%long%
– e.g.,(iteraCve(algorithms,(streaming( Iter1(
data…
! Recovery%can%be%very%expensive% data…
Iter2(
data…
data…
Iter3(
! PotenBal%stack%overflow% data…
data…
data…
Iter4(
data…
data…
data…
data…
data…
data…
data…
data…
myrdd = …ini(al*value….
data…
while x in xrange(100):
Iter100(
myrdd = myrdd.transform(…)
myrdd.saveAsTextFile()
…( data…
data…
data…
data…

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#33%
CheckpoinCng((2)(

! CheckpoinBng%saves%the%data%to%HDFS%%
– Provides(fault7tolerant(storage(across(nodes((
! Lineage%is%not%saved% HDFS(
data…
! Must%be%checkpointed%before%any%% checkpoint( data…
Iter3(
data…
acBons%on%the%RDD% data…
Iter4(
data…
data…
data…
data…
data…
sc.setCheckpointDir(directory) data…

myrdd = …ini(al*value…. data…

while x in xrange(100): Iter100(


myrdd = myrdd.transform(…) …( data…

if x % 3 == 0: data…

myrdd.checkpoint() data…

myrdd.count() data…

myrdd.saveAsTextFile()

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#34%
Chapter(Topics(

Distributed%Data%Processing%%
Caching%and%Persistence%
with%Spark%

!! RDD(Lineage(
!! Caching(Overview(
!! Distributed(Persistence(
!! Conclusion%
!! Hands7On(Exercises(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#35%
Key(Points(

! Spark%keeps%track%of%each%RDD’s%lineage%
– Provides(fault(tolerance(
! By%default,%every%RDD%operaBon%executes%the%enBre%lineage%
! If%an%RDD%will%be%used%mulBple%Bmes,%persist%it%to%avoid%re#computaBon%
! Persistence%opBons%
– Caching((memory(only)(–(will(re7compute(what(doesn’t(fit(in(memory(
– Disk(–(will(spill(to(local(disk(what(doesn’t(fit(in(memory(
– ReplicaCon(–(will(save(cached(data(on(mulCple(nodes(in(case(a(node(
goes(down,(for(job(recovery(without(recomputaCon(
– SerializaCon(–(in7memory(caching(can(be(serialized(to(save(memory((but(
at(the(cost(of(performance)(
– CheckpoinCng(–(saves(to(HDFS,(removes(lineage(

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#36%
Chapter(Topics(

Distributed%Data%Processing%%
Caching%and%Persistence%
with%Spark%

!! RDD(Lineage(
!! Caching(Overview(
!! Distributed(Persistence(
!! Conclusion(
!! Hands#On%Exercises%

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#37%
Hands7On(Exercises(

! Hands#On%Exercise:%Caching1RDDs11
– Compare(performance(with(a(cached(and(uncached(RDD(
– Use(the(Spark(ApplicaCon(UI(to(see(how(an(RDD(is(cached(
! Hands#On%Exercise:%Checkpoin)ng1RDDs11
– View(the(lineage(of(an(iteraCve(RDD(
– Increase(iteraCon(unCl(a(stack(overflow(error(occurs(
– Checkpoint(the(RDD(to(avoid(long(lineage(issues(
! Please%refer%to%the%Hands#On%Exercise%Manual%

©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#38%
Wri$ng'Spark'Applica$ons'
Chapter'9'
Course'Chapters'
!! Introduc$on' Course'Introduc$on'
!! What'is'Apache'Spark?'
!! Spark'Basics' Introduc$on'to'Spark'
!! Working'With'RDDs'

!! The'Hadoop'Distributed'File'System'(HDFS)'
!! Running'Spark'on'a'Cluster'
Distributed%Data%Processing%%
!! Parallel'Programming'with'Spark'
with%Spark%
!! Caching'and'Persistence'
!! Wri;ng%Spark%Applica;ons%

!! Spark'Streaming'
!! Common'PaDerns'in'Spark'Programming' Solving'Business'Problems''
!! Improving'Spark'Performance' with'Spark'
!! Spark,'Hadoop,'and'the'Enterprise'Data'Center'

!! Conclusion' Course'Conclusion'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#2%
Wri$ng'a'Spark'Applica$on'

In%this%chapter%you%will%learn%
! How%to%write,%build,%configure,%and%run%Spark%applica;ons%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#3%
Chapter'Topics'

Distributed%Data%Processing%
Wri;ng%a%Spark%Applica;on%
with%Spark%

!! Spark%Applica;ons%vs.%Spark%Shell%
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#4%
Spark'Shell'vs.'Spark'Applica$ons'

! The%Spark%Shell%allows%interac;ve%explora;on%and%manipula;on%of%data%
– REPL'using'Python'or'Scala'
! Spark%applica;ons%run%as%independent%programs%
– Python,'Scala,'or'Java'
– e.g.,'ETL'processing,'Streaming,'and'so'on'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#5%
Chapter'Topics'

Distributed%Data%Processing%
Wri;ng%a%Spark%Applica;on%
with%Spark%

!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea;ng%the%SparkContext%
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#6%
The'SparkContext'

! Every%Spark%program%needs%a%SparkContext%
– The'interac$ve'shell'creates'one'for'you'
– You'create'your'own'in'a'Spark'applica$on'
– Named'sc'by'conven$on'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#7%
Python'Example:'WordCount'

import sys
from pyspark import SparkContext

if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount <file>"
exit(-1)

sc = SparkContext()

counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)

for pair in counts.take(5): print pair

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#8%
Scala'Example:'WordCount'

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}

val sc = new SparkContext()

val counts = sc.textFile(args(0)).


flatMap(line => line.split("\\W")).
map(word => (word,1)).
reduceByKey(_ + _)

counts.take(5).foreach(println)
}
}

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#9%
Chapter'Topics'

Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%

!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building%a%Spark%Applica;on%(Scala%and%Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#10%
Building'a'Spark'Applica$on:'Scala'or'Java'

! Scala%or%Java%Spark%applica;ons%must%be%compiled%and%assembled%into%JAR%
files%
– JAR'file'will'be'passed'to'worker'nodes'
! Most%developers%use%Apache%Maven%to%build%their%applica;ons%
– For'specific'se[ng'recommenda$ons,'see''
http://spark.apache.org/docs/latest/building-
with-maven.html
! Build%details%will%differ%depending%on%
– Version'of'Hadoop'(HDFS)'
– Deployment'pla^orm'(Spark'Standalone,'YARN,'Mesos)'
! Consider%using%an%IDE%
– IntelliJ'appears'to'be'the'most'popular'among'Spark'developers'
%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#11%
Chapter'Topics'

Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%

!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running%a%Spark%Applica;on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#12%
Running'a'Spark'Applica$on'(1)'

! The%easiest%way%to%run%a%Spark%Applica;on%is%using%the%spark-submit
script%
Python' $ spark-submit WordCount.py fileURL

Scala/ $ spark-submit --class WordCount \


Java' MyJarFile.jar fileURL

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#13%
Running'a'Spark'Applica$on'(2)'

! Some%key%spark-submit op;ons%
--help%–'explain'available'op$ons'
--master'–'equivalent'to'MASTER'environment'variable'for'Spark'Shell'
– local[*]'–'run'locally'with'as'many'threads'as'cores'(default)'
– local[n]'–'run'locally'with'n'threads'
– local%–'run'locally'with'a'single'thread
– master'URL,'e.g.,'spark://masternode:7077''
--deploy-mode'–'either'client'or'cluster
--name'–'applica$on'name'to'display'in'the'UI'(default'is'the'Scala/Java'
class'or'Python'program'name)'''
--jars'–'addi$onal'JAR'files'(Scala'and'Java'only)'
--pyfiles'–'addi$onal'Python'files'(Python'only)'
--driver-java-options'–'parameters'to'pass'to'the'driver'JVM'
'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#14%
Chapter'Topics'

Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%

!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands#On%Exercise:%Wri;ng%and%Running%a%Spark%Applica;on%
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#15%
Building'and'Running'Scala'Applica$ons'in'the''
Hands;On'Exercises'
! Basic%Maven%projects%are%provided%in%the%exercises/projects
directory%with%two%packages%
– stubs'–'starter'Scala'file,'do'exercises'here'
– solution'–'final'exercise'solu$on'
Project'Directory'Structure'
+countjpgs
-pom.xml
$ mvn package
+src
+main
$ spark-submit \ +scala
--class stubs.CountJPGs \ +solution
-CountJPGs.scala
target/countjpgs-1.0.jar \ +stubs
weblogs/* -CountJPGs.scala
+target
-countjpgs-1.0.jar

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#16%
Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'

! Hands#On%Exercise:%Wri$ng'and'Running'a'Spark'Applica$on'
– Write'and'run'a'Spark'applica$on'to'count'JPG'requests'in'a'web'server'
log'
! Please%refer%to%the%Hands#On%Exercise%Manual%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#17%
Chapter'Topics'

Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%

!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring%Spark%Proper;es%
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#18%
Spark'Applica$on'Configura$on'

! Spark%provides%numerous%proper;es%for%configuring%your%applica;on%
! Some%example%proper;es%
– spark.master'
– spark.app.name'
– spark.local.dir'–'where'to'store'local'files'such'as'shuffle'output'
(default'/tmp)'
– spark.ui.port'–'port'to'run'the'Spark'Applica$on'UI'(default'
4040)'
– spark.executor.memory'–'how'much'memory'to'allocate'to'each'
Executor'(default'512m)'
! Most%are%more%interes;ng%to%system%administrators%than%developers%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#19%
Spark'Applica$on'Configura$on'

! Spark%Applica;ons%can%be%configured%
– Via'the'command'line'when'the'program'is'run'
– Programma$cally,'using'the'API'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#20%
Run;$me'Configura$on'Op$ons'

!  spark-submit script%
– e.g.,'spark-submit --master spark://masternode:7077'
! Proper;es%file%
– Tab;'or'space;separated'list'of'proper$es'and'values'
– Load'with'spark-submit --properties-file filename
– Example:'
spark.master spark://masternode:7077
spark.local.dir /tmp
% spark.ui.port 4444

! Site%defaults%proper;es%file%
– $SPARK_HOME/conf/spark-defaults.conf
– Template'file'provided'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#21%
Se[ng'Configura$on'Proper$es'Programma$cally'

! Spark%configura;on%se\ngs%are%part%of%the%SparkContext%
! Configure%using%a%SparkConf%object%
! Some%example%func;ons%
– setAppName(name)
– setMaster(master)
– set(property-name, value)
!  set%func;ons%return%a%SparkConf%object%to%support%chaining%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#22%
SparkConf'Example'(Python)'

import sys
from pyspark import SparkContext
from pyspark import SparkConf

if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount <file>"
exit(-1)

sconf = SparkConf() \
.setAppName("Word Count") \
.set("spark.ui.port","4141")
sc = SparkContext(conf=sconf)

counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
.map(lambda w: (w,1)) \
.reduceByKey(lambda v1,v2: v1+v2)

for pair in counts.take(5): print pair

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#23%
SparkConf'Example'(Scala)'

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}

val sconf = new SparkConf().


setAppName("Word Count").
set("spark.ui.port","4141")
val sc = new SparkContext(sconf)

val counts = sc.textFile(args(0)).


flatMap(line => line.split("\\W")).
map(word => (word,1)).
reduceByKey(_ + _)
counts.take(5).foreach(println)
}
}

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#24%
Viewing'Spark'Proper$es'

! You%%can%view%the%Spark%
property%se\ng%in%the%
Spark%Applica;on%UI%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#25%
Chapter'Topics'

Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%

!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging%
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#26%
Spark'Logging'

! Spark%uses%Apache%Log4j%for%logging%
– Allows'for'controlling'logging'at'run$me'using'a'proper$es'file'
– Enable'or'disable'logging,'set'logging'levels,'select'output'
des$na$on'
– For'more'info'see'http://logging.apache.org/log4j/1.2/
! Log4j%provides%several%logging%levels%
– Fatal'
– Error'
– Warn'
– Info'
– Debug'
– Trace'
– Off'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#27%
Spark'Log'Files'

! Log%file%loca;ons%depend%on%your%cluster%management%pla`orm%
! Spark%Standalone%defaults:%
– Spark'daemons:'/var/log/spark'
– Individual'tasks:'$SPARK_HOME/work'on'each'worker'node'

'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#28%
Spark'Worker'UI'–'Log'File'Access'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#29%
Configuring'Spark'Logging'(1)'

! Logging%levels%can%be%set%for%the%cluster,%for%individual%applica;ons,%or%
even%for%specific%components%or%subsystems%
! Default%for%machine:%$SPARK_HOME/conf/log4j.properties
– Start'by'copying'log4j.properties.template

log4j.proper$es.template'
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#30%
Configuring'Spark'Logging'(2)'

! Spark%will%use%the%first%log4j.properties%file%it%finds%in%the%Java%
classpath%
! Spark%Shell%will%read%log4j.properties%from%the%current%directory%
– Copy'log4j.properties'to'the'working'directory'and'edit'

…my#working#directory/log4j.proper$es'
# Set everything to be logged to the console
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#31%
Chapter'Topics'

Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%

!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion%
!! Hands;On'Exercise:'Se[ng'Log'Levels'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#32%
Key'Points'

! Use%the%Spark%Shell%applica;on%for%interac;ve%data%explora;on%
! Write%a%Spark%applica;on%to%run%independently%
! Spark%applica;ons%require%a%Spark%Context%object%
! Spark%applica;ons%are%run%using%the%spark-submit script%
! Spark%configura;on%parameters%can%be%set%at%run;me%using%the%%
spark-submit%script%or%programma;cally%using%a%SparkConf%object%
! Spark%uses%log4j%for%logging%
– Configure'using'a'log4j.properties'file'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#33%
Chapter'Topics'

Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%

!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands#On%Exercise:%Se\ng%Log%Levels%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#34%
Hands;On'Exercise:'Configuring'Spark'Applica$ons'

! Hands#On%Exercise:%Configuring%Spark%Applica;ons%
– Set'proper$es'using'spark-submit
– Set'proper$es'in'a'proper$es'file'
– Set'proper$es'programma$cally'using'SparkConf
– Change'the'logging'levels'in'a'log4j.properties'file'
! Please%refer%to%the%Hands#On%Exercise%Manual%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#35%
Spark&Streaming&
Chapter&10&
Course&Chapters&
!! IntroducDon& Course&IntroducDon&
!! Why&Spark?&
!! Spark&Basics& IntroducDon&to&Spark&
!! Working&With&RDDs&

!! The&Hadoop&Distributed&File&System&(HDFS)&
!! Running&Spark&on&a&Cluster&
Distributed&Data&Processing&&
!! Parallel&Programming&with&Spark&
with&Spark&
!! Caching&and&Persistence&
!! WriDng&Spark&ApplicaDons&

!! Spark%Streaming%
!! Common&PaBerns&in&Spark&Programming& Solving%Business%Problems%%
!! Improving&Spark&Performance& with%Spark%
!! Spark,&Hadoop,&and&the&Enterprise&Data&Center&

!! Conclusion& Course&Conclusion&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#2%
Spark&Streaming&

In%this%chapter%you%will%learn%
! What%Spark%Streaming%is,%and%why%it%is%valuable%
! How%to%use%Spark%Streaming%
! How%to%work%with%Sliding%Window%operaCons%

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#3%
Chapter&Topics&

Solving%Business%Problems%%
Spark%Streaming%
with%Spark%

!! Spark%Streaming%Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#4%
What&is&Spark&Streaming?&

! Spark%Streaming%provides%real#Cme%processing%of%stream%data%
! An%extension%of%core%Spark%
! Supports%Scala%and%Java%
– Most&recent&version&of&Spark&also&supports&Python&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#5%
Why&Spark&Streaming?&

! Many%big#data%applicaCons%need%to%process%large%data%streams%in%real%
Cme%
– Website&monitoring&
– Fraud&detecDon&
– Ad&moneDzaDon&
– Etc.&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#6%
Spark&Streaming&Features&

! Second#scale%latencies%
! Scalability%and%efficient%fault%tolerance%
! “Once%and%only%once”%processing%
! Integrates%batch%and%real#Cme%processing%
! Easy%to%develop%
– Uses&Spark’s&high&level&API&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#7%
Spark&Streaming&Overview&

! Divide%up%data%stream%into%batches%of%n%seconds%%
! Process%each%batch%in%Spark%as%an%RDD%
! Return%results%of%RDD%operaCons%in%batches%
Live&Data&Stream&
…1001101001000111000011100010…&
Spark&Streaming&

DStream&–&RDDs&(batches&of&&
n&seconds)&

Spark&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#8%
Chapter&Topics&

Solving%Business%Problems%%
Spark%Streaming%
with%Spark%

!! Spark&Streaming&Overview&
!! Example:%Streaming%Request%Count%
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#9%
Streaming&Example:&Streaming&Request&Count&

object StreamingRequestCount {

def main(args: Array[String]) {

val ssc = new StreamingContext(new SparkConf(),Seconds(2))


val mystream = ssc.socketTextStream(hostname, port)
val userreqs = mystream.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)

userreqs.print()

ssc.start()
ssc.awaitTermination()
}
}

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#10%
Streaming&Example:&Configuring&StreamingContext&

object StreamingRequestCount {

def main(args: Array[String]) {

val ssc = new StreamingContext(new SparkConf(),Seconds(2))


val mystream = ssc.socketTextStream(hostname, port)
val userreqs = mystream
!! A&StreamingContext&is&the&main&entry&point&for&Spark&
.map(line => (line.split(" ")(2),1))
Streaming&apps& => x+y)
.reduceByKey((x,y)
!! Equivalent&to&SparkContext&in&core&Spark&
userreqs.saveAsTextFiles("…/outdir/reqcounts")
!! Configured&with&the&same&parameters&as&a&SparkContext&
plus&batch'dura+on'–&instance&of&Milliseconds,&Seconds,&or&
ssc.start()
Minutes%
ssc.awaitTermination()
} !! Named&ssc&by&convenDon&&
}

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#11%
Streaming&Example:&CreaDng&a&DStream&

object StreamingRequestCount {

def main(args: Array[String]) {

val ssc = new StreamingContext(new SparkConf(),Seconds(2))


val logs = ssc.socketTextStream(hostname, port)
val userreqs = logs
.map(line => (line.split(" ")(2),1))
!! Get&a&DStream&(“DiscreDzed&Stream”)&from&a&streaming&data&
.reduceByKey((x,y) => x+y)
source,&e.g.,&text&from&a&socket&
userreqs.saveAsTextFiles("…/outdir/reqcounts")

ssc.start()
ssc.awaitTermination()
}
}

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#12%
Streaming&Example:&DStream&TransformaDons&

object StreamingRequestCount {

def main(args: Array[String]) {

val ssc = new StreamingContext(new SparkConf(),Seconds(2))


val logs = ssc.socketTextStream(hostname, port)
val userreqs = logs.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)

userreqs.saveAsTextFiles("…/outdir/reqcounts")
!! DStream&operaDons&are&applied&to&each&batch&RDD&in&the&stream&
!  Similar&to&RDD&operaDons&–&filter,&map,&reduce,&join,&etc.&
ssc.start()
ssc.awaitTermination()
}
}

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#13%
Streaming&Example:&DStream&Result&Output&

object StreamingRequestCount {

def main(args: Array[String]) {

val ssc = new StreamingContext(new SparkConf(),Seconds(2))


val logs = ssc.socketTextStream(hostname, port)
val userreqs = logs.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)

userreqs.print()

ssc.start()
!! Print&out&the&first&10&elements&of&each&RDD&
ssc.awaitTermination()
}
}

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#14%
Streaming&Example:&StarDng&the&Streams&

object StreamingRequestCount {

def main(args: Array[String]) {

val ssc = new StreamingContext(new SparkConf(),Seconds(2))


val logs = ssc.socketTextStream(hostname, port)
val userreqs = logs
!  start:&Starts&the&execuDon&of&all&DStreams&
.map(line => (line.split(" ")(2),1))
!  awaitTermination:&&waits&for&all&background&threads&to&
.reduceByKey((x,y) => x+y)

complete&before&ending&the&main&thread&
userreqs.print()

ssc.start()
ssc.awaitTermination()
}
}

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#15%
Streaming&Example:&Streaming&Request&Count&(Recap)&

object StreamingRequestCount {

def main(args: Array[String]) {

val ssc = new StreamingContext(new SparkConf(),Seconds(2))


val logs= ssc.socketTextStream(hostname, port)
val userreqs = logs.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)

userreqs.print()

ssc.start()
ssc.awaitTermination()
}
}

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#16%
Streaming&Example&Output&

-------------------------------------------
Time: 1401219545000 ms
Starts&2&seconds&
------------------------------------------- acer&ssc.start
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#17%
Streaming&Example&Output&

-------------------------------------------
Time: 1401219545000 ms
-------------------------------------------
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...
------------------------------------------- 2&seconds&later…
Time: 1401219547000 ms
-------------------------------------------
(42400,2)
(24996,2)
(97464,2)
(161,2)
(6011,2)

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#18%
Streaming&Example&Output&

-------------------------------------------
Time: 1401219545000 ms
-------------------------------------------
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...
-------------------------------------------
Time: 1401219547000 ms
-------------------------------------------
(42400,2)
(24996,2)
(97464,2)
(161,2)
(6011,2)

------------------------------------------- 2&seconds&later…
Time: 1401219549000 ms
-------------------------------------------
(44390,2)
(48712,2)
(165,2)
(465,2) ConDnues&unDl&
(120,2)

terminaDon…

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#19%
Chapter&Topics&

Solving%Business%Problems%%
Spark%Streaming%
with%Spark%

!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams%
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&
&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#20%
DStreams&

! A%DStream%is%a%sequence%of%RDDs%represenCng%a%data%stream%
– “DiscreDzed&Stream”&
Time&

Live&Data& data…data…data…data…data…data…data…data…

t=0& t=1& t=2& t=3&

RDD&@&t=1& RDD&@&t=2& RDD&@&t=3&


data… data… data…
data… data… data…
DStream&
data… data… data…
data… data… data…

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#21%
DStream&Data&Sources&

! DStreams%are%defined%for%a%given%input%stream%(e.g.,%a%Unix%socket)%
– Created&by&the&StreamingContext&
ssc.socketTextStream(hostname, port)&
– Similar&to&how&RDDs&are&created&by&the&SparkContext&
! Out#of#the#%box%data%sources%
– Network&
– Sockets&
– Other&network&sources,&e.g.,&Flume,&Akka&Actors,&Kaha,&ZeroMQ,&
TwiBer&
– Files&
– Monitors&an&HDFS&directory&for&new&content&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#22%
DStream&OperaDons&

! DStream%operaCons%are%applied%to%every%RDD%in%the%stream%
– Executed&once&per&dura+on&
! Two%types%of%DStream%operaCons%
– TransformaDons&
– Create&a&new&DStream&from&an&exisDng&one&
– Output&operaDons&
– Write&data&(for&example,&to&a&file&system,&database,&or&console)&
•  Similar&to&RDD&ac+ons'

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#23%
DStream&TransformaDons&(1)&

! Many%RDD%transformaCons%are%also%available%on%DStreams%
– Regular&transformaDons&such&as&map,&flatMap,&filter
– Pair&transformaDons&such&as&reduceByKey,&groupByKey,&join
!  What%if%you%want%to%do%something%else?%%
– transform(function)
– Creates&a&new&DStream&by&execuDng&func+on&on&RDDs&in&the&
current&DStream&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#24%
DStream&TransformaDons&(2)&

data… data… data…

logs& data… data… data…


data… data… data…
… … …

userreqs = logs.map(line =>


(line.split(" ")(2),1))
(user002,1) (user011,1) (user012,1)
(user011,1) (user823,1) (user011,1)
userreqs&
(user991,1) (user012,1) (user552,1)
… … …

reqcounts = userreqs.
reduceByKey((x,y) => x+y)
(user002,5) (user710,9) (user002,1)
(user033,1) (user022,4) (user808,8)
reqcounts& (user912,2) (user001,4) (user018,2)
… … …

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#25%
DStream&Output&OperaDons&

! Console%output%
– print&–&prints&out&the&first&10&elements&of&each&RDD&
! File%output%
– saveAsTextFiles&–&save&data&as&text&
– saveAsObjectFiles&–&save&as&serialized&object&files&
! ExecuCng%other%funcCons%
– foreachRDD(function)%–&performs&a&funcDon&on&each&RDD&in&the&
DStream&
– FuncDon&input&parameters&
– RDD&–&the&RDD&on&which&to&perform&the&funcDon&
– Time&–&opDonal,&the&Dme&stamp&of&the&RDD&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#26%
Saving&DStream&Results&as&Files&

val userreqs = logs.


map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)
userreqs.print()
userreqs.saveAsTextFiles("…/outdir/reqcounts")

(user002,5) (user710,9) (user002,1)


(user033,1) (user022,4) (user808,8)
(user912,2) (user001,4) (user018,2)
… … …

reqcounts-timestamp1/ reqcounts-timestamp2/ reqcounts-timestamp3/


part-00000… part-00000… part-00000…
(user002,1)
(user002,5) (user710,9)
(the,5) (the,9)
(user022,4)
(user808,8)
(word1,n)
(user033,1)
(the,5) (the,9) (word1,n)
(fat,1) (angry,1) (user018,2)
(word2,n)
(user912,2)
(fat,1) (user001,4)
(angry,1) (word2,n)
… (sat,4) … (word3,n)
… (on,2)
(on,2) (word3,n)
… … (sat,4) …
… … …

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#27%
Example:&Find&Top&Users&(1)&


val userreqs = logs.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)
userreqs.saveAsTextFiles(path)

val sortedreqs = userreqs.


map(pair => pair.swap).
transform(rdd => rdd.sortByKey(false))

sortedreqs.foreachRDD((rdd,time) => {
println("Top Transform&each&RDD:&swap&userID/count,&sort&by&count&
users @ " + time)
rdd.take(5).foreach(
pair => printf("User: %s (%s)\n",pair._2, pair._1))
}
)

ssc.start()
ssc.awaitTermination()

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#28%
Example:&Find&Top&Users&(2)&


val userreqs = logs.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)
userreqs.saveAsTextFiles(path)

val sortedreqs = userreqs.


map(pair => pair.swap).
Print&out&the&top&5&users&as&“User:&userID&(count)”&
transform(rdd => rdd.sortByKey(false))

sortedreqs.foreachRDD((rdd,time) => {
println("Top users @ " + time)
rdd.take(5).foreach(
pair => printf("User: %s (%s)\n",pair._2, pair._1))
}
)

ssc.start()
ssc.awaitTermination()

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#29%
Example:&Find&Top&Users&–&Output&

Top users @ 1401219545000 ms t&=&0&(2&seconds&


User: 16261 (8)
User: 22232 (7) acer&ssc.start)%
User: 66652 (4)
User: 21205 (2)
User: 24358 (2)

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#30%
Example:&Find&Top&Users&–&Output&

Top users @ 1401219545000 ms


User: 16261 (8)
User: 22232 (7)
User: 66652 (4)
User: 21205 (2)
User: 24358 (2)
Top users @ 1401219547000 ms
User: 53667 (4)
User: 35600 (4) t&=1&&
User: 62 (2) (2&seconds&later)
User: 165 (2)
User: 40 (2)

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#31%
Example:&Find&Top&Users&–&Output&

Top users @ 1401219545000 ms


User: 16261 (8)
User: 22232 (7)
User: 66652 (4)
User: 21205 (2)
User: 24358 (2)
Top users @ 1401219547000 ms
User: 53667 (4)
User: 35600 (4)
User: 62 (2)
User: 165 (2)
User: 40 (2)
Top users @ 1401219549000 ms
User: 31 (12)
User: 6734 (10) t&=2&&
User: 14986 (10) (2&seconds&later)
User: 72760 (2)
User: 65335 (2)
Top users @ 1401219551000 ms

ConDnues&unDl&
terminaDon…

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#32%
Using&Spark&Streaming&with&Spark&Shell&

! Spark%Streaming%is%designed%for%batch%applicaCons,%not%interacCve%use%
! Spark%Shell%can%be%used%for%limited%tesCng%
– Adding&operaDons&acer&the&Streaming&Context&has&been&started&is&
unsupported&
– Stopping&and&restarDng&the&Streaming&Context&is&unsupported&
&
&& $ spark-shell --master local[2]

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#33%
Chapter&Topics&

Solving%Business%Problems%%
Spark%Streaming%
with%Spark%

!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands#On%Exercise:%Exploring%Spark%Streaming%
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#34%
Hands5On&Exercise:&Exploring&Spark&Streaming&

! Hands#On%Exercise:%Exploring*Spark*Streaming*
– Explore&Spark&Streaming&using&the&Scala&Spark&Shell&
– Count&words,&use&netcat&to&simulate&a&data&stream&
! Please%refer%to%the%Hands#On%Exercise%Manual%

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#35%
Chapter&Topics&

Solving%Business%Problems%%
Spark%Streaming%
with%Spark%

!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State%OperaCons%
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#36%
State&DStreams&(1)&

! Use%the%updateStateByKey%funcCon%to%create%a%state%DStream%
! Example:%Total%request%count%by%User%ID%

t&=&1&
(user001,5)
Requests& (user102,1)
(user009,2)

Total&& (user001,5)
Requests& (user102,1)
(State)& (user009,2)

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#37%
State&DStreams&(2)&

! Use%the%updateStateByKey%funcCon%to%create%a%state%DStream%
! Example:%Total%request%count%by%User%ID%

t&=&1& t&=&2& t&=&3&


(user001,5) (user001,4)
Requests& (user102,1) (user012,2)
(user009,2) (user921,5)

Total&& (user001,5) (user001,9)


Requests& (user102,1) (user102,1)
(State)& (user009,2) (user009,2)
(user012,2)
(user921,5)

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#38%
State&DStreams&(3)&

! Use%the%updateStateByKey%funcCon%to%create%a%state%DStream%
! Example:%Total%request%count%by%User%ID%

t&=&1& t&=&2& t&=&3&


(user001,5) (user001,4) (user102,7)
Requests& (user102,1) (user012,2) (user012,3)
(user009,2) (user921,5) (user660,4)

Total&& (user001,5) (user001,9) (user001,9)


Requests& (user102,1) (user102,1) (user102,8)
(State)& (user009,2) (user009,2) (user009,2)
(user012,2) (user012,5)
(user921,5) (user921,5)
(user660,4)

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#39%
Example:&Total&User&Request&Count&(1)&


Val userreqs = logs.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)

ssc.checkpoint("checkpoints")
val totalUserreqs = userreqs.updateStateByKey(updateCount)
totalUserreqs.print()

ssc.start()
Set&checkpoint&directory&to&enable&checkpoinDng.&&
ssc.awaitTermination()
… Required&to&prevent&infinite&lineages.&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#40%
Example:&Total&User&Request&Count&(2)&


val userreqs = logs.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)

next&slide…&
ssc.checkpoint("checkpoints")
val totalUserreqs = userreqs.updateStateByKey(updateCount)
totalUserreqs.print()

ssc.start()
Compute&a&state&DStream&based&on&the&previous&states&
ssc.awaitTermination()
… updated&with&the&values&from&the&current&batch&of&request&
counts&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#41%
Example:&Total&User&Request&Count&–&Update&FuncDon&(1)&

New&Values& Current&State&(or&None)&

def updateCount = (newCounts: Seq[Int], state: Option[Int]) => {


val newCount = newCounts.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(newCount + previousCount) New&State&
}

Given&an&exisDng&state&for&a&key&(user),&and&new&values&
(counts),&return&a&new&state&(sum&of&current&state&and&new&
counts)&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#42%
Example:&Total&User&Request&Count&–&Update&FuncDon&(2)&

! Example%at%t=2%
user001:&&updateCount([4],Some[5])&"&&9
user012:&&updateCount([2],None))&"&&2
user921:&&updateCount([5],None))&"&&5
t&=&1& t&=&2&
(user001,5) (user001,4)
Requests& (user102,1) (user012,2)
(user009,2) (user921,5)

Total&& (user001,5) (user001,9)


Requests& (user102,1) (user102,1)
(State)& (user009,2) (user009,2)
(user012,2)
(user921,5)

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#43%
Example:&Maintaining&State&–&Output&&

-------------------------------------------
Time: 1401219545000 ms
------------------------------------------- (user001,5)
(user001,5)
(user102,1) t&=&1& (user102,1)
(user009,2) (user009,2)
-------------------------------------------
Time: 1401219547000 ms
------------------------------------------- (user001,9)
(user001,9)
(user102,1) (user102,1)
(user009,2) t&=&2& (user009,2)
(user012,2)
(user921,5) (user012,2)
------------------------------------------- (user921,5)
Time: 1401219549000 ms
-------------------------------------------
(user001,9) (user001,9)
(user102,8)
(user102,8)
(user009,2)
(user012,5) (user009,2)
(user921,5)
t&=&3&
(user012,5)
(user660,4)
------------------------------------------- (user921,5)
Time: 1401219541000 ms
------------------------------------------- (user660,4)

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#44%
Chapter&Topics&

Solving%Business%Problems%%
Spark%Streaming%
with%Spark%

!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding%Window%OperaCons%
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#45%
Sliding&Window&OperaDons&(1)&

! Regular%DStream%operaCons%execute%for%each%RDD%based%on%SSC%duraCon%
! “Window”%operaCons%span%RDDs%over%a%given%duraCon%
– e.g.,&reduceByKeyAndWindow,&countByWindow

Window&DuraDon&

Regular&
DStream&

reduceByKeyAndWindow(
fn,window-duration)

Window&
DStream&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#46%
Sliding&Window&OperaDons&(2)&

! By%default%window%operaCons%will%execute%with%an%“interval”%the%same%as%
the%SSC%duraCon%
– i.e.,&for&2&minute&batch&duraDon,&window&will&“slide”&every&2&minutes&

Window&DuraDon&
Regular&
DStream&
(batch&size&=&&
Minutes(2))&

reduceByKeyAndWindow(fn,
Minutes(12))

Window&
DStream&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#47%
Sliding&Window&OperaDons&(3)&

! You%can%specify%a%different%slide%duraCon%(must%be%a%mulCple%of%the%SSC%
duraCon)%

Window&DuraDon&
Regular&
DStream&
(batch&size&=&&
Minutes(2))&

reduceByKeyAndWindow(fn,
Minutes(12), Minutes(4))

Window&
DStream&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#48%
Example:&Count&and&Sort&User&Requests&by&Window&(1)&


val ssc = new StreamingContext(new SparkConf(), Seconds(2))
val logs = ssc.socketTextStream(hostname, port)

val reqcountsByWindow = logs.
map(line => (line.split(' ')(2),1)).
reduceByKeyAndWindow((x: Int, y: Int) => x+y,
Minutes(5),Seconds(30))

val topreqsByWindow=reqcountsByWindow.
Every&30&seconds,&count&requests&by&user&over&the&last&5&
map(pair => pair.swap).
transform(rddminutes&
=> rdd.sortByKey(false))
topreqsByWindow.map(pair => pair.swap).print()

ssc.start()
ssc.awaitTermination()

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#49%
Example:&Count&and&Sort&User&Requests&by&Window&(2)&


val ssc = new StreamingContext(new SparkConf(), Seconds(2))
val logs = ssc.socketTextStream(hostname, port)

val reqcountsByWindow = logs.
map(line => (line.split(' ')(2),1)).
Sort&and&print&the&top&users&for&every&RDD&(every&30&
reduceByKeyAndWindow((x: Int, y: Int) => x+y,
seconds)&
Minutes(5),Seconds(30))

val topreqsByWindow=reqcountsByWindow.
map(pair => pair.swap).
transform(rdd => rdd.sortByKey(false))
topreqsByWindow.map(pair => pair.swap).print()

ssc.start()
ssc.awaitTermination()

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#50%
Chapter&Topics&

Solving%Business%Problems%%
Spark%Streaming%
with%Spark%

!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing%Spark%Streaming%ApplicaCons%
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#51%
Special&ConsideraDons&for&Streaming&ApplicaDons&

! Spark%Streaming%applicaCons%are%by%definiCon%long#running%
– Require&some&different&approaches&than&typical&Spark&applicaDons&
! Metadata%accumulates%over%Cme%
– Use&checkpoinDng&to&trim&RDD&lineage&data&
– Required&to&use&windowed&and&state&operaDons&
– Enable&by&seong&the&checkpoint&directory:&
ssc.checkpoint(directory)
! Monitoring%
– The&StreamingListener&API&lets&you&collect&staDsDcs&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#52%
Spark&Fault&Tolerance&(1)&

! Network%data%is%received%on%a%worker%node%
– Receiver&distributes&data&(RDDs)&to&the&cluster&as&parDDons&
! Spark%Streaming%persists%windowed%RDDs%by%default%(replicaCon%=%2)%
Client&
Executor& rdd_0_1&
Receiver&

Network&
Driver& Executor& Data&Source&
rdd_0_1&
Program&
rdd_0_0&

Executor&
rdd_0_0&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#53%
Spark&Fault&Tolerance&(2)&

! If%the%receiver%fails,%Spark%will%restart%it%on%a%different%Executor%
– PotenDal&for&brief&loss&of&incoming&data&

Executor&
Receiver&

Network&
Driver& Executor& Data&Source&
rdd_0_1&
Program& Receiver&
rdd_0_0&

Executor&
rdd_0_0&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#54%
Building&and&Running&Spark&Streaming&ApplicaDons&

! Building%Spark%Streaming%ApplicaCons%
– Link&with&the&main&Spark&Streaming&library&(included&with&Spark)&
– Link&with&addiDonal&Spark&Streaming&libraries&if&necessary,&e.g,.&Kaha,&
Flume,&TwiBer&
! Running%Spark%Streaming%ApplicaCons%
– Use&at&least&two&threads&if&running&locally&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#55%
The&Spark&Streaming&ApplicaDon&UI&

! The%Streaming%tab%
in%the%Spark%App%%
UI%provides%basic%%
metrics%about%the%%
applicaCon%

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#56%
Chapter&Topics&

Solving%Business%Problems%%
Spark%Streaming%
with%Spark%

!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion%
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#57%
Key&Points&

! Spark%Streaming%is%an%add#on%to%core%Spark%to%process%real#Cme%streaming%
data%
! DStreams%are%“discreCzed%streams”%of%streaming%data,%batched%into%RDDs%
by%Cme%intervals%%
– OperaDons&applied&to&DStreams&are&applied&to&each&RDD&
– TransformaDons&produce&new&DStreams&by&applying&a&funcDon&to&each&
RDD&in&the&base&DStream&
! You%can%update%state%based%on%prior%state%
– e.g.,&Total&requests&by&user&
! You%can%perform%operaCons%on%“windows”%of%data%
– e.g.,&Number&of&logins&in&the&last&hour&

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#58%
Chapter&Topics&

Solving%Business%Problems%%
Spark%Streaming%
with%Spark%

!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands#On%Exercise:%WriCng%a%Spark%Streaming%ApplicaCon%

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#59%
Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&

! Hands#On%Exercise:%Wri2ng*a*Spark*Streaming*Applica2on*
– Write&a&Spark&Streaming&applicaDon&to&process&web&logs&using&a&Python&
script&to&simulate&a&data&stream&
! Please%refer%to%the%Hands#On%Exercise%Manual%

©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#60%
Common%Pa(erns%in%Spark%
Programming%
Chapter%11%
Course%Chapters%
!! IntroducEon% Course%IntroducEon%
!! Why%Spark?%
!! Spark%Basics% IntroducEon%to%Spark%
!! Working%With%RDDs%

!! The%Hadoop%Distributed%File%System%(HDFS)%
!! Running%Spark%on%a%Cluster%
Distributed%Data%Processing%%
!! Parallel%Programming%with%Spark%
with%Spark%
!! Caching%and%Persistence%
!! WriEng%Spark%ApplicaEons%

!! Spark%Streaming%
!! Common$Pa;erns$in$Spark$Programming$ Solving$Business$Problems$$
!! Improving%Spark%Performance% with$Spark$
!! Spark,%Hadoop,%and%the%Enterprise%Data%Center%

!! Conclusion% Course%Conclusion%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"2$
Common%Spark%Algorithms%

In$this$chapter$you$will$learn$
! What$kinds$of$processing$and$analysis$Spark$is$best$at$
! How$to$implement$an$iteraDve$algorithm$in$Spark$
! How$GraphX$and$MLlib$work$with$Spark$$

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"3$
Chapter%Topics%

Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$

!! Common$Spark$Use$Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph%Processing%and%Analysis%%%
!! Machine%Learning%
!! Example:%k8means%
!! Conclusion%
!! Hands8On%Exercise:%IteraEve%Processing%in%Spark%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"4$
Common%Spark%Use%Cases%(1)%

! Spark$is$especially$useful$when$working$with$any$combinaDon$of:$
– Large%amounts%of%data%
– Distributed%storage%
– Intensive%computaEons%
– Distributed%compuEng%
– IteraEve%algorithms%
– In8memory%processing%and%pipelining%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"5$
Common%Spark%Use%Cases%(2)%

! Examples$
– Risk%analysis%
– “How%likely%is%this%borrower%to%pay%back%a%loan?”%
– RecommendaEons%
– “Which%products%will%this%customer%enjoy?”%
– PredicEons%
– “How%can%we%prevent%service%outages%instead%of%simply%reacEng%to%
them?”%
– ClassificaEon%
– “How%can%we%tell%which%email%is%spam%and%which%is%legiEmate?”%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"6$
Spark%Examples%

! Spark$includes$many$example$programs$that$demonstrate$some$common$
Spark$programming$pa;erns$and$algorithms$
– k8means%
– LogisEc%regression%
– Calculate%pi%
– AlternaEng%least%squares%(ALS)%
– Querying%Apache%web%logs%
– Processing%Twi(er%feeds%
! Scala$and$Java$Examples$
– $SPARK_HOME/examples/
! Python$examples$
– $SPARK_HOME/python/examples

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"7$
Chapter%Topics%

Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$

!! Common%Spark%Use%Cases%
!! IteraDve$Algorithms$in$Spark%
!! Graph%Processing%and%Analysis%%%
!! Machine%Learning%
!! Example:%k8means%
!! Conclusion%
!! Hands8On%Exercise:%IteraEve%Processing%in%Spark%
%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"8$
Example:%PageRank%

! PageRank$gives$web$pages$a$ranking$score$based$on$links$from$other$
pages$
– Higher%scores%given%for%more%links,%and%links%from%other%high%ranking%
pages%
! Why$do$we$care?$
– PageRank%is%a%classic%example%of%big%data%analysis%(like%WordCount)%
– Lots%of%data%–%needs%an%algorithm%that%is%distributable%and%scalable%
– IteraEve%–%the%more%iteraEons,%the%be(er%than%answer%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"9$
PageRank%Algorithm%(1)%

1.  Start$each$page$with$a$rank$of$1.0$

Page%1%
1.0%

Page%2% Page%3%
1.0% 1.0%
Page%4%
1.0%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"10$
PageRank%Algorithm%(2)%

1.  Start$each$page$with$a$rank$of$1.0$
2.  On$each$iteraDon:$
1.  each%page%contributes%to%its%neighbors%its%own%rank%divided%by%the%
number%of%its%neighbors:%contribp%=%rankp%/%neighborsp%

Page%1%
1.0%

Page%2% Page%3%
".%5%

1.0% 1.0%
Page%4%
1.0%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"11$
PageRank%Algorithm%(3)%

1.  Start$each$page$with$a$rank$of$1.0$
2.  On$each$iteraDon:$
1.  each%page%contributes%to%its%neighbors%its%own%rank%divided%by%the%
number%of%its%neighbors:%contribp%=%rankp%/%neighborsp%
2.  Set%each%page’s%new%rank%based%on%the%sum%of%its%neighbors%
contribuEon:%%new8rank%=%Σcontribs%*%.85%+%.15%

Page%1% IteraEon%1%
1.85%

Page%2% Page%3%
".%5%

0.58% 1.0%
Page%4%
0.58%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"12$
PageRank%Algorithm%(4)%

1.  Start$each$page$with$a$rank$of$1.0$
2.  On$each$iteraDon:$
1.  each%page%contributes%to%its%neighbors%its%own%rank%divided%by%the%
number%of%its%neighbors:%contribp%=%rankp%/%neighborsp%
2.  Set%each%page’s%new%rank%based%on%the%sum%of%its%neighbors%
contribuEon:%%new8rank%=%Σcontribs%*%.85%+%.15%
3.  Each$iteraDon$incrementally$improves$the$page$ranking$

Page%1% IteraEon%2%
1.31%

Page%2% Page%3%
" . %29%

0.39% 1.7%
Page%4%
0.57%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"13$
PageRank%Algorithm%(5)%

1.  Start$each$page$with$a$rank$of$1.0$
2.  On$each$iteraDon:$
1.  each%page%contributes%to%its%neighbors%its%own%rank%divided%by%the%
number%of%its%neighbors:%contribp%=%rankp%/%neighborsp%
2.  Set%each%page’s%new%rank%based%on%the%sum%of%its%neighbors%
contribuEon:%%new8rank%=%Σcontribs%*%.85%+%.15%
3.  Each$iteraDon$incrementally$improves$the$page$ranking$

Page%1% IteraDon$10$
1.43% (Final)%
Page%2% Page%3%
" . %37%

0.46% 1.38%
Page%4%
0.73%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"14$
PageRank%in%Spark:%Neighbor%ContribuEon%FuncEon%

def computeContribs(neighbors, rank):


for neighbor in neighbors: yield(neighbor, rank/len(neighbors))

neighbors:%[page1,page2]% (page1,.5)%
rank:%1.0%%% (page2,.5)%

Page%1%

Page%2% Page%3%
".%5%

Page%4%
1.0%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"15$
PageRank%in%Spark:%Example%Data%

Data%Format:% page1 page3


source-page destination-page page2 page1
…% page4 page1
page3 page1
page4 page2
page3 page4

Page%1%

Page%2% Page%3%

Page%4%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"16$
PageRank%in%Spark:%Pairs%of%Page%Links%

page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct() (page1,page3)
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"17$
PageRank%in%Spark:%Page%Links%Grouped%by%Source%Page%

page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

links%
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"18$
PageRank%in%Spark:%Caching%the%Link%Pair%RDD%

page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()\
(page2,page1)
.cache()
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

links%
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"19$
PageRank%in%Spark:%Set%IniEal%Ranks%

links%
def computeContribs(neighbors, rank):… (page4, [page2,page1])
(page2, [page1])
links = sc.textFile(file)\ (page3, [page1,page4])
.map(lambda line: line.split())\
(page1, [page3])
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\
.groupByKey()\
ranks%
.cache()
(page4, 1.0)

ranks=links.map(lambda (page,neighbors): (page,1.0)) (page2, 1.0)


(page3, 1.0)
(page1, 1.0)

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"20$
PageRank%in%Spark:%First%IteraEon%(1)%

links% ranks%
def computeContribs(neighbors, rank):… (page4, [page2,page1]) (page4, 1.0)
(page2, [page1]) (page2, 1.0)
links = … (page3, [page1,page4]) (page3, 1.0)
(page1, [page3]) (page1, 1.0)
ranks = …

for x in xrange(10):
contribs=links\ (page4, ([page2,page1], 1.0))

.join(ranks) (page2, ([page1], 1.0))


(page3, ([page1,page4], 1.0))
(page1, ([page3], 1.0))

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"21$
PageRank%in%Spark:%First%IteraEon%(2)%

links% ranks%
def computeContribs(neighbors, rank):… (page4, [page2,page1]) (page4, 1.0)
(page2, [page1]) (page2, 1.0)
links = … (page3, [page1,page4]) (page3, 1.0)
(page1, [page3]) (page1, 1.0)
ranks = …

for x in xrange(10):
contribs=links\ (page4, ([page2,page1], 1.0))

.join(ranks)\ (page2, ([page1], 1.0))


.flatMap(lambda (page,(neighbors,rank)): \ (page3, ([page1,page4], 1.0))
computeContribs(neighbors,rank)) (page1, ([page3], 1.0))

contribs%
(page2,0.5)
(page1,0.5)
(page1,1.0)
(page1,0.5)
(page4,0.5)
(page3,1.0)

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"22$
PageRank%in%Spark:%First%IteraEon%(3)%

contribs%
(page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)

ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2) (page1,2.0)

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"23$
PageRank%in%Spark:%First%IteraEon%(4)%

contribs%
(page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)

ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2)\ (page1,2.0)
.map(lambda (page,contrib): \
(page,contrib * 0.85 + 0.15)) ranks%
(page4,.58)
(page2,.58)
(page3,1.0)
(page1,1.85)

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"24$
PageRank%in%Spark:%Second%IteraEon%

links% ranks%
def computeContribs(neighbors, rank):… (page4, [page2,page1]) (page4,0.58)
(page2, [page1]) (page2,0.58)
links = … (page3, [page1,page4]) (page3,1.0)
(page1, [page3]) (page1,1.85)
ranks = …

for x in xrange(10):
contribs=links\
.join(ranks)\
.flatMap(lambda (page,(neighbors,rank)): \
computeContribs(neighbors,rank))

ranks=contribs\
.reduceByKey(lambda v1,v2: v1+v2)\
.map(lambda (page,contrib): \ ranks%
(page,contrib * 0.85 + 0.15)) (page4,0.57)
(page2,0.21)
for rank in ranks.collect(): print rank
(page3,1.0)
(page1,0.77)

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"25$
Chapter%Topics%

Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$

!! Common%Spark%Use%Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph$Processing$and$Analysis$
!! Machine%Learning%
!! Example:%k8means%
!! Conclusion%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"26$
Graph%AnalyEcs%

! Many$data$analyDcs$problems$work$with$“data$parallel”$algorithms$
– Records%can%be%processed%independently%of%each%other%
– Very%well%suited%to%parallelizing%%
! Some$problems$focus$on$the$relaDonships$between$the$individual$data$
items.$For$example:$
– Social%networks%
– Web%page%hyperlinks%
– Roadmaps%
! These$relaDonships$can$be$represented$by$graphs$
– Requires%“graph%parallel”%algorithms%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"27$
Graph%Analysis%Challenges%at%Scale%

! Graph$CreaDon$
– ExtracEng%relaEonship%informaEon%from%a%data%source%
– For%example,%extracEng%links%from%web%pages%
! Graph$RepresentaDon$
– e.g.,%adjacency%lists%in%a%table%
! Graph$Analysis$
– Inherently%iteraEve,%hard%to%parallelize%
– This%is%the%focus%of%specialized%libraries%like%Pregel,%GraphLab%
! Post"analysis$processing$
– e.g.,%incorporaEng%product%recommendaEons%into%a%retail%site%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"28$
Graph%Analysis%in%Spark%

! Spark$is$very$well$suited$to$graph$parallel$algorithms$
! GraphX$
– UC%Berkeley%AMPLab%project%on%top%of%Spark%
– Unifies%opEmized%graph%computaEon%with%Spark’s%fast%data%parallelism%
and%interacEve%abiliEes%
– Supersedes%predecessor%Bagel%(Pregel%on%Spark)%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"29$
Chapter%Topics%

Solving$Business$Problems$$
Common$Spark$Algorithms$
with$Spark$

!! Common%Spark%Use%Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph%Processing%and%Analysis%%%
!! Machine$Learning$
!! Example:%k8means%
!! Conclusion%
!! Hands8On%Exercise:%IteraEve%Processing%in%Spark%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"30$
Machine%Learning%

! Most$programs$tell$computers$exactly$what$to$do$
– Database%transacEons%and%queries%
– Controllers%
– Phone%systems,%manufacturing%processes,%transport,%weaponry,%
etc.%
– Media%delivery%
– Simple%search%
– Social%systems%
– Chat,%blogs,%email,%etc.%
! An$alternaDve$technique$is$to$have$computers$learn$what$to$do$
! Machine$Learning$refers$to$programs$that$leverage$collected$data$to$drive$
future$program$behavior$
! This$represents$another$major$opportunity$to$gain$value$from$data$

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"31$
The%‘Three%Cs’%

! Machine$Learning$is$an$acDve$area$of$research$and$new$applicaDons$
! There$are$three$well"established$categories$of$techniques$for$exploiDng$
data$
– CollaboraEve%filtering%(recommendaEons)%
– Clustering%
– ClassificaEon%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"32$
CollaboraEve%Filtering%

! CollaboraDve$Filtering$is$a$technique$for$recommendaDons$
! Example$applicaDon:$given$people$who$each$like$certain$books,$learn$to$
suggest$what$someone$may$like$in$the$future$based$on$what$they$already$
like$
! Helps$users$navigate$data$by$expanding$to$topics$that$have$affinity$with$
their$established$interests$
! CollaboraDve$Filtering$algorithms$are$agnosDc$to$the$different$types$of$
data$items$involved$
– Useful%in%many%different%domains%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"33$
Clustering%

! Clustering$algorithms$discover$structure$in$collecDons$of$data$
– Where%no%formal%structure%previously%existed%
! They$discover$what$clusters,$or$groupings,$naturally$occur$in$data$
! Examples$
– Finding%related%news%arEcles%
– Computer%vision%(groups%of%pixels%that%cohere%into%objects)%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"34$
ClassificaEon%

! The$previous$two$techniques$are$considered$‘unsupervised’$learning$
– The%algorithm%discovers%groups%or%recommendaEons%itself%
! ClassificaDon$is$a$form$of$‘supervised’$learning$
! A$classificaDon$system$takes$a$set$of$data$records$with$known$labels$
– Learns%how%to%label%new%records%based%on%that%informaEon%
! Examples$
– Given%a%set%of%emails%idenEfied%as%spam/not%spam,%label%new%emails%as%
spam/not%spam%
– Given%images%of%tumors%idenEfied%as%benign%or%malignant,%classify%new%
images%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"35$
Machine%Learning%Challenges%

! Highly$computaDon$intensive$and$iteraDve$
! Many$tradiDonal$numerical$processing$systems$do$not$scale$to$very$large$
datasets$
– e.g.,%MatLab%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"36$
MLlib:%Machine%Learning%on%Spark%

! MLlib$is$part$of$Apache$Spark$
! Includes$many$common$ML$funcDons$
– ALS%(alternaEng%least%squares)%
– k8means%
– LogisEc%Regression%
– Linear%Regression%
– Gradient%Descent%
! SDll$a$‘work$in$progress’$

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"37$
Chapter%Topics%

Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$

!! Common%Spark%Use%Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph%Processing%and%Analysis%%%
!! Machine%Learning%
!! Example:$k"means$
!! Conclusion%
!! Hands8On%Exercise:%IteraEve%Processing%in%Spark%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"38$
k8means%Clustering%

! k"means$Clustering$
– A%common%iteraEve%algorithm%used%in%graph%analysis%and%machine%
learning%
– You%will%implement%a%simplified%version%in%the%Hands8On%Exercises%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"39$
Clustering%(1)%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"40$
Clustering%(2)%

Goal:%Find%“clusters”%of%data%
points%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"41$
Example:%k8means%Clustering%(1)%

1.  Choose%K%random%points%as%
starEng%centers%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"42$
Example:%k8means%Clustering%(2)%

1.  Choose%K%random%points%as%
starEng%centers%
2.  Find%all%points%closest%to%each%
center%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"43$
Example:%k8means%Clustering%(3)%

1.  Choose%K%random%points%as%
starEng%centers%
2.  Find%all%points%closest%to%each%
center%
3.  Find%the%center%(mean)%of%each%
cluster%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"44$
Example:%k8means%Clustering%(4)%

1.  Choose%K%random%points%as%
starEng%centers%
2.  Find%all%points%closest%to%each%
center%
3.  Find%the%center%(mean)%of%each%
cluster%
4.  If%the%centers%changed,%iterate%
again%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"45$
Example:%k8means%Clustering%(5)%

1.  Choose%K%random%points%as%
starEng%centers%
2.  Find%all%points%closest%to%each%
center%
3.  Find%the%center%(mean)%of%each%
cluster%
4.  If%the%centers%changed,%iterate%
again%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"46$
Example:%k8means%Clustering%(6)%

1.  Choose%K%random%points%as%
starEng%centers%
2.  Find%all%points%closest%to%each%
center%
3.  Find%the%center%(mean)%of%each%
cluster%
4.  If%the%centers%changed,%iterate%
again%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"47$
Example:%k8means%Clustering%(7)%

1.  Choose%K%random%points%as%
starEng%centers%
2.  Find%all%points%closest%to%each%
center%
3.  Find%the%center%(mean)%of%each%
cluster%
4.  If%the%centers%changed,%iterate%
again%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"48$
Example:%k8means%Clustering%(8)%

1.  Choose%K%random%points%as%
starEng%centers%
2.  Find%all%points%closest%to%each%
center%
3.  Find%the%center%(mean)%of%each%
cluster%
4.  If%the%centers%changed,%iterate%
again%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"49$
Example:%k8means%Clustering%(9)%

1.  Choose%K%random%points%as%
starEng%centers%
2.  Find%all%points%closest%to%each%
center%
3.  Find%the%center%(mean)%of%each%
cluster%
4.  If%the%centers%changed,%iterate%
again%
…%
5.  Done!%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"50$
Example:%Approximate%k8means%Clustering%

1.  Choose%K%random%points%as%
starEng%centers%
2.  Find%all%points%closest%to%each%
center%
3.  Find%the%center%(mean)%of%each%
cluster%
4.  If%the%centers%changed%by%more%
than%c,%iterate%again%
…%
5.  Close%enough!%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"51$
Chapter%Topics%

Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$

!! Common%Spark%Use%Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph%Processing%and%Analysis%
!! Machine%Learning%%
!! Example:%k8means%
!! Conclusion$
!! Hands8On%Exercise:%IteraEve%Processing%in%Spark%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"52$
Key%Points%

! Spark$is$especially$suited$to$big$data$problems$that$require$iteraDon$
– In8memory%caching%makes%this%very%efficient%
! Common$in$many$types$of$analysis$
– e.g.,%common%algorithms%such%as%PageRank%and%k8means%
! Spark$includes$specialized$libraries$to$implement$many$common$funcDons$
– GraphX%
– MLlib%%
! GraphX$
– Highly%efficient%graph%analysis%(similar%to%Pregel%et%al.)%and%graph%
construcEon,%representaEon%and%post8processing%
! MLlib$
– Efficient,%scalable%funcEons%for%machine%learning%(e.g.,%logisEc%
regression,%k8means)%

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"53$
Chapter%Topics%

Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$

!! Common%Spark%Use%Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph%Processing%and%Analysis%%%
!! Machine%Learning%
!! Example:%k8means%
!! Conclusion%
!! Hands"On$Exercise:$IteraDve$Processing$in$Spark$

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"54$
Hands8On%Exercise%

! Hands"On$Exercise:$Itera(ve*Processing*in*Spark*
– Implement%k8means%in%Spark%in%order%to%idenEfy%clustered%locaEon%data%
points%from%Loudacre%device%status%logs%
– Find%the%geographic%centers%of%device%acEvity%
! Please$refer$to$the$Hands"On$Exercise$Manual$

©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"55$
Improving*Spark*Performance*
Chapter*12*
Course*Chapters*
!! IntroducFon* Course*IntroducFon*
!! What*is*Apache*Spark?*
!! Spark*Basics* IntroducFon*to*Spark*
!! Working*With*RDDs*

!! The*Hadoop*Distributed*File*System*(HDFS)*
!! Running*Spark*on*a*Cluster*
Distributed*Data*Processing*
!! Parallel*Programming*with*Spark*
with*Spark*
!! Caching*and*Persistence*
!! WriFng*Spark*ApplicaFons*

!! Spark*Streaming*
!! Common*Spark*Algorithms* Solving$Business$Problems$$
!! Improving$Spark$Performance$ with$Spark$
!! Spark,*Hadoop,*and*the*Enterprise*Data*Center*

!! Conclusion* Course*Conclusion*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#2$
Improving*Spark*Performance*

In$this$chapter$you$will$learn$
! How$to$improve$the$performance$of$Spark$programs$using$shared$
variables$
! Some$common$performance$issues$and$how$to$find$and$address$them$

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#3$
Chapter*Topics*

Solving$Business$Problems$
Improving$Performance$
with$Spark$

!! Shared$Variables:$Broadcast$Variables$
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared*Variables:*Accumulators*
!! Hands:On*Exercise:*Using*Accumulators*
!! Common*Performance*Issues*
!! Diagnosing*Performance*Problems*
!! Conclusion*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#4$
Broadcast*Variables*

! Broadcast$variables$are$set$by$the$driver$and$retrieved$by$the$workers$
! They$are$read#only$aGer$they$have$been$set$
! The$first$read$of$a$Broadcast$variable$retrieves$and$stores$its$value$on$the$
node$
Client*
Driver** Executor*
Program*

Executor*
myVariable
Spark*
Master*
Executor*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#5$
Example:*Match*User*IDs*with*Requested*Page*Titles*

227.35.151.122*:*184*[16/Sep/2013:00:03:51*+0100]*"GET*/KBDOC:00183.html*HTTP/1.0"*200*…*
146.218.191.254*:*133*[16/Sep/2013:00:03:48*+0100]*"GET*/KBDOC:00188.html*HTTP/1.0"*200*…*
176.96.251.224*:*12379*[16/Sep/2013:00:02:29*+0100]*"GET*/KBDOC:00054.html*HTTP/1.0”*16011…**
…*

logs* pages* pagelogs*


(184, KBDOC-00183) (KBDOC-00001, title1) (184, title183)
(133, KBDOC-00188) (KBDOC-00002, title2) (133, title188)
(12379, KBDOC-00054) (KBDOC-00003, title3) (12379, title54)
… … …

KBDOC:00001:MeeToo%4.1%)%Back%up%files%
KBDOC:00002:Sorrento*F24L*:*Change*the*phone*ringtone*and*noFficaFon*sound*
KBDOC:00003:Sorrento*F41L*–*overheaFng*
…*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#6$
Example:*Join*a*Web*Server*Log*with*Page*Titles*

logs = sc.textFile(logfile).map(fn)
pages = sc.textFile(pagefile).map(fn)
pagelogs = logs.join(pages)

pages*
join

pagelogs*

logs*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#7$
Example:*Pass*a*Small*Table*as*a*Parameter**

logs = sc.textFile(logfile).map(fn)
pages = dict(map(fn,open(pagefile)))
pagelogs = logs.map(lambda (userid,pageid):
(userid,pages[pageid]))

logs pagelogs

task*
Driver*
task*
pages
task*

task*
task*
task*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#8$
Example:*Broadcast*a*Small*Table*

logs = sc.textFile(logfile).map(…)
pages = dict(map(fn,open(pagefile)))
pagesbc = sc.broadcast(pages)
pagelogs = logs.map(lambda (userid, pageid):
(userid,pagesbc.value[pageid])))

logs pagelogs

Driver*
pages

pagesbc

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#9$
Broadcast*Variables*

! Why$use$Broadcast$variables?$
– Use*to*minimize*transfer*of*data*over*the*network,*which*is*usually*the*
biggest*boEleneck*
– Spark*Broadcast*variables*are*distributed*to*worker*nodes*using*a*
very*efficient*peer:to:peer*algorithm*
$

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#10$
Chapter*Topics*

Solving$Business$Problems$
Improving$Performance$
with$Spark$

!! Shared*Variables:*Broadcast*Variables*
!! Hands#On$Exercise:$Using$Broadcast$Variables$
!! Shared*Variables:*Accumulators*
!! Hands:On*Exercise:*Using*Accumulators*
!! Common*Performance*Issues*
!! Diagnosing*Performance*Problems*
!! Conclusion*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#11$
Hands:On*Exercise:*Using*Broadcast*Variables*

! Hands#On$Exercise:$Using&Broadcast&Variables&
– Filter*web*server*logs*for*requests*from*selected*devices*
– Use*a*broadcast*variable*for*the*list*of*target*device*models*to*filter*
! Please$refer$to$the$Hands#On$Exercise$Manual$

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#12$
Chapter*Topics*

Solving$Business$Problems$
Improving$Performance$
with$Spark$

!! Shared*Variables:*Broadcast*Variables*
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared$Variables:$Accumulators$
!! Hands:On*Exercise:*Using*Accumulators*
!! Common*Performance*Issues*
!! Diagnosing*Performance*Problems*
!! Conclusion*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#13$
Accumulators*

! Accumulators$are$shared$variables$
Client*
– Worker*nodes*can*add*to*the*value*
– Only*the*driver*applicaFon*can*access*the*value*

Driver** Executor*
Program* +

.set .value
Executor*
+
myAccumulator
Spark*
Master*
Executor*
+

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#14$
Accumulator*Example:*Average*Word*Length*

! Example:$Calculate$the$average$length$of$all$the$words$in$a$dataset$

def addTotals(word,words,letters):
words += 1
letters += len(word)

totalWords = sc.accumulator(0)
totalLetters = sc.accumulator(0.0)

words = sc.textFile(myfile) \
.flatMap(lambda line: line.split())

words.foreach(lambda word: \
addTotals(word,totalWords,totalLetters))

print "Average word length: ", \


totalLetters.value/totalWords.value

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#15$
More*About*Accumulators*

! Accumulator$values$will$be$reported$to$the$driver$only$once$per$task$
– If*tasks*must*be*rerun*due*to*failure,*Spark*will*correctly*add*only*for*
the*task*which*succeeds*
! Only$the$driver$can$access$the$value$
– Updates*are*only*sent*to*the*master,*not*to*all*workers*
– Code*will*throw*an*excepFon*if*you*use*.value on*worker*nodes*
! Supports$the$compound$assignment$operator,$+=$
! Can$use$integers$or$doubles$
– sc.accumulator(0)
– sc.accumulator(0.0)
! Can$customize$to$support$any$data$type$
– Extend*the*AccumulatorParam*class*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#16$
Chapter*Topics*

Solving$Business$Problems$
Improving$Performance$
with$Spark$

!! Shared*Variables:*Broadcast*Variables*
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared*Variables:*Accumulators*
!! Hands#On$Exercise:$Using$Accumulators$
!! Common*Performance*Issues*
!! Diagnosing*Performance*Problems*
!! Conclusion*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#17$
Hands:On*Exercise:*Using*Accumulators*

! Hands#On$Exercise:$Using&Accumulators&
– Use*Accumulator*variables*to*count*the*number*of*requests*for*
different*types*of*files*in*a*set*of*web*server*logs*
! Please$refer$to$the$Hands#On$Exercise$Manual$

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#18$
Chapter*Topics*

Solving$Business$Problems$
Improving$Performance$
with$Spark$

!! Shared*Variables:*Broadcast*Variables*
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared*Variables:*Accumulators*
!! Hands:On*Exercise:*Using*Accumulators*
!! Common$Performance$Issues$
!! Diagnosing*Performance*Problems*
!! Conclusion*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#19$
Performance*Issue:*SerializaFon*

! SerializaXon$affects$
– Network*bandwidth*
– Memory*(save*memory*by*serializing)*
! Default$method$of$serializaXon$in$Spark$is$basic$Java$serializaXon$
– Simple*but*slow*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#20$
Using*Kryo*SerializaFon*

! Use$Kryo$serializaXon$for$Scala$and$Java$
– To*enable,*set*spark.serializer*=*spark.KryoSerializer
! To$enable$Kryo$for$your$custom$classes$
– Create*a*KryoRegistrar*class*and*set**
spark.kryo.registrator=MyRegistrator
– Register*your*classes*with*Kryo*

class MyRegistrator extends spark.KryoRegistrator {


def registerClasses(kryo: Kryo) {
kryo.register(classOf[MyClass1])
kryo.register(classOf[MyClass2])

}
}

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#21$
Performance*Issue:*Small*ParFFons*

! Problem:$filter()$can$result$in$parXXons$with$small$amounts$of$data$
– Results*in*many*small*tasks*

sc.textFile(file) \
.filter(lambda s: s.startswith('I')) \
.map(lambda s: \
(s.split()[0],(s.split()[1],s.split()[2])))

RDD* RDD*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#22$
SoluFon:*ReparFFon/Coalesce*

! SoluXon:$repartition(n)
– This*is*the*same*as*coalesce(n, shuffle=true)*

sc.textFile(file) \
.filter(lambda s: s.startswith('I')) \
.repartition(3) \
.map(lambda s: \
(s.split()[0],(s.split()[1],s.split()[2])))

RDD* RDD*
RDD*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#23$
Performance*Issue:*Passing*Too*Much*Data*in*FuncFons***

! Problem:$Passing$large$amounts$of$data$to$parallel$funcXons$results$in$
poor$performance$

hashmap = some_massive_hash_map()

myrdd.map(lambda x: hashmap(x)).countByValue()

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#24$
Performance*Issues:*Passing*Too*Much*Data*in*FuncFons***

! SoluXon:$$
– If*the*data*is*relaFvely*small,*use*a*Broadcast*variable*

hashmap = some_massive_hash_map()
bhashmap = sc.broadcast(hashmap)

myrdd.map(lambda x: bhashmap(x)).countByValue()

– If*the*data*is*very*large,*parallelize*into*an*RDD*

hashmap = some_massive_hash_map()
hashmaprdd = sc.parallelize(hashmap)

myrdd.join(bhashmaprdd).countByValue()

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#25$
Chapter*Topics*

Solving$Business$Problems$
Improving$Performance$
with$Spark$

!! Shared*Variables:*Broadcast*Variables*
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared*Variables:*Accumulators*
!! Hands:On*Exercise:*Using*Accumulators*
!! Common*Performance*Issues*
!! Diagnosing$Performance$Problems$
!! Conclusion*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#26$
Diagnosing*Performance*Issues*(1)*

! The$Spark$ApplicaXon$UI$provides$useful$metrics$to$find$performance$
problems$

Stage*
Details*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#27$
Diagnosing*Performance*Issues*(2)*

! Where$to$look$for$performance$issues$
– Scheduling*and*launching*tasks*
– Task*execuFon*
– Shuffling*
– CollecFng*data*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#28$
Scheduling*and*Launching*Issues*

! Scheduling$and$launching$taking$too$long$
– Are*you*passing*too*much*data*to*tasks?**
– myrdd.map(lambda x: HugeLookupTable(x))
– Use*a*Broadcast*variable*or*an*RDD*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#29$
Task*ExecuFon*Issues*(1)*

! Task$execuXon$taking$too$long?$
– Are*there*tasks*with*a*very*high*per:record*overhead?***
– e.g.,*mydata.map(dbLookup)
– Each*lookup*call*opens*a*connecFon*to*the*DB,*reads,*and*closes*
– Try*mapPartitions

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#30$
Task*ExecuFon*Issues*(2)*

! Are$a$few$tasks$taking$much$more$Xme$than$others?$$$
– ReparFFon,*parFFon*on*a*different*key,*or*write*a*custom*parFFoner*

Task*duraFons*should*be*
fairly*even*

Example:*empty*
parFFons*due*to*
filtering*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#31$
Shuffle*Issues*

! WriXng$shuffle$results$taking$too$long?$
– Make*sure*you*have*enough*memory*for*buffer*cache*
– Make*sure*spark.local.dir*is*a*local*disk,*ideally*dedicated*

Saves*to*disk*if*too*
big*for*buffer*cache*

Look*for*big*
write*Fmes*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#32$
CollecFng*Data*to*the*Driver*

! Are$results$taking$too$long?$
– Beware*of*returning*large*amounts*of*data*to*the*driver,*for*example*
with*collect()
– Process*data*on*the*workers,*not*the*driver* Watch*for*
disproporFonate*result*
– Save*large*results*to*HDFS* serializaFon*Fmes*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#33$
Performance*Analysis*and*Monitoring*

! Spark$supports$integraXon$with$other$performance$tools$
– Configurable*metrics*system*built*on*the*Coda*Hale*Metrics*Library*
– Metrics*can*be**
– Saved*to*files*
– Output*to*the*console*
– Viewed*in*the*JMX*console*
– Sent*to*reporFng*tools*like*Graphite*or*Ganglia*

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#34$
Chapter*Topics*

Solving$Business$Problems$
Improving$Performance$
with$Spark$

!! Shared*Variables:*Broadcast*Variables*
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared*Variables:*Accumulators*
!! Hands:On*Exercise:*Using*Accumulators*
!! Common*Performance*Issues*
!! Diagnosing*Performance*Problems*
!! Conclusion$

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#35$
Key*Points*

! Network$bandwidth$is$oGen$the$major$bo`leneck$$
! For$best$performance,$minimize$data$shuffling$between$workers$
! Broadcast$variables$allow$you$to$copy$data$to$each$worker$once$
– Use*instead*of*an*RDD*for*small*datasets*
! Accumulators$allow$workers$to$update$a$shared$variable$locally$
! Use$Kryo$serializaXon$instead$of$default$Scala/Java$serializaXon$to$speed$
up$network$copy$of$data,$and$save$memory$
! ReparXXon$to$avoid$unbalanced$or$very$small$parXXons$across$nodes$

©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#36$
Spark,'Hadoop,'and'the''
Enterprise'Data'Center'
Chapter'13'
Course'Chapters'
!! IntroducHon' Course'IntroducHon'
!! Why'Spark?'
!! Spark'Basics' IntroducHon'to'Spark'
!! Working'With'RDDs'

!! The'Hadoop'Distributed'File'System'(HDFS)'
!! Running'Spark'on'a'Cluster'
Distributed'Data'Processing''
!! Parallel'Programming'with'Spark'
with'Spark'
!! Caching'and'Persistence'
!! WriHng'Spark'ApplicaHons'

!! Spark'Streaming'
!! Common'PaFerns'in'Spark'Programming' Solving%Business%Problems%
!! Improving'Spark'Performance' with%Spark%
!! Spark,%Hadoop,%and%the%Enterprise%Data%Center%

!! Conclusion' Course'Conclusion'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#2%
Spark'and'the'Enterprise'Data'Center'

In%this%chapter%you%will%learn%
! How%Spark%and%Hadoop%work%together%to%provide%enterprise#level%data%
processing%and%analysis%
! How%to%integrate%Spark%and%Hadoop%into%an%exisEng%enterprise%data%
center%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#3%
Chapter'Topics'

Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%

!! The%Spark%Hadoop%Overview'
!! Spark'and'MapReduce'
!! Spark'and'the'Hadoop'Ecosystem''
!! PuVng'It'All'Together:'IntegraHng'the'Enterprise'Data'Center'
!! Conclusion'
!! Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#4%
The'Spark'Stack'

! In%addiEon%to%the%core%Spark%engine,%there%are%an%ever#growing%number%of%
related%projects%
! SomeEmes%called%the%Berkeley%Data%AnalyEcs%Stack%(BDAS)%

Spark' MLlib' GraphX' SparkR'


Stream; Shark' (Machine' (Graph' (StaHsHcs)'
ing' (SQL)' Learning)' Processing)'

Spark'Core'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#5%
Spark'and'Hadoop'(1)'

! Spark%was%created%to%complement,%not%replace,%Hadoop%

Spark' MLlib' GraphX' SparkR'

…'
Stream; (Machine' (Graph' (StaHsHcs)' Cloudera'
Shark'
ing' (SQL)'
Learning)' Processing)' Hive' Search'
Impala' HBase'

Spark'Core' MapReduce'

HDFS' YARN'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#6%
Spark'and'Hadoop'(2)'

! Spark%uses%HDFS%
– Can'use'any'Hadoop'data'source'
– Uses'Hadoop'InputFormats'and'OutputFormats'
– This'means'it'can'manipulate'e.g.,'Avro'files'and'SequenceFiles'
! Spark%runs%on%YARN%
– Can'run'on'the'same'cluster'with'MapReduce'jobs,'Impala,'etc.'
! Spark%works%with%the%Hadoop%ecosystem%
– Flume'
– Sqoop'
– HBase'
– …'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#7%
Example:'Yahoo'

! Example%use#case:%Yahoo%is%a%major%user%of%Hadoop%
– Uses'Hadoop'for'personalizaHon,'collaboraHve'filtering,'ad'analyHcs…'
! MapReduce%couldn’t%keep%up%
– Highly'iteraHve'machine'learning'algorithms''
! Moved%iteraEve%processing%to%Spark%

Spark'
MapReduce'
MapReduce' IteraHve'
Batch'Processing'
Processing'

YARN' YARN'

HDFS' HBase' HDFS' HBase'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#8%
Chapter'Topics'

Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%

!! The'Spark'Hadoop'Overview'
!! Spark%and%MapReduce%
!! Spark'and'the'Hadoop'Ecosystem''
!! PuVng'It'All'Together:'IntegraHng'the'Enterprise'Data'Center'
!! Conclusion'
!! Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#9%
Spark'vs.'Hadoop'MapReduce'

! Hadoop%MapReduce%
– Widely'used,'huge'investment'already'made'
– Supports'and'supported'by'many'complementary'tools'
– Mature,'stable,'well;tested'technology'
– Skilled'developers'available'
! Spark%
– Flexible'
– Elegant''
– Fast'
– Changing'rapidly'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#10%
Sharing'Data'Between'Spark'and'MapReduce'Jobs'

! Apache%Avro%is%a%binary%file%format%for%saving%datasets%
! Hadoop%SequenceFiles%are%similar;%used%by%many%exisEng%Hadoop%data%
centers%
! Both%are%supported%by%Spark%

Spark' MapReduce'

HDFS'
(key,value)
(key,value)
(key,value)
(key,value)

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#11%
Chapter'Topics'

Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%

!! The'Spark'Hadoop'Overview'
!! Spark'and'MapReduce'
!! Spark%and%the%Hadoop%Ecosystem%%
!! PuVng'It'All'Together:'IntegraHng'the'Enterprise'Data'Center'
!! Conclusion'
!! Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#12%
The'Hadoop'Ecosystem'

! In%addiEon%to%HDFS%and%MapReduce,%the%Hadoop%Ecosystem%includes%
many%addiEonal%components%
! Some%that%may%be%of%parEcular%interest%to%Spark%developers%
– Data'Storage:'HBase'
– Data'Analysis:'Hive'and'Impala'
– Data'IntegraHon:'Flume'and'Sqoop'
'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#13%
Data'Storage:'HBase'–'The'Hadoop'Database'

! HBase:%database%layered%on%top%of%HDFS%
– Provides'interacHve'access'to'data'
! Stores%massive%amounts%of%data%
– Petabytes+'
! High%throughput%
– Thousands'of'writes'per'second'(per'node)'
! Handles%sparse%data%well%
– No'wasted'space'for'a'row'with'empty'' HDFS'
columns'
! Limited%access%model%
– OpHmized'for'lookup'of'a'row'by'key'rather'than'full'queries'
– No'transacHons:'single'row'operaHons'only'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#14%
Data'Analysis:'Hive'

! What%is%Hive?%
– Open'source'Apache'project'
– Built'on'Hadoop'MapReduce'
– HiveQL:'An'SQL;like'interface'to'Hadoop'

SELECT * FROM purchases WHERE price > 10000 ORDER BY


storeid

! Very%acEve%work%is%currently%ongoing%to%port%Hive’s%execuEon%engine%to%
Spark%
– Will'be'able'to'use'either'MapReduce'or'Spark'to'execute'queries'

% ©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#15%
Data'Analysis:'Impala'

! High#performance%SQL%engine%for%vast%amounts%of%data%
– Similar'query'language'to'HiveQL''
– 10'to'50+'Hmes'faster'than'Hive'or'MapReduce'
! Impala%runs%on%Hadoop%clusters%
– Data'stored'in'HDFS'
– Dedicated'SQL'engine;'does'not'depend'on'Spark,'
MapReduce,'or'Hive'
! Developed%by%Cloudera%
– 100%'open'source,'released'under'the'Apache'sojware'
license'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#16%
Data'IntegraHon:'Flume'(1)'

! What%is%Flume?%
– A'service'to'move'large'amounts'of'data'in'real'Hme'
– Example:'storing'log'files'in'HDFS'
! Flume%is%
– Distributed'
– Reliable'and'available'
– Horizontally'scalable''
– Extensible'
! Spark%Streaming%is%integrated%with%Flume%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#17%
Data'IntegraHon:'Flume'(2)'

•  Collect'data'as'it'is'produced'
•  Files,'syslogs,'stdout'or'
custom'source'
Agent'' Agent'' Agent' Agent' Agent'
'
•  Process'in'place'' encrypt% compress%

•  e.g.,'encrypt,'compress'

•  Pre;process'data'before'storing' Agent' Agent%


•  'e.g.,'transform,'scrub,'enrich'

•  Write'in'parallel'
•  Scalable'throughput' Agent(s)%

•  Store'in'any'format'
Spark'
•  Text,'compressed,'binary,'or' HDFS'
custom'sink' Streaming'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#18%
Data'IntegraHon:'Sqoop'–'SQL'to'Hadoop'(1)'

! Typical%scenario:%data%stored%in%an%RDBMS%is%needed%in%a%Spark%
applicaEon%
– Lookup'tables'
– Legacy'data'
! Possible%to%read%directly%from%an%RDBMS%in%your%Spark%applicaEon%
– Can'lead'to'the'equivalent'of'a'distributed'denial'of'service'
(DDoS)'aFack'on'your'RDBMS'
– In'pracHce'–'don’t'do'it!'
! Becer%idea:%use%Sqoop%to%import%the%data%into%HDFS%beforehand%%

RDBMS' sqoop' HDFS'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#19%
Data'IntegraHon:'Sqoop'–'SQL'to'Hadoop'(2)'

! Sqoop:%open%source%tool%originally%wricen%at%Cloudera%
– Now'a'top;level'Apache'Sojware'FoundaHon'project'
! Imports%tables%from%an%RDBMS%into%HDFS%
– Just'one'table,'all'tables,'or'porHons'of'a'table'
– Uses'MapReduce'to'actually'import'the'data'
! Uses%a%JDBC%interface%
– Works'with'virtually'any'JDBC;compaHble'database'
! Imports%data%to%HDFS%as%delimited%text%files%or%SequenceFiles%
– Default'is'comma;delimited'text'files'
! Can%be%used%for%incremental%data%imports%
– First'import'retrieves'all'rows'in'a'table'
– Subsequent'imports'retrieve'just'rows'created'since'the'last'
import'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#20%
Custom'Sqoop'Connectors'

! Cloudera%has%partnered%with%other%organizaEons%to%create%custom%Sqoop%
connectors%
– Use'a'database’s'naHve'protocols'rather'than'JDBC'
– Provides'much'faster'performance'
! Current%systems%supported%by%custom%connectors%include:%
– Netezza'
– Teradata'
– Oracle'Database'(connector'developed'with'Quest'Sojware)'
! Others%are%in%development%
! Custom%connectors%are%not%open%source,%but%are%free%
– Available'from'the'Cloudera'Web'site'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#21%
Sqoop:'Basic'Syntax'

! Standard%syntax:%

$ sqoop tool-name [tool-options]

! Tools%include:%
import
import-all-tables
list-tables
! OpEons%include:%
--connect
--username
--password

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#22%
Sqoop:'Example'

! Example:%import%a%table%called%employees%from%a%database%called%
personnel%in%a%MySQL%RDBMS%

$ sqoop import --username fred --password derf \


--connect jdbc:mysql://database.example.com/personnel \
--table employees

! Example:%as%above,%but%only%records%with%an%ID%greater%than%1000%

$ sqoop import --username fred --password derf \


--connect jdbc:mysql://database.example.com/personnel \
--table employees \
--where "id > 1000"

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#23%
Chapter'Topics'

Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%

!! The'Spark'Hadoop'Overview'
!! Spark'and'MapReduce'
!! Spark'and'the'Hadoop'Ecosystem''
!! Pugng%It%All%Together:%IntegraEng%the%Enterprise%Data%Center%
!! Conclusion'
!! Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#24%
Typical'RDBMS'Scenario'

! Typical%scenario:%%
– InteracHve'RDBMS'serves'queries'from'a'web'site'
– Data'is'extracted'and'loaded'into'a'data'warehouse'for'processing'and'
archiving'
Business'
Intelligence' Archive'
Tools'
Web'server'logs'

OLTP' Extract' Transform' Load' OLAP'

Orders'
Enterprise'
Data''
RDBMS' Warehouse'

Site'Content'
OLTP: Online Transaction Processing
OLAP: Online Analytical Processing

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#25%
OLAP'Database'LimitaHons'

! All%dimensions%must%be%prematerialized%
– Re;materializaHon'can'be'very'Hme'consuming'
! Daily%data%load#in%Emes%can%increase%
– Typically'this'leads'to'some'data'being'discarded'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#26%
Using'Spark'and'Hadoop'to'Augment'ExisHng'Databases'

! With%Spark%and%Hadoop%you%can%store%and%process%all%your%data%
– The'‘Enterprise'Data'Hub’'
! Reserve%EDW%space%for%high%value%data%
Spark'and'Hadoop' BI'Tools'

Web'server'logs' HDFS/ OLAP'


HBase'
ETL'
OLTP' Enterprise'
Data''
Orders' Warehouse'

RecommendaHons'
RDBMS'

Site'Content'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#27%
Benefits'of'Spark'and'Hadoop'Over'RDBMSs'

! Processing%power%scales%with%data%storage%
– As'you'add'more'nodes'for'storage,'you'get'more'processing'power'‘for'
free’'
! Views%do%not%need%prematerializaEon%
– Ad;hoc'full'or'parHal'dataset'queries'are'possible'
! Total%query%size%can%be%mulEple%petabytes%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#28%
TradiHonal'High;Performance'File'Servers'

! Enterprise%data%is%ohen%held%on%large%fileservers,%such%as%products%from%
– NetApp'
– EMC'
! Advantages%
– Fast'random'access'
– Many'concurrent'clients'
! Disadvantages%
– High'cost'per'terabyte'of'storage'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#29%
File'Servers'and'HDFS'

! Choice%of%storage%depends%on%the%expected%access%pacerns%
– SequenHally'read,'append;only'data:'HDFS'
– Random'access:'file'server'
! HDFS%can%crunch%sequenEal%data%faster%
! Offloading%data%to%HDFS%leaves%more%room%on%file%servers%for%‘interacEve’%
data%
! Use%the%right%tool%for%the%job!%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#30%
Chapter'Topics'

Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%

!! The'Spark'Hadoop'Overview'
!! Spark'and'MapReduce'
!! Spark'and'the'Hadoop'Ecosystem''
!! PuVng'It'All'Together:'IntegraHng'the'Enterprise'Data'Center'
!! Conclusion%
!! Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#31%
Key'Points'

! Spark%complements%Hadoop%MapReduce%
! Spark%works%with%other%Hadoop%Ecosystem%projects%
– HBase'–'The'Hadoop'NoSQL'database'
– Hive'–'SQL;like'access'to'Hadoop'data'
– Impala'–'high;speed'SQL'query'engine'
– Flume'–'real;Hme'data'import'
– Sqoop'–'RDBMS'to'(and'from)'HDFS'
! Spark%and%Hadoop%together%can%help%you%make%your%data%center%faster%
and%cheaper%
– Offload'ETL'processing'
– Use'all#your'data'

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#32%
Chapter'Topics'

Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%

!! The'Spark'Hadoop'Overview'
!! Spark'and'MapReduce'
!! Spark'and'the'Hadoop'Ecosystem''
!! PuVng'It'All'Together:'IntegraHng'the'Enterprise'Data'Center'
!! Conclusion'
!! Hands#On%Exercise:%ImporEng%RDBMS%Data%Into%Spark%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#33%
Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'

! Hands#On%Exercise:%Impor-ng/RDBMS/Data/Into/Spark/
– Import'movies'and'movie'raHngs'from'MySQL'to'HDFS'and'load'
them'into'Spark'RDDs'
– Calculate'and'save'average'movie'raHngs'
! Please%refer%to%the%Hands#On%Exercise%Manual%

©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#34%
Conclusion)
Chapter)14)
Course)Chapters)
!! IntroducBon) Course)IntroducBon)
!! What)is)Apache)Spark?)
!! Spark)Basics) IntroducBon)to)Spark)
!! Working)With)RDDs)

!! The)Hadoop)Distributed)File)System)(HDFS))
!! Running)Spark)on)a)Cluster)
Distributed)Data)Processing))
!! Parallel)Programming)with)Spark)
with)Spark)
!! Caching)and)Persistence)
!! WriBng)Spark)ApplicaBons)

!! Spark)Streaming)
!! Common)Pa@erns)in)Spark)Programming) Solving)Business)Problems))
!! Improving)Spark)Performance) with)Spark)
!! Spark,)Hadoop,)and)the)Enterprise)Data)Center)

!! Conclusion% Course%Conclusion%

©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#2%
Conclusion)(1))

During%this%course,%you%have%learned%
! What%Apache%Spark%is,%what%problems%it%solves,%and%why%you%would%want%
to%use%it%
! The%basic%programming%concepts%of%Spark:%operaEons%on%Resilient%
Distributed%Datasets%(RDDs)%
! How%Spark%works%to%distribute%processing%of%big%data%across%a%cluster%
! How%Spark%interacts%with%other%components%of%a%big%data%system:%data%
storage%and%cluster%resource%management%
! How%to%take%advantage%of%key%Spark%features%such%as%caching%and%shared%
variables%to%improve%performance%
! How%to%use%Spark%–%either%interacEvely%using%a%Spark%Shell%or%by%wriEng%
your%own%Spark%ApplicaEons%

©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#3%
%
Conclusion)(2))

! How%to%use%Spark%Streaming%to%process%a%live%data%stream%in%real%Eme%
! How%Spark%integrates%with%other%parts%of%the%Hadoop%Ecosystem%to%
provide%Enterprise#level%data%processing%

©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#4%
Next)Steps)

! Cloudera%offers%a%number%of%other%training%courses,%including:%
– Cloudera)Hadoop)EssenBals)
– Cloudera)Administrator)Training)for)Apache)Hadoop)
– Cloudera)Developer)Training)for)Apache)Hadoop)
– Designing)and)Building)Big)Data)ApplicaBons)
– Cloudera)Data)Analyst)Training:)Using)Pig,)Hive,)and)Impala)with)
Hadoop)
– Cloudera)Training)for)Apache)HBase)
– IntroducBon)to)Data)Science:)Building)Recommender)Systems)
– Custom)courses)
! Cloudera%also%provides%consultancy%and%troubleshooEng%services%
– Please)ask)your)instructor)for)more)informaBon)

©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#5%
Class)EvaluaBon)

! Please%take%a%few%minutes%to%complete%the%class%evaluaEon%
– Your)instructor)will)show)you)how)to)access)the)online)form)

©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#6%
Thank)You!)

! Thank%you%for%aQending%this%course%
! If%you%have%any%further%quesEons%or%comments,%please%feel%free%to%contact%
us%
– Full)contact)details)are)on)our)Web)site)at)
http://www.cloudera.com/

©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#7%

You might also like