Cloudera Spark Developer Training
Cloudera Spark Developer Training
Cloudera Spark Developer Training
for)Apache)Spark)
201409)
Introduc>on)
Chapter)1)
Course)Chapters)
!! Introduc.on% Course%Introduc.on%
!! Why)Spark?)
!! Spark)Basics) Introduc>on)to)Spark)
!! Working)With)RDDs)
!! The)Hadoop)Distributed)File)System)(HDFS))
!! Running)Spark)on)a)Cluster)
Distributed)Data)Processing))
!! Parallel)Programming)with)Spark)
with)Spark)
!! Caching)and)Persistence)
!! Wri>ng)Spark)Applica>ons)
!! Spark)Streaming)
!! Common)PaHerns)in)Spark)Programming) Solving)Business)Problems))
!! Improving)Spark)Performance) with)Spark)
!! Spark,)Hadoop,)and)the)Enterprise)Data)Center)
!! Conclusion) Course)Conclusion)
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#3%
Chapter)Topics)
Introduc.on% Course%Introduc.on%
!! About%This%Course%
!! About)Cloudera)
!! Course)Logis>cs)
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#4%
Course)Objec>ves)(1))
During%this%course,%you%will%learn%
! What%Apache%Spark%is,%what%problems%it%solves,%and%why%you%would%want%
to%use%it%
! The%basic%programming%concepts%of%Spark:%opera.ons%on%Resilient%
Distributed%Datasets%(RDDs)%
! How%Spark%works%to%distribute%processing%of%big%data%across%a%cluster%
! How%Spark%interacts%with%other%components%of%a%big%data%system:%data%
storage%and%cluster%resource%management%
! How%to%take%advantage%of%key%Spark%features%such%as%caching%and%shared%
variables%to%improve%performance%
! How%to%use%Spark%–%either%interac.vely%using%a%Spark%Shell%or%by%wri.ng%
your%own%Spark%Applica.ons%
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#5%
%
Course)Objec>ves)(2))
! How%to%use%Spark%Streaming%to%process%a%live%data%stream%in%real%.me%
! How%Spark%integrates%with%other%parts%of%the%Hadoop%Ecosystem%to%
provide%Enterprise#level%data%processing%
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#6%
Chapter)Topics)
Introduc.on% Course%Introduc.on%
!! About)This)Course)
!! About%Cloudera%
!! Course)Logis>cs)
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#7%
About)Cloudera)(1))
! The%leader%in%Apache%Spark%and%Hadoop#based%soQware%and%services%
! Founded%by%leading%experts%on%Big%Data%processing%from%Facebook,%Yahoo,%
Google,%and%Oracle%
! Provides%support,%consul.ng,%training,%and%cer.fica.on%
! Staff%includes%commi[ers%and%contributors%to%virtually%all%Hadoop%and%
Spark%projects%
! Many%authors%of%industry%standard%books%on%Apache%Hadoop%projects%
– Tom)White,)Lars)George,)Kathleen)Ting,)etc.)
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#8%
About)Cloudera)(2))
! Customers%include:%
– Allstate,)AOL)Adver>sing,)Box,)CBS)Interac>ve,)eBay,)Experian,)Groupon,)
Na>onal)Cancer)Ins>tute,)Orbitz,)Social)Security)Administra>on,)Trend)
Micro,)Trulia,)US)Army,)…)
! Cloudera%public%training:%
– Cloudera)Developer)Training)for)Apache)Spark)
– Cloudera)Developer)Training)for)Apache)Hadoop)
– Designing)and)Building)Big)Data)Applica>ons)
– Cloudera)Administrator)Training)for)Apache)Hadoop)
– Cloudera)Data)Analyst)Training:)Using)Pig,)Hive,)and)Impala)with)Hadoop)
– Cloudera)Training)for)Apache)HBase)
– Introduc>on)to)Data)Science:)Building)Recommender)Systems)
– Cloudera)Essen>als)for)Apache)Hadoop)
! Onsite%and%custom%training%is%also%available%
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#9%
CDH)
! CDH%
– 100%)open)source,))
enterpriseAready))
distribu>on)of)Hadoop))
and)related)projects)
– The)most)complete,))
tested,)and)widelyA)
deployed)distribu>on))
of)Hadoop)
– Integrates)all)key)Spark))
and)Hadoop))
ecosystem)projects)
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#10%
Cloudera)Express)
! Cloudera%Express%
– Free)download)
! The%best%way%to%get%started%
%with%Spark%and%Hadoop%
! Includes%CDH%
! Includes%Cloudera%Manager%
– EndAtoAend))
administra>on))
– Deploy,)manage,)and))
monitor)your)cluster)
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#11%
Cloudera)Enterprise)
! Cloudera%Enterprise%
– Subscrip>on)product)including)CDH)and))
Cloudera)Manager)
! Includes%support%
! Includes%extra%Cloudera%Manager%features%
– Configura>on)history)and)rollbacks)
– Rolling)updates)
– LDAP)integra>on)
– SNMP)support)
– Automated)disaster)recovery)
– Etc.)
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#12%
Chapter)Topics)
Introduc.on% Course%Introduc.on%
!! About)This)Course)
!! About)Cloudera)
!! Course%Logis.cs%
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#13%
Logis>cs)
! Course%start%and%end%.mes%
! Lunch%
! Breaks%
! Restrooms%
! Can%I%come%in%early/stay%late?%
! Access%to%the%course%materials%
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#14%
Introduc>ons)
! About%your%instructor%
! About%you%
– Experience)with)Spark)or)Hadoop?)
– Experience)as)a)developer?)
– What)programming)languages)do)you)usually)use?)
– What)programming)language)will)you)use)in)this)course?))
– Expecta>ons)from)the)course?)
©)Copyright)2010A2015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wriHen)consent.) 01#15%
Why$Spark?$
Chapter$2$
Course$Chapters$
!! IntroducEon$ Course$IntroducEon$
!! Why$Spark?$
!! Spark$Basics$ Introduc-on$to$Spark$
!! Working$With$RDDs$
!! The$Hadoop$Distributed$File$System$(HDFS)$
!! Running$Spark$on$a$Cluster$
Distributed$Data$Processing$$
!! Parallel$Programming$with$Spark$
with$Spark$
!! Caching$and$Persistence$
!! WriEng$Spark$ApplicaEons$
!! Spark$Streaming$
!! Common$PaBerns$in$Spark$Programming$ Solving$Business$Problems$$
!! Improving$Spark$Performance$ with$Spark$
!! Spark,$Hadoop,$and$the$Enterprise$Data$Center$
!! Conclusion$ Course$Conclusion$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#2$
Why$Spark?$
In$this$chapter$you$will$learn$
! What$problems$exist$with$tradi-onal$large#scale$compu-ng$systems$
! How$Spark$addresses$those$issues$
! Some$typical$big$data$ques-ons$Spark$can$be$used$to$answer$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#3$
Chapter$Topics$
Why$Spark?$ Introduc-on$to$Spark$
!! Problems$with$Tradi-onal$Large#scale$Systems$
!! Spark!$
!! Conclusion$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#4$
TradiEonal$Large5Scale$ComputaEon$
! Tradi-onally,$computa-on$has$been$$
processor#bound$
– RelaEvely$small$amounts$of$data$
– Lots$of$complex$processing$
! The$early$solu-on:$bigger$computers$
– Faster$processor,$more$memory$
– But$even$this$couldn’t$keep$up$$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#5$
Distributed$Systems$
! The$beJer$solu-on:$more$computers$
– Distributed$systems$–$use$mulEple$machines$
for$a$single$job$
“In$pioneer$days$they$used$oxen$for$heavy$
pulling,$and$when$one$ox$couldn’t$budge$a$log,$
we$didn’t$try$to$grow$a$larger$ox.$We$shouldn’t$
be$trying$for$bigger$computers,$but$for$more%
systems$of$computers.”$
$ $ $ $ $ $–$Grace$Hopper$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$
Database Hadoop Cluster 02#6$
Distributed$Systems:$Challenges$
! Challenges$with$distributed$systems$
– Programming$complexity$
– Keeping$data$and$processes$in$sync$
– Finite$bandwidth$$
– ParEal$failures$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#7$
Distributed$Systems:$The$Data$BoBleneck$(1)$
! Tradi-onally,$data$is$stored$in$a$central$loca-on$
! Data$is$copied$to$processors$at$run-me$
! Fine$for$limited$amounts$of$data$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#8$
Distributed$Systems:$The$Data$BoBleneck$(2)$
! Modern$systems$have$much$more$data$
– terabytes+$a$day$
– petabytes+$total$
! We$need$a$new$approach…$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#9$
Big$Data$Processing$
! Hadoop$introduced$a$radical$new$approach$based$on$two$key$concepts$
– Distribute$the$data$when$it$is$stored$
– Run$computaEon$where$the$data$is$
! Spark$takes$this$new$approach$to$the$next$level$
– Data$is$distributed$in$memory$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#10$
Chapter$Topics$
Why$Spark?$ Introduc-on$to$Spark$
!! Problems$with$TradiEonal$Large5scale$Systems$
!! Spark!$
!! Conclusion$
$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#11$
Introducing$Apache$Spark$
! Apache$Spark$is$a$fast,$general$engine$for$large5scale$data$processing$on$a$
cluster$
! Originally$developed$at$AMPLab$at$UC$Berkeley$
– Started$as$a$research$project$in$2009$
! Open$source$Apache$project$
– CommiBers$from$Cloudera,$Yahoo,$Databricks,$UC$Berkeley,$Intel,$
Groupon,$…$
– One$of$the$most$acEve$and$fastest5growing$Apache$projects$
– Cloudera$provides$enterprise5level$support$for$Spark$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#12$
Distributed$Processing$with$the$Spark$Framework$
API$
Spark$
Cluster$CompuEng$ Storage$
• Spark$Standalone$ HDFS$
• YARN$ (Hadoop$Distributed$File$
• Mesos$ System)$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#13$
Advantages$of$Spark$
! High#level$programming$framework$
– Programmers$can$focus$on$logic,$not$plumbing$
! Cluster$compu-ng$
– ApplicaEon$processes$are$distributed$across$a$cluster$of$worker$nodes$
– Managed$by$a$single$“master”$
– Scalable$and$fault$tolerant$
! Distributed$storage$
– Data$is$distributed$when$it$is$stored$
– Replicated$for$efficiency$and$fault$tolerance$
– “Bring$the$computaEon$to$the$data”$
! Data$in$memory$
– Configurable$caching$for$efficient$iteraEon$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#14$
Scalability$
! Increasing$load$results$in$a$graceful$decline$in$performance$$
– Not$failure$of$the$system$
! Adding$nodes$adds$capacity$propor-onally$
Capacity$
Number$of$Nodes$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#15$
Fault$Tolerance$
! Node$failure$is$inevitable$
! What$happens?$
– System$conEnues$to$funcEon$
– Master$re5assigns$tasks$to$a$different$node$
– Data$replicaEon$=$no$loss$of$data$
– Nodes$which$recover$rejoin$the$cluster$automaEcally$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#16$
Who$Uses$Spark?$
! Yahoo!$$
– PersonalizaEon$and$ad$analyEcs$
! Conviva$$
– Real5Eme$video$stream$opEmizaEon$
! Technicolor$
– Real5Eme$analyEcs$for$telco$clients$
! Ooyala$
– Cross5device$personalized$video$experience$
! Plus…$
– Intel,$Groupon,$TrendMicro,$Autodesk,$Nokia,$Shopify,$ClearStory,$
Technicolor,$and$many$more…$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#17$
Common$Spark$Use$Cases$
! Extract/Transform/Load$(ETL)$ ! Collabora-ve$filtering$
! Text$mining$ ! Predic-on$models$
! Index$building$ ! Sen-ment$analysis$
! Graph$crea-on$and$analysis$ ! Risk$assessment$
! PaJern$recogni-on$ $
! What$do$these$workloads$have$in$common?$Nature$of$the$data…$
– Volume$
– Velocity$
– Variety$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#18$
Benefits$of$Spark$
! Previously$impossible$or$imprac-cal$analysis$
! Lower$cost$
! Less$-me$
! Greater$flexibility$
! Near#linear$scalability$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#19$
Spark$v.$Hadoop$MapReduce$
! Spark$takes$the$concepts$of$ sc.textFile(file) \
MapReduce$to$the$next$level$ .flatMap(lambda s: s.split()) \
.map(lambda w: (w,1)) \
processing$$
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
– In5memory$data$storage$=$up$to$
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
100x$performance$improvement$ job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#20$
Chapter$Topics$
Why$Spark?$ Introduc-on$to$Spark$
!! Problems$with$TradiEonal$Large5scale$Systems$
!! Spark!$
!! Conclusion$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#21$
Key$Points$
! Tradi-onal$large#scale$compu-ng$involved$complex$processing$on$small$
amounts$of$data$
! Exponen-al$growth$in$data$drove$development$of$distributed$compu-ng$
! Distributed$compu-ng$is$difficult!$
! Spark$addresses$big$data$distributed$compu-ng$challenges$
– Bring$the$computaEon$to$the$data$
– Fault$tolerance$
– Scalability$
– Hides$the$‘plumbing’$so$developers$can$focus$on$the$data$
– Caches$data$in$memory$$
©$Copyright$201052015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriBen$consent.$ 02#22$
Spark&Basics&
Chapter&3&
Course&Chapters&
!! IntroducEon& Course&IntroducEon&
!! Why&Spark?&
!! Spark%Basics% Introduc.on%to%Spark%
!! Working&With&RDDs&
!! The&Hadoop&Distributed&File&System&(HDFS)&
!! Running&Spark&on&a&Cluster&
Distributed&Data&Processing&&
!! Parallel&Programming&with&Spark&
with&Spark&
!! Caching&and&Persistence&
!! WriEng&Spark&ApplicaEons&
!! Spark&Streaming&
!! Common&PaBerns&in&Spark&Programming& Solving&Business&Problems&&
!! Improving&Spark&Performance& with&Spark&
!! Spark,&Hadoop,&and&the&Enterprise&Data&Center&
!! Conclusion& Course&Conclusion&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#2%
Spark&Basics&
In%this%chapter%you%will%learn%
! How%to%start%the%Spark%Shell%
! About%the%SparkContext%
! Key%Concepts%of%Resilient%Distributed%Datasets%(RDDs)%
– What&are&they?&
– How&do&you&create&them?&
– What&operaEons&can&you&perform&with&them?&
! How%Spark%uses%the%principles%of%func.onal%programming%
! About%the%Hands#On%Exercises%for%the%course%
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#3%
Chapter&Topics&
Spark%Basics% Introduc.on%to%Spark%
!! What%is%Apache%Spark?%
!! Using&the&Spark&Shell&
!! RDDs&(Resilient&Distributed&Datasets)&
!! FuncEonal&Programming&in&Spark&
!! Conclusion&
!! Hands7On&Exercises&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#4%
What&is&Apache&Spark?&
! Apache%Spark%is%a%fast%and%general%engine%for%large#scale%
data%processing%
! WriNen%in%Scala%
– FuncEonal&programming&language&that&runs&in&a&JVM&
! Spark%Shell%
– InteracEve&–&for&learning&or&data&exploraEon&
– Python&or&Scala&
! Spark%Applica.ons%
– For&large&scale&data&processing&
– Python,&Scala,&or&Java&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#5%
Chapter&Topics&
Spark%Basics% Introduc.on%to%Spark%
!! What&is&Apache&Spark?&&
!! Using%the%Spark%Shell%
!! RDDs&(Resilient&Distributed&Datasets)&
!! FuncEonal&Programming&in&Spark&
!! Conclusion&
!! Hands7On&Exercises&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#6%
Spark&Shell&
! The%Spark%Shell%provides%interac.ve%data%explora.on%(REPL)
! Wri.ng%standalone%Spark%applica.ons%will%be%covered%later%
Python&Shell:&pyspark Scala&Shell:&spark-shell
$ pyspark $ spark-shell
Welcome to
Welcome to
____ __
/ __/__ ___ _____/ /__ ____ __
_\ \/ _ \/ _ `/ __/ '_/ / __/__ ___ _____/ /__
/__ / .__/\_,_/_/ /_/\_\ version 1.0.0 _\ \/ _ \/ _ `/ __/ '_/
/_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.0
/_/
Using Python version 2.6.6 (r266:84292, Jan
22 2014 09:42:36)
SparkContext available as sc. Using Scala version 2.10.3 (Java HotSpot(TM)
64-Bit Server VM, Java 1.7.0_51)
>>> Created spark context..
Spark context available as sc.
scala>
REPL:&Read/Evaluate/Print&Loop&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#7%
Spark&Context&
! Every%Spark%applica.on%requires%a%Spark&Context&
– The&main&entry&point&to&the&Spark&API&
! Spark%Shell%provides%a%preconfigured%Spark%Context%called%sc
scala> sc.appName
res0: String = Spark shell
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#8%
Chapter&Topics&
Spark%Basics% Introduc.on%to%Spark%
!! What&is&Apache&Spark?&&
!! Using&the&Spark&Shell&
!! RDDs%(Resilient%Distributed%Datasets)%
!! FuncEonal&Programming&With&Spark&
!! Conclusion&
!! Hands7On&Exercise:&Ge`ng&Started&with&RDDs&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#9%
RDD&(Resilient&Distributed&Dataset)&
! RDD%(Resilient%Distributed%Dataset)%
– Resilient&–&if&data&in&memory&is&lost,&it&can&be&recreated&
– Distributed&–&stored&in&memory&across&the&cluster&
– Dataset&–&iniEal&data&can&come&from&a&file&or&be&created&
programmaEcally&
! RDDs%are%the%fundamental%unit%of%data%in%Spark%
! Most%Spark%programming%consists%of%performing%opera.ons%on%RDDs%
&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#10%
CreaEng&an&RDD&
! Three%ways%to%create%an%RDD%
– From&a&file&or&set&of&files&
– From&data&in&memory&
– From&another&RDD&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#11%
File7Based&RDDs&
! For%file#based%RDDS,%use%SparkContext.textFile%%
– Accepts&a&single&file,&a&wildcard&list&of&files,&or&a&comma7separated&list&of&
files&
– Examples&
– sc.textFile("myfile.txt")
– sc.textFile("mydata/*.log")
– sc.textFile("myfile1.txt,myfile2.txt")
– Each&line&in&the&file(s)&is&a&separate&record&in&the&RDD&
! Files%are%referenced%by%absolute%or%rela.ve%URI%
– Absolute&URI:&file:/home/training/myfile.txt
– RelaEve&URI&(uses&default&file&system):&myfile.txt
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#12%
Example:&A&File7based&RDD&
File:&purplecow.txt&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#13%
RDD&OperaEons&
! Two%types%of%RDD%opera.ons%
& RDD&
– AcEons&–&return&values&
value&
Base&RDD& New&RDD&
– TransformaEons&–&define&a&new&
RDD&based&on&the¤t&one(s)&
%
! Quiz:%
– Which&type&of&operaEon&is&
count()?&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#14%
RDD&OperaEons:&AcEons&
! Some%common%ac.ons% RDD&
– count()&–&&return&the&number&of&elements&
value&
– take(n)&–&return&an&array&of&the&first&n&
elements&
– collect()–&return&an&array&of&all&elements&
– saveAsTextFile(filename)%–&save&to&text&
file(s)
> mydata = > val mydata =
sc.textFile("purplecow.txt") sc.textFile("purplecow.txt")
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#15%
RDD&OperaEons:&TransformaEons&
! Transforma.ons%create%a%new%RDD%from%
Base&RDD& New&RDD&
an%exis.ng%one%
! RDDs%are%immutable%
– Data&in&an&RDD&is&never&changed&
– Transform&in&sequence&to&modify&the&
data&as&needed&&
! Some%common%transforma.ons%
– map(function)&–&creates&a&new&RDD&by&performing&a&funcEon&on&
each&record&in&the&base&RDD&
– filter(function)&–&creates&a&new&RDD&by&including&or&
excluding&each&record&in&the&base&RDD&according&to&a&boolean&
funcEon&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#16%
Example:&map&and&filter&TransformaEons&
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#17%
Lazy&ExecuEon&(1)&
File:&purplecow.txt&
! RDDs%are%not%always%immediately% I've never seen a purple cow.
materialized% I never hope to see one;
But I can tell you, anyhow,
– Spark&logs&the&lineage&of&& I'd rather see than be one.
transformaEons&used&to&build&
datasets&
>
! Data%in%RDDs%is%not%processed%un.l%
an%ac.on&is%performed%
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#18%
Lazy&ExecuEon&(2)&
File:&purplecow.txt&
! Data%in%RDDs%is%not%processed%un.l% I've never seen a purple cow.
an%ac.on&is%performed% I never hope to see one;
But I can tell you, anyhow,
– RDD&is&materialized&in&memory& I'd rather see than be one.
upon&the&first&acEon&that&uses&it& RDD:&mydata&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#19%
Lazy&ExecuEon&(3)&
File:&purplecow.txt&
! Data%in%RDDs%is%not%processed%un.l% I've never seen a purple cow.
an%ac.on&is%performed% I never hope to see one;
But I can tell you, anyhow,
– RDD&is&materialized&in&memory& I'd rather see than be one.
upon&the&first&acEon&that&uses&it& RDD:&mydata&
line.upper())
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#20%
Lazy&ExecuEon&(4)&
File:&purplecow.txt&
! Data%in%RDDs%is%not%processed%un.l% I've never seen a purple cow.
an%ac.on&is%performed% I never hope to see one;
But I can tell you, anyhow,
– RDD&is&materialized&in&memory& I'd rather see than be one.
upon&the&first&acEon&that&uses&it& RDD:&mydata&
line.upper())
> mydata_filt = \
mydata_uc.filter(lambda line: \
line.startswith('I'))
RDD:&mydata_filt&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#21%
Lazy&ExecuEon&(5)&
File:&purplecow.txt&
! Data%in%RDDs%is%not%processed%un.l% I've never seen a purple cow.
an%ac.on&is%performed% I never hope to see one;
But I can tell you, anyhow,
– RDD&is&materialized&in&memory& I'd rather see than be one.
upon&the&first&acEon&that&uses&it& RDD:&mydata&
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
> mydata = sc.textFile("purplecow.txt")
> mydata_uc = mydata.map(lambda line: RDD:&mydata_uc&
I'VE NEVER SEEN A PURPLE COW.
line.upper())
I NEVER HOPE TO SEE ONE;
> mydata_filt = \
BUT I CAN TELL YOU, ANYHOW,
mydata_uc.filter(lambda line: \
I'D RATHER SEE THAN BE ONE.
line.startswith('I'))
> mydata_filt.count() RDD:&mydata_filt&
I'VE NEVER SEEN A PURPLE COW.
3
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#22%
Chaining&TransformaEons&
! Transforma.ons%may%be%chained%together%
is&exactly&equivalent&to&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#23%
Chapter&Topics&
Spark%Basics% Introduc.on%to%Spark%
!! What&is&Apache&Spark?&&
!! Using&the&Spark&Shell&
!! RDDs&(Resilient&Distributed&Datasets)&
!! Func.onal%Programming%in%Spark%
!! Conclusion&
!! Hands7On&Exercises&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#24%
FuncEonal&Programming&in&Spark&
! Spark%depends%heavily%on%the%concepts%of%func.onal&programming&
– FuncEons&are&the&fundamental&unit&of&programming&
– FuncEons&have&input&and&output&only&
– No&state&or&side&effects&
! Key%concepts%
– Passing&funcEons&as&input&to&other&funcEons&
– Anonymous&funcEons&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#25%
Passing&FuncEons&as&Parameters&
! Many%RDD%opera.ons%take%func.ons%as%parameters%
! Pseudocode%for%the%RDD%map%opera.on%
– Applies&funcEon&fn&to&each&record&in&the&RDD&
RDD {
map(fn(x)) {
foreach record in rdd
emit fn(record)
}
}
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#26%
Example:&Passing&Named&FuncEons&
! Python%
! Scala%
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#27%
Anonymous&FuncEons&
! Func.ons%defined%in#line%without%an%iden.fier%
– Best&for&short,&one7off&funcEons&
! Supported%in%many%programming%languages%
– Python:&lambda x: ...
– Scala:&x => ...
– Java&8:&x -> ...
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#28%
Example:&Passing&Anonymous&FuncEons&
! Python:%
! Scala:%
OR&
> mydata.map(_.toUpperCase()).take(2)
Scala&allows&anonymous¶meters&
using&underscore&(_)&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#29%
Example:&Java&&
...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
new MapFunction<String, String>() {
Java&7& public String call(String line) {
return line.toUpperCase();
}
}
...
...
Java&8& JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
line -> line.toUpperCase());
...
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#30%
Chapter&Topics&
Spark%Basics% Introduc.on%to%Spark%
!! What&is&Apache&Spark?&&
!! Using&the&Spark&Shell&
!! RDDs&(Resilient&Distributed&Datasets)&
!! FuncEonal&Programming&With&Spark&
!! Conclusion%
!! Hands7On&Exercises&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#31%
Key&Points&
! Spark%can%be%used%interac.vely%via%the%Spark%Shell%
– Python&or&Scala&
– WriEng&non7interacEve&Spark&applicaEons&will&be&covered&later&
! RDDs%(Resilient%Distributed%Datasets)%are%a%key%concept%in%Spark%
! RDD%Opera.ons%
– TransformaEons&create&a&new&RDD&based&on&an&exisEng&one&
– AcEons&return&a&value&from&an&RDD&
! Lazy%Execu.on%
– TransformaEons&are¬&executed&unEl&required&by&an&acEon&
! Spark%uses%func.onal%programming%
– Passing&funcEons&as¶meters&
– Anonymous&funcEons&in&supported&languages&(Python&and&Scala)&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#32%
Chapter&Topics&
Spark%Basics% Introduc.on%to%Spark%
!! What&is&Apache&Spark?&&
!! Using&the&Spark&Shell&
!! RDDs&(Resilient&Distributed&Datasets)&
!! FuncEonal&Programming&With&Spark&
!! Conclusion&
!! Hands#On%Exercises%
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#33%
IntroducEon&to&Exercises:&Ge`ng&Started&
! Instruc.ons%are%in%the%Hands#On%Exercise%Manual%
! Start%with%%
– General&Notes&
– Se`ng&Up&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#34%
IntroducEon&to&Exercises:&Pick&Your&Language&
! Your%choice:%Python%or%Scala%
– For&most&exercises&in&this&course,&you&may&choose&to&work&with&either&
Python&or&Scala&
– ExcepEon:&Spark&Streaming&material&is¤tly&presented&only&in&
Scala&
– Course&examples&are&mostly&presented&in&Python&
! Solu.on%and%example%files%
– .pyspark&–&Python&shell&commands&
– .scalaspark&–&Scala&shell&commands&
– .py&–&complete&Python&Spark&applicaEons&
– .scala&–&complete&Scala&Spark&applicaEons&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#35%
IntroducEon&to&Exercises:&Classroom&Virtual&Machine&
! Your%virtual%machine%
– Log&in&as&user&training&(password&training)&
– Pre7installed&and&configured&with&
– Spark&and&CDH&
– Various&tools&including&Emacs,&IntelliJ,&and&Maven&
! Training%materials:%~/training_materials/sparkdev%folder%on%
the%VM%
– data&–&sample&datasets&uses&in&exercises&&
– examples&–&all&the&example&code&in&this&course&
– solutions&–&soluEons&for&Scala&Shell&and&Python&exercises&
– stubs&–&starter&code&required&in&some&exercises&
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#36%
IntroducEon&to&Exercises:&The&Data&
! Most%exercises%are%based%around%a%hypothe.cal%company:%Loudacre%
Mobile%
– A&cellular&telephone&company&
! Loudacre%Mobile%Customer%Support%has%many%sources%of%data%they%need%
to%process,%transform,%and%analyze%
– Customer&account&data&&
– Web&server&logs&from&Loudacre’s&customer&support&website&
– New&device&acEvaEon&records&
– Customer&support&Knowledge&Base&arEcles&
– InformaEon&about&models&of&supported&devices&
L udacre
mobile
o
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#37%
Hands7On&Exercises&
! Now,%please%do%the%following%three%Hands#On%Exercises%
1. Viewing&the&Spark&Documenta8on&
– Familiarize&yourself&with&the&Spark&documentaEon;&you&will&refer&to&
this&documentaEon&frequently&during&the&course&
2. Using&the&Spark&Shell&
– Follow&the&instrucEons&for&either&the&Python&or&Scala&shell&
3. Ge>ng&Started&with&RDDs&
– Use&either&the&Python&or&Scala&Spark&Shell&to&explore&the&Loudacre&
weblogs&
! Please%refer%to%the%Hands#On%Exercise%Manual%
©&Copyright&201072015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 03#38%
Working(With(RDDs(
Chapter(4(
Course(Chapters(
!! IntroducFon( Course(IntroducFon(
!! What(is(Apache(Spark?(
!! Spark(Basics( Introduc.on%to%Spark%
!! Working%With%RDDs%
!! The(Hadoop(Distributed(File(System((HDFS)(
!! Running(Spark(on(a(Cluster(
Distributed(Data(Processing((
!! Parallel(Programming(with(Spark(
with(Spark(
!! Caching(and(Persistence(
!! WriFng(Spark(ApplicaFons(
!! Spark(Streaming(
!! Common(PaDerns(in(Spark(Programming( Solving(Business(Problems((
!! Improving(Spark(Performance( with(Spark(
!! Spark,(Hadoop,(and(the(Enterprise(Data(Center(
!! Conclusion( Course(Conclusion(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#2%
Working(With(RDDs(
In%this%chapter%you%will%learn%
! How%RDDs%are%created%%
! Addi.onal%RDD%opera.ons%
! Special%opera.ons%available%on%RDDs%of%key#value%pairs%
! How%MapReduce%algorithms%are%implemented%in%Spark%
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#3%
Chapter(Topics(
Working%With%RDDs% Introduc.on%to%Spark%
!! A%Closer%Look%at%RDDs%
!! Key8Value(Pair(RDDs(
!! MapReduce(
!! Other(Pair(RDD(OperaFons(
!! Conclusion(
!! Hands8On(Exercise:(Working(with(Pair(RDDs(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#4%
RDDs(
! RDDs%can%hold%any%type%of%element%
– PrimiFve(types:(integers,(characters,(booleans,(etc.(
– Sequence(types:(strings,(lists,(arrays,(tuples,(dicts,(etc.((Including(nested(
data(types)(
– Scala/Java(Objects((if(serializable)(
– Mixed(types(
! Some%types%of%RDDs%have%addi.onal%func.onality%
– Pair(RDDs(
– RDDs(consisFng(of(Key8Value(pairs(
– Double(RDDs(
– RDDs(consisFng(of(numeric(data(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#5%
CreaFng(RDDs(From(CollecFons(
! You%can%create%RDDs%from%collec.ons%instead%of%files%
– sc.parallelize(collection)
> randomnumlist = \
[random.uniform(0,10) for _ in xrange(10000)]
> randomrdd = sc.parallelize(randomnumlist)
> print "Mean: %f" % randomrdd.mean()
! Useful%when%
– TesFng(
– GeneraFng(data(programmaFcally(
– IntegraFng(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#6%
Some(Other(General(RDD(OperaFons(
! Transforma.ons%
– flatMap(–(maps(one(element(in(the(base(RDD(to(mulFple(elements
– distinct(–(filter(out(duplicates(
– union(–(add(all(elements(of(two(RDDs(into(a(single(new(RDD(
! Other%RDD%opera.ons%
– first(–(return(the(first(element(of(the(RDD
– foreach(–(apply(a(funcFon(to(each(element(in(an(RDD(
– top(n)%–(return(the(largest(n(elements(using(natural(ordering(
! Sampling%opera.ons%
– takeSample(withReplacement, num)%–(return(an(array(of(num(
sampled(elements(
! Double%RDD%opera.ons%
– StaFsFcal(funcFons,(e.g.,(mean,(sum,(variance,(stdev
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#7%
Example:(flatMap(and(distinct
> sc.textFile(file) \
Python( .flatMap(lambda line: line.split()) \
.distinct()
> sc.textFile(file).
Scala( flatMap(line => line.split("\\W")).
distinct()
I’ve I’ve
never never
I've never seen a purple cow.
seen seen
I never hope to see one;
But I can tell you, anyhow,
a a
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#8%
Chapter(Topics(
Working%With%RDDs% Introduc.on%to%Spark%
!! A(Closer(Look(at(RDDs(
!! Key#Value%Pair%RDDs%
!! MapReduce(
!! Other(Pair(RDD(OperaFons(
!! Conclusion(
!! Hands8On(Exercise:(Working(with(Pair(RDDs(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#9%
Pair(RDDs(
! Pair%RDDs%are%a%special%form%of%RDD% Pair(RDD(
– Each(element(must(be(a(key8value(pair((a(( (key1,value1)
two8element(tuple)( (key2,value2)
– Keys(and(values(can(be(any(type( (key3,value3)
! Why?% …
– Use(with(MapReduce(algorithms((
– Many(addiFonal(funcFons(are(available(for(
common(data(processing(needs(
– e.g.,(sorFng,(joining,(grouping,(counFng,(etc.(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#10%
CreaFng(Pair(RDDs(
! The%first%step%in%most%workflows%is%to%get%the%data%into%key/value%form%
– What(should(the(RDD(be(keyed(on?(
– What(is(the(value?(
! Commonly%used%func.ons%to%create%Pair%RDDs%
– map
– flatMap%/%flatMapValues
– keyBy
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#11%
Example:(A(Simple(Pair(RDD(
! Example:%Create%a%Pair%RDD%from%a%tab#separated%file%
(user001,Fred Flintstone)
user001 Fred Flintstone
(user090,Bugs Bunny)
user090 Bugs Bunny
user111 Harry Potter (user111,Harry Potter)
… …
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#12%
Example:(Keying(Web(Logs(by(User(ID(
> sc.textFile(logfile) \
.keyBy(lambda line: line.split(' ')[2])
> sc.textFile(logfile).
keyBy(line => line.split(' ')(2))
User(ID(
56.38.234.188 – 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 – 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 – 25254 "GET /KBDOC-00230.html HTTP/1.0" …
…
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#13%
QuesFon(1:(Pairs(With(Complex(Values
! How%would%you%do%this?%
– Input:(a(list(of(postal(codes(with(laFtude(and(longitude(
– Output:(postal(code((key)(and(lat/long(pair((value)(
(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202
?( (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202
…
…%
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#14%
Answer(1:(Pairs(With(Complex(Values(
> sc.textFile(file) \
.map(lambda line: line.split()) \
.map(lambda fields: (fields[0],(fields[1],fields[2])))
(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202 (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202
…
…%
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#15%
QuesFon(2:(Mapping(Single(Rows(to(MulFple(Pairs((1)(
! How%would%you%do%this?%
– Input:(order(numbers(with(a(list(of(SKUs(in(the(order(
– Output:(order((key)(and(sku((value)(
Input(Data( Pair(RDD(
00001 sku010:sku933:sku022 (00001,sku010)
00002 sku912:sku331 (00001,sku933)
00003 sku888:sku022:sku010:sku594
00004 sku411
?( (00001,sku022)
(00002,sku912)
( (00002,sku331)
(00003,sku888)
…
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#16%
QuesFon(2:(Mapping(Single(Rows(to(MulFple(Pairs((2)(
! Hint:%map%alone%won’t%work%
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411
(00001,(sku010,sku933,sku022))
(00002,(sku912,sku331))
(00003,(sku888,sku022,sku010,sku594))
(00004,(sku411))
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#17%
Answer(2:(Mapping(Single(Rows(to(MulFple(Pairs((1)(
> sc.textFile(file)
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#18%
Answer(2:(Mapping(Single(Rows(to(MulFple(Pairs((2)(
> sc.textFile(file) \
.map(lambda line: line.split('\t'))
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
[00004,sku411]
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#19%
Answer(2:(Mapping(Single(Rows(to(MulFple(Pairs((3)(
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
(00001,sku010:sku933:sku022)
[00004,sku411]
(00002,sku912:sku331)
(00003,sku888:sku022:sku010:sku594)
(00004,sku411)
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#20%
Answer(2:(Mapping(Single(Rows(to(MulFple(Pairs((4)(
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1])) \
.flatMapValues(lambda skus: skus.split(':'))
00001 sku010:sku933:sku022
00002 sku912:sku331 (00001,sku010)
00003 sku888:sku022:sku010:sku594 (00001,sku933)
[00001,sku010:sku933:sku022]
00004 sku411 (00001,sku022)
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594] (00002,sku912)
(00001,sku010:sku933:sku022)
[00004,sku411] (00002,sku331)
(00002,sku912:sku331)
(00003,sku888)
(00003,sku888:sku022:sku010:sku594)
…
(00004,sku411)
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#21%
Chapter(Topics(
Working%With%RDDs% Introduc.on%to%Spark%
!! A(Closer(Look(at(RDDs(
!! Key8Value(Pair(RDDs(
!! MapReduce%
!! Other(Pair(RDD(OperaFons(
!! Conclusion(
!! Hands8On(Exercise:(Working(with(Pair(RDDs(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#22%
MapReduce(
! MapReduce%is%a%common%programming%model%
– Easily(applicable(to(distributed(processing(of(large(data(sets(
! Hadoop%MapReduce%is%the%best#known%implementa.on%%
– Somewhat(limited(
– Each(job(has(one(Map(phase,(one(Reduce(phase((
– Job(output(is(saved(to(files(
! Spark%implements%MapReduce%with%much%greater%flexibility%
– Map(and(Reduce(funcFons(can(be(interspersed(
– Results(are(stored(in(memory(
– OperaFons(can(easily(be(chained(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#23%
MapReduce(in(Spark(
! MapReduce%in%Spark%works%on%Pair%RDDs%
! Map%phase%
– Operates(on(one(record(at(a(Fme(
– “Maps”(each(record(to(one(or(more(new(records(
– map(and(flatMap
! Reduce%phase%
– Works(on(Map(output(
– Consolidates(mulFple(records(
– reduceByKey
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#24%
MapReduce(Example:(Word(Count(
Result(
aardvark 1
Input(Data(
cat 1
the cat sat on the mat mat 1
the aardvark sat on the sofa ?( on 2
% sat 2
sofa 1
the 4
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#25%
Example:(Word(Count((1)(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#26%
Example:(Word(Count((2)(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#27%
Example:(Word(Count((3)(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#28%
Example:(Word(Count((4)(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#29%
ReduceByKey(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#30%
Word(Count(Recap((the(Scala(Version)(
OR(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#31%
Why(Do(We(Care(About(CounFng(Words?(
! Word%count%is%challenging%over%massive%amounts%of%data%
– Using(a(single(compute(node(would(be(too(Fme8consuming(
– Number(of(unique(words(could(exceed(available(memory(
! Sta.s.cs%are%o_en%simple%aggregate%func.ons%
– DistribuFve(in(nature(
– e.g.,(max,(min,(sum,(count(
! MapReduce%breaks%complex%tasks%down%into%smaller%elements%which%can%
be%executed%in%parallel%
! Many%common%tasks%are%very%similar%to%word%count%
– e.g.,(log(file(analysis(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#32%
Chapter(Topics(
Working%With%RDDs% Introduc.on%to%Spark%
!! Key8Value(Pair(RDDs(
!! Map8Reduce(
!! Other%Pair%RDD%Opera.ons%
!! Conclusion(
!! Hands8On(Exercise:(Working(with(Pair(RDDs(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#33%
Pair(RDD(OperaFons(
! In%addi.on%to%map%and%reduce%func.ons,%Spark%has%several%opera.ons%
specific%to%Pair%RDDs%
! Examples%
– countByKey(–(return(a(map(with(the(count(of(occurrences(of(each(key(
– groupByKey –(group(all(the(values(for(each(key(in(an(RDD
– sortByKey(–(sort(in(ascending(or(descending(order(
– join%–(return(an(RDD(containing(all(pairs(with(matching(keys(from(two(
RDDs(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#34%
Example:(Pair(RDD(OperaFons(
(00004,sku411)
(00003,sku888)
(00001,sku010)
=Fa lse)( (00003,sku022)
d i n g
(00001,sku933)
( a s cen
(00001,sku022) sortByK
ey (00003,sku010)
(00003,sku594)
(00002,sku912)
(00002,sku912)
(00002,sku331)
…
(00003,sku888)
…
(00002,[sku912,sku331])
(00001,[sku010,sku933,sku022])
(00003,[sku888,sku022,sku010,sku594])
(00004,[sku411])
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#35%
Example:(Joining(by(Key(
RDD:(moviegross RDD:(movieyear
(Casablanca,$3.7M) (Casablanca,1942)
(Star Wars,$775M) (Star Wars,1977)
(Annie Hall,$38M) (Annie Hall,1977)
(Argo,$232M) (Argo,2012)
… …
(Casablanca,($3.7M,1942))
(Star Wars,($775M,1977))
(Annie Hall,($38M,1977))
(Argo,($232M,2012))
…
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#36%
Using(Join(
! A%common%programming%paaern%
1. Map(separate(datasets(into(key8value(Pair(RDDs(
2. Join(by(key(
3. Map(joined(data(into(the(desired(format(
4. Save,(display,(or(conFnue(processing…(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#37%
Example:(Join(Web(Log(With(Knowledge(Base(ArFcles((1)(
weblogs(
56.38.234.188 – 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 – 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 – 25254 "GET /KBDOC-00230.html HTTP/1.0" …
221.78.60.155 – 45402 "GET /titanic_4000_sales.html HTTP/1.0" …
65.187.255.81 – 14242 "GET /KBDOC-00107.html HTTP/1.0" …
…
User(ID( Requested(File(
join(
kblist(
KBDOC-00157:Ronin Novelty Note 3 - Back up files
KBDOC-00230:Sorrento F33L - Transfer Contacts
KBDOC-00050:Titanic 1000 - Transfer Contacts
KBDOC-00107:MeeToo 5.0 - Transfer Contacts
KBDOC-00300:iFruit 5A – overheats
…
ArFcle(ID( ArFcle(Title(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#38%
Example:(Join(Web(Log(With(Knowledge(Base(ArFcles((2)(
! Steps%
1. Map(separate(datasets(into(key8value(Pair(RDDs(
a. Map(web(log(requests(to((docid,userid)
b. Map(KB(Doc(index(to((docid,title)
2. Join(by(key:(docid
3. Map(joined(data(into(the(desired(format:((userid,title)
4. Further(processing:(group(Ftles(by(User(ID(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#39%
Step(1a:(Map(Web(Log(Requests(to((docid,userid)
> import re
> def getRequestDoc(s):
return re.search(r'KBDOC-[0-9]*',s).group()
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#40%
Step(1b:(Map(KB(Index(to((docid,title)%
kblist(
(KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00050,Titanic 1000 - Transfer Contacts)
(KBDOC-00107,MeeToo 5.0 - Transfer Contacts)
…
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#41%
Step(2:(Join(By(Key(docid
kbreqs( kblist(
(KBDOC-00157,99788) (KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,25254) (KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00107,14242) (KBDOC-00050,Titanic 1000 - Transfer Contacts)
… (KBDOC-00107,MeeToo 5.0 - Transfer Contacts)
…
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#42%
Step(3:(Map(Result(to(Desired(Format((userid,title)
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#43%
Step(4:(ConFnue(Processing(–(Group(Titles(by(User(ID(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#44%
Example(Output(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#45%
Aside:(Anonymous(FuncFon(Parameters(
! Python%and%Scala%paaern%matching%can%help%improve%code%readability%
OR(
(KBDOC-00157,(99788,…title…)) (99788,…title…)
(KBDOC-00230,(25254,…title…)) (25254,…title…)
(KBDOC-00107,(14242,…title…)) (14242,…title…)
… …
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#46%
Other(Pair(OperaFons(
! Some%other%pair%opera.ons%
– keys(–(return(an(RDD(of(just(the(keys,(without(the(values(
– values(–(return(an(RDD(of(just(the(values,(without(keys(
– lookup(key)(–(return(the(value(s)(for(a(key
– leftOuterJoin,(rightOuterJoin%–(join,(including(keys(defined(
only(in(the(lel(or(right(RDDs(respecFvely(
– mapValues,(flatMapValues(–(execute(a(funcFon(on(just(the(
values,(keeping(the(key(the(same(
! See%the%PairRDDFunctions%class%Scaladoc%for%a%full%list%
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#47%
Chapter(Topics(
Working%With%RDDs% Introduc.on%to%Spark%
!! Key8Value(Pair(RDDs(
!! MapReduce(
!! Other(Pair(RDD(OperaFons(
!! Conclusion%
!! Hands8On(Exercise:(Working(with(Pair(RDDs(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#48%
Key(Points(
! Pair%RDDs%are%a%special%form%of%RDD%consis.ng%of%Key#Value%pairs%(tuples)%
! Spark%provides%several%opera.ons%for%working%with%Pair%RDDs%
! MapReduce%is%a%generic%programming%model%for%distributed%processing%
– Spark(implements(MapReduce(with(Pair(RDDs(
– Hadoop(MapReduce(and(other(implementaFons(are(limited(to(a(single(
Map(and(Reduce(phase(per(job(
– Spark(allows(flexible(chaining(of(map(and(reduce(operaFons(
– Spark(provides(operaFons(to(easily(perform(common(MapReduce(
algorithms(like(joining,(sorFng,(and(grouping(
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#49%
Chapter(Topics(
Working%With%RDDs% Introduc.on%to%Spark%
!! Key8Value(Pair(RDDs(
!! MapReduce(
!! Other(Pair(RDD(OperaFons(
!! Conclusion(
!! Hands#On%Exercise:%Working%with%Pair%RDDs%
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#50%
Hands8On(Exercise:(Working(with(Pair(RDDs(
! Hands#On%Exercise:%Working(with(Pair(RDDs(
– ConFnue(exploring(web(server(log(files(using(key8value(Pair(RDDs(
– Join(log(data(with(user(account(data(
! Please%refer%to%the%Hands#On%Exercise%Manual%
©(Copyright(201082015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriDen(consent.( 04#51%
The$Hadoop$Distributed$File$System$
(HDFS)$
Chapter$5$
Course$Chapters$
!! IntroducIon$ Course$IntroducIon$
!! What$is$Apache$Spark?$
!! Spark$Basics$ IntroducIon$to$Spark$
!! Working$With$RDDs$
!! The%Hadoop%Distributed%File%System%(HDFS)%
!! Running$Spark$on$a$Cluster$
Distributed%Data%Processing%%
!! Parallel$Programming$with$Spark$
with%Spark%
!! Caching$and$Persistence$
!! WriIng$Spark$ApplicaIons$
!! Spark$Streaming$
!! Common$PaFerns$in$Spark$Programming$ Solving$Business$Problems$$
!! Improving$Spark$Performance$ with$Spark$
!! Spark,$Hadoop,$and$the$Enterprise$Data$Center$
!! Conclusion$ Course$Conclusion$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#2%
The$Hadoop$Distributed$File$System$
In%this%chapter%you%will%learn%
! How%HDFS%supports%Big%Data%processing%by%distribuEng%data%storage%
across%a%cluster%
! How%to%save%and%retrieve%data%from%HDFS%using%both%command%line%tools%
and%the%Spark%API%
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#3%
Chapter$Topics$
Distributed%Data%Processing%%
The%Hadoop%Distributed%File%System%
with%Spark%
!! Why%HDFS?%
!! HDFS$Architecture$
!! Using$HDFS$
!! Conclusion$
!! Hands?On$Exercise:$Using$HDFS$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#4%
Distributed$Processing$with$the$Spark$Framework$
API$
Spark$
Cluster$CompuIng$ Storage$
• Spark$Standalone$
• YARN$ HDFS$
• Mesos$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#5%
Big$Data$Processing$with$Spark$
! Three%key%concepts%
– Distribute$data$when$the$data$is$stored$–$HDFS$$
– Run$computaIon$where$the$data$is$–$HDFS$and$Spark$
– Cache$data$in$memory$–$Spark$$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#6%
Chapter$Topics$
Distributed%Data%Processing%%
The%Hadoop%Distributed%File%System%
with%Spark%
!! Why$HDFS?$
!! HDFS%Architecture%
!! Using$HDFS$
!! Conclusion$
!! Hands?On$Exercise:$Using$HDFS$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#7%
HDFS$Basic$Concepts$(1)$
! HDFS%is%a%filesystem%wriPen%in%Java%
– Based$on$Google’s$GFS$
! Sits%on%top%of%a%naEve%filesystem%
– Such$as$ext3,$ext4,$or$xfs$
! Provides%redundant%storage%for%massive%amounts%of%data%
– Using$readily?available,$industry?standard$computers$
HDFS%
NaIve$OS$filesystem$
Disk$Storage$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#8%
HDFS$Basic$Concepts$(2)$
! HDFS%performs%best%with%a%‘modest’%number%of%large%files%
– Millions,$rather$than$billions,$of$files$
– Each$file$typically$100MB$or$more$
! Files%in%HDFS%are%‘write%once’%
– No$random$writes$to$files$are$allowed$
! HDFS%is%opEmized%for%large,%streaming%reads%of%files%
– Rather$than$random$reads$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#9%
How$Files$Are$Stored$
! Data%files%are%split%into%128MB%blocks%which%are%distributed%at%load%Eme%
! Each%block%is%replicated%on%mulEple%data%nodes%(default%3x)%
! NameNode%stores%metadata%
Block$1$ Name$
Block$3$
Node$
Block$1$
Block$1$
Metadata:$
Very$ Block$2$
informaIon$
Large$ Block$2$ Block$2$
about$files$
Data$File$ Block$3$
and$blocks$
Block$2$
Block$3$
Block$1$
Block$3$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#10%
Example:$Storing$and$Retrieving$Files$(1)$
Local$
Node$A$ Node$D$
/logs/ $
031512.log
$
Node$B$ Node$E$
/logs/
042313.log Node$C$
HDFS$
Cluster$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#11%
Example:$Storing$and$Retrieving$Files$(2)$
1 Node$A$ Node$D$
/logs/
031512.log
2 1 3 1 5$
3 4 2 $
Node$B$ Node$E$
1 2 2 5
3 4 4
4
/logs/
042313.log 5 Node$C$
3 5
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#12%
Example:$Storing$and$Retrieving$Files$(3)$
1 Node$A$ Node$D$
/logs/
/logs/042313.log?$
031512.log
2 1 3 1 5$
3 4 2 $
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#13%
Example:$Storing$and$Retrieving$Files$(4)$
1 Node$A$ Node$D$
/logs/
/logs/042313.log?$
031512.log
2 1 3 1 5$
3 4 2 $
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#14%
HDFS$NameNode$Availability$
! The%NameNode%daemon%must%be%running%at%all%Emes%
– If$the$NameNode$stops,$the$cluster$becomes$inaccessible$
! HDFS%is%typically%set%up%for%High%
Availability% AcIve Standby$
– Two$NameNodes:$AcIve$and$ Name$ Name$
Standby$ Node$ Node$
! Small%clusters%may%use%‘Classic%mode’%
– One$NameNode$
Secondary$
– One$“helper”$node$called$the$ Name$ Name$
Secondary$NameNode$ Node$ Node$
– Bookkeeping,$not$backup$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#15%
Chapter$Topics$
Distributed%Data%Processing%%
The%Hadoop%Distributed%File%System%
with%Spark%
!! Why$HDFS?$$
!! HDFS$Architecture$
!! Using%HDFS%
!! Conclusion$
!! Hands?On$Exercise:$Using$HDFS$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#16%
OpIons$for$Accessing$HDFS$ $$
put
! From%the%command%line% HDFS$
– FsShell:$$ Client$ Cluster$
hdfs dfs$ get
! In%Spark%
– By$URI,$e.g.$
hdfs://host:port/file…
! Other%programs%
– Java$API$
– Used$by$Hadoop$MapReduce,$$
Impala,$Hue,$Sqoop,$$
Flume,$etc.$
– RESTful$interface$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#17%
hdfs dfs%Examples$(1)$
! Copy%file%foo.txt%from%local%disk%to%the%user’s%directory%in%HDFS%
– This$will$copy$the$file$to$/user/username/foo.txt
! Get%a%directory%lisEng%of%the%user’s%home%directory%in%HDFS%
! Get%a%directory%lisEng%of%the%HDFS%root%directory%
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#18%
hdfs dfs%Examples$(2)$
! Display%the%contents%of%the%HDFS%file%/user/fred/bar.txt%%
! Copy%that%file%to%the%local%disk,%named%as%baz.txt
! Create%a%directory%called%input%under%the%user’s%home%directory%
Note:$copyFromLocal$is$a$synonym$for$put;$copyToLocal$is$a$synonym$for$get$$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#19%
hdfs dfs%Examples$(3)$
! Delete%the%directory%input_old%and%all%its%contents%
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#20%
Example:$HDFS$in$Spark$
! Specify%HDFS%files%in%Spark%by%URI%
– hdfs://hdfs-host[:port]/path
– Default$port$is$8020$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#21%
Using$HDFS$By$Default$
! If%Hadoop%configuraEon%files%are%on%Spark’s%classpath,%Spark%will%use%HDFS%
by%default%
– e.g.$/etc/hadoop/conf
! Paths%are%relaEve%to%the%user’s%home%HDFS%directory%
hdfs://hdfs-host:port/user/training/purplecow.txt$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#22%
Chapter$Topics$
Distributed%Data%Processing%%
The%Hadoop%Distributed%File%System%
with%Spark%
!! Why$HDFS?$$
!! HDFS$Architecture$
!! Using$HDFS$
!! Conclusion%
!! Hands?On$Exercise:$Using$HDFS$
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#23%
Key$Points$
! HDFS%provides%a%key%component%of%big%data%processing%
– Distribute$data$when$it$is$stored,$so$that$computaIon$can$be$run$where$
the$data$is$
! How%HDFS%works%
– Files$are$divided$into$blocks$
– Blocks$are$replicated$across$nodes$
! Command%line%access%to%HDFS%
– FsShell:$hdfs dfs
– Sub?commands:$-get,$-put,$-ls,$-cat,$etc.$
! Spark%access%to%HDFS%
– sc.textFile$and$rdd.saveAsTextFile$methods$$
– e.g.,$hdfs://host:port/path/to/file
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#24%
Chapter$Topics$
Distributed%Data%Processing%%
The%Hadoop%Distributed%File%System%
with%Spark%
!! Why$HDFS?$$
!! HDFS$Architecture$
!! Using$HDFS$
!! Conclusion$
!! Hands#On%Exercise:%Using%HDFS%
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#25%
Hands?On$Exercise:$Using$HDFS$
! Hands#On%Exercise:%Using&HDFS&
– Begin$to$get$acquainted$with$the$Hadoop$Distributed$File$System$$
– Read$and$write$files$using$hdfs dfs%on$the$command$line,$and$from$
the$Spark$Shell$
! Please%refer%to%the%Hands#On%Exercise%Manual%
©$Copyright$2010?2015$Cloudera.$All$rights$reserved.$Not$to$be$reproduced$without$prior$wriFen$consent.$ 05#26%
Running&Spark&on&a&Cluster&
Chapter&6&
Course&Chapters&
!! IntroducEon& Course&IntroducEon&
!! What&is&Apache&Spark?&
!! Spark&Basics& IntroducEon&to&Spark&
!! Working&With&RDDs&
!! The&Hadoop&Distributed&File&System&(HDFS)&
!! Running%Spark%on%a%Cluster%
Distributed%Data%Processing%%
!! Parallel&Programming&with&Spark&
with%Spark%
!! Caching&and&Persistence&
!! WriEng&Spark&ApplicaEons&
!! Spark&Streaming&
!! Common&PaCerns&in&Spark&Programming& Solving&Business&Problems&&
!! Improving&Spark&Performance& with&Spark&
!! Spark,&Hadoop,&and&the&Enterprise&Data&Center&
!! Conclusion& Course&Conclusion&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#2%
Running&Spark&on&a&Cluster&
In%this%chapter%you%will%learn%
! Spark%clustering%concepts%and%terminology%
! Spark%deployment%opAons%
! How%to%run%a%Spark%applicaAon%on%a%Spark%Standalone%cluster%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#3%
Chapter&Topics&
Distributed%Data%Processing%%
Running%Spark%on%a%Cluster%
with%Spark%
!! Overview%
!! A&Spark&Standalone&Cluster&
!! The&Spark&Standalone&Web&UI&
!! Spark&Deployment&OpEons&
!! Conclusion&
!! Hands9On&Exercise:&Running&the&Spark&Shell&on&a&Cluster&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#4%
Spark&Cluster&OpEons&
! Spark%can%run%
– Locally&
– No&distributed&processing&
– Locally&with&mulEple&worker&threads&
– On&a&cluster&
– Spark&Standalone&
– Apache&Hadoop&YARN&(Yet&Another&Resource&NegoEator)&
– Apache&Mesos&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#5%
Why&Run&on&a&Cluster?&
! Run%Spark%on%a%cluster%to%get%the%advantages%of%distributed%processing%
– Ability&to&process&large&amounts&of&data&efficiently&
– Fault&tolerance&and&scalability&&
! Local%mode%is%useful%for%development%and%tesAng%
! ProducAon%use%is%almost%always%on%a%cluster%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#6%
Distributed&Processing&with&the&Spark&Framework&
API&
Spark&
Cluster&CompuEng& Storage&
• Spark&Standalone&
• YARN& HDFS&
• Mesos&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#7%
Spark&Cluster&Terminology&
! A%cluster%is%a%group%of%computers%working%together%
– Usually&runs&HDFS&in&addiEon&to&Spark&Standalone,&YARN,&or&Mesos&
! A%node%is%an%individual%computer%in%the%cluster%
– Master&nodes&manage&distribuEon&of&work&and&data&to&worker&nodes&
! A%daemon%is%a%program%running%on%a%node%
– Each&performs&different&funcEons&in&the&cluster&
Worker&Node&
Cluster& Worker&Node&
HDFS&&
Manager&
Master&Node&
Master&Node&
Worker&Node&
Worker&Node&
…&&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#8%
The&Spark&Driver&Program&
! A%Spark%Driver%
– The&“main”&program&
– Either&the&Spark&Shell&or&a&Spark&applicaEon&
– Creates&a&Spark&Context&configured&for&the&cluster&
– Communicates&with&Cluster&Manager&to&distribute&tasks&to&executors&
Worker&Node&
Executor&
Worker&Node&
Driver&Program& Master&Node& Executor&
Cluster&
Spark& Worker&Node&
Manager&
Context& Executor&
Worker&Node&
Executor&
…&&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#9%
StarEng&the&Spark&Shell&on&a&Cluster&
! Set%the%Spark%Shell%master%to%
– url&–&the&URL&of&the&cluster&manager&
– local[*]%–&run&with&as&many&threads&as&cores&(default)&
– local[n]%–&run&locally&with*n*worker&threads&
– local&–&run&locally&without&distributed&processing&
! This%configures%the%SparkContext.master%property%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#10%
Chapter&Topics&
Distributed%Data%Processing%%
Running%Spark%on%a%Cluster%
with%Spark%
!! Overview&
!! A%Spark%Standalone%Cluster%
!! The&Spark&Standalone&Web&UI&
!! Spark&Deployment&OpEons&
!! Conclusion&
!! Hands9On&Exercise:&Running&the&Spark&Shell&on&a&Cluster&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#11%
Spark&Standalone&Daemons&
! Spark%Standalone%daemons%
– Spark&Master&
– One&per&cluster&
– Manages&applicaEons,&distributes&individual&tasks&to&Spark&Workers&
– Spark&Worker&
– One&per&worker&node&
– Starts&and&monitors&Executors&for&applicaEons&
Worker&Nodes&
Cluster&Master&Node& SparkWorker&
Spark& SparkWorker&
Master&
SparkWorker&
…&&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#12%
Running&Spark&on&a&Standalone&Cluster&(1)&
Worker&(Slave)&Nodes&
Client&
SparkWorker& DataNode&
SparkWorker& DataNode&
HDFS&Master&&
Master&Node& Node&
Spark& Name&
Master& Node&
SparkWorker& DataNode&
SparkWorker& DataNode&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#13%
Running&Spark&on&a&Standalone&Cluster&(2)&
Worker&(Slave)&Nodes&
Client& $ hdfs dfs –put mydata
SparkWorker& DataNode&
HDFS:
mydata
SparkWorker& DataNode&
HDFS&Master&&
Master&Node& Node&
Block&1&
Spark& Name&
Master& Node&
SparkWorker& DataNode&
Block&2&
SparkWorker& DataNode&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#14%
Running&Spark&on&a&Standalone&Cluster&(3)&
Worker&(Slave)&Nodes&
Driver&Program& Client&
SparkWorker& DataNode&
Spark&
Context&
SparkWorker& DataNode&
HDFS&Master&&
Master&Node& Node&
Block&1&
Spark& Name&
Master& Node&
SparkWorker& DataNode&
Block&2&
SparkWorker& DataNode&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#15%
Running&Spark&on&a&Standalone&Cluster&(4)&
Worker&(Slave)&Nodes&
Driver&Program& Client&
SparkWorker& DataNode&
Spark&
Context& Executor&
SparkWorker& DataNode&
HDFS&Master&&
Master&Node& Node&
Block&1&
Executor&
Spark& Name&
Master& Node&
SparkWorker& DataNode&
Block&2&
Executor&
SparkWorker& DataNode&
Executor&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#16%
Running&Spark&on&a&Standalone&Cluster&(5)&
Worker&(Slave)&Nodes&
Driver&Program& Client&
SparkWorker& DataNode&
Spark&
Context& Executor&
SparkWorker& DataNode&
HDFS&Master&&
Master&Node& Node&
Executor& Task&
Task&
Task&
Block&1&
Spark& Name&
Master& Node&
SparkWorker& DataNode&
Block&2&
Executor& Task&
SparkWorker& DataNode&
Executor&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#17%
Chapter&Topics&
Distributed%Data%Processing%%
Running%Spark%on%a%Cluster%
with%Spark%
!! Cluster&Overview&
!! A&Spark&Standalone&Cluster&
!! The%Spark%Standalone%Web%UI%
!! Spark&Deployment&OpEons&
!! Conclusion&
!! Hands9On&Exercise:&Running&the&Spark&Shell&on&a&Cluster&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#18%
Spark&Standalone&Web&UI&
! Spark%Standalone%clusters%offer%a%Web%UI%to%monitor%the%cluster%
– http://masternode:uiport
– e.g.,&in&our&class&environment,&http://localhost:18080
Master&URL&
Worker&Nodes&
ApplicaEons&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#19%
Spark&Standalone&Web&UI:&ApplicaEon&Overview&
Link&to&Spark&
ApplicaEon&UI&
Executors&for&this&
applicaEon&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#20%
Spark&Standalone&Web&UI:&Worker&Detail&
Log&files&
All&executors&on&
this&node&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#21%
Chapter&Topics&
Distributed%Data%Processing%%
Spark%on%a%Cluster%
with%Spark%
!! Overview&
!! A&Spark&Standalone&Cluster&
!! The&Spark&Standalone&Web&UI&
!! Spark%Deployment%OpAons%
!! Conclusion&
!! Hands9On&Exercise:&Running&the&Spark&Shell&on&a&Cluster&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#22%
Supported&Cluster&Resource&Managers&
! Spark%Standalone%
– Included&with&Spark&
– Easy&to&install&and&run&
– Limited&configurability&and&scalability&
– Useful&for&tesEng,&development,&or&small&systems&
! Hadoop%YARN%
– Included&in&CDH&
– Most&common&for&producEon&sites&
– Allows&sharing&cluster&resources&with&other&applicaEons&(MapReduce,&
Impala,&etc.)&
! Apache%Mesos%
– First&plaeorm&supported&by&Spark&
– Now&used&less&ofen&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#23%
Client&Mode&and&Cluster&Mode&
! By%default,%the%driver%program%runs%outside%the%cluster%
– Called&“client”&deploy&mode&
– Most&common&
– Required&for&interacEve&use&(e.g.,&the&Spark&Shell)&
! It%is%also%possible%to%run%the%driver%program%on%a%worker%node%in%the%
cluster% Worker&Node&
– Called&“cluster”&deploy&mode& Executor&
Worker&Node&
Executor&
Master&Node&
Cluster& Worker&Node&
submit& Manager&
Executor&
Worker&Node&
Driver%Program%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#24%
Installing&a&Spark&Cluster&(1)&
! ProducAon%cluster%installaAon%is%usually%performed%by%a%system%
administrator%
– Out&of&the&scope&of&this&course&
! Developers%should%understand%how%the%components%of%a%cluster%work%
together%
! Developers%oXen%test%first%locally,%then%on%a%small%test%cluster%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#25%
Installing&a&Spark&Cluster&(2)&
! Difficult:%
– Download&and&install&Spark&and&HDFS&directly&from&Apache&
! Easier:%CDH%
– Cloudera’s&DistribuEon,&including&Apache&Hadoop&
– Includes&HDFS,&Spark&API,&Spark&Standalone,&and&YARN&
– Includes&many&patches,&backports,&bug&fixes&
&
! Easiest:%Cloudera%Manager%
– Wizard9based&UI&to&install,&configure,&and&manage&a&cluster&
– Included&with&Cloudera&Express&(free)&or&Cloudera&Enterprise&
– Supports&Spark&deployment&as&Standalone&or&YARN&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#26%
Sejng&Up&a&Spark&Standalone&Cluster&on&EC2&
! Spark%includes%support%to%easily%set%up%and%manage%a%Spark%Standalone%
cluster%on%Amazon%Web%Services%EC2%
– Create&your&own&AWS&account&
– Use&the&spark-ec2&script&to&
– Start,&pause,&and&stop&a&cluster&
– Launch&an&applicaEon&on&the&cluster&
– Specify®ions,&spot&pricing,&Spark&version,&and&other&opEons&
– Use&distributed&files&stored&on&Amazon&S3&(Simple&Storage&Service)&
– s3://path/to/file
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#27%
Chapter&Topics&
Distributed%Data%Processing%%
Running%Spark%on%a%Cluster%
with%Spark%
!! Overview&
!! A&Spark&Standalone&Cluster&
!! The&Spark&Standalone&Web&UI&
!! Spark&Deployment&OpEons&
!! Conclusion%
!! Hands9On&Exercise:&Running&the&Spark&Shell&on&a&Standalone&Cluster&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#28%
Key&Points&
! Spark%is%designed%to%run%on%a%cluster%
– Spark&includes&a&basic&cluster&management&plaeorm&called&Spark&
Standalone&
– Can&also&run&on&Hadoop&YARN&and&Mesos&
! The%master%distributes%tasks%to%individual%workers%in%the%cluster%
– Tasks&run&in&executors*–&JVMs&running&on&worker&nodes&
! Spark%clusters%work%closely%with%HDFS%
– Tasks&are&assigned&to&workers&where&the&data&is&physically&stored&when&
possible&
! Spark%Standalone%provides%a%UI%for%monitoring%the%cluster%
– YARN&has&its&own&UI&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#29%
Chapter&Topics&
Distributed%Data%Processing%%
Running%Spark%on%a%Cluster%
with%Spark%
!! Overview&
!! A&Spark&Standalone&Cluster&
!! The&Spark&Standalone&Web&UI&
!! Spark&Deployment&OpEons&
!! Conclusion&
!! Hands#On%Exercise:%Running%the%Spark%Shell%on%a%Cluster%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#30%
Hands9On&Exercise:&Running&Spark&on&a&Cluster&
! Hands#On%Exercise:%Running&Spark&on&a&Cluster&
– Start&the&Spark&Standalone&daemons&(Spark&Master&and&Spark&Worker)&
on&your&local&machine&(a&simulated&Spark&Standalone&cluster)&
– Run&the&Spark&Shell&on&the&cluster&
– View&the&Spark&Standalone&UI&
! Please%refer%to%the%Hands#On%Exercise%Manual%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriCen&consent.& 06#31%
Parallel&Programming&with&Spark&
Chapter&7&
Course&Chapters&
!! IntroducFon& Course&IntroducFon&
!! What&is&Apache&Spark?&
!! Spark&Basics& IntroducFon&to&Spark&
!! Working&With&RDDs&
!! The&Hadoop&Distributed&File&System&(HDFS)&
!! Running&Spark&on&a&Cluster&
Distributed%Data%Processing%%
!! Parallel%Programming%with%Spark%
with%Spark%
!! Caching&and&Persistence&
!! WriFng&Spark&ApplicaFons&
!! Spark&Streaming&
!! Common&PaDerns&in&Spark&Programming& Solving&Business&Problems&&
!! Improving&Spark&Performance& with&Spark&
!! Spark,&Hadoop,&and&the&Enterprise&Data&Center&
!! Conclusion& Course&Conclusion&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#2%
Parallel&Programming&with&Spark&
In%this%chapter%you%will%learn%
! How%RDDs%are%distributed%across%a%cluster%
! How%Spark%executes%RDD%operaBons%in%parallel%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#3%
Chapter&Topics&
Parallel%Programming%with%Spark% Distributed%Data%Processing%%
with%Spark%
!! RDD%ParBBons%%
!! ParFFoning&of&File9based&RDDs&
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&
&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#4%
Spark&Cluster&Review&
Worker&(Slave)&Nodes&
Client&
Executor&
Executor& Task&
Cluster& HDFS&
Master& Master&
Node& Node&
Executor& Task&
Executor&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#5%
RDDs&on&a&Cluster&
! Resilient%Distributed*Datasets% RDD&1&
– Data&is&par$$oned&across&worker&nodes&
Executor&
! ParBBoning%is%done%automaBcally%by%Spark% rdd_1_0&
– OpFonally,&you&can&control&how&many&
parFFons&are&created& Executor&
rdd_1_1&
Executor&
rdd_1_2&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#6%
Chapter&Topics&
Parallel%Programming%with%Spark% Distributed%Data%Processing%%
with%Spark%
!! RDD&ParFFons&
!! ParBBoning%of%File#based%RDDs%%
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&
&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#7%
File&ParFFoning:&Single&Files&
! ParBBons%from%single%files% sc.textFile("myfile",3)
– ParFFons&based&on&size&
– You&can&opFonally&specify&a&minimum& RDD&
number&of&parFFons&
textFile(file, minPartitions) Executor&
– Default&is&2&
myfile
– More&parFFons&=&more¶llelizaFon&
Executor&
Executor&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#8%
File&ParFFoning:&MulFple&Files&
RDD&
! sc.textFile("mydir/*")
– Each&file&becomes&(at&least)&one& Executor&
parFFon&& file1
– File9based&operaFons&can&be&done&
per9parFFon,&for&example&parsing& Executor&
XML& file2
! sc.wholeTextFiles("mydir") RDD&
– For&many&small&files&
Executor&
– Creates&a&key9value&PairRDD&
– key&=&file&name&
– value&=&file&contents&
Executor&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#9%
OperaFng&on&ParFFons&
! Most%RDD%operaBons%work%on%each%element%of%an%RDD%
! A%few%work%on%each%par00on*
– foreachPartition&–&call&a&funcFon&for&each&parFFon&
– mapPartitions&–&create&a&new&RDD&by&execuFng&a&funcFon&on&each&
parFFon&in&the¤t&RDD&
– mapPartitionsWithIndex&–&same&as&mapPartitions&but&
includes&index&of&the&RDD&
! FuncBons%for%parBBon%operaBons%take%iterators%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#10%
Example:&Count&JPGs&Requests&per&File&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#11%
Chapter&Topics&
Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%
!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS%and%Data%Locality%
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&
&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#12%
HDFS&and&Data&Locality&(1)&
HDFS:
Client& mydata
Executor&
HDFS&
Block&1&
Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Block&3&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#13%
HDFS&and&Data&Locality&(2)&
By&default,&Spark&parFFons&
sc.textFile("hdfs://…mydata…").collect() file9based&RDDs&by&block.&
Each&block&loads&into&a&single&
parFFon.&
Client& RDD& HDFS:
Driver&Program& Client& mydata
Executor&
HDFS&
Block&1&
Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Block&3&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#14%
HDFS&and&Data&Locality&(3)&
An&acFon&triggers&
sc.textFile("hdfs://…mydata…").collect() execuFon:&tasks&on&
executors&load&data&from&
blocks&into&parFFons&
Client& RDD& HDFS:
Driver&Program& Client& mydata
Executor&
HDFS&
task& Block&1&
Executor&
HDFS&
Master&Node& task& Block&2&
Spark&
Master&
Executor&
HDFS&
task& Block&3&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#15%
HDFS&and&Data&Locality&(4)&
Data&is&distributed&across&
sc.textFile("hdfs://…mydata…").collect() executors&unFl&an&acFon&
returns&a&value&to&the&driver&
Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Block&3&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#16%
Chapter&Topics&
Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%
!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS&and&Data&Locality&
!! Hands#On%Exercise:%Working%With%ParBBons%
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&
&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#17%
Hands9On&Exercise:&Working&With&ParFFons&
! Hands#On%Exercise:%Working*With*Par00ons*
– Parse&mulFple&small&XML&files&containing&device&acFvaFon&records&
– Use&provided&XML&parsing&funcFons&in&exercise&stubs&
– Find&the&most&common&device&models&in&the&dataset&
! Please%refer%to%the%Hands#On%Exercise%Manual%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#18%
Chapter&Topics&
Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%
!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuBng%Parallel%OperaBons%
!! Stages&and&Tasks&
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#19%
Parallel&OperaFons&on&ParFFons&
! RDD%operaBons%are%executed%in%parallel%on%each%parBBon%
– When&possible,&tasks&execute&on&the&worker&nodes&where&the&data&is&in&
memory&&
! Some%operaBons%preserve%parBBoning%
– e.g.,&map,&flatMap,&filter
! Some%operaBons%reparBBon%
– e.g.,&reduce,&sort,&group
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#20%
Example:&Average&Word&Length&by&LeDer&(1)&
RDD&
HDFS:
mydata
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#21%
Example:&Average&Word&Length&by&LeDer&(2)&
RDD& RDD&
HDFS:
mydata
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#22%
Example:&Average&Word&Length&by&LeDer&(3)&
HDFS:
mydata
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#23%
Example:&Average&Word&Length&by&LeDer&(4)&
HDFS:
mydata
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#24%
Example:&Average&Word&Length&by&LeDer&(5)&
HDFS:
mydata
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#25%
Chapter&Topics&
Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%
!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages%and%Tasks%
!! Conclusion&
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#26%
Stages&
! OperaBons%that%can%run%on%the%same%parBBon%are%executed%in%stages*
! Tasks%within%a%stage%are%pipelined%together%
! Developers%should%be%aware%of%stages%to%improve%performance%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#27%
Spark&ExecuFon:&Stages&(1)&
Stage&1& Stage&2&
RDD& RDD& RDD&
RDD& RDD&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#28%
Spark&ExecuFon:&Stages&(2)&
Stage&1& Stage&2&
Task&1&
Task&4&
Task&2&
Task&5&
Task&3&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#29%
Spark&ExecuFon:&Stages&(3)&
Stage&1& Stage&2&
Task&1&
Task&4&
Task&2&
Task&5&
Task&3&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#30%
Spark&ExecuFon:&Stages&(4)&
Stage&1& Stage&2&
Task&1& Task&4&
Task&2& Task&5&
Task&3&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#31%
Summary&of&Spark&Terminology&
! Job%–%a&set&of&tasks&executed&as&a&result&of&an&ac$on*
! Stage%–%a&set&of&tasks&in&a&job&that&can&be&executed&in¶llel&
! Task%–%an&individual&unit&of&work&sent&to&one&executor&
Stage&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#32%
How&Spark&Calculates&Stages&
! Spark%constructs%a%DAG%(Directed%Acyclic%Graph)%of%RDD%dependencies%
! Narrow%operaBons%
– Only&one&child&depends&on&the&RDD&
– No&shuffle&required&between&nodes&
– Can&be&collapsed&into&a&single&stage&
– e.g.,&map,&filter,&union
! Wide%operaBons%
– MulFple&children&depend&on&the&RDD&
– Defines&a&new&stage&
– e.g.,&reduceByKey,&join,&groupByKey
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#33%
Controlling&the&Level&of&Parallelism&
! “Wide”%operaBons%(e.g.,%reduceByKey)%parBBon%result%RDDs%
– More&parFFons&=&more¶llel&tasks&
– Cluster&will&be&under9uFlized&if&there&are&too&few&parFFons&
! You%can%control%how%many%parBBons%
– Configure&with&the&spark.default.parallelism&property&
spark.default.parallelism 10
– OpFonal&numPartitions%parameter&in&funcFon&call&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#34%
Spark&ExecuFon:&Task&Scheduling&(1)&
Stage&1& Stage&2&
Task&1& Task&4&
Task&2& Client&
Task&5&
Executor&
HDFS&
Task&3& Block&1&
Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Block&3&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#35%
Spark&ExecuFon:&Task&Scheduling&(2)&
Stage&1& Stage&2&
Task&4&
Client&
Task&5&
Executor&
HDFS&
Task&1& Block&1&
Executor&
HDFS&
Master&Node&
Task&2& Block&2&
Spark&
Master&
Executor&
HDFS&
Task&3& Block&3&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#36%
Spark&ExecuFon:&Task&Scheduling&(3)&
Stage&2&
Task&4&
Client&
Task&5&
Executor&
HDFS&
Block&1&
Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Block&3&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#37%
Spark&ExecuFon:&Task&Scheduling&(4)&
Stage&2&
Client&
Executor&
HDFS&
Task&4& Block&1&
Executor&
HDFS&
Master&Node&
Block&2&
Spark&
Master&
Executor&
HDFS&
Task&5& Block&3&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#38%
Viewing&Stages&in&the&Spark&ApplicaFon&UI&
! You%can%view%the%execuBon%stages%in%the%Spark%ApplicaBon%UI%
Stages&are& Number&of&tasks&=&
Data&shuffled&
idenFfied&by&the& number&of&
between&stages&
last&operaFon& parFFons&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#39%
Chapter&Topics&
Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%
!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion%
!! Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&ApplicaFon&UI&
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#40%
Key&Points&
! RDDs%are%stored%in%the%memory%of%Spark%executor%JVMs%
! Data%is%split%into%parBBons%–%each%parBBon%in%a%separate%executor%
! RDD%operaBons%are%executed%on%parBBons%in%parallel%
! OperaBons%that%depend%on%the%same%parBBon%are%pipelined%together%in%
stages%
– e.g.,&map,&filter
! OperaBons%that%depend%on%mulBple%parBBons%are%executed%in%separate%
stages%
– e.g.,&join,&reduceByKey
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#41%
Chapter&Topics&
Distributed%Data%Processing%%
Parallel%Programming%with%Spark%
with%Spark%
!! RDD&ParFFons&
!! ParFFoning&of&File9based&RDDs&&
!! HDFS&and&Data&Locality&
!! Hands9On&Exercise:&Working&With&ParFFons&
!! ExecuFng&Parallel&OperaFons&
!! Stages&and&Tasks&
!! Conclusion&
!! Hands#On%Exercise:%Viewing%Stages%and%Tasks%in%the%Spark%ApplicaBon%UI%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#42%
Hands9On&Exercise:&Viewing&Stages&and&Tasks&in&the&Spark&
ApplicaFon&UI&
! Hands#On%Exercise:%Viewing*Stages*and*Tasks*in*the*Spark*Applica0on*UI*
– Use&the&Spark&ApplicaFon&UI&to&view&how&stages&and&tasks&are&executed&
in&a&job&
! Please%refer%to%the%Hands#On%Exercise%Manual%
©&Copyright&201092015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriDen&consent.& 07#43%
Caching(and(Persistence(
Chapter(8(
Course(Chapters(
!! IntroducCon( Course(IntroducCon(
!! What(is(Apache(Spark?(
!! Spark(Basics( IntroducCon(to(Spark(
!! Working(With(RDDs(
!! The(Hadoop(Distributed(File(System((HDFS)(
!! Running(Spark(on(a(Cluster(
Distributed%Data%Processing%%
!! Parallel(Programming(with(Spark(
with%Spark%
!! Caching%and%Persistence%
!! WriCng(Spark(ApplicaCons(
!! Spark(Streaming(
!! Common(PaAerns(in(Spark(Programming( Solving(Business(Problems((
!! Improving(Spark(Performance( with(Spark(
!! Spark,(Hadoop,(and(the(Enterprise(Data(Center(
!! Conclusion( Course(Conclusion(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#2%
Caching(and(Persistence(
In%this%chapter%you%will%learn%
! How%Spark%uses%an%RDD’s%lineage%in%operaBons%
! How%to%persist%RDDs%to%improve%performance%
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#3%
Chapter(Topics(
Distributed%Data%Processing%%
Caching%and%Persistence%
with%Spark%
!! RDD%Lineage%
!! Caching(Overview(
!! Distributed(Persistence(
!! Conclusion(
!! Hands7On(Exercises(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#4%
Lineage(Example((1)(
File:(purplecow.txt(
! Each%transforma)on%operaBon% I've never seen a purple cow.
creates%a%new%child%RDD% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#5%
Lineage(Example((2)(
File:(purplecow.txt(
! Each%transforma)on%operaBon% I've never seen a purple cow.
creates%a%new%child%RDD% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
MappedRDD[1]((mydata)(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#6%
Lineage(Example((3)(
File:(purplecow.txt(
! Each%transforma)on%operaBon% I've never seen a purple cow.
creates%a%new%child%RDD% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
MappedRDD[1]((mydata)(
MappedRDD[2](
FilteredRDD[3]:((myrdd)(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#7%
Lineage(Example((4)(
File:(purplecow.txt(
! Spark%keeps%track%of%the%parent%RDD% I've never seen a purple cow.
for%each%new%RDD% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
! Child%RDDs%depend1on1their%parents%
MappedRDD[1]((mydata)(
MappedRDD[2](
FilteredRDD[3]:((myrdd)(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#8%
Lineage(Example((5)(
File:(purplecow.txt(
! Ac)on%operaBons%execute%the% I've never seen a purple cow.
parent%transformaBons% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
MappedRDD[1]((mydata)(
I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt")
I never hope to see one;
> myrdd = mydata.map(lambda s: s.upper())\
But I can tell you, anyhow,
.filter(lambda s:s.startswith('I'))
> myrdd.count() I'd rather see than be one.
3 MappedRDD[2](
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
FilteredRDD[3]:((myrdd)(
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#9%
Lineage(Example((6)(
File:(purplecow.txt(
! Each%acBon%re#executes%the%lineage% I've never seen a purple cow.
transformaBons%starBng%with%the% I never hope to see one;
But I can tell you, anyhow,
base% I'd rather see than be one.
– By(default( MappedRDD[1]((mydata)(
FilteredRDD[3]:((myrdd)(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#10%
Lineage(Example((7)(
File:(purplecow.txt(
! Each%acBon%re#executes%the%lineage% I've never seen a purple cow.
transformaBons%starBng%with%the% I never hope to see one;
But I can tell you, anyhow,
base% I'd rather see than be one.
– By(default( MappedRDD[1]((mydata)(
I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt") I never hope to see one;
> myrdd = mydata.map(lambda s: s.upper())\ But I can tell you, anyhow,
.filter(lambda s:s.startswith('I')) I'd rather see than be one.
> myrdd.count()
3 MappedRDD[2](
I'VE NEVER SEEN A PURPLE COW.
> myrdd.count()
3 I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
FilteredRDD[3]:((myrdd)(
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#11%
Chapter(Topics(
Distributed%Data%Processing%%
Caching%and%Persistence%
with%Spark%
!! RDD(Lineage(
!! Caching%Overview%
!! Distributed(Persistence(
!! Conclusion(
!! Hands7On(Exercises(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#12%
Caching(
File:(purplecow.txt(
! Caching%an%RDD%saves%the%data%in% I've never seen a purple cow.
memory% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#13%
Caching(
File:(purplecow.txt(
! Caching%an%RDD%saves%the%data%in% I've never seen a purple cow.
memory% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]((mydata)(
> mydata = sc.textFile("purplecow.txt")
> myrdd = mydata.map(lambda s:
s.upper())
RDD[2]((myrdd)(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#14%
Caching(
File:(purplecow.txt(
! Caching%an%RDD%saves%the%data%in% I've never seen a purple cow.
memory% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]((mydata)(
> mydata = sc.textFile("purplecow.txt")
> myrdd = mydata.map(lambda s:
s.upper())
> myrdd.cache()
RDD[2]((myrdd)(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#15%
Caching(
File:(purplecow.txt(
! Caching%an%RDD%saves%the%data%in% I've never seen a purple cow.
memory% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]((mydata)(
> mydata = sc.textFile("purplecow.txt")
> myrdd = mydata.map(lambda s:
s.upper())
> myrdd.cache()
> myrdd2 = myrdd.filter(lambda \ RDD[2]((myrdd)(
s:s.startswith('I'))
RDD[3]((myrdd2)(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#16%
Caching(
File:(purplecow.txt(
! Caching%an%RDD%saves%the%data%in% I've never seen a purple cow.
memory% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]((mydata)(
> mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.
RDD[3]((myrdd2)(
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#17%
Caching(
File:(purplecow.txt(
! Subsequent%operaBons%use%saved% I've never seen a purple cow.
data% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]((mydata)(
> mydata = sc.textFile("purplecow.txt")
> myrdd = mydata.map(lambda s:
s.upper())
> myrdd.cache()
> myrdd2 = myrdd.filter(lambda \ RDD[2]((myrdd)(
I'VE NEVER SEEN A PURPLE COW.
s:s.startswith('I'))
I NEVER HOPE TO SEE ONE;
> myrdd2.count()
BUT I CAN TELL YOU, ANYHOW,
3
I'D RATHER SEE THAN BE ONE.
> myrdd2.count()
RDD[3]((myrdd2)(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#18%
Caching(
File:(purplecow.txt(
! Subsequent%operaBons%use%saved% I've never seen a purple cow.
data% I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]((mydata)(
> mydata = sc.textFile("purplecow.txt")
> myrdd = mydata.map(lambda s:
s.upper())
> myrdd.cache()
> myrdd2 = myrdd.filter(lambda \ RDD[2]((myrdd)(
I'VE NEVER SEEN A PURPLE COW.
s:s.startswith('I'))
I NEVER HOPE TO SEE ONE;
> myrdd2.count()
BUT I CAN TELL YOU, ANYHOW,
3
I'D RATHER SEE THAN BE ONE.
> myrdd2.count()
3 RDD[3]((myrdd2)(
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#19%
Caching(
! Caching%is%a%suggesBon%to%Spark%
– If(not(enough(memory(is(available,(transformaCons(will(be(re7executed(
when(needed(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#20%
Chapter(Topics(
Distributed%Data%Processing%%
Caching%and%Persistence%
with%Spark%
!! RDD(Lineage(
!! Caching(Overview(
!! Distributed%Persistence(
!! Conclusion(
!! Hands7On(Exercises(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#21%
Caching(and(Fault7Tolerance(
! RDD%=%Resilient1Distributed%Dataset%
– Resiliency(is(a(product(of(tracking(lineage(
– RDDs(can(always(be(recomputed(from(their(base(if(needed(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#22%
Distributed(Cache(
! RDD%parBBons%are%distributed%across%a%cluster%
! Cached%parBBons%are%stored%in%memory%in%Executor%JVMs%
RDD(
Client(
Executor(
rdd_1_0(
Master(Node(
Executor(
Spark( rdd_1_1(
Master(
Executor(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#23%
RDD(Fault7Tolerance((1)(
! What%happens%if%a%cached%parBBon%becomes%unavailable?%
RDD(
Client(
Executor(
rdd_1_0(
Master(Node(
Spark( ?(
Master(
Executor(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#24%
RDD(Fault7Tolerance((2)(
! The%SparkMaster%starts%a%new%task%to%recompute%the%parBBon%on%a%
different%node%%
RDD(
Client(
Executor(
rdd_1_0(
Master(Node(
Spark(
Master(
Executor(
task( rdd_1_1(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#25%
Persistence(Levels((1)(
! The%cache%method%stores%data%in%memory%only%
! The%persist%method%offers%other%opBons%called%Storage%Levels%
! Storage%locaBon%–%where%is%the%data%stored?%
– MEMORY_ONLY((default)(–(same(as(cache(
– MEMORY_AND_DISK(–(Store(parCCons(on(disk(if(they(do(not(fit(in(
memory((
– Called(spilling(
– DISK_ONLY(–(Store(all(parCCons(on(disk(
! ReplicaBon%–%store%parBBons%on%two%nodes%
– MEMORY_ONLY_2,(MEMORY_AND_DISK_2,(etc.(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#26%
Persistence(Levels((2)(
! SerializaBon%–%you%can%choose%to%serialize%the%data%in%memory%
– MEMORY_ONLY_SER(and(MEMORY_AND_DISK_SER
– Much(more(space(efficient(
– Less(Cme(efficient(
– Choose(a(fast(serializaCon(library((covered(later)(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#27%
Changing(Persistence(OpCons(
! To%stop%persisBng%and%remove%from%memory%and%disk%
– rdd.unpersist()
! To%change%an%RDD%to%a%different%persistence%level%
– Unpersist(first(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#28%
Distributed(Disk(Persistence((1)(
! Disk#persisted%parBBons%are%stored%in%local%files%
RDD(
Client(
Executor(
rdd_0(
Master(Node(
Executor(
Spark( part1
rdd_1(
Master(
Executor(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#29%
Distributed(Disk(Persistence((2)(
! Data%on%disk%will%be%used%to%recreate%the%parBBon%if%possible%
– Will(be(recomputed(if(the(data(is(unavailable((
– e.g.,(the(node(is(down(
RDD(
Client(
Executor(
rdd_0(
Master(Node(
Spark( part1
Master(
Executor(
rdd_1(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#30%
ReplicaCon(
! Persistence%replicaBon%makes%recomputaBon%less%likely%to%be%necessary%%
RDD(
Client(
Executor(
rdd_0(
Master(Node(
Executor(
Spark( part1
rdd_1(
Master(
Executor(
part1
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#31%
When(and(Where(to(Cache(
! When%should%you%cache%a%dataset?%
– When(a(dataset(is(likely(to(be(re7used(
– e.g.,(iteraCve(algorithms,(machine(learning(
! How%to%choose%a%persistence%level%
– Memory(only(–(when(possible,(best(performance(
– Save(space(by(saving(as(serialized(objects(in(memory(if(necessary(
– Disk(–(choose(when(recomputaCon(is(more(expensive(than(disk(read(
– e.g.,(expensive(funcCons(or(filtering(large(datasets(
– ReplicaCon(–(choose(when(recomputaCon(is(more(expensive(than(
memory(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#32%
CheckpoinCng((1)(
! Maintaining%RDD%lineage%provides%resilience%but%can%also%cause%problems%
when%the%lineage%gets%very%long%
– e.g.,(iteraCve(algorithms,(streaming( Iter1(
data…
! Recovery%can%be%very%expensive% data…
Iter2(
data…
data…
Iter3(
! PotenBal%stack%overflow% data…
data…
data…
Iter4(
data…
data…
data…
data…
data…
data…
data…
data…
myrdd = …ini(al*value….
data…
while x in xrange(100):
Iter100(
myrdd = myrdd.transform(…)
myrdd.saveAsTextFile()
…( data…
data…
data…
data…
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#33%
CheckpoinCng((2)(
! CheckpoinBng%saves%the%data%to%HDFS%%
– Provides(fault7tolerant(storage(across(nodes((
! Lineage%is%not%saved% HDFS(
data…
! Must%be%checkpointed%before%any%% checkpoint( data…
Iter3(
data…
acBons%on%the%RDD% data…
Iter4(
data…
data…
data…
data…
data…
sc.setCheckpointDir(directory) data…
if x % 3 == 0: data…
myrdd.checkpoint() data…
myrdd.count() data…
myrdd.saveAsTextFile()
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#34%
Chapter(Topics(
Distributed%Data%Processing%%
Caching%and%Persistence%
with%Spark%
!! RDD(Lineage(
!! Caching(Overview(
!! Distributed(Persistence(
!! Conclusion%
!! Hands7On(Exercises(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#35%
Key(Points(
! Spark%keeps%track%of%each%RDD’s%lineage%
– Provides(fault(tolerance(
! By%default,%every%RDD%operaBon%executes%the%enBre%lineage%
! If%an%RDD%will%be%used%mulBple%Bmes,%persist%it%to%avoid%re#computaBon%
! Persistence%opBons%
– Caching((memory(only)(–(will(re7compute(what(doesn’t(fit(in(memory(
– Disk(–(will(spill(to(local(disk(what(doesn’t(fit(in(memory(
– ReplicaCon(–(will(save(cached(data(on(mulCple(nodes(in(case(a(node(
goes(down,(for(job(recovery(without(recomputaCon(
– SerializaCon(–(in7memory(caching(can(be(serialized(to(save(memory((but(
at(the(cost(of(performance)(
– CheckpoinCng(–(saves(to(HDFS,(removes(lineage(
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#36%
Chapter(Topics(
Distributed%Data%Processing%%
Caching%and%Persistence%
with%Spark%
!! RDD(Lineage(
!! Caching(Overview(
!! Distributed(Persistence(
!! Conclusion(
!! Hands#On%Exercises%
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#37%
Hands7On(Exercises(
! Hands#On%Exercise:%Caching1RDDs11
– Compare(performance(with(a(cached(and(uncached(RDD(
– Use(the(Spark(ApplicaCon(UI(to(see(how(an(RDD(is(cached(
! Hands#On%Exercise:%Checkpoin)ng1RDDs11
– View(the(lineage(of(an(iteraCve(RDD(
– Increase(iteraCon(unCl(a(stack(overflow(error(occurs(
– Checkpoint(the(RDD(to(avoid(long(lineage(issues(
! Please%refer%to%the%Hands#On%Exercise%Manual%
©(Copyright(201072015(Cloudera.(All(rights(reserved.(Not(to(be(reproduced(without(prior(wriAen(consent.( 08#38%
Wri$ng'Spark'Applica$ons'
Chapter'9'
Course'Chapters'
!! Introduc$on' Course'Introduc$on'
!! What'is'Apache'Spark?'
!! Spark'Basics' Introduc$on'to'Spark'
!! Working'With'RDDs'
!! The'Hadoop'Distributed'File'System'(HDFS)'
!! Running'Spark'on'a'Cluster'
Distributed%Data%Processing%%
!! Parallel'Programming'with'Spark'
with%Spark%
!! Caching'and'Persistence'
!! Wri;ng%Spark%Applica;ons%
!! Spark'Streaming'
!! Common'PaDerns'in'Spark'Programming' Solving'Business'Problems''
!! Improving'Spark'Performance' with'Spark'
!! Spark,'Hadoop,'and'the'Enterprise'Data'Center'
!! Conclusion' Course'Conclusion'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#2%
Wri$ng'a'Spark'Applica$on'
In%this%chapter%you%will%learn%
! How%to%write,%build,%configure,%and%run%Spark%applica;ons%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#3%
Chapter'Topics'
Distributed%Data%Processing%
Wri;ng%a%Spark%Applica;on%
with%Spark%
!! Spark%Applica;ons%vs.%Spark%Shell%
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#4%
Spark'Shell'vs.'Spark'Applica$ons'
! The%Spark%Shell%allows%interac;ve%explora;on%and%manipula;on%of%data%
– REPL'using'Python'or'Scala'
! Spark%applica;ons%run%as%independent%programs%
– Python,'Scala,'or'Java'
– e.g.,'ETL'processing,'Streaming,'and'so'on'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#5%
Chapter'Topics'
Distributed%Data%Processing%
Wri;ng%a%Spark%Applica;on%
with%Spark%
!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea;ng%the%SparkContext%
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#6%
The'SparkContext'
! Every%Spark%program%needs%a%SparkContext%
– The'interac$ve'shell'creates'one'for'you'
– You'create'your'own'in'a'Spark'applica$on'
– Named'sc'by'conven$on'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#7%
Python'Example:'WordCount'
import sys
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount <file>"
exit(-1)
sc = SparkContext()
counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#8%
Scala'Example:'WordCount'
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}
counts.take(5).foreach(println)
}
}
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#9%
Chapter'Topics'
Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%
!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building%a%Spark%Applica;on%(Scala%and%Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#10%
Building'a'Spark'Applica$on:'Scala'or'Java'
! Scala%or%Java%Spark%applica;ons%must%be%compiled%and%assembled%into%JAR%
files%
– JAR'file'will'be'passed'to'worker'nodes'
! Most%developers%use%Apache%Maven%to%build%their%applica;ons%
– For'specific'se[ng'recommenda$ons,'see''
http://spark.apache.org/docs/latest/building-
with-maven.html
! Build%details%will%differ%depending%on%
– Version'of'Hadoop'(HDFS)'
– Deployment'pla^orm'(Spark'Standalone,'YARN,'Mesos)'
! Consider%using%an%IDE%
– IntelliJ'appears'to'be'the'most'popular'among'Spark'developers'
%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#11%
Chapter'Topics'
Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%
!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running%a%Spark%Applica;on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#12%
Running'a'Spark'Applica$on'(1)'
! The%easiest%way%to%run%a%Spark%Applica;on%is%using%the%spark-submit
script%
Python' $ spark-submit WordCount.py fileURL
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#13%
Running'a'Spark'Applica$on'(2)'
! Some%key%spark-submit op;ons%
--help%–'explain'available'op$ons'
--master'–'equivalent'to'MASTER'environment'variable'for'Spark'Shell'
– local[*]'–'run'locally'with'as'many'threads'as'cores'(default)'
– local[n]'–'run'locally'with'n'threads'
– local%–'run'locally'with'a'single'thread
– master'URL,'e.g.,'spark://masternode:7077''
--deploy-mode'–'either'client'or'cluster
--name'–'applica$on'name'to'display'in'the'UI'(default'is'the'Scala/Java'
class'or'Python'program'name)'''
--jars'–'addi$onal'JAR'files'(Scala'and'Java'only)'
--pyfiles'–'addi$onal'Python'files'(Python'only)'
--driver-java-options'–'parameters'to'pass'to'the'driver'JVM'
'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#14%
Chapter'Topics'
Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%
!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands#On%Exercise:%Wri;ng%and%Running%a%Spark%Applica;on%
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#15%
Building'and'Running'Scala'Applica$ons'in'the''
Hands;On'Exercises'
! Basic%Maven%projects%are%provided%in%the%exercises/projects
directory%with%two%packages%
– stubs'–'starter'Scala'file,'do'exercises'here'
– solution'–'final'exercise'solu$on'
Project'Directory'Structure'
+countjpgs
-pom.xml
$ mvn package
+src
+main
$ spark-submit \ +scala
--class stubs.CountJPGs \ +solution
-CountJPGs.scala
target/countjpgs-1.0.jar \ +stubs
weblogs/* -CountJPGs.scala
+target
-countjpgs-1.0.jar
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#16%
Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
! Hands#On%Exercise:%Wri$ng'and'Running'a'Spark'Applica$on'
– Write'and'run'a'Spark'applica$on'to'count'JPG'requests'in'a'web'server'
log'
! Please%refer%to%the%Hands#On%Exercise%Manual%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#17%
Chapter'Topics'
Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%
!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring%Spark%Proper;es%
!! Logging'
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#18%
Spark'Applica$on'Configura$on'
! Spark%provides%numerous%proper;es%for%configuring%your%applica;on%
! Some%example%proper;es%
– spark.master'
– spark.app.name'
– spark.local.dir'–'where'to'store'local'files'such'as'shuffle'output'
(default'/tmp)'
– spark.ui.port'–'port'to'run'the'Spark'Applica$on'UI'(default'
4040)'
– spark.executor.memory'–'how'much'memory'to'allocate'to'each'
Executor'(default'512m)'
! Most%are%more%interes;ng%to%system%administrators%than%developers%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#19%
Spark'Applica$on'Configura$on'
! Spark%Applica;ons%can%be%configured%
– Via'the'command'line'when'the'program'is'run'
– Programma$cally,'using'the'API'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#20%
Run;$me'Configura$on'Op$ons'
! spark-submit script%
– e.g.,'spark-submit --master spark://masternode:7077'
! Proper;es%file%
– Tab;'or'space;separated'list'of'proper$es'and'values'
– Load'with'spark-submit --properties-file filename
– Example:'
spark.master spark://masternode:7077
spark.local.dir /tmp
% spark.ui.port 4444
! Site%defaults%proper;es%file%
– $SPARK_HOME/conf/spark-defaults.conf
– Template'file'provided'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#21%
Se[ng'Configura$on'Proper$es'Programma$cally'
! Spark%configura;on%se\ngs%are%part%of%the%SparkContext%
! Configure%using%a%SparkConf%object%
! Some%example%func;ons%
– setAppName(name)
– setMaster(master)
– set(property-name, value)
! set%func;ons%return%a%SparkConf%object%to%support%chaining%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#22%
SparkConf'Example'(Python)'
import sys
from pyspark import SparkContext
from pyspark import SparkConf
if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount <file>"
exit(-1)
sconf = SparkConf() \
.setAppName("Word Count") \
.set("spark.ui.port","4141")
sc = SparkContext(conf=sconf)
counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
.map(lambda w: (w,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#23%
SparkConf'Example'(Scala)'
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#24%
Viewing'Spark'Proper$es'
! You%%can%view%the%Spark%
property%se\ng%in%the%
Spark%Applica;on%UI%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#25%
Chapter'Topics'
Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%
!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging%
!! Conclusion'
!! Hands;On'Exercise:'Se[ng'Log'Levels'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#26%
Spark'Logging'
! Spark%uses%Apache%Log4j%for%logging%
– Allows'for'controlling'logging'at'run$me'using'a'proper$es'file'
– Enable'or'disable'logging,'set'logging'levels,'select'output'
des$na$on'
– For'more'info'see'http://logging.apache.org/log4j/1.2/
! Log4j%provides%several%logging%levels%
– Fatal'
– Error'
– Warn'
– Info'
– Debug'
– Trace'
– Off'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#27%
Spark'Log'Files'
! Log%file%loca;ons%depend%on%your%cluster%management%pla`orm%
! Spark%Standalone%defaults:%
– Spark'daemons:'/var/log/spark'
– Individual'tasks:'$SPARK_HOME/work'on'each'worker'node'
'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#28%
Spark'Worker'UI'–'Log'File'Access'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#29%
Configuring'Spark'Logging'(1)'
! Logging%levels%can%be%set%for%the%cluster,%for%individual%applica;ons,%or%
even%for%specific%components%or%subsystems%
! Default%for%machine:%$SPARK_HOME/conf/log4j.properties
– Start'by'copying'log4j.properties.template
log4j.proper$es.template'
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
…
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#30%
Configuring'Spark'Logging'(2)'
! Spark%will%use%the%first%log4j.properties%file%it%finds%in%the%Java%
classpath%
! Spark%Shell%will%read%log4j.properties%from%the%current%directory%
– Copy'log4j.properties'to'the'working'directory'and'edit'
…my#working#directory/log4j.proper$es'
# Set everything to be logged to the console
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
…
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#31%
Chapter'Topics'
Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%
!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion%
!! Hands;On'Exercise:'Se[ng'Log'Levels'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#32%
Key'Points'
! Use%the%Spark%Shell%applica;on%for%interac;ve%data%explora;on%
! Write%a%Spark%applica;on%to%run%independently%
! Spark%applica;ons%require%a%Spark%Context%object%
! Spark%applica;ons%are%run%using%the%spark-submit script%
! Spark%configura;on%parameters%can%be%set%at%run;me%using%the%%
spark-submit%script%or%programma;cally%using%a%SparkConf%object%
! Spark%uses%log4j%for%logging%
– Configure'using'a'log4j.properties'file'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#33%
Chapter'Topics'
Distributed%Data%Processing%
Wri;ng%Spark%Applica;ons%
with%Spark%
!! Spark'Applica$ons'vs.'Spark'Shell'
!! Crea$ng'the'SparkContext'
!! Building'a'Spark'Applica$on'(Scala'and'Java)'
!! Running'a'Spark'Applica$on'
!! Hands;On'Exercise:'Wri$ng'and'Running'a'Spark'Applica$on'
!! Configuring'Spark'Proper$es'
!! Logging'
!! Conclusion'
!! Hands#On%Exercise:%Se\ng%Log%Levels%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#34%
Hands;On'Exercise:'Configuring'Spark'Applica$ons'
! Hands#On%Exercise:%Configuring%Spark%Applica;ons%
– Set'proper$es'using'spark-submit
– Set'proper$es'in'a'proper$es'file'
– Set'proper$es'programma$cally'using'SparkConf
– Change'the'logging'levels'in'a'log4j.properties'file'
! Please%refer%to%the%Hands#On%Exercise%Manual%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriDen'consent.' 09#35%
Spark&Streaming&
Chapter&10&
Course&Chapters&
!! IntroducDon& Course&IntroducDon&
!! Why&Spark?&
!! Spark&Basics& IntroducDon&to&Spark&
!! Working&With&RDDs&
!! The&Hadoop&Distributed&File&System&(HDFS)&
!! Running&Spark&on&a&Cluster&
Distributed&Data&Processing&&
!! Parallel&Programming&with&Spark&
with&Spark&
!! Caching&and&Persistence&
!! WriDng&Spark&ApplicaDons&
!! Spark%Streaming%
!! Common&PaBerns&in&Spark&Programming& Solving%Business%Problems%%
!! Improving&Spark&Performance& with%Spark%
!! Spark,&Hadoop,&and&the&Enterprise&Data&Center&
!! Conclusion& Course&Conclusion&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#2%
Spark&Streaming&
In%this%chapter%you%will%learn%
! What%Spark%Streaming%is,%and%why%it%is%valuable%
! How%to%use%Spark%Streaming%
! How%to%work%with%Sliding%Window%operaCons%
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#3%
Chapter&Topics&
Solving%Business%Problems%%
Spark%Streaming%
with%Spark%
!! Spark%Streaming%Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#4%
What&is&Spark&Streaming?&
! Spark%Streaming%provides%real#Cme%processing%of%stream%data%
! An%extension%of%core%Spark%
! Supports%Scala%and%Java%
– Most&recent&version&of&Spark&also&supports&Python&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#5%
Why&Spark&Streaming?&
! Many%big#data%applicaCons%need%to%process%large%data%streams%in%real%
Cme%
– Website&monitoring&
– Fraud&detecDon&
– Ad&moneDzaDon&
– Etc.&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#6%
Spark&Streaming&Features&
! Second#scale%latencies%
! Scalability%and%efficient%fault%tolerance%
! “Once%and%only%once”%processing%
! Integrates%batch%and%real#Cme%processing%
! Easy%to%develop%
– Uses&Spark’s&high&level&API&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#7%
Spark&Streaming&Overview&
! Divide%up%data%stream%into%batches%of%n%seconds%%
! Process%each%batch%in%Spark%as%an%RDD%
! Return%results%of%RDD%operaCons%in%batches%
Live&Data&Stream&
…1001101001000111000011100010…&
Spark&Streaming&
DStream&–&RDDs&(batches&of&&
n&seconds)&
Spark&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#8%
Chapter&Topics&
Solving%Business%Problems%%
Spark%Streaming%
with%Spark%
!! Spark&Streaming&Overview&
!! Example:%Streaming%Request%Count%
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#9%
Streaming&Example:&Streaming&Request&Count&
object StreamingRequestCount {
userreqs.print()
ssc.start()
ssc.awaitTermination()
}
}
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#10%
Streaming&Example:&Configuring&StreamingContext&
object StreamingRequestCount {
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#11%
Streaming&Example:&CreaDng&a&DStream&
object StreamingRequestCount {
ssc.start()
ssc.awaitTermination()
}
}
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#12%
Streaming&Example:&DStream&TransformaDons&
object StreamingRequestCount {
userreqs.saveAsTextFiles("…/outdir/reqcounts")
!! DStream&operaDons&are&applied&to&each&batch&RDD&in&the&stream&
! Similar&to&RDD&operaDons&–&filter,&map,&reduce,&join,&etc.&
ssc.start()
ssc.awaitTermination()
}
}
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#13%
Streaming&Example:&DStream&Result&Output&
object StreamingRequestCount {
userreqs.print()
ssc.start()
!! Print&out&the&first&10&elements&of&each&RDD&
ssc.awaitTermination()
}
}
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#14%
Streaming&Example:&StarDng&the&Streams&
object StreamingRequestCount {
complete&before&ending&the&main&thread&
userreqs.print()
ssc.start()
ssc.awaitTermination()
}
}
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#15%
Streaming&Example:&Streaming&Request&Count&(Recap)&
object StreamingRequestCount {
userreqs.print()
ssc.start()
ssc.awaitTermination()
}
}
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#16%
Streaming&Example&Output&
-------------------------------------------
Time: 1401219545000 ms
Starts&2&seconds&
------------------------------------------- acer&ssc.start
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#17%
Streaming&Example&Output&
-------------------------------------------
Time: 1401219545000 ms
-------------------------------------------
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...
------------------------------------------- 2&seconds&later…
Time: 1401219547000 ms
-------------------------------------------
(42400,2)
(24996,2)
(97464,2)
(161,2)
(6011,2)
…
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#18%
Streaming&Example&Output&
-------------------------------------------
Time: 1401219545000 ms
-------------------------------------------
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...
-------------------------------------------
Time: 1401219547000 ms
-------------------------------------------
(42400,2)
(24996,2)
(97464,2)
(161,2)
(6011,2)
…
------------------------------------------- 2&seconds&later…
Time: 1401219549000 ms
-------------------------------------------
(44390,2)
(48712,2)
(165,2)
(465,2) ConDnues&unDl&
(120,2)
…
terminaDon…
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#19%
Chapter&Topics&
Solving%Business%Problems%%
Spark%Streaming%
with%Spark%
!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams%
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&
&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#20%
DStreams&
! A%DStream%is%a%sequence%of%RDDs%represenCng%a%data%stream%
– “DiscreDzed&Stream”&
Time&
Live&Data& data…data…data…data…data…data…data…data…
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#21%
DStream&Data&Sources&
! DStreams%are%defined%for%a%given%input%stream%(e.g.,%a%Unix%socket)%
– Created&by&the&StreamingContext&
ssc.socketTextStream(hostname, port)&
– Similar&to&how&RDDs&are&created&by&the&SparkContext&
! Out#of#the#%box%data%sources%
– Network&
– Sockets&
– Other&network&sources,&e.g.,&Flume,&Akka&Actors,&Kaha,&ZeroMQ,&
TwiBer&
– Files&
– Monitors&an&HDFS&directory&for&new&content&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#22%
DStream&OperaDons&
! DStream%operaCons%are%applied%to%every%RDD%in%the%stream%
– Executed&once&per&dura+on&
! Two%types%of%DStream%operaCons%
– TransformaDons&
– Create&a&new&DStream&from&an&exisDng&one&
– Output&operaDons&
– Write&data&(for&example,&to&a&file&system,&database,&or&console)&
• Similar&to&RDD&ac+ons'
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#23%
DStream&TransformaDons&(1)&
! Many%RDD%transformaCons%are%also%available%on%DStreams%
– Regular&transformaDons&such&as&map,&flatMap,&filter
– Pair&transformaDons&such&as&reduceByKey,&groupByKey,&join
! What%if%you%want%to%do%something%else?%%
– transform(function)
– Creates&a&new&DStream&by&execuDng&func+on&on&RDDs&in&the&
current&DStream&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#24%
DStream&TransformaDons&(2)&
reqcounts = userreqs.
reduceByKey((x,y) => x+y)
(user002,5) (user710,9) (user002,1)
(user033,1) (user022,4) (user808,8)
reqcounts& (user912,2) (user001,4) (user018,2)
… … …
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#25%
DStream&Output&OperaDons&
! Console%output%
– print&–&prints&out&the&first&10&elements&of&each&RDD&
! File%output%
– saveAsTextFiles&–&save&data&as&text&
– saveAsObjectFiles&–&save&as&serialized&object&files&
! ExecuCng%other%funcCons%
– foreachRDD(function)%–&performs&a&funcDon&on&each&RDD&in&the&
DStream&
– FuncDon&input¶meters&
– RDD&–&the&RDD&on&which&to&perform&the&funcDon&
– Time&–&opDonal,&the&Dme&stamp&of&the&RDD&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#26%
Saving&DStream&Results&as&Files&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#27%
Example:&Find&Top&Users&(1)&
…
val userreqs = logs.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)
userreqs.saveAsTextFiles(path)
sortedreqs.foreachRDD((rdd,time) => {
println("Top Transform&each&RDD:&swap&userID/count,&sort&by&count&
users @ " + time)
rdd.take(5).foreach(
pair => printf("User: %s (%s)\n",pair._2, pair._1))
}
)
ssc.start()
ssc.awaitTermination()
…
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#28%
Example:&Find&Top&Users&(2)&
…
val userreqs = logs.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)
userreqs.saveAsTextFiles(path)
sortedreqs.foreachRDD((rdd,time) => {
println("Top users @ " + time)
rdd.take(5).foreach(
pair => printf("User: %s (%s)\n",pair._2, pair._1))
}
)
ssc.start()
ssc.awaitTermination()
…
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#29%
Example:&Find&Top&Users&–&Output&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#30%
Example:&Find&Top&Users&–&Output&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#31%
Example:&Find&Top&Users&–&Output&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#32%
Using&Spark&Streaming&with&Spark&Shell&
! Spark%Streaming%is%designed%for%batch%applicaCons,%not%interacCve%use%
! Spark%Shell%can%be%used%for%limited%tesCng%
– Adding&operaDons&acer&the&Streaming&Context&has&been&started&is&
unsupported&
– Stopping&and&restarDng&the&Streaming&Context&is&unsupported&
&
&& $ spark-shell --master local[2]
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#33%
Chapter&Topics&
Solving%Business%Problems%%
Spark%Streaming%
with%Spark%
!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands#On%Exercise:%Exploring%Spark%Streaming%
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#34%
Hands5On&Exercise:&Exploring&Spark&Streaming&
! Hands#On%Exercise:%Exploring*Spark*Streaming*
– Explore&Spark&Streaming&using&the&Scala&Spark&Shell&
– Count&words,&use&netcat&to&simulate&a&data&stream&
! Please%refer%to%the%Hands#On%Exercise%Manual%
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#35%
Chapter&Topics&
Solving%Business%Problems%%
Spark%Streaming%
with%Spark%
!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State%OperaCons%
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#36%
State&DStreams&(1)&
! Use%the%updateStateByKey%funcCon%to%create%a%state%DStream%
! Example:%Total%request%count%by%User%ID%
t&=&1&
(user001,5)
Requests& (user102,1)
(user009,2)
Total&& (user001,5)
Requests& (user102,1)
(State)& (user009,2)
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#37%
State&DStreams&(2)&
! Use%the%updateStateByKey%funcCon%to%create%a%state%DStream%
! Example:%Total%request%count%by%User%ID%
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#38%
State&DStreams&(3)&
! Use%the%updateStateByKey%funcCon%to%create%a%state%DStream%
! Example:%Total%request%count%by%User%ID%
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#39%
Example:&Total&User&Request&Count&(1)&
…
Val userreqs = logs.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)
…
ssc.checkpoint("checkpoints")
val totalUserreqs = userreqs.updateStateByKey(updateCount)
totalUserreqs.print()
ssc.start()
Set&checkpoint&directory&to&enable&checkpoinDng.&&
ssc.awaitTermination()
… Required&to&prevent&infinite&lineages.&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#40%
Example:&Total&User&Request&Count&(2)&
…
val userreqs = logs.
map(line => (line.split(" ")(2),1)).
reduceByKey((x,y) => x+y)
…
next&slide…&
ssc.checkpoint("checkpoints")
val totalUserreqs = userreqs.updateStateByKey(updateCount)
totalUserreqs.print()
ssc.start()
Compute&a&state&DStream&based&on&the&previous&states&
ssc.awaitTermination()
… updated&with&the&values&from&the¤t&batch&of&request&
counts&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#41%
Example:&Total&User&Request&Count&–&Update&FuncDon&(1)&
New&Values& Current&State&(or&None)&
Given&an&exisDng&state&for&a&key&(user),&and&new&values&
(counts),&return&a&new&state&(sum&of¤t&state&and&new&
counts)&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#42%
Example:&Total&User&Request&Count&–&Update&FuncDon&(2)&
! Example%at%t=2%
user001:&&updateCount([4],Some[5])&"&&9
user012:&&updateCount([2],None))&"&&2
user921:&&updateCount([5],None))&"&&5
t&=&1& t&=&2&
(user001,5) (user001,4)
Requests& (user102,1) (user012,2)
(user009,2) (user921,5)
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#43%
Example:&Maintaining&State&–&Output&&
-------------------------------------------
Time: 1401219545000 ms
------------------------------------------- (user001,5)
(user001,5)
(user102,1) t&=&1& (user102,1)
(user009,2) (user009,2)
-------------------------------------------
Time: 1401219547000 ms
------------------------------------------- (user001,9)
(user001,9)
(user102,1) (user102,1)
(user009,2) t&=&2& (user009,2)
(user012,2)
(user921,5) (user012,2)
------------------------------------------- (user921,5)
Time: 1401219549000 ms
-------------------------------------------
(user001,9) (user001,9)
(user102,8)
(user102,8)
(user009,2)
(user012,5) (user009,2)
(user921,5)
t&=&3&
(user012,5)
(user660,4)
------------------------------------------- (user921,5)
Time: 1401219541000 ms
------------------------------------------- (user660,4)
…
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#44%
Chapter&Topics&
Solving%Business%Problems%%
Spark%Streaming%
with%Spark%
!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding%Window%OperaCons%
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#45%
Sliding&Window&OperaDons&(1)&
! Regular%DStream%operaCons%execute%for%each%RDD%based%on%SSC%duraCon%
! “Window”%operaCons%span%RDDs%over%a%given%duraCon%
– e.g.,&reduceByKeyAndWindow,&countByWindow
Window&DuraDon&
Regular&
DStream&
reduceByKeyAndWindow(
fn,window-duration)
Window&
DStream&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#46%
Sliding&Window&OperaDons&(2)&
! By%default%window%operaCons%will%execute%with%an%“interval”%the%same%as%
the%SSC%duraCon%
– i.e.,&for&2&minute&batch&duraDon,&window&will&“slide”&every&2&minutes&
Window&DuraDon&
Regular&
DStream&
(batch&size&=&&
Minutes(2))&
reduceByKeyAndWindow(fn,
Minutes(12))
Window&
DStream&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#47%
Sliding&Window&OperaDons&(3)&
! You%can%specify%a%different%slide%duraCon%(must%be%a%mulCple%of%the%SSC%
duraCon)%
Window&DuraDon&
Regular&
DStream&
(batch&size&=&&
Minutes(2))&
reduceByKeyAndWindow(fn,
Minutes(12), Minutes(4))
Window&
DStream&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#48%
Example:&Count&and&Sort&User&Requests&by&Window&(1)&
…
val ssc = new StreamingContext(new SparkConf(), Seconds(2))
val logs = ssc.socketTextStream(hostname, port)
…
val reqcountsByWindow = logs.
map(line => (line.split(' ')(2),1)).
reduceByKeyAndWindow((x: Int, y: Int) => x+y,
Minutes(5),Seconds(30))
val topreqsByWindow=reqcountsByWindow.
Every&30&seconds,&count&requests&by&user&over&the&last&5&
map(pair => pair.swap).
transform(rddminutes&
=> rdd.sortByKey(false))
topreqsByWindow.map(pair => pair.swap).print()
ssc.start()
ssc.awaitTermination()
…
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#49%
Example:&Count&and&Sort&User&Requests&by&Window&(2)&
…
val ssc = new StreamingContext(new SparkConf(), Seconds(2))
val logs = ssc.socketTextStream(hostname, port)
…
val reqcountsByWindow = logs.
map(line => (line.split(' ')(2),1)).
Sort&and&print&the&top&users&for&every&RDD&(every&30&
reduceByKeyAndWindow((x: Int, y: Int) => x+y,
seconds)&
Minutes(5),Seconds(30))
val topreqsByWindow=reqcountsByWindow.
map(pair => pair.swap).
transform(rdd => rdd.sortByKey(false))
topreqsByWindow.map(pair => pair.swap).print()
ssc.start()
ssc.awaitTermination()
…
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#50%
Chapter&Topics&
Solving%Business%Problems%%
Spark%Streaming%
with%Spark%
!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing%Spark%Streaming%ApplicaCons%
!! Conclusion&
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#51%
Special&ConsideraDons&for&Streaming&ApplicaDons&
! Spark%Streaming%applicaCons%are%by%definiCon%long#running%
– Require&some&different&approaches&than&typical&Spark&applicaDons&
! Metadata%accumulates%over%Cme%
– Use&checkpoinDng&to&trim&RDD&lineage&data&
– Required&to&use&windowed&and&state&operaDons&
– Enable&by&seong&the&checkpoint&directory:&
ssc.checkpoint(directory)
! Monitoring%
– The&StreamingListener&API&lets&you&collect&staDsDcs&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#52%
Spark&Fault&Tolerance&(1)&
! Network%data%is%received%on%a%worker%node%
– Receiver&distributes&data&(RDDs)&to&the&cluster&as&parDDons&
! Spark%Streaming%persists%windowed%RDDs%by%default%(replicaCon%=%2)%
Client&
Executor& rdd_0_1&
Receiver&
Network&
Driver& Executor& Data&Source&
rdd_0_1&
Program&
rdd_0_0&
Executor&
rdd_0_0&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#53%
Spark&Fault&Tolerance&(2)&
! If%the%receiver%fails,%Spark%will%restart%it%on%a%different%Executor%
– PotenDal&for&brief&loss&of&incoming&data&
Executor&
Receiver&
Network&
Driver& Executor& Data&Source&
rdd_0_1&
Program& Receiver&
rdd_0_0&
Executor&
rdd_0_0&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#54%
Building&and&Running&Spark&Streaming&ApplicaDons&
! Building%Spark%Streaming%ApplicaCons%
– Link&with&the&main&Spark&Streaming&library&(included&with&Spark)&
– Link&with&addiDonal&Spark&Streaming&libraries&if&necessary,&e.g,.&Kaha,&
Flume,&TwiBer&
! Running%Spark%Streaming%ApplicaCons%
– Use&at&least&two&threads&if&running&locally&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#55%
The&Spark&Streaming&ApplicaDon&UI&
! The%Streaming%tab%
in%the%Spark%App%%
UI%provides%basic%%
metrics%about%the%%
applicaCon%
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#56%
Chapter&Topics&
Solving%Business%Problems%%
Spark%Streaming%
with%Spark%
!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion%
!! Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#57%
Key&Points&
! Spark%Streaming%is%an%add#on%to%core%Spark%to%process%real#Cme%streaming%
data%
! DStreams%are%“discreCzed%streams”%of%streaming%data,%batched%into%RDDs%
by%Cme%intervals%%
– OperaDons&applied&to&DStreams&are&applied&to&each&RDD&
– TransformaDons&produce&new&DStreams&by&applying&a&funcDon&to&each&
RDD&in&the&base&DStream&
! You%can%update%state%based%on%prior%state%
– e.g.,&Total&requests&by&user&
! You%can%perform%operaCons%on%“windows”%of%data%
– e.g.,&Number&of&logins&in&the&last&hour&
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#58%
Chapter&Topics&
Solving%Business%Problems%%
Spark%Streaming%
with%Spark%
!! Spark&Streaming&Overview&
!! Example:&Streaming&Request&Count&
!! DStreams&
!! Hands5On&Exercise:&Exploring&Spark&Streaming&
!! State&OperaDons&
!! Sliding&Window&OperaDons&
!! Developing&Spark&Streaming&ApplicaDons&
!! Conclusion&
!! Hands#On%Exercise:%WriCng%a%Spark%Streaming%ApplicaCon%
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#59%
Hands5On&Exercise:&WriDng&a&Spark&Streaming&ApplicaDon&
! Hands#On%Exercise:%Wri2ng*a*Spark*Streaming*Applica2on*
– Write&a&Spark&Streaming&applicaDon&to&process&web&logs&using&a&Python&
script&to&simulate&a&data&stream&
! Please%refer%to%the%Hands#On%Exercise%Manual%
©&Copyright&201052015&Cloudera.&All&rights&reserved.&Not&to&be&reproduced&without&prior&wriBen&consent.& 10#60%
Common%Pa(erns%in%Spark%
Programming%
Chapter%11%
Course%Chapters%
!! IntroducEon% Course%IntroducEon%
!! Why%Spark?%
!! Spark%Basics% IntroducEon%to%Spark%
!! Working%With%RDDs%
!! The%Hadoop%Distributed%File%System%(HDFS)%
!! Running%Spark%on%a%Cluster%
Distributed%Data%Processing%%
!! Parallel%Programming%with%Spark%
with%Spark%
!! Caching%and%Persistence%
!! WriEng%Spark%ApplicaEons%
!! Spark%Streaming%
!! Common$Pa;erns$in$Spark$Programming$ Solving$Business$Problems$$
!! Improving%Spark%Performance% with$Spark$
!! Spark,%Hadoop,%and%the%Enterprise%Data%Center%
!! Conclusion% Course%Conclusion%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"2$
Common%Spark%Algorithms%
In$this$chapter$you$will$learn$
! What$kinds$of$processing$and$analysis$Spark$is$best$at$
! How$to$implement$an$iteraDve$algorithm$in$Spark$
! How$GraphX$and$MLlib$work$with$Spark$$
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"3$
Chapter%Topics%
Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$
!! Common$Spark$Use$Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph%Processing%and%Analysis%%%
!! Machine%Learning%
!! Example:%k8means%
!! Conclusion%
!! Hands8On%Exercise:%IteraEve%Processing%in%Spark%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"4$
Common%Spark%Use%Cases%(1)%
! Spark$is$especially$useful$when$working$with$any$combinaDon$of:$
– Large%amounts%of%data%
– Distributed%storage%
– Intensive%computaEons%
– Distributed%compuEng%
– IteraEve%algorithms%
– In8memory%processing%and%pipelining%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"5$
Common%Spark%Use%Cases%(2)%
! Examples$
– Risk%analysis%
– “How%likely%is%this%borrower%to%pay%back%a%loan?”%
– RecommendaEons%
– “Which%products%will%this%customer%enjoy?”%
– PredicEons%
– “How%can%we%prevent%service%outages%instead%of%simply%reacEng%to%
them?”%
– ClassificaEon%
– “How%can%we%tell%which%email%is%spam%and%which%is%legiEmate?”%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"6$
Spark%Examples%
! Spark$includes$many$example$programs$that$demonstrate$some$common$
Spark$programming$pa;erns$and$algorithms$
– k8means%
– LogisEc%regression%
– Calculate%pi%
– AlternaEng%least%squares%(ALS)%
– Querying%Apache%web%logs%
– Processing%Twi(er%feeds%
! Scala$and$Java$Examples$
– $SPARK_HOME/examples/
! Python$examples$
– $SPARK_HOME/python/examples
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"7$
Chapter%Topics%
Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$
!! Common%Spark%Use%Cases%
!! IteraDve$Algorithms$in$Spark%
!! Graph%Processing%and%Analysis%%%
!! Machine%Learning%
!! Example:%k8means%
!! Conclusion%
!! Hands8On%Exercise:%IteraEve%Processing%in%Spark%
%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"8$
Example:%PageRank%
! PageRank$gives$web$pages$a$ranking$score$based$on$links$from$other$
pages$
– Higher%scores%given%for%more%links,%and%links%from%other%high%ranking%
pages%
! Why$do$we$care?$
– PageRank%is%a%classic%example%of%big%data%analysis%(like%WordCount)%
– Lots%of%data%–%needs%an%algorithm%that%is%distributable%and%scalable%
– IteraEve%–%the%more%iteraEons,%the%be(er%than%answer%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"9$
PageRank%Algorithm%(1)%
1. Start$each$page$with$a$rank$of$1.0$
Page%1%
1.0%
Page%2% Page%3%
1.0% 1.0%
Page%4%
1.0%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"10$
PageRank%Algorithm%(2)%
1. Start$each$page$with$a$rank$of$1.0$
2. On$each$iteraDon:$
1. each%page%contributes%to%its%neighbors%its%own%rank%divided%by%the%
number%of%its%neighbors:%contribp%=%rankp%/%neighborsp%
Page%1%
1.0%
Page%2% Page%3%
".%5%
1.0% 1.0%
Page%4%
1.0%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"11$
PageRank%Algorithm%(3)%
1. Start$each$page$with$a$rank$of$1.0$
2. On$each$iteraDon:$
1. each%page%contributes%to%its%neighbors%its%own%rank%divided%by%the%
number%of%its%neighbors:%contribp%=%rankp%/%neighborsp%
2. Set%each%page’s%new%rank%based%on%the%sum%of%its%neighbors%
contribuEon:%%new8rank%=%Σcontribs%*%.85%+%.15%
Page%1% IteraEon%1%
1.85%
Page%2% Page%3%
".%5%
0.58% 1.0%
Page%4%
0.58%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"12$
PageRank%Algorithm%(4)%
1. Start$each$page$with$a$rank$of$1.0$
2. On$each$iteraDon:$
1. each%page%contributes%to%its%neighbors%its%own%rank%divided%by%the%
number%of%its%neighbors:%contribp%=%rankp%/%neighborsp%
2. Set%each%page’s%new%rank%based%on%the%sum%of%its%neighbors%
contribuEon:%%new8rank%=%Σcontribs%*%.85%+%.15%
3. Each$iteraDon$incrementally$improves$the$page$ranking$
Page%1% IteraEon%2%
1.31%
Page%2% Page%3%
" . %29%
0.39% 1.7%
Page%4%
0.57%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"13$
PageRank%Algorithm%(5)%
1. Start$each$page$with$a$rank$of$1.0$
2. On$each$iteraDon:$
1. each%page%contributes%to%its%neighbors%its%own%rank%divided%by%the%
number%of%its%neighbors:%contribp%=%rankp%/%neighborsp%
2. Set%each%page’s%new%rank%based%on%the%sum%of%its%neighbors%
contribuEon:%%new8rank%=%Σcontribs%*%.85%+%.15%
3. Each$iteraDon$incrementally$improves$the$page$ranking$
Page%1% IteraDon$10$
1.43% (Final)%
Page%2% Page%3%
" . %37%
0.46% 1.38%
Page%4%
0.73%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"14$
PageRank%in%Spark:%Neighbor%ContribuEon%FuncEon%
neighbors:%[page1,page2]% (page1,.5)%
rank:%1.0%%% (page2,.5)%
Page%1%
Page%2% Page%3%
".%5%
Page%4%
1.0%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"15$
PageRank%in%Spark:%Example%Data%
Page%1%
Page%2% Page%3%
Page%4%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"16$
PageRank%in%Spark:%Pairs%of%Page%Links%
page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct() (page1,page3)
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"17$
PageRank%in%Spark:%Page%Links%Grouped%by%Source%Page%
page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
links%
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"18$
PageRank%in%Spark:%Caching%the%Link%Pair%RDD%
page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()\
(page2,page1)
.cache()
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
links%
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"19$
PageRank%in%Spark:%Set%IniEal%Ranks%
links%
def computeContribs(neighbors, rank):… (page4, [page2,page1])
(page2, [page1])
links = sc.textFile(file)\ (page3, [page1,page4])
.map(lambda line: line.split())\
(page1, [page3])
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\
.groupByKey()\
ranks%
.cache()
(page4, 1.0)
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"20$
PageRank%in%Spark:%First%IteraEon%(1)%
links% ranks%
def computeContribs(neighbors, rank):… (page4, [page2,page1]) (page4, 1.0)
(page2, [page1]) (page2, 1.0)
links = … (page3, [page1,page4]) (page3, 1.0)
(page1, [page3]) (page1, 1.0)
ranks = …
for x in xrange(10):
contribs=links\ (page4, ([page2,page1], 1.0))
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"21$
PageRank%in%Spark:%First%IteraEon%(2)%
links% ranks%
def computeContribs(neighbors, rank):… (page4, [page2,page1]) (page4, 1.0)
(page2, [page1]) (page2, 1.0)
links = … (page3, [page1,page4]) (page3, 1.0)
(page1, [page3]) (page1, 1.0)
ranks = …
for x in xrange(10):
contribs=links\ (page4, ([page2,page1], 1.0))
contribs%
(page2,0.5)
(page1,0.5)
(page1,1.0)
(page1,0.5)
(page4,0.5)
(page3,1.0)
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"22$
PageRank%in%Spark:%First%IteraEon%(3)%
contribs%
(page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)
ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2) (page1,2.0)
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"23$
PageRank%in%Spark:%First%IteraEon%(4)%
contribs%
(page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)
ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2)\ (page1,2.0)
.map(lambda (page,contrib): \
(page,contrib * 0.85 + 0.15)) ranks%
(page4,.58)
(page2,.58)
(page3,1.0)
(page1,1.85)
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"24$
PageRank%in%Spark:%Second%IteraEon%
links% ranks%
def computeContribs(neighbors, rank):… (page4, [page2,page1]) (page4,0.58)
(page2, [page1]) (page2,0.58)
links = … (page3, [page1,page4]) (page3,1.0)
(page1, [page3]) (page1,1.85)
ranks = …
for x in xrange(10):
contribs=links\
.join(ranks)\
.flatMap(lambda (page,(neighbors,rank)): \
computeContribs(neighbors,rank))
…
ranks=contribs\
.reduceByKey(lambda v1,v2: v1+v2)\
.map(lambda (page,contrib): \ ranks%
(page,contrib * 0.85 + 0.15)) (page4,0.57)
(page2,0.21)
for rank in ranks.collect(): print rank
(page3,1.0)
(page1,0.77)
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"25$
Chapter%Topics%
Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$
!! Common%Spark%Use%Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph$Processing$and$Analysis$
!! Machine%Learning%
!! Example:%k8means%
!! Conclusion%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"26$
Graph%AnalyEcs%
! Many$data$analyDcs$problems$work$with$“data$parallel”$algorithms$
– Records%can%be%processed%independently%of%each%other%
– Very%well%suited%to%parallelizing%%
! Some$problems$focus$on$the$relaDonships$between$the$individual$data$
items.$For$example:$
– Social%networks%
– Web%page%hyperlinks%
– Roadmaps%
! These$relaDonships$can$be$represented$by$graphs$
– Requires%“graph%parallel”%algorithms%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"27$
Graph%Analysis%Challenges%at%Scale%
! Graph$CreaDon$
– ExtracEng%relaEonship%informaEon%from%a%data%source%
– For%example,%extracEng%links%from%web%pages%
! Graph$RepresentaDon$
– e.g.,%adjacency%lists%in%a%table%
! Graph$Analysis$
– Inherently%iteraEve,%hard%to%parallelize%
– This%is%the%focus%of%specialized%libraries%like%Pregel,%GraphLab%
! Post"analysis$processing$
– e.g.,%incorporaEng%product%recommendaEons%into%a%retail%site%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"28$
Graph%Analysis%in%Spark%
! Spark$is$very$well$suited$to$graph$parallel$algorithms$
! GraphX$
– UC%Berkeley%AMPLab%project%on%top%of%Spark%
– Unifies%opEmized%graph%computaEon%with%Spark’s%fast%data%parallelism%
and%interacEve%abiliEes%
– Supersedes%predecessor%Bagel%(Pregel%on%Spark)%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"29$
Chapter%Topics%
Solving$Business$Problems$$
Common$Spark$Algorithms$
with$Spark$
!! Common%Spark%Use%Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph%Processing%and%Analysis%%%
!! Machine$Learning$
!! Example:%k8means%
!! Conclusion%
!! Hands8On%Exercise:%IteraEve%Processing%in%Spark%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"30$
Machine%Learning%
! Most$programs$tell$computers$exactly$what$to$do$
– Database%transacEons%and%queries%
– Controllers%
– Phone%systems,%manufacturing%processes,%transport,%weaponry,%
etc.%
– Media%delivery%
– Simple%search%
– Social%systems%
– Chat,%blogs,%email,%etc.%
! An$alternaDve$technique$is$to$have$computers$learn$what$to$do$
! Machine$Learning$refers$to$programs$that$leverage$collected$data$to$drive$
future$program$behavior$
! This$represents$another$major$opportunity$to$gain$value$from$data$
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"31$
The%‘Three%Cs’%
! Machine$Learning$is$an$acDve$area$of$research$and$new$applicaDons$
! There$are$three$well"established$categories$of$techniques$for$exploiDng$
data$
– CollaboraEve%filtering%(recommendaEons)%
– Clustering%
– ClassificaEon%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"32$
CollaboraEve%Filtering%
! CollaboraDve$Filtering$is$a$technique$for$recommendaDons$
! Example$applicaDon:$given$people$who$each$like$certain$books,$learn$to$
suggest$what$someone$may$like$in$the$future$based$on$what$they$already$
like$
! Helps$users$navigate$data$by$expanding$to$topics$that$have$affinity$with$
their$established$interests$
! CollaboraDve$Filtering$algorithms$are$agnosDc$to$the$different$types$of$
data$items$involved$
– Useful%in%many%different%domains%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"33$
Clustering%
! Clustering$algorithms$discover$structure$in$collecDons$of$data$
– Where%no%formal%structure%previously%existed%
! They$discover$what$clusters,$or$groupings,$naturally$occur$in$data$
! Examples$
– Finding%related%news%arEcles%
– Computer%vision%(groups%of%pixels%that%cohere%into%objects)%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"34$
ClassificaEon%
! The$previous$two$techniques$are$considered$‘unsupervised’$learning$
– The%algorithm%discovers%groups%or%recommendaEons%itself%
! ClassificaDon$is$a$form$of$‘supervised’$learning$
! A$classificaDon$system$takes$a$set$of$data$records$with$known$labels$
– Learns%how%to%label%new%records%based%on%that%informaEon%
! Examples$
– Given%a%set%of%emails%idenEfied%as%spam/not%spam,%label%new%emails%as%
spam/not%spam%
– Given%images%of%tumors%idenEfied%as%benign%or%malignant,%classify%new%
images%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"35$
Machine%Learning%Challenges%
! Highly$computaDon$intensive$and$iteraDve$
! Many$tradiDonal$numerical$processing$systems$do$not$scale$to$very$large$
datasets$
– e.g.,%MatLab%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"36$
MLlib:%Machine%Learning%on%Spark%
! MLlib$is$part$of$Apache$Spark$
! Includes$many$common$ML$funcDons$
– ALS%(alternaEng%least%squares)%
– k8means%
– LogisEc%Regression%
– Linear%Regression%
– Gradient%Descent%
! SDll$a$‘work$in$progress’$
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"37$
Chapter%Topics%
Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$
!! Common%Spark%Use%Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph%Processing%and%Analysis%%%
!! Machine%Learning%
!! Example:$k"means$
!! Conclusion%
!! Hands8On%Exercise:%IteraEve%Processing%in%Spark%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"38$
k8means%Clustering%
! k"means$Clustering$
– A%common%iteraEve%algorithm%used%in%graph%analysis%and%machine%
learning%
– You%will%implement%a%simplified%version%in%the%Hands8On%Exercises%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"39$
Clustering%(1)%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"40$
Clustering%(2)%
Goal:%Find%“clusters”%of%data%
points%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"41$
Example:%k8means%Clustering%(1)%
1. Choose%K%random%points%as%
starEng%centers%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"42$
Example:%k8means%Clustering%(2)%
1. Choose%K%random%points%as%
starEng%centers%
2. Find%all%points%closest%to%each%
center%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"43$
Example:%k8means%Clustering%(3)%
1. Choose%K%random%points%as%
starEng%centers%
2. Find%all%points%closest%to%each%
center%
3. Find%the%center%(mean)%of%each%
cluster%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"44$
Example:%k8means%Clustering%(4)%
1. Choose%K%random%points%as%
starEng%centers%
2. Find%all%points%closest%to%each%
center%
3. Find%the%center%(mean)%of%each%
cluster%
4. If%the%centers%changed,%iterate%
again%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"45$
Example:%k8means%Clustering%(5)%
1. Choose%K%random%points%as%
starEng%centers%
2. Find%all%points%closest%to%each%
center%
3. Find%the%center%(mean)%of%each%
cluster%
4. If%the%centers%changed,%iterate%
again%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"46$
Example:%k8means%Clustering%(6)%
1. Choose%K%random%points%as%
starEng%centers%
2. Find%all%points%closest%to%each%
center%
3. Find%the%center%(mean)%of%each%
cluster%
4. If%the%centers%changed,%iterate%
again%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"47$
Example:%k8means%Clustering%(7)%
1. Choose%K%random%points%as%
starEng%centers%
2. Find%all%points%closest%to%each%
center%
3. Find%the%center%(mean)%of%each%
cluster%
4. If%the%centers%changed,%iterate%
again%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"48$
Example:%k8means%Clustering%(8)%
1. Choose%K%random%points%as%
starEng%centers%
2. Find%all%points%closest%to%each%
center%
3. Find%the%center%(mean)%of%each%
cluster%
4. If%the%centers%changed,%iterate%
again%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"49$
Example:%k8means%Clustering%(9)%
1. Choose%K%random%points%as%
starEng%centers%
2. Find%all%points%closest%to%each%
center%
3. Find%the%center%(mean)%of%each%
cluster%
4. If%the%centers%changed,%iterate%
again%
…%
5. Done!%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"50$
Example:%Approximate%k8means%Clustering%
1. Choose%K%random%points%as%
starEng%centers%
2. Find%all%points%closest%to%each%
center%
3. Find%the%center%(mean)%of%each%
cluster%
4. If%the%centers%changed%by%more%
than%c,%iterate%again%
…%
5. Close%enough!%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"51$
Chapter%Topics%
Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$
!! Common%Spark%Use%Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph%Processing%and%Analysis%
!! Machine%Learning%%
!! Example:%k8means%
!! Conclusion$
!! Hands8On%Exercise:%IteraEve%Processing%in%Spark%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"52$
Key%Points%
! Spark$is$especially$suited$to$big$data$problems$that$require$iteraDon$
– In8memory%caching%makes%this%very%efficient%
! Common$in$many$types$of$analysis$
– e.g.,%common%algorithms%such%as%PageRank%and%k8means%
! Spark$includes$specialized$libraries$to$implement$many$common$funcDons$
– GraphX%
– MLlib%%
! GraphX$
– Highly%efficient%graph%analysis%(similar%to%Pregel%et%al.)%and%graph%
construcEon,%representaEon%and%post8processing%
! MLlib$
– Efficient,%scalable%funcEons%for%machine%learning%(e.g.,%logisEc%
regression,%k8means)%
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"53$
Chapter%Topics%
Common$Programming$Pa;erns$in$ Solving$Business$Problems$$
Spark$ with$Spark$
!! Common%Spark%Use%Cases%
!! IteraEve%Algorithms%in%Spark%
!! Graph%Processing%and%Analysis%%%
!! Machine%Learning%
!! Example:%k8means%
!! Conclusion%
!! Hands"On$Exercise:$IteraDve$Processing$in$Spark$
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"54$
Hands8On%Exercise%
! Hands"On$Exercise:$Itera(ve*Processing*in*Spark*
– Implement%k8means%in%Spark%in%order%to%idenEfy%clustered%locaEon%data%
points%from%Loudacre%device%status%logs%
– Find%the%geographic%centers%of%device%acEvity%
! Please$refer$to$the$Hands"On$Exercise$Manual$
©%Copyright%201082015%Cloudera.%All%rights%reserved.%Not%to%be%reproduced%without%prior%wri(en%consent.% 11"55$
Improving*Spark*Performance*
Chapter*12*
Course*Chapters*
!! IntroducFon* Course*IntroducFon*
!! What*is*Apache*Spark?*
!! Spark*Basics* IntroducFon*to*Spark*
!! Working*With*RDDs*
!! The*Hadoop*Distributed*File*System*(HDFS)*
!! Running*Spark*on*a*Cluster*
Distributed*Data*Processing*
!! Parallel*Programming*with*Spark*
with*Spark*
!! Caching*and*Persistence*
!! WriFng*Spark*ApplicaFons*
!! Spark*Streaming*
!! Common*Spark*Algorithms* Solving$Business$Problems$$
!! Improving$Spark$Performance$ with$Spark$
!! Spark,*Hadoop,*and*the*Enterprise*Data*Center*
!! Conclusion* Course*Conclusion*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#2$
Improving*Spark*Performance*
In$this$chapter$you$will$learn$
! How$to$improve$the$performance$of$Spark$programs$using$shared$
variables$
! Some$common$performance$issues$and$how$to$find$and$address$them$
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#3$
Chapter*Topics*
Solving$Business$Problems$
Improving$Performance$
with$Spark$
!! Shared$Variables:$Broadcast$Variables$
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared*Variables:*Accumulators*
!! Hands:On*Exercise:*Using*Accumulators*
!! Common*Performance*Issues*
!! Diagnosing*Performance*Problems*
!! Conclusion*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#4$
Broadcast*Variables*
! Broadcast$variables$are$set$by$the$driver$and$retrieved$by$the$workers$
! They$are$read#only$aGer$they$have$been$set$
! The$first$read$of$a$Broadcast$variable$retrieves$and$stores$its$value$on$the$
node$
Client*
Driver** Executor*
Program*
Executor*
myVariable
Spark*
Master*
Executor*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#5$
Example:*Match*User*IDs*with*Requested*Page*Titles*
227.35.151.122*:*184*[16/Sep/2013:00:03:51*+0100]*"GET*/KBDOC:00183.html*HTTP/1.0"*200*…*
146.218.191.254*:*133*[16/Sep/2013:00:03:48*+0100]*"GET*/KBDOC:00188.html*HTTP/1.0"*200*…*
176.96.251.224*:*12379*[16/Sep/2013:00:02:29*+0100]*"GET*/KBDOC:00054.html*HTTP/1.0”*16011…**
…*
KBDOC:00001:MeeToo%4.1%)%Back%up%files%
KBDOC:00002:Sorrento*F24L*:*Change*the*phone*ringtone*and*noFficaFon*sound*
KBDOC:00003:Sorrento*F41L*–*overheaFng*
…*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#6$
Example:*Join*a*Web*Server*Log*with*Page*Titles*
logs = sc.textFile(logfile).map(fn)
pages = sc.textFile(pagefile).map(fn)
pagelogs = logs.join(pages)
pages*
join
pagelogs*
logs*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#7$
Example:*Pass*a*Small*Table*as*a*Parameter**
logs = sc.textFile(logfile).map(fn)
pages = dict(map(fn,open(pagefile)))
pagelogs = logs.map(lambda (userid,pageid):
(userid,pages[pageid]))
logs pagelogs
task*
Driver*
task*
pages
task*
task*
task*
task*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#8$
Example:*Broadcast*a*Small*Table*
logs = sc.textFile(logfile).map(…)
pages = dict(map(fn,open(pagefile)))
pagesbc = sc.broadcast(pages)
pagelogs = logs.map(lambda (userid, pageid):
(userid,pagesbc.value[pageid])))
logs pagelogs
Driver*
pages
pagesbc
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#9$
Broadcast*Variables*
! Why$use$Broadcast$variables?$
– Use*to*minimize*transfer*of*data*over*the*network,*which*is*usually*the*
biggest*boEleneck*
– Spark*Broadcast*variables*are*distributed*to*worker*nodes*using*a*
very*efficient*peer:to:peer*algorithm*
$
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#10$
Chapter*Topics*
Solving$Business$Problems$
Improving$Performance$
with$Spark$
!! Shared*Variables:*Broadcast*Variables*
!! Hands#On$Exercise:$Using$Broadcast$Variables$
!! Shared*Variables:*Accumulators*
!! Hands:On*Exercise:*Using*Accumulators*
!! Common*Performance*Issues*
!! Diagnosing*Performance*Problems*
!! Conclusion*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#11$
Hands:On*Exercise:*Using*Broadcast*Variables*
! Hands#On$Exercise:$Using&Broadcast&Variables&
– Filter*web*server*logs*for*requests*from*selected*devices*
– Use*a*broadcast*variable*for*the*list*of*target*device*models*to*filter*
! Please$refer$to$the$Hands#On$Exercise$Manual$
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#12$
Chapter*Topics*
Solving$Business$Problems$
Improving$Performance$
with$Spark$
!! Shared*Variables:*Broadcast*Variables*
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared$Variables:$Accumulators$
!! Hands:On*Exercise:*Using*Accumulators*
!! Common*Performance*Issues*
!! Diagnosing*Performance*Problems*
!! Conclusion*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#13$
Accumulators*
! Accumulators$are$shared$variables$
Client*
– Worker*nodes*can*add*to*the*value*
– Only*the*driver*applicaFon*can*access*the*value*
Driver** Executor*
Program* +
.set .value
Executor*
+
myAccumulator
Spark*
Master*
Executor*
+
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#14$
Accumulator*Example:*Average*Word*Length*
! Example:$Calculate$the$average$length$of$all$the$words$in$a$dataset$
def addTotals(word,words,letters):
words += 1
letters += len(word)
totalWords = sc.accumulator(0)
totalLetters = sc.accumulator(0.0)
words = sc.textFile(myfile) \
.flatMap(lambda line: line.split())
words.foreach(lambda word: \
addTotals(word,totalWords,totalLetters))
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#15$
More*About*Accumulators*
! Accumulator$values$will$be$reported$to$the$driver$only$once$per$task$
– If*tasks*must*be*rerun*due*to*failure,*Spark*will*correctly*add*only*for*
the*task*which*succeeds*
! Only$the$driver$can$access$the$value$
– Updates*are*only*sent*to*the*master,*not*to*all*workers*
– Code*will*throw*an*excepFon*if*you*use*.value on*worker*nodes*
! Supports$the$compound$assignment$operator,$+=$
! Can$use$integers$or$doubles$
– sc.accumulator(0)
– sc.accumulator(0.0)
! Can$customize$to$support$any$data$type$
– Extend*the*AccumulatorParam*class*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#16$
Chapter*Topics*
Solving$Business$Problems$
Improving$Performance$
with$Spark$
!! Shared*Variables:*Broadcast*Variables*
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared*Variables:*Accumulators*
!! Hands#On$Exercise:$Using$Accumulators$
!! Common*Performance*Issues*
!! Diagnosing*Performance*Problems*
!! Conclusion*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#17$
Hands:On*Exercise:*Using*Accumulators*
! Hands#On$Exercise:$Using&Accumulators&
– Use*Accumulator*variables*to*count*the*number*of*requests*for*
different*types*of*files*in*a*set*of*web*server*logs*
! Please$refer$to$the$Hands#On$Exercise$Manual$
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#18$
Chapter*Topics*
Solving$Business$Problems$
Improving$Performance$
with$Spark$
!! Shared*Variables:*Broadcast*Variables*
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared*Variables:*Accumulators*
!! Hands:On*Exercise:*Using*Accumulators*
!! Common$Performance$Issues$
!! Diagnosing*Performance*Problems*
!! Conclusion*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#19$
Performance*Issue:*SerializaFon*
! SerializaXon$affects$
– Network*bandwidth*
– Memory*(save*memory*by*serializing)*
! Default$method$of$serializaXon$in$Spark$is$basic$Java$serializaXon$
– Simple*but*slow*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#20$
Using*Kryo*SerializaFon*
! Use$Kryo$serializaXon$for$Scala$and$Java$
– To*enable,*set*spark.serializer*=*spark.KryoSerializer
! To$enable$Kryo$for$your$custom$classes$
– Create*a*KryoRegistrar*class*and*set**
spark.kryo.registrator=MyRegistrator
– Register*your*classes*with*Kryo*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#21$
Performance*Issue:*Small*ParFFons*
! Problem:$filter()$can$result$in$parXXons$with$small$amounts$of$data$
– Results*in*many*small*tasks*
sc.textFile(file) \
.filter(lambda s: s.startswith('I')) \
.map(lambda s: \
(s.split()[0],(s.split()[1],s.split()[2])))
RDD* RDD*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#22$
SoluFon:*ReparFFon/Coalesce*
! SoluXon:$repartition(n)
– This*is*the*same*as*coalesce(n, shuffle=true)*
sc.textFile(file) \
.filter(lambda s: s.startswith('I')) \
.repartition(3) \
.map(lambda s: \
(s.split()[0],(s.split()[1],s.split()[2])))
RDD* RDD*
RDD*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#23$
Performance*Issue:*Passing*Too*Much*Data*in*FuncFons***
! Problem:$Passing$large$amounts$of$data$to$parallel$funcXons$results$in$
poor$performance$
hashmap = some_massive_hash_map()
…
myrdd.map(lambda x: hashmap(x)).countByValue()
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#24$
Performance*Issues:*Passing*Too*Much*Data*in*FuncFons***
! SoluXon:$$
– If*the*data*is*relaFvely*small,*use*a*Broadcast*variable*
hashmap = some_massive_hash_map()
bhashmap = sc.broadcast(hashmap)
…
myrdd.map(lambda x: bhashmap(x)).countByValue()
– If*the*data*is*very*large,*parallelize*into*an*RDD*
hashmap = some_massive_hash_map()
hashmaprdd = sc.parallelize(hashmap)
…
myrdd.join(bhashmaprdd).countByValue()
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#25$
Chapter*Topics*
Solving$Business$Problems$
Improving$Performance$
with$Spark$
!! Shared*Variables:*Broadcast*Variables*
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared*Variables:*Accumulators*
!! Hands:On*Exercise:*Using*Accumulators*
!! Common*Performance*Issues*
!! Diagnosing$Performance$Problems$
!! Conclusion*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#26$
Diagnosing*Performance*Issues*(1)*
! The$Spark$ApplicaXon$UI$provides$useful$metrics$to$find$performance$
problems$
Stage*
Details*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#27$
Diagnosing*Performance*Issues*(2)*
! Where$to$look$for$performance$issues$
– Scheduling*and*launching*tasks*
– Task*execuFon*
– Shuffling*
– CollecFng*data*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#28$
Scheduling*and*Launching*Issues*
! Scheduling$and$launching$taking$too$long$
– Are*you*passing*too*much*data*to*tasks?**
– myrdd.map(lambda x: HugeLookupTable(x))
– Use*a*Broadcast*variable*or*an*RDD*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#29$
Task*ExecuFon*Issues*(1)*
! Task$execuXon$taking$too$long?$
– Are*there*tasks*with*a*very*high*per:record*overhead?***
– e.g.,*mydata.map(dbLookup)
– Each*lookup*call*opens*a*connecFon*to*the*DB,*reads,*and*closes*
– Try*mapPartitions
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#30$
Task*ExecuFon*Issues*(2)*
! Are$a$few$tasks$taking$much$more$Xme$than$others?$$$
– ReparFFon,*parFFon*on*a*different*key,*or*write*a*custom*parFFoner*
Task*duraFons*should*be*
fairly*even*
Example:*empty*
parFFons*due*to*
filtering*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#31$
Shuffle*Issues*
! WriXng$shuffle$results$taking$too$long?$
– Make*sure*you*have*enough*memory*for*buffer*cache*
– Make*sure*spark.local.dir*is*a*local*disk,*ideally*dedicated*
Saves*to*disk*if*too*
big*for*buffer*cache*
Look*for*big*
write*Fmes*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#32$
CollecFng*Data*to*the*Driver*
! Are$results$taking$too$long?$
– Beware*of*returning*large*amounts*of*data*to*the*driver,*for*example*
with*collect()
– Process*data*on*the*workers,*not*the*driver* Watch*for*
disproporFonate*result*
– Save*large*results*to*HDFS* serializaFon*Fmes*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#33$
Performance*Analysis*and*Monitoring*
! Spark$supports$integraXon$with$other$performance$tools$
– Configurable*metrics*system*built*on*the*Coda*Hale*Metrics*Library*
– Metrics*can*be**
– Saved*to*files*
– Output*to*the*console*
– Viewed*in*the*JMX*console*
– Sent*to*reporFng*tools*like*Graphite*or*Ganglia*
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#34$
Chapter*Topics*
Solving$Business$Problems$
Improving$Performance$
with$Spark$
!! Shared*Variables:*Broadcast*Variables*
!! Hands:On*Exercise:*Using*Broadcast*Variables*
!! Shared*Variables:*Accumulators*
!! Hands:On*Exercise:*Using*Accumulators*
!! Common*Performance*Issues*
!! Diagnosing*Performance*Problems*
!! Conclusion$
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#35$
Key*Points*
! Network$bandwidth$is$oGen$the$major$bo`leneck$$
! For$best$performance,$minimize$data$shuffling$between$workers$
! Broadcast$variables$allow$you$to$copy$data$to$each$worker$once$
– Use*instead*of*an*RDD*for*small*datasets*
! Accumulators$allow$workers$to$update$a$shared$variable$locally$
! Use$Kryo$serializaXon$instead$of$default$Scala/Java$serializaXon$to$speed$
up$network$copy$of$data,$and$save$memory$
! ReparXXon$to$avoid$unbalanced$or$very$small$parXXons$across$nodes$
©*Copyright*2010:2015*Cloudera.*All*rights*reserved.*Not*to*be*reproduced*without*prior*wriEen*consent.* 12#36$
Spark,'Hadoop,'and'the''
Enterprise'Data'Center'
Chapter'13'
Course'Chapters'
!! IntroducHon' Course'IntroducHon'
!! Why'Spark?'
!! Spark'Basics' IntroducHon'to'Spark'
!! Working'With'RDDs'
!! The'Hadoop'Distributed'File'System'(HDFS)'
!! Running'Spark'on'a'Cluster'
Distributed'Data'Processing''
!! Parallel'Programming'with'Spark'
with'Spark'
!! Caching'and'Persistence'
!! WriHng'Spark'ApplicaHons'
!! Spark'Streaming'
!! Common'PaFerns'in'Spark'Programming' Solving%Business%Problems%
!! Improving'Spark'Performance' with%Spark%
!! Spark,%Hadoop,%and%the%Enterprise%Data%Center%
!! Conclusion' Course'Conclusion'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#2%
Spark'and'the'Enterprise'Data'Center'
In%this%chapter%you%will%learn%
! How%Spark%and%Hadoop%work%together%to%provide%enterprise#level%data%
processing%and%analysis%
! How%to%integrate%Spark%and%Hadoop%into%an%exisEng%enterprise%data%
center%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#3%
Chapter'Topics'
Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%
!! The%Spark%Hadoop%Overview'
!! Spark'and'MapReduce'
!! Spark'and'the'Hadoop'Ecosystem''
!! PuVng'It'All'Together:'IntegraHng'the'Enterprise'Data'Center'
!! Conclusion'
!! Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#4%
The'Spark'Stack'
! In%addiEon%to%the%core%Spark%engine,%there%are%an%ever#growing%number%of%
related%projects%
! SomeEmes%called%the%Berkeley%Data%AnalyEcs%Stack%(BDAS)%
Spark'Core'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#5%
Spark'and'Hadoop'(1)'
! Spark%was%created%to%complement,%not%replace,%Hadoop%
…'
Stream; (Machine' (Graph' (StaHsHcs)' Cloudera'
Shark'
ing' (SQL)'
Learning)' Processing)' Hive' Search'
Impala' HBase'
Spark'Core' MapReduce'
HDFS' YARN'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#6%
Spark'and'Hadoop'(2)'
! Spark%uses%HDFS%
– Can'use'any'Hadoop'data'source'
– Uses'Hadoop'InputFormats'and'OutputFormats'
– This'means'it'can'manipulate'e.g.,'Avro'files'and'SequenceFiles'
! Spark%runs%on%YARN%
– Can'run'on'the'same'cluster'with'MapReduce'jobs,'Impala,'etc.'
! Spark%works%with%the%Hadoop%ecosystem%
– Flume'
– Sqoop'
– HBase'
– …'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#7%
Example:'Yahoo'
! Example%use#case:%Yahoo%is%a%major%user%of%Hadoop%
– Uses'Hadoop'for'personalizaHon,'collaboraHve'filtering,'ad'analyHcs…'
! MapReduce%couldn’t%keep%up%
– Highly'iteraHve'machine'learning'algorithms''
! Moved%iteraEve%processing%to%Spark%
Spark'
MapReduce'
MapReduce' IteraHve'
Batch'Processing'
Processing'
YARN' YARN'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#8%
Chapter'Topics'
Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%
!! The'Spark'Hadoop'Overview'
!! Spark%and%MapReduce%
!! Spark'and'the'Hadoop'Ecosystem''
!! PuVng'It'All'Together:'IntegraHng'the'Enterprise'Data'Center'
!! Conclusion'
!! Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#9%
Spark'vs.'Hadoop'MapReduce'
! Hadoop%MapReduce%
– Widely'used,'huge'investment'already'made'
– Supports'and'supported'by'many'complementary'tools'
– Mature,'stable,'well;tested'technology'
– Skilled'developers'available'
! Spark%
– Flexible'
– Elegant''
– Fast'
– Changing'rapidly'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#10%
Sharing'Data'Between'Spark'and'MapReduce'Jobs'
! Apache%Avro%is%a%binary%file%format%for%saving%datasets%
! Hadoop%SequenceFiles%are%similar;%used%by%many%exisEng%Hadoop%data%
centers%
! Both%are%supported%by%Spark%
Spark' MapReduce'
HDFS'
(key,value)
(key,value)
(key,value)
(key,value)
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#11%
Chapter'Topics'
Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%
!! The'Spark'Hadoop'Overview'
!! Spark'and'MapReduce'
!! Spark%and%the%Hadoop%Ecosystem%%
!! PuVng'It'All'Together:'IntegraHng'the'Enterprise'Data'Center'
!! Conclusion'
!! Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#12%
The'Hadoop'Ecosystem'
! In%addiEon%to%HDFS%and%MapReduce,%the%Hadoop%Ecosystem%includes%
many%addiEonal%components%
! Some%that%may%be%of%parEcular%interest%to%Spark%developers%
– Data'Storage:'HBase'
– Data'Analysis:'Hive'and'Impala'
– Data'IntegraHon:'Flume'and'Sqoop'
'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#13%
Data'Storage:'HBase'–'The'Hadoop'Database'
! HBase:%database%layered%on%top%of%HDFS%
– Provides'interacHve'access'to'data'
! Stores%massive%amounts%of%data%
– Petabytes+'
! High%throughput%
– Thousands'of'writes'per'second'(per'node)'
! Handles%sparse%data%well%
– No'wasted'space'for'a'row'with'empty'' HDFS'
columns'
! Limited%access%model%
– OpHmized'for'lookup'of'a'row'by'key'rather'than'full'queries'
– No'transacHons:'single'row'operaHons'only'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#14%
Data'Analysis:'Hive'
! What%is%Hive?%
– Open'source'Apache'project'
– Built'on'Hadoop'MapReduce'
– HiveQL:'An'SQL;like'interface'to'Hadoop'
! Very%acEve%work%is%currently%ongoing%to%port%Hive’s%execuEon%engine%to%
Spark%
– Will'be'able'to'use'either'MapReduce'or'Spark'to'execute'queries'
% ©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#15%
Data'Analysis:'Impala'
! High#performance%SQL%engine%for%vast%amounts%of%data%
– Similar'query'language'to'HiveQL''
– 10'to'50+'Hmes'faster'than'Hive'or'MapReduce'
! Impala%runs%on%Hadoop%clusters%
– Data'stored'in'HDFS'
– Dedicated'SQL'engine;'does'not'depend'on'Spark,'
MapReduce,'or'Hive'
! Developed%by%Cloudera%
– 100%'open'source,'released'under'the'Apache'sojware'
license'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#16%
Data'IntegraHon:'Flume'(1)'
! What%is%Flume?%
– A'service'to'move'large'amounts'of'data'in'real'Hme'
– Example:'storing'log'files'in'HDFS'
! Flume%is%
– Distributed'
– Reliable'and'available'
– Horizontally'scalable''
– Extensible'
! Spark%Streaming%is%integrated%with%Flume%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#17%
Data'IntegraHon:'Flume'(2)'
• Collect'data'as'it'is'produced'
• Files,'syslogs,'stdout'or'
custom'source'
Agent'' Agent'' Agent' Agent' Agent'
'
• Process'in'place'' encrypt% compress%
• e.g.,'encrypt,'compress'
• Write'in'parallel'
• Scalable'throughput' Agent(s)%
• Store'in'any'format'
Spark'
• Text,'compressed,'binary,'or' HDFS'
custom'sink' Streaming'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#18%
Data'IntegraHon:'Sqoop'–'SQL'to'Hadoop'(1)'
! Typical%scenario:%data%stored%in%an%RDBMS%is%needed%in%a%Spark%
applicaEon%
– Lookup'tables'
– Legacy'data'
! Possible%to%read%directly%from%an%RDBMS%in%your%Spark%applicaEon%
– Can'lead'to'the'equivalent'of'a'distributed'denial'of'service'
(DDoS)'aFack'on'your'RDBMS'
– In'pracHce'–'don’t'do'it!'
! Becer%idea:%use%Sqoop%to%import%the%data%into%HDFS%beforehand%%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#19%
Data'IntegraHon:'Sqoop'–'SQL'to'Hadoop'(2)'
! Sqoop:%open%source%tool%originally%wricen%at%Cloudera%
– Now'a'top;level'Apache'Sojware'FoundaHon'project'
! Imports%tables%from%an%RDBMS%into%HDFS%
– Just'one'table,'all'tables,'or'porHons'of'a'table'
– Uses'MapReduce'to'actually'import'the'data'
! Uses%a%JDBC%interface%
– Works'with'virtually'any'JDBC;compaHble'database'
! Imports%data%to%HDFS%as%delimited%text%files%or%SequenceFiles%
– Default'is'comma;delimited'text'files'
! Can%be%used%for%incremental%data%imports%
– First'import'retrieves'all'rows'in'a'table'
– Subsequent'imports'retrieve'just'rows'created'since'the'last'
import'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#20%
Custom'Sqoop'Connectors'
! Cloudera%has%partnered%with%other%organizaEons%to%create%custom%Sqoop%
connectors%
– Use'a'database’s'naHve'protocols'rather'than'JDBC'
– Provides'much'faster'performance'
! Current%systems%supported%by%custom%connectors%include:%
– Netezza'
– Teradata'
– Oracle'Database'(connector'developed'with'Quest'Sojware)'
! Others%are%in%development%
! Custom%connectors%are%not%open%source,%but%are%free%
– Available'from'the'Cloudera'Web'site'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#21%
Sqoop:'Basic'Syntax'
! Standard%syntax:%
! Tools%include:%
import
import-all-tables
list-tables
! OpEons%include:%
--connect
--username
--password
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#22%
Sqoop:'Example'
! Example:%import%a%table%called%employees%from%a%database%called%
personnel%in%a%MySQL%RDBMS%
! Example:%as%above,%but%only%records%with%an%ID%greater%than%1000%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#23%
Chapter'Topics'
Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%
!! The'Spark'Hadoop'Overview'
!! Spark'and'MapReduce'
!! Spark'and'the'Hadoop'Ecosystem''
!! Pugng%It%All%Together:%IntegraEng%the%Enterprise%Data%Center%
!! Conclusion'
!! Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#24%
Typical'RDBMS'Scenario'
! Typical%scenario:%%
– InteracHve'RDBMS'serves'queries'from'a'web'site'
– Data'is'extracted'and'loaded'into'a'data'warehouse'for'processing'and'
archiving'
Business'
Intelligence' Archive'
Tools'
Web'server'logs'
Orders'
Enterprise'
Data''
RDBMS' Warehouse'
Site'Content'
OLTP: Online Transaction Processing
OLAP: Online Analytical Processing
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#25%
OLAP'Database'LimitaHons'
! All%dimensions%must%be%prematerialized%
– Re;materializaHon'can'be'very'Hme'consuming'
! Daily%data%load#in%Emes%can%increase%
– Typically'this'leads'to'some'data'being'discarded'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#26%
Using'Spark'and'Hadoop'to'Augment'ExisHng'Databases'
! With%Spark%and%Hadoop%you%can%store%and%process%all%your%data%
– The'‘Enterprise'Data'Hub’'
! Reserve%EDW%space%for%high%value%data%
Spark'and'Hadoop' BI'Tools'
RecommendaHons'
RDBMS'
Site'Content'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#27%
Benefits'of'Spark'and'Hadoop'Over'RDBMSs'
! Processing%power%scales%with%data%storage%
– As'you'add'more'nodes'for'storage,'you'get'more'processing'power'‘for'
free’'
! Views%do%not%need%prematerializaEon%
– Ad;hoc'full'or'parHal'dataset'queries'are'possible'
! Total%query%size%can%be%mulEple%petabytes%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#28%
TradiHonal'High;Performance'File'Servers'
! Enterprise%data%is%ohen%held%on%large%fileservers,%such%as%products%from%
– NetApp'
– EMC'
! Advantages%
– Fast'random'access'
– Many'concurrent'clients'
! Disadvantages%
– High'cost'per'terabyte'of'storage'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#29%
File'Servers'and'HDFS'
! Choice%of%storage%depends%on%the%expected%access%pacerns%
– SequenHally'read,'append;only'data:'HDFS'
– Random'access:'file'server'
! HDFS%can%crunch%sequenEal%data%faster%
! Offloading%data%to%HDFS%leaves%more%room%on%file%servers%for%‘interacEve’%
data%
! Use%the%right%tool%for%the%job!%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#30%
Chapter'Topics'
Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%
!! The'Spark'Hadoop'Overview'
!! Spark'and'MapReduce'
!! Spark'and'the'Hadoop'Ecosystem''
!! PuVng'It'All'Together:'IntegraHng'the'Enterprise'Data'Center'
!! Conclusion%
!! Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#31%
Key'Points'
! Spark%complements%Hadoop%MapReduce%
! Spark%works%with%other%Hadoop%Ecosystem%projects%
– HBase'–'The'Hadoop'NoSQL'database'
– Hive'–'SQL;like'access'to'Hadoop'data'
– Impala'–'high;speed'SQL'query'engine'
– Flume'–'real;Hme'data'import'
– Sqoop'–'RDBMS'to'(and'from)'HDFS'
! Spark%and%Hadoop%together%can%help%you%make%your%data%center%faster%
and%cheaper%
– Offload'ETL'processing'
– Use'all#your'data'
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#32%
Chapter'Topics'
Spark,%Hadoop%and%the%Enterprise% Solving%Business%Problems%
Data%Center% with%Spark%
!! The'Spark'Hadoop'Overview'
!! Spark'and'MapReduce'
!! Spark'and'the'Hadoop'Ecosystem''
!! PuVng'It'All'Together:'IntegraHng'the'Enterprise'Data'Center'
!! Conclusion'
!! Hands#On%Exercise:%ImporEng%RDBMS%Data%Into%Spark%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#33%
Hands;On'Exercise:'ImporHng'RDBMS'Data'Into'Spark'
! Hands#On%Exercise:%Impor-ng/RDBMS/Data/Into/Spark/
– Import'movies'and'movie'raHngs'from'MySQL'to'HDFS'and'load'
them'into'Spark'RDDs'
– Calculate'and'save'average'movie'raHngs'
! Please%refer%to%the%Hands#On%Exercise%Manual%
©'Copyright'2010;2015'Cloudera.'All'rights'reserved.'Not'to'be'reproduced'without'prior'wriFen'consent.' 13#34%
Conclusion)
Chapter)14)
Course)Chapters)
!! IntroducBon) Course)IntroducBon)
!! What)is)Apache)Spark?)
!! Spark)Basics) IntroducBon)to)Spark)
!! Working)With)RDDs)
!! The)Hadoop)Distributed)File)System)(HDFS))
!! Running)Spark)on)a)Cluster)
Distributed)Data)Processing))
!! Parallel)Programming)with)Spark)
with)Spark)
!! Caching)and)Persistence)
!! WriBng)Spark)ApplicaBons)
!! Spark)Streaming)
!! Common)Pa@erns)in)Spark)Programming) Solving)Business)Problems))
!! Improving)Spark)Performance) with)Spark)
!! Spark,)Hadoop,)and)the)Enterprise)Data)Center)
!! Conclusion% Course%Conclusion%
©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#2%
Conclusion)(1))
During%this%course,%you%have%learned%
! What%Apache%Spark%is,%what%problems%it%solves,%and%why%you%would%want%
to%use%it%
! The%basic%programming%concepts%of%Spark:%operaEons%on%Resilient%
Distributed%Datasets%(RDDs)%
! How%Spark%works%to%distribute%processing%of%big%data%across%a%cluster%
! How%Spark%interacts%with%other%components%of%a%big%data%system:%data%
storage%and%cluster%resource%management%
! How%to%take%advantage%of%key%Spark%features%such%as%caching%and%shared%
variables%to%improve%performance%
! How%to%use%Spark%–%either%interacEvely%using%a%Spark%Shell%or%by%wriEng%
your%own%Spark%ApplicaEons%
©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#3%
%
Conclusion)(2))
! How%to%use%Spark%Streaming%to%process%a%live%data%stream%in%real%Eme%
! How%Spark%integrates%with%other%parts%of%the%Hadoop%Ecosystem%to%
provide%Enterprise#level%data%processing%
©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#4%
Next)Steps)
! Cloudera%offers%a%number%of%other%training%courses,%including:%
– Cloudera)Hadoop)EssenBals)
– Cloudera)Administrator)Training)for)Apache)Hadoop)
– Cloudera)Developer)Training)for)Apache)Hadoop)
– Designing)and)Building)Big)Data)ApplicaBons)
– Cloudera)Data)Analyst)Training:)Using)Pig,)Hive,)and)Impala)with)
Hadoop)
– Cloudera)Training)for)Apache)HBase)
– IntroducBon)to)Data)Science:)Building)Recommender)Systems)
– Custom)courses)
! Cloudera%also%provides%consultancy%and%troubleshooEng%services%
– Please)ask)your)instructor)for)more)informaBon)
©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#5%
Class)EvaluaBon)
! Please%take%a%few%minutes%to%complete%the%class%evaluaEon%
– Your)instructor)will)show)you)how)to)access)the)online)form)
©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#6%
Thank)You!)
! Thank%you%for%aQending%this%course%
! If%you%have%any%further%quesEons%or%comments,%please%feel%free%to%contact%
us%
– Full)contact)details)are)on)our)Web)site)at)
http://www.cloudera.com/
©)Copyright)201072015)Cloudera.)All)rights)reserved.)Not)to)be)reproduced)without)prior)wri@en)consent.) 14#7%