0% found this document useful (0 votes)
32 views57 pages

Chapter 4 Spark

Uploaded by

nhatminhle248
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views57 pages

Chapter 4 Spark

Uploaded by

nhatminhle248
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

BIG

 DATA  TRAINING  
Intro  to  Spark  
Thoại  Nam  
 

www.cce.hcmut.edu.vn   01/2021   hpcc.hcmut.edu.vn  


Teacher  
1
Content  

• Shared/Distributed memory
• MapReduce drawbacks
• Spark

2 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Mul2processor  

• Consists  of  many  fully  programmable  processors  each  


capable  of  execuNng  its  own  program  
• Shared  address  space  architecture  
• Classified  into  2  types  
• Uniform  Memory  Access  (UMA)  MulNprocessors  
• Non-­‐Uniform  Memory  Access  (NUMA)  MulNprocessors

3 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Memory  hierarchy  

• Most  programs  have  a  high  degree  of  locality  in  their  accesses  
o spa2al  locality:  accessing  things  nearby  previous  accesses  
o temporal  locality:  reusing  an  item  that  was  previously  accessed  
• Memory  hierarchy  tries  to  exploit  locality  to  improve  average  
processor

control
Second Main Secondary Tertiary
level memory storage storage
cache (Disk)
datapath (DRAM) (Disk/Tape)
(SRAM)
on-chip
registers
cache (“Cloud”)

Speed 1ns 10ns 100ns 10ms 10sec

Size KB MB GB TB PB

4 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Tradi2onal  Network  Programming  

• Message-­‐passing  between  nodes  (MPI,  RPC,  etc.)


• Really  hard  to  do      at  scale:  
o How  to  split  problem  across      nodes?  
– Important  to  consider  network  and  data  locality  
o How  to  deal  with  failures?  
– If  a  typical  server  fails  every  3  years,  a  10,000-­‐node  cluster  sees  10    
faults/day!

5 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Data-­‐Parallel  M odels  

• Restrict  the  programming  interface  so  that  the    system    can  do  
more  automaNcally  

“Here’s  an  operaNon,  run  it  on  all  of  the  data”  
 
o I  don’t  care  where  it  runs  (you  schedule  that)  
o In  fact,  feel  free  to  run  it  twice  on  different    nodes

6 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


MapReduce  Programming  M odel  

• MapReduce  turned  out  to  be  an  incredibly  useful  and  widely-­‐deployed  
framework  for  processing  large  amounts  of  data.  However,  its  design  
forces  programs  to  comply  with  its  computaNon  model,  which  is:  
o Map:  create  a  <key,  value>  pairs  
o Shuffle:  combine  common  keys  together  and  parNNon  them  to  reduce  workers  
o Reduce:  process  each  unique  key  and  all  of  its  associated  values

7 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


MapReduce  drawbacks  

• Many  applicaNons  had  to  run  MapReduce  over  mulNple  passes  to  
process  their  data  
• All  intermediate  data  had  to  be  stored  back  in  the  file  system  (GFS  at  
Google,  HDFS  elsewhere),  which  tended  to  be  slow  since  stored  data  
was  not  just  wriben  to  disks  but  also  replicated  
• The  next  MapReduce  phase  could  not  start  un2l  the  previous  
MapReduce  job  completed  fully  
• MapReduce  was  also  designed  to  read  its  data  from  a  distributed  file  
system  (GFS/HDFS).  In  many  cases,  however,  data  resides  within  an  
SQL  database  or  is  streaming  in  (e.g.,  acNvity  logs,  remote  monitoring).  

8 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


MapReduce   programmability  

• Most  real  applicaNons  require  mulNple  MR  steps  


• Google  indexing  pipeline:  21  steps  
• AnalyNcs  queries  (e.g.  count  clicks  &  top  K):  2  –  5  steps  
• IteraNve  algorithms  (e.g.  PageRank):  10’s  of  steps  

• MulN-­‐step  jobs  create  spagheg  code  


• 21  MR  steps  -­‐>  21  mapper  and  reducer  classes  
 

9 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Problems  with  M apReduce  

• MapReduce  use  cases  showed  two  major  limitaNons:


(1)  Difficulty  of  programming  directly  in  MR  
(2)  Performance  boblenecks  

• In  short,  MapReduce  doesn’t  compose  well  for  large-­‐scale  


applicaNons  
• Therefore,  people  built  high  level  frameworks    and  
specialized  systems.  

10 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Specialized   S ystems  

11 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Spark  

12 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Spark:  A   Brief  History  

13 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Spark  S ummary  

Ø Highly  flexible  and  general-­‐purpose  way  of  dealing  with  big  data  
processing  needs  
Ø Does  not  impose  a  rigid  computaNon  model,  and  supports  a  variety  of  
input  types  
Ø Deal  with  text  files,  graph  data,  database  queries,  and  streaming  sources  
and  not  be  confined  to  a  two-­‐stage  processing  model  
Ø Programmers  can  develop  arbitrarily-­‐complex,  mulN-­‐step  data  pipelines  
arranged  in  an  arbitrary  directed  acyclic  graph  (DAG)  paTern.  
Ø Programming  in  Spark  involves  defining  a  sequence  of  transforma2ons  
and  ac2ons  
Ø Spark  has  support  for  a  map  acNon  and  a  reduce  operaNon,  so  it  can  
implement  tradi2onal  MapReduce  operaNons  but  it  also  supports  SQL  
queries,  graph  processing,  and  machine  learning  
Ø Stores  its  intermediate  results  in  memory,  providing  for  dramaNcally  
higher  performance.

14 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Spark  ecosystem  

Spark SQL Streaming MLlib GraphX


(SQL Queries) (SQL Queries) (Machine Learning) (Graph Processing)

Spark Core API


(Structured & Unstructured)

Scala Python Java R

Compute Engine
(Memory Management, Task Scheduling, Fault Recovery, Interaction with Cluster Management)

Cluster Resource Manager

Distributed Storage

15 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Spark  

16 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Programmability  

WordCount  in  3  lines  of  Spark  

WordCount  in  50+  lines  of  J ava  MR  


17 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  
Performance  
Time   to  sort  100TB  

2013  Record:     2100  machines  


Hadoop  
72  minutes  

2014  Record:     207  machines  


Spark  
23  minutes  

18 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


RDD:   C ore  Abstrac2on  

Write  programs  in  terms  of  distributed  datasets  


 
and  operaNons  on  them  

Resilient  Distributed  Datasets   OperaNons  


• Collec2ons  of  objects  spread    across   • TransformaNons    (e.g.  
a  cluster,  stored  in  RAM  or    on  Disk   map,  filter,    groupBy)  
• Built  through  parallel     • AcNons  
transforma2ons   (e.g.  count,  collect,    save)  
• Automa2cally  rebuilt  on  failure  

19 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


RDD  

Resilient  Distributed  Datasets  are  the  primary    abstracNon  in  


Spark  –  a  fault-­‐tolerant  collecNon  of    elements  that  can  be  
operated  on  in  parallel  
 
Two  types:  
• parallelized  collec/ons  –  take  an  exisNng    single-­‐node  collecNon  
and  parallel  it  
• Hadoop  datasets:  files  on  HDFS  or  other  compa/ble  storage  

20 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


RDD:   C ore  Abstrac2on  

• An  applicaNon  that  uses  Spark  idenNfies  data  sources  and  the  operaNons  on  
that  data.  The  main  applicaNon,  called  the  driver  program  is  linked  with  the  
Spark  API,  which  creates  a  SparkContext  (heart  of  the  Spark  system  and  
coordinates  all  processing  acNvity.)  This  SparkContext  in  the  driver  program  
connects  to  a  Spark  cluster  manager.  The  cluster  manager  responsible  for  
alloca2ng  worker  nodes,  launching  executors  on  them,  and  keeping  track  of  
their  status  
• Each  worker  node  runs  one  or  more  executors.  An  executor  is  a  process  that  
runs  an  instance  of  a  Java  Virtual  Machine  (JVM)  
•  When  each  executor  is  launched  by  the  manager,  it  establishes  a  connecNon  
back  to  the  driver  program  
• The  executor  runs  tasks  on  behalf  of  a  specific  SparkContext  (applicaNon)  and  
keeps  related  data  in  memory  or  disk  storage  
• A  task  is  a  transformaNon  or  acNon;  the  executor  remains  running  for  the  
duraNon  of  the  driver  program.  

21 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


RDD:   C ore  Abstrac2on  
• Each  worker  node  runs  one  or  more  executors.  An  executor  is  a  process  that  
runs  an  instance  of  a  Java  Virtual  Machine  (JVM)  
•  When  each  executor  is  launched  by  the  manager,  it  establishes  a  connecNon  
back  to  the  driver  program  
• The  executor  runs  tasks  on  behalf  of  a  specific  SparkContext  (applicaNon)  and  
keeps  related  data  in  memory  or  disk  storage  
• A  task  is  a  transforma2on  or  ac2on.  The  executor  remains  running  for  the  
duraNon  of  the  applicaNon.  This  provides  a  performance  advantage  over  the  
MapReduce  approach  since  new  tasks  can  be  started  very  quickly  
• The  executor  also  maintains  a  cache,  which  stores  frequently-­‐used  data  in  
memory  instead  of  having  to  store  it  to  a  disk-­‐based  file  as  the  MapReduce  
framework  does  
• The  driver  goes  through  the  user’s  program,  which  consists  of  acNons  and  
transformaNons  on  data  and  converts  that  into  a  series  of  tasks.  The  driver  then  
sends  tasks  to  the  executors  that  registered  with  it  
• A  task  is  applicaNon  code  that  runs  in  the  executor  on  a  Java  Virtual  Machine  
(JVM)  and  can  be  wriben  in  languages  such  as  Scala,  Java,  Python,  Clojure,  and  
R.  It  is  transmibed  as  a  jar  file  to  an  e22xecutor,  which  then  runs   it.  
HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  
RDD  

• Data  in  Spark  is  a  collec2on  of  Resilient  Distributed  Datasets  (RDDs).  This  is  
osen  a  huge  collecNon  of  stuff.  Think  of  an  individual  RDD  as  a  table  in  a  
database  or  a  structured  file.  
• Input  data  is  organized  into  RDDs,  which  will  osen  be  parNNoned  across  
many  computers.    RDDs  can  be  created  in  three  ways:  
 
(1)  They  can  be  present  as  any  file  stored  in  HDFS  or  any  other  storage  
system  supported  in  Hadoop.  This  includes  Amazon  S3  (a  key-­‐value  server,  
similar  in  design  to  Dynamo),  HBase  (Hadoop’s  version  of  Bigtable),  and  
Cassandra  (a  no-­‐SQL  eventually-­‐consistent  database).  This  data  is  created  
by  other  services,  such  as  event  streams,  text  logs,  or  a  database.  For  
instance,  the  results  of  a  specific  query  can  be  treated  as  an  RDD.  A  list  of  
files  in  a  specific  directory  can  also  be  an  RDD.  
 
 

23 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


RDD  

 
(2)  RDDs  can  be  streaming  sources  using  the  Spark  Streaming  extension.    
 
This  could  be  a  stream  of  events  from  remote  sensors,  for  example.    
 
For  fault  tolerance,  a  sliding  window  is  used,  where  the  contents  of  the  stream  
are  buffered  in  memory  for  a  predefined  Nme  interval.  
 

24 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


RDD  

(3)  An  RDD  can  be  the  output  of  a  transforma2on  func2on.  This  allows  one  
task   to   create   data   that   can   be   consumed   by   another   task   and   is   the   way  
tasks  pass  data  around.    
 
For  example,  one  task  can  filter  out  unwanted  data  and  generate  a  set  of  key-­‐
value  pairs,  wriNng  them  to  an  RDD.    
 
This  RDD  will  be  cached  in  memory  (overflowing  to  disk  if  needed)  and  will  be  
read  by  a  task  that  reads  the  output  of  the  task  that  created  the  key/value  
data.  
 
 

25 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


RDD   p roper2es  
• They  are  immutable.  That  means  their  contents  cannot  be  changed.  A  task  
can  read  from  an  RDD  and  create  a  new  RDD  but  it  cannot  modify  an  RDD.  
The  framework  magically  garbage  collects  unneeded  intermediate  RDDs.  
• They  are  typed.  An  RDD  will  have  some  kind  of  structure  within  in,  such  as  a  
key-­‐value  pair  or  a  set  of  fields.  Tasks  need  to  be  able  to  parse  RDD  streams.  
• They  are  ordered.  An  RDD  contains  a  set  of  elements  that  can  be  sorted.  In  
the  case  of  key-­‐value  lists,  the  elements  will  be  sorted  by  a  key.  The  sorNng  
funcNon   can   be   defined   by   the   programmer   but   sorNng   enables   one   to  
implement  things  like  Reduce  operaNons  
• They  are  par22oned.  Parts  of  an  RDD  may  be  sent  to  different  servers.  The  
default   parNNoning   funcNon   is   to   send   a   row   of   data   to   the   server  
corresponding  to  hash(key)  mod  server  count.  
 

26 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


RDD   o pera2ons  

• Spark  allows  two  types  of  operaNons  on  RDDs:  transforma2ons  and  ac2ons  
o Transforma2ons  read  an  RDD  and  return  a  new  RDD.  Example  
transformaNons  are  map,  filter,  groupByKey,  and  reduceByKey.  
TransformaNons  are  evaluated  lazily,  which  means  they  are  computed  only  
when  some  task  wants  their  data  (the  RDD  that  they  generate).  At  that  
point,  the  driver  schedules  them  for  execuNon  
o Ac2ons  are  operaNons  that  evaluate  and  return  a  new  value.  When  an  
acNon  is  requested  on  an  RDD  object,  the  necessary  transformaNons  are  
computed  and  the  result  is  returned.  AcNons  tend  to  be  the  things  that  
generate  the  final  output  needed  by  a  program.  Example  acNons  are  
reduce,  grab  samples,  and  write  to  file  
 

27 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Spark Essentials: Transformations  

transformation Description
when called on a dataset of (K, V) pairs,
groupByKey([numTasks]) returns a dataset of (K, Seq[V]) pairs
when called on a dataset of (K, V) pairs, returns
reduceByKey(func,
a dataset of (K, V) pairs where the values for
[numTasks]) each key are aggregated using the given reduce
function
when called on a dataset of (K, V) pairs where
sortByKey([ascending], K implements Ordered, returns a dataset of (K,
[numTasks]) V) pairs sorted by keys in ascending or
descending order, as specified in the boolean
ascending argument
when called on datasets of type (K, V) and (K,
join(otherDataset, W), returns a dataset of (K, (V, W)) pairs with
[numTasks]) all pairs of elements for each key

when called on datasets of type (K, V) and (K,


cogroup(otherDataset, W), returns a dataset of (K, Seq[V], Seq[W])
[numTasks]) tuples – also called groupWith

when called on datasets of types T and U,


cartesian(otherDataset) returns a dataset of (T, U) pairs (all pairs of
elements)

28 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Spark Essentials: Actions  

action description
aggregate the elements of the dataset using a
reduce(func) function func (which takes two arguments and
returns one), and should also be commutative and
associative so that it can be computed correctly in
parallel
return all the elements of the dataset as an array at
collect() the driver program – usually useful after a filter or
other operation that returns a sufficiently small
subset of the data
count() return the number of elements in the dataset
return the first element of the dataset – similar to
first()
take(1)
return an array with the first n elements of the dataset
take(n) – currently not executed in parallel, instead the
driver program computes all the elements

takeSample(withReplacement, return an array with a random sample of num


elements of the dataset, with or without
fraction, seed) replacement, using the given random number
generator seed

29 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Data storage  

Spark  does  not  care  how  data  is  stored.  The  appropriate  RDD  connector  
determines  how  to  read  data.  
 
 
For  example,  RDDs  can  be  the  result  of  a  query  in  a  Cassandra  database  and  new  
RDDs  can  be  wriben  to  Cassandra  tables.  
 
 AlternaNvely,  RDDs  can  be  read  from  HDFS  files  or  wriben  to  an  HBASE  table.  
 
 

30 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Fault  tolerance  

• For  each  RDD,  the  driver  tracks  the  sequence  of  transformaNons  
used  to  create  it  
• That  means  every  RDD  knows  which  task  needed  to  create  it.  If  any  
RDD  is  lost  (e.g.,  a  task  that  creates  one  died),  the  driver  can  ask  the  
task  that  generated  it  to  recreate  it  
• The  driver  maintains  the  enNre  dependency  graph,  so  this  
recreaNon  may  end  up  being  a  chain  of  transformaNon  tasks  going  
back  to  the  original  data.  
 

31 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Working  with  R DDs  

textFile = sc.textFile(”SomeFile.txt”)

RDD

32 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Working  with  R DDs  

textFile = sc.textFile(”SomeFile.txt”)

RDD
RDD
RDD
RDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

33 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Working  with  R DDs  

textFile = sc.textFile(”SomeFile.txt”)

RDD
RDD
RDD
RDD Action Value

Transformations

linesWithSpark.count()
74

linesWithSpark.first()
# Apache Spark

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

34 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   35  


Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

Worker  

Driver  

Worker  

Worker  

HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   36  


Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

lines = spark.textFile(“hdfs://...”) Worker  

Driver  

Worker  

Worker  

HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   37  


Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

lines = spark.textFile(“hdfs://...”) Worker  

Driver  

Worker  

Worker  

HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   38  


Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

lines = spark.textFile(“hdfs://...”) Worker  


errors = lines.filter(lambda s: s.startswith(“ERROR”))

Driver  

Worker  

Worker  

HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   39  


Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

lines = spark.textFile(“hdfs://...”) Worker  


errors = lines.filter(lambda s: s.startswith(“ERROR”))

Driver  

Worker  

Worker  

HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   40  


Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

lines = spark.textFile(“hdfs://...”) Worker  


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache() Driver  

messages.filter(lambda s: “mysql” in s).count()


Worker  

Worker  

HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   41  


Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

lines = spark.textFile(“hdfs://...”) Worker  


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache() Driver  

messages.filter(lambda s: “mysql” in s).count()


Action  
Worker  

Worker  

HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   42  


Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

lines = spark.textFile(“hdfs://...”) Worker  


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block  1  
messages.cache() Driver  

messages.filter(lambda s: “mysql” in s).count()


Worker  

Worker   Block  2  

Block  3  
HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   43  
Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

lines = spark.textFile(“hdfs://...”) Worker  


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks   Block  1  
messages.cache() Driver  

tasks  
messages.filter(lambda s: “mysql” in s).count()
tasks   Worker  

Worker   Block  2  

Block  3  
HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   44  
Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

lines = spark.textFile(“hdfs://...”) Worker  


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block  1  
messages.cache() Driver  
Read    
HDFS  
Block  
messages.filter(lambda s: “mysql” in s).count()
Worker  

Worker   Block  2  
Read    
HDFS   Read    
Block  3   Block   HDFS  
Block  
HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   45  
Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

Cache  1  
lines = spark.textFile(“hdfs://...”) Worker  
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block  1  
messages.cache() Driver  
Process    
&  Cache    
Data  
messages.filter(lambda s: “mysql” in s).count() Cache  2  
Worker  
Cache  3  
Worker   Process     Block  2  
&  Cache     Process    
Block  3   Data   &  Cache    
Data  
HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   46  
Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

Cache  1  
lines = spark.textFile(“hdfs://...”) Worker  
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) results  
Block  1  
messages.cache() Driver  

results  

messages.filter(lambda s: “mysql” in s).count() Cache  2  


results  
Worker  
Cache  3  
Worker   Block  2  

Block  3  
HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   47  
Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

Cache  1  
lines = spark.textFile(“hdfs://...”) Worker  
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block  1  
messages.cache() Driver  

messages.filter(lambda s: “mysql” in s).count() Cache  2  


messages.filter(lambda s: “php” in s).count() Worker  
Cache  3  
Worker   Block  2  

Block  3  
HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   48  
Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

Cache  1  
lines = spark.textFile(“hdfs://...”) Worker  
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks   Block  1  
messages.cache() Driver  

tasks  
messages.filter(lambda s: “mysql” in s).count() Cache  2  
messages.filter(lambda s: “php” in s).count() tasks   Worker  
Cache  3  
Worker   Block  2  

Block  3  
HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   49  
Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

Cache  1  
lines = spark.textFile(“hdfs://...”) Worker  
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block  1  
messages.cache() Driver  
Process    
from    
Cache  
messages.filter(lambda s: “mysql” in s).count() Cache  2  
messages.filter(lambda s: “php” in s).count() Worker  
Cache  3  
Worker   Process     Block  2  

from     Process    
Cache   from    
Block  3  
Cache  
HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   50  
Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

Cache  1  
lines = spark.textFile(“hdfs://...”) Worker  
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) results  
Block  1  
messages.cache() Driver  

results  

messages.filter(lambda s: “mysql” in s).count() Cache  2  


results  
messages.filter(lambda s: “php” in s).count() Worker  
Cache  3  
Worker   Block  2  

Block  3  
HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   51  
Example:   L og  Mining  
Load  error  messages  from  a  log  into  memory,  then  interacNvely  search  for  various  paberns  

Cache  1  
lines = spark.textFile(“hdfs://...”) Worker  
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block  1  
messages.cache() Driver  

messages.filter(lambda s: “mysql” in s).count() Cache  2  


messages.filter(lambda s: “php” in s).count() Worker  
Cache  3  
Cache  your  data  è Faster  Results   Worker   Block  2  
Full-text search of Wikipedia
• 60GB  on  20  EC2  machines  
Block  3  
• 0.5  sec  from  mem  vs.  20s  for  on-­‐disk  
HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021   52  
Language  Support  

Python   Standalone Programs


Python, Scala, &Java
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Interactive Shells
RDD Python &Scala

Scala
val lines = sc.textFile(...) Performance
lines.filter(x => x.contains(“ERROR”)).count() Java &Scala are faster due to static
typing
…but Python is often fine

Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();

53 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Expressive  A PI  

map reduce

RDD

54 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Expressive  A PI  

map reduce sample


filter count take
groupBy fold
RDD first
sort reduceByKey
union groupByKey partitionBy
join cogroup mapWith
leftOuterJoin cross pipe
rightOuterJoin zip
save ...

55 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Fault  R ecovery  

RDDs  track  lineage  informaNon  that  can  be  used  to    


efficiently  reconstruct  lost  parNNons  
RDD
Ex:   messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))

HDFS  File  Filtered  RDD   Mapped  RDD  


filter   map  
(func  =  _.contains(...))   (func  =  _.split(...))  

56 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  


Fault  R ecovery  Results  

140   Failure  happens  


119  
Iteratrion  time  (s)  

120  
100   RDD 81  
80  
57   56   58  58   57   59   57   59  
60  
40  
20  
0  
1  2  3  4  5  6  7  8  9  10  
Iteration  

57 HPC  Lab  &  CCE  -­‐  HCMUT  Big  Data  2021  

You might also like