02 HadoopIntroEcosystem
02 HadoopIntroEcosystem
201509
Course
Chapters
10
Spark
Basics
11
Working
with
RDDs
in
Spark
12
Aggrega)ng
Data
with
Pair
RDDs
13
Wri)ng
and
Deploying
Spark
Applica)ons
Distributed
Data
Processing
with
14
Parallel
Processing
in
Spark
Spark
15
Spark
RDD
Persistence
16
Common
PaEerns
in
Spark
Data
Processing
17
Spark
SQL
and
DataFrames
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐2
Introduc)on
to
Hadoop
and
the
Hadoop
Ecosystem
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐3
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐4
Tradi)onal
Large-‐Scale
Computa)on
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐5
Distributed
Systems
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
Database Hadoop Cluster
2-‐6
Challenges
with
Distributed
Systems
§ The
solu,on?
– Hadoop!
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐7
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐8
What
is
Apache
Hadoop?
Workload Management
Data Storage
Data Integra)on
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐9
Common
Hadoop
Use
Cases
§ What
do
these
workloads
have
in
common?
Nature
of
the
data…
– Volume
– Velocity
– Variety
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐10
Distributed
Systems:
The
Data
BoEleneck
(1)
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐11
Distributed
Systems:
The
Data
BoEleneck
(2)
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐12
Big
Data
Processing
with
Hadoop
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐13
Core
Hadoop
Processing
A
Hadoop
Cluster
• Spark
• MapReduce
• YARN • HDFS
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐14
Big
Data
Processing
Data
Analysis
Data
Sources
Data
Storage
Data
Processing
and
Explora)on
Hadoop
Spark
Impala
Search
Distributed
File
System
(HDFS)
Hadoop
MapReduce
Hive
HBase
Pig
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐15
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐16
Data
Ingest
and
Storage
HBase
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐17
Data
Storage
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐18
Data
Ingest
Tools
(1)
§ HDFS
– Direct
file
transfer
§ Apache
Sqoop
– High
speed
import
to
HDFS
from
Rela)onship
Database
(and
vice
versa)
– Supports
many
data
storage
systems
– e.g.
Netezza,
Mongo,
MySQL,
Teradata,
Oracle
HDFS
– Covered
later
in
this
course
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐19
Data
Ingest
Tools
(2)
§ Apache
Flume
– Distributed
service
for
inges)ng
streaming
data
– Ideally
suited
for
event
data
from
mul)ple
systems
– For
example,
log
files
– Covered
later
in
this
course
§ Kaca
HDFS
– A
high
throughput,
scalable
messaging
system
– Distributed,
reliable
publish-‐subscribe
system
– Integrates
with
Flume
and
Spark
Streaming
Apache
Kana
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐20
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐21
Apache
Spark:
An
Engine
For
Large-‐scale
Data
Processing
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐22
Hadoop
MapReduce:
The
Original
Hadoop
Processing
Engine
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐23
Apache
Pig:
Scrip)ng
for
MapReduce
§ Apache
Pig
builds
on
Hadoop
to
offer
high-‐level
data
processing
– This
is
an
alterna)ve
to
wri)ng
low-‐level
MapReduce
code
– Pig
is
especially
good
at
joining
and
transforming
data
§ The
Pig
interpreter
runs
on
the
client
machine
– Turns
Pig
La)n
scripts
into
MapReduce
or
Spark
jobs
– Submits
those
jobs
to
a
Hadoop
cluster
– Covered
in
Cloudera
Data
Analyst
Training
people = LOAD '/user/training/customers' AS (cust_id, name);
orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);
groups = GROUP orders BY cust_id;
totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;
result = JOIN totals BY group, people BY cust_id;
DUMP result;
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐24
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐25
Cloudera
Impala:
High
Performance
SQL
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐26
Apache
Hive:
SQL
on
MapReduce
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐27
Cloudera
Search:
A
Plasorm
For
Data
Explora)on
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐28
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐29
Hue:
The
UI
for
Hadoop
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐30
Apache
Oozie:
Workflow
Management
§ Oozie
– Workflow
engine
for
Hadoop
jobs
– Defines
dependencies
between
jobs
§ The
Oozie
server
submits
the
jobs
to
the
server
in
the
correct
sequence
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐31
Apache
Sentry:
Hadoop
Security
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐32
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐33
Introduc)on
to
the
Homework
Labs
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐34
Scenario
Explana)on
(1)
L udacre mobile
o
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐35
Scenario
Explana)on
(2)
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐36
Introduc)on
to
Homework
Labs:
Gevng
Started
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐37
Introduc)on
to
Homework
Labs:
Classroom
Virtual
Machine
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐38
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐39
Essen)al
Points
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
2-‐40
Bibliography
The
following
offer
more
informa,on
on
topics
discussed
in
this
chapter
§ Hadoop:
The
Defini0ve
Guide
(published
by
O’Reilly)
– http://tiny.cloudera.com/hadooptdg
§ Cloudera
Essen0als
for
Apache
Hadoop
–
free
online
training
– http://tiny.cloudera.com/esscourse
© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐41