Cloudera Hadoop Introduction PDF
Cloudera Hadoop Introduction PDF
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
1
The
Mo@va@on
for
Hadoop
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
2
Tradi@onal
Large-Scale
Computa@on
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
3
The
Data
Explosion
10,000
created in 2011
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
4
Current
Solu@ons
10,000
0
10%
2005
2010
2015
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
5
Why
Use
Hadoop?
Hadoop
handles
any
data
Hadoop
grows
with
your
Hadoop
is
100%
Apache
Hadoop
helps
you
derive
type,
in
any
quan*ty
business
licensed
and
open
source
the
complete
value
of
all
your
data
Structured,
unstructured
No
vendor
lock-in
Proven
at
petabyte
scale
Drives
revenue
by
extrac@ng
Schema,
no
schema
Community
development
value
from
data
that
was
Capacity
and
performance
previously
out
of
reach
High
volume,
low
volume
grow
simultaneously
Rich
ecosystem
of
related
projects
Controls
costs
by
storing
data
All
kinds
of
analy@c
Leverages
commodity
more
aordably
than
any
applica@ons
hardware
to
mi@gate
costs
other
pla`orm
1 2 3
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
6
The
Origins
of
Hadoop
Open
Source
Open
source
web
Publishes
MapReduce
MapReduce
and
HDFS
Runs
4,000-node
Hadoop
wins
Terabyte
Releases
CDH
and
crawler
project
created
and
GFS
Paper
project
created
by
Hadoop
cluster
sort
benchmark
Cloudera
Enterprise
by
Doug
Cuang
Doug
Cuang
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
7
Core
Hadoop:
HDFS
3 HDFS
4 2 1 1 2 1
4 2 3 3 3
5 5 5 4 5 4
HDFS breaks incoming les into blocks and stores them redundantly across the cluster.
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
8
Core
Hadoop:
MapReduce
framework
3 MR
4 2 1 1 2 1
4 2 3 3 3
5 5 5 4 5 4
Processes large jobs in parallel across many nodes and combines the results.
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
9
Hadoop
and
Databases
You need
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
10
Typical
Datacenter
Architecture
Business
intelligence apps
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
11
Adding
Hadoop
To
The
Mix
New Oracle,
Interactive Hadoop SAP...
data
database
Recommendations, etc...
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
12
Why
Cloudera?
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
13
Cloudera
is
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
14
Experienced
and
Proven
Across
Hundreds
of
Deployments
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
15
The
Only
Vendor
With
a
Complete
Solu@on
Cloudera University
Partner Ecosystem Equipping the Big Data workforce 12,000+ trained
250+ partners across hardware, software, platforms and services
Professional Services
Use case discovery, pilots, process & team development
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
16
Solving
Problems
with
Hadoop
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
17
Eight
Common
Hadoop-able
Problems
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
18
1.
Modeling
True
Risk
Challenge:
How
much
risk
exposure
does
an
organiza*on
really
have
with
each
customer?
Mul@ple
sources
of
data
and
across
mul@ple
lines
of
business
Solu*on
with
Hadoop:
Source
and
aggregate
disparate
data
sources
to
build
data
picture
e.g.
credit
card
records,
call
recordings,
chat
sessions,
emails,
banking
ac@vity
Structure
and
analyze
Sen@ment
analysis,
graph
crea@on,
pa=ern
recogni@on
Typical
Industry:
Financial
Services
(banks,
insurance
companies)
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
19
2.
Customer
Churn
Analysis
Challenge:
Why
is
an
organiza*on
really
losing
customers?
Data
on
these
factors
comes
from
dierent
sources
Solu*on
with
Hadoop:
Rapidly
build
behavioral
model
from
disparate
data
sources
Structure
and
analyze
with
Hadoop
Traversing
Graph
crea@on
Pa=ern
recogni@on
Typical
Industry:
Telecommunica@ons,
Financial
Services
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
20
3.
Recommenda@on
Engine/Ad
Targe@ng
Challenge:
Using
user
data
to
predict
which
products
to
recommend
Solu*on
with
Hadoop:
Batch
processing
framework
Allow
execu@on
in
in
parallel
over
large
datasets
Collabora*ve
ltering
Collec@ng
taste
informa@on
from
many
users
U@lizing
informa@on
to
predict
what
similar
users
like
Typical
Industry
Ecommerce,
Manufacturing,
Retail
Adver@sing
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
21
4.
Point
of
Sale
Transac@on
Analysis
Challenge:
Analyzing
Point
of
Sale
(PoS)
data
to
target
promo*ons
and
manage
opera*ons
Sources
are
complex
and
data
volumes
grow
across
chains
of
stores
and
other
sources
Solu*on
with
Hadoop:
Batch
processing
framework
Allow
execu@on
in
in
parallel
over
large
datasets
Paiern
recogni*on
Op@mizing
over
mul@ple
data
sources
U@lizing
informa@on
to
predict
demand
Typical
Industry:
Retail
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
22
5.
Analyzing
Network
Data
to
Predict
Failure
Challenge:
Analyzing
real-*me
data
series
from
a
network
of
sensors
Calcula@ng
average
frequency
over
@me
is
extremely
tedious
because
of
the
need
to
analyze
terabytes
Solu*on
with
Hadoop:
Take
the
computa*on
to
the
data
Expand
from
simple
scans
to
more
complex
data
mining
Beier
understand
how
the
network
reacts
to
uctua*ons
Discrete
anomalies
may,
in
fact,
be
interconnected
Iden*fy
leading
indicators
of
component
failure
Typical
Industry:
U@li@es,
Telecommunica@ons,
Data
Centers
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
23
6.
Threat
Analysis/Trade
Surveillance
Challenge:
Detec*ng
threats
in
the
form
of
fraudulent
ac*vity
or
aiacks
Large
data
volumes
involved
Like
looking
for
a
needle
in
a
haystack
Solu*on
with
Hadoop:
Parallel
processing
over
huge
datasets
Paiern
recogni*on
to
iden*fy
anomalies,
i.e.,
threats
Typical
Industry:
Security,
Financial
Services,
General:
spam
gh@ng,
click
fraud
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
24
7.
Search
Quality
Challenge:
Providing
real
*me
meaningful
search
results
Solu*on
with
Hadoop:
Analyzing
search
aiempts
in
conjunc*on
with
structured
data
Paiern
recogni*on
Browsing
pa=ern
of
users
performing
searches
in
dierent
categories
Typical
Industry:
Web,
Ecommerce
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
25
8.
Data
Sandbox
Challenge:
Data
Deluge
Dont
know
what
to
do
with
the
data
or
what
analysis
to
run
Solu*on
with
Hadoop:
Dump
all
this
data
into
an
HDFS
cluster
Use
Hadoop
to
start
trying
out
dierent
analysis
on
the
data
See
paierns
to
derive
value
from
data
Typical
Industry:
Common
across
all
industries
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
26
Orbitz:
Major
Online
Travel
Booking
Service
Challenge:
Orbitz
performs
millions
of
searches
and
transac@ons
daily,
which
leads
to
hundreds
of
gigabytes
of
log
data
every
day
Not
all
of
that
data
has
value
(i.e.,
it
is
logged
for
historic
reasons)
Much
is
quite
valuable
Want
to
capture
even
more
data
Solu*on
with
Hadoop:
Hadoop
provides
Orbitz
with
ecient,
economical,
scalable,
and
reliable
storage
and
processing
of
these
large
amounts
of
data
Hadoop
places
no
constraints
on
how
data
is
processed
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
27
Before
Hadoop
Data Warehouse
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
28
Aker
Hadoop
Hadoop
was
deployed
late
2009/early
2010
to
begin
collec*ng
this
non-
transac*onal
data
Orbitz
has
been
using
CDH
for
that
en@re
period
with
great
success.
Much
of
this
non-transac*onal
data
is
contained
in
Web
analy*cs
logs
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
29
What
Now?
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
30
Major
Na@onal
Bank
Background
100M
customers
Rela@onal
data:
2.5B
records/month
Card
transac@ons,
home
loans,
auto
loans,
etc.
Data
volume
growing
by
hundreds
of
TB/year
Needs
to
incorporate
non-rela@onal
data
as
well
Web
clicks,
check
images,
voice
data
Uses
Hadoop
to
Iden@fy
credit
risk,
fraud
Proac@vely
manage
capital
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
31
Financial
Regulatory
Body
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
32
Leading
North
American
Retailer
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
33
Digital
Media
Company
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
34
Leader
in
Real-Time
Adver@sing
Technology
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
35
Ne`lix
Before
Hadoop
Nightly
processing
of
logs
Imported
into
a
database
Analysis/BI
As
data
volume
grew,
it
took
more
than
24
hours
to
process
and
load
a
days
worth
of
logs
Today,
an
hourly
Hadoop
job
processes
logs
for
quicker
availability
to
the
data
for
analysis/BI
Currently
inges*ng
approximately
1TB
of
data
per
day
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
36
04-36
Copyright
@
2011
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
Hadoop
as
Cheap
Storage
Yahoo
Before
Hadoop:
$1
million
for
10TB
storage
With
Hadoop:
$1
million
for1
PB
of
storage
Other
Large
Company
Before
Hadoop:
$5
million
to
store
data
in
Oracle
With
Hadoop:
$240K
to
store
the
data
in
HDFS
Facebook
Hadoop
as
unied
storage
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
37
Hadoop
Jobs
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
38
The
Roles
People
Play
System
Administrators
Developers
Analysts
Data
Stewards
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
39
System
Administrators
Required
skills:
Strong
Linux
administra@on
skills
Networking
knowledge
Understanding
of
hardware
Job
responsibili*es
Install,
congure
and
upgrade
Hadoop
sokware
Manage
hardware
components
Monitor
the
cluster
Integrate
with
other
systems
(e.g.,
Flume
and
Sqoop)
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
40
Developers
Required
skills:
Strong
Java
or
scrip@ng
capabili@es
Understanding
of
MapReduce
and
algorithms
Job
responsibili*es:
Write,
package
and
deploy
MapReduce
programs
Op@mize
MapReduce
jobs
and
Hive/Pig
programs
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
41
Data
Analyst/Business
Analyst
Required
skills:
SQL
Understanding
data
analy@cs/data
mining
Job
responsibili*es:
Extract
intelligence
from
the
data
Write
Hive
and/or
Pig
programs
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
42
Data
Steward
Required
skills:
Data
modeling
and
ETL
Scrip@ng
skills
Job
responsibili*es:
Cataloging
the
data
(analogous
to
a
librarian
for
books)
Manage
data
lifecycle,
reten@on
Data
quality
control
with
SLAs
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
43
Combining
Roles
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
44
Finding
The
Right
People
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
45
Clouderas
Academic
Partnership
Program
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
46
Clouderas
Academic
Partnerships:
Overview
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
47
Clouderas
Academic
Partnerships:
Goals
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
48
Clouderas
Academic
Partnerships:
Financial
Overview
Cloudera
does
not
currently
charge
Academic
Partners
for
usage
of
the
training
materials
This
is
a
program
designed
solely
to
facilitate
students
learning
of
an
emerging
technology
Our
reward
is
helping
the
industry
grow,
and
ideally
the
exposure
to
Cloudera
is
a
posi@ve
one
which
will
be
remembered
when
the
students
we
service
today
are
making
decisions
for
their
business
tomorrow
Instructors
who
are
delivering
the
Cloudera
courses
are
eligible
for
a
50%
discount
to
commercial
training
courses
delivered
by
Cloudera
We
want
to
make
sure
the
folks
leading
the
classes
have
the
skillset
to
help
their
students
be
successful
Normally
we
provide
universi*es
with
courses
focused
on
the
roles
of
Hadoop
Developer
or
Administrator
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.
49
Ian
Wrigley
[email protected]
50
Copyright
2010-2012
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri=en
consent.