An Introduction To Hadoop Presentation PDF
An Introduction To Hadoop Presentation PDF
Introduc+on
to
Hadoop
Mark
Fei
Cloudera
Strata
+
Hadoop
World
2012,
New
York
City,
October
23,
2012
Current:! Past:!
Senior Instructor at Cloudera! Professional Services Education, VMware! Senior Member Technical Staff, Hill Associates! Sales Engineer, Nortel Networks! Systems Programmer, large Bank! Banking Applications software developer!
Whats Ahead?
Open source Apache project Harnesses the power of commodity servers Distributed and fault-tolerant HDFS (storage) MapReduce (processing)
A large ecosystem
Vendor integration
BI / Analytics
ETL
Database
Hardware
About Cloudera
Cloudera is The commercial Hadoop company Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo Provides consulting and training services for Hadoop users Staff includes several committers to Hadoop projects
Cloudera Software
A single, easy-to-install package from the Apache Hadoop core repository Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version 100% open source Apache Hadoop Apache Hive Apache Pig Apache HBase Apache Zookeeper Apache Flume, Apache Hue, Apache Oozie, Apache Sqoop, Apache Mahout
Components
A Coherent Platform
Components of the CDH Stack
File System Mount
FUSE-DFS
Storage
SDK
UI Framework
HUE
HUE SDK
Workflow
APACHE OOZIE
Scheduling
APACHE OOZIE
Metadata
APACHE HIVE
Coordination
APACHE ZOOKEEPER
Cloudera Enterprise
Cloudera Enterprise
Big data storage, processing and analytics platform based on CDH End-to-end deployment, management, and operation of CDH Provides sophisticated cluster monitoring tools not present in the free version A team of experts on call to help you meet your Service Level Agreements (SLAs)
Production support
Cloudera University
Cloudera Developer Training for Apache Hadoop Cloudera Administrator Training for Apache Hadoop Cloudera Training for Apache HBase Cloudera Training for Apache Hive and Pig Cloudera Essentials for Apache Hadoop More courses coming Including customized on-site private classes Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera Certified Specialist in Apache HBase (CCSHB)
Industry-recognized Certifications
Professional Services
Use Case Discovery New Hadoop Deployment Proof of Concept Production Pilot Process and Team Development Hadoop Deployment Certification
Notably,
the
Google
Filesystem
and
MapReduce
papers
Early
adop+on
by
Yahoo,
Facebook
and
others
Nutch spun off from Lucene Google publishes GFS paper Google publishes MapReduce paper Nutch rewritten for MapReduce
2002
2003
2004
2005
Velocity
Processes are increasingly automated Systems are increasingly interconnected People are increasingly interac+ng online
Variety
Volume
Twi]er processes 340 million messages Facebook stores 2.7 billion comments and Likes Google processes about 24 petabytes of data More than 200 million e-mail messages are sent Foursquare processes more than 2,000 check-ins
Science
Medical imaging, sensor data, genome sequencing, weather data, satellite feeds, etc. Financial, pharmaceu+cal, manufacturing, insurance, online, energy, retail data Sales data, customer behavior, product databases, accoun+ng data, etc. Log les, health & status feeds, ac+vity streams, network messages, Web Analy+cs, intrusion detec+on, spam lters
Industry
Legacy
System Data
XML CSV EDI Log les Objects SQL Text JSON Binary Etc.
Batch processing Parallel execu+on Spread data over a cluster of servers and take the computa+on to the data
Previously impossible/imprac+cal to do this analysis Analysis conducted at lower cost Analysis conducted in less +me Greater exibility Linear scalability
Text mining Index building Graph crea+on and analysis Pa]ern recogni+on
Modeling true risk Customer churn analysis Recommenda+on engine PoS transac+on analysis
5.
Analyzing network data to predict failure Threat analysis Search quality Data sandbox
6. 7. 8.
Solu+on with Hadoop: Source and aggregate disparate data sources to build data picture
e.g. credit card records, call recordings, chat sessions, emails, banking ac+vity
Sen+ment analysis, graph crea+on, pa]ern recogni+on Financial Services (banks, insurance companies)
Typical Industry:
Solu-on
with
Hadoop:
Rapidly
build
behavioral
model
from
disparate
data
sources
Structure
and
analyze
with
Hadoop
Typical Industry:
Collabora+ve
ltering
Typical Industry
Collec+ng taste informa+on from many users U+lizing informa+on to predict what similar users like Ecommerce, Manufacturing, Retail Adver+sing
Sources are complex and data volumes grow across chains of stores and other sources
Allow execu+on in in parallel over large datasets Op+mizing over mul+ple data sources U+lizing informa+on to predict demand Retail
Pa]ern
recogni+on
Typical Industry:
Calcula+ng average frequency over +me is extremely tedious because of the need to analyze terabytes
Expand from simple scans to more complex data mining Discrete anomalies may, in fact, be interconnected
Solu+on
with
Hadoop:
Parallel
processing
over
huge
datasets
Pa]ern
recogni+on
to
iden+fy
anomalies,
Typical Industry:
i.e., threats
7.
Search
Quality
Challenge:
Providing
real
+me
meaningful
search
results
Solu+on
with
Hadoop:
Analyzing
search
a]empts
in
conjunc+on
with
structured
data
Pa]ern
recogni+on
Typical Industry:
8.
Data
Sandbox
Challenge:
Data
Deluge
Solu+on with Hadoop: Dump all this data into an HDFS cluster Use Hadoop to start trying out dierent analysis on the data See pa]erns to derive value from data Typical Industry:
Were genera+ng more data than ever before Fortunately, the size and cost of storage has kept pace
Capacity (GB)
$157 $1.05 $0.05
Disk
performance
has
also
increased
in
the
last
15
years
Unfortunately,
transfer
rates
havent
kept
pace
with
capacity
Year
1997 2004 2012
Capacity (GB)
2.1 200 3,000
Storage System
Fast Network
Storage System
Fast Network
Storage System
Fast Network
Storage System
Fast Network
Storage System
Bottleneck
Data locality: Bring the computa+on to the data Reduces I/O and boosts performance
Disk
seeks
are
expensive
Solu+on:
Read
lots
of
data
at
once
to
amor+ze
the
cost
Current location of disk head
Introducing
HDFS
Scalable storage inuenced by Googles le system paper HDFS is op+mized for Hadoop Values high throughput much more than low latency Its a user-space Java process Primarily accessed via command-line u+li+es and Java API
Hierarchical UNIX-style paths (e.g. /foo/bar/myfile.txt) File ownership and permissions No CWD Cannot modify les once wri]en
HDFS follows a master-slave architecture There are two essen+al daemons in HDFS
Master:
NameNode
Responsible for namespace and metadata Namespace: le hierarchy Metadata: ownership, permissions, block loca+ons, etc. Responsible for storing actual datablocks
Slave: DataNode
HDFS
Blocks
When a le is added to HDFS, its split into blocks This is a similar concept to na+ve lesystems
HDFS
uses
a
much
larger
block
size
(64
MB),
for
performance
150 MB input le Block #1 (64 MB) Block #2 (64 MB)
HDFS
Replica+on
Those
blocks
are
then
replicated
across
machines
The
rst
block
might
be
replicated
to
A,
C
and
D
Block #1 A B Block #2 C D E
Block #3
Block #3
Block #3
HDFS Reliability
Even
when
a
node
fails,
two
copies
of
the
block
remain
These
will
be
re-replicated
to
other
nodes
automa+cally
This failed node held blocks #1 and #3
X
A B C D E
Blocks #1 and #3 are still available here Block #1 is still available here Block #3 is still available here
Like
HDFS,
MapReduce
has
a
master-slave
architecture
There
are
two
daemons
in
classical
MapReduce
Master: JobTracker
Responsible for dividing, scheduling and monitoring work Responsible for actual processing
Slave: TaskTracker
One
func+on
(Map)
processes
data
That
output
is
ul+mately
input
to
another
func+on
(Reduce)
Each
piece
is
simple,
but
can
be
powerful
when
combined
'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c ERROR INFO WARN
Map
Intermediate Processing
Map
Map
MapReduce
History
A style of processing data you could implement in any language Many languages have func+ons named map and reduce These func+ons have largely the same purpose in Hadoop
MapReduce Benets
No le I/O No networking code No synchroniza+on A record consists of a key and corresponding value
But possible to use nearly any language with Hadoop Streaming Ill show the log event counter using MapReduce in Python
Job Input
2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 22:16:49.392 22:16:49.394 22:16:49.395 22:16:49.397 22:16:49.398 22:16:49.399 CDT CDT CDT CDT CDT CDT INFO "Blah blah" WARN "Hmmm..." INFO "More blather" WARN "Hey there" INFO "Spewing data" ERROR "Oh boy!"
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() for field in fields: field = field.strip().upper() if field in levels: print "%s\t1" % field
Dene list of JUnit log levels Split every line (record) we receive on standard input into elds, normalized by case If this eld matches a log level, print it (and a 1)
The
Reducer
receives
a
key
and
all
values
for
that
key
Keys
are
always
passed
to
reducers
in
sorted
order
Although
its
not
obvious
here,
values
are
unordered
ERROR INFO INFO INFO INFO WARN WARN 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8 9 10 11 12 13
The
Reducer
rst
extracts
the
key
and
value
it
was
passed
#!/usr/bin/env python import sys previous_key = '' sum = 0 for line in sys.stdin: fields = line.split() key, value = line.split() value = int(value) # continued on next slide
14 15 16 17 18 19 20 21 22 23
If key unchanged, increment the count If key changed, print sum for previous key Re-init loop variables Print sum for nal key
Map
output
INFO INFO WARN INFO WARN INFO ERROR 1 1 1 1 1 1 1
Reduce
input
ERROR INFO INFO INFO INFO WARN WARN 1 1 1 1 1 1 1
Reduce
output
ERROR INFO WARN 1 4 2
An
InputSplit
usually
corresponds
to
a
single
HDFS
block
Each
of
these
serves
as
input
to
a
single
Map
task
Input for entire job (192 MB) 64 MB
Mapper #1
64 MB
Mapper #2
64 MB
Mapper #3
Output
of
all
Mappers
is
par++oned,
merged,
and
sorted
(No
code
required
Hadoop
does
this
automa+cally)
Mapper #1
INFO WARN INFO INFO ERROR 1 1 1 1 1 ERROR ERROR ERROR 1 1 1
Mapper #2
1 1 1 1 1
1 1 1 1 1 1 1 1
Mapper #N
1 1 1 1 1
1 1 1 1
All values for a given key are then collapsed into a list
The
key
and
all
its
values
are
fed
to
reducers
as
input
ERROR ERROR ERROR 1 1 1 ERROR 1 1 1
1 1 1 1 1 1 1 1
Reducer #1
INFO 1 1 1 1 1 1 1 1
1 1 1 1
Reducer #2
WARN 1 1 1 1
Reducer #1
ERROR WARN
3 4
Reducer #2
Some help you integrate Hadoop with other systems Others help you analyze your data S+ll others, like Oozie, help you use Hadoop more eec+vely Also like Hadoop, they have funny names All of these are part of Clouderas CDH distribu+on
Retrieve
all
tables,
a
single
table,
or
a
por+on
to
store
in
HDFS
Can
also
export
data
from
HDFS
back
to
the
database
Database Hadoop Cluster
SELECT customer.id, customer.name, sum(orders.cost) FROM customers INNER JOIN ON (customer.id = orders.customer_id) WHERE customer.zipcode = '63105' GROUP BY customer.id;
It turns this into MapReduce jobs that run on your cluster Reduces development +me Makes Hadoop more accessible to non-engineers
It has a high-level language (PigLa+n) for data analysis Scripts yield MapReduce jobs that run on your cluster
NoSQL
database
built
on
HDFS
Low-latency
and
high-performance
for
reads
and
writes
Extremely
scalable
The most widely used distribu+on of Hadoop A stable, proven and supported environment you can count on Such as Hive, Pig, Sqoop, Flume and many more All of these are integrated and work well together Its completely free Apache licensed its 100% open source too
You need to process non-rela+onal (unstructured) data You are processing large amounts of data You can run your jobs in batch mode Youre processing small amounts of data Your algorithms require communica+on among nodes You need low latency or transac+ons And know how to integrate it with other systems
System Administrators
Required
skills:
Strong Linux administra+on skills Networking knowledge Understanding of hardware Install, congure and upgrade Hadoop sorware Manage hardware components Monitor the cluster Integrate with other systems (e.g., Flume and Sqoop)
Job
responsibili+es
Developers
Required
Skills:
Strong Java or scrip+ng capabili+es Understanding of MapReduce and algorithms Write, package and deploy MapReduce programs Op+mize MapReduce jobs and Hive/Pig programs
Job
responsibili+es:
Required
skills:
SQL Understanding data analy+cs/data mining Extract intelligence from the data Write Hive and/or Pig programs
Job
responsibili+es:
Data Steward
Required
skills:
Data modeling and ETL Scrip+ng skills Cataloging the data (analogous to a librarian for books) Manage data lifecycle, reten+on Data quality control with SLAs
Job
responsibili+es:
Combining
Roles
Job
responsibili+es:
Data modeling and ETL Scrip+ng skills Strong Linux administra+on skills
Manage data lifecycle, reten+on Data quality control with SLAs Install, congure and upgrade Hadoop sorware Manage hardware components Monitor the cluster Integrate with other systems (e.g., Flume and Sqoop)
Conclusion