100% found this document useful (1 vote)
400 views91 pages

An Introduction To Hadoop Presentation PDF

Mark Fei from Cloudera gives an introduction to Apache Hadoop. He discusses what Hadoop is, how it works, and the large ecosystem of companies and tools that use Hadoop. He then describes Cloudera's offerings including their Distribution of Apache Hadoop (CDH), Cloudera Manager for managing Hadoop clusters, Cloudera Enterprise for production support, and Cloudera University for training and certifications.

Uploaded by

srinath_vj3326
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
400 views91 pages

An Introduction To Hadoop Presentation PDF

Mark Fei from Cloudera gives an introduction to Apache Hadoop. He discusses what Hadoop is, how it works, and the large ecosystem of companies and tools that use Hadoop. He then describes Cloudera's offerings including their Distribution of Apache Hadoop (CDH), Cloudera Manager for managing Hadoop clusters, Cloudera Enterprise for production support, and Cloudera University for training and certifications.

Uploaded by

srinath_vj3326
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

An

Introduc+on to Hadoop
Mark Fei
Cloudera Strata + Hadoop World 2012, New York City, October 23, 2012

Who Am I? Mark Fei


Cloudera!
Durango, Colorado!

Current:! Past:!

Senior Instructor at Cloudera! Professional Services Education, VMware! Senior Member Technical Staff, Hill Associates! Sales Engineer, Nortel Networks! Systems Programmer, large Bank! Banking Applications software developer!

Whats Ahead?

Solid introduc+on to Apache Hadoop


What it is Why its relevant How it works The Ecosystem

No prior experience needed Feel free to ask ques+ons

What is Apache Hadoop?

Scalable data storage and processing


Open source Apache project Harnesses the power of commodity servers Distributed and fault-tolerant HDFS (storage) MapReduce (processing)

Core Hadoop consists of two main parts


A large ecosystem

Who uses Hadoop?

Vendor integration

BI / Analytics

ETL

Database

OS / Cloud / System Mgmt.

Hardware

About Cloudera

Cloudera is The commercial Hadoop company Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo Provides consulting and training services for Hadoop users Staff includes several committers to Hadoop projects

Cloudera Software

Clouderas Distribution including Apache Hadoop (CDH)


A single, easy-to-install package from the Apache Hadoop core repository Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version 100% open source Apache Hadoop Apache Hive Apache Pig Apache HBase Apache Zookeeper Apache Flume, Apache Hue, Apache Oozie, Apache Sqoop, Apache Mahout

Components

A Coherent Platform
Components of the CDH Stack
File System Mount
FUSE-DFS

Storage
SDK

UI Framework
HUE

HUE SDK

Computation Integration Coordination Access

Workflow
APACHE OOZIE

Scheduling
APACHE OOZIE

Metadata
APACHE HIVE

Languages / Compilers Data Integration


APACHE FLUME, APACHE SQOOP HDFS, MAPREDUCE APACHE PIG, APACHE HIVE, APACHE MAHOUT

Fast Read/Write Access


APACHE HBASE

Coordination

APACHE ZOOKEEPER

Cloudera Manager, Free Edition

End-to-end Deployment and management of your CDH cluster

Zero to Hadoop in 15 minutes

Supports up to 50 nodes Free (but not open source)

Cloudera Enterprise

Cloudera Enterprise

Clouderas Distribution including Apache Hadoop (CDH)

Big data storage, processing and analytics platform based on CDH End-to-end deployment, management, and operation of CDH Provides sophisticated cluster monitoring tools not present in the free version A team of experts on call to help you meet your Service Level Agreements (SLAs)

Cloudera Manager (full version)


Production support

Cloudera University

Training for the entire Hadoop stack


Cloudera Developer Training for Apache Hadoop Cloudera Administrator Training for Apache Hadoop Cloudera Training for Apache HBase Cloudera Training for Apache Hive and Pig Cloudera Essentials for Apache Hadoop More courses coming Including customized on-site private classes Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera Certified Specialist in Apache HBase (CCSHB)

Public and private classes offered


Industry-recognized Certifications

Professional Services

Solutions Architects provide guidance and handson expertise


Use Case Discovery New Hadoop Deployment Proof of Concept Production Pilot Process and Team Development Hadoop Deployment Certification

How Did Apache Hadoop Originate?


Heavily inuenced by Googles architecture

Notably, the Google Filesystem and MapReduce papers Early adop+on by Yahoo, Facebook and others
Nutch spun off from Lucene Google publishes GFS paper Google publishes MapReduce paper Nutch rewritten for MapReduce

Other Web companies quickly saw the benets

2002

2003

2004

2005

Why Do We Have So Much Data?

And what are we supposed to do with it?

Velocity

Why were genera+ng data faster than ever


Processes are increasingly automated Systems are increasingly interconnected People are increasingly interac+ng online

Variety

What types of data are we producing?


Applica+on logs Text messages Social network connec+ons Tweets Photos

Not all of this maps cleanly to the rela+onal model

Volume

The result of this is that every single day


Twi]er processes 340 million messages Facebook stores 2.7 billion comments and Likes Google processes about 24 petabytes of data More than 200 million e-mail messages are sent Foursquare processes more than 2,000 check-ins

And every single minute


Where Does Data Come From?

Science

Medical imaging, sensor data, genome sequencing, weather data, satellite feeds, etc. Financial, pharmaceu+cal, manufacturing, insurance, online, energy, retail data Sales data, customer behavior, product databases, accoun+ng data, etc. Log les, health & status feeds, ac+vity streams, network messages, Web Analy+cs, intrusion detec+on, spam lters

Industry

Legacy

System Data

Analyzing Data: The Challenges


Huge volumes of data Mixed sources result in many dierent formats


XML CSV EDI Log les Objects SQL Text JSON Binary Etc.

What is Common Across Hadoop-able Problems?

Nature of the data


Complex data Mul+ple data sources Lots of it

Nature of the analysis


Batch processing Parallel execu+on Spread data over a cluster of servers and take the computa+on to the data

Benets of Analyzing With Hadoop


Previously impossible/imprac+cal to do this analysis Analysis conducted at lower cost Analysis conducted in less +me Greater exibility Linear scalability

What Analysis is Possible With Hadoop?


Text mining Index building Graph crea+on and analysis Pa]ern recogni+on

Collabora+ve ltering Predic+on models Sen+ment analysis Risk assessment

Eight Common Hadoop-able Problems


1. 2. 3. 4.

Modeling true risk Customer churn analysis Recommenda+on engine PoS transac+on analysis

5.

Analyzing network data to predict failure Threat analysis Search quality Data sandbox

6. 7. 8.

1. Modeling True Risk


Challenge: How much risk exposure does an organiza+on really have with each customer?

Mul+ple sources of data and across mul+ple lines of business

Solu+on with Hadoop: Source and aggregate disparate data sources to build data picture

Structure and analyze


e.g. credit card records, call recordings, chat sessions, emails, banking ac+vity

Sen+ment analysis, graph crea+on, pa]ern recogni+on Financial Services (banks, insurance companies)

Typical Industry:

2. Customer Churn Analysis


Challenge: Why is an organiza+on really losing customers?

Solu-on with Hadoop: Rapidly build behavioral model from disparate data sources Structure and analyze with Hadoop

Data on these factors comes from dierent sources

Typical Industry:

Traversing Graph crea+on Pa]ern recogni+on

Telecommunica+ons, Financial Services

3. Recommenda+on Engine/Ad Targe+ng


Challenge: Using user data to predict which products to recommend Solu+on with Hadoop: Batch processing framework

Collabora+ve ltering

Allow execu+on in in parallel over large datasets

Typical Industry

Collec+ng taste informa+on from many users U+lizing informa+on to predict what similar users like Ecommerce, Manufacturing, Retail Adver+sing

4. Point of Sale Transac+on Analysis


Challenge: Analyzing Point of Sale (PoS) data to target promo+ons and manage opera+ons

Solu+on with Hadoop: Batch processing framework

Sources are complex and data volumes grow across chains of stores and other sources

Allow execu+on in in parallel over large datasets Op+mizing over mul+ple data sources U+lizing informa+on to predict demand Retail

Pa]ern recogni+on

Typical Industry:

5. Analyzing Network Data to Predict Failure


Challenge: Analyzing real-+me data series from a network of sensors

Solu+on with Hadoop: Take the computa+on to the data

Calcula+ng average frequency over +me is extremely tedious because of the need to analyze terabytes

Expand from simple scans to more complex data mining Discrete anomalies may, in fact, be interconnected

Be]er understand how the network reacts to uctua+ons

Iden+fy leading indicators of component failure Typical Industry:

U+li+es, Telecommunica+ons, Data Centers

6. Threat Analysis/Trade Surveillance


Challenge: Detec+ng threats in the form of fraudulent ac+vity or a]acks

Solu+on with Hadoop: Parallel processing over huge datasets Pa]ern recogni+on to iden+fy anomalies,

Large data volumes involved Like looking for a needle in a haystack

Typical Industry:

i.e., threats

Security, Financial Services, General: spam gh+ng, click fraud

7. Search Quality
Challenge: Providing real +me meaningful search results Solu+on with Hadoop: Analyzing search a]empts in conjunc+on with structured data Pa]ern recogni+on

Typical Industry:

Browsing pa]ern of users performing searches in dierent categories Web, Ecommerce

8. Data Sandbox
Challenge: Data Deluge

Dont know what to do with the data or what analysis to run

Solu+on with Hadoop: Dump all this data into an HDFS cluster Use Hadoop to start trying out dierent analysis on the data See pa]erns to derive value from data Typical Industry:

Common across all industries

Hadoop: How does it work?

Moores law and not

Disk Capacity and Price


Were genera+ng more data than ever before Fortunately, the size and cost of storage has kept pace

Capacity has increased while price has decreased


Year
1997 2004 2012 2.1 200 3,000

Capacity (GB)
$157 $1.05 $0.05

Cost per GB (USD)

Disk Capacity and Performance


Disk performance has also increased in the last 15 years Unfortunately, transfer rates havent kept pace with capacity
Year
1997 2004 2012

Capacity (GB)
2.1 200 3,000

Transfer Rate (MB/s)


16.6 56.5 210

Disk Read Time


126 seconds 59 minutes 3 hours, 58 minutes

Architecture of a Typical HPC System


Compute Nodes

Storage System

Fast Network

Architecture of a Typical HPC System


Compute Nodes

Storage System

Step 1: Copy input data

Fast Network

Architecture of a Typical HPC System


Compute Nodes

Storage System

Step 2: Process the data

Fast Network

Architecture of a Typical HPC System


Compute Nodes

Storage System

Step 3: Copy output data

Fast Network

You Dont Just Need Speed

The problem is that we have way more data than code


$ du -ks code/ 1,083 $ du ks data/ 854,632,947,314

You Need Speed At Scale


Compute Nodes

Storage System

Bottleneck

HDFS: HADOOP DISTRIBUTED FILESYSTEM


Because 10,000 hard disks are be]er than one

Collocated Storage and Processing

Solu+on: store and process data on the same nodes


Data locality: Bring the computa+on to the data Reduces I/O and boosts performance

"slave" nodes (storage and processing)

Hard Disk Latency


Disk seeks are expensive Solu+on: Read lots of data at once to amor+ze the cost
Current location of disk head

Where the data you need is stored

Introducing HDFS

Hadoop Distributed File System

Scalable storage inuenced by Googles le system paper HDFS is op+mized for Hadoop Values high throughput much more than low latency Its a user-space Java process Primarily accessed via command-line u+li+es and Java API

Its not a general-purpose lesystem


HDFS is (Mostly) UNIX-like

In many ways, HDFS is similar to a UNIX lesystem


Hierarchical UNIX-style paths (e.g. /foo/bar/myfile.txt) File ownership and permissions No CWD Cannot modify les once wri]en

There are also some major devia+ons from UNIX


HDFS High-Level Architecture


HDFS follows a master-slave architecture There are two essen+al daemons in HDFS

Master: NameNode

Responsible for namespace and metadata Namespace: le hierarchy Metadata: ownership, permissions, block loca+ons, etc. Responsible for storing actual datablocks

Slave: DataNode

Anatomy of a Small Hadoop Cluster



The diagram shows the HDFS-related daemons on a small cluster

Each "slave" node will run - DataNode daemon

The "master" node will run - NameNode daemon

HDFS Blocks

When a le is added to HDFS, its split into blocks This is a similar concept to na+ve lesystems

HDFS uses a much larger block size (64 MB), for performance
150 MB input le Block #1 (64 MB) Block #2 (64 MB)

Block #3 (remaining 22 MB)

HDFS Replica+on

Those blocks are then replicated across machines The rst block might be replicated to A, C and D
Block #1 A B Block #2 C D E

Block #3

HDFS Replica+on (contd)

The next block might be replicated to B, D and E


Block #1 A B Block #2 C D E

Block #3

HDFS Replica+on (contd)

The last block might be replicated to A, C and E


Block #1 A B Block #2 C D E

Block #3

HDFS Reliability

Replica+on helps to achieve reliability


Even when a node fails, two copies of the block remain These will be re-replicated to other nodes automa+cally
This failed node held blocks #1 and #3

X
A B C D E

Blocks #1 and #3 are still available here Block #1 is still available here Block #3 is still available here

DATA PROCESSING WITH MAPREDUCE


It not only works, its func+onal

MapReduce High-Level Architecture


Like HDFS, MapReduce has a master-slave architecture There are two daemons in classical MapReduce

Master: JobTracker

Responsible for dividing, scheduling and monitoring work Responsible for actual processing

Slave: TaskTracker

Anatomy of a Small Hadoop Cluster



The diagram shows both MapReduce and HDFS daemons

Each "slave" node will run - DataNode daemon - TaskTracker daemon

The "master" node will run - NameNode daemon - JobTracker daemon

Gentle Introduc+on to MapReduce

MapReduce is conceptually like a UNIX pipeline


$ egrep 941 78264 4312

One func+on (Map) processes data That output is ul+mately input to another func+on (Reduce) Each piece is simple, but can be powerful when combined
'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c ERROR INFO WARN

The Map Func+on

Operates on each record individually


Typical uses include ltering, parsing, or transforming input egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c

Map

Intermediate Processing

The Map func+ons output is grouped and sorted

This is the automa+c sort and shue process in Hadoop

$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c

Map

Sort and Shue

The Reduce Func+on

Operates on all records in a group

Oren used for sum, average or other aggregate func+ons

$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c

Map

Sort and Reduce shue

MapReduce History

MapReduce is not a language, its a programming model

A style of processing data you could implement in any language Many languages have func+ons named map and reduce These func+ons have largely the same purpose in Hadoop

MapReduce has its roots in func+onal programming


Popularized for large-scale data processing by Google

MapReduce Benets

Complex details are abstracted away from the developer


No le I/O No networking code No synchroniza+on A record consists of a key and corresponding value

Its scalable because you process one record at a +me

We oren care about only one of these

MapReduce Example in Python

MapReduce code for Hadoop is typically wri]en in Java


But possible to use nearly any language with Hadoop Streaming Ill show the log event counter using MapReduce in Python

Its very helpful to see the data as well as the code

Job Input

Each mapper gets a chunk of jobs input data to process


This chunk is called an InputSplit 2012-09-06 In most cases, this corresponds to a b"This lock in H DFS 22:16:49.391 CDT INFO can wait"

2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 22:16:49.392 22:16:49.394 22:16:49.395 22:16:49.397 22:16:49.398 22:16:49.399 CDT CDT CDT CDT CDT CDT INFO "Blah blah" WARN "Hmmm..." INFO "More blather" WARN "Hey there" INFO "Spewing data" ERROR "Oh boy!"

Python Code for Map Func+on

1 2 3 4 5 6 7 8 9 10 11 12 13

Our map func+on will parse the event type

And then output that event (key) and a literal 1 (value)


Boilerplate Python stu

#!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() for field in fields: field = field.strip().upper() if field in levels: print "%s\t1" % field

Dene list of JUnit log levels Split every line (record) we receive on standard input into elds, normalized by case If this eld matches a log level, print it (and a 1)

Output of Map Func+on

The map func+on produces key/value pairs as output


INFO INFO WARN INFO WARN INFO ERROR 1 1 1 1 1 1 1

Input to Reduce Func+on

The Reducer receives a key and all values for that key

Keys are always passed to reducers in sorted order Although its not obvious here, values are unordered
ERROR INFO INFO INFO INFO WARN WARN 1 1 1 1 1 1 1

Python Code for Reduce Func+on

1 2 3 4 5 6 7 8 9 10 11 12 13

The Reducer rst extracts the key and value it was passed
#!/usr/bin/env python import sys previous_key = '' sum = 0 for line in sys.stdin: fields = line.split() key, value = line.split() value = int(value) # continued on next slide

Boilerplate Python stu

Ini+alize loop variables

Extract the key and value passed via standard input

Python Code for Reduce Func+on

14 15 16 17 18 19 20 21 22 23

Then simply adds up the value for each key


# continued from previous slide if key == previous_key: sum = sum + value else: if previous_key != '': print '%s\t%i' % (previous_key, sum) previous_key = key sum = 1 print '%s\t%i' % (previous_key, sum)

If key unchanged, increment the count If key changed, print sum for previous key Re-init loop variables Print sum for nal key

Output of Reduce Func+on

The output of this Reduce func+on is a sum for each level


ERROR INFO WARN 1 4 2

Recap of Data Flow



Map input
2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 22:16:49.391 22:16:49.392 22:16:49.394 22:16:49.395 22:16:49.397 22:16:49.398 22:16:49.399 CDT CDT CDT CDT CDT CDT CDT INFO "This can wait" INFO "Blah blah" WARN "Hmmm..." INFO "More blather" WARN "Hey there" INFO "Spewing data" ERROR "Oh boy!"

Map output
INFO INFO WARN INFO WARN INFO ERROR 1 1 1 1 1 1 1

Reduce input
ERROR INFO INFO INFO INFO WARN WARN 1 1 1 1 1 1 1

Reduce output
ERROR INFO WARN 1 4 2

Input Splits Feed the Map Tasks

Input for the en+re job is subdivided into InputSplits


An InputSplit usually corresponds to a single HDFS block Each of these serves as input to a single Map task
Input for entire job (192 MB) 64 MB

Mapper #1

64 MB

Mapper #2

64 MB

Mapper #3

Mappers Feed the Shue and Sort

Output of all Mappers is par++oned, merged, and sorted (No code required Hadoop does this automa+cally)
Mapper #1
INFO WARN INFO INFO ERROR 1 1 1 1 1 ERROR ERROR ERROR 1 1 1

Mapper #2

WARN INFO INFO INFO ERROR

1 1 1 1 1

INFO INFO INFO INFO INFO INFO INFO INFO

1 1 1 1 1 1 1 1

Mapper #N

WARN INFO WARN INFO ERROR

1 1 1 1 1

WARN WARN WARN WARN

1 1 1 1

Shue and Sort Feeds the Reducers

All values for a given key are then collapsed into a list

The key and all its values are fed to reducers as input
ERROR ERROR ERROR 1 1 1 ERROR 1 1 1

INFO INFO INFO INFO INFO INFO INFO INFO

1 1 1 1 1 1 1 1

Reducer #1
INFO 1 1 1 1 1 1 1 1

WARN WARN WARN WARN

1 1 1 1

Reducer #2
WARN 1 1 1 1

Each Reducer Has an Output File

These are stored in HDFS below your output directory

Use hadoop fs -getmerge to combine them into a local copy


INFO 8

Reducer #1

ERROR WARN

3 4

Reducer #2

Apache Hadoop Ecosystem: Overview


"Core Hadoop" consists of HDFS and MapReduce


These are the kernel of a much broader plavorm

Hadoop has many related projects

Most are open source Apache projects like Hadoop


Some help you integrate Hadoop with other systems Others help you analyze your data S+ll others, like Oozie, help you use Hadoop more eec+vely Also like Hadoop, they have funny names All of these are part of Clouderas CDH distribu+on

Ecosystem: Apache Flume



log les program output syslog custom source and many more

Ecosystem: Apache Sqoop

Integrates with any JDBC-compa+ble database


Retrieve all tables, a single table, or a por+on to store in HDFS Can also export data from HDFS back to the database
Database Hadoop Cluster

Ecosystem: Apache Hive

Hive allows you to do SQL-like queries on data in HDFS


SELECT customer.id, customer.name, sum(orders.cost) FROM customers INNER JOIN ON (customer.id = orders.customer_id) WHERE customer.zipcode = '63105' GROUP BY customer.id;

It turns this into MapReduce jobs that run on your cluster Reduces development +me Makes Hadoop more accessible to non-engineers

Ecosystem: Apache Pig

Apache Pig has a similar purpose to Hive


It has a high-level language (PigLa+n) for data analysis Scripts yield MapReduce jobs that run on your cluster

But Pigs approach is much dierent than Hive

Ecosystem: Apache HBase


NoSQL database built on HDFS Low-latency and high-performance for reads and writes Extremely scalable

Tables can have billions of rows And poten+ally millions of columns

You Should Be Using CDH

Clouderas Distribu+on including Apache Hadoop (CDH)


Combines Hadoop with many important ecosystem tools


The most widely used distribu+on of Hadoop A stable, proven and supported environment you can count on Such as Hive, Pig, Sqoop, Flume and many more All of these are integrated and work well together Its completely free Apache licensed its 100% open source too

How much does it cost?


When is Hadoop (Not) a Good Choice

Hadoop may be a great choice when


Hadoop may not be a great choice when


You need to process non-rela+onal (unstructured) data You are processing large amounts of data You can run your jobs in batch mode Youre processing small amounts of data Your algorithms require communica+on among nodes You need low latency or transac+ons And know how to integrate it with other systems

As always, use the best tool for the job

Managing The Elephant In The Room - Roles


System Administrators Developers Analysts Data Stewards

System Administrators

Required skills:

Strong Linux administra+on skills Networking knowledge Understanding of hardware Install, congure and upgrade Hadoop sorware Manage hardware components Monitor the cluster Integrate with other systems (e.g., Flume and Sqoop)

Job responsibili+es

Developers

Required Skills:

Strong Java or scrip+ng capabili+es Understanding of MapReduce and algorithms Write, package and deploy MapReduce programs Op+mize MapReduce jobs and Hive/Pig programs

Job responsibili+es:

Data Analyst/Business Analyst

Required skills:

SQL Understanding data analy+cs/data mining Extract intelligence from the data Write Hive and/or Pig programs

Job responsibili+es:

Data Steward

Required skills:

Data modeling and ETL Scrip+ng skills Cataloging the data (analogous to a librarian for books) Manage data lifecycle, reten+on Data quality control with SLAs

Job responsibili+es:

Combining Roles

System Administrator + Steward analogous to DBA Required skills:


Job responsibili+es:

Data modeling and ETL Scrip+ng skills Strong Linux administra+on skills

Manage data lifecycle, reten+on Data quality control with SLAs Install, congure and upgrade Hadoop sorware Manage hardware components Monitor the cluster Integrate with other systems (e.g., Flume and Sqoop)

Conclusion

Thanks for your +me! Ques+ons?

You might also like