0% found this document useful (0 votes)
39 views

Week 02

Uploaded by

muhammad shoaib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Week 02

Uploaded by

muhammad shoaib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

Week02

Big Data Analytics:


MapReduce and Apache Hadoop
Dr. Jawad Khokhar
Scalable Computing
What is a Distributed File System?
• A file system is responsible from the organization of the long term
information storage in a computer.
• When many storage computers are connected through the
network, we call it a distributed file system.
• A distributed file system distributes data on multiple file servers or
multiple locations
• Data sets, or parts of a data set, can be replicated across the nodes of
a distributed file system.
Distributed File System (DFS)

Rack
Rack

The computing nodes are clustered in racks connected to each other via a fast network.
Data Distributed across racks

2 5

Rack
1
3 4

High Concurrency:
• Since data is already on these nodes, then analysis of parts of the data is needed in a data parallel fashion, computation can be moved
to these nodes. This improves system performance and enable data-parallelism.
• Cluster computing: the computing nodes are clustered in racks connected to each other via a fast network.
• Computing in one or more of these clusters across a local area network or the internet is called distributed computing.

Data
Data Replicated across racks

4 3 2 1 5

Rack
5 1 4 2 3

3 2 5 4 1

1 4 3 5 2

Scalability
• Data replication also helps with scaling the access to this data by many users.
• In a highly parallelized replication, each reader can get their own node to access to and analyze
data.
Data Replication provides Fault Tolerance

4 3 2 1 5

Rack
5 1 4 2 3

3 2 5 4 1

1 4 3 5 2

• High Availability
• Reliability
Benefits of DFS
• Data scalability
• Data partitioning
• Fault tolerance
• Data replication
• High concurrency
Data Parallelism
• In data-parallelism many jobs that share nothing can work on different data sets or parts of a
data set.
• Large volumes and varieties of big data can be analyzed using this mode of parallelism, achieving
scalability, performance and cost reduction.
Rack
Commodity Cluster
• Commodity clusters are affordable parallel computers with an average
number of computing nodes.
• They are not as powerful as traditional parallel computers and are often
built out of less specialized nodes.
• Commodity clusters are Affordable and less specialized
• The service-oriented computing community over the internet have pushed
for computing to be done on commodity clusters as distributed
computations. And in turn, reducing the cost of computing over the
Internet.
• The commodity clusters are a cost effective way of achieving data parallel
scalability for big data applications.
Common Failures in Cluster computing
• A node, or an entire rack can fail at any given time.
• The connectivity of a rack to the network can stop, or
• the connections between individual nodes can break.
• It is not practical to restart everything every time, if failure happens.
• The ability to recover from such failures is called Fault-tolerance.
• For Fault-tolerance of such systems, we can:
1. Have Redundant data storage, and
2. Restart failed individual parallel jobs.

Rack
Rack

Rack
Programming Models for Big Data
• A programming model is a set of abstract runtime libraries
and programming languages
• The enabling infrastructure for big data analysis is distributed file
systems
• The programming model for big data enables the programmability of
the operations within distributed file systems.
• Programming model allows writing computer programs that
work efficiently on top of distributed file systems using big data
and making it easy to cope with all the potential issues.
Requirements for Big Data Programming Models
• Support Big Data Operations
• Split volumes of data: partitioning and placement of data in and out of computer
memory along with a model to synchronize the datasets later on
• Fast Access: The access to data should be achieved in a fast way.
• Distribution of computations to nodes: It should allow fast distribution to nodes
within a rack and these are potentially, the data nodes we moved the computation
to. Scheduling of many parallel tasks at once.
• Handle Fault Tolerance:
• Reliability: Enable reliability of the computing and full tolerance from failures.
• Recover files when needed.
• Scalability:
• Enable adding more resources when needed, e.g. Racks.
• This is also called Scaling out
• Optimized for different data types: Document Table

Graph
Key-value

Stream
Multimedia
MapReduce
MapReduce
MapReduce is a programming model and an associated
implementation for processing large data sets.
• Challenges:
• How to distribute computation?
• Distributed/parallel programming is hard

• Map-reduce addresses all of the above


• Google’s computational/data manipulation model
• Elegant way to work with big data

15
Single Node Architecture

CPU
Machine Learning, Statistics
Memory

“Classical” Data Analytics


Disk

16
Motivation: Google Example
• 20+ billion web pages x 20KB = 400+ TB
• 1 computer reads 30-35 MB/sec from disk
• ~4 months to read the web
• ~1,000 hard drives to store the web
• Takes even more to do something useful
with the data!
• Today, a standard architecture for such problems is emerging, which
is the Commodity Clusters:
• Cluster of commodity Linux nodes
• Commodity network (ethernet) to connect them

17
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO 18


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
19
Datasets, http://www.mmds.org
Large-scale Computing
• Large-scale computing for data analytics problems on commodity
hardware (commodity clusters)
• Challenges:
• How do you distribute computation?
• How can we make it easy to write distributed programs?
• Machines fail:
• One server may stay up 3 years (1,000 days)
• If you have 1,000 servers, expect to loose 1/day
• People estimated Google had ~1M machines in 2011
• 1,000 machines fail every day!

20
Idea and Solution
• Issue: Copying data over a network takes time
• Idea:
• Bring computation close to the data
• Store files multiple times for reliability
• Map-reduce addresses these problems
• Google’s computational/data manipulation model
• Elegant way to work with big data
• Storage Infrastructure – File system
• Google: GFS. Hadoop: HDFS
• Programming model
• Map-Reduce

21
Storage Infrastructure
• Problem:
• If nodes fail, how to store data persistently?
• Answer:
• Distributed File System:
• Provides global file namespace
• Google GFS; Hadoop HDFS;
• Typical usage pattern
• Huge files (100s of GB to TB)
• Data is rarely updated in place
• Reads and appends are common

22
Distributed File System
• Chunk servers
• File is split into contiguous chunks
• Typically each chunk is 16-64MB
• Each chunk replicated (usually 2x or 3x)
• Try to keep replicas in different racks
• Master node
• a.k.a. Name Node in Hadoop’s HDFS
• Stores metadata about where files are stored
• Might be replicated
• Client library for file access
• Talks to master to find chunk servers
• Connects directly to chunk servers to access data

23
Distributed File System
• Reliable distributed file system
• Data kept in “chunks” spread across machines
• Each chunk replicated on different machines
• Seamless recovery from disk or machine failure

C0 C1 D0 C1 C2 C5 C0 C5

C5 C2 C5 C3 D0 D1 … D0 C2
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Bring computation directly to the data!


Chunk servers also serve as compute servers 24
Programming Model: MapReduce

Sample task:
• We have a huge text document

• Count the number of times each


distinct word appears in the file

• Sample application:
• Analyze web server logs to find popular URLs

25
Task: Word Count
Case 1:
• File too large for memory, but all <word, count> pairs fit in memory
• Use a hashtable
Case 2:
• Even the <word, count> pairs do not fit in memory
• Count occurrences of words:
• words(doc.txt) | sort | uniq -c
• where words takes a file and outputs the words in it, one per a line
• Case 2 captures the essence of MapReduce
• Great thing is that it is naturally parallelizable

26
MapReduce: Overview
• Sequentially read a lot of data
• Map:
• Extract something you care about from each record (Key)
• Group by key: Sort and Shuffle
• Reduce:
• Aggregate, summarize, filter or transform
• Write the result
Outline stays the same, Map and Reduce
change to fit the problem
27
MapReduce: The Map Step
Input Intermediate
key-value pairs key-value pairs

k v
map
k v
k v
map
k v
k v

… …

k v k v

28
MapReduce: The Reduce Step
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key

k v

… … …

k v k v k v

29
Example: WordCount using MapReduce

File 1
Result
File 2 WordCount
File

File N
Step 0: File is stored in a DFS
Step 1: Map on each node
My apple is red and my rose is blue....

You are the apple of my eye....



Map generates
My apple is red and my rose is blue.... key-value pairs

my, my à (my, 1), (my, 1)
apple à (apple, 1)
is, is à (is, 1), (is, 1)
red à (red, 1)
and à (and, 1)
rose à (rose, 1)
blue à (blue, 1)
Map generates
You are the apple of my eye.... key-value pairs

You à (You, 1)
are à (are, 1)
the à (the, 1)
apple à (apple, 1)
of à (of, 1)
my à (my, 1)
eye à (eye, 1)
Step 2: Sort and Shuffle
Pairs with same key
moved to same node
(You, 1) Step 2: Sort and Shuffle
(apple, 1) Pairs with same key
moved to same node
(apple, 1)

(is, 1)
(is, 1)

(rose, 1)
(red, 1)
Step 3: Reduce Add values for same keys
Step 3: Reduce Add values for same keys
(You, 1) (You, 1)
(apple, 1), (apple, 1) (apple, 2)

(my, 1), (my, 1),


(my, 3)
(my, 1)
(red, 1) (red, 1)
(rose, 1) (rose, 1)
Word Count Using MapReduce
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

39
Map Shuffle Reduce
and Sort
More Specifically
• Input: a set of key-value pairs
• Programmer specifies two methods:
• Map(k, v) ® <k’, v’>*
• Takes a key-value pair and outputs a set of key-value pairs
• E.g., key is the filename, value is a single line in the file
• There is one Map call for every (k,v) pair
• Reduce(k’, <v’>*) ® <k’, v’’>*
• All values v’ with same key k’ are reduced together
and processed in v’ order
• There is one Reduce function call per unique key k’

41
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the key
and output

42
Map-Reduce: In Parallel
Map Node 1 Map Node 2 Map Node 3

Reduce Node 1 Reduce Node 2

All phases are distributed with many tasks doing the work 43
Map-Reduce Pattern
Map-Reduce: Environment
Input 0 Input 1 Input 2

Map-Reduce environment takes care of:


• Partitioning the input data Map 0 Map 1 Map 2
• Scheduling the program’s execution across a
set of machines Shuffle

• Performing the group by key step


• Handling machine failures Reduce 0 Reduce 1

• Managing required inter-machine communication


Out 0 Out 1

45
Data Flow
• Input and final output are stored on a distributed file system (FS):
• Scheduler tries to schedule map tasks “close” to physical storage location of
input data

• Intermediate results are stored on local FS of Map and Reduce


workers

• Output is often input to another MapReduce task

46
Coordination: Master
• Master node takes care of coordination:
• Task status: (idle, in-progress, completed)
• Idle tasks get scheduled as workers become available
• When a map task completes, it sends the master the location and sizes of its R
intermediate files, one for each reducer
• Master pushes this info to reducers

• Master pings workers periodically to detect failures

47
Dealing with Failures
• Map worker failure
• Map tasks completed or in-progress at worker are reset to idle since output of
the mapper is stored on the local FS
• Reduce workers are notified when task is rescheduled on another worker
• Reduce worker failure
• Only in-progress tasks are reset to idle since output of the reducer is stored
on the DFS
• Reduce task is restarted
• Master failure
• MapReduce task is aborted and client is notified

48
How many Map and Reduce jobs?
• M map tasks, R reduce tasks
• Rule of a thumb:
• Make M much larger than the number of nodes in the
cluster
• One DFS chunk per map is common
• Improves dynamic load balancing and speeds up
recovery from worker failures
• Usually R is smaller than M
• Because output is spread across R files. If there are too
many Reduce tasks the number of intermediate files
explodes.

49
Combiners
• Often a Map task will produce many pairs of the form (k,v1), (k,v2), …
for the same key k (e.g. the word “the”)
• E.g., popular words in the word count example
• Can save network time by
pre-aggregating values in
the mapper:
• combine(k, list(v1)) à v2
• Combiner is usually same
as the reduce function

50
Combiners
• Back to our word counting example:
• Combiner combines the values of all keys of a single
mapper (single machine):

• Much less data needs to be copied and shuffled!

51
Combiners – Caution!
• Works only if reduce function is commutative and associative
• Sum is commutative and associative
• a+b=b+a (a + b) + c = a + (b + c)

+ +

+
Combiners – Caution!
• Works only if reduce function is commutative and associative
• Average is NOT commutative and associative

avg1 avg2

(Avg1+avg2) /2
x Not the true average
Combiners – Caution!
• Works only if reduce function is commutative and associative
• Average is NOT commutative and associative, but can be computed if both
the sum and count are returned by the map function (Combiner Trick)

(sum1,count1) (sum2,count2)

(sum1+sum2) /(count1+count2)

Combiners – Caution!
• Works only if reduce function is commutative and associative
• Median is NOT commutative and Associative
• There is not way to split the median computation
Partition Function
• Want to control how keys get partitioned
• Inputs to map tasks are created by contiguous splits of
input file
• Ensure that records with the same intermediate key end up
at the same worker
• System uses a default partition function:
• hash(key) mod R

• Sometimes useful to override the hash function:


• E.g., hash(hostname(URL)) mod R ensures URLs from
a host end up in the same output file

56
Problems Suited for
Map-Reduce
Example: Host size
• Suppose we have a large web corpus
• Look at the metadata file
• Lines of the form: (URL, size, date, …)
• For each host, find the total number of bytes
• That is, the sum of the page sizes for all URLs from that particular host

58
Example: Language Model
• Statistical machine translation:
• Need to count number of times every 5-word sequence occurs in a large
corpus of documents

• Very easy with MapReduce:


• Map:
• Extract (5-word sequence, count) from document
• Reduce:
• Combine the counts

59
Example: Relational-Algebra Operations in database queries (SQL)
Background
• A relation is a table with column headers called attributes.
• Rows of the relation are called tuples.
• The set of attributes of a relation is called its schema. table1 Relation/Table
• R(A1, A2, . . . , An) -> The relation name is R and its attributes are A1, A2, . . . , An
a b c
Relational operations
• Selection: Apply a condition C to each tuple in the relation and produce as output only those tuples that a1 b1 c1
satisfy C. Denoted as 𝝈𝒄 𝑹 .
• Example SQL: select * from table1 where a = ‘a1’; a1 b2 c2
• Projection: For some subset S of the attributes of the relation, produce from each tuple only the
components for the attributes in S. Denoted 𝜋" (𝑅). a1 b3 c3
• Example SQL: select a,b from table1;
a4 b4 c4
• Union, Intersection, and Difference
• Example SQL: select a,b,c from table1 union select d,b,f from table2;
Attributes/column headers
• Natural Join: Given two relations, compare each pair of tuples, one from each relation. If the tuples agree
on all the attributes that are common to the two schemas, then produce a tuple that has components for
each of the attributes in either schema. The natural join of relations R and S is denoted 𝑅 ⋈ 𝑆
• Example SQL (inner join): d b f
• select * from table1 R inner join table2 S on R.b = S.b; table2
• Grouping and Aggregation: Given a relation R, partition its tuples according to their values in one set of
attributes G, called the grouping attributes. Then, for each group, aggregate the values in certain other d1 b1 f1

Row/tuples
attributes. The normally permitted aggregations are SUM, COUNT, AVG, MIN, and MAX.
• Example SQL: d2 b2 f2
• select a,count(*) from table1 group by a;
d3 b3 f3
d4 b4 f4
Example: Join By Map-Reduce
• Compute the natural join R(A,B) ⋈ S(B,C)
• R and S are each stored in files
• Tuples are pairs (a,b) or (b,c)

A B B C A C
a1 b1

b2 c1 a3 c1
a2
a3
b1
b2
b2 c2 = a3 c2
b3 c3 a4 c3
a4 b3
S
R

61
Example: Join By Map-Reduce
• Use a hash function h from B-values to 1...k
• A Map process turns:
• Each input tuple R(a,b) into key-value pair (b,(a,R))
• Each input tuple S(b,c) into (b,(c,S))

• Map processes send each key-value pair with key b to


Reduce process h(b)
• Hadoop does this automatically; just tell it what k is.
• Each Reduce process matches all the pairs (b,(a,R)) with
all (b,(c,S)) and outputs (a,b,c).

62
Example: Batch Gradient Descent
" (%)
Batch Gradient Descent: 𝜃! ≔ 𝜃! − 𝛼 #$$ ∑#$$
%&" ℎ' 𝑥 % −𝑦 % 𝑥!
Split training sets into different subsets
(4 subsets for 4 nodes [Mappers]) • Assume m = 400
• Normally m would be more like 400 000 000
! ! !"" !""
Node 1: Use (𝑥 ,𝑦 ), … , (𝑥 ,𝑦 ) • If m is large this is really expensive
!""
(!) ' ' (')
𝑡𝑒𝑚𝑝$ ≔ , ℎ) 𝑥 −𝑦 𝑥$
'(!

!"! !"! #"" #""


Node 2: Use (𝑥 ,𝑦 ), … , (𝑥 ,𝑦 )
#""
(#) ' ' (')
𝑡𝑒𝑚𝑝$ ≔ , ℎ) 𝑥 −𝑦 𝑥$
'(!"!

#"! #"! *"" *""


Node 3: Use (𝑥 ,𝑦 ), … , (𝑥 ,𝑦 )
*""
(*) ' ' (')
𝑡𝑒𝑚𝑝$ ≔ , ℎ) 𝑥 −𝑦 𝑥$
'(#"!

*"! *"! +"" +""


Send to a Reducer
Node 4: Use (𝑥 ,𝑦 ), … , (𝑥 ,𝑦 )
+"" Put them back together
(+)
𝑡𝑒𝑚𝑝$ ≔ , ℎ) 𝑥 '
−𝑦 '
𝑥$
(')
Update θ using 1 (#) (%) (&) (')
'(*"! 𝜃! ≔ 𝜃! − 𝛼 𝑡𝑒𝑚𝑝! + 𝑡𝑒𝑚𝑝! + 𝑡𝑒𝑚𝑝! + 𝑡𝑒𝑚𝑝!
400
Other Examples
• Matrix-Vector Multiplication
• Matrix Multiplication
Implementations
• Google
• Uses Google File System (GFS) for stable storage
• Not available outside Google
• Hadoop
• An open-source implementation in Java
• Uses HDFS for stable storage
• Download: http://lucene.apache.org/hadoop/

65
Problems not suited for MapReduce
• MapReduce is great for:
• Problems that require sequential data access
• Large batch jobs (not interactive, real-time)
• MapReduce is inefficient for problems where random (or irregular)
access to data required:
• Graphs
• Interdependent data
• Comparisons of many pairs of items
Cloud Computing
• Ability to rent computing by the hour
• Additional services e.g., persistent storage

• Amazon’s “Elastic Compute Cloud” (EC2)


• S3 (Stable storage)
• Elastic Map Reduce (EMR)

67
Cost in MapReduce Jobs
• Computation Cost of mappers and the reducers
• System cost is principally sorting key-value pairs by key and merging them at
Reduce tasks
• Communication Cost of shipping key-value pairs from mappers to
reducers
• Map Tasks are executed where their input data resides so no communication
required.
Hands on practice environments
• Cloudera VM
• https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-
vm-5.4.2-0-virtualbox.zip
• HDFS, SQL, MongoDB, Spark
• Google CoLab/Jupyter Notebooks
• Spark, Python
Cloudera VM for hands on: Installation
instructions
Instructions
Please use the following instructions to download and install the Cloudera Quickstart VM with VirutalBox before proceeding
to the Getting Started with the Cloudera VM Environment video. The screenshots are from a Mac but the instructions should
be the same for Windows. Please see the discussion boards if you have any issues.
1. Install VirtualBox. Go to https://www.virtualbox.org/wiki/Downloads to download and install VirtualBox for your
computer.
2. Download the Cloudera VM. Download the Cloudera VM
from https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-virtualbox.zip. The VM is over
4GB, so will take some time to download.
3. Unzip the Cloudera VM:
On Mac: Double click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip
On Windows: Right-click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip and select “Extract All…”
4. Start VirtualBox.
5. Begin importing. Import the VM by going to File -> Import Appliance
6. Click the Folder icon.
7. Select the cloudera-quickstart-vm-5.4.2-0-virtualbox.ovf from the Folder where you unzipped the VirtualBox
VM and click Open.
8. Click Continue to proceed.
9. Click Import.
10. The virtual machine image will
be imported. This can take several
minutes.
11. Launch Cloudera VM. When the importing is finished, the quickstart-vm-5.4.2-0 VM will appear on the left
in the VirtualBox window. Select it and click the Start button to launch the VM.
12. Cloudera VM booting. It will take several minutes for the Virtual Machine to start. The booting
process takes a long time since many Hadoop tools are started
13. The Cloudera VM desktop. Once the booting process is complete, the desktop will appear with a browser.
Apache Hadoop:
A Framework to manage and process Big Data
Apache Hadoop
• Hadoop is a framework that allows us to store and process large data
sets in parallel and distributed manner
• Created in 2005 by Yahoo
• Written in Java
• Implements the MapReduce Big Data Programming model
• More Big Data frameworks released
Now there’s over a 100!
Zookeeper
Hive

MapReduce

YARN
Pig

HDFS

Giraph
The Hadoop Ecosystem

Storm

Spark

Flink

HBase

Cassandra

MongoDB
The Hadoop Ecosystem
Higher levels:
Interactivity

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS

Lower levels:
Storage and scheduling
The Hadoop Ecosystem
• Distributed file system as foundation
• Scalable storage
• Fault tolerance

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The Hadoop Ecosystem
• YARN: Yet Another Resource Negotiator
• It is used for Flexible scheduling and resource management
• YARN schedules jobs on more than 40,000 servers at yahoo!

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The Hadoop Ecosystem
Implementation of the MapReduce Programming Model

Map à apply() Reduce à summarize()

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS

Google used to use MapReduce to index the websites


The Hadoop Ecosystem
• Higher-level programming models
• Pig = dataflow scripting
• Hive = SQL-like queries

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS

• Pig created at Yahoo,


• Hive created at Facebook
The Hadoop Ecosystem
• Specialized models for graph processing
• Giraph used by Facebook to analyze social graphs

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The Hadoop Ecosystem
• Real-time and in-memory processing
• 100x faster for some tasks

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The Hadoop Ecosystem
• NoSQL databases
• Key-values
• Hbase used by Facebook’s messaging platform

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The Hadoop Ecosystem
• Zookeeper for management
• Synchronization
• Configuration
• High-availability

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
• All these tools are open-source
• Large community for support

• We can also download separately or part of pre-built virtual


machine images having all the necessary tools
The Hadoop Distributed File System (HDFS):
A Storage System for Big Data
HDFS is foundation for Hadoop ecosystem
• HDFS provides scalable and reliable storage
• Scalability: Scalability is the property of Hadoop to handle a large datasets by adding more resources to the
system as and when needed. This is also called scaling out. If you run out of space, you can simply add more
nodes to increase the space.
• Reliability: The ability to cope with hardware failures making it fault tolerant
• Fault Tolerance: the ability of to continue operating without interruption when one or more of its
nodes fail.
• Can handle a variety of file types

Hive Pig

Giraph

Spark
Storm
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
HDFS allows storing massively large data sets

up to 200 Petabytes, 4500 servers,


1 billion files and blocks!
HDFS splits files across nodes for parallel access

• HDFS achieves scalability by partitioning or splitting large files across multiple computers
• The default chunk size, the size of each piece of a file is 128 MB. But we can configure this to any size.
What happens if node fails? Data May get lost

Solution: Replicate duplicate copies on different nodes


Replication ensures fault tolerance



• Replication Factor: The number of times Hadoop framework replicates each and every Data Block
• Default Replication Factor of 3
Two key components of HDFS
• NameNode (also called Master node) for metadata
• It tracks files, manages the file system and has the metadata of all of the stored data within it
• Contains the details of the number of blocks, locations of the data node that the data is stored in,
where the replications are stored, and other details.
• The name node has direct contact with the client.
• Usually one per cluster
• Secondary Name Node: takes care of the checkpoints of the file system metadata which is in the
Name Node

• DataNode (also called slave node)for block storage


• A Data Node stores data in it as blocks
• Listens to NameNode for block creation,
deletion, replication
• Replication leads to Data locality and fault Tolerance
• Usually one per machine

Data locality is the process of moving the computation close to where the
actual data resides on the node, instead of moving large data to computation
YARN:
The Resource Manager for Hadoop
YARN: Yet Another Resource Negotiator
• YARN provides flexible resource management for Hadoop cluster. It interacts with applications and schedules
resources for their use.
• Allows batch processing , stream processing , graph processing, interactive processing.
• YARN enables running multiple applications over HDFS other than MapReduce. Extends Hadoop to enable
multiple frameworks such as Giraph, Spark and Flink

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
Hadoop evolved over time

Hadoop 1.0 Hadoop 2.0

Hive Pig Others Hive Pig

Giraph

Spark
Storm

Flink
MapReduce
MapReduce

HBase

Cassandra

MongoDB
Zookeeper
YARN

HDFS HDFS
YARN lead to more Applications

and growing …
Benefits of YARN (Hadoop 2.0)
• It lets you run many distributed applications over the same Hadoop
cluster.
• YARN reduces the need to move data around and supports higher
resource utilization resulting in lower costs.
• It's a scalable platform that has enabled growth of several
applications over the HDFS, enriching the Hadoop ecosystem.
When to reconsider Hadoop?
When to use Hadoop?
• Future anticipated data growth
• Long term availability of data
• Many platforms over single data store
• High Volume
• High Variety
When Hadoop may not be the most suitable
option..
• Small Datasets
• Task Level Parallelism
• Advanced Algorithms
• Replacement to your infrastructure – Databases are still needed
• Random Data Access
Hands on Exercise
• Copy data into the Hadoop Distributed File System (HDFS)
• Run the WordCount program
Copy data into the Hadoop Distributed File System (HDFS)
Download the Shakespeare text. Enter the following link in the
browser: http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespear
e.txt. Save the file as words.txt
• Open a terminal shell: Open a terminal shell by clicking on the square black box on
the top left of the screen.
• Run cd Downloads to change to the Downloads directory.
• Run ls to see that words.txt was saved.
• Copy file to HDFS: Run hadoop fs -copyFromLocal words.txt to copy
the text file to HDFS.
• Verify file was copied to HDFS: Run hadoop fs -ls to verify the file was
copied to HDFS.
• Copy a file within HDFS: Run hadoop fs -cp words.txt words2.txt to
make a copy of words.txt called words2.txt
• Copy a file from HDFS: Run hadoop fs -copyToLocal words2.txt . to
copy words2.txt to the local directory.
• Delete a file in HDFS: Run hadoop fs -rm words2.txt
Run the WordCount program
• See example MapReduce programs. Hadoop comes with several example MapReduce applications.
• List the programs by running hadoop jar /usr/jars/hadoop-examples.jar.
• We are interested in running wordcount.
Run the WordCount program
• Verify words.txt file exists. Run hadoop fs -ls
• See WordCount command line arguments. We can learn how to run WordCount by examining its command-line
arguments. Run hadoop jar /usr/jars/hadoop-examples.jar wordcount.

• Run WordCount: hadoop jar /usr/jars/hadoop-examples.jar wordcount words.txt out. As


WordCount executes, the Hadoop prints the progress in terms of Map and Reduce. When the WordCount is complete,
both will say 100%.

• See WordCount output directory. Run hadoop fs –ls

• Look inside output directory. The directory created by WordCount contains several files. Look inside the directory by
running hadoop –fs ls out

• Copy WordCount results to local file system. Copy part-r-00000 to the local file system by running hadoop fs –
copyToLocal out/part-r-00000 local.txt
• View the WordCount results. View the contents of the results: more local.txt
Further Reading
Reading
• Chapter 2: Mining of Massive Datasets, by Anand Rajaraman and Jeffrey David
Ullman. http://www.mmds.org/.

• Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing


on Large Clusters
• http://labs.google.com/papers/mapreduce.html

• Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File
System
• http://labs.google.com/papers/gfs.html

• 10 Comprehensive Hadoop, Spark and Map-Reduce Articles:


https://www.kaggle.com/getting-started/128600

113
Resources
• Hadoop Wiki
• Introduction
• http://wiki.apache.org/lucene-hadoop/
• Getting Started
• http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop
• Map/Reduce Overview
• http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
• http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses
• Eclipse Environment
• http://wiki.apache.org/lucene-hadoop/EclipseEnvironment
• Javadoc
• http://lucene.apache.org/hadoop/docs/api/

114
Resources
• Releases from Apache download mirrors
• http://www.apache.org/dyn/closer.cgi/lucene/hadoop/
• Nightly builds of source
• http://people.apache.org/dist/lucene/hadoop/nightly/
• Source code from subversion
• http://lucene.apache.org/hadoop/version_control.html

115

You might also like