0% found this document useful (0 votes)

39 views

Week 02

Uploaded by

muhammad shoaib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Week 02

Uploaded by

muhammad shoaib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 115

Week02

Big Data Analytics:

MapReduce and Apache Hadoop
Dr. Jawad Khokhar
Scalable Computing
What is a Distributed File System?
• A file system is responsible from the organization of the long term
information storage in a computer.
• When many storage computers are connected through the
network, we call it a distributed file system.
• A distributed file system distributes data on multiple file servers or
multiple locations
• Data sets, or parts of a data set, can be replicated across the nodes of
a distributed file system.
Distributed File System (DFS)

Rack
Rack

The computing nodes are clustered in racks connected to each other via a fast network.
Data Distributed across racks

2 5

Rack
1
3 4

High Concurrency:
• Since data is already on these nodes, then analysis of parts of the data is needed in a data parallel fashion, computation can be moved
to these nodes. This improves system performance and enable data-parallelism.
• Cluster computing: the computing nodes are clustered in racks connected to each other via a fast network.
• Computing in one or more of these clusters across a local area network or the internet is called distributed computing.

Data
Data Replicated across racks

4 3 2 1 5

Rack
5 1 4 2 3

3 2 5 4 1

1 4 3 5 2

Scalability
• Data replication also helps with scaling the access to this data by many users.
• In a highly parallelized replication, each reader can get their own node to access to and analyze
data.
Data Replication provides Fault Tolerance

4 3 2 1 5

Rack
5 1 4 2 3

3 2 5 4 1

1 4 3 5 2

• High Availability
• Reliability
Benefits of DFS
• Data scalability
• Data partitioning
• Fault tolerance
• Data replication
• High concurrency
Data Parallelism
• In data-parallelism many jobs that share nothing can work on different data sets or parts of a
data set.
• Large volumes and varieties of big data can be analyzed using this mode of parallelism, achieving
scalability, performance and cost reduction.
Rack
Commodity Cluster
• Commodity clusters are affordable parallel computers with an average
number of computing nodes.
• They are not as powerful as traditional parallel computers and are often
built out of less specialized nodes.
• Commodity clusters are Affordable and less specialized
• The service-oriented computing community over the internet have pushed
for computing to be done on commodity clusters as distributed
computations. And in turn, reducing the cost of computing over the
Internet.
• The commodity clusters are a cost effective way of achieving data parallel
scalability for big data applications.
Common Failures in Cluster computing
• A node, or an entire rack can fail at any given time.
• The connectivity of a rack to the network can stop, or
• the connections between individual nodes can break.
• It is not practical to restart everything every time, if failure happens.
• The ability to recover from such failures is called Fault-tolerance.
• For Fault-tolerance of such systems, we can:
1. Have Redundant data storage, and
2. Restart failed individual parallel jobs.

Rack
Rack

Rack
Programming Models for Big Data
• A programming model is a set of abstract runtime libraries
and programming languages
• The enabling infrastructure for big data analysis is distributed file
systems
• The programming model for big data enables the programmability of
the operations within distributed file systems.
• Programming model allows writing computer programs that
work efficiently on top of distributed file systems using big data
and making it easy to cope with all the potential issues.
Requirements for Big Data Programming Models
• Support Big Data Operations
• Split volumes of data: partitioning and placement of data in and out of computer
memory along with a model to synchronize the datasets later on
• Fast Access: The access to data should be achieved in a fast way.
• Distribution of computations to nodes: It should allow fast distribution to nodes
within a rack and these are potentially, the data nodes we moved the computation
to. Scheduling of many parallel tasks at once.
• Handle Fault Tolerance:
• Reliability: Enable reliability of the computing and full tolerance from failures.
• Recover files when needed.
• Scalability:
• Enable adding more resources when needed, e.g. Racks.
• This is also called Scaling out
• Optimized for different data types: Document Table

Graph
Key-value

Stream
Multimedia
MapReduce
MapReduce
MapReduce is a programming model and an associated
implementation for processing large data sets.
• Challenges:
• How to distribute computation?
• Distributed/parallel programming is hard

• Map-reduce addresses all of the above

• Google’s computational/data manipulation model
• Elegant way to work with big data

15
Single Node Architecture

CPU
Machine Learning, Statistics
Memory

“Classical” Data Analytics

Disk

16
Motivation: Google Example
• 20+ billion web pages x 20KB = 400+ TB
• 1 computer reads 30-35 MB/sec from disk
• ~4 months to read the web
• ~1,000 hard drives to store the web
• Takes even more to do something useful
with the data!
• Today, a standard architecture for such problems is emerging, which
is the Commodity Clusters:
• Cluster of commodity Linux nodes
• Commodity network (ethernet) to connect them

17
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO 18

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
19
Datasets, http://www.mmds.org
Large-scale Computing
• Large-scale computing for data analytics problems on commodity
hardware (commodity clusters)
• Challenges:
• How do you distribute computation?
• How can we make it easy to write distributed programs?
• Machines fail:
• One server may stay up 3 years (1,000 days)
• If you have 1,000 servers, expect to loose 1/day
• People estimated Google had ~1M machines in 2011
• 1,000 machines fail every day!

20
Idea and Solution
• Issue: Copying data over a network takes time
• Idea:
• Bring computation close to the data
• Store files multiple times for reliability
• Map-reduce addresses these problems
• Google’s computational/data manipulation model
• Elegant way to work with big data
• Storage Infrastructure – File system
• Google: GFS. Hadoop: HDFS
• Programming model
• Map-Reduce

21
Storage Infrastructure
• Problem:
• If nodes fail, how to store data persistently?
• Answer:
• Distributed File System:
• Provides global file namespace
• Google GFS; Hadoop HDFS;
• Typical usage pattern
• Huge files (100s of GB to TB)
• Data is rarely updated in place
• Reads and appends are common

22
Distributed File System
• Chunk servers
• File is split into contiguous chunks
• Typically each chunk is 16-64MB
• Each chunk replicated (usually 2x or 3x)
• Try to keep replicas in different racks
• Master node
• a.k.a. Name Node in Hadoop’s HDFS
• Stores metadata about where files are stored
• Might be replicated
• Client library for file access
• Talks to master to find chunk servers
• Connects directly to chunk servers to access data

23
Distributed File System
• Reliable distributed file system
• Data kept in “chunks” spread across machines
• Each chunk replicated on different machines
• Seamless recovery from disk or machine failure

C0 C1 D0 C1 C2 C5 C0 C5

C5 C2 C5 C3 D0 D1 … D0 C2
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Bring computation directly to the data!

Chunk servers also serve as compute servers 24
Programming Model: MapReduce

Sample task:
• We have a huge text document

• Count the number of times each

distinct word appears in the file

• Sample application:
• Analyze web server logs to find popular URLs

25
Task: Word Count
Case 1:
• File too large for memory, but all <word, count> pairs fit in memory
• Use a hashtable
Case 2:
• Even the <word, count> pairs do not fit in memory
• Count occurrences of words:
• words(doc.txt) | sort | uniq -c
• where words takes a file and outputs the words in it, one per a line
• Case 2 captures the essence of MapReduce
• Great thing is that it is naturally parallelizable

26
MapReduce: Overview
• Sequentially read a lot of data
• Map:
• Extract something you care about from each record (Key)
• Group by key: Sort and Shuffle
• Reduce:
• Aggregate, summarize, filter or transform
• Write the result
Outline stays the same, Map and Reduce
change to fit the problem
27
MapReduce: The Map Step
Input Intermediate
key-value pairs key-value pairs

k v
map
k v
k v
map
k v
k v

… …

k v k v

28
MapReduce: The Reduce Step
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key

k v

… … …

k v k v k v

29
Example: WordCount using MapReduce

File 1
Result
File 2 WordCount
File

File N
Step 0: File is stored in a DFS
Step 1: Map on each node
My apple is red and my rose is blue....
…

You are the apple of my eye....

…

…
Map generates
My apple is red and my rose is blue.... key-value pairs
…
my, my à (my, 1), (my, 1)
apple à (apple, 1)
is, is à (is, 1), (is, 1)
red à (red, 1)
and à (and, 1)
rose à (rose, 1)
blue à (blue, 1)
Map generates
You are the apple of my eye.... key-value pairs
…
You à (You, 1)
are à (are, 1)
the à (the, 1)
apple à (apple, 1)
of à (of, 1)
my à (my, 1)
eye à (eye, 1)
Step 2: Sort and Shuffle
Pairs with same key
moved to same node
(You, 1) Step 2: Sort and Shuffle
(apple, 1) Pairs with same key
moved to same node
(apple, 1)

(is, 1)
(is, 1)

(rose, 1)
(red, 1)
Step 3: Reduce Add values for same keys
Step 3: Reduce Add values for same keys
(You, 1) (You, 1)
(apple, 1), (apple, 1) (apple, 2)

(my, 1), (my, 1),

(my, 3)
(my, 1)
(red, 1) (red, 1)
(rose, 1) (rose, 1)
Word Count Using MapReduce
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

39
Map Shuffle Reduce
and Sort
More Specifically
• Input: a set of key-value pairs
• Programmer specifies two methods:
• Map(k, v) ® <k’, v’>*
• Takes a key-value pair and outputs a set of key-value pairs
• E.g., key is the filename, value is a single line in the file
• There is one Map call for every (k,v) pair
• Reduce(k’, <v’>*) ® <k’, v’’>*
• All values v’ with same key k’ are reduced together
and processed in v’ order
• There is one Reduce function call per unique key k’

41
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the key
and output

42
Map-Reduce: In Parallel
Map Node 1 Map Node 2 Map Node 3

Reduce Node 1 Reduce Node 2

All phases are distributed with many tasks doing the work 43
Map-Reduce Pattern
Map-Reduce: Environment
Input 0 Input 1 Input 2

Map-Reduce environment takes care of:

• Partitioning the input data Map 0 Map 1 Map 2
• Scheduling the program’s execution across a
set of machines Shuffle

• Performing the group by key step

• Handling machine failures Reduce 0 Reduce 1

• Managing required inter-machine communication

Out 0 Out 1

45
Data Flow
• Input and final output are stored on a distributed file system (FS):
• Scheduler tries to schedule map tasks “close” to physical storage location of
input data

• Intermediate results are stored on local FS of Map and Reduce

workers

• Output is often input to another MapReduce task

46
Coordination: Master
• Master node takes care of coordination:
• Task status: (idle, in-progress, completed)
• Idle tasks get scheduled as workers become available
• When a map task completes, it sends the master the location and sizes of its R
intermediate files, one for each reducer
• Master pushes this info to reducers

• Master pings workers periodically to detect failures

47
Dealing with Failures
• Map worker failure
• Map tasks completed or in-progress at worker are reset to idle since output of
the mapper is stored on the local FS
• Reduce workers are notified when task is rescheduled on another worker
• Reduce worker failure
• Only in-progress tasks are reset to idle since output of the reducer is stored
on the DFS
• Reduce task is restarted
• Master failure
• MapReduce task is aborted and client is notified

48
How many Map and Reduce jobs?
• M map tasks, R reduce tasks
• Rule of a thumb:
• Make M much larger than the number of nodes in the
cluster
• One DFS chunk per map is common
• Improves dynamic load balancing and speeds up
recovery from worker failures
• Usually R is smaller than M
• Because output is spread across R files. If there are too
many Reduce tasks the number of intermediate files
explodes.

49
Combiners
• Often a Map task will produce many pairs of the form (k,v1), (k,v2), …
for the same key k (e.g. the word “the”)
• E.g., popular words in the word count example
• Can save network time by
pre-aggregating values in
the mapper:
• combine(k, list(v1)) à v2
• Combiner is usually same
as the reduce function

50
Combiners
• Back to our word counting example:
• Combiner combines the values of all keys of a single
mapper (single machine):

• Much less data needs to be copied and shuffled!

51
Combiners – Caution!
• Works only if reduce function is commutative and associative
• Sum is commutative and associative
• a+b=b+a (a + b) + c = a + (b + c)

+ +

+
Combiners – Caution!
• Works only if reduce function is commutative and associative
• Average is NOT commutative and associative

avg1 avg2

(Avg1+avg2) /2
x Not the true average
Combiners – Caution!
• Works only if reduce function is commutative and associative
• Average is NOT commutative and associative, but can be computed if both
the sum and count are returned by the map function (Combiner Trick)

(sum1,count1) (sum2,count2)

(sum1+sum2) /(count1+count2)
✓
Combiners – Caution!
• Works only if reduce function is commutative and associative
• Median is NOT commutative and Associative
• There is not way to split the median computation
Partition Function
• Want to control how keys get partitioned
• Inputs to map tasks are created by contiguous splits of
input file
• Ensure that records with the same intermediate key end up
at the same worker
• System uses a default partition function:
• hash(key) mod R

• Sometimes useful to override the hash function:

• E.g., hash(hostname(URL)) mod R ensures URLs from
a host end up in the same output file

56
Problems Suited for
Map-Reduce
Example: Host size
• Suppose we have a large web corpus
• Look at the metadata file
• Lines of the form: (URL, size, date, …)
• For each host, find the total number of bytes
• That is, the sum of the page sizes for all URLs from that particular host

58
Example: Language Model
• Statistical machine translation:
• Need to count number of times every 5-word sequence occurs in a large
corpus of documents

• Very easy with MapReduce:

• Map:
• Extract (5-word sequence, count) from document
• Reduce:
• Combine the counts

59
Example: Relational-Algebra Operations in database queries (SQL)
Background
• A relation is a table with column headers called attributes.
• Rows of the relation are called tuples.
• The set of attributes of a relation is called its schema. table1 Relation/Table
• R(A1, A2, . . . , An) -> The relation name is R and its attributes are A1, A2, . . . , An
a b c
Relational operations
• Selection: Apply a condition C to each tuple in the relation and produce as output only those tuples that a1 b1 c1
satisfy C. Denoted as 𝝈𝒄 𝑹 .
• Example SQL: select * from table1 where a = ‘a1’; a1 b2 c2
• Projection: For some subset S of the attributes of the relation, produce from each tuple only the
components for the attributes in S. Denoted 𝜋" (𝑅). a1 b3 c3
• Example SQL: select a,b from table1;
a4 b4 c4
• Union, Intersection, and Difference
• Example SQL: select a,b,c from table1 union select d,b,f from table2;
Attributes/column headers
• Natural Join: Given two relations, compare each pair of tuples, one from each relation. If the tuples agree
on all the attributes that are common to the two schemas, then produce a tuple that has components for
each of the attributes in either schema. The natural join of relations R and S is denoted 𝑅 ⋈ 𝑆
• Example SQL (inner join): d b f
• select * from table1 R inner join table2 S on R.b = S.b; table2
• Grouping and Aggregation: Given a relation R, partition its tuples according to their values in one set of
attributes G, called the grouping attributes. Then, for each group, aggregate the values in certain other d1 b1 f1

Row/tuples
attributes. The normally permitted aggregations are SUM, COUNT, AVG, MIN, and MAX.
• Example SQL: d2 b2 f2
• select a,count(*) from table1 group by a;
d3 b3 f3
d4 b4 f4
Example: Join By Map-Reduce
• Compute the natural join R(A,B) ⋈ S(B,C)
• R and S are each stored in files
• Tuples are pairs (a,b) or (b,c)

A B B C A C
a1 b1
⋈
b2 c1 a3 c1
a2
a3
b1
b2
b2 c2 = a3 c2
b3 c3 a4 c3
a4 b3
S
R

61
Example: Join By Map-Reduce
• Use a hash function h from B-values to 1...k
• A Map process turns:
• Each input tuple R(a,b) into key-value pair (b,(a,R))
• Each input tuple S(b,c) into (b,(c,S))

• Map processes send each key-value pair with key b to

Reduce process h(b)
• Hadoop does this automatically; just tell it what k is.
• Each Reduce process matches all the pairs (b,(a,R)) with
all (b,(c,S)) and outputs (a,b,c).

62
Example: Batch Gradient Descent
" (%)
Batch Gradient Descent: 𝜃! ≔ 𝜃! − 𝛼 #$$ ∑#$$
%&" ℎ' 𝑥 % −𝑦 % 𝑥!
Split training sets into different subsets
(4 subsets for 4 nodes [Mappers]) • Assume m = 400
• Normally m would be more like 400 000 000
! ! !"" !""
Node 1: Use (𝑥 ,𝑦 ), … , (𝑥 ,𝑦 ) • If m is large this is really expensive
!""
(!) ' ' (')
𝑡𝑒𝑚𝑝$ ≔ , ℎ) 𝑥 −𝑦 𝑥$
'(!

!"! !"! #"" #""

Node 2: Use (𝑥 ,𝑦 ), … , (𝑥 ,𝑦 )
#""
(#) ' ' (')
𝑡𝑒𝑚𝑝$ ≔ , ℎ) 𝑥 −𝑦 𝑥$
'(!"!

#"! #"! "" ""

Node 3: Use (𝑥 ,𝑦 ), … , (𝑥 ,𝑦 )
*""
(*) ' ' (')
𝑡𝑒𝑚𝑝$ ≔ , ℎ) 𝑥 −𝑦 𝑥$
'(#"!

"! "! +"" +""

Send to a Reducer
Node 4: Use (𝑥 ,𝑦 ), … , (𝑥 ,𝑦 )
+"" Put them back together
(+)
𝑡𝑒𝑚𝑝$ ≔ , ℎ) 𝑥 '
−𝑦 '
𝑥$
(')
Update θ using 1 (#) (%) (&) (')
'(*"! 𝜃! ≔ 𝜃! − 𝛼 𝑡𝑒𝑚𝑝! + 𝑡𝑒𝑚𝑝! + 𝑡𝑒𝑚𝑝! + 𝑡𝑒𝑚𝑝!
400
Other Examples
• Matrix-Vector Multiplication
• Matrix Multiplication
Implementations
• Google
• Uses Google File System (GFS) for stable storage
• Not available outside Google
• Hadoop
• An open-source implementation in Java
• Uses HDFS for stable storage
• Download: http://lucene.apache.org/hadoop/

65
Problems not suited for MapReduce
• MapReduce is great for:
• Problems that require sequential data access
• Large batch jobs (not interactive, real-time)
• MapReduce is inefficient for problems where random (or irregular)
access to data required:
• Graphs
• Interdependent data
• Comparisons of many pairs of items
Cloud Computing
• Ability to rent computing by the hour
• Additional services e.g., persistent storage

• Amazon’s “Elastic Compute Cloud” (EC2)

• S3 (Stable storage)
• Elastic Map Reduce (EMR)

67
Cost in MapReduce Jobs
• Computation Cost of mappers and the reducers
• System cost is principally sorting key-value pairs by key and merging them at
Reduce tasks
• Communication Cost of shipping key-value pairs from mappers to
reducers
• Map Tasks are executed where their input data resides so no communication
required.
Hands on practice environments
• Cloudera VM
• https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-
vm-5.4.2-0-virtualbox.zip
• HDFS, SQL, MongoDB, Spark
• Google CoLab/Jupyter Notebooks
• Spark, Python
Cloudera VM for hands on: Installation
instructions
Instructions
Please use the following instructions to download and install the Cloudera Quickstart VM with VirutalBox before proceeding
to the Getting Started with the Cloudera VM Environment video. The screenshots are from a Mac but the instructions should
be the same for Windows. Please see the discussion boards if you have any issues.
1. Install VirtualBox. Go to https://www.virtualbox.org/wiki/Downloads to download and install VirtualBox for your
computer.
2. Download the Cloudera VM. Download the Cloudera VM
from https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-virtualbox.zip. The VM is over
4GB, so will take some time to download.
3. Unzip the Cloudera VM:
On Mac: Double click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip
On Windows: Right-click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip and select “Extract All…”
4. Start VirtualBox.
5. Begin importing. Import the VM by going to File -> Import Appliance
6. Click the Folder icon.
7. Select the cloudera-quickstart-vm-5.4.2-0-virtualbox.ovf from the Folder where you unzipped the VirtualBox
VM and click Open.
8. Click Continue to proceed.
9. Click Import.
10. The virtual machine image will
be imported. This can take several
minutes.
11. Launch Cloudera VM. When the importing is finished, the quickstart-vm-5.4.2-0 VM will appear on the left
in the VirtualBox window. Select it and click the Start button to launch the VM.
12. Cloudera VM booting. It will take several minutes for the Virtual Machine to start. The booting
process takes a long time since many Hadoop tools are started
13. The Cloudera VM desktop. Once the booting process is complete, the desktop will appear with a browser.
Apache Hadoop:
A Framework to manage and process Big Data
Apache Hadoop
• Hadoop is a framework that allows us to store and process large data
sets in parallel and distributed manner
• Created in 2005 by Yahoo
• Written in Java
• Implements the MapReduce Big Data Programming model
• More Big Data frameworks released
Now there’s over a 100!
Zookeeper
Hive

MapReduce

YARN
Pig

HDFS

Giraph
The Hadoop Ecosystem

Storm

Spark

Flink

HBase

Cassandra

MongoDB
The Hadoop Ecosystem
Higher levels:
Interactivity

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS

Lower levels:
Storage and scheduling
The Hadoop Ecosystem
• Distributed file system as foundation
• Scalable storage
• Fault tolerance

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The Hadoop Ecosystem
• YARN: Yet Another Resource Negotiator
• It is used for Flexible scheduling and resource management
• YARN schedules jobs on more than 40,000 servers at yahoo!

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The Hadoop Ecosystem
Implementation of the MapReduce Programming Model

Map à apply() Reduce à summarize()

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS

Google used to use MapReduce to index the websites

The Hadoop Ecosystem
• Higher-level programming models
• Pig = dataflow scripting
• Hive = SQL-like queries

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS

• Pig created at Yahoo,

• Hive created at Facebook
The Hadoop Ecosystem
• Specialized models for graph processing
• Giraph used by Facebook to analyze social graphs

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The Hadoop Ecosystem
• Real-time and in-memory processing
• 100x faster for some tasks

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The Hadoop Ecosystem
• NoSQL databases
• Key-values
• Hbase used by Facebook’s messaging platform

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The Hadoop Ecosystem
• Zookeeper for management
• Synchronization
• Configuration
• High-availability

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
• All these tools are open-source
• Large community for support

• We can also download separately or part of pre-built virtual

machine images having all the necessary tools
The Hadoop Distributed File System (HDFS):
A Storage System for Big Data
HDFS is foundation for Hadoop ecosystem
• HDFS provides scalable and reliable storage
• Scalability: Scalability is the property of Hadoop to handle a large datasets by adding more resources to the
system as and when needed. This is also called scaling out. If you run out of space, you can simply add more
nodes to increase the space.
• Reliability: The ability to cope with hardware failures making it fault tolerant
• Fault Tolerance: the ability of to continue operating without interruption when one or more of its
nodes fail.
• Can handle a variety of file types

Hive Pig

Giraph

Spark
Storm
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
HDFS allows storing massively large data sets

up to 200 Petabytes, 4500 servers,

1 billion files and blocks!
HDFS splits files across nodes for parallel access

• HDFS achieves scalability by partitioning or splitting large files across multiple computers
• The default chunk size, the size of each piece of a file is 128 MB. But we can configure this to any size.
What happens if node fails? Data May get lost

Solution: Replicate duplicate copies on different nodes

Replication ensures fault tolerance

✔
✔
• Replication Factor: The number of times Hadoop framework replicates each and every Data Block
• Default Replication Factor of 3
Two key components of HDFS
• NameNode (also called Master node) for metadata
• It tracks files, manages the file system and has the metadata of all of the stored data within it
• Contains the details of the number of blocks, locations of the data node that the data is stored in,
where the replications are stored, and other details.
• The name node has direct contact with the client.
• Usually one per cluster
• Secondary Name Node: takes care of the checkpoints of the file system metadata which is in the
Name Node

• DataNode (also called slave node)for block storage

• A Data Node stores data in it as blocks
• Listens to NameNode for block creation,
deletion, replication
• Replication leads to Data locality and fault Tolerance
• Usually one per machine

Data locality is the process of moving the computation close to where the
actual data resides on the node, instead of moving large data to computation
YARN:
The Resource Manager for Hadoop
YARN: Yet Another Resource Negotiator
• YARN provides flexible resource management for Hadoop cluster. It interacts with applications and schedules
resources for their use.
• Allows batch processing , stream processing , graph processing, interactive processing.
• YARN enables running multiple applications over HDFS other than MapReduce. Extends Hadoop to enable
multiple frameworks such as Giraph, Spark and Flink

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
Hadoop evolved over time

Hadoop 1.0 Hadoop 2.0

Hive Pig Others Hive Pig

Giraph

Spark
Storm

Flink
MapReduce
MapReduce

HBase

Cassandra

MongoDB
Zookeeper
YARN

HDFS HDFS
YARN lead to more Applications

and growing …
Benefits of YARN (Hadoop 2.0)
• It lets you run many distributed applications over the same Hadoop
cluster.
• YARN reduces the need to move data around and supports higher
resource utilization resulting in lower costs.
• It's a scalable platform that has enabled growth of several
applications over the HDFS, enriching the Hadoop ecosystem.
When to reconsider Hadoop?
When to use Hadoop?
• Future anticipated data growth
• Long term availability of data
• Many platforms over single data store
• High Volume
• High Variety
When Hadoop may not be the most suitable
option..
• Small Datasets
• Task Level Parallelism
• Advanced Algorithms
• Replacement to your infrastructure – Databases are still needed
• Random Data Access
Hands on Exercise
• Copy data into the Hadoop Distributed File System (HDFS)
• Run the WordCount program
Copy data into the Hadoop Distributed File System (HDFS)
Download the Shakespeare text. Enter the following link in the
browser: http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespear
e.txt. Save the file as words.txt
• Open a terminal shell: Open a terminal shell by clicking on the square black box on
the top left of the screen.
• Run cd Downloads to change to the Downloads directory.
• Run ls to see that words.txt was saved.
• Copy file to HDFS: Run hadoop fs -copyFromLocal words.txt to copy
the text file to HDFS.
• Verify file was copied to HDFS: Run hadoop fs -ls to verify the file was
copied to HDFS.
• Copy a file within HDFS: Run hadoop fs -cp words.txt words2.txt to
make a copy of words.txt called words2.txt
• Copy a file from HDFS: Run hadoop fs -copyToLocal words2.txt . to
copy words2.txt to the local directory.
• Delete a file in HDFS: Run hadoop fs -rm words2.txt
Run the WordCount program
• See example MapReduce programs. Hadoop comes with several example MapReduce applications.
• List the programs by running hadoop jar /usr/jars/hadoop-examples.jar.
• We are interested in running wordcount.
Run the WordCount program
• Verify words.txt file exists. Run hadoop fs -ls
• See WordCount command line arguments. We can learn how to run WordCount by examining its command-line
arguments. Run hadoop jar /usr/jars/hadoop-examples.jar wordcount.

• Run WordCount: hadoop jar /usr/jars/hadoop-examples.jar wordcount words.txt out. As

WordCount executes, the Hadoop prints the progress in terms of Map and Reduce. When the WordCount is complete,
both will say 100%.

• See WordCount output directory. Run hadoop fs –ls

• Look inside output directory. The directory created by WordCount contains several files. Look inside the directory by
running hadoop –fs ls out

• Copy WordCount results to local file system. Copy part-r-00000 to the local file system by running hadoop fs –
copyToLocal out/part-r-00000 local.txt
• View the WordCount results. View the contents of the results: more local.txt
Further Reading
Reading
• Chapter 2: Mining of Massive Datasets, by Anand Rajaraman and Jeffrey David
Ullman. http://www.mmds.org/.

• Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing

on Large Clusters
• http://labs.google.com/papers/mapreduce.html

• Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File
System
• http://labs.google.com/papers/gfs.html

• 10 Comprehensive Hadoop, Spark and Map-Reduce Articles:

https://www.kaggle.com/getting-started/128600

113
Resources
• Hadoop Wiki
• Introduction
• http://wiki.apache.org/lucene-hadoop/
• Getting Started
• http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop
• Map/Reduce Overview
• http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
• http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses
• Eclipse Environment
• http://wiki.apache.org/lucene-hadoop/EclipseEnvironment
• Javadoc
• http://lucene.apache.org/hadoop/docs/api/

114
Resources
• Releases from Apache download mirrors
• http://www.apache.org/dyn/closer.cgi/lucene/hadoop/
• Nightly builds of source
• http://people.apache.org/dist/lucene/hadoop/nightly/
• Source code from subversion
• http://lucene.apache.org/hadoop/version_control.html

115

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Assignment 03:: Association Rule Mining
No ratings yet
Assignment 03:: Association Rule Mining
3 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
03 Intro HadoopAndMapReduce BigData
No ratings yet
03 Intro HadoopAndMapReduce BigData
91 pages
2a Intro To Cluster Computing PDF
No ratings yet
2a Intro To Cluster Computing PDF
18 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
CSE545 Sp23 (3) Hadoop MapReduce 2-13
No ratings yet
CSE545 Sp23 (3) Hadoop MapReduce 2-13
96 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Spark Introduction
No ratings yet
Spark Introduction
90 pages
BDA-Lec5
No ratings yet
BDA-Lec5
40 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
PPT3-W3-Big Data Foundation
No ratings yet
PPT3-W3-Big Data Foundation
63 pages
Big-Data Computing: B. Ramamurthy
No ratings yet
Big-Data Computing: B. Ramamurthy
61 pages
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
100% (1)
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
57 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Unit-5 -Hadoop.pptx
No ratings yet
Unit-5 -Hadoop.pptx
29 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Lecture 10 Map Reduce
No ratings yet
Lecture 10 Map Reduce
42 pages
MapReduce-Final
No ratings yet
MapReduce-Final
92 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Cloud Compute
No ratings yet
Cloud Compute
46 pages
MapReduce - 1
No ratings yet
MapReduce - 1
39 pages
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
No ratings yet
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
14 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
Chapter4 PDF
No ratings yet
Chapter4 PDF
50 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Chapter Three Data Science
No ratings yet
Chapter Three Data Science
23 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Big Data
No ratings yet
Big Data
51 pages
Big Data
No ratings yet
Big Data
29 pages
BDA 2 (1)
No ratings yet
BDA 2 (1)
35 pages
HADOOP
No ratings yet
HADOOP
43 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Question No 2 Diagram
No ratings yet
Question No 2 Diagram
1 page
NLP Assignment 2
No ratings yet
NLP Assignment 2
2 pages
C Ovid Data Analysis
No ratings yet
C Ovid Data Analysis
3 pages
PHP Ans
No ratings yet
PHP Ans
7 pages
CISSP 2024 Practice Exam Questions and Answers
No ratings yet
CISSP 2024 Practice Exam Questions and Answers
14 pages
Test Python
No ratings yet
Test Python
6 pages
191CS42B - Os-Record
No ratings yet
191CS42B - Os-Record
144 pages
Quantum Information Sciences (PH61010)
No ratings yet
Quantum Information Sciences (PH61010)
66 pages
Outline: - What Is Physical Design - Design Methods - Design Styles - Analysis and Verification
No ratings yet
Outline: - What Is Physical Design - Design Methods - Design Styles - Analysis and Verification
43 pages
Brochure Comos-Portfolio en
No ratings yet
Brochure Comos-Portfolio en
8 pages
GAC-0013-1 Hidex AMG IQ validation V1.0
No ratings yet
GAC-0013-1 Hidex AMG IQ validation V1.0
6 pages
Opinnaytetyo Aedma Martin
No ratings yet
Opinnaytetyo Aedma Martin
51 pages
MIT15 S12F18 Ses3
No ratings yet
MIT15 S12F18 Ses3
24 pages
Hacking Machines
No ratings yet
Hacking Machines
38 pages
SD Invoice Split From Sales Order
No ratings yet
SD Invoice Split From Sales Order
3 pages
Analysis Chapter 1
No ratings yet
Analysis Chapter 1
74 pages
Maryland Election Integrity LLC V Maryland SBE EXHIBIT M 093113888365
No ratings yet
Maryland Election Integrity LLC V Maryland SBE EXHIBIT M 093113888365
5 pages
Lab Manual 02 Implementation of 8086 Based Data Movement and Arithmetic Instructions Such As ADD, SUB, MOV Using EMU8086
No ratings yet
Lab Manual 02 Implementation of 8086 Based Data Movement and Arithmetic Instructions Such As ADD, SUB, MOV Using EMU8086
5 pages
Fortigate Cli Ref 56
No ratings yet
Fortigate Cli Ref 56
1,131 pages
12
No ratings yet
12
16 pages
UI24R Network
No ratings yet
UI24R Network
16 pages
A9. Design & Development of Pole Climbing Robots With Fire Detection & Prevention Using Extinguisher Mechanism
No ratings yet
A9. Design & Development of Pole Climbing Robots With Fire Detection & Prevention Using Extinguisher Mechanism
64 pages
AIO_Mem_Test_Guide_Translated (1)
No ratings yet
AIO_Mem_Test_Guide_Translated (1)
9 pages
Case Study Unilever Canteen Cohort
No ratings yet
Case Study Unilever Canteen Cohort
14 pages
01 Assignment Information System Era's
No ratings yet
01 Assignment Information System Era's
2 pages
2022-2023 Spring - Ceng240 20230516 Midterm Exam Answers
No ratings yet
2022-2023 Spring - Ceng240 20230516 Midterm Exam Answers
52 pages
Cloud Computing Unit-5
No ratings yet
Cloud Computing Unit-5
11 pages
Comp 231
No ratings yet
Comp 231
42 pages
Checklist For Students Remote Exams
No ratings yet
Checklist For Students Remote Exams
5 pages
The Ultimate C - C4H460 - 04 - SAP Certified Development Associate - SAP Cloud For Customer 2011
No ratings yet
The Ultimate C - C4H460 - 04 - SAP Certified Development Associate - SAP Cloud For Customer 2011
3 pages
Amon Chowdhury CV
No ratings yet
Amon Chowdhury CV
3 pages
Main Catalog 2018 Volume 1 - Ipc, Motion, Automation: Industrial PC Embedded PC Drive Technology Twincat Twinsafe
No ratings yet
Main Catalog 2018 Volume 1 - Ipc, Motion, Automation: Industrial PC Embedded PC Drive Technology Twincat Twinsafe
17 pages
Software Design Document: Carpool System
No ratings yet
Software Design Document: Carpool System
48 pages

Week 02

Uploaded by

Week 02

Uploaded by

Week02

Big Data Analytics:

• Map-reduce addresses all of the above

“Classical” Data Analytics

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO 18

Bring computation directly to the data!

• Count the number of times each

You are the apple of my eye....

(my, 1), (my, 1),

Reduce Node 1 Reduce Node 2

Map-Reduce environment takes care of:

• Performing the group by key step

• Managing required inter-machine communication

• Intermediate results are stored on local FS of Map and Reduce

• Output is often input to another MapReduce task

• Master pings workers periodically to detect failures

• Much less data needs to be copied and shuffled!

• Sometimes useful to override the hash function:

• Very easy with MapReduce:

• Map processes send each key-value pair with key b to

!"! !"! #"" #""

#"! #"! *"" *""

*"! *"! +"" +""

• Amazon’s “Elastic Compute Cloud” (EC2)

Map à apply() Reduce à summarize()

Google used to use MapReduce to index the websites

• Pig created at Yahoo,

• We can also download separately or part of pre-built virtual

up to 200 Petabytes, 4500 servers,

Solution: Replicate duplicate copies on different nodes

• DataNode (also called slave node)for block storage

Hadoop 1.0 Hadoop 2.0

Hive Pig Others Hive Pig

• Run WordCount: hadoop jar /usr/jars/hadoop-examples.jar wordcount words.txt out. As

• See WordCount output directory. Run hadoop fs –ls

• Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing

• 10 Comprehensive Hadoop, Spark and Map-Reduce Articles:

You might also like

#"! #"! "" ""

"! "! +"" +""