• “Hadoop is a framework that allows for the distributed
processing of large data sets across clusters of computers
using simple programming models”
• Hadoop à ideal solutions to analyze & gain insights from big-data.
Ø De facto big-data processing platforms
Ø Storage: Hadoop Distributed File System (HDFS)
Ø Computation: MapReduce (MR)
• HDFS, MR distribute data among nodes - process them in parallel.
Realizing the benefit
Reading 1 TB of data
10 machine
4 I/O Channel
Each channel – 100 MB/s
1 machine
4 I/O Channel
Each channel – 100 MB/s
In 2002, Doug Cutting and Mike Cafarella - Apache
Nutch Project - aim at building a web search
engine - crawl & index websites.
In 2003, Google released a paper on Google
distributed File System (GFS) – Architecture for
storing large datasets in a distributed environment
In 2004, Nutch’s developers developed an open-source
implementation, the Nutch Distributed File System (NDFS).
In 2004, Google introduced MapReduce to process large
datasets parallelly.
In 2006, Nutch formed an independent subproject
called “Hadoop”
In 2006, Doug Cutting joined Yahoo to scale
the Hadoop project to thousands of nodes cluster.
In 2007, Yahoo started using Hadoop on 1000
nodes cluster
In 2008, Hadoop confirmed its success by becoming
the top-level project at Apache.
In 2008, Hadoop defeated supercomputers and became
the fastest system on the planet by sorting an entire
terabyte of data.
In November 2008, Google reported that its Mapreduce
implementation sorted 1 terabyte in 68 seconds.
In April 2009, a team at Yahoo used Hadoop to sort 1
terabyte in 62 seconds, beaten Google MapReduce
implementation.
In December 2011, Apache released Hadoop version 1.0
In May 2012, the Hadoop 2.0.0-alpha version was released.
In December 2017, release 3.0.0 was available – 3.3 x
(3.3.4) - Aug 2022
Hadoop Characteristics
Realizing the benefit
Reading 1 TB of data
10 machine
4 I/O Channel
Each channel – 100 MB/s
1 machine
4 I/O Channel
Each channel – 100 MB/s
WEEK 2 - CSE 3020 – DATA VISUALIZATION
HDFS Architecture
HDFS is a block-structured file system
where each file is divided into blocks of a
pre-determined size and stored across a Name Node:
cluster of one or several machines.
v Master daemon - maintains and manages
v Moving Computation is Cheaper than the Data Nodes
Moving Data
v Records the metadata of all the files stored
in the cluster, e.g. Location of data, Size of
files, permissions etc
v Regularly receives a Heartbeat and block
report from Data Nodes-live.
v Responsible for replication factor
WEEK 2 - CSE 3020 – DATA VISUALIZATION
HDFS Architecture
Data Node:
v Slave Daemon .
v Actual data is stored on Data Nodes.
v Commodity hardware, non-expensive
v Data Nodes perform read and write
requests from the clients.
v Send heartbeats to Name Node
periodically to report the overall health,
frequency is 3 secs.
SECONDARY NAME NODE
• Copies FsImage and Transaction Log from
NameNode to a temporary directory
• Merges FSImage and Transaction Log into
a new FSImage in temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
Blocks & Replicas
• Blocks are the smallest continuous location • HDFS provides a reliable way to store huge
on your hard drive where data is stored. - data in a distributed environment
HDFS file à blocks
• Blocks are replicated to provide fault tolerance
• Default size of each block is 128 MB in
• Default replication factor is 3
Apache Hadoop 2.x (64 MB in Apache
Hadoop 1.x) – Configure • NN collects Block report – over/under
Example.txt – 514 MB replicated
Block Placement
• One replica on local node, another replica on a remote rack, Third replica on different
node on the same rack, Additional replicas are randomly placed
• Data placement exposed so that computation can be migrated to data
HDFS Read Architecture:
Name node
HDFS Read Architecture:
v Client will reach out NameNode asking for block metadata
v NameNode will return the list of DataNodes where each block (Block A &
B) are stored
v After that client, will connect to the DN where blocks are stored
v Client starts reading data parallel from the DataNodes (Block A from
DataNode 1 and Block B from DataNode 3)
v Once the client gets all the required file blocks, it will combine these blocks to
form a file
v While serving read request of client, HDFS selects the replica which is closest
to the client - reduces the read latency and the bandwidth consumption
MUTATION ORDER AND LEASES
• A mutation is an operation that changes the
contents / metadata of a chunk such as append /
write operation.
• Each mutation is performed at all replicas.
• Leases are used to maintain consistency
• Master grants chunk lease to one replica (primary)
• Primary picks the serial order for all mutations to
the chunk
• All replicas follow this order (consistency)
Dr Vengadeswaran 16
DATA CORRECTNESS
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
Dr Vengadeswaran 17
• Guarantees
• Checkpoints for incremental writes
• Checksums for records/chunks
• Unique ID for records
• Stale replicas by version number.
Dr Vengadeswaran 18