0% found this document useful (0 votes)

12 views18 pages

10th August Morning and Afternoon Session Hadoop

Hadoop is a distributed processing framework for large data sets, utilizing HDFS for storage and MapReduce for computation. It has evolved since its inception in 2002, becoming a leading platform for big data analytics, with significant milestones including sorting 1 terabyte of data faster than supercomputers. Key components of Hadoop include the NameNode and DataNodes, which manage data storage and processing across clusters, ensuring fault tolerance and data integrity through replication and checksums.

Uploaded by

fallenalways89

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views18 pages

10th August Morning and Afternoon Session Hadoop

Uploaded by

fallenalways89

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

• “Hadoop is a framework that allows for the distributed

processing of large data sets across clusters of computers

using simple programming models”

• Hadoop à ideal solutions to analyze & gain insights from big-data.

Ø De facto big-data processing platforms

Ø Storage: Hadoop Distributed File System (HDFS)

Ø Computation: MapReduce (MR)

• HDFS, MR distribute data among nodes - process them in parallel.

Realizing the benefit
Reading 1 TB of data

10 machine
4 I/O Channel
Each channel – 100 MB/s

1 machine
4 I/O Channel
Each channel – 100 MB/s
In 2002, Doug Cutting and Mike Cafarella - Apache
Nutch Project - aim at building a web search
engine - crawl & index websites.

In 2003, Google released a paper on Google

distributed File System (GFS) – Architecture for
storing large datasets in a distributed environment

In 2004, Nutch’s developers developed an open-source

implementation, the Nutch Distributed File System (NDFS).

In 2004, Google introduced MapReduce to process large

datasets parallelly.
In 2006, Nutch formed an independent subproject
called “Hadoop”

In 2006, Doug Cutting joined Yahoo to scale

the Hadoop project to thousands of nodes cluster.

In 2007, Yahoo started using Hadoop on 1000

nodes cluster

In 2008, Hadoop confirmed its success by becoming

the top-level project at Apache.
In 2008, Hadoop defeated supercomputers and became
the fastest system on the planet by sorting an entire
terabyte of data.

In November 2008, Google reported that its Mapreduce

implementation sorted 1 terabyte in 68 seconds.

In April 2009, a team at Yahoo used Hadoop to sort 1

terabyte in 62 seconds, beaten Google MapReduce
implementation.
In December 2011, Apache released Hadoop version 1.0

In May 2012, the Hadoop 2.0.0-alpha version was released.

In December 2017, release 3.0.0 was available – 3.3 x

(3.3.4) - Aug 2022
Hadoop Characteristics
Realizing the benefit
Reading 1 TB of data

10 machine
4 I/O Channel
Each channel – 100 MB/s

1 machine
4 I/O Channel
Each channel – 100 MB/s
WEEK 2 - CSE 3020 – DATA VISUALIZATION
HDFS Architecture
HDFS is a block-structured file system
where each file is divided into blocks of a
pre-determined size and stored across a Name Node:
cluster of one or several machines.
v Master daemon - maintains and manages
v Moving Computation is Cheaper than the Data Nodes
Moving Data
v Records the metadata of all the files stored
in the cluster, e.g. Location of data, Size of
files, permissions etc

v Regularly receives a Heartbeat and block

report from Data Nodes-live.

v Responsible for replication factor

WEEK 2 - CSE 3020 – DATA VISUALIZATION
HDFS Architecture
Data Node:
v Slave Daemon .
v Actual data is stored on Data Nodes.
v Commodity hardware, non-expensive
v Data Nodes perform read and write
requests from the clients.
v Send heartbeats to Name Node
periodically to report the overall health,
frequency is 3 secs.
SECONDARY NAME NODE
• Copies FsImage and Transaction Log from
NameNode to a temporary directory
• Merges FSImage and Transaction Log into
a new FSImage in temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
Blocks & Replicas
• Blocks are the smallest continuous location • HDFS provides a reliable way to store huge
on your hard drive where data is stored. - data in a distributed environment
HDFS file à blocks
• Blocks are replicated to provide fault tolerance
• Default size of each block is 128 MB in
• Default replication factor is 3
Apache Hadoop 2.x (64 MB in Apache
Hadoop 1.x) – Configure • NN collects Block report – over/under
Example.txt – 514 MB replicated
Block Placement

• One replica on local node, another replica on a remote rack, Third replica on different
node on the same rack, Additional replicas are randomly placed
• Data placement exposed so that computation can be migrated to data
HDFS Read Architecture:

Name node
HDFS Read Architecture:
v Client will reach out NameNode asking for block metadata
v NameNode will return the list of DataNodes where each block (Block A &
B) are stored
v After that client, will connect to the DN where blocks are stored
v Client starts reading data parallel from the DataNodes (Block A from
DataNode 1 and Block B from DataNode 3)
v Once the client gets all the required file blocks, it will combine these blocks to
form a file
v While serving read request of client, HDFS selects the replica which is closest
to the client - reduces the read latency and the bandwidth consumption
MUTATION ORDER AND LEASES

• A mutation is an operation that changes the

contents / metadata of a chunk such as append /
write operation.
• Each mutation is performed at all replicas.
• Leases are used to maintain consistency
• Master grants chunk lease to one replica (primary)
• Primary picks the serial order for all mutations to
the chunk
• All replicas follow this order (consistency)

Dr Vengadeswaran 16
DATA CORRECTNESS

• Use Checksums to validate data

– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas

Dr Vengadeswaran 17
• Guarantees
• Checkpoints for incremental writes
• Checksums for records/chunks
• Unique ID for records
• Stale replicas by version number.

Dr Vengadeswaran 18

PR ARTS Data Warehouse V2 20110707
No ratings yet
PR ARTS Data Warehouse V2 20110707
75 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit 5
No ratings yet
Unit 5
101 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Lecture 4 Introduction To Hadoop
No ratings yet
Lecture 4 Introduction To Hadoop
25 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Module II
No ratings yet
Module II
46 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Bda Unit2
No ratings yet
Bda Unit2
24 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Unit 3
No ratings yet
Unit 3
5 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
Big Data
No ratings yet
Big Data
67 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
4
No ratings yet
4
53 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Unit 2
No ratings yet
Unit 2
21 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Complete Bundle Business Intelligence Analytics and Data Science A Managerial Perspective 4th Edition Sharda
No ratings yet
Complete Bundle Business Intelligence Analytics and Data Science A Managerial Perspective 4th Edition Sharda
400 pages
SQL Assingment
No ratings yet
SQL Assingment
9 pages
Intro To Data Ethics
No ratings yet
Intro To Data Ethics
63 pages
Fish Biology Thesis
100% (4)
Fish Biology Thesis
6 pages
Validating Informations
No ratings yet
Validating Informations
9 pages
Fundamentals of SQL: Datonics Club Initiative - Session 1
No ratings yet
Fundamentals of SQL: Datonics Club Initiative - Session 1
18 pages
4.10 Exam Practise Questions (Qs Only)
No ratings yet
4.10 Exam Practise Questions (Qs Only)
9 pages
H10070688S19
No ratings yet
H10070688S19
4 pages
Snider PDF 6ra70
No ratings yet
Snider PDF 6ra70
109 pages
Vacancy Announcement Advert - Reserach Director
No ratings yet
Vacancy Announcement Advert - Reserach Director
2 pages
Syllabus MCA 2024 2026
No ratings yet
Syllabus MCA 2024 2026
12 pages
Article 2 Plagirism
No ratings yet
Article 2 Plagirism
2 pages
Unit 3 Biomedical
No ratings yet
Unit 3 Biomedical
4 pages
Sas Asst 3
No ratings yet
Sas Asst 3
9 pages
T-SQL Interview Questions
No ratings yet
T-SQL Interview Questions
10 pages
Excel 2000 Data-Management
No ratings yet
Excel 2000 Data-Management
16 pages
Dagne Proposal
No ratings yet
Dagne Proposal
14 pages
Planning A Power BI Enterprise Deployment
No ratings yet
Planning A Power BI Enterprise Deployment
105 pages
Fraudulent Consumer Returns - Exploiting Retailers' Return Policies
No ratings yet
Fraudulent Consumer Returns - Exploiting Retailers' Return Policies
18 pages
Virtual Memory 3
No ratings yet
Virtual Memory 3
45 pages
Computer Networks Unit-II Part 1
No ratings yet
Computer Networks Unit-II Part 1
42 pages
Bareos Manual Main Reference
No ratings yet
Bareos Manual Main Reference
359 pages
An Airline Reservations System
No ratings yet
An Airline Reservations System
8 pages
CH 01 PPTaccessible
No ratings yet
CH 01 PPTaccessible
42 pages
Emassfile 930
No ratings yet
Emassfile 930
50 pages
BSBINS603 - Student Assessment
No ratings yet
BSBINS603 - Student Assessment
11 pages
Name: Sadeeq Code: 1-SAAO-ORAINS-966-EG: Sadeeq Abuali Ali Osman
No ratings yet
Name: Sadeeq Code: 1-SAAO-ORAINS-966-EG: Sadeeq Abuali Ali Osman
3 pages
Bhavnesh Baghel's Resume
No ratings yet
Bhavnesh Baghel's Resume
2 pages
Expected Viva Questions: Chapter-1 Overview of Computerised Accounting System
100% (1)
Expected Viva Questions: Chapter-1 Overview of Computerised Accounting System
6 pages

10th August Morning and Afternoon Session Hadoop

Uploaded by

10th August Morning and Afternoon Session Hadoop

Uploaded by

• “Hadoop is a framework that allows for the distributed

processing of large data sets across clusters of computers

• Hadoop à ideal solutions to analyze & gain insights from big-data.

Ø De facto big-data processing platforms

Ø Storage: Hadoop Distributed File System (HDFS)

Ø Computation: MapReduce (MR)

• HDFS, MR distribute data among nodes - process them in parallel.

In 2003, Google released a paper on Google

In 2004, Nutch’s developers developed an open-source

In 2004, Google introduced MapReduce to process large

In 2006, Doug Cutting joined Yahoo to scale

In 2007, Yahoo started using Hadoop on 1000

In 2008, Hadoop confirmed its success by becoming

In November 2008, Google reported that its Mapreduce

In April 2009, a team at Yahoo used Hadoop to sort 1

In May 2012, the Hadoop 2.0.0-alpha version was released.

In December 2017, release 3.0.0 was available – 3.3 x

v Regularly receives a Heartbeat and block

v Responsible for replication factor

• A mutation is an operation that changes the

• Use Checksums to validate data

You might also like