0% found this document useful (0 votes)

5 views48 pages

Lecture 4 - Hadoop HDFS

Hadoop is a large-scale, distributed batch-processing infrastructure that utilizes HDFS for data storage and MapReduce for computation. HDFS distributes data across multiple nodes with high reliability and allows for efficient data processing through a simplified programming model. The document also covers the architecture, key components, and operations involved in using Hadoop and HDFS.

Uploaded by

Yuan-hsuan Wen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views48 pages

Lecture 4 - Hadoop HDFS

Uploaded by

Yuan-hsuan Wen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Hadoop & HDFS

DSCI 551
Wensheng Wu

1
Hadoop
• A large-scale distributed & parallel batch-
processing infrastructure

• Large-scale:
– Handle a large amount of data and computation
• Distributed:
– Distribute data & computation over multiple machines
• Batch processing
– Process a series of jobs without human intervention
2
2-10 Gbps backbone between racks

1 Gbps between Switch

any pair of nodes
in a rack
Switch Switch

CPU CPU CPU CPU

Mem Mem Mem Mem
… …
Disk Disk Disk Disk

Each rack contains 16-64 nodes

3
In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO
1/7/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36

4
select cid, title -- clause
from Course
where semester = 'Fa23'

find cid and tilte of Course

offered in Fa23

create table mktbl

put data into table

mysql – partition
mongodb – shard
hdfs – block

5
History
• 1st version released by Yahoo! in 2006
– named after an elephant toy (Cassandra)

• Originated from Google's work

– GFS: Google File System (2003)
– MapReduce (2004)

6
Roadmap
• Hadoop architecture
– HDFS
– MapReduce

• Installing Hadoop & HDFS

7
Key components
• HDFS (Hadoop distributed file system)
– Distributed data storage with high reliability

• MapReduce
– A parallel, distributed computational paradigm
– With a simplified programming model

8
HDFS
• Data are distributed among multiple data nodes
– Data nodes may be added on demand for more
storage space

• Data are replicated to cope with node failure

– Typically replication factor: 2 or 3

• Requests can go to any replica/copy

– Removing the bottleneck (as in single file server)
9
HDFS architecture

bar: 256MB
block 3: 128MB (4KB: 32K)
block 5: 128MB (4KB: 32K)

(hdfs block: 128MB)

A C
B

/usr/john/blk_5_1.csv /usr/mary/blk_3_1.csv 10
HDFS has …
• A single NameNode, storing meta data:
– A hierarchy of directories and files (name space)
– Attributes of directories and files (in inodes), e.g.,
permission, access/modification times, etc.
– Mapping of files to blocks on data nodes

• A number of DataNodes:
– Storing contents/blocks of files

11
Compute nodes
• Data nodes are compute nodes too

• Advantage:
– Allow schedule computation close to data

12
HDFS also has …
• A SecondaryNameNode
– Maintaining checkpoints/images of NameNode
– For recovery
– not a failover node

• In a single-machine setup
– all nodes correspond to the same machine

13
Metadata in NameNode
• NameNode has an inode for each file and dir

• Record attributes of file/dir such as

– Permission
– Access time
– Modification time

• Also record mapping of files to blocks

14
Mapping information in NameNode
• E.g., file /user/aaron/foo consists of blocks 1,
2, and 4

• Block 1 is stored on data nodes 1 and 3

• Block 2 is stored on data nodes 1 and 2
• …

15
Block size
• HDFS: 128 MB (version 2 & above)
– Much larger than disk block size (4KB)
v – A: 128MB; B: 4KB
– 128MB/4KB = 32K
– A: 1GB/128MB = 8; B: 1GB/4KB = 2^30/2^12 =
2^18 = 2^8K = 256K
• Why larger size in HDFS?
– Reduce metadata required per file
– Fast streaming read of data (since larger amount
of data are sequentially laid out on disk)
16
HDFS
• HDFS exposes the concept of blocks to client

• Reading and writing are done in two phases

– Phase 1: client asks NameNode for block locations
• By calling (sending request) getBlockLocations(), if reading
• Or calling addBlock() for allocating new blocks (one at a
time), if writing (need to call create()/append() first)
– Phase 2: client talks to DataNode for data transfer
• Reading blocks via readBlock() or writing blocks via
writeBlock()

17
Client and Namenode communication
• Source code (version 2.8.1)
– Definition of protocol
• ClientNamenodeProtocol.proto
• <hadoop-src-dir>\hadoop-hdfs-project\hadoop-hdfs-
client\src\main\proto
– Implementation
• ClientProtocol.java
• <hadoop-src-dir>\hadoop-hdfs-project\hadoop-hdfs-
client\src\main\java\org\apache\hadoop\hdfs\protocol

18
Key operations
• Reading:
– getBlockLocations()

• Writing
– create()
– append()
– addBlock()

19
getBlockLocations
Before reading, client needs to first obtain locations of blocks

20
getBlockLocations
• Input:
– File name
– Offset (to start reading)
– Length (how much data to be read)

• Output:
– Located blocks (data nodes + offsets)

21
s

22
../java/…hdfs/protocol/LocatedBlocks.java

Block
Offset of this block
in the entire file
Data nodes with
replicas of block

23
Create/append a file

This opens the file for

create/append

24
Creating a file
• Needs to specify:
– Path to the file to be created, e.g., /foo/bar
– Permission mask
– Client name
– Flag on whether to overwrite (entire file!) if
already exists
– How many replicas
– Block size

25
A hierarchy of files and directories

Creating a new file

26
Allocating new blocks for writing
Asking NameNode to allocate a new block
+ data nodes holding its replicas

27
28
Client and Datanode communication
• Source code (version 2.8.1)
– Definition of protocol
• datatransfer.proto
• Located at: <hadoop-src-dir>\hadoop-hdfs-
project\hadoop-hdfs-client\src\main\proto
– Implementation
• DataTransferProtocol.java
• <hadoop-src-dir>\hadoop-hdfs-project\hadoop-hdfs-
client\src\main\java\org\apache\hadoop\hdfs\protocol
\datatransfer

29
Operations
• readBlock()

• writeBlock()

• copyBlock() – for load balancing

• replaceBlock() – for load balancing

– Move a block from one DataNode to another

30
Reading a file
1. Client first contacts NameNode which
informs the client of the closest DataNodes
storing blocks of the file
– This is done by making which RPC call?

2. Client contacts the DataNodes directly for

reading the blocks
– Calling readBlock()

31
datatransfer.proto
Block, offset, length

32
DataTransferProtocol.java

Block, offset, length

33
Writing a file
• Blocks are written one at a time
– In a pipelined fashion through the data nodes

• For each block:

– Client asks NameNode to select DataNodes for
holding its replica (using which rpc call?)
• e.g., DataNodes 1 and 3 for the first block of
/user/aaron/foo
– It then forms the pipeline to send the block
34
Writing a file

A, [B,C]

B, [C]

C, []

35
Block to be written
Rest of data nodes

Current data node in the pipeline

36
Data pipelining
• Consider a block X to be written to DataNode
A, B, and C (replication factor = 3)

1. X is broken down into packets (typically

64KB/packet)
– 128MB/64KB = 2^27/2^16 = 2^11 = 2048
2. Client sends the packet to DataNode A
3. A sends it further to B & B further to C
37
Acknowledgement
• Client maintains an ack (acknowledgment) queue

• Packet removed from ack queue once received by

all data nodes

• When all packets were written, client notifies

NameNode
– NameNode will update the metadata for the file
– Reflecting that a new block has been added to the file

38
Data pipelining for writing a block
Control messages

Data packets

queue:
p1,

Acknowledgment
messages

39
Acknowledgement
• Client does not wait for the acknowledgement
of previous packet before sending next one

• Is this synchronous or asynchronous?

– Ajax

• Advantage?

40
Roadmap
• Hadoop architecture
– HDFS
– MapReduce

• Installing Hadoop & HDFS

41
Hadoop & HDFS installation
• Refer to the installation note posted on course
web site on how to install Hadoop and setup
HDFS

42
Working with hdfs
• Setting up home directory in hdfs
– hdfs dfs -mkdir /user
– hdfs dfs -mkdir /user/ec2-user
(ec2-user is user name of your EC2 account)

• Create a directory "input" under home

– hdfs dfs -mkdir /user/ec2-user/input
– Or simply:
– hdfs dfs -mkdir input
This will automatically create the "input" directory under /user/ec2-user
43
Working with hdfs
• Copy data from local file system
– hdfs dfs -put etc/hadoop/*.xml /user/ec2-
user/input
– Ignore error if you see one like this: "WARN hdfs.
DataStreamer: Caught exception…"

• List the content of directory

– hdfs dfs -ls /user/ec2-user/input

44
Working with hdfs
• Copy data from hdfs
– hdfs dfs -get /user/ec2-user/input input1
– If input1 does not exist, it will create one
– If it does, it will create another one under it

• Examine the content of file in hdfs

– hdfs dfs -cat /user/ec2-user/input/core-site.xml

45
Working with hdfs
• Remove files
– hdfs dfs -rm /user/ec2-user/input/core-site.xml
– hdfs dfs -rm /user/ec2-user/input/*

• Remove directory
– hdfs dfs -rmdir /user/ec2-user/input
– Directory "input" needs to be empty first

46
Where is hdfs located?
• /tmp/hadoop-ec2-user/dfs/

47
References
• K. Shvachko, H. Kuang, S. Radia, and R. Chansler,
"The hadoop distributed file system," in Mass
Storage Systems and Technologies (MSST), 2010
IEEE 26th Symposium on, 2010, pp. 1-10.

• HDFS File System Shell Guide:

– https://hadoop.apache.org/docs/current/hadoop-
project-dist/hadoop-common/FileSystemShell.html

PST Material Unit-I
No ratings yet
PST Material Unit-I
32 pages
Exam H13-611: IT Certification Guaranteed, The Easy Way!
No ratings yet
Exam H13-611: IT Certification Guaranteed, The Easy Way!
96 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Ict 9 Q1 Module 3
No ratings yet
Ict 9 Q1 Module 3
23 pages
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
No ratings yet
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
11 pages
Zrall 0707170001
No ratings yet
Zrall 0707170001
8 pages
Working Principle of Hard Disk
No ratings yet
Working Principle of Hard Disk
11 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Computer Systems Assessment Test
No ratings yet
Computer Systems Assessment Test
6 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Basic Hardware Questions
No ratings yet
Basic Hardware Questions
9 pages
Unit 5
No ratings yet
Unit 5
101 pages
Unit - 3 - Big Data
No ratings yet
Unit - 3 - Big Data
66 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
Unit 4
No ratings yet
Unit 4
104 pages
Notes 3.3 - Data Storage
No ratings yet
Notes 3.3 - Data Storage
14 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
Storage Managment
No ratings yet
Storage Managment
9 pages
BDA UNIT - 3 Updated
No ratings yet
BDA UNIT - 3 Updated
25 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
UNIT-5-HDFS (Hadoop Distributed File System)
No ratings yet
UNIT-5-HDFS (Hadoop Distributed File System)
18 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
17 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
BDT - Unit - II - Hdfs and Hadoop Io
No ratings yet
BDT - Unit - II - Hdfs and Hadoop Io
42 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
BackupandDR v3.4
No ratings yet
BackupandDR v3.4
18 pages
HDFS
No ratings yet
HDFS
20 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Unit - 2
No ratings yet
Unit - 2
27 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
4
No ratings yet
4
53 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
s6 Storage Management 1st Module
No ratings yet
s6 Storage Management 1st Module
28 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Cloud Computing - Unit 3
No ratings yet
Cloud Computing - Unit 3
38 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
8 pages
Big Data Assignment PDF
No ratings yet
Big Data Assignment PDF
18 pages
FIT Unit 2 Computer Arithmetic & Storage Fundamentals
No ratings yet
FIT Unit 2 Computer Arithmetic & Storage Fundamentals
16 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
45 pages
Data Representation Notes Final
No ratings yet
Data Representation Notes Final
7 pages
Hadoop Working
No ratings yet
Hadoop Working
33 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
5.apache Hadoop
No ratings yet
5.apache Hadoop
33 pages
Partition Creation
No ratings yet
Partition Creation
15 pages
Instant Access To Digitization and Digital Archiving A Practical Guide For Librarians 2nd Edition Elizabeth R Leggett Ebook Full Chapters
100% (4)
Instant Access To Digitization and Digital Archiving A Practical Guide For Librarians 2nd Edition Elizabeth R Leggett Ebook Full Chapters
62 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
HDFS
No ratings yet
HDFS
19 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
B-Epsilor Tree
No ratings yet
B-Epsilor Tree
8 pages
Splunk 9.1.0 Admin BackupKVstore
No ratings yet
Splunk 9.1.0 Admin BackupKVstore
5 pages
Functions Hardware
No ratings yet
Functions Hardware
5 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
05 - Introduction To HDFS
No ratings yet
05 - Introduction To HDFS
27 pages
How Does The SLUB Allocator Work: Joonsoo Kim Lge Cto SWP Lab
No ratings yet
How Does The SLUB Allocator Work: Joonsoo Kim Lge Cto SWP Lab
27 pages
Microcontroller Section #1
No ratings yet
Microcontroller Section #1
11 pages
Docu51453 - Data Domain DD2200 and DD2500 Systems Head Unit Expansion Guide
No ratings yet
Docu51453 - Data Domain DD2200 and DD2500 Systems Head Unit Expansion Guide
12 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Blue Ray Notes
No ratings yet
Blue Ray Notes
3 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
Importance of Cache Memory
No ratings yet
Importance of Cache Memory
3 pages
7500 5700 5500 BlockPowerDown
No ratings yet
7500 5700 5500 BlockPowerDown
2 pages
COSMAC FRED and Cassette Data Storage
No ratings yet
COSMAC FRED and Cassette Data Storage
6 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
OS Assignment 9 and 10
No ratings yet
OS Assignment 9 and 10
4 pages
A Presentation On Holographic Versatile Disc (HVD) : Presented By: Prateek Jain I.T. (IV Year) Guided By: Ms - Deepti Mittal
No ratings yet
A Presentation On Holographic Versatile Disc (HVD) : Presented By: Prateek Jain I.T. (IV Year) Guided By: Ms - Deepti Mittal
19 pages
22 Cloud Storage
No ratings yet
22 Cloud Storage
2 pages
18 GCSE Lesson Retrieval Practice (Assessments) Answers - Utility Software (OCR)
No ratings yet
18 GCSE Lesson Retrieval Practice (Assessments) Answers - Utility Software (OCR)
2 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Mastering Proxmox - Second Edition
From Everand
Mastering Proxmox - Second Edition
Wasim Ahmed
No ratings yet

Lecture 4 - Hadoop HDFS

Uploaded by

Lecture 4 - Hadoop HDFS

Uploaded by

Hadoop & HDFS

1 Gbps between Switch

CPU CPU CPU CPU

Each rack contains 16-64 nodes

find cid and tilte of Course

create table mktbl

put data into table

• Originated from Google's work

• Installing Hadoop & HDFS

• Data are replicated to cope with node failure

• Requests can go to any replica/copy

(hdfs block: 128MB)

• Record attributes of file/dir such as

• Also record mapping of files to blocks

• Block 1 is stored on data nodes 1 and 3

• Reading and writing are done in two phases

This opens the file for

Creating a new file

• copyBlock() – for load balancing

• replaceBlock() – for load balancing

2. Client contacts the DataNodes directly for

Block, offset, length

• For each block:

Current data node in the pipeline

1. X is broken down into packets (typically

• Packet removed from ack queue once received by

• When all packets were written, client notifies

• Is this synchronous or asynchronous?

• Installing Hadoop & HDFS

• Create a directory "input" under home

• List the content of directory

• Examine the content of file in hdfs

• HDFS File System Shell Guide:

You might also like