0% found this document useful (0 votes)
5 views48 pages

Lecture 4 - Hadoop HDFS

Hadoop is a large-scale, distributed batch-processing infrastructure that utilizes HDFS for data storage and MapReduce for computation. HDFS distributes data across multiple nodes with high reliability and allows for efficient data processing through a simplified programming model. The document also covers the architecture, key components, and operations involved in using Hadoop and HDFS.

Uploaded by

Yuan-hsuan Wen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views48 pages

Lecture 4 - Hadoop HDFS

Hadoop is a large-scale, distributed batch-processing infrastructure that utilizes HDFS for data storage and MapReduce for computation. HDFS distributes data across multiple nodes with high reliability and allows for efficient data processing through a simplified programming model. The document also covers the architecture, key components, and operations involved in using Hadoop and HDFS.

Uploaded by

Yuan-hsuan Wen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Hadoop & HDFS

DSCI 551
Wensheng Wu

1
Hadoop
• A large-scale distributed & parallel batch-
processing infrastructure

• Large-scale:
– Handle a large amount of data and computation
• Distributed:
– Distribute data & computation over multiple machines
• Batch processing
– Process a series of jobs without human intervention
2
2-10 Gbps backbone between racks

1 Gbps between Switch


any pair of nodes
in a rack
Switch Switch

CPU CPU CPU CPU


Mem Mem Mem Mem
… …
Disk Disk Disk Disk

Each rack contains 16-64 nodes

3
In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO
1/7/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36

4
select cid, title -- clause
from Course
where semester = 'Fa23'

find cid and tilte of Course


offered in Fa23

create table mktbl

put data into table

mysql – partition
mongodb – shard
hdfs – block

5
History
• 1st version released by Yahoo! in 2006
– named after an elephant toy (Cassandra)

• Originated from Google's work


– GFS: Google File System (2003)
– MapReduce (2004)

6
Roadmap
• Hadoop architecture
– HDFS
– MapReduce

• Installing Hadoop & HDFS

7
Key components
• HDFS (Hadoop distributed file system)
– Distributed data storage with high reliability

• MapReduce
– A parallel, distributed computational paradigm
– With a simplified programming model

8
HDFS
• Data are distributed among multiple data nodes
– Data nodes may be added on demand for more
storage space

• Data are replicated to cope with node failure


– Typically replication factor: 2 or 3

• Requests can go to any replica/copy


– Removing the bottleneck (as in single file server)
9
HDFS architecture

bar: 256MB
block 3: 128MB (4KB: 32K)
block 5: 128MB (4KB: 32K)

(hdfs block: 128MB)

A C
B

/usr/john/blk_5_1.csv /usr/mary/blk_3_1.csv 10
HDFS has …
• A single NameNode, storing meta data:
– A hierarchy of directories and files (name space)
– Attributes of directories and files (in inodes), e.g.,
permission, access/modification times, etc.
– Mapping of files to blocks on data nodes

• A number of DataNodes:
– Storing contents/blocks of files

11
Compute nodes
• Data nodes are compute nodes too

• Advantage:
– Allow schedule computation close to data

12
HDFS also has …
• A SecondaryNameNode
– Maintaining checkpoints/images of NameNode
– For recovery
– not a failover node

• In a single-machine setup
– all nodes correspond to the same machine

13
Metadata in NameNode
• NameNode has an inode for each file and dir

• Record attributes of file/dir such as


– Permission
– Access time
– Modification time

• Also record mapping of files to blocks

14
Mapping information in NameNode
• E.g., file /user/aaron/foo consists of blocks 1,
2, and 4

• Block 1 is stored on data nodes 1 and 3


• Block 2 is stored on data nodes 1 and 2
• …

15
Block size
• HDFS: 128 MB (version 2 & above)
– Much larger than disk block size (4KB)
v – A: 128MB; B: 4KB
– 128MB/4KB = 32K
– A: 1GB/128MB = 8; B: 1GB/4KB = 2^30/2^12 =
2^18 = 2^8K = 256K
• Why larger size in HDFS?
– Reduce metadata required per file
– Fast streaming read of data (since larger amount
of data are sequentially laid out on disk)
16
HDFS
• HDFS exposes the concept of blocks to client

• Reading and writing are done in two phases


– Phase 1: client asks NameNode for block locations
• By calling (sending request) getBlockLocations(), if reading
• Or calling addBlock() for allocating new blocks (one at a
time), if writing (need to call create()/append() first)
– Phase 2: client talks to DataNode for data transfer
• Reading blocks via readBlock() or writing blocks via
writeBlock()

17
Client and Namenode communication
• Source code (version 2.8.1)
– Definition of protocol
• ClientNamenodeProtocol.proto
• <hadoop-src-dir>\hadoop-hdfs-project\hadoop-hdfs-
client\src\main\proto
– Implementation
• ClientProtocol.java
• <hadoop-src-dir>\hadoop-hdfs-project\hadoop-hdfs-
client\src\main\java\org\apache\hadoop\hdfs\protocol

18
Key operations
• Reading:
– getBlockLocations()

• Writing
– create()
– append()
– addBlock()

19
getBlockLocations
Before reading, client needs to first obtain locations of blocks

20
getBlockLocations
• Input:
– File name
– Offset (to start reading)
– Length (how much data to be read)

• Output:
– Located blocks (data nodes + offsets)

21
s

22
../java/…hdfs/protocol/LocatedBlocks.java

Block
Offset of this block
in the entire file
Data nodes with
replicas of block

23
Create/append a file

This opens the file for


create/append

24
Creating a file
• Needs to specify:
– Path to the file to be created, e.g., /foo/bar
– Permission mask
– Client name
– Flag on whether to overwrite (entire file!) if
already exists
– How many replicas
– Block size

25
A hierarchy of files and directories

Creating a new file

26
Allocating new blocks for writing
Asking NameNode to allocate a new block
+ data nodes holding its replicas

27
28
Client and Datanode communication
• Source code (version 2.8.1)
– Definition of protocol
• datatransfer.proto
• Located at: <hadoop-src-dir>\hadoop-hdfs-
project\hadoop-hdfs-client\src\main\proto
– Implementation
• DataTransferProtocol.java
• <hadoop-src-dir>\hadoop-hdfs-project\hadoop-hdfs-
client\src\main\java\org\apache\hadoop\hdfs\protocol
\datatransfer

29
Operations
• readBlock()

• writeBlock()

• copyBlock() – for load balancing

• replaceBlock() – for load balancing


– Move a block from one DataNode to another

30
Reading a file
1. Client first contacts NameNode which
informs the client of the closest DataNodes
storing blocks of the file
– This is done by making which RPC call?

2. Client contacts the DataNodes directly for


reading the blocks
– Calling readBlock()

31
datatransfer.proto
Block, offset, length

32
DataTransferProtocol.java

Block, offset, length

33
Writing a file
• Blocks are written one at a time
– In a pipelined fashion through the data nodes

• For each block:


– Client asks NameNode to select DataNodes for
holding its replica (using which rpc call?)
• e.g., DataNodes 1 and 3 for the first block of
/user/aaron/foo
– It then forms the pipeline to send the block
34
Writing a file

A, [B,C]

B, [C]

C, []

35
Block to be written
Rest of data nodes

Current data node in the pipeline

36
Data pipelining
• Consider a block X to be written to DataNode
A, B, and C (replication factor = 3)

1. X is broken down into packets (typically


64KB/packet)
– 128MB/64KB = 2^27/2^16 = 2^11 = 2048
2. Client sends the packet to DataNode A
3. A sends it further to B & B further to C
37
Acknowledgement
• Client maintains an ack (acknowledgment) queue

• Packet removed from ack queue once received by


all data nodes

• When all packets were written, client notifies


NameNode
– NameNode will update the metadata for the file
– Reflecting that a new block has been added to the file

38
Data pipelining for writing a block
Control messages

Data packets

queue:
p1,

Acknowledgment
messages

39
Acknowledgement
• Client does not wait for the acknowledgement
of previous packet before sending next one

• Is this synchronous or asynchronous?


– Ajax

• Advantage?

40
Roadmap
• Hadoop architecture
– HDFS
– MapReduce

• Installing Hadoop & HDFS

41
Hadoop & HDFS installation
• Refer to the installation note posted on course
web site on how to install Hadoop and setup
HDFS

42
Working with hdfs
• Setting up home directory in hdfs
– hdfs dfs -mkdir /user
– hdfs dfs -mkdir /user/ec2-user
(ec2-user is user name of your EC2 account)

• Create a directory "input" under home


– hdfs dfs -mkdir /user/ec2-user/input
– Or simply:
– hdfs dfs -mkdir input
This will automatically create the "input" directory under /user/ec2-user
43
Working with hdfs
• Copy data from local file system
– hdfs dfs -put etc/hadoop/*.xml /user/ec2-
user/input
– Ignore error if you see one like this: "WARN hdfs.
DataStreamer: Caught exception…"

• List the content of directory


– hdfs dfs -ls /user/ec2-user/input

44
Working with hdfs
• Copy data from hdfs
– hdfs dfs -get /user/ec2-user/input input1
– If input1 does not exist, it will create one
– If it does, it will create another one under it

• Examine the content of file in hdfs


– hdfs dfs -cat /user/ec2-user/input/core-site.xml

45
Working with hdfs
• Remove files
– hdfs dfs -rm /user/ec2-user/input/core-site.xml
– hdfs dfs -rm /user/ec2-user/input/*

• Remove directory
– hdfs dfs -rmdir /user/ec2-user/input
– Directory "input" needs to be empty first

46
Where is hdfs located?
• /tmp/hadoop-ec2-user/dfs/

47
References
• K. Shvachko, H. Kuang, S. Radia, and R. Chansler,
"The hadoop distributed file system," in Mass
Storage Systems and Technologies (MSST), 2010
IEEE 26th Symposium on, 2010, pp. 1-10.

• HDFS File System Shell Guide:


– https://hadoop.apache.org/docs/current/hadoop-
project-dist/hadoop-common/FileSystemShell.html

48

You might also like