0% found this document useful (0 votes)

92 views57 pages

BDS Session 5

The document discusses Hadoop architecture and HDFS. It describes the components of Hadoop 1 vs Hadoop 2, including the introduction of YARN and changes to master-slave architecture. It explains the HDFS architecture with NameNode and DataNodes, block replication strategy, and functions of key components like heartbeat and secondary NameNode for robustness.

Uploaded by

R Krish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views57 pages

BDS Session 5

Uploaded by

R Krish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

You are on page 1/ 57

DSECL ZG 522: Big Data Systems

Session 5: Hadoop architecture and filesystem

Janardhanan PS
Professor
[email protected]
Topics for today

• Hadoop architecture overview

✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

2
Hadoop - Data and Compute layers
• A data storage layer
• A Distributed File System - HDFS
• A data processing layer
• MapReduce programming

3
Hadoop 2 - Architecture
• Master-slave architecture for overall compute and data management
• Slaves implement peer-to-peer communication

To be covered in coming session

HDFS namespace / meta-data manager
YARN cluster level resource manager

HDFS node level data manager

YARN node level resource manager
Map Reduce tasks

Note: YARN Resource Manager also uses application level App Master
processes on slave nodes for application specific resource management 4
What changed from Hadoop 1 to Hadoop 2

• Hadoop 1:
MapReduce was coupled with resource management
• Hadoop 2 brought in YARN as a resource management
capability and MapReduce is only about data
processing.
• Hadoop 1: Single Master node with NameNode is a SPOF
• Hadoop 2 introduced active-passive and other HA
configurations besides secondary NameNodes
• Hadoop 1: Only MapReduce programs
• In Hadoop 2, non MR programs can be run by YARN
on slave nodes (since decoupled from MapReduce) as
well support for non-HDFS storage, e.g. Amazon S3
etc.

5
Hadoop Distributions

• Open source Apache project

• Core components :
✓ Hadoop Common
✓ Hadoop Distributed File System
✓ Hadoop YARN
✓ Hadoop MapReduce

Amazon Web Services

Elastic Map Reduce

6
Topics for today

• Hadoop architecture overview

✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

7
HDFS Features (1)

• A DFS stores data over multiple nodes in a cluster and allows multi-user access
✓ Gives a feeling to the user that the data is on single machine
✓ HDFS is a Java based DFS that sits on top of native FS
✓ Enables storage of very large files across nodes of a Hadoop cluster
✓ Data is split into large blocks : 128MB (Default)
• Scale through parallel data processing
✓ 1 node with 1TB storage can have an IO bandwidth of 400MBps across 4 IO channels = 43 min
✓ 10 nodes with partitioned 1 TB data can access in parallel that data in 4.3 min

8
HDFS Features (2)

• Fault tolerance through replication

✓ Default replication factor = 3 for every block (Hadoop 2 has some
optimisations)
✓ So 1 GB data can actually take 3GB storage
• Consistency
✓ Write once and read many time workload
✓ Files can be appended, truncated but not updated at any arbitrary point

9
HDFS Features (3)

• Cost: Typically deployed using commodity hardware for low TCO - so adding

more nodes is cost-effective

• Variety and Volume of Data: Huge data i.e. Terabytes & petabytes of data and
different kinds of data - structured, unstructured or semi structured.

10
HDFS Features (4)

• Data Integrity: HDFS nodes constantly verify checksums to preserve data

integrity. On error, new copies are created and old copies are deleted.

• Data Locality: Data locality talks about moving processing unit to data rather
than the data to processing unit. Bring the computation part to the data nodes
where the data is residing. Hence, you are not moving the data, you are
bringing the program or processing part to the data.

11
HDFS Architecture - Master node

• Master slave architecture within a HDFS

cluster
• One master node with NameNode
• Maintains namespace - Filename to
blocks and their replica mappings
• Serves as arbitrator and doesn’t
handle actual data flow
• HDFS client app interacts with
NameNode for metadata

12
HDFS Architecture - Slave nodes
• Multiple slave nodes with one
DataNode per slave
• Serves block R/W from Clients
• Serves Create/Delete/Replicate
requests from NameNode
• DataNodes interact with each other
for pipeline reads and writes.

13
Functions of a NameNode
• Maintains namespace in HDFS with 2 files
• FsImage: Contains mapping of blocks to file, hierarchy, file properties / permissions
• EditLog: Transaction log of changes to metadata in FsImage
• Does not store any data - only meta-data about files
• Runs on Master node while DataNodes run on Slave nodes
• HA can be configured (discussed later)
• Records each change that takes place to the meta-data. e.g. if a file is deleted in HDFS, the NameNode
will immediately record this in the EditLog.
• Receives periodic Heartbeat and a block report from all the DataNodes in the cluster to ensure that the
DataNodes are live.
• Ensure replication factor is maintained across DataNode failures
• In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the DataNodes

14
Where are fsimage and edit logs ?

15
Namenode - What happens on start-up
1. Enters into safe mode
✓ Check for status of Data nodes on slaves
• Does not allow any Datanode replications in this mode
• Gets heartbeat and block report from Datanodes
• Checks for minimum Replication Factor needed for configurable majority of blocks
✓ Updates meta-data (this is also done at checkpoint time)
• Reads FsImage and EditLog from disk into memory
• Applies all transactions from the EditLog to the in-memory version of FsImage
• Flushes out new version of FsImage on disk
• Keeps latest FsImage in memory for client requests
• Truncates the old EditLog as its changes are applied on the new FsImage
2. Exits safe mode
3. Continues with further replications needed and client requests

16
Functions of a DataNode (1)

• Each slave in cluster runs a DataNode

• Nodes store actual data blocks and R/W data for
the HDFS clients as regular files on the native file
system, e.g. ext2 or ext3 NN

• Default block size for ext2 and ext3 – 4096 bytes

• During pipeline read and write, DataNodes
communicate with each other
• We will discuss what’s a pipeline
• No additional HA because blocks are anyway DN DN DN
replicated
local FS local FS local FS

DN-2-DN pipeline for data transfer

17
Functions of a DataNode (2)
• DataNode continuously sends heartbeat to
NameNode (default 3 sec)
✓ To ensure the connectivity with NameNode
• If no heartbeat message from DataNode, NN
NameNode replicates that DataNode within the
cluster and removes the DN from the meta-data
records
heartbeat replicate
• DataNodes also send a BlockReport on start-up /
periodically containing file list no heartbeat

• Applies some heuristic to subdivide files into DN DN DN

directories based on limits of local FS but has no
knowledge of HDFS level files local FS local FS local FS

18
Topics for today

• Hadoop architecture overview

✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

19
Hadoop 2: Introduction of Secondary NameNode

• In the case of failure of NameNode,

✓ The secondary NameNode can be configured manually to
bring up the cluster
✓ But it does not record any real-time changes that happen to
the HDFS metadata
• The Secondary NameNode constantly reads all the file systems
and metadata from the RAM of the NameNode (snapshot) and
writes to its local file system.
• It is responsible for combining the EditLogs with FsImage from
the NameNode.
• It downloads the EditLogs from the NameNode at regular
intervals and applies to FsImage.
• Hence, Secondary NameNode performs regular checkpoints in
HDFS. Therefore, it is also called CheckpointNode.
• After recover from a failure, the new FsImage is copied back to
the NameNode, which is used whenever the NameNode is started
the next time.

20
HA configuration of NameNode
JournalNodes / NFS Zookeeper (3 or 5)
• Active-Passive configuration can also be setup
with a standby NameNode
• Can use a Quorum Journal Manager (QJM) or
NFS to maintain shared state
• DataNodes send heartbeats and updates to both
NameNodes.
• Writes to JournalNodes only happens via Active Active
NameNode Passive
NameNode - avoids “split brain” scenario of NameNode
network partitions
• Standby reads from JournalNodes to keep updated
on state as well as latest updates from DataNodes
• Zookeeper session may be used for failure
detection and election of new Active

Client DataNodes
21
Other robustness mechanisms

• Types of failures - DataNode, NameNode failures and network partitions

• Heartbeat from DataNode to NameNode for handling DN failures
✓ When data node heartbeat times out (10min) NameNode updates state and
starts pointing clients to other replicas.
✓ Timeout (10min) is high to avoid replication storms but can be set lower
especially if clients want to read recent data and avoid stale replicas.
• Cluster rebalancing by keeping track of RF per block and node usage
• Checksums stored in NameNode for blocks written to DataNodes to check data
integrity on corruption on node / link and software bugs

22
Communication

• TCP/IP at network level

• RPC abstraction on HDFS specific protocols
• Clients talk to HDFS using Client protocol
• DataNode and NameNode talk using DataNode protocol
• RPC is always initiated by DataNode to NameNode and not vice-versa
• For better fault tolerance and NameNode state maintenance

23
Topics for today

• Hadoop architecture overview

✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

24
Blocks in HDFS

• HDFS stores each file as blocks which are scattered throughout the Apache Hadoop cluster.

• The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop
1.x) which you can configure as per your requirement.

• It is not necessary that in HDFS, each file is stored in exact multiple of the configured block
size (128 MB, 256 MB etc.).

• A file of size 514 MB can have 4 x 128MB and 1 x 2MB blocks

• Why large block of 128MB ?

• HDFS is used for TB/PB size files and small block size will create too much meta-data

• Larger blocks will further reduce the “indexing” at block level, impact load balancing across
nodes etc.
25
How to see blocks of a file in HDFS - fsck

26
HDFS Blocksize Vs Input Split

* Set mapred.min.split.size parameter in mapred-site.xml

27
HDFS on local FS
• Find / configure the root of HDFS in hdfs-site.xml - >
dfs.data.dir property

• e.g. $HADOOP_HOME/data/dfs/data/hadoop-$
{user.name}/current

• If you want to see the files in local FS that store

blocks of HDFS :

• cd to the HDFS root dir specified in dfs.data.dir

• go inside the sub-dir with name you got from fsck

command

• navigate into further sub-directories to find the

block files

• All this mapping is stored on the NameNode to map

HDFS files to blocks (local FS files) on DataNodes

28
Replica Placement Strategy - with Rack awareness

• First replica is placed on the same node as the client

• Second replica is placed on a node that is present on different rack
• Third replica is placed on same rack as second but on a different node
• Putting each replica on a different rack is expensive write operation
• For replicas > 3, nodes are randomly picked for 4th replica without
violating upper limit per rack as (replicas-1) / racks + 2.
• Total replicas <= #DataNodes with no 2 replicas on same DN
• Once the replica locations are set, pipeline is built
• Shows good reliability
• NameNode collects block report from DataNodes to balance the
blocks across nodes and control over/under replication of blocks

29
Why rack awareness ?

• To improve the network performance: The

communication between nodes residing on different
racks is directed via switch. In general, you will
find greater network bandwidth between machines in
the same rack than the machines residing in different
rack. So, the Rack Awareness helps you to have
reduce write traffic in between different racks and
thus providing a better write performance. Also, you
will be gaining increased read performance because
you are using the bandwidth of multiple racks.

• To prevent loss of data: We don’t have to worry

about the data even if an entire rack fails because
of the switch failure or power failure. And if you think
about it, it will make sense, as it is said that never put
all your eggs in the same basket.

30
Topics for today

• Hadoop architecture overview

✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

31
HDFS data writes

Now, the following protocol will be followed whenever the data is written into

HDFS:
• HDFS client contacts NameNode for Write Request against the two blocks,
say, Block A & Block B.
• NameNode grants permission to client with IP addresses of the DataNodes to
copy blocks
• Selection of DataNodes is randomized but factoring in availability, RF, and
rack awareness
• For 3 copies, 3 unique DNs needed, if possible, for each block.
◦ For Block A, list A = {DN1, DN4, DN6}
◦ For Block B, set B = {DN3, DN7, DN9}
• Each block will be copied in three different DataNodes to maintain the
replication factor consistent throughout the cluster.
• Now the whole data copy process will happen in three stages:
1. Set up of Pipeline
2. Data streaming and replication
3. Shutdown of Pipeline (Acknowledgement stage)

32
HDFS Write: Step 1. Setup pipeline

Client creates a pipeline for each of the

blocks by connecting the individual

DataNodes in the respective list for that

block. Let us consider Block A. The list of

DataNodes provided by the NameNode is

DN1, DN4, DN6

33
HDFS Write: Step 1. Setup pipeline for a block

1. Client chooses the first DataNode (DN1) and will establish

a TCP/IP connection.
2. Client informs DN1 to be ready to receive the block.
3. Provides IPs of next two DNs (4, 6) to DN1 for replication.
4. The DN1 connects to DN4 and informs it to be ready and
gives IP of DN6. DN4 asks DN6 to be ready for data.
5. Ack of readiness follows the reverse sequence, i.e. from
the DN6 to DN4 to DN1.
6. At last DN1 will inform the client that all the DNs are
ready and a pipeline will be formed between the client,
DataNode 1, 4 and 6.
7. Now pipeline set up is complete and the client will finally
begin the data copy or streaming process.

34
HDFS Write: Step 2. Data streaming

Client pushes the data into the pipeline.

1. Once the block has been written to

DataNode 1 by the client, DataNode 1 will
connect to DataNode 4.
2. Then, DataNode 1 will push the block in
the pipeline and data will be copied to
DataNode 4.
3. Again, DataNode 4 will connect to
DataNode 6 and will copy the last replica
of the block.

35
HDFS Write: Step 3. Shutdown pipeline / ack

Block is now copied to all DNs. Client and NameNode need to

be updated. Client needs to close pipeline and end TCP session.

1. Acknowledgement happens in the reverse sequence i.e. from

DN 6 to 4 and then to 1.

2. DN1 pushes three acknowledgements (including its own)

into pipeline and client.

3. Client informs NameNode that data has been written

successfully.

4. NameNode updates metadata.

5. Client shuts down the pipeline.

36
Multi-block writes

• The client will copy Block A and

Block B to the first
DataNode simultaneously.
• Parallel pipelines for each block
• Pipeline process for a block is
same as discussed.
• E.g. 1A, 2A, 3A, … and 1B, 2B,
3B, … work in parallel

37
Sample write code
public class WriteFileToHDFS{
public static void main(String[] args) throws IOException {
WriteFileToHDFS.writeFileToHDFS();
}
public static void writeFileToHDFS() throws IOException {
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://localhost:9000");
FileSystem fileSystem = FileSystem.get(configuration);
String fileName = "read_write_hdfs_example.txt";
Path hdfsWritePath = new Path("/javareadwriteexample/" + fileName);
FSDataOutputStream fsDataOutputStream = fileSystem.create(hdfsWritePath,true);
BufferedWriter bufferedWriter = new BufferedWriter(new OutputStreamWriter(fs
DataOutputStream,StandardCharsets.UTF_8));
bufferedWriter.write("Java API to write data in HDFS");
bufferedWriter.newLine();
bufferedWriter.close();
fileSystem.close();
}
}
38
HDFS Create / Write - Call sequence in code

1) Client calls create() on FileSystem to create a file

1) RPC call to NameNode happens through FileSystem to create new
file.
2) NameNode performs checks to create a new file. Initially, NameNode
creates a file without associating any data blocks to the file.
3) The FileSystem.create() returns an FSDataOutputStream to client to
perform write.
2) Client creates a BufferedWriter using FSDataOutputStream to write data to
a pipeline
1) Data is split into packets by FSDataOutputStream, which is then
written to the internal queue.
2) DataStreamer consumes the data queue
3) DataStreamer requests NameNode to allocate new blocks by
selecting a list of suitable DataNodes to store replicas. This is
pipeline.
4) DataStreamer streams packets to first DataNode in the pipeline.

39
HDFS Create / Write - Call sequence in code
3) The first DataNode stores packet and forwards it to
Second DataNode and then Second node transfer it to
Third DataNode.
4) FSDataOutputStream also manages a “Ack queue” of
packets that are waiting for the acknowledgement by
DataNodes.
5) A packet is removed from the queue only if it is
acknowledged by all the DataNodes in the pipeline
6) When the client finishes writing to the file, it calls
close() on the stream
7) This flushes all the remaining packets to DataNode
pipeline and waits for relevant acknowledgements before
communicating the NameNode to inform the client that
the writing of the file is complete.

40
HDFS Read
1.Client contacts NameNode asking for the block metadata
for a file
2.NameNode returns list of DNs where each block is
stored
3.Client connects to the DNs where blocks are stored
4.The client starts reading data parallel from the DNs (e.g.
Block A from DN1, Block B from DN3)
5.Once the client gets all the required file blocks, it will
combine these blocks to form a file.

How are blocks chosen by NameNode ?

While serving read request of the client, HDFS selects the
replica which is closest to the client. This reduces the read
latency and the bandwidth consumption. Therefore, that
replica is selected which resides on the same rack as the
reader node, if possible.

41
Sample read code
public class ReadFileFromHDFS{
public static void main(String[] args) throws IOException {
ReadFileFromHDFS.readFileFromHDFS();
}
public static void readFileFromHDFS() throws IOException {
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://localhost:9000");
FileSystem fileSystem = FileSystem.get(configuration);
String fileName = "read_write_hdfs_example.txt";
Path hdfsReadPath = new Path("/javareadwriteexample/" + fileName);
FSDataInputStream inputStream = fileSystem.open(hdfsReadPath);
//Classical input stream usage
String out= IOUtils.toString(inputStream, "UTF-8");
System.out.println(out);
inputStream.close();
fileSystem.close();
}
} 42
HDFS Read - Call sequence in code
1) Client opens the file that it wishes to read from by calling open() on
FileSystem
1) FileSystem communicates with NameNode to get location of
data blocks.
2) NameNode returns the addresses of DataNodes on which blocks
are stored.
3) FileSystem returns FSDataInputStream to client to read from
file.
2) Client then calls read() on the stream, which has addresses of
DataNodes for first few blocks of file, connects to the closest
DataNode for the first block in file
1) Client calls read() repeatedly to stream the data from DataNode
2) When end of block is reached, stream closes the connection with
DataNode.
3) The stream repeats the steps to find the best DataNode for the
next blocks.
3) When the client completes the reading of file, it calls close() on the
stream to close the connection.

43
Read optimizations

• Short-circuit reads can also be made by client to local

file bypassing DataNode using /dev/shm for better
performance.
• Client can ask certain HDFS paths to be cached
✓ NameNode asks DNs to cache corresponding
blocks and send cache block report along with
heartbeat
• Clients can be scheduled for data locality so that it can
use local reads or caches better

44
Security

• POSIX style file permissions

• Choice of transparent encryption/decryption Database
✓ HDFS encryption is between file system and DB level encryption HDFS
• Create hierarchical encryption zones so that any file within the zone (a Native FS
path) has the same key
• Hadoop Key Management Server (KMS) does the following
✓ Provides access to stored Encryption Zone Keys
✓ Create encrypted data encryption keys (EDEK) for storage by
NameNode
✓ Decrypting EDEK for use by clients

45
Topics for Session 5

• Hadoop architecture overview

✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

46
File formats

• Many ways to evaluate formats

• Text (JSON, CSV, ..)
✓ Write performance
• AVRO
✓ Read performance
• Parquet
✓ Partial read performance
• RC
✓ Block compression
• ORC
✓ Columnar support
• Sequence
✓ Schema change

47
File formats - text based

• Text-based (JSON, CSV …)

✓ Easily splittable
✓ Can’t split compressed files
• so large Map tasks because you have to give the entire file to a
Map task
✓ Simplest to start putting in structured / semi-structured data

48
File formats - sequence files
• A flat file consisting of binary key/value pairs
• Extensively used in MapReduce as input/output formats as well as internal temporary
outputs of maps
• 3 types
1. Uncompressed key/value records.
2. Record compressed key/value records - only 'values' are compressed here.
3. Block compressed key/value records - both keys and values are collected in
configurable 'blocks' and compressed. Sync markers added for random access and
splitting.
• Header includes information about :
• key, value class
• whether compression is enabled and whether at block level
• compressor codec used

ref: https:/cwiki.apache.org/confluence/display/HADOOP2/SequenceFile 49
Example: Storing images in HDFS
public class ImageToSeq {

public static void main(String args[]) throws Exception {

Configuration confHadoop = new Configuration();

• We want to store a bunch of images as key-
confHadoop.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));

confHadoop.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/hdfs-site.xml"));

FileSystem fs = FileSystem.get(confHadoop);
value pairs, maybe attach some meta-data
Path inPath = new Path("/mapin/1.png");

Path outPath = new Path("/mapin/11.png");

as well
FSDataInputStream in = null;

Text key = new Text(); • Files are stored as binary format, e.g.
BytesWritable value = new BytesWritable();

SequenceFile.Writer writer = null; sequence file

try{

in = fs.open(inPath); • We can apply some image processing on

byte buffer[] = new byte[in.available()];

in.read(buffer);
the files, as well as use meta-data to
writer = SequenceFile.createWriter(fs, confHadoop, outPath, key.getClass(),value.getClass());

writer.append(new Text(inPath.getName()), new BytesWritable(buffer));

}catch (Exception e) {
retrieve specific images
System.out.println("Exception MESSAGES = "+e.getMessage());

finally {

IOUtils.closeStream(writer);

System.out.println("last line of the code....!!!!!!!!!!");

} Ref: https://stackoverflow.com/questions/16546040/store-images-videos-into-hadoop-hdfs 50
File formats - Optimized Row Columnar (ORC) *
Improves performance when Hive is reading, writing, and processing data

create table Addresses ( Index data used to skip rows

Includes min/max for each column
name string,
and their row positions
street string,
Row data used for table scans
city string,

state string,

zip int

) stored as orc tblproperties (“orc.compress"="ZLIB");

* more in Hive session ref: https://cwiki.apache.org/confluence/display/hive/languagemanual+orc 51

File formats - Parquet
• Columnar format
• Pros
✓ Good for compression because data in a column tends to be similar
• Support block level compression as well as file level
✓ Good query performance when query is for specific columns
✓ Compared to ORC - more flexible to add columns
✓ Good for Hive and Spark workloads if working on specific columns at a time
• Cons
✓ Expensive write due to columnar
• So if use case is more about reading entire rows, then not a good choice
• Anyway - Hadoop systems are more about write once and read many times - so read
performance is paramount
• Note: Avro is another option for workloads with full row scans because it is row major storage

52
Topics for today

• Hadoop architecture overview

✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoophdfs/HDFSCommands.html
53
Basic command reference
• list files in the path of the file system
✓ hadoop fs -ls <path>
• alters the permissions of a file where <arg> is the binary argument e.g. 777
✓ hadoop fs -chmod <arg> <file-or-dir>
• change the owner of a file
✓ hadoop fs -chown <owner>:<group> <file-or-dir>
• make a directory on the file system
✓ hadoop fs -mkdir <path>
• copy a file from the local storage onto file system
✓ hadoop fs -put <local-origin> <destination>
• copy a file to the local storage from the file system
✓ hadoop fs -get <origin> <local-destination>
• similar to the put command but the source is restricted to a local file reference
✓ hadoop fs -copyFromLocal <local-origin> <destination>
• similar to the get command but the destination is restricted to a local file reference
✓ hadoop fs -copyToLocal <origin> <local-destination>
• create an empty file on the file system
✓ hadoop fs -touchz
• copy files to stdout
✓ hadoop fs -cat <file>
54
More HDFS commands - config and usage
Get configuration data in general, about name nodes, about any specific attribute
• hdfs getconf
• return various configuration settings in effect
• hdfs getconf -namenodes
• returns namenodes in the cluster
• hdfs getconf -confkey <a.value>
• return the value of a particular setting (e.g. dfs.replication)
• hdfs dfsadmin -report
• find out how much disk space us used, free, under-replicated, etc.
• hadoop fs -setrep 2 /jps/wc/sample01.txt
• Set replication factor of 2 to sample01.txt file in HDFS

55
Summary

• High level architecture of Hadoop 2 and differences with earlier Hadoop 1

• HDFS
✓ Architecture
✓ Read and write flows
✓ File formats
✓ Commands commonly used
✓ Additional reading about HDFS:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/
HdfsDesign.html

56
Next Session:
Distributed Programming

BDS Session 6
No ratings yet
BDS Session 6
53 pages
Yarn Ha Federation
No ratings yet
Yarn Ha Federation
64 pages
17 03 2021
No ratings yet
17 03 2021
3 pages
Distributed System MCQ 2018
59% (75)
Distributed System MCQ 2018
25 pages
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
No ratings yet
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
56 pages
21CS72 Bigdata Module 2 HDFS
No ratings yet
21CS72 Bigdata Module 2 HDFS
55 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Module 4 - Hadoop HDFS
No ratings yet
Module 4 - Hadoop HDFS
102 pages
Unit 4
No ratings yet
Unit 4
104 pages
HDFS
No ratings yet
HDFS
37 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
BDA - Unit-2
No ratings yet
BDA - Unit-2
24 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
BDA Mid 2
No ratings yet
BDA Mid 2
21 pages
5.apache Hadoop
No ratings yet
5.apache Hadoop
33 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
HDFS
No ratings yet
HDFS
16 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
5 Final Hadoop Ecosystem Hdfs
No ratings yet
5 Final Hadoop Ecosystem Hdfs
130 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
IMTC634 - Data Science - Chapter 14
No ratings yet
IMTC634 - Data Science - Chapter 14
22 pages
Hdfs R20it III
No ratings yet
Hdfs R20it III
19 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Unit - 2
No ratings yet
Unit - 2
27 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
22 pages
Unit 2
No ratings yet
Unit 2
53 pages
Hdfs Architecture
No ratings yet
Hdfs Architecture
16 pages
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
No ratings yet
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
37 pages
Big Data Assignment PDF
No ratings yet
Big Data Assignment PDF
18 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
05 - Introduction To HDFS
No ratings yet
05 - Introduction To HDFS
27 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
HDFS
No ratings yet
HDFS
1 page
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
BD Module 1 Final
No ratings yet
BD Module 1 Final
17 pages
BDA Unit - 4
No ratings yet
BDA Unit - 4
16 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
Big Data Unit-3
No ratings yet
Big Data Unit-3
46 pages
03 Hdfs
No ratings yet
03 Hdfs
27 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
LS1.1 - V2 Scaling With Traditional Databases
No ratings yet
LS1.1 - V2 Scaling With Traditional Databases
7 pages
LS1.1 - V6 Generalized Architecture of Big Data Systems
No ratings yet
LS1.1 - V6 Generalized Architecture of Big Data Systems
8 pages
LS1.1 - V1 Reliable, Scalable and Maintainable Data Applications
No ratings yet
LS1.1 - V1 Reliable, Scalable and Maintainable Data Applications
10 pages
LS1.0 - 0 DSECL ZC556 SPA Course Introduction
No ratings yet
LS1.0 - 0 DSECL ZC556 SPA Course Introduction
9 pages
Logical Time - pdf.142344387
No ratings yet
Logical Time - pdf.142344387
25 pages
Cloud Computing in Distributed System IJERTV1IS10199
No ratings yet
Cloud Computing in Distributed System IJERTV1IS10199
8 pages
Monolithic Vs Distributed Systems A Comparative Analysis
No ratings yet
Monolithic Vs Distributed Systems A Comparative Analysis
8 pages
Distributed Application Critical Design Review (CDR) Checklist
No ratings yet
Distributed Application Critical Design Review (CDR) Checklist
17 pages
BlockChain ITU Sayad Haghighi
No ratings yet
BlockChain ITU Sayad Haghighi
34 pages
Transport Steamui
No ratings yet
Transport Steamui
6 pages
AWS Technical Essentials
No ratings yet
AWS Technical Essentials
2 pages
Operating System Summary of Chapter 6
No ratings yet
Operating System Summary of Chapter 6
7 pages
AWS Notes 01
No ratings yet
AWS Notes 01
16 pages
Distributed Computing Paradigms Paradigms For Distributed Applications
No ratings yet
Distributed Computing Paradigms Paradigms For Distributed Applications
9 pages
BTCOE703B Distributed System
No ratings yet
BTCOE703B Distributed System
2 pages
Notes PDC
No ratings yet
Notes PDC
6 pages
Dbms
No ratings yet
Dbms
4 pages
Week-2 Mcqs
No ratings yet
Week-2 Mcqs
14 pages
Paper PDF
No ratings yet
Paper PDF
52 pages
Concurrency Control Techniques
No ratings yet
Concurrency Control Techniques
12 pages
REPORT Block Chain
No ratings yet
REPORT Block Chain
17 pages
Altcoins Review 2
100% (1)
Altcoins Review 2
34 pages
Module 9 - Transactions & ACID Properties
No ratings yet
Module 9 - Transactions & ACID Properties
9 pages
Advanced Database Management Systems: Assignment 01
No ratings yet
Advanced Database Management Systems: Assignment 01
8 pages
Distributed Systems Midterm Review
No ratings yet
Distributed Systems Midterm Review
12 pages
AWS Answer
No ratings yet
AWS Answer
5 pages
Chapter1 Dos
No ratings yet
Chapter1 Dos
51 pages
JMS Interview Questions
No ratings yet
JMS Interview Questions
4 pages
Module 2
No ratings yet
Module 2
9 pages
Binance-Spot Trade History-202505012320
No ratings yet
Binance-Spot Trade History-202505012320
4 pages
Distributed Computing Second Edition: Sunita Mahajan Seema Shah
100% (1)
Distributed Computing Second Edition: Sunita Mahajan Seema Shah
39 pages
W1 Web3 22 f24 Presentation
No ratings yet
W1 Web3 22 f24 Presentation
44 pages

BDS Session 5

Uploaded by

BDS Session 5

Uploaded by

DSECL ZG 522: Big Data Systems

Session 5: Hadoop architecture and filesystem

• Hadoop architecture overview

To be covered in coming session

HDFS node level data manager

• Open source Apache project

Amazon Web Services

• Hadoop architecture overview

• Fault tolerance through replication

• Cost: Typically deployed using commodity hardware for low TCO - so adding

• Data Integrity: HDFS nodes constantly verify checksums to preserve data

• Master slave architecture within a HDFS

• Each slave in cluster runs a DataNode

• Default block size for ext2 and ext3 – 4096 bytes

DN-2-DN pipeline for data transfer

• Applies some heuristic to subdivide files into DN DN DN

• Hadoop architecture overview

• In the case of failure of NameNode,

• Types of failures - DataNode, NameNode failures and network partitions

• TCP/IP at network level

• Hadoop architecture overview

• A file of size 514 MB can have 4 x 128MB and 1 x 2MB blocks

• Why large block of 128MB ?

* Set mapred.min.split.size parameter in mapred-site.xml

• If you want to see the files in local FS that store

• cd to the HDFS root dir specified in dfs.data.dir

• go inside the sub-dir with name you got from fsck

• navigate into further sub-directories to find the

• All this mapping is stored on the NameNode to map

• First replica is placed on the same node as the client

• To improve the network performance: The

• To prevent loss of data: We don’t have to worry

• Hadoop architecture overview

Client creates a pipeline for each of the

blocks by connecting the individual

DataNodes in the respective list for that

block. Let us consider Block A. The list of

DataNodes provided by the NameNode is

DN1, DN4, DN6

1. Client chooses the first DataNode (DN1) and will establish

Client pushes the data into the pipeline.

1. Once the block has been written to

Block is now copied to all DNs. Client and NameNode need to

be updated. Client needs to close pipeline and end TCP session.

1. Acknowledgement happens in the reverse sequence i.e. from

2. DN1 pushes three acknowledgements (including its own)

into pipeline and client.

3. Client informs NameNode that data has been written

4. NameNode updates metadata.

5. Client shuts down the pipeline.

• The client will copy Block A and

1) Client calls create() on FileSystem to create a file

How are blocks chosen by NameNode ?

• Short-circuit reads can also be made by client to local

• POSIX style file permissions

• Hadoop architecture overview

• Many ways to evaluate formats

• Text-based (JSON, CSV …)

public static void main(String args[]) throws Exception {

Configuration confHadoop = new Configuration();

Path outPath = new Path("/mapin/11.png");

SequenceFile.Writer writer = null; sequence file

in = fs.open(inPath); • We can apply some image processing on

writer.append(new Text(inPath.getName()), new BytesWritable(buffer));

System.out.println("last line of the code....!!!!!!!!!!");

create table Addresses ( Index data used to skip rows

) stored as orc tblproperties (“orc.compress"="ZLIB");

* more in Hive session ref: https://cwiki.apache.org/confluence/display/hive/languagemanual+orc 51

• Hadoop architecture overview

• High level architecture of Hadoop 2 and differences with earlier Hadoop 1

You might also like