0% found this document useful (0 votes)
92 views57 pages

BDS Session 5

The document discusses Hadoop architecture and HDFS. It describes the components of Hadoop 1 vs Hadoop 2, including the introduction of YARN and changes to master-slave architecture. It explains the HDFS architecture with NameNode and DataNodes, block replication strategy, and functions of key components like heartbeat and secondary NameNode for robustness.

Uploaded by

R Krish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views57 pages

BDS Session 5

The document discusses Hadoop architecture and HDFS. It describes the components of Hadoop 1 vs Hadoop 2, including the introduction of YARN and changes to master-slave architecture. It explains the HDFS architecture with NameNode and DataNodes, block replication strategy, and functions of key components like heartbeat and secondary NameNode for robustness.

Uploaded by

R Krish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 57

DSECL ZG 522: Big Data Systems

Session 5: Hadoop architecture and filesystem

Janardhanan PS
Professor
[email protected]
Topics for today

• Hadoop architecture overview


✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

2
Hadoop - Data and Compute layers
• A data storage layer
• A Distributed File System - HDFS
• A data processing layer
• MapReduce programming

3
Hadoop 2 - Architecture
• Master-slave architecture for overall compute and data management
• Slaves implement peer-to-peer communication

To be covered in coming session


HDFS namespace / meta-data manager
YARN cluster level resource manager

HDFS node level data manager


YARN node level resource manager
Map Reduce tasks

Note: YARN Resource Manager also uses application level App Master
processes on slave nodes for application specific resource management 4
What changed from Hadoop 1 to Hadoop 2

• Hadoop 1:
MapReduce was coupled with resource management
• Hadoop 2 brought in YARN as a resource management
capability and MapReduce is only about data
processing.
• Hadoop 1: Single Master node with NameNode is a SPOF
• Hadoop 2 introduced active-passive and other HA
configurations besides secondary NameNodes
• Hadoop 1: Only MapReduce programs
• In Hadoop 2, non MR programs can be run by YARN
on slave nodes (since decoupled from MapReduce) as
well support for non-HDFS storage, e.g. Amazon S3
etc.

5
Hadoop Distributions

• Open source Apache project


• Core components :
✓ Hadoop Common
✓ Hadoop Distributed File System
✓ Hadoop YARN
✓ Hadoop MapReduce

Amazon Web Services


Elastic Map Reduce

6
Topics for today

• Hadoop architecture overview


✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

7
HDFS Features (1)

• A DFS stores data over multiple nodes in a cluster and allows multi-user access
✓ Gives a feeling to the user that the data is on single machine
✓ HDFS is a Java based DFS that sits on top of native FS
✓ Enables storage of very large files across nodes of a Hadoop cluster
✓ Data is split into large blocks : 128MB (Default)
• Scale through parallel data processing
✓ 1 node with 1TB storage can have an IO bandwidth of 400MBps across 4 IO channels = 43 min
✓ 10 nodes with partitioned 1 TB data can access in parallel that data in 4.3 min

8
HDFS Features (2)

• Fault tolerance through replication


✓ Default replication factor = 3 for every block (Hadoop 2 has some
optimisations)
✓ So 1 GB data can actually take 3GB storage
• Consistency
✓ Write once and read many time workload
✓ Files can be appended, truncated but not updated at any arbitrary point

9
HDFS Features (3)

• Cost: Typically deployed using commodity hardware for low TCO - so adding


more nodes is cost-effective

• Variety and Volume of Data: Huge data i.e. Terabytes & petabytes of data and
different kinds of data - structured, unstructured or semi structured.

10
HDFS Features (4)

• Data Integrity: HDFS nodes constantly verify checksums to preserve data


integrity. On error, new copies are created and old copies are deleted.

• Data Locality: Data locality talks about moving processing unit to data rather
than the data to processing unit. Bring the computation part to the data nodes
where the data is residing. Hence, you are not moving the data, you are
bringing the program or processing part to the data.

11
HDFS Architecture - Master node

• Master slave architecture within a HDFS


cluster
• One master node with NameNode
• Maintains namespace - Filename to
blocks and their replica mappings
• Serves as arbitrator and doesn’t
handle actual data flow
• HDFS client app interacts with
NameNode for metadata

12
HDFS Architecture - Slave nodes
• Multiple slave nodes with one
DataNode per slave
• Serves block R/W from Clients
• Serves Create/Delete/Replicate
requests from NameNode
• DataNodes interact with each other
for pipeline reads and writes.

13
Functions of a NameNode
• Maintains namespace in HDFS with 2 files
• FsImage: Contains mapping of blocks to file, hierarchy, file properties / permissions
• EditLog: Transaction log of changes to metadata in FsImage
• Does not store any data - only meta-data about files
• Runs on Master node while DataNodes run on Slave nodes
• HA can be configured (discussed later)
• Records each change that takes place to the meta-data. e.g. if a file is deleted in HDFS, the NameNode
will immediately record this in the EditLog.
• Receives periodic Heartbeat and a block report from all the DataNodes in the cluster to ensure that the
DataNodes are live.
• Ensure replication factor is maintained across DataNode failures
• In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the DataNodes

14
Where are fsimage and edit logs ?

15
Namenode - What happens on start-up
1. Enters into safe mode
✓ Check for status of Data nodes on slaves
• Does not allow any Datanode replications in this mode
• Gets heartbeat and block report from Datanodes
• Checks for minimum Replication Factor needed for configurable majority of blocks
✓ Updates meta-data (this is also done at checkpoint time)
• Reads FsImage and EditLog from disk into memory
• Applies all transactions from the EditLog to the in-memory version of FsImage
• Flushes out new version of FsImage on disk
• Keeps latest FsImage in memory for client requests
• Truncates the old EditLog as its changes are applied on the new FsImage
2. Exits safe mode
3. Continues with further replications needed and client requests

16
Functions of a DataNode (1)

• Each slave in cluster runs a DataNode


• Nodes store actual data blocks and R/W data for
the HDFS clients as regular files on the native file
system, e.g. ext2 or ext3 NN

• Default block size for ext2 and ext3 – 4096 bytes


• During pipeline read and write, DataNodes
communicate with each other
• We will discuss what’s a pipeline
• No additional HA because blocks are anyway DN DN DN
replicated
local FS local FS local FS

DN-2-DN pipeline for data transfer

17
Functions of a DataNode (2)
• DataNode continuously sends heartbeat to
NameNode (default 3 sec)
✓ To ensure the connectivity with NameNode
• If no heartbeat message from DataNode, NN
NameNode replicates that DataNode within the
cluster and removes the DN from the meta-data
records
heartbeat replicate
• DataNodes also send a BlockReport on start-up /
periodically containing file list no heartbeat

• Applies some heuristic to subdivide files into DN DN DN


directories based on limits of local FS but has no
knowledge of HDFS level files local FS local FS local FS

18
Topics for today

• Hadoop architecture overview


✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

19
Hadoop 2: Introduction of Secondary NameNode

• In the case of failure of NameNode,


✓ The secondary NameNode can be configured manually to
bring up the cluster
✓ But it does not record any real-time changes that happen to
the HDFS metadata
• The Secondary NameNode constantly reads all the file systems
and metadata from the RAM of the NameNode (snapshot) and
writes to its local file system.
• It is responsible for combining the EditLogs with FsImage from
the NameNode. 
• It downloads the EditLogs from the NameNode at regular
intervals and applies to FsImage.
• Hence, Secondary NameNode performs regular checkpoints in
HDFS. Therefore, it is also called CheckpointNode.
• After recover from a failure, the new FsImage is copied back to
the NameNode, which is used whenever the NameNode is started
the next time.

20
HA configuration of NameNode
JournalNodes / NFS Zookeeper (3 or 5)
• Active-Passive configuration can also be setup
with a standby NameNode
• Can use a Quorum Journal Manager (QJM) or
NFS to maintain shared state
• DataNodes send heartbeats and updates to both
NameNodes.
• Writes to JournalNodes only happens via Active Active
NameNode Passive
NameNode - avoids “split brain” scenario of NameNode
network partitions
• Standby reads from JournalNodes to keep updated
on state as well as latest updates from DataNodes
• Zookeeper session may be used for failure
detection and election of new Active

Client DataNodes
21
Other robustness mechanisms

• Types of failures - DataNode, NameNode failures and network partitions


• Heartbeat from DataNode to NameNode for handling DN failures
✓ When data node heartbeat times out (10min) NameNode updates state and
starts pointing clients to other replicas.
✓ Timeout (10min) is high to avoid replication storms but can be set lower
especially if clients want to read recent data and avoid stale replicas.
• Cluster rebalancing by keeping track of RF per block and node usage
• Checksums stored in NameNode for blocks written to DataNodes to check data
integrity on corruption on node / link and software bugs

22
Communication

• TCP/IP at network level


• RPC abstraction on HDFS specific protocols
• Clients talk to HDFS using Client protocol
• DataNode and NameNode talk using DataNode protocol
• RPC is always initiated by DataNode to NameNode and not vice-versa
• For better fault tolerance and NameNode state maintenance

23
Topics for today

• Hadoop architecture overview


✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

24
Blocks in HDFS

• HDFS stores each file as blocks which are scattered throughout the Apache Hadoop cluster.

• The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop
1.x) which you can configure as per your requirement.

• It is not necessary that in HDFS, each file is stored in exact multiple of the configured block
size (128 MB, 256 MB etc.).

• A file of size 514 MB can have 4 x 128MB and 1 x 2MB blocks

• Why large block of 128MB ?

• HDFS is used for TB/PB size files and small block size will create too much meta-data

• Larger blocks will further reduce the “indexing” at block level, impact load balancing across
nodes etc.
25
How to see blocks of a file in HDFS - fsck

26
HDFS Blocksize Vs Input Split

* Set mapred.min.split.size parameter in mapred-site.xml


27
HDFS on local FS
• Find / configure the root of HDFS in hdfs-site.xml - >
dfs.data.dir property

• e.g. $HADOOP_HOME/data/dfs/data/hadoop-$
{user.name}/current

• If you want to see the files in local FS that store


blocks of HDFS :

• cd to the HDFS root dir specified in dfs.data.dir

• go inside the sub-dir with name you got from fsck


command

• navigate into further sub-directories to find the


block files

• All this mapping is stored on the NameNode to map


HDFS files to blocks (local FS files) on DataNodes

28
Replica Placement Strategy - with Rack awareness

• First replica is placed on the same node as the client


• Second replica is placed on a node that is present on different rack
• Third replica is placed on same rack as second but on a different node
• Putting each replica on a different rack is expensive write operation
• For replicas > 3, nodes are randomly picked for 4th replica without
violating upper limit per rack as (replicas-1) / racks + 2.
• Total replicas <= #DataNodes with no 2 replicas on same DN
• Once the replica locations are set, pipeline is built
• Shows good reliability
• NameNode collects block report from DataNodes to balance the
blocks across nodes and control over/under replication of blocks

29
Why rack awareness ?

• To improve the network performance: The


communication between nodes residing on different
racks is directed via switch. In general, you will
find greater network bandwidth between machines in
the same rack than the machines residing in different
rack. So, the Rack Awareness helps you to have
reduce write traffic in between different racks and
thus providing a better write performance. Also, you
will be gaining increased read performance because
you are using the bandwidth of multiple racks.

• To prevent loss of data: We don’t have to worry


about the data even if an entire rack fails because
of the switch failure or power failure. And if you think
about it, it will make sense, as it is said that never put
all your eggs in the same basket.

30
Topics for today

• Hadoop architecture overview


✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

31
HDFS data writes

Now, the following protocol will be followed whenever the data is written into

HDFS:
• HDFS client contacts NameNode for Write Request against the two blocks,
say, Block A & Block B.
• NameNode grants permission to client with IP addresses of the DataNodes to
copy blocks
• Selection of DataNodes is randomized but factoring in availability, RF, and
rack awareness
• For 3 copies, 3 unique DNs needed, if possible, for each block.
◦ For Block A, list A = {DN1, DN4, DN6}
◦ For Block B, set B = {DN3, DN7, DN9}
• Each block will be copied in three different DataNodes to maintain the
replication factor consistent throughout the cluster.
• Now the whole data copy process will happen in three stages:
1. Set up of Pipeline
2. Data streaming and replication
3. Shutdown of Pipeline (Acknowledgement stage) 

32
HDFS Write: Step 1. Setup pipeline

Client creates a pipeline for each of the

blocks by connecting the individual

DataNodes in the respective list for that

block. Let us consider Block A. The list of

DataNodes provided by the NameNode is

DN1, DN4, DN6

33
HDFS Write: Step 1. Setup pipeline for a block

1. Client chooses the first DataNode (DN1) and will establish


a TCP/IP connection.
2. Client informs DN1 to be ready to receive the block.
3. Provides IPs of next two DNs (4, 6) to DN1 for replication.
4. The DN1 connects to DN4 and informs it to be ready and
gives IP of DN6. DN4 asks DN6 to be ready for data.
5. Ack of readiness follows the reverse sequence, i.e. from
the DN6 to DN4 to DN1.
6. At last DN1 will inform the client that all the DNs are
ready and a pipeline will be formed between the client,
DataNode 1, 4 and 6.
7. Now pipeline set up is complete and the client will finally
begin the data copy or streaming process.

34
HDFS Write: Step 2. Data streaming

Client pushes the data into the pipeline.

1. Once the block has been written to


DataNode 1 by the client, DataNode 1 will
connect to DataNode 4.
2. Then, DataNode 1 will push the block in
the pipeline and data will be copied to
DataNode 4.
3. Again, DataNode 4 will connect to
DataNode 6 and will copy the last replica
of the block.

35
HDFS Write: Step 3. Shutdown pipeline / ack

Block is now copied to all DNs. Client and NameNode need to

be updated. Client needs to close pipeline and end TCP session.

1. Acknowledgement happens in the reverse sequence i.e. from

DN 6 to 4 and then to 1.

2. DN1 pushes three acknowledgements (including its own)

into pipeline and client.

3. Client informs NameNode that data has been written

successfully.

4. NameNode updates metadata.

5. Client shuts down the pipeline.

36
Multi-block writes

• The client will copy Block A and


Block B to the first
DataNode simultaneously.
• Parallel pipelines for each block
• Pipeline process for a block is
same as discussed.
• E.g. 1A, 2A, 3A, … and 1B, 2B,
3B, … work in parallel

37
Sample write code
public class WriteFileToHDFS{
public static void main(String[] args) throws IOException {
WriteFileToHDFS.writeFileToHDFS();
}
public static void writeFileToHDFS() throws IOException {
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://localhost:9000");
FileSystem fileSystem = FileSystem.get(configuration);
String fileName = "read_write_hdfs_example.txt";
Path hdfsWritePath = new Path("/javareadwriteexample/" + fileName);
FSDataOutputStream fsDataOutputStream = fileSystem.create(hdfsWritePath,true);
BufferedWriter bufferedWriter = new BufferedWriter(new OutputStreamWriter(fs
DataOutputStream,StandardCharsets.UTF_8));
bufferedWriter.write("Java API to write data in HDFS");
bufferedWriter.newLine();
bufferedWriter.close();
fileSystem.close();
}
}
38
HDFS Create / Write - Call sequence in code

1) Client calls create() on FileSystem to create a file


1) RPC call to NameNode happens through FileSystem to create new
file.
2) NameNode performs checks to create a new file. Initially, NameNode
creates a file without associating any data blocks to the file.
3) The FileSystem.create() returns an FSDataOutputStream to client to
perform write.
2) Client creates a BufferedWriter using FSDataOutputStream to write data to
a pipeline
1) Data is split into packets by FSDataOutputStream, which is then
written to the internal queue.
2) DataStreamer consumes the data queue
3) DataStreamer requests NameNode to allocate new blocks by
selecting a list of suitable DataNodes to store replicas. This is
pipeline.
4) DataStreamer streams packets to first DataNode in the pipeline.

39
HDFS Create / Write - Call sequence in code
3) The first DataNode stores packet and forwards it to
Second DataNode and then Second node transfer it to
Third DataNode.
4) FSDataOutputStream also manages a “Ack queue” of
packets that are waiting for the acknowledgement by
DataNodes.
5) A packet is removed from the queue only if it is
acknowledged by all the DataNodes in the pipeline
6) When the client finishes writing to the file, it calls
close() on the stream
7) This flushes all the remaining packets to DataNode
pipeline and waits for relevant acknowledgements before
communicating the NameNode to inform the client that
the writing of the file is complete.

40
HDFS Read
1.Client contacts NameNode asking for the block metadata
for a file
2.NameNode returns list of DNs where each block is
stored
3.Client connects to the DNs where blocks are stored
4.The client starts reading data parallel from the DNs (e.g.
Block A from DN1, Block B from DN3)
5.Once the client gets all the required file blocks, it will
combine these blocks to form a file.

How are blocks chosen by NameNode ?


While serving read request of the client, HDFS selects the
replica which is closest to the client. This reduces the read
latency and the bandwidth consumption. Therefore, that
replica is selected which resides on the same rack as the
reader node, if possible.

41
Sample read code
public class ReadFileFromHDFS{
public static void main(String[] args) throws IOException {
ReadFileFromHDFS.readFileFromHDFS();
}
public static void readFileFromHDFS() throws IOException {
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://localhost:9000");
FileSystem fileSystem = FileSystem.get(configuration);
String fileName = "read_write_hdfs_example.txt";
Path hdfsReadPath = new Path("/javareadwriteexample/" + fileName);
FSDataInputStream inputStream = fileSystem.open(hdfsReadPath);
//Classical input stream usage
String out= IOUtils.toString(inputStream, "UTF-8");
System.out.println(out);
inputStream.close();
fileSystem.close();
}
} 42
HDFS Read - Call sequence in code
1) Client opens the file that it wishes to read from by calling open() on
FileSystem
1) FileSystem communicates with NameNode to get location of
data blocks.
2) NameNode returns the addresses of DataNodes on which blocks
are stored.
3) FileSystem returns FSDataInputStream to client to read from
file.
2) Client then calls read() on the stream, which has addresses of
DataNodes for first few blocks of file, connects to the closest
DataNode for the first block in file
1) Client calls read() repeatedly to stream the data from DataNode
2) When end of block is reached, stream closes the connection with
DataNode.
3) The stream repeats the steps to find the best DataNode for the
next blocks.
3) When the client completes the reading of file, it calls close() on the
stream to close the connection.

43
Read optimizations

• Short-circuit reads can also be made by client to local


file bypassing DataNode using /dev/shm for better
performance.
• Client can ask certain HDFS paths to be cached
✓ NameNode asks DNs to cache corresponding
blocks and send cache block report along with
heartbeat
• Clients can be scheduled for data locality so that it can
use local reads or caches better

44
Security

• POSIX style file permissions


• Choice of transparent encryption/decryption Database
✓ HDFS encryption is between file system and DB level encryption HDFS
• Create hierarchical encryption zones so that any file within the zone (a Native FS
path) has the same key
• Hadoop Key Management Server (KMS) does the following
✓ Provides access to stored Encryption Zone Keys
✓ Create encrypted data encryption keys (EDEK) for storage by
NameNode
✓ Decrypting EDEK for use by clients

45
Topics for Session 5

• Hadoop architecture overview


✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands

46
File formats

• Many ways to evaluate formats


• Text (JSON, CSV, ..)
✓ Write performance
• AVRO
✓ Read performance
• Parquet
✓ Partial read performance
• RC
✓ Block compression
• ORC
✓ Columnar support
• Sequence
✓ Schema change

47
File formats - text based

• Text-based (JSON, CSV …)


✓ Easily splittable
✓ Can’t split compressed files
• so large Map tasks because you have to give the entire file to a
Map task
✓ Simplest to start putting in structured / semi-structured data

48
File formats - sequence files
• A flat file consisting of binary key/value pairs
• Extensively used in MapReduce as input/output formats as well as internal temporary
outputs of maps
• 3 types
1. Uncompressed key/value records.
2. Record compressed key/value records - only 'values' are compressed here.
3. Block compressed key/value records - both keys and values are collected in
configurable 'blocks' and compressed. Sync markers added for random access and
splitting.
• Header includes information about :
• key, value class
• whether compression is enabled and whether at block level
• compressor codec used

ref: https:/cwiki.apache.org/confluence/display/HADOOP2/SequenceFile 49
Example: Storing images in HDFS
public class ImageToSeq {

public static void main(String args[]) throws Exception {

Configuration confHadoop = new Configuration();


• We want to store a bunch of images as key-
confHadoop.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));

confHadoop.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/hdfs-site.xml"));

FileSystem fs = FileSystem.get(confHadoop);
value pairs, maybe attach some meta-data
Path inPath = new Path("/mapin/1.png");

Path outPath = new Path("/mapin/11.png");


as well
FSDataInputStream in = null;

Text key = new Text(); • Files are stored as binary format, e.g.
BytesWritable value = new BytesWritable();

SequenceFile.Writer writer = null; sequence file


try{

in = fs.open(inPath); • We can apply some image processing on


byte buffer[] = new byte[in.available()];

in.read(buffer);
the files, as well as use meta-data to
writer = SequenceFile.createWriter(fs, confHadoop, outPath, key.getClass(),value.getClass());

writer.append(new Text(inPath.getName()), new BytesWritable(buffer));

}catch (Exception e) {
retrieve specific images
System.out.println("Exception MESSAGES = "+e.getMessage());

finally {

IOUtils.closeStream(writer);

System.out.println("last line of the code....!!!!!!!!!!");

} Ref: https://stackoverflow.com/questions/16546040/store-images-videos-into-hadoop-hdfs 50
File formats - Optimized Row Columnar (ORC) *
Improves performance when Hive is reading, writing, and processing data

create table Addresses ( Index data used to skip rows


Includes min/max for each column
name string,
and their row positions
street string,
Row data used for table scans
city string,

state string,

zip int

) stored as orc tblproperties (“orc.compress"="ZLIB");

* more in Hive session ref: https://cwiki.apache.org/confluence/display/hive/languagemanual+orc 51


File formats - Parquet
• Columnar format
• Pros
✓ Good for compression because data in a column tends to be similar
• Support block level compression as well as file level
✓ Good query performance when query is for specific columns
✓ Compared to ORC - more flexible to add columns
✓ Good for Hive and Spark workloads if working on specific columns at a time
• Cons
✓ Expensive write due to columnar
• So if use case is more about reading entire rows, then not a good choice
• Anyway - Hadoop systems are more about write once and read many times - so read
performance is paramount
• Note: Avro is another option for workloads with full row scans because it is row major storage

52
Topics for today

• Hadoop architecture overview


✓ Components
✓ Hadoop 1 vs Hadoop 2
• HDFS
✓ Architecture
✓ Robustness
✓ Blocks and replication strategy
✓ Read and write operations
✓ File formats
✓ Commands
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoophdfs/HDFSCommands.html
53
Basic command reference
• list files in the path of the file system
✓ hadoop fs -ls <path> 
• alters the permissions of a file where <arg> is the binary argument e.g. 777
✓ hadoop fs -chmod <arg> <file-or-dir> 
• change the owner of a file
✓ hadoop fs -chown <owner>:<group> <file-or-dir> 
• make a directory on the file system
✓ hadoop fs -mkdir <path> 
• copy a file from the local storage onto file system
✓ hadoop fs -put <local-origin> <destination> 
• copy a file to the local storage from the file system
✓ hadoop fs -get <origin> <local-destination> 
• similar to the put command but the source is restricted to a local file reference
✓ hadoop fs -copyFromLocal <local-origin> <destination> 
• similar to the get command but the destination is restricted to a local file reference
✓ hadoop fs -copyToLocal <origin> <local-destination> 
• create an empty file on the file system
✓ hadoop fs -touchz 
• copy files to stdout
✓ hadoop fs -cat <file> 
54
More HDFS commands - config and usage
Get configuration data in general, about name nodes, about any specific attribute
• hdfs getconf 
• return various configuration settings in effect
• hdfs getconf -namenodes 
• returns namenodes in the cluster
• hdfs getconf -confkey <a.value>
• return the value of a particular setting (e.g. dfs.replication)
• hdfs dfsadmin -report 
• find out how much disk space us used, free, under-replicated, etc.
• hadoop fs -setrep 2 /jps/wc/sample01.txt
• Set replication factor of 2 to sample01.txt file in HDFS

55
Summary

• High level architecture of Hadoop 2 and differences with earlier Hadoop 1


• HDFS
✓ Architecture
✓ Read and write flows
✓ File formats
✓ Commands commonly used
✓ Additional reading about HDFS:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/
HdfsDesign.html

56
Next Session:
Distributed Programming

You might also like