Hadoop Working
Hadoop Working
Be highly fault-tolerant
HDFS is made for handling large files by dividing them into blocks,
replicating them, and storing them in the different cluster nodes. Thus,
its ability to be highly fault-tolerant and reliable.
Read a file
A client reading data from HDFS from Hadoop The definitive guide
To read a file from HDFS, the client opens the file it wishes to read,
and the Distributed Filesystem communicates to the NameNode for
the metadata. The NameNode responds with the number of blocks,
their location, and their details. The client then calls read() on the
stream returned by the Distributed Filesystem, and it connects to the
first (closest) datanode for the first block in the file. When a block ends,
DFSInputStream will close the connection to the datanode, then find
the best datanode for the next block. This happens transparently to the
client, which from its point of view, is just reading a continuous
stream. When the client has finished reading, it calls close() on the
FSDataInputStream.
Write a file
The “Big data from B to A” article series covers all Big Data concepts:
Business concepts, its tools, its frameworks, etc.
The first data node copies the block to another data node, which copies
it to the third data node internally. After it generates the replicas of
bricks, the acknowledgment is sent back.
Let’s now grasp the full HDFS data writing pipeline end-to-end. The
HDFS client sends a Distributed File System APIs development
request.
(ii) Distributed File System makes a name node RPC call to create a
new file in the namespace of the file system.
To ensure that the file does not already exist and that the client has the
permission to create the file, the name node performs several tests.
Then only the name node allows a record of the new file when these
checks pass; otherwise, file creation fails and an IOException is thrown
at the client. Read in-depth about Hadoop HDFS Architecture, too.
V) A packet is only deleted from the ack queue when the data nodes in
the pipeline have been recognized. Once necessary replicas are made,
the Data node sends the recognition (3 by default). Similarly, all the
blocks are stored and replicated on the various data nodes and copied
in parallel with the data blocks.
Vi) It calls close() on the stream when the client has finished writing
data.
Vii) This action flushes all remaining packets to the pipeline of the data
node and waits for acknowledgments to signal that the file is complete
before contacting the name node.
From the following diagram, we can summarise the HDFS data writing
operation.
The client opens the file it wants to read by calling open() on the File
System object, which is the Distributed File System instance for HDFS.
See HDFS Data Read Process
(ii) Distributed File System uses RPC to call the name node to decide
the block positions for the first few blocks in the file.
iv) Data is streamed back to the client from the data node, which
enables the client to repeatedly call read() on the stream. When the
block ends, the connection to the data node is closed by
DFSInputStream and then the best data node for the next block is
found. Learn about the HDFS data writing operation as well.
From the following diagram, we can summarise the HDFS data reading
operation.
The part of the pipeline running a data node process fails. Hadoop has
an advanced feature to manage this situation (HDFS is fault-tolerant).
If a data node fails when data is written to it the following steps are
taken, which are clear to the customer writing the details.
A new identity is given to the current block on the successful data node,
which is transmitted to the name node so that if the failed data node
recovery is later, the partial block on the failed data node is removed.
Read High Accessibility in HDFS Name node also.
The failed data node is removed from the pipeline, and the remaining
data from the block is written to the two successful data nodes in the
pipeline.
Rack awareness
A rack is a collection of a few dozen DataNodes physically connected
through a single network switch. In case a network stops functioning, the
entire rack becomes unavailable. To address this, rack awareness ensures that
block replicas are stored on different racks.
For instance, if a block has a replication factor of 3, the rack awareness
algorithm will store the first replica on the local rack, the second replica on
another DataNode within the same rack, and the third replica on a completely
different rack.
Through rack awareness, the closest node is chosen based on rack
information. The NameNode uses the rack awareness algorithm to store and
access replicas in a way that boosts fault tolerance and minimizes latency.
Example of HDFS in action
Imagine there exists a file containing the Social Security Numbers of every
citizen in the United States. The SSNs for people with the last name
beginning with A are stored on server 1, SSNs for last names beginning with
B are stored on server 2, etc. Essentially, Hadoop would store fragments of
this SSN database across a server cluster. For the entire file to be
reconstructed, the client would require blocks from all servers within the
cluster.
To ensure that this data’s availability is not compromised in case of server
failure, HDFS replicates blocks onto two additional servers as a rule. The
redundancy can be increased or decreased based on the replication factor of
individual files or an entire environment. For instance, a Hadoop cluster
earmarked for development typically would not require redundancy.
Besides high availability, redundancy has another advantage: it enables
Hadoop clusters to fractionate work into tinier pieces and execute those tasks
across cluster servers for enhanced scalability. The third benefit of HDFS
redundancy is the advantage of data locality, which is critical for the
streamlined execution of large data sets.
Example of HDFS in action
Imagine there exists a file containing the Social Security Numbers of every
citizen in the United States. The SSNs for people with the last name
beginning with A are stored on server 1, SSNs for last names beginning with
B are stored on server 2, etc. Essentially, Hadoop would store fragments of
this SSN database across a server cluster. For the entire file to be
reconstructed, the client would require blocks from all servers within the
cluster.
To ensure that this data’s availability is not compromised in case of server
failure, HDFS replicates blocks onto two additional servers as a rule. The
redundancy can be increased or decreased based on the replication factor of
individual files or an entire environment. For instance, a Hadoop cluster
earmarked for development typically would not require redundancy.
Besides high availability, redundancy has another advantage: it enables
Hadoop clusters to fractionate work into tinier pieces and execute those tasks
across cluster servers for enhanced scalability. The third benefit of HDFS
redundancy is the advantage of data locality, which is critical for the
streamlined execution of large data sets.
Hadoop MapReduce – Data Flow
Map Reduce is a terminology that comes with Map Phase and Reducer Phase.
The map is used for Transformation while the Reducer is used for aggregation
kind of operation. The terminology for Map and Reduce is derived from some
functional programming languages like Lisp, Scala, etc. The Map-Reduce
processing framework program comes with 3 main components i.e. our Driver
code, Mapper(For Transformation), and Reducer(For Aggregation).
Let’s take an example where you have a file of 10TB in size to process on
Hadoop. The 10TB of data is first distributed across multiple nodes on Hadoop
with HDFS. Now we have to process it for that we have a Map-Reduce
framework. So to process this data with Map-Reduce we have a Driver code
which is called Job. If we are using Java programming language for processing
the data on HDFS then we need to initiate this Driver class with the Job object.
Suppose you have a car which is your framework than the start button used to
start the car is similar to this Driver code in the Map-Reduce framework. We
need to initiate the Driver code to utilize the advantages of this Map-Reduce
Framework.
There are also Mapper and Reducer classes provided by this framework which
are predefined and modified by the developers as per the organizations
requirement.
Mapper is the initial line of code that initially interacts with the input dataset.
suppose, If we have 100 Data-Blocks of the dataset we are analyzing then, in
that case, there will be 100 Mapper program or process that runs in parallel on
machines(nodes) and produce there own output known as intermediate output
which is then stored on Local Disk, not on HDFS. The output of the mapper act
as input for Reducer which performs some sorting and aggregation operation on
data and produces the final output.
Brief Working Of Reducer
Reducer is the second part of the Map-Reduce programming model. The
Mapper produces the output in the form of key-value pairs which works as input
for the Reducer. But before sending this intermediate key-value pairs directly to
the Reducer some process will be done which shuffle and sort the key-value
pairs according to its key values. The output generated by the Reducer will be
the final output which is then stored on HDFS(Hadoop Distributed File System).
Reducer mainly performs some computation operation like addition, filtration,
and aggregation.
Steps of Data-Flow:
Most of us are familiar with the term Rack. The rack is a physical collection of nodes in
our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of many Racks.
With the help of this Racks information, Namenode chooses the closest Datanode to
achieve maximum performance while performing the read/write information which
reduces the Network Traffic. A rack can have multiple data nodes storing the file blocks
and their replica’s. The Hadoop itself is so smart that it will automatically write a
particular file block in 2 different Data nodes in Rack. If you want to store that block of
data into more than 2 Racks then you can do that. Also as this feature is configurable
means you can change it Manually. Example of Rack in
a cluster:
As we all know a large Hadoop cluster contains multiple Racks, in each rack there are
lots of data nodes are available. Communication between the Datanodes that are present
on the same rack is quite much faster than the communication between the data node
present at the 2 different racks. The name node has the feature of finding the closest data
node for faster performance for that Name node holds the ids of all the Racks present in
the Hadoop cluster. This concept of choosing the closest data node for serving a purpose
is Rack Awareness. Let’s understand this with an example.
In
the above image, we have 3 different Racks in our Hadoop cluster each Rack
contains 4 Datanode. Now suppose you have 3 file blocks(Block 1, Block 2,
Block 3) that you want to put in this data node. As we all know Hadoop has a
Feature to make Replica’s of the file blocks to provide the high availability and
fault tolerance. By default, the Replication Factor is 3 so Hadoop is so smart
that it will place the replica’s of Blocks in Racks in such a way that we can
achieve a good network bandwidth. For that Hadoop has some Rack
awareness policies.
There should not be more than 1 replica on the same Datanode.
More than 2 replica’s of a single block is not allowed on the same Rack.
The number of racks used inside a Hadoop cluster must be smaller than the
number of replicas.
Now let’s continue with our above example. In the diagram, we can easily found
that we have block 1 in the first Datanode of Rack 1 and 2 replica’s of Block 1 in
5 and 6 number Data node of Rack which sum up to 3. Similarly, we also have
a Replica distribution of 2 other blocks in different Racks which are following the
above policies. Benefits of Implementing
1. 6
These Schedulers are actually a kind of algorithm that we use to schedule tasks
in a Hadoop cluster when we receive requests from different-different clients.
A Job queue is nothing but the collection of various tasks that we have
received from our various clients. The tasks are available in the queue and we
need to schedule this task on the basis of our requirements.
1. FIFO Scheduler
As the name suggests FIFO i.e. First In First Out, so the tasks or application
that comes first will be served first. This is the default Scheduler we use in
Hadoop. The tasks are placed in a queue and the tasks are performed in their
submission order. In this method, once the job is scheduled, no intervention is
allowed. So sometimes the high-priority process has to wait for a long time
since the priority of the task does not matter in this method.
Advantage:
No need for configuration
First Come First Serve
simple to execute
Disadvantage:
Priority of task doesn’t matter, so high priority jobs need to wait
Not suitable for shared cluster
2. Capacity Scheduler
In Capacity Scheduler we have multiple job queues for scheduling our tasks.
The Capacity Scheduler allows multiple occupants to share a large size Hadoop
cluster. In Capacity Scheduler corresponding for each job queue, we provide
some slots or cluster resources for performing job operation. Each job queue
has it’s own slots to perform its task. In case we have tasks to perform in only
one queue then the tasks of that queue can access the slots of other queues
also as they are free to use, and when the new task enters to some other queue
then jobs in running in its own slots of the cluster are replaced with its own job.
Capacity Scheduler also provides a level of abstraction to know which occupant
is utilizing the more cluster resource or slots, so that the single user or
application doesn’t take disappropriate or unnecessary slots in the cluster. The
capacity Scheduler mainly contains 3 types of the queue that are root, parent,
and leaf which are used to represent cluster, organization, or any subgroup,
application submission respectively.
Advantage:
Best for working with Multiple clients or priority jobs in a Hadoop cluster
Maximizes throughput in the Hadoop cluster
Disadvantage:
More complex
Not easy to configure for everyone
3. Fair Scheduler
The Fair Scheduler is very much similar to that of the capacity scheduler. The
priority of the job is kept in consideration. With the help of Fair Scheduler, the
YARN applications can share the resources in the large Hadoop Cluster and
these resources are maintained dynamically so no need for prior capacity. The
resources are distributed in such a manner that all applications within a cluster
get an equal amount of time. Fair Scheduler takes Scheduling decisions on the
basis of memory, we can configure it to work with CPU also.
As we told you it is similar to Capacity Scheduler but the major thing to notice is
that in Fair Scheduler whenever any high priority job arises in the same queue,
the task is processed in parallel by replacing some portion from the already
dedicated slots.
Advantages:
Resources assigned to each application depend upon its priority.
it can limit the concurrent running task in a particular pool or queue.
Disadvantages: The configuration is required.
Hadoop – Different Modes of Operation
As we all know Hadoop is an open-source framework which is mainly used for
storage purpose and maintaining and analyzing a large amount of data or
datasets on the clusters of commodity hardware, which means it is actually a
data management tool. Hadoop also posses a scale-out storage property, which
means that we can scale up or scale down the number of nodes as per are a
requirement in the future which is really a cool feature.
1. Standalone Mode
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode,
Secondary Name node, Job Tracker, and Task Tracker. We use job-tracker and
task-tracker for processing purposes in Hadoop1. For Hadoop2 we use
Resource Manager and Node Manager. Standalone Mode also means that we
are installing Hadoop only in a single system. By default, Hadoop is made to
run in this Standalone Mode or we can also call it as the Local mode. We
mainly use Hadoop in this Mode for the Purpose of Learning, testing, and
debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As
we all know HDFS (Hadoop distributed file system) is one of the major
components for Hadoop which utilized for storage Permission is not utilized in
this mode. You can think of HDFS as similar to the file system’s available for
windows i.e. NTFS (New Technology File System) and FAT32(File Allocation
Table which stores the data in the blocks of 32 bits ). when your Hadoop works
in this mode there is no need to configure the files – hdfs-site.xml, mapred-
site.xml, core-site.xml for Hadoop environment. In this Mode, all of your
Processes will run on a single JVM(Java Virtual Machine) and this mode can
only be used for small development purposes.
In Pseudo-distributed Mode we also use only a single node, but the main thing
is that the cluster is simulated, which means that all the processes inside the
cluster will run independently to each other. All the daemons that are
Namenode, Datanode, Secondary Name node, Resource Manager, Node
Manager, etc. will be running as a separate process on separate JVM(Java
Virtual Machine) or we can say run on different java processes that is why it is
called a Pseudo-distributed.
One thing we should remember that as we are using only the single node set up
so all the Master and Slave processes are handled by the single system.
Namenode and Resource Manager are used as Master and Datanode and
Node Manager is used as a slave. A secondary name node is also used as a
Master. The purpose of the Secondary Name node is to just keep the hourly
based backup of the Name node. In this Mode,
Hadoop is used for development and for debugging purposes both.
Our HDFS(Hadoop Distributed File System ) is utilized for managing the
Input and Output processes.
We need to change the configuration files mapred-site.xml, core-
site.xml, hdfs-site.xml for setting up the environment.
This is the most important one in which multiple nodes are used few of them run
the Master Daemon’s that are Namenode and Resource Manager and the rest
of them run the Slave Daemon’s that are DataNode and Node Manager. Here
Hadoop will run on the clusters of Machine or nodes. Here the data that is used
is distributed across different nodes. This is actually the Production Mode of
Hadoop let’s clarify or understand this Mode in a better way in Physical
Terminology.
Once you download the Hadoop in a tar file format or zip file format then you
install it in your system and you run all the processes in a single system but
here in the fully distributed mode we are extracting this tar or zip file to each of
the nodes in the Hadoop cluster and then we are using a particular node for a
particular process. Once you distribute the process among the nodes then you’ll
define which nodes are working as a master or which one of them is working as
a slave.
Hadoop – Cluster, Properties and its
Types
1. Single Node Hadoop Cluster: In Single Node Hadoop Cluster as the name
suggests the cluster is of an only single node which means all our Hadoop
Daemons i.e. Name Node, Data Node, Secondary Name Node, Resource
Manager, Node Manager will run on the same system or on the same machine.
It also means that all of our processes will be handled by only single JVM(Java
Virtual Machine) Process Instance.
2. Multiple Node Hadoop Cluster: In multiple node Hadoop clusters as the
name suggests it contains multiple nodes. In this kind of cluster set up all of our
Hadoop Daemons, will store in different-different nodes in the same cluster
setup. In general, in multiple node Hadoop cluster setup we try to utilize our
higher processing nodes for Master i.e. Name node and Resource Manager and
we utilize the cheaper system for the slave Daemon’s i.e.Node Manager and
Data Node.
1. File Permission
Similarly, we can have permission for a directory in our HDFS. Where r is used
to list the content of a directory, w is used for creation or deletion of a directory
and x permission is used to access the child of a directory. Here also we don’t
have a setUID and setGID for HDFS.
How You Can Change this HDFS File’s Permission?
-chmod that stands for change mode command is used for changing the
permission for the files in our HDFS. The first list down the directories available
in our HDFS and have a look at the permission assigned to each of this
directory. You can list the directory in your HDFS root with the below command.
hdfs dfs -ls /
Here, / represents the root directory of your HDFS.
In above Image you can see that for file1.txt, I have only read and write
permission for owner user only. So I am adding write permission to group and
others also.
Pre-requisite:
You have to be familiar with the use of -chmod command in Linux means how
to use switch for permissions for users. To add write permission to group and
others use below command.
hdfs dfs -chmod go+w /Hadoop_File/file1.txt
Here, go stands for group and other and w means write, and + sign shows that
I am adding write permission to group and other. Then list the file again to
check it worked or not.
hdfs dfs -ls /Hadoop_File
And we have done with it, similarly, you can change the permission for any file
or directory available in our HDFS(Hadoop Distributed File System).
Similarly, you can change permission as per your requirement for any user. you
can also change group or owner of a directory with -chgrp and -
chown respectively.
ACL provides a more flexible way to assign permission for a file system. It is a
list of access permission for a file or a directory. We need the use of ACL in
case you have made a separate user for your Hadoop single node cluster
setup, or you have a multinode cluster setup where various nodes are present,
and you want to change permission for other users.
Because if you want to change permission for the different users, you can not
do it with -chmod command. For example, for single node cluster of Hadoop
your main user is root and you have created a separate user for Hadoop setup
with name let say Hadoop. Now if you want to change permission for
the root user for files that are present in your HDFS, you can not do it with -
chmod command. Here comes ACL(Access Control List) in the picture. With
ACL you can set permission for a specific named user or named group.
In order to enable ACL in HDFS you need to add the below property in hdfs-
site.xml file.
<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>
Note: Don’t forget to restart all the daemons otherwise changes made to hdfs-
site.xml don’t reflect.
You can check the entry’s in your access control list(ACL) with -
getfacl command for a directory as shown below.
hdfs dfs -getfacl /Hadoop_File
You can see that we have 3 different entry’s in our ACL. Suppose you want to
change permission for your root user for any HDFS directory you can do it with
below command.
Syntax:
hdfs dfs -setfacl -m user:user_name:r-x /Hadoop_File
You can change permission for any user by adding it to the ACL for that
directory. Below are some of the example to change permission of different
named users for any HDFS file or directory.
hdfs dfs -setfacl -m user:root:r-x /Hadoop_File
Another example, for raj user:
hdfs dfs -setfacl -m user:raj:r-x /Hadoop_File
Here r-x denotes only read and executing permission for HDFS directory for
that root, and raj user.
In my case, I don’t have any other user so I am changing permission for my
only user i.e. dikshant
hdfs dfs -setfacl -m user:dikshant:rwx /Hadoop_File
Then list the ACL with -getfacl command to see the changes.
hdfs dfs -getfacl /Hadoop_File
Here, you can see that I have changed dikshant user permission from rwx to r-
x.