0% found this document useful (0 votes)
21 views33 pages

Hadoop Working

Uploaded by

josephpraveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views33 pages

Hadoop Working

Uploaded by

josephpraveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

As data is significantly growing, storing large amounts of information

across a network of machines becomes a necessity. Therefore, comes


the need for a reliable system, called distributed filesystems, to control
how data is stored and retrieved. However, many challenges emerge
with the implementation of such infrastructure, for instance, handling
hardware failure without losing data.

In this article, we’ll focus on Hadoop’s distributed filesystem — HDFS,


its design, its architecture, and the data flow.

The design of HDFS


The Hadoop Distributed File System (HDFS) is a distributed file
system designed to:

Run on commodity hardware.

Hadoop is designed to run on affordable devices commonly available


from multiple vendors. The nature of the hardware gives an extreme
scalability within a cluster; the bad units can be easily replaced without
expensive costs. Although the probability of hardware failure gets
higher within a large cluster, HDFS will continue to work without a
noticeable change.

Be highly fault-tolerant
HDFS is made for handling large files by dividing them into blocks,
replicating them, and storing them in the different cluster nodes. Thus,
its ability to be highly fault-tolerant and reliable.

Handle very large datasets

HDFS is designed to store large datasets in the range of gigabytes or


terabytes, or even petabytes.

Streaming data and providing high


throughput

The write-once model that assumes data never changes after it is


written simplifies replication. Thanks to this model and to the
independent parallel processing, the data throughput speeds up.

The architecture of HDFS


HDFS has a master/slave architecture. It consists of:

 NameNode: Known as Master Node. It manages the filesystem


namespace and executes operations like opening, closing, and
renaming files and directories. It maintains the filesystem tree
and the metadata (number of data blocks*, replicas, etc.) for all
the files and directories in the tree. The NameNode also
maintains and manages the slave nodes.
The files associated with the metadata are:

 FsImage: A persistent checkpoint of the filesystem metadata.

 EditLogs: It contains all the recent modifications made to the file


system with respect to the most recent FsImage.

 DataNode: Known as Slave Node. It performs read and write


operation as per the request of the client or the NameNode, and
it reports back to the namenode periodically with lists of blocks
that it is storing.

 Secondary NameNode: Usually runs on a separate physical


machine. Its role is to merge the FsImage and EditLogs from the
NameNode periodically. This prevents the edit log from
becoming too large. Is also stores a copy of the merged FsImage
into persistent storage, which can be used in the case of
NameNode failure.

From Hadoop version 0.21.0 onward, a new type of namenode, called


a backup node, was introduced to maintains the up-to-date state of
the namespace by receiving edits from the namenode, besides
a checkpoint node, which creates a checkpoint of the namespace is
replacing the secondary namenode.

*Block: A disk has a block size, which is the minimum amount of


data that it can read or write. Files in HDFS are broken into
block-sized chunks, which are stored as independent units. The
default size of a block in HDFS is 128 MB (Hadoop 2.x) and 64
MB (Hadoop 1.x).

The Data Flow

Read a file

A client reading data from HDFS from Hadoop The definitive guide

To read a file from HDFS, the client opens the file it wishes to read,
and the Distributed Filesystem communicates to the NameNode for
the metadata. The NameNode responds with the number of blocks,
their location, and their details. The client then calls read() on the
stream returned by the Distributed Filesystem, and it connects to the
first (closest) datanode for the first block in the file. When a block ends,
DFSInputStream will close the connection to the datanode, then find
the best datanode for the next block. This happens transparently to the
client, which from its point of view, is just reading a continuous
stream. When the client has finished reading, it calls close() on the
FSDataInputStream.

Write a file

A client writing data to HDFS Hadoop The definitive guide

When a client wants to write a file to HDFS, it calls create() on


DistributedFileSystem that communicates to the namenode to create a
new file in the filesystem’s namespace, with no blocks associated with
it. If the file doesn’t already exist, and the client has the right
permissions, the NameNode creates the file; otherwise, file creation
fails, and the client is thrown an IOException. The
DistributedFileSystem returns an FSDataOutputStream for the client
to start writing data to. As the client writes data, DFSOutputStream
splits it into packets, which it writes to an internal queue, called the
data queue. The data queue is consumed by the Data Streamer, whose
responsibility it is to ask the namenode to allocate new blocks by
picking a list of suitable datanodes to store the replicas. The list of
datanodes forms a pipeline. The DataStreamer streams the packets to
the first datanode in the pipeline, which stores the packet and forwards
it to the second datanode in the pipeline and so on. DFSOutputStream
also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed
from the ack queue only when it has been acknowledged by all the
datanodes in the pipeline. When the client has finished writing data, it
calls close() on the stream. This action flushes all the remaining
packets to the datanode pipeline and waits for acknowledgments
before contacting the namenode to signal that the file is complete. The
namenode already knows which blocks the file is made up of (via Data
Streamer asking for block allocations), so it only has to wait for blocks
to be minimally replicated before returning successfully.

To conclude, HDFS is a reliable, distributed filesystem to stores large


files across big data clusters. It was designed to be scalable, highly
available, and fault-tolerant. Its architecture helps to manage the
distributed storage across the network of machines and maintain
replicas in the case of hardware failure.

The “Big data from B to A” article series covers all Big Data concepts:
Business concepts, its tools, its frameworks, etc.

A client needs to communicate with the master, i.e. namenode, to write


a file in HDFS (master). Name node now provides the address of the
data nodes (slaves) that the client begins writing the data on. The client
writes data directly to the data nodes, and now the data node builds the
pipeline for data writing.

The first data node copies the block to another data node, which copies
it to the third data node internally. After it generates the replicas of
bricks, the acknowledgment is sent back.

a. Pipeline Hadoop Workflow HDFS Data Write

Let’s now grasp the full HDFS data writing pipeline end-to-end. The
HDFS client sends a Distributed File System APIs development
request.

(ii) Distributed File System makes a name node RPC call to create a
new file in the namespace of the file system.

To ensure that the file does not already exist and that the client has the
permission to create the file, the name node performs several tests.
Then only the name node allows a record of the new file when these
checks pass; otherwise, file creation fails and an IOException is thrown
at the client. Read in-depth about Hadoop HDFS Architecture, too.

(iii) The Distributed File System returns an FSData Output Stream to


start writing data to the device. DFS Output Stream divides it into
packets, which it writes to an internal queue, called a data queue, while
the client writes data.
iv) A Hadoop pipeline is made up of a list of data nodes, and here we
can presume that the degree of replication is three, so there are three
nodes in the pipeline. Similarly, a packet is stored and forwarded to the
third (and last) data node in the pipeline by the second data node.
Read in-depth about HDFS Data Blocks.

V) A packet is only deleted from the ack queue when the data nodes in
the pipeline have been recognized. Once necessary replicas are made,
the Data node sends the recognition (3 by default). Similarly, all the
blocks are stored and replicated on the various data nodes and copied
in parallel with the data blocks.

Vi) It calls close() on the stream when the client has finished writing
data.

Vii) This action flushes all remaining packets to the pipeline of the data
node and waits for acknowledgments to signal that the file is complete
before contacting the name node.

From the following diagram, we can summarise the HDFS data writing
operation.

A client needs to communicate with the name node (master) to read a


file from HDFS because the name node is the core of the Hadoop
cluster (it stores all the metadata i.e. data about the data). Now if the
client has enough privileges, the name node checks for the necessary
privileges, then the name node provides the address of the slaves
where a file is stored. In order to read the data blocks, the client can
now communicate directly with the respective data nodes.

HDFS Workflow Read File in Hadoop

Let’s now understand the complete operation of reading HDFS data


from end to end. The data read process in HDFS distributes, the client
reads the data from data nodes in parallel, the data read cycle
explained step by step.

The client opens the file it wants to read by calling open() on the File
System object, which is the Distributed File System instance for HDFS.
See HDFS Data Read Process

(ii) Distributed File System uses RPC to call the name node to decide
the block positions for the first few blocks in the file.

Iii) Distributed File System returns to the client an


FSDataInputStream from which it can read data. Therefore,
FSDataInputStream wraps the DFSInputStream that handles the I/O
of the data node and name node. On the stream, the client calls read().
The DFSInputStream that has stored the addresses of the data node
then connects to the first block in the file with the nearest data node.

iv) Data is streamed back to the client from the data node, which
enables the client to repeatedly call read() on the stream. When the
block ends, the connection to the data node is closed by
DFSInputStream and then the best data node for the next block is
found. Learn about the HDFS data writing operation as well.

V) If an error is encountered by DFSInputStream while interacting


with a data node, the next closest one will be tried for that block. Data
nodes that have failed will also be remembered so that they do not
needlessly retry them for later blocks. Checksums for the data
transferred to it from the data node are also checked by the
DFSInputStream. If a corrupt block is detected, the name node will
report this before the DFSInputStream tries to read a replica of the
block from another data node.vi) When the client has finished reading
the data, the stream will call close().

From the following diagram, we can summarise the HDFS data reading
operation.

HDFS Fault Tolerance in Hadoop

The part of the pipeline running a data node process fails. Hadoop has
an advanced feature to manage this situation (HDFS is fault-tolerant).
If a data node fails when data is written to it the following steps are
taken, which are clear to the customer writing the details.

A new identity is given to the current block on the successful data node,
which is transmitted to the name node so that if the failed data node
recovery is later, the partial block on the failed data node is removed.
Read High Accessibility in HDFS Name node also.

The failed data node is removed from the pipeline, and the remaining
data from the block is written to the two successful data nodes in the
pipeline.

Rack awareness
A rack is a collection of a few dozen DataNodes physically connected
through a single network switch. In case a network stops functioning, the
entire rack becomes unavailable. To address this, rack awareness ensures that
block replicas are stored on different racks.
For instance, if a block has a replication factor of 3, the rack awareness
algorithm will store the first replica on the local rack, the second replica on
another DataNode within the same rack, and the third replica on a completely
different rack.
Through rack awareness, the closest node is chosen based on rack
information. The NameNode uses the rack awareness algorithm to store and
access replicas in a way that boosts fault tolerance and minimizes latency.
Example of HDFS in action
Imagine there exists a file containing the Social Security Numbers of every
citizen in the United States. The SSNs for people with the last name
beginning with A are stored on server 1, SSNs for last names beginning with
B are stored on server 2, etc. Essentially, Hadoop would store fragments of
this SSN database across a server cluster. For the entire file to be
reconstructed, the client would require blocks from all servers within the
cluster.
To ensure that this data’s availability is not compromised in case of server
failure, HDFS replicates blocks onto two additional servers as a rule. The
redundancy can be increased or decreased based on the replication factor of
individual files or an entire environment. For instance, a Hadoop cluster
earmarked for development typically would not require redundancy.
Besides high availability, redundancy has another advantage: it enables
Hadoop clusters to fractionate work into tinier pieces and execute those tasks
across cluster servers for enhanced scalability. The third benefit of HDFS
redundancy is the advantage of data locality, which is critical for the
streamlined execution of large data sets.
Example of HDFS in action
Imagine there exists a file containing the Social Security Numbers of every
citizen in the United States. The SSNs for people with the last name
beginning with A are stored on server 1, SSNs for last names beginning with
B are stored on server 2, etc. Essentially, Hadoop would store fragments of
this SSN database across a server cluster. For the entire file to be
reconstructed, the client would require blocks from all servers within the
cluster.
To ensure that this data’s availability is not compromised in case of server
failure, HDFS replicates blocks onto two additional servers as a rule. The
redundancy can be increased or decreased based on the replication factor of
individual files or an entire environment. For instance, a Hadoop cluster
earmarked for development typically would not require redundancy.
Besides high availability, redundancy has another advantage: it enables
Hadoop clusters to fractionate work into tinier pieces and execute those tasks
across cluster servers for enhanced scalability. The third benefit of HDFS
redundancy is the advantage of data locality, which is critical for the
streamlined execution of large data sets.
Hadoop MapReduce – Data Flow


Map-Reduce is a processing framework used to process data over a large number of


machines. Hadoop uses Map-Reduce to process the data distributed in a Hadoop cluster.
Map-Reduce is not similar to the other regular processing framework like
Hibernate, JDK, .NET, etc. All these previous frameworks are designed to use with a
traditional system where the data is stored at a single location like Network File System,
Oracle database, etc. But when we are processing big data the data is located on multiple
commodity machines with the help of HDFS.
So when the data is stored on multiple nodes we need a processing framework where it
can copy the program to the location where the data is present, Means it copies the
program to all the machines where the data is present. Here the Map-Reduce came into
the picture for processing the data on Hadoop over a distributed system. Hadoop has a
major drawback of cross-switch network traffic which is due to the massive volume of
data. Map-Reduce comes with a feature called Data-Locality. Data Locality is the
potential to move the computations closer to the actual data location on the machines.
Since Hadoop is designed to work on commodity hardware it uses Map-Reduce as it is
widely acceptable which provides an easy way to process data over multiple nodes. Map-
Reduce is not the only framework for parallel processing. Nowadays Spark is also a
popular framework used for distributed computing like Map-Reduce. We also have
HAMA, MPI theses are also the different-different distributed processing framework.

Let’s Understand Data-Flow in Map-Reduce

Map Reduce is a terminology that comes with Map Phase and Reducer Phase.
The map is used for Transformation while the Reducer is used for aggregation
kind of operation. The terminology for Map and Reduce is derived from some
functional programming languages like Lisp, Scala, etc. The Map-Reduce
processing framework program comes with 3 main components i.e. our Driver
code, Mapper(For Transformation), and Reducer(For Aggregation).
Let’s take an example where you have a file of 10TB in size to process on
Hadoop. The 10TB of data is first distributed across multiple nodes on Hadoop
with HDFS. Now we have to process it for that we have a Map-Reduce
framework. So to process this data with Map-Reduce we have a Driver code
which is called Job. If we are using Java programming language for processing
the data on HDFS then we need to initiate this Driver class with the Job object.
Suppose you have a car which is your framework than the start button used to
start the car is similar to this Driver code in the Map-Reduce framework. We
need to initiate the Driver code to utilize the advantages of this Map-Reduce
Framework.
There are also Mapper and Reducer classes provided by this framework which
are predefined and modified by the developers as per the organizations
requirement.

Brief Working of Mapper

Mapper is the initial line of code that initially interacts with the input dataset.
suppose, If we have 100 Data-Blocks of the dataset we are analyzing then, in
that case, there will be 100 Mapper program or process that runs in parallel on
machines(nodes) and produce there own output known as intermediate output
which is then stored on Local Disk, not on HDFS. The output of the mapper act
as input for Reducer which performs some sorting and aggregation operation on
data and produces the final output.
Brief Working Of Reducer
Reducer is the second part of the Map-Reduce programming model. The
Mapper produces the output in the form of key-value pairs which works as input
for the Reducer. But before sending this intermediate key-value pairs directly to
the Reducer some process will be done which shuffle and sort the key-value
pairs according to its key values. The output generated by the Reducer will be
the final output which is then stored on HDFS(Hadoop Distributed File System).
Reducer mainly performs some computation operation like addition, filtration,
and aggregation.
Steps of Data-Flow:

 At a time single input split is processed. Mapper is overridden by the


developer according to the business logic and this Mapper run in a parallel
manner in all the machines in our cluster.
 The intermediate output generated by Mapper is stored on the local disk and
shuffled to the reducer to reduce the task.
 Once Mapper finishes their task the output is then sorted and merged and
provided to the Reducer.
 Reducer performs some reducing tasks like aggregation and other
compositional operation and the final output is then stored on HDFS in part-
r-00000(created by default) file.
Hadoop – Rack and Rack Awareness



Most of us are familiar with the term Rack. The rack is a physical collection of nodes in
our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of many Racks.
With the help of this Racks information, Namenode chooses the closest Datanode to
achieve maximum performance while performing the read/write information which
reduces the Network Traffic. A rack can have multiple data nodes storing the file blocks
and their replica’s. The Hadoop itself is so smart that it will automatically write a
particular file block in 2 different Data nodes in Rack. If you want to store that block of
data into more than 2 Racks then you can do that. Also as this feature is configurable
means you can change it Manually. Example of Rack in
a cluster:
As we all know a large Hadoop cluster contains multiple Racks, in each rack there are
lots of data nodes are available. Communication between the Datanodes that are present
on the same rack is quite much faster than the communication between the data node
present at the 2 different racks. The name node has the feature of finding the closest data
node for faster performance for that Name node holds the ids of all the Racks present in
the Hadoop cluster. This concept of choosing the closest data node for serving a purpose
is Rack Awareness. Let’s understand this with an example.

In
the above image, we have 3 different Racks in our Hadoop cluster each Rack
contains 4 Datanode. Now suppose you have 3 file blocks(Block 1, Block 2,
Block 3) that you want to put in this data node. As we all know Hadoop has a
Feature to make Replica’s of the file blocks to provide the high availability and
fault tolerance. By default, the Replication Factor is 3 so Hadoop is so smart
that it will place the replica’s of Blocks in Racks in such a way that we can
achieve a good network bandwidth. For that Hadoop has some Rack
awareness policies.
 There should not be more than 1 replica on the same Datanode.
 More than 2 replica’s of a single block is not allowed on the same Rack.
 The number of racks used inside a Hadoop cluster must be smaller than the
number of replicas.
Now let’s continue with our above example. In the diagram, we can easily found
that we have block 1 in the first Datanode of Rack 1 and 2 replica’s of Block 1 in
5 and 6 number Data node of Rack which sum up to 3. Similarly, we also have
a Replica distribution of 2 other blocks in different Racks which are following the
above policies. Benefits of Implementing

Rack Awareness in our Hadoop Cluster:


 With the rack awareness policy’s we store the data in different Racks so no
way to lose our data.
 Rack awareness helps to maximize the network bandwidth because the data
blocks transfer within the Racks.
 It also improves the cluster performance and provides high data availability.
HDFS Rack Awareness
Example:

Hadoop – Schedulers and Types of


Schedulers
In Hadoop, we can receive multiple jobs from different clients to perform. The
Map-Reduce framework is used to perform multiple tasks in parallel in a typical
Hadoop cluster to process large size datasets at a fast rate. This Map-Reduce
Framework is responsible for scheduling and monitoring the tasks given by
different clients in a Hadoop cluster. But this method of scheduling jobs is used
prior to Hadoop 2.
Now in Hadoop 2, we have YARN (Yet Another Resource Negotiator). In YARN
we have separate Daemons for performing Job scheduling, Monitoring, and
Resource Management as Application Master, Node Manager, and Resource
Manager respectively.
Here, Resource Manager is the Master Daemon responsible for tracking or
providing the resources required by any application within the cluster, and Node
Manager is the slave Daemon which monitors and keeps track of the resources
used by an application and sends the feedback to Resource Manager.
Schedulers and Applications Manager are the 2 major components of
resource Manager. The Scheduler in YARN is totally dedicated to scheduling
the jobs, it can not track the status of the application. On the basis of required
resources, the scheduler performs or we can say schedule the Jobs.

1. 6
These Schedulers are actually a kind of algorithm that we use to schedule tasks
in a Hadoop cluster when we receive requests from different-different clients.
A Job queue is nothing but the collection of various tasks that we have
received from our various clients. The tasks are available in the queue and we
need to schedule this task on the basis of our requirements.

1. FIFO Scheduler

As the name suggests FIFO i.e. First In First Out, so the tasks or application
that comes first will be served first. This is the default Scheduler we use in
Hadoop. The tasks are placed in a queue and the tasks are performed in their
submission order. In this method, once the job is scheduled, no intervention is
allowed. So sometimes the high-priority process has to wait for a long time
since the priority of the task does not matter in this method.
Advantage:
 No need for configuration
 First Come First Serve
 simple to execute
Disadvantage:
 Priority of task doesn’t matter, so high priority jobs need to wait
 Not suitable for shared cluster
2. Capacity Scheduler

In Capacity Scheduler we have multiple job queues for scheduling our tasks.
The Capacity Scheduler allows multiple occupants to share a large size Hadoop
cluster. In Capacity Scheduler corresponding for each job queue, we provide
some slots or cluster resources for performing job operation. Each job queue
has it’s own slots to perform its task. In case we have tasks to perform in only
one queue then the tasks of that queue can access the slots of other queues
also as they are free to use, and when the new task enters to some other queue
then jobs in running in its own slots of the cluster are replaced with its own job.
Capacity Scheduler also provides a level of abstraction to know which occupant
is utilizing the more cluster resource or slots, so that the single user or
application doesn’t take disappropriate or unnecessary slots in the cluster. The
capacity Scheduler mainly contains 3 types of the queue that are root, parent,
and leaf which are used to represent cluster, organization, or any subgroup,
application submission respectively.
Advantage:
 Best for working with Multiple clients or priority jobs in a Hadoop cluster
 Maximizes throughput in the Hadoop cluster
Disadvantage:
 More complex
 Not easy to configure for everyone
3. Fair Scheduler

The Fair Scheduler is very much similar to that of the capacity scheduler. The
priority of the job is kept in consideration. With the help of Fair Scheduler, the
YARN applications can share the resources in the large Hadoop Cluster and
these resources are maintained dynamically so no need for prior capacity. The
resources are distributed in such a manner that all applications within a cluster
get an equal amount of time. Fair Scheduler takes Scheduling decisions on the
basis of memory, we can configure it to work with CPU also.
As we told you it is similar to Capacity Scheduler but the major thing to notice is
that in Fair Scheduler whenever any high priority job arises in the same queue,
the task is processed in parallel by replacing some portion from the already
dedicated slots.
Advantages:
 Resources assigned to each application depend upon its priority.
 it can limit the concurrent running task in a particular pool or queue.
Disadvantages: The configuration is required.
Hadoop – Different Modes of Operation
As we all know Hadoop is an open-source framework which is mainly used for
storage purpose and maintaining and analyzing a large amount of data or
datasets on the clusters of commodity hardware, which means it is actually a
data management tool. Hadoop also posses a scale-out storage property, which
means that we can scale up or scale down the number of nodes as per are a
requirement in the future which is really a cool feature.

Hadoop Mainly works on 3 different Modes:


1. Standalone Mode
2. Pseudo-distributed Mode
3. Fully-Distributed Mode

1. Standalone Mode

In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode,
Secondary Name node, Job Tracker, and Task Tracker. We use job-tracker and
task-tracker for processing purposes in Hadoop1. For Hadoop2 we use
Resource Manager and Node Manager. Standalone Mode also means that we
are installing Hadoop only in a single system. By default, Hadoop is made to
run in this Standalone Mode or we can also call it as the Local mode. We
mainly use Hadoop in this Mode for the Purpose of Learning, testing, and
debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As
we all know HDFS (Hadoop distributed file system) is one of the major
components for Hadoop which utilized for storage Permission is not utilized in
this mode. You can think of HDFS as similar to the file system’s available for
windows i.e. NTFS (New Technology File System) and FAT32(File Allocation
Table which stores the data in the blocks of 32 bits ). when your Hadoop works
in this mode there is no need to configure the files – hdfs-site.xml, mapred-
site.xml, core-site.xml for Hadoop environment. In this Mode, all of your
Processes will run on a single JVM(Java Virtual Machine) and this mode can
only be used for small development purposes.

2. Pseudo Distributed Mode (Single Node Cluster)

In Pseudo-distributed Mode we also use only a single node, but the main thing
is that the cluster is simulated, which means that all the processes inside the
cluster will run independently to each other. All the daemons that are
Namenode, Datanode, Secondary Name node, Resource Manager, Node
Manager, etc. will be running as a separate process on separate JVM(Java
Virtual Machine) or we can say run on different java processes that is why it is
called a Pseudo-distributed.
One thing we should remember that as we are using only the single node set up
so all the Master and Slave processes are handled by the single system.
Namenode and Resource Manager are used as Master and Datanode and
Node Manager is used as a slave. A secondary name node is also used as a
Master. The purpose of the Secondary Name node is to just keep the hourly
based backup of the Name node. In this Mode,
 Hadoop is used for development and for debugging purposes both.
 Our HDFS(Hadoop Distributed File System ) is utilized for managing the
Input and Output processes.
 We need to change the configuration files mapred-site.xml, core-
site.xml, hdfs-site.xml for setting up the environment.

3. Fully Distributed Mode (Multi-Node Cluster)

This is the most important one in which multiple nodes are used few of them run
the Master Daemon’s that are Namenode and Resource Manager and the rest
of them run the Slave Daemon’s that are DataNode and Node Manager. Here
Hadoop will run on the clusters of Machine or nodes. Here the data that is used
is distributed across different nodes. This is actually the Production Mode of
Hadoop let’s clarify or understand this Mode in a better way in Physical
Terminology.
Once you download the Hadoop in a tar file format or zip file format then you
install it in your system and you run all the processes in a single system but
here in the fully distributed mode we are extracting this tar or zip file to each of
the nodes in the Hadoop cluster and then we are using a particular node for a
particular process. Once you distribute the process among the nodes then you’ll
define which nodes are working as a master or which one of them is working as
a slave.
Hadoop – Cluster, Properties and its
Types

Cluster is a collection of something, a simple computer cluster is a group of


various computers that are connected with each other through LAN(Local Area
Network), the nodes in a cluster share the data, work on the same task and this
nodes are good enough to work as a single unit means all of them to work
together.
Similarly, a Hadoop cluster is also a collection of various commodity
hardware(devices that are inexpensive and amply available). This Hardware
components work together as a single unit. In the Hadoop cluster, there are lots
of nodes (can be computer and servers) contains Master and Slaves, the Name
node and Resource Manager works as Master and data node, and Node
Manager works as a Slave. The purpose of Master nodes is to guide the slave
nodes in a single Hadoop cluster. We design Hadoop clusters for storing,
analyzing, understanding, and for finding the facts that are hidden behind the
data or datasets which contain some crucial information. The Hadoop cluster
stores different types of data and processes them.
 Structured-Data: The data which is well structured like Mysql.
 Semi-Structured Data: The data which has the structure but not the data
type like XML, Json (Javascript object notation).
 Unstructured Data: The data that doesn’t have any structure like audio,
video.
Hadoop Cluster Schema:
Hadoop Clusters Properties

1. Scalability: Hadoop clusters are very much capable of scaling-up and


scaling-down the number of nodes i.e. servers or commodity hardware. Let’s
see with an example of what actually this scalable property means. Suppose an
organization wants to analyze or maintain around 5PB of data for the upcoming
2 months so he used 10 nodes(servers) in his Hadoop cluster to maintain all of
this data. But now what happens is, in between this month the organization has
received extra data of 2PB, in that case, the organization has to set up or
upgrade the number of servers in his Hadoop cluster system from 10 to 12(let’s
consider) in order to maintain it. The process of scaling up or scaling down the
number of servers in the Hadoop cluster is called scalability.
2. Flexibility: This is one of the important properties that a Hadoop cluster
possesses. According to this property, the Hadoop cluster is very much Flexible
means they can handle any type of data irrespective of its type and structure.
With the help of this property, Hadoop can process any type of data from online
web platforms.
3. Speed: Hadoop clusters are very much efficient to work with a very fast
speed because the data is distributed among the cluster and also because of its
data mapping capability’s i.e. the MapReduce architecture which works on the
Master-Slave phenomena.
4. No Data-loss: There is no chance of loss of data from any node in a Hadoop
cluster because Hadoop clusters have the ability to replicate the data in some
other node. So in case of failure of any node no data is lost as it keeps track of
backup for that data.
5. Economical: The Hadoop clusters are very much cost-efficient as they
possess the distributed storage technique in their clusters i.e. the data is
distributed in a cluster among all the nodes. So in the case to increase the
storage we only need to add one more another hardware storage which is not
that much costliest.

Types of Hadoop clusters

1. Single Node Hadoop Cluster


2. Multiple Node Hadoop Cluster

1. Single Node Hadoop Cluster: In Single Node Hadoop Cluster as the name
suggests the cluster is of an only single node which means all our Hadoop
Daemons i.e. Name Node, Data Node, Secondary Name Node, Resource
Manager, Node Manager will run on the same system or on the same machine.
It also means that all of our processes will be handled by only single JVM(Java
Virtual Machine) Process Instance.
2. Multiple Node Hadoop Cluster: In multiple node Hadoop clusters as the
name suggests it contains multiple nodes. In this kind of cluster set up all of our
Hadoop Daemons, will store in different-different nodes in the same cluster
setup. In general, in multiple node Hadoop cluster setup we try to utilize our
higher processing nodes for Master i.e. Name node and Resource Manager and
we utilize the cheaper system for the slave Daemon’s i.e.Node Manager and
Data Node.

Hadoop – File Permission and


ACL(Access Control List)
a Hadoop cluster performs security on many layers. The level of protection
depends upon the organization’s requirements. In this article, we are going to
Learn about Hadoop’s first level of security. It contains mainly two components.
Both of these features are part of the default installation.
1. File Permission
2. ACL(Access Control List)

1. File Permission

The HDFS(Hadoop Distributed File System) implements POSIX(Portable


Operating System Interface) like a file permission model. It is similar to the file
permission model in Linux . In Linux, we use Owner, Group, and
Others which has permission for each file and directory available in our Linux
environment.
Owner/user Group
Others
rwx rwx
rwx
Similarly, the HDFS file system also implements a set of permissions, for
this Owner, Group, and Others. In Linux we use -rwx for permission to the
specific user where r is read, w is for write or append and x is for executable.
But in HDFS for a file, we have r for reading, w for writing and appending and
there is no sense for x i.e. for execution permission, because in HDFS all files
are supposed to be data files and we don’t have any concept of executing a file
in HDFS. Since we don’t have an executable concept in HDFS so we don’t have
a setUID and setGID for HDFS.

Similarly, we can have permission for a directory in our HDFS. Where r is used
to list the content of a directory, w is used for creation or deletion of a directory
and x permission is used to access the child of a directory. Here also we don’t
have a setUID and setGID for HDFS.
How You Can Change this HDFS File’s Permission?

-chmod that stands for change mode command is used for changing the
permission for the files in our HDFS. The first list down the directories available
in our HDFS and have a look at the permission assigned to each of this
directory. You can list the directory in your HDFS root with the below command.
hdfs dfs -ls /
Here, / represents the root directory of your HDFS.

Let me first list down files present in my Hadoop_File directory.


hdfs dfs -ls /Hadoop_File

In above Image you can see that for file1.txt, I have only read and write
permission for owner user only. So I am adding write permission to group and
others also.
Pre-requisite:
You have to be familiar with the use of -chmod command in Linux means how
to use switch for permissions for users. To add write permission to group and
others use below command.
hdfs dfs -chmod go+w /Hadoop_File/file1.txt
Here, go stands for group and other and w means write, and + sign shows that
I am adding write permission to group and other. Then list the file again to
check it worked or not.
hdfs dfs -ls /Hadoop_File

And we have done with it, similarly, you can change the permission for any file
or directory available in our HDFS(Hadoop Distributed File System).
Similarly, you can change permission as per your requirement for any user. you
can also change group or owner of a directory with -chgrp and -
chown respectively.

2. ACL(Access Control List)

ACL provides a more flexible way to assign permission for a file system. It is a
list of access permission for a file or a directory. We need the use of ACL in
case you have made a separate user for your Hadoop single node cluster
setup, or you have a multinode cluster setup where various nodes are present,
and you want to change permission for other users.
Because if you want to change permission for the different users, you can not
do it with -chmod command. For example, for single node cluster of Hadoop
your main user is root and you have created a separate user for Hadoop setup
with name let say Hadoop. Now if you want to change permission for
the root user for files that are present in your HDFS, you can not do it with -
chmod command. Here comes ACL(Access Control List) in the picture. With
ACL you can set permission for a specific named user or named group.
In order to enable ACL in HDFS you need to add the below property in hdfs-
site.xml file.

<property>

<name>dfs.namenode.acls.enabled</name>

<value>true</value>

</property>

Note: Don’t forget to restart all the daemons otherwise changes made to hdfs-
site.xml don’t reflect.
You can check the entry’s in your access control list(ACL) with -
getfacl command for a directory as shown below.
hdfs dfs -getfacl /Hadoop_File

You can see that we have 3 different entry’s in our ACL. Suppose you want to
change permission for your root user for any HDFS directory you can do it with
below command.
Syntax:
hdfs dfs -setfacl -m user:user_name:r-x /Hadoop_File
You can change permission for any user by adding it to the ACL for that
directory. Below are some of the example to change permission of different
named users for any HDFS file or directory.
hdfs dfs -setfacl -m user:root:r-x /Hadoop_File
Another example, for raj user:
hdfs dfs -setfacl -m user:raj:r-x /Hadoop_File
Here r-x denotes only read and executing permission for HDFS directory for
that root, and raj user.
In my case, I don’t have any other user so I am changing permission for my
only user i.e. dikshant
hdfs dfs -setfacl -m user:dikshant:rwx /Hadoop_File
Then list the ACL with -getfacl command to see the changes.
hdfs dfs -getfacl /Hadoop_File

Here, you can see another entry in ACL of this directory


with user:dikshant:rwx for new permission of dikshant user. Similarly, in case
you have multiple users then you can change their permission for any HDFS
directory. This is another example to change the permission of the
user dikshant from r-x mode.

Here, you can see that I have changed dikshant user permission from rwx to r-
x.

You might also like