0% found this document useful (0 votes)
135 views21 pages

HDFS - Rackawareness

Rack awareness in Hadoop places replicas of data blocks across different racks to improve data reliability, availability, and performance. The NameNode maintains the rack IDs of each data node to choose nearby nodes on the same or different racks for read/write requests. The replica placement policy aims to store no more than one replica on a node and no more than two replicas on the same rack to reduce network traffic while ensuring fault tolerance if an entire rack fails.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views21 pages

HDFS - Rackawareness

Rack awareness in Hadoop places replicas of data blocks across different racks to improve data reliability, availability, and performance. The NameNode maintains the rack IDs of each data node to choose nearby nodes on the same or different racks for read/write requests. The replica placement policy aims to store no more than one replica on a node and no more than two replicas on the same rack to reduce network traffic while ensuring fault tolerance if an entire rack fails.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

HDFS –

Rackawareness
Rackwareness

Rack Awaren ess in Hadoop is the concept that chooses close r


Datanodes b ased on the rack informa tion. By de fault, Hadoop
installation assumes that all the nodes belong to the same rack.

To improve network traffic while reading/writing HDFS file s in large


clusters of Hadoop .

NameNode chooses data nodes, which are on the same rack or a ne arby
rack to read/ write re que sts (client node ). HDFS Namenode achieves
this rack information by maintaining rack ids of each data node .
Why Rack Awareness?

The main pu rpose of Rack awaren ess is t o:

•Improve data reliability and data availability.

•Better cluster performance.

•Prevents data loss if the entire rack fails.

•To improve network bandwidth.

•Keep the bulk flow in -rack when possible.


Why Rack Awareness?

Hadoop keeps m ul ti pl e copi es for al l data that i s present i n


HD F S. I f Hadoop i s aware of the rack topol og y, each copy of
data can be kept i n a di fferent rack. By doi ng thi s, i n case an
enti re rack suffers a fail ure for som e reason, the data can be
retri eved from a di fferent rack.

Repl i cati on of data bl ocks i n m ul ti pl e racks i n HD FS vi a rack


awareness i s done usi ng a pol i cy call ed Repl i ca Repl acem ent
Pol i cy.

The pol i cy states that “N o m ore than one repli ca i s placed on


one node. And no m ore than 2 repl i cas are pl aced on the sam e
rack.”
Replica placement via Rack Awareness in
Hadoop

The m ai n purpose of repli ca placem ent vi a Rack awareness, the pol i cy i s to im prove
data rel i abi l i ty etc.

A si m pl e pol i cy i s to pl ace repl i cas on the rack to prevent l osi ng of data when an enti re
rack fai l s. And al l ow the use of bandwi dth from m ul ti pl e racks when readi ng a fi l e.

On m ul ti pl e rack cl usters, bl ock repl i cati on fol l ows the bel ow pol i cy:

Yo u sho uld no t pla ce mo re t ha n o ne re plica o n o ne no d e. Yo u sho uld a lso no t


pla ce mo re t ha n t wo replica s o n t he sa me ra ck. T his ha s a bo t t leneck t ha t number
o f ra cks used fo r blo ck repli ca t io n sho uld be a lwa ys less t ha n t he t o t a l numbe r o f
blo ck replica s.
For example;

When a Hadoop fram ework creates new bl ock, i t pl aces fi rst repl i ca on the
l ocal node. And pl ace a second one i n a di fferent rack, and the thi rd one i s on
di fferent node on the l ocal node.

When re- repl i cati ng a bl ock, i f the num ber of exi sti ng repl i cas i s one, pl ace the
second on a di fferent rack.

When num ber of exi sti ng repli cas are two, i f the two repl i cas are i n the sam e
rack, pl ace the thi rd one on a di fferent rack.
How does Hadoop decide
where to store the
replica of blocks created?
What is a rack?

A rack is n oth in g b u t a col l ection of 30 -40 DataNod es or mach in es in a


Had oop cl u ster l ocated in a sin g l e d ata cen ter or l ocation . Th ese
DataNod es in a rack are con n ected to th e NameNod e th rou g h
trad ition al n etwork d esig n via a n etwork switch . A l arg e Had oop cl u ster
wil l h ave mu l tip l e racks.
What is rack awareness in Hadoop HDFS?

The proc es s of making H adoop awar e of what mac hine is part of whic h
rac k and how thes e r ac ks ar e c onnec ted to eac h other within the H adoop
c lus ter is what def ines r ac k awar enes s . In a H adoop c lus ter, N ameN ode
keeps the r ac k ids of all the D ataN odes . N amenode c hoos es the c los es t
D ataN ode while s tor ing the data bloc ks us ing the r ac k inf or mation. In
s imple terms , having the knowledge of how dif f erent data nodes are
dis tributed ac r os s the r ac ks or knowing the c lus ter topology in the H adoop
c lus ter is c alled r ac k awar enes s in H adoop. R ac k awar enes s is impor tant
as it ens ures data r eliability and helps to rec over data in c as e of a rac k
f ailure.
Rack Awareness Example
The def ault r eplicat ion f act or is 3 or it can also be conf igur ed .

At t he t ime of t he cr eat ion of a new block: The f ir s t r eplic a is st or ed on t he c los es t


local nod e. The seco nd is st or ed on alt oget her a dif f er ent r ack . The t hir d r eplica is
st or ed on t he same r ack but a dif f er ent node.

At t he t ime of r e- r eplicat i ng a block : I f t he numb er of t he exist i ng r eplic as is one,


t he seco nd r eplica is st or ed on a dif f er ent r ack . I f t he number of t he ex is t i ng
r eplicas is t w o and bot h ar e on t he same r ack, t he t hir d r eplica is st or ed on a
dif f er ent r ack .

A simp le w ay of st or ing dat a block r epl icas is pl aci ng eac h o ne o n a separ at e r ack
how ever, t his could incr ease t he lat ency of Read/ W r it e oper at ions .

So Replic at io n policy is desig ned in s uc h a w ay t o r educe t he net w or k bandw idt h


us ed w he n r eadi ng t he dat a as t he r epl ic as ar e place d o n o nly 2 uni q ue r acks, at
t he same t ime ensur ing t he f ault t oler ance .
Advantages of implementing Rack Awareness in
Hadoop
•Rack awareness in Hadoop helps optimize replica placement thus ensuring high
reliability and fault tolerance.
•Rack awareness ensures that the Read/Write requests to replicas are placed to
the closest rack or the same rack. This maximizes the reading speed and
minimizes the writing cost.
•Rack Awareness maximizes the network bandwidth by block transfers within the
rack. Data access needs are catered to keeping in mind minimum network travel
so as to reduce the network overheads.
•Rack Awareness helps the NameNode to assign the task to the nodes closer to
data in the network topology.
•The M apReduce j obs can also benefit from rack awareness. B y knowing where
the data required by the map is located, it can run the map task on that
particular machine itself, thereby saving a lot of bandwidth and time.
Hadoop Arch – Rack Awareness
Algorithm
Hadoop Arch – Rack Awareness
Algorithm
Hadoop Arch – Rack Awareness
Algorithm
Hadoop Arch – Rack Awareness
Algorithm
Hadoop Arch – Rack Awareness
Algorithm
Hadoop Arch – Rack Awareness
Algorithm
Advantages of Rack Awareness in Hadoop

Let’s now discu ss some advantages of Rack Awareness in Had oop HDFS-
Provide higher ban dwidth and low latency – This polic y ma ximizes
netwo rk bandwidth by transf er rin g block within a rack ra the r than b etween
racks. Th e YARN is able to opti mize Map Redu ce job pe rfo r man ce by
assigning tasks to nodes that a re clos er to thei r d a ta in t er m s o f n etwo rk
topolog y.
Minimize the writing cos t and Maximize read s peed – Ra ck awar ene ss,
policy plac es re ad/write r equ es ts to r eplicas which ar e in the sa me rack.
Thu s, this minimizes writing cost and maximizes reading speed .
Advantages of Rack Awareness in Hadoop

•Provides data protection against rack failure – Namenode


assign the block replicas of 2 nd And 3 rd Block to nodes
in different rack from the first replica. Thus, it provides
data protection even against rack failure. However, this
is possible only if Hadoop was configured with
knowledge of its rack configuration.
Advantages of Rack Awareness in Hadoop

•Minimize the writing cost and Maximize read speed –


Rack awareness, policy places read/write requests to
replicas which are in the same rack. Thus, this
minimizes writing cost and maximizes reading speed.
R ac k Awarenes s in H adoop is the c onc ept to c hoos e a near by data node
(c los es t to the c lient whic h has r ais ed the R ead/W rite reques t), thereby
reduc ing the networ k tr af f ic . H adoop s upports the c onf iguration of rac k
awarenes s to ens ur e the plac ement of one replic a of the data bloc k on a
dif f erent rac k. The per f or manc e as pec t of rac k awarenes s is that des pite
c opies of the data s pr ead ac r os s r ac ks , but it is not more than two
ens uring that the bandwidth utilization is les s and lower latenc y. This
makes the W r ite oper ations f as ter at the s ame time pr oviding f ault
toleranc e. This als o pr ovides data availability if there is a partition within
the c lus ter or in the event of a networ k s witc h f ailure.

You might also like