0% found this document useful (0 votes)
45 views21 pages

Hadoop Notes

Hadoop is a framework designed for processing and storing large amounts of data across multiple computers efficiently. It consists of key components like HDFS for storage, MapReduce for data processing, and YARN for resource management. The architecture follows a master-slave model, allowing for scalable and fault-tolerant data handling.

Uploaded by

gafafop879
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views21 pages

Hadoop Notes

Hadoop is a framework designed for processing and storing large amounts of data across multiple computers efficiently. It consists of key components like HDFS for storage, MapReduce for data processing, and YARN for resource management. The architecture follows a master-slave model, allowing for scalable and fault-tolerant data handling.

Uploaded by

gafafop879
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Hadoop

ohat is
krame wonkdata
thot helbs store
ftadoob is a big data
amount
acYosS mut'ple
and þrocass
bnCess hige and kau'l!
ouilt to handle
Combuters Tt is oben -50nce Sungle comjuter
data that s foo
lage
to þreces eticienty
Hedop Comþonents : wrth mllions
In agine we hawe huge libany Count ha
need to
5 books (data). Che day We

timas the word "Hadooh " appears


this alone wou td take
all the boo ks Doing
the wOrK )00
fohever So, dividle
we

the
amomg
wo rol n a euo
friends. Eoch friemds count
combine this Yesul ts
boo ks and in the end, thay
b get the total count quckty and eb7'ienty.
wo Ks n similas way tor s'
data. divides data and tasks among mu Ifip le
Combuders making þasing fes tes and storoge
ore ve liable.
or ks ther
Hadoo has four main arts that
ike teem
ditibuted ile Systm (HDES)
Hhkrt
be Storage System that sloves a
like a
a big bookshe lh
Ku ping all the
HDFS IZ a like
amount ob data. Sut instead os
sits hem into snall
books (data) in one place , it dilfererd sechons combutea)
þieces and stoxe therm in
Sbve big data ebticienty
Dota is duplicated b onehas conpute kails
soanothes
coF8
actosS
Fast access because data
s
mulitk Comtuters.
MapReduce "The foacesaing System ot
stred in
n ditteon
Now that gu bo0 Ks (data) ae count the oord
seetions (combutes) , how do we
ttienty! ending al the boo kg,
Instead oh one Compute
multile combuters
MabReduce sends the task to
collects
the eiults.
and
Caunt he oog
Map fhase : Each Comptea to it.
booke ai'gred
combined to
All the results
uce Phose i
the kinal count.
Yet Ano lbe Resouscc Negotiato (YARIN) :
dean. You
Tmagine yau aaemaehe s emanage %
Yesoulces (ables, books ,
work ,

þens) ate aWar'lable


and checkþgress.
yARN does the Same fon hadoop . 9t decides
task.
Combutes Ss hould work or
" which
amd CPU each task needs.
Jtow much memony
" wh'ch ask sheuld um h'rst

Hodoo b Common
all the othes
Jhis is the toolbox that helbs
twonk amoothly.
hadosp wonk st þrovide
meded kor hodoo
tlies
neceass aiy kles and
to
Othes tools in tHadop Eco system make
with -addifonal tools to
Hodoob wohks even easiex
data processing maþ Recuce Jobs.
sq,l -ike queriee into
Hve : Convert easer data tanfomator
:A sobng lngage bo veal-tne bi data
HBae: A NoSQL data base
access.
that wor ks
fostas oong engone
Spak : A
oith hado
(ladarts ito buted le sy lomn ) :
oenass mu ltple Computers.
Stows lange les
sing nnodel that beaks
2 MapRecduce : A þroces the smalles baats and
tas ks imto
2u them in in poalel'

sowce Negobiato )) Manages


:
3. YARN (Yet ano lhes
anothe Re
and schadle tasks.
3esoees

4. tive , Pig. soAk Tools that make itt easus to


þiocess cata in
todoof
Hadook Avchitectue :
Manage
|Resorce

Mas te
hame
mode

slave slawe
Slawe
Data
nocle
mode
Hadoop follaws masle-slave vchi tec hee dasigud to
hondle lange- scale data storauge ond þouing tticin y
3t4 consiss thre þimay layots
DHDES (Haclcofp Disbibutel Laye
Frle Syskem)-- stoeg
) Map Reluce - Proesing laye
3) yARN (Yet Arnobe Re scwte. degofotor ) - Resowce

HDFS - Stexage Laycr masive dataset


HDFS is iaponi ble for 4storing
spts loge bles into
aross multiplemachines .
Smalle chunks ( called blocks) and s tore them acYosS

amd
multiple machines o imþove kault to lerace om
Shead.
Combonents of HDFS
1. Nome No de ( Mastea)- The fle Manage

"Jhe brain o HDFS , veþonutble bor (ae fing tack of


ob
wohate data is stored.
sto ved meta data (k names, black lacahons eplahon.
inyo etc)
Does not store actual data ony the nfor mahon
about whore Bles ae stoed.
DataNode fals > NameNode vero udes he
veqyest to anothsr veblica.
9.Dta Nodesslaves) - Jhe Woskos
" Stoves actual data in the fom o blocks

Rqulay Send heatbeats to the NameNo de to Yepork


heatth stots
ba Datenode feils, Hadot autbmically oies it
data t om
orn anothe ogn node.
3. Block Reblicoion
" Each blo ck is zeplicated mulhple tias across
diyeent Data Nocdes to ense bault tolkamce.
Dekoult eblic ohion bact 3 (each block is store&
on thee ditteen mac hines
MabReduce Dala Processing Loyer :
MabReduce is the þocesig komwor k
Hadooþ that allows parallel data þrocea sing
mult'ble nodes.
Tuso man shase o MapReduce
4) Maf Phase
big tak into Smaey Sub-tos ks
· Sputs a
assgn them to dikleent odes
and ed data
. Loch hode askgred
inde fendenty
2 Reduce Phase
Collects and Combines the
combines he esuHs frorn oll nedes
te feoaning
YARN (Yel Anather Resource Naço tia te)
Resou\ca Monagemet Laya
CPU
YARN manages and allocales tesowces 1t
wke
acts a
hoss dibheest tasks.
and memog
bothe controllel ensuinga that computing poa
is distri'buted ebbicenty
Com þonents of YARN :
(Maste)- Ihe Decision Mater
1) ResouYce Manage to dibteunt
Asaigns CPU memory and tau k
modes.
tacks a which tasks are
kss
which nodes. utilzaton
yesowce
Enswes ef cint
Exe utos :
2 Node Manager (s lawe) the
"Ran the aehual processing tasks asgmed by
Resowree Manaes. the task.
Monit CPU ond Memory Usage o
Rifont bock to tae Resousee Managea
adoob. abks
Haw Hsdop stþ by step Eeuhn.
We have l0G6 of bile a

Lets say þocesscd.


IHees how it is (eg 28MB each) and
into blocks
" file is divided padaNode in HD ES.
Stohed across mult'ple block loCations
ock the the
" NameNode kefs Datanodes to ocass
tas ks to
" Map Redce ssigns cPU and
date in þorallel. nodes get enough
YARN enses
enss each

memony con_lete, the


Once focessing sers to
acceSS.

shiud in
Hacoop Clustess intaconnec fed combuters
A cluster is a wnit. Similarry
that won ks together ofsingle
mulh'ple commodity
cluster congiets awaiable devices
Hodot kor lale and widey
hasdwore Co4
won hing togctaesCuste
t doob (Name Node 4 Resomce Manage)
" Mastes nodes - and conbol the ysten
manage Manoge) stre
Slave node Node
- (Data Node
? Node
Cnd þ0cS data.
1 single Node Hadoct. Cluce* In Single Node
Custes as the name
yneans all
suggests s
owr hadaot
mode which
an
omly singk
i-e. Name Node , Dota Node , Secon daay
DaemonS Node Manage
Name Noole , Resowrce Managel> On the sarne machime.
the Soume
that all of o ooe sses will be
91 also mean
Jvm (Jaua Viatual Mactine)
handled
by single
Procass hotanee. Node
2. Multi Node Hadoo Custer : Ia Molthle contains
the name gests ?
sug
Hado clustevs tind ah custe set
mulile odes . m his will store in
all of ows tladoob Daemons the Same cuyter
dikferent- dit7entnodes in
mulhle node hadoop
set uf ufize 6w high
tay to
clste sctat we
Masle (Nane node
nodes for
þroasng Managee) and we utlze the cheaber
Resouee
the slave Daemon' s (Node Manage
Syetem fo Data ode)
and
Hlodan
hae |00 G43 ok data.
Sutfpose twe Hadaoþ and spork in voles
Pxes sing lo0 6,8 of data on undealying aschitectues
ditbevent opbronche s due to the
kramwo» ks.
models ob thase
and þrocessing
l00 GB o Data Hadoot :
FrscessingHodos is a disti buted storage amd rocesseng
amewonk that wses the Map Re duce bogamming
and
model 9+ is deaignad batch procesíing aloss
is obtimized kor handling laige - sc ole data
a dio bibuted custe.
Processing InHacaos :

1) Step-1 : Set Up the Hadoob cluste


Iatal Bnd contigure Hadoob (HDFS RVARN)
On a cster oh machine

that the custer has saycant storage


Ensue dota.
and es ourees to hendle \06 GB okob
Stove Data m HDFS :
2) sleb-2 :
set to Hadoob dstibde d
Upload the loo GB dada
Rle Syulem (HDFS)
comand o uplaad he
Jse the hdts s - þut
dada.
orite the MapRaduce ob ;
steb.3: in JAVA or
Develot a MabRecuee rogvam
fython ( Using hadop skeaming) main
too fune fons :
of
Yap Reduce Job. Consists data and emib
Function : Process inbut
Map Key-Value þas.
key -Vale þaias to
funchon : Aqgveqote the
Reduce i

final' autput
pyoduca th
Job to he Custe
e4: Subrin't ha
nto a JAR fle
the MasReduce Progounm
ackaze the sheaming)
prefae the sei (fo Aodonh
(for JAA) or to the chuster sing the
Subnt the Jo b
Command.
hadoop Ja
-dass> Jhds|kath to in put
hadooh Jor <Jaa-fle> <main

.
steb-5: Monitb he Job
hadoo Resohee Manage UI to montor
Use the
he brog xess of the Job
te sues.
eroS
-Check logs for amd

Step- &: Retieve the Out but wil be


Qut put
Onee the Job Comþletes, he
stoned in HDFS. download
hdys dhs -get Command to
Use the
the outbut to the local file sysem
(out pit lacal/palh/beldstira
hebs dys get lhafs|bath /to
hdys
SPARK
rocesSmg \00GB o Data disti buted
is an in
computng
Shonk -mmo comyaled
toster þrocasting
fromeosk that þrouides sutforts badch þo ces ing
b Hadoo Map Reduce. It
Neal- time styemingML
Procesing bh Spavk spank Custer
) stes-1:set up the
" nstell and Consge Apache spot
cluster (stamdalone , y4RN or Mesos).
Enwe the cutes has sutt'cent
te cste am
memoy and
CPu resurces to hand le 6o GB
Vead dala koom HbES , beol file yelem
Sþosk
or ote sbage sytem to load the (oo GB dataet
" Dse the spask.veadd API
into an RoD or DataTxamei
teadiformat ("csv') obfor("haadw', "tas').
cdala ispaak.
lod ("Ikdyslpoth| to ld.ak )
Steb-2: Peform tansfor matons and Ackons
high-lewel APls (RDD Dota frame or Datasd)
Use shak
to þyocess the data.
py tns 7omatons (e, map fHtes. groupBy )
Count Yedue , Save)
and achioneg
['"colunn] >l00) gou4gCodezoy"
esult =data}ttes ( data .Cont).

Step.4
Caching/þerstence to sore intemadiate
Use
Yeults n memo
data .cacha)
to oFtimize fanllelisn
Adjut the
data: dada. eþao hhion (200)
lq-5 : Erocute the Job
. Submit the spak Job to the clte win
Spak- Submit.

stotk.submit -- mas tos yasyn --dchleg- moce cusles


nm- exe tutors |o - executo - menoy 36
/palh/lo(spk- scatt
stes-6 : Monifor the Job:
* bse the spak UI to monitor the þogren os
the Job
" Check logs pr any ferformance
botHlenec ks.

Step-4: Sowe the Out but i


-Sawe the xo cessed data to HDFS , loeal stovage
br anote dastrnahon.

Tesull'ute. for mot ("csv")-sawe ("/hd/sathtolopa


ain Combonenis ?
what are ts man
O what ís hadoop: and di ti buled
kramework kos
Hadoo (s open - SouYee dadasets.
storegeain and processng. ot lage
Its com þonets are :
a) HDFS : for disti buded storage
MasReduce : For disti buted þrocessng
b)
) YARN : for ResewrCe Management.
between Nametlode and
what is the dbevence
Data Node ?
kle system names pocenot
Node : H manages the Contol 9t does
Name metadoda and access
Stoe acual data.
data blocks ond seTVes
actual
DataNode : 9t stoses the
ead/wnite vequasts rom ciends
is the detault
HDES ? whot
block in
8what is a in tHDFs.
block size data storoge
Smallest u t of
Smallest
A block is a l28 MB (cont'gabl)
The dafault block sie
is

ensules haut tolerance t


tbw does Hadooþ each block ( defautt
: HDFS eþlcades Date No des.
- Dato Reblicaion is 3) ahoss mu lti'ble
ebucafon tacor Name Node
Ensues the
- Namelode High Avarlability : si singe toint
does not becoe
MobReduce vefies harled touks on
Task Reties :
6ther nodes.
(6what ís ault \o le ance? abtlty a syen
Fault tolerance ehas to the eNen when Cne
to continue kune tionina Fao pey
eye ob its Components kal tn the context os Hadoop ,
moje
bault tolexamce ensues that the syelem can handle
hordware hailus, natwork (ssues
lwithout lasing dada interrapting ongoing proces.
Hadoop isio designed to hardle lage- seale data foesing
aross dsbibuted Systems, whee balwres
exhected vather thon extepional
ault tolerance mechanims ensae data
even n th
Yeiabi lty ond syslem anailability
bace ok Such kailwes.
inat is Data Reicahion
Data Reblcation ekers Sameto thedata brocess o7 stong
across difeent
mult' ple coptes o7 the
nodes. in a d'sti bled seystem. the
locotions data
eplicton is to
Päimany goad otveliabrlity
ense

availability » and tault toleronce


data becomes nawailable
Jk e Coby o the
the
due to hadavae karwe
dada tvom anothe
Con still oceess the
-’ Jhe Repli'coton Faco delemines hauw many cobus
% each block ae stoved aross he cluste each
ú 3 meoms
he de<oult rcplcahion foc trr in tHadoof DoNodes
bloc k is shhsd om hoes diyerent
with Exomtle
) Elain Data Re ia hon in tFS HDFS
300M8 hle to wth the
Subþose wewe have
de7ault black size 128 M8 amd eplication 7aco 3
3

3 blocks
So, Jhe kle s sblit into
Block A : 28 MB
Block B 128 m8
Block Cc : 44 MB
bock s velcated 3 tomes and stored on
Eoch is
diyerent dotamodes Datanodel , 2.3
Bock A : Sto ed om
stor ed Dadanode 4, S, 6
Block A :
Dotamode i &,9
Block c : Storeel
Block A can be stll be
!, Datanode 1 kails then
&3.
accesed som Dedanode 2 othel vacks
7 entie rack 6ais,.
data
ense data anailabi ity.
Hadoop 2x intodeeed YARN kor resowrce managmnt.
enabbing sutport fr muliple þrocesing boamelignte

Hodoo 4x ony support. Map Reduce.


microsob t
Hadnop 1ox has no dsti cials supbo ts
lwindouo j bud hadoos 2:x has otr cal subbot tor
xming hado win dow emitonments.

hadop 6n linux - based os (ubu ntu)


we Can Ue on uindouos wseng
VM(irtual Machine) eming Linux
Docker Containey usith liaux base mage

( what is heaxtbeat .n HDFS? Name Node


beats to the
Ans-Data Nodes send heat heast beats
seconds (Congsasle). Mtsing
3
mark the mocle
rom otuar odes.

the þroces o MapRuee


divided into sflits les l29 ma)
Inpt sputs : Data Splits tnto <key . value) paiy.
Mab bhese : Pzocene Ka
Shuttle Sort : Grout in ter nech at cata by
Nales
Reduee phase Seved to tpFS .
Cause oh alanade hilwe
Disk (ash. memolyy bwlwte iarne
send heart beats
datanod e does mot
o

de hined imteval , u monkad


node with in
desd also h data nole
it may hail
bs to hande it (he data toom
:

kuls , HbFS serves


" I7 a data node
another Yeaca send heart beats fo name no de
dada node does not to deod.
maked
loithin a
afined interval,iH
blocKes
a4ter that mamenode starts repicating
he dlesiree
that ode fo maintan
Stoved on
Nepu'cahon kac tor
and deod dolanoda
Fa lwre handing
:
[st 've
doda no de stoatus.
) check
hdys dsadmin -refort
tat tha data node. resfoat.
2) Res datanode
Sudo sehwice hadaob- hdhs-
deod datanodes <datano do. host nane
3) Remove de comm'ssion
hdfs d<s admin Re hlicaded)
" lit dada u wndey-
Refpication
fonte Buck
fsck-black -locabons
-w3 <pothto-gile>
dysoolmin setet
Yoo cause
5) Check Logs or hdys - daBand leg
|bg / hados -hdts (hodoo -
" tail -f /vas
3 what is dta stewness e neve

" Data skecwnes m hadoob ebers to


actosS pasitons veduees
disti buhon oh dada degva daton
cluatex , leadng fo berjo
pe m
nodes n
is Commony
a

inehficient esowc e utiization. it


and , shark and othe
Reduce
Map, Hive
Occ s in
distibuted bocesina kramooxk.
(9How to handle Data sKew
fo
ntoduee Tandom ness
disti bud e data
more eNeny
RAND (0) as Saldel jdy
(id,,
" se lect CoNCAT
voue om toble

obkmizat'on l Hive): ü
b) S Kes Join (Broodcos t Jon ) ib one table
Use MAPJOIN
Small .
Toun -ue
SET hveauto Con vert
hive. obhmize . skeojon increase
SET mabreduee job. reduess duors.
no. of re
SET

in he've based
c) rse bucketnf oat a
Insteacl ot has hing bucketng
On skewed co wmnS.
CREATE table Salsluckot (icd int , amount bloc t )
nfo lo buckets.
Clustered by (id)
) Dynamie þauttoning þarth'ons,
þre-de foned
Instead oh xelying
Enable dynaie par i ioning tue;
þarthon =
Set hu've. exec dynamie "
modk.
set hive. exec- dynamic. þorhhon.

HBase
(5 Hve ces ng
(HQL) for bateh pxo
Heve sQL -ke
HsFS. High latency
DB tor real-tme
coumnar
HBase No SQL sits on tDFS.
Low lateney

You might also like