Adobe Scan 05-Nov-2023
Adobe Scan 05-Nov-2023
0.4
o.s
Explain lhree vs al 619 Data
Hadoop ,s an open-source, big data storage and processing sohware framework. Hadoop stores and
process big data ,n a distributed fashion on large clusters of commodity hardware. Massive data storage and
As shown ,n Fig. 2.1.l Hadoop cluster is a set of commodity machines networked together in one location
,.e. cloud.
These cloud machines are then used for Data storage and processing. From 1nd;·✓1dual clients users can
submit their Jobs to cluster. These clients may be present at some remote locations from the Had00p clv:;r•r
Hadoop run applications on systems with thousands of nodes ,molv,ng huge storage capab, t'es. As
distributed file system is used by Hadoop data transfer rates among nodes are ver1 faster
As thousands of machines are there in a cluster user can get unm·errup:ed serv1,;e and ~ode fa1 ... ~~ 'S ,..~,: a
b,g issue in Hadoop even if a large number of nodes become inoperati·1e.
Hadoop uses distributed s.orage and transfers code to data. Tr..s code ,s t, 0 / ar,d co~s,coe:; es:; rr.emcr1
also.
This code e,ecutes with data there itself. Thus the ume to get da::a a·d aga n re;·~r€ resu :.s ~ sa ;,d a:; e
data is locally available. Thus interprocess commur·cation fme 1s sa,ed N!'--ch rr:a!.-es t fa-:'er prores:;rrg
The redu:idancy of data is importart feature of H3doop d~e •'J ,m ,:r r'.,dl? fa J"e'.:: a"e ea::. y ra--~ 0
lmroduct,on to Hao
-3 · • d · ~ : , 3 Introduction to Hadoop
"' :;. :
3
- • .~, •
~-- ... :~ \~ .. .:- a~.,ut C3'.": ttC'1!ng t
he data data an tas, assignment t~
d O nao., i';i~s,~g;;D;;;a;;;ta:,;A;;;n;;;al~)'t~1C;s~(~;;;tU;,l==========·=·==================
==
.- -;;~::: .se· --- "" : a c;er on rnncentrate on ata aM Operations" .2 Hadoop System Principles
..... ____ ~.3- ........ c.,,.:.:"" ,. .... .:s ~s t-.:c.:~o n3rJ es 0... !,:.:.:_.....:..:..:..:::..:..:::.L::I..::.::.:.:.:..:....:..:.:.:..:::r:.:.=.
2 _______________________
l. Scaling out
In Tradmonal RDBMS it 1s quite difficult to add more hardware, software resources 1.e. scale up In Hadoop
2.1.l Hadoop - Features
this can be easily done 1.e scale down .
l. Low cost
t 15 :•ee It uses commoa1ty hardware to store and process h~g., 2 _ Transfer code to data
-5 --a:.::o S :r: ~:e'l·S01..r:e frarr ::,•,cri<. In RDBMS generally data 1s moved to code and results are stared back. As data is moving there 1s always a
=ai:a. r-:er:e 1t s net rnuc, ccst \ security threat. In Hadoop small code 1s moved to data and 1t 1s executed there itself. Thus data 1s local. Thus
2. High computing power Hadoop correlates preprocessors and storage.
h nk can be distributed amongst different nodes
-aooc~ cses c s:·oc,ed como.:,ng ".lode, Due tot is O • . ano 3. Fault tolerance
:a· oe cro:esseo □ c,c,ly C.Jster have thousands of nodes which gives high computing capability to Hadoop is designed to cope up with node failures. As large number of machines are there. a node failure 1s
~aacco very common problem.
3. Scalability 4. Abstraction of complexities
•,odes c2~ oe eas•., added and removed. Failed nodes can be easily detected. For all th ese activities very Hadoop provides proper interfaces between components for proper working.
~ e admm1strat1on ,s required s. Data protection and consistency
4. Huge and flexible storage Hadoop handles system level challenges as it supports data consistency.
Massive data storage 1s available due to thousands of nodes in the cluster. It supports both structured and
2.3 Hadoop Physical Architecture
unstructured data. No preprocessing is required on data before storing 1t.
5. Fault tolerance and data protection l'l,\M4Mli1•M#Art-tii
!f any node fails the tasks 1n hand are automattcally redirected to other nodes. Multiple copies of all data are a. Explain Physical architecture of Hadoop. l1~111 1~Mfl•@l:i•l&tfflff
automancally stored. Due to this even if any node falls that data is available on some other nodes also. Running Hadoop means running a set of resident programs. These resident programs are also known as
daemons.
2.1.2 Hadoop and Traditional RDBMS
These daemons may be running on the same server or on the different servers in the network.
Sr. Hadoop RDBMS All these daemons have some specific functionality assigned to them. Let us see these daemons.
No.
Secondary NameNode
1. Hadoop stores both structured and RDBMS stores data in a structural way.
unstructured data.
2 SQL can be implemented on top of Hadoop as SQL (Structured Query Language) is used.
the execution engine.
3. Scaling out 1s not that much expensive as Scaling up (upgradation) is very expensive.
Da1aNode DataNode OataNoda DataNade
machines can be added or removed with ease
4.
and little administration.
Us1no this address client communicates directly with the DataNode. It converts file data to smaller, intermediate <key. value> pairs.
For r:plicat1on of data a DataNode may communicate with other DataNodes. 3. Partition, compa re and sort
DataNode com,nually informs local change updates to NameNodes. Partition functi on : With the given key and number of reducers It finds the correct reducer
~o create move or oelete blocks DataNode receives instrucuons from the local disk. Co mpare fu nctio n : Map intermediate output s are sorted according to this compare function
~l-e tao3.ta (~ane. rep11cas )
, "lOme loo aa1a 6 ) 4. Function reducing
5. Write output
C . i:ot
Fig . 2.4.2 : HDFS architecture data
™™
0. Wtiat s MapReduce? Explain He.,, MapReduce work? IZl!.&llii.fUll
'.1apReduce Is a software framewor"- In Mapreduce an application is broken down into number of sma Fig. 2.4.3 : The general MapReduce dataflow
parts
To understand how it works let us see one example
These sma 1 pans are also as ca Iled fragrients or blocks. These blocks men can be run on any node "
File 1 : "Hello Sachin Hello Sum1t·
the C,uster
File 2 : "Goodnight Sachin Goodnight Sumit"
Da:a Processing is done by MapReduce MapReduce scales and runs an applicatio n to different clustf
macr ~es Count occurrences of each word across d ifferent file s.
r ~~~~~~~~==::::==~2-!8===========::::.;~
•i
B,g Data Ana.IVT cs MU)
•
(i)
1
~ fi~
Map
Introduction t
0
~
-;",-::
1.
Big Data Analytics (MU)
Introduction
29
Map2
Mapl Sequen11al access ,s time consuming
< Sum,t l > < SumIt l > The amount of unstructure d d ata ,s mu ch more than struaured information stored in rows and
columns.
(ii) Combine
8,9 Data actually comes from comp Iex, u nstructured formats, everything from web sites, social media
Combine Mapl Combine Map2 and email, to videos. presentations. etc
< Sdch,n 1 > < Sach1n. 1 > The pioneers in this f.Ie Id o f data ,s Google ' wh1Ch designed scalable frameworks like MapReduce and
Hello, 2 > < Goodnight. 2> Apache open source h as started with ,ni11atIve by the name Hadoop, It ,s a framework that allows for
the distributed processing of such large data sets across clusters of machines.
(iii) Reduce
--------------
< Sachin, 2 > ETL Tools Bl Aepon1ng ADBMS
< Sum1t, 2> § .. Pig (data I OWJ Ii ·-H;~~-(~~·:,---w·--~~~~p- ---
< Goodnight, 2>
~ Map reduce (job scheduling r execu~on system)
< Hello 2 > i:i
I™™ 0.
0.
Slate l.Jm11at1ons of Hadoop
Wha1 are the hmllat,ons of Hadoop?
W!&IDiii•&il:fJMffifl
HDFS
(Hadoop d1stnbuted rIIe system)
fmii . . Hadoop MapReduce is a programming model and software for writing applications which can process
vast amounts of data in parallel on large clusters of computers
0. Give Hadoop Ecosystem and briefly explain Its components
0. Expln,n Hadoop Ecosystem wI1h core components IZ118•t¥Mi:JIUIMrtl HDFS 1s the primary storage system, it creates multiple replicas of data blocks and distributes them on
MU : Ma 17, Dec. 18, 4 Marks compute nodes throughout a cluster to enable reliable, extremely rapid computations
I 0. What do you mean by the Hadoop Ecosystem?Descnbe any three components ol a typical Hadoop Ecosysiem.
I
~ E,pta,n Hadoop Ecosystem ~
Other Hadoop-related projects are Chukwa, Hive, HBase Mahout. Sqoop and ZooKeeper
2-11 lc,trod.ct1or to Hadoop
n1erarch1caI
ZooY.eeper ·mil allows d1str1buted processes to coordinate 111th eacn other using shared
~-*dJ,~~~l
- il l• ,,
_,
" {
N
]
8
HDFS
Hadoop d1stnbuted hie system
2.7 HBase
Fig. 2.5.2
HBase 1s a distributed column-oriented database
2.6 ZooKeeper HBase is hadoop application built on top of HDFS.
by Hadoop HBase 1s suitable for huge datasets where real-11me read/write random access 1s required.
ZooKeeper 1s a distributed, open-source coordination service for dismbuted applications used
h19hi • HBase 1s not a relanonal database. Hence does not support SQL
2. This system ,s a simple se t of pnm1t1ves that distributed applications can build upon to implement
It ,s an open -source pro;ect and is horizontally scalable
level services for synchronization, configura11on maintenance, and groups and naming
Cassandra, couchDB, Dynamo and MongoDB are some other databases si milar to HBase.
Data can be entered 1n HDFS either directly or through HBase.
1. HDFS 1s a distributed file system suitable for storing large HBase 1s a database built on top of the
I files. HDFS.
2. HOFS does not support fast individual record lookups. HBase provides fast lookups for larger
~~-~~~~:~~~~.:=~--r-------~~~--
;..•: BJ Cata Anal\ tics 1i\1U) MS d HBase ~
-:;'f Big Data Analytics (MU)
2-13
~2.7.2
Sr.
Comparison of ROB an
RD BMS
HBase 2. Auto splitting
This is by default awon. It splits region when one of the stores crosses the max confrgureCI value
[ "'"'v~ C Pnt API-. ]I Et1erna1 AP.s t-h nlt, Avro. RESTI Region assignment upon load balancing
When there are no regions in transition, the cluster load is balanced by a load balancer by moving regions
around. Thus redistributes the regions on the cluster. It is configured via hbase.balancer.period. The default value
B
:=teg en Serv.:iir ( Wn!e-Ahead . og (WALi )
is 300000 (5 minutes).
Aegon Ra:i10n Region
2.7.6 HBase Data Model
§~ { r..~em~orn) ( r.AemStore}
The Data Model in HBase is made of different logical components such as Tables, Rows, Column Families,
~ ~ ~ Columns, Cells and Versions.
It can handle semi-structured data that may be varied in terms of data type, size and columns. Thus
partitioning and distributing data across the cluster is easier
Hudcop fl'• sys:eM AP ) _[_ _ _ zoo_,.,,_.P_•_'_ _...JI
Ro w Key Movies Shows
r-'anc-cp mstnouraa file sys1em (HDFSJ Screen Movie Name Ticket Time Day
Fig. 2.7.1: HBase database architecture 01 Harry Potter 1 200 6.00 Saturday
1. Pre splitting
Regions are created first and split points ar · d ·
points are to be used very carefull other.~/s~rgne _at ·th e nme of table creation. Initial set of region sp
Column Families
clusters performance. Y e oad diS t nbution will be heterogeneous which may hamp:
Fig. 2.7.2 : HBase data model
F
Tilblts 2 8 _1 Architecture of Hl ✓ E
l'ows
4. Columns
Fig z.e l . 111,e arch ••ctur•
1. User interface
5. HDFS or HBASE
~~
l 8 -.B j Fig 2 a.2 : Hr1e and Hacl<><>p ,o,,.-,m1,r ,cation
lntroti111 t1un to >1,1
2 16
'I[,
~ ..
819 D,ita Andly\lCS (MU)
~•.:1 ~~=- a~== --=== ===- ----== ==== =a=~ ~.;.;:
l. Execute Query :
Command Line or Web UI ,ends query lO JDS( or ODB( Driver Ill l'\l'< llh' Review Question
2. Get Plan: ti
qu1•1\ v\'111 11 n I I J, I~
With the 1,elp of query compiler drrver chelks the synh1x ,111d iequ11rme11t ,,1
01
f',, execut >n eng,ne recel\es tne results fro Da:a nodes
9. Send Results :
The e,erutron eng,ne sends those resultant va ues 10 the (]r ver
10. Send Results :
Buckets
v -