0% found this document useful (0 votes)
8 views9 pages

Adobe Scan 05-Nov-2023

Uploaded by

Vaishnavi Kanade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views9 pages

Adobe Scan 05-Nov-2023

Uploaded by

Vaishnavi Kanade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

n t=ro:,;;d;;;

" ' ! ~ ~ ~ ~ ~ ~ ~ = = = = = = = = ; 1 ~ · 1 ; 2 = = = = = = = = ==l= u~ s·


ct;;;
~
1o~n~to
::~ Big Data Aoal 1tlCI (MU)

0.3 Give all charac1ens11cs of 819 Data

0.4

o.s
Explain lhree vs al 619 Data

Explain various types of 019 data ,n details 2 Introduction to Hadoop


~
0 .6 Why to use 819 Dala over trad111onal ous,ness approach ?

0.7 Compare Trad,uonal approach and 1rad111onal big data approach.

0 .8 Explain vanous needs of 819 Data

Q.9 Explain vanous 1001s used m 819 Data.

0.10 Wnte a snor1 note on


2.1 Hadoop
(a) Types of 819 Data

(OJ Trad,uonal vs B19 Data business approach. l!1llftWll'1il! ™1fM·i&@fl


a. What 1s Hadoop? How 819 Dala andHadoop are linked?

a. Wnle a shorl note on Hadoop.

Hadoop ,s an open-source, big data storage and processing sohware framework. Hadoop stores and
process big data ,n a distributed fashion on large clusters of commodity hardware. Massive data storage and

,,,., '"'"''"' "' <"• ~" ,mpo"'"' a s , . a ~ P

Fig. 2.1.l : Hadoop cl uster

As shown ,n Fig. 2.1.l Hadoop cluster is a set of commodity machines networked together in one location
,.e. cloud.
These cloud machines are then used for Data storage and processing. From 1nd;·✓1dual clients users can
submit their Jobs to cluster. These clients may be present at some remote locations from the Had00p clv:;r•r

Hadoop run applications on systems with thousands of nodes ,molv,ng huge storage capab, t'es. As
distributed file system is used by Hadoop data transfer rates among nodes are ver1 faster

As thousands of machines are there in a cluster user can get unm·errup:ed serv1,;e and ~ode fa1 ... ~~ 'S ,..~,: a
b,g issue in Hadoop even if a large number of nodes become inoperati·1e.

Hadoop uses distributed s.orage and transfers code to data. Tr..s code ,s t, 0 / ar,d co~s,coe:; es:; rr.emcr1
also.

This code e,ecutes with data there itself. Thus the ume to get da::a a·d aga n re;·~r€ resu :.s ~ sa ;,d a:; e
data is locally available. Thus interprocess commur·cation fme 1s sa,ed N!'--ch rr:a!.-es t fa-:'er prores:;rrg
The redu:idancy of data is importart feature of H3doop d~e •'J ,m ,:r r'.,dl? fa J"e'.:: a"e ea::. y ra--~ 0
lmroduct,on to Hao
-3 · • d · ~ : , 3 Introduction to Hadoop
"' :;. :
3
- • .~, •
~-- ... :~ \~ .. .:- a~.,ut C3'.": ttC'1!ng t
he data data an tas, assignment t~
d O nao., i';i~s,~g;;D;;;a;;;ta:,;A;;;n;;;al~)'t~1C;s~(~;;;tU;,l==========·=·==================
==
.- -;;~::: .se· --- "" : a c;er on rnncentrate on ata aM Operations" .2 Hadoop System Principles
..... ____ ~.3- ........ c.,,.:.:"" ,. .... .:s ~s t-.:c.:~o n3rJ es 0... !,:.:.:_.....:..:..:..:::..:..:::.L::I..::.::.:.:.:..:....:..:.:.:..:::r:.:.=.
2 _______________________
l. Scaling out
In Tradmonal RDBMS it 1s quite difficult to add more hardware, software resources 1.e. scale up In Hadoop
2.1.l Hadoop - Features
this can be easily done 1.e scale down .
l. Low cost
t 15 :•ee It uses commoa1ty hardware to store and process h~g., 2 _ Transfer code to data
-5 --a:.::o S :r: ~:e'l·S01..r:e frarr ::,•,cri<. In RDBMS generally data 1s moved to code and results are stared back. As data is moving there 1s always a
=ai:a. r-:er:e 1t s net rnuc, ccst \ security threat. In Hadoop small code 1s moved to data and 1t 1s executed there itself. Thus data 1s local. Thus
2. High computing power Hadoop correlates preprocessors and storage.
h nk can be distributed amongst different nodes
-aooc~ cses c s:·oc,ed como.:,ng ".lode, Due tot is O • . ano 3. Fault tolerance
:a· oe cro:esseo □ c,c,ly C.Jster have thousands of nodes which gives high computing capability to Hadoop is designed to cope up with node failures. As large number of machines are there. a node failure 1s
~aacco very common problem.
3. Scalability 4. Abstraction of complexities

•,odes c2~ oe eas•., added and removed. Failed nodes can be easily detected. For all th ese activities very Hadoop provides proper interfaces between components for proper working.
~ e admm1strat1on ,s required s. Data protection and consistency

4. Huge and flexible storage Hadoop handles system level challenges as it supports data consistency.
Massive data storage 1s available due to thousands of nodes in the cluster. It supports both structured and
2.3 Hadoop Physical Architecture
unstructured data. No preprocessing is required on data before storing 1t.
5. Fault tolerance and data protection l'l,\M4Mli1•M#Art-tii
!f any node fails the tasks 1n hand are automattcally redirected to other nodes. Multiple copies of all data are a. Explain Physical architecture of Hadoop. l1~111 1~Mfl•@l:i•l&tfflff
automancally stored. Due to this even if any node falls that data is available on some other nodes also. Running Hadoop means running a set of resident programs. These resident programs are also known as
daemons.
2.1.2 Hadoop and Traditional RDBMS
These daemons may be running on the same server or on the different servers in the network.
Sr. Hadoop RDBMS All these daemons have some specific functionality assigned to them. Let us see these daemons.
No.
Secondary NameNode
1. Hadoop stores both structured and RDBMS stores data in a structural way.
unstructured data.
2 SQL can be implemented on top of Hadoop as SQL (Structured Query Language) is used.
the execution engine.
3. Scaling out 1s not that much expensive as Scaling up (upgradation) is very expensive.
Da1aNode DataNode OataNoda DataNade
machines can be added or removed with ease

4.
and little administration.

Basic data unit is key/value pairs.


l
TaskTracker
l
TaskTracker
l
TaskTracker
1
TaskTracker
Basic data unit is relational tables.
5 With MapReduce we can use scripts and codes With SQL we can state expected result and data base Fig. 2.3.1: Hadoop cluster topology
to tell actual steps in processing the data. NameNode
engine derives it.
6 Hadoop is designed for offline processing and 1. The Name Node is known as the master of HDFS.
RDBMS is designed for online transactions.
analysis of large-scale data. 2. DataNode is known as the slave of HDFS.
3. The NameNode has JobTracker which keeps track of files distributed to DataNodes.
V T•~~~•mtd1 1
y 1,c1111...iu1,
,w- - v'' to.,
~
~;;:f~f~:;:_ --
• '-1=_;;;=--=-...c==--='--'--===
7 :_~:::Jt~Jc:~~•:;;,cJ~'\. ,. .:.:••~':a.'.:.'_:•=-_.c:::..-=-"'==--=_;_;;;:-·== -~.\.lll(tl•'
==ll\~tc: ~•: l\hl (),Iii! ,\n,fiy!h S \f\..1t}l 2 1, lnt1odrn t1un tot lr1tloop
-= , ·• •'•" '\\ .~1~!1,1 :.1,.,
4 VI Hadoop Core Components
~ ;.:·,'►'~Pert
'l ,•I IJ
r-,;:~e'.:::ie '." '"'c ,."'""'\ '."-~~i~~ .::-
a. t , 1,1,11r cumpo1w111~_1 ol Ct110 ti, 1tll)(lp
DauNc-d~
'""-,,• ~-~ :j~ 1, \f'\v\\ ., ch t '1t? , ~a\ ~ ..""~ :' ·':'- t, P~S 1, ., 111,, sy\ l<'lll 101 H,1dnoµ
~ ~.:;~aN.:--..i t t.~,e, .. 1;e 11t t.,1 .... ~, J~: ...·JC5'~~:-- .,\."'1,, ~~.,,e,Nc,ie, H runs 011 clustt'rs on, 011unod1ty IM1dwJrP
•Ill (..,t,_1r.1•JIJ,
• t IDfO H.:1,t,)l•j1 th·,tr1t>ul,:,.J 111! ►/
,f'iC3tc'' j, es:lh \\ 1:1 tl'e C'.11aNNie • M,1p r1'l!Ut"1• \J"•>; :\!~ll"1q)
- ~.n.; ~;1· ~ addrC;'~ ,1,e:i~ .. .:- 111~
• .:-r l't:' :~Jt10'1 ~• dJ!J J D.HJf\J~ve rl'::y :.:-:11;,,1.,;,;cate ,,·,tr, ether [\H~1~0d~'
- ,:!3t,,;.:jc ... .:-.1t;nua!J)· f'f::-r•,is ,:i.::!r .:~~:i0e \J;:'jJ!e$ t.:' 't1r'1eN0ue~
Tc ~T:?ate r~~ve;. ,.,c e~e :'!.:'~'-5 :'3tJ\~~e rece1,e~ li~~!n...C tiO'.'S fr('m tne local disk >,,1,tp lt'\IUCll I
Task 1r,v:11:c-r I
Secondary NameNode (SNN)
A1Jm1n O(XhJ
StJt:> r.•c11•tor r9 ~-, :.iSier rlDf'5 ~ :t:~e 01· S'.IIN HOFS Clu~IPf
I f\Mme no-IQ I U-.il,t l"VIJ• O,tl,1 f't :JI
=, i:1'· : 1.1s~er ha~ one SN~
1
~
~i\J~ rcs1.Jes .,;1 :s C\\'ti riach1ne J!s.:-
-l 0r :.ie san~e ser\•er 3f'Y otner C'at3N~.:e .:-r Tas\Tracker daemons ca~not run.
Fig. 2.4.1: Hadoop core components
T~e S11,N :a,es S'.'lapsh01s ;,r 1ne H'.'.lFS metacia:a at ntenals oy communicating constantly with NameNode
Job Tracker 2.4.l HDFS (Hadoop Distributed File System)
l~t-r,ac,er :ie1erm1r1es iiles 10 process n0de assignments fer different tasks. tasks monitoring etc
C'"I~· one ,obTrac~er daemon per >1adoop cluster 1s a.loweo. a. Descnbe lhe structure of HDFS in a hadoop ecosystem using a diagram. MU: Ol!c 17, 5 Mark&
,obTrac,er •uns on a ser,,er as a master nooe of the cluster
HDFS ,s a file system for Hadoop.
TaskTracker
It runs on clusters on commodity hardware.
!nd1v1oua! tasks assigned by JobTracker are e,ec~ted by TaskTracker HDFS has follow111g important characterisucs ·
There s a single Tas, Tracker per s;ave node
o Highly fault-tolerant
3 TaskTracker may handle multiple tasks parallelly by using muluple JVMs.
o High throughput
-l TasJ..Tracker co11s1antly communJCates w,th the JobTracker W1thm a specified amount of ume ,f the
Tasi<Tracker iails to respond to JobTracker then 111s assumed that the Task Tracker has crashed. Rescheduling o Supports applica11on with massive data sets
of corresponding tasks are done to other nodes ,n the cluster o Streaming access to file system data
o Can be built out of commodity hardware
HDFS Architecture
For distributed storage and distributed computation Hadoop uses a master/slave architecture. The
/
distributed storage system in Hadoop ,s called as the Hadoop Distributed File System or HDFS. In HDFS a file
'"""'"' ~,,.Ir"-•
/
1s chopped into 64MB chunks and then stored, known as blocks.
T dSk T r,1cker
As previously discussed HDFS cluster has Master (NameNode) and Slave (DataNode) architecture. Name
~~
Node manages the namespace of the filesystem.
9eja,.,-, In this namespace the information regarding file system tree, metadata for all the files and directories in mat
tree etc. ,s stored. For this ,t creates two files the namespace image and the edit log and stores 111forma11on in
Fig, 2.3.2 : Jobtracker and tasktracker interaction 11 on consistent basis.
Tf;rtcttllH.tlql iJ 1'Kll...t,qe
_ lntroduct1on to Hae10o
26
1 'f,~ Big Data Analytics (MU) 2-7 Introdurnon to Hadoop
'i; s,g Data Anal\~,cs ,MU N me Node and Datanodes. The user does not k P
. mmurncaung with the a . . no,, Required configurauon changes for scaling and running for these applications are done by MapReduce
A cllenr ,nteram with HDFS b) co f f nctionmg. I.e. Which NameNode and DataN
de and Data Node or u Odei itself. There are two pnm111ves used for data processing by Mapreduce known as mappers and reducers
about the assignment of Name N0
are assigned or will oe assigned. Mapping and reduc,ng are the two important phases for execuung an application program. In the mapping
phase MapReduce takes the input data, filters that input data and then transforms each data element to the
1. NameNode
mapper
The NameNode is known as the maS ter of HDFS.
In the reducing phase. the reducer processes all the outputs from the mapper, aggregates all the outputs
DataNode Is known as the slave of HDFS
k of files distributed to Data Nodes. and then provides a final result.
The NameNode has Job Tracker wh1Ch keeps trac
MapReduce uses /1Srs and key/value porrs for processing of data.
NameNode directs DataNode regarding th e low-level 1/0 tasks.
MapRedu ce core functions
NameNode Is the only single point of failure component.
1. Read input
2. DataNode
Divides input into small parts/ blocks. These blocks then get assigned to a Map function
DataNode Is knm·.n as the slave of HDFS.
2. Function ma pping
7ne Data Node takes c1;em bloc~ addresses from NameNodes.

Us1no this address client communicates directly with the DataNode. It converts file data to smaller, intermediate <key. value> pairs.

For r:plicat1on of data a DataNode may communicate with other DataNodes. 3. Partition, compa re and sort

DataNode com,nually informs local change updates to NameNodes. Partition functi on : With the given key and number of reducers It finds the correct reducer
~o create move or oelete blocks DataNode receives instrucuons from the local disk. Co mpare fu nctio n : Map intermediate output s are sorted according to this compare function
~l-e tao3.ta (~ane. rep11cas )
, "lOme loo aa1a 6 ) 4. Function reducing

Intermediate values are reduced to smaller solutions and given to output.

5. Write output

Gives file output.

C . i:ot
Fig . 2.4.2 : HDFS architecture data

2.4.2 Ma pRed uce

™™
0. Wtiat s MapReduce? Explain He.,, MapReduce work? IZl!.&llii.fUll
'.1apReduce Is a software framewor"- In Mapreduce an application is broken down into number of sma Fig. 2.4.3 : The general MapReduce dataflow
parts
To understand how it works let us see one example
These sma 1 pans are also as ca Iled fragrients or blocks. These blocks men can be run on any node "
File 1 : "Hello Sachin Hello Sum1t·
the C,uster
File 2 : "Goodnight Sachin Goodnight Sumit"
Da:a Processing is done by MapReduce MapReduce scales and runs an applicatio n to different clustf
macr ~es Count occurrences of each word across d ifferent file s.
r ~~~~~~~~==::::==~2-!8===========::::.;~
•i
B,g Data Ana.IVT cs MU)

(i)
1
~ fi~

Three operations w1II be there as follows.

Map
Introduction t
0

~
-;",-::

1.
Big Data Analytics (MU)

Introduction
29

Hadoup can perform only batch processing and sequent1al access


lntroduct1on to Haooop

Map2
Mapl Sequen11al access ,s time consuming

< Goodnight l > So a new technique is needed to get nd of this problem


< Hello 1 >
The data in todays world is growing rap, di y in S1Ze as well as scale up and shows no s,gns of slowing
< Sach,n, 1 > < Sach1n. l >
down
< Hello. 1 > < Goodnight l >
Sta11s11cs show that every year amount of data generated is more than previous years

< Sum,t l > < SumIt l > The amount of unstructure d d ata ,s mu ch more than struaured information stored in rows and
columns.
(ii) Combine
8,9 Data actually comes from comp Iex, u nstructured formats, everything from web sites, social media
Combine Mapl Combine Map2 and email, to videos. presentations. etc

< Sdch,n 1 > < Sach1n. 1 > The pioneers in this f.Ie Id o f data ,s Google ' wh1Ch designed scalable frameworks like MapReduce and

< Sum,t. 1 > < Sumi!, l > Google File System

Hello, 2 > < Goodnight. 2> Apache open source h as started with ,ni11atIve by the name Hadoop, It ,s a framework that allows for
the distributed processing of such large data sets across clusters of machines.
(iii) Reduce
--------------
< Sachin, 2 > ETL Tools Bl Aepon1ng ADBMS
< Sum1t, 2> § .. Pig (data I OWJ Ii ·-H;~~-(~~·:,---w·--~~~~p- ---
< Goodnight, 2>
~ Map reduce (job scheduling r execu~on system)
< Hello 2 > i:i

2.4.3 Hadoop - Limitation


! HBase (column DB)

I™™ 0.
0.
Slate l.Jm11at1ons of Hadoop
Wha1 are the hmllat,ons of Hadoop?
W!&IDiii•&il:fJMffifl
HDFS
(Hadoop d1stnbuted rIIe system)

Fig. 2.5.1 : Hadoop ecosyst em


Hadoop can perform only batch processing and sequential access.
2. Ecosystem
Sequential access ,s ume consuming
Apache Hadoop, has 2 core proJects,
So a new technique Is needed to get rrd of this problem.
Hadoop MapReduce
2.5 Hadoop - Ecosystem o Hadoop Distributed File System

fmii . . Hadoop MapReduce is a programming model and software for writing applications which can process
vast amounts of data in parallel on large clusters of computers
0. Give Hadoop Ecosystem and briefly explain Its components
0. Expln,n Hadoop Ecosystem wI1h core components IZ118•t¥Mi:JIUIMrtl HDFS 1s the primary storage system, it creates multiple replicas of data blocks and distributes them on
MU : Ma 17, Dec. 18, 4 Marks compute nodes throughout a cluster to enable reliable, extremely rapid computations
I 0. What do you mean by the Hadoop Ecosystem?Descnbe any three components ol a typical Hadoop Ecosysiem.

I
~ E,pta,n Hadoop Ecosystem ~
Other Hadoop-related projects are Chukwa, Hive, HBase Mahout. Sqoop and ZooKeeper
2-11 lc,trod.ct1or to Hadoop

n1erarch1caI
ZooY.eeper ·mil allows d1str1buted processes to coordinate 111th eacn other using shared

../Y; narr.espace organized as a standard file system.


~ . / Apache hedoop eco sys tem direciones.
6. The name space made up of of data registers called znodes, and these are s1m1lar to files ana
latenCJ
zoor.eeper data 1s kept ,n-memor1. ·11h1cr means 1t can ach1e,e high throughput and low
Pr:

~-*dJ,~~~l
- il l• ,,

_,
" {
N
]
8
HDFS
Hadoop d1stnbuted hie system
2.7 HBase
Fig. 2.5.2
HBase 1s a distributed column-oriented database
2.6 ZooKeeper HBase is hadoop application built on top of HDFS.

by Hadoop HBase 1s suitable for huge datasets where real-11me read/write random access 1s required.
ZooKeeper 1s a distributed, open-source coordination service for dismbuted applications used
h19hi • HBase 1s not a relanonal database. Hence does not support SQL
2. This system ,s a simple se t of pnm1t1ves that distributed applications can build upon to implement
It ,s an open -source pro;ect and is horizontally scalable
level services for synchronization, configura11on maintenance, and groups and naming
Cassandra, couchDB, Dynamo and MongoDB are some other databases si milar to HBase.
Data can be entered 1n HDFS either directly or through HBase.

Consistent read and writes, Automatic fa ilure support 1s provi ded.


Server 1 Server3
It can be easily integrated with JAVA.
Zookeeper ensemble Data 1s repl icated across cluster. Useful when some node fails

Create znode Delete znode 2.7.1 Comparison of HDFS and HBase

Sr. HDFS HBase


No.

1. HDFS 1s a distributed file system suitable for storing large HBase 1s a database built on top of the
I files. HDFS.

2. HOFS does not support fast individual record lookups. HBase provides fast lookups for larger

Fig. 2.6.1 tables.

3. It provides high latency batch processing. Low latency random access.


Th,s Coordination services are prone to errors such as race conditions and deadlock

The main goal behind Zoo Keeper 1s rouse distributed applications.


~~~~~~~~==-=::::~=~~;-1~2=========='":ctr,;o;;;ou;;;;c~tion to 11ad.- Introduction to Hadoop

~~-~~~~:~~~~.:=~--r-------~~~--
;..•: BJ Cata Anal\ tics 1i\1U) MS d HBase ~
-:;'f Big Data Analytics (MU)
2-13

~2.7.2

Sr.
Comparison of ROB an
RD BMS
HBase 2. Auto splitting
This is by default awon. It splits region when one of the stores crosses the max confrgureCI value

No. I d' HBase is schema-less. Only column families 3. Manual splitting


RDBMS uses schema. Data ,s stored accor mg defined. i, Split regions wl11ch are not uniformly loaaed.
tO 11SS(
Horizontally scalable.
2.7.5 Region Assignment and Load Balancing
Scaling ,s difficult
No transactions are there ,n HBase.
RDBMS rs transactional. Tlirs rnformatron cannot be changed further as these are the standard procedures.
It has de-nonnalized data.
It has normalized data On startup
It is good for semi-structured as well as struq~,,
It 1s good for structured data. On startup Assignment Manager is invoked by Master.
data. ·
From META the rnformatron about existing region assignments is taken by the Ass1gnmentManager.
It is column oriented database.
It 1s ro\\ oriented database. If the Reg1onServer is still online then the assignment 1s kept as it is.
It 1s su,table for Online Transaction Process It is suitable for Online Analytical Processing (OLAp If the RegionServer is not online then for region assignment the LoadBalancerFactory is invoked. The
OLTPl DefaultloadBalancer will randomly assign the region to a RegionServer.
5. META s updated with this new RegronServer assignment. The RegionServer starts functionrng upon region
2. 7.3 H Base Architecture 1
th
Tne Mas:er perrorms adm1nrstra11on, cluster management. region management, load balancing and farl~ opening by e RegionServer.
handling. When region server fails
Region Server hom and manages servers, region splitting, read/write request handling, cite· Regions become unavailable when any RegionServer fails.
communication etc 2. The Master finds which RegionServer rs failed.
Region contains Wme Ahead Log (WAL). It may have multiple regions. Region is made up of Memstore ,· The region assignments done by that RegionServer then becomes invalrd. The same process rs followed for
H•rles 1n which data s stored. Zookeeper is required to manage all the services. new region assignment as that of startup.

[ "'"'v~ C Pnt API-. ]I Et1erna1 AP.s t-h nlt, Avro. RESTI Region assignment upon load balancing
When there are no regions in transition, the cluster load is balanced by a load balancer by moving regions
around. Thus redistributes the regions on the cluster. It is configured via hbase.balancer.period. The default value

B
:=teg en Serv.:iir ( Wn!e-Ahead . og (WALi )
is 300000 (5 minutes).
Aegon Ra:i10n Region
2.7.6 HBase Data Model
§~ { r..~em~orn) ( r.AemStore}
The Data Model in HBase is made of different logical components such as Tables, Rows, Column Families,
~ ~ ~ Columns, Cells and Versions.
It can handle semi-structured data that may be varied in terms of data type, size and columns. Thus
partitioning and distributing data across the cluster is easier
Hudcop fl'• sys:eM AP ) _[_ _ _ zoo_,.,,_.P_•_'_ _...JI
Ro w Key Movies Shows

r-'anc-cp mstnouraa file sys1em (HDFSJ Screen Movie Name Ticket Time Day

Fig. 2.7.1: HBase database architecture 01 Harry Potter 1 200 6.00 Saturday

2. 7.4 Region Sp litting Methods 02 Harry Potter 2 250 3.00 Sunday

1. Pre splitting
Regions are created first and split points ar · d ·
points are to be used very carefull other.~/s~rgne _at ·th e nme of table creation. Initial set of region sp
Column Families
clusters performance. Y e oad diS t nbution will be heterogeneous which may hamp:
Fig. 2.7.2 : HBase data model
F
Tilblts 2 8 _1 Architecture of Hl ✓ E

l'ows

3. Col .. mn Fam n,e-s

4. Columns
Fig z.e l . 111,e arch ••ctur•

1. User interface

Cell 2. Meli> store

3. H1veQL process engine

5. HDFS or HBASE

2.8.2 Working of HTVE

~~
l 8 -.B j Fig 2 a.2 : Hr1e and Hacl<><>p ,o,,.-,m1,r ,cation
lntroti111 t1un to >1,1
2 16
'I[,
~ ..
819 D,ita Andly\lCS (MU)
~•.:1 ~~=- a~== --=== ===- ----== ==== =a=~ ~.;.;:
l. Execute Query :
Command Line or Web UI ,ends query lO JDS( or ODB( Driver Ill l'\l'< llh' Review Question

2. Get Plan: ti
qu1•1\ v\'111 11 n I I J, I~

With the 1,elp of query compiler drrver chelks the synh1x ,111d iequ11rme11t ,,1
01

0.2 w1,,11 ltlf >J


3. Get Metadata :
0 J f ~pl 1H
The compiler sends metadata ,eque,t to Metdstore 101 qrttmg d,rt l y ,, n
0 ~ I Jfjll II H I pI
4. Send Metadata :
0 5
Metastore sends the required metadata as a 1e,pon1e to the ,ompil~,
It O 1111111 tJ011 ,111>1 p
0 6 Whll
5. Send Plan : '
1 1 the 1'•11 ' 11 <1 nd ornprling
The cornp1ler c11ecks the requrrernent and re,ends tile pl,1n to tlw ,11 ><' Thu .J J..J
query rs complete
6. Execute Plan ;

The tlrrver sends the execute plan to the execution enq,ne


7. Execute Job :
Tasklracker
The execution engine sends the Job to JobTrac,er JobTrac~er assigns 11 to
7.1 Metadata Operations :

The l ecuuon engine can e,ecute •netadata opera:,ons" th Me:astore


8. Fetch Result :

f',, execut >n eng,ne recel\es tne results fro Da:a nodes
9. Send Results :

The e,erutron eng,ne sends those resultant va ues 10 the (]r ver
10. Send Results :

Tho cir ver .ntls e re It to Hive Interlaces

2.8.3 HfVE Data Models

.he Hi.e dlla models conrain :Ile following compon~nts


Databases
Tab es
Part1t tlr,1
BJ kets or c ester
Partitions

d detl to a sma r part sed 0,.. •tie a e , a rt

Buckets

v -

You might also like