BIGDATA AND HADOOP - Unit II
BIGDATA AND HADOOP - Unit II
UNITII
• NeedofHadoop
• DataCentervsHadoop
• OverviewofHadoopDaemons
• HadoopClusterandRacks
• LearningLinuxrequiredforHadoop
• Hadoopecosystemtoolsoverview
• BigdataHadoopopportunities
Introduction
Hadoopisatoolorprocessthroughwhichwecanaccessthedataandprocess that
data. It is rather a mechanism to perform operation on a data.
HadoopComponents&Daemons
• Thereare2layersinHadoop–
• HDFSlayer
• Map-Reducelayer
• There are 5 daemons (Daemonsarethe processesthatrunin the background)which run on Hadoop in
these above 2 layers -
a) Namenode–Itrunsonmasternode.
b) Datanode–Itrunsonslavenodes.
c) JobTracker–ItrunsonYARNmasternodeforMapReduce.
d) TaskTracker–ItrunsonYARNslavenodeforMapReduce.
e) SecondaryNamenode– Itisbackupfornamenodeandrunson adifferentsystem (otherthan
masterandslavenodes.)
Architecture
NameNode
FunctionsofNameNode:-
1. ManagestheDataNodes
2. Recordsthemetadataofallthefiles
stored in the cluster
3. ReceivesaHeartbeattoensurethat
the DataNodes are live.
FunctionsofDataNodes:-
1. Actualdataisstoredon them.
2. Perform the low-level read and
write requests from the file system’s ………….
clients.
Data Data Data Data
Node-1 Node-2 Node-3 Node-N
HadoopCluste rand Racks
The rack is a physical collectionof nodes in Hadoop cluster (maybe 30 to40). A large Hadoop cluster is consists
of many Racks. With the help of this Racks information, Namenode chooses the closest Datanode to achieve
maximum performance while performing the read/write information which reduces the Network Traffic.
Hadoopcluster containsmultipleRacks,in each racktherearelotsofdatanodesareavailable.Communication
between the Datanodes that are present on the same rack is quite much faster than the communication
between the data node present at the 2 different racks.
LearningLinuxrequiredforHadoop
1) Command forUploadingafilein HDFS
• Hadoopfs–put
Thiscommandisusedtouploadafile fromthelocalfilesystemtoHDFS.Multiplefiles canbeuploadedusingthiscommandby separatingthefilenames withaspace.
2) CommandforDownloadingafileinHDFS
• Hadoopfs–get
Thiscommandisused todownloadafilefromthelocalfilesystemtoHDFS.Multiplefilescanbedownloadedusingthiscommandby separatingthefilenameswithaspace.
3) CommandforViewingtheContentsofafile
• Hadoopfs–cat
4) CommandforMovingFilesfromSourcetoDestination
• Hadoopfs–mv
5) CommandforRemovingaDirectoryorFileinHDFS
• Hadoopfs–rm
Note-Toremoveadirectory,thedirectoryshouldbeemptybeforeusingtherm command.
6) CommandforCopyingfilesfromlocalfilesystemtoHDFS
• Hadoopfs–copyFromLocal
7) Commandtodisplaythelengthofafile
• Hadoopfs–du
8) Commandto viewthecontentofadirectory
• Hadoopfs–ls
9) Commandtocreate aDirectory in HDFS
• Hadoopfs–mkdir
10) Commandtodisplaythe firstfewlinesofafile
• Hadoopfs–head
Zookeeper
Storm
Spark
Flume&Sqoop
Pig
Ambari
MAPReduce
HDFS
Hive
Mahout
CustomMR
Impala
Hadoopecosystemtoolsoverview
HBase
Oozie
Hadoopecosystemtoolsoverview
• 1.Flume&Sqoop(DataIngestionapplication):-flumeisusedforlogcollection sqoopforsqltohadoop
• 2.Pig:-DataProcessing/Analysis/ProgrammingLanguage
• 3.Hive:-InterfaceSQLlikeFunctionality
• 4.Mahout:-Itisamachinelearninglibrary/application.Providem/clearningalgo
• 5.CustomMR:-JAVAetc..
• DisadvantageofMapReduce
a. VerySlow
b. Batchprocessing
AlternativeofMapReduce
6. Impala:-SqlLikeinterfacesimilartoHivebutdoesnotusemapreduceratherhasitsownmechnismtoaccessdataandcluster
7. Hbase:-BasedonNOSQLDatabasei.e.dataisstoredinKeyValue pair
8. Spark&Storm:-ProcessRealtimedata(streaming)
9. Zookeeper:-usedforManagement
10. Oozie:-scheduler
11. Ambari:-WebbasedGUIforprovisioning,managing,andmonitoring