0% found this document useful (0 votes)
21 views11 pages

BIGDATA AND HADOOP - Unit II

Hadoop is a software framework designed for distributed processing of large datasets, featuring HDFS for file storage and MapReduce for data processing. It offers advantages such as fault tolerance, cost-effective storage, and the ability to handle both structured and unstructured data. The document also outlines the architecture of Hadoop, its components, and various ecosystem tools that enhance its functionality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views11 pages

BIGDATA AND HADOOP - Unit II

Hadoop is a software framework designed for distributed processing of large datasets, featuring HDFS for file storage and MapReduce for data processing. It offers advantages such as fault tolerance, cost-effective storage, and the ability to handle both structured and unstructured data. The document also outlines the architecture of Hadoop, its components, and various ecosystem tools that enhance its functionality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

BigData&Hadoop

UNITII
• NeedofHadoop
• DataCentervsHadoop
• OverviewofHadoopDaemons
• HadoopClusterandRacks
• LearningLinuxrequiredforHadoop
• Hadoopecosystemtoolsoverview
• BigdataHadoopopportunities
Introduction

Hadoop is a software framework that is optimized for the distributed


processing of very large datasets. Its two main features are the Hadoop
Distributed File System (HDFS), which handles storing files, and
MapReduce, which processes the stored information.
AdvantagesofusingHadoopare-
1. Itstoresbothstructuredandunstructureddataasit is.
2. ItisFaultTolerantasfailureofanynodeisrecoveredautomatically.
3. Itprocesscomplex dataeasily and veryfast.
4. Itworksindistributedprocessingmannerthatmeansmultipletaskexecution
willbedone parallellyatthesametime.
5. Hadoopoffersacosteffectivedatastoragesolutions.
6. Dataisreliablystoredonclusterofmachines despiteofmachine failure.
DataCentervsHadoop
Data Center, the data is actually stored there for a particular site. Whenever
youfireaqueryortypeFacebook.com,therequestcomestothedatacenter and
then the packets are delivered on your system.

Hadoopisatoolorprocessthroughwhichwecanaccessthedataandprocess that
data. It is rather a mechanism to perform operation on a data.
HadoopComponents&Daemons
• Thereare2layersinHadoop–
• HDFSlayer
• Map-Reducelayer
• There are 5 daemons (Daemonsarethe processesthatrunin the background)which run on Hadoop in
these above 2 layers -
a) Namenode–Itrunsonmasternode.
b) Datanode–Itrunsonslavenodes.
c) JobTracker–ItrunsonYARNmasternodeforMapReduce.
d) TaskTracker–ItrunsonYARNslavenodeforMapReduce.
e) SecondaryNamenode– Itisbackupfornamenodeandrunson adifferentsystem (otherthan
masterandslavenodes.)
Architecture
NameNode
FunctionsofNameNode:-
1. ManagestheDataNodes
2. Recordsthemetadataofallthefiles
stored in the cluster
3. ReceivesaHeartbeattoensurethat
the DataNodes are live.

FunctionsofDataNodes:-
1. Actualdataisstoredon them.
2. Perform the low-level read and
write requests from the file system’s ………….
clients.
Data Data Data Data
Node-1 Node-2 Node-3 Node-N
HadoopCluste rand Racks
The rack is a physical collectionof nodes in Hadoop cluster (maybe 30 to40). A large Hadoop cluster is consists
of many Racks. With the help of this Racks information, Namenode chooses the closest Datanode to achieve
maximum performance while performing the read/write information which reduces the Network Traffic.
Hadoopcluster containsmultipleRacks,in each racktherearelotsofdatanodesareavailable.Communication
between the Datanodes that are present on the same rack is quite much faster than the communication
between the data node present at the 2 different racks.
LearningLinuxrequiredforHadoop
1) Command forUploadingafilein HDFS
• Hadoopfs–put
Thiscommandisusedtouploadafile fromthelocalfilesystemtoHDFS.Multiplefiles canbeuploadedusingthiscommandby separatingthefilenames withaspace.
2) CommandforDownloadingafileinHDFS
• Hadoopfs–get
Thiscommandisused todownloadafilefromthelocalfilesystemtoHDFS.Multiplefilescanbedownloadedusingthiscommandby separatingthefilenameswithaspace.
3) CommandforViewingtheContentsofafile
• Hadoopfs–cat
4) CommandforMovingFilesfromSourcetoDestination
• Hadoopfs–mv
5) CommandforRemovingaDirectoryorFileinHDFS
• Hadoopfs–rm
Note-Toremoveadirectory,thedirectoryshouldbeemptybeforeusingtherm command.
6) CommandforCopyingfilesfromlocalfilesystemtoHDFS
• Hadoopfs–copyFromLocal
7) Commandtodisplaythelengthofafile
• Hadoopfs–du
8) Commandto viewthecontentofadirectory
• Hadoopfs–ls
9) Commandtocreate aDirectory in HDFS
• Hadoopfs–mkdir
10) Commandtodisplaythe firstfewlinesofafile
• Hadoopfs–head
Zookeeper

Storm

Spark

Flume&Sqoop

Pig
Ambari
MAPReduce
HDFS

Hive

Mahout

CustomMR

Impala
Hadoopecosystemtoolsoverview

HBase

Oozie
Hadoopecosystemtoolsoverview
• 1.Flume&Sqoop(DataIngestionapplication):-flumeisusedforlogcollection sqoopforsqltohadoop
• 2.Pig:-DataProcessing/Analysis/ProgrammingLanguage
• 3.Hive:-InterfaceSQLlikeFunctionality
• 4.Mahout:-Itisamachinelearninglibrary/application.Providem/clearningalgo
• 5.CustomMR:-JAVAetc..
• DisadvantageofMapReduce
a. VerySlow
b. Batchprocessing

AlternativeofMapReduce
6. Impala:-SqlLikeinterfacesimilartoHivebutdoesnotusemapreduceratherhasitsownmechnismtoaccessdataandcluster
7. Hbase:-BasedonNOSQLDatabasei.e.dataisstoredinKeyValue pair
8. Spark&Storm:-ProcessRealtimedata(streaming)
9. Zookeeper:-usedforManagement
10. Oozie:-scheduler
11. Ambari:-WebbasedGUIforprovisioning,managing,andmonitoring

You might also like