Hadoop Notes
Hadoop Notes
Lesson 1 Notes
Introduction
Hi!WelcometoFundamentalsofHadoopandMapReduce.MynamesSarah
Sproehnle,andImtheVicePresidentofEducationalServicesatCloudera,a
companywhichhelpsdevelop,support,andmanageHadoop.
AndImIanWrigley,ClouderasSeniorCurriculumManager.Betweenus,Sarah
andIhavebeenresponsibleforbringingHadooptrainingtoover20,000people,
andwereexcitedtoreachamuchbiggeraudiencehereatUdacity.Duringthis
courseweregoingtodiscusswhatbigdatais,whatHadoopis,whyitsuseful,
andhowtowriteMapReducecode.
Bytheendofthecourse,youllbeabletodescribethekindsofproblemsHadoopaddresses,
andyoullhavewrittenMapReduceprogramstoefficientlyanalyzeverylargeWebserverlog
files.Infact,youllhavehadhandsonexperiencerunningaHadoopjobbytheendoflessontwo.
So,letsstart.Inthislesson,we'regoingtodefine'bigdata',thesortofproblemsitintroduces,
andhowtoaddressthoseproblems.
Sources of Data
Organizationshavebeengeneratingdatasince
wayback,butastimegoeson,moreandmore
dataisbeinggenerated.IBMestimatesthatas
muchas90%ofthedataintheworldtoday
hasbeencreatedinthelasttwoyearsalone.
Justasasimpleexample,thinkaboutyourcellphone.Wheneverits
turnedon,itsconnectingtocelltowerstogetreception.Asyoumove
around,itwillconnecttodifferenttowers,andatdifferentsignal
strengthsdependingonhowfarawayfromthemyouare.Allofthat
connectiondataiscollectedbythephonecompany,anditslogged.
Copyright2014Udacity,Inc.AllRightsReserved.
Theycanuseittofinddeadspotsintheircoverage,toworkoutwhichtowersarethebusiest
andneedincreasedcapacity...theycaneventraceyouifyoumakeanemergencycallbutdont
giveyourexactlocation.Thatsanenormousamountofdatarightthere.
Anotherexampleiswhenyouvisita
WebsitelikeAmazonorNetflix.
Everythingyoudothereislogged:what
pagesyouviewed,whatproductsyou
lookedat,howlongyouspentoneach
page...eventhingslikewhatWeb
browseryouwereusingandwhatsortof
computeryouwereconnectingfrom.
Again,hugeamountsofdata.
Andthatsjustinthecorporateworld.Inmedicine,forexample,eachXRaycreateshuge
amountsofpotentiallyincrediblyvaluableinformation,andcomparinglargenumbersofthemcan
helpustodetectsimilaritiesintumors.
Thisincreaseintheamountofdataweregeneratingopensuphugepossibilities.Butitcomes
withproblemstoo.Wehavetostoreallthatdata,andwehavetobeabletoprocessitina
sensibleamountoftime.
Letsstartwithaquickquestion.Whichofthesewouldyouconsidertobebigdata?Youarenot
goingtobegradedonthisanswer,butgiveityourbestguess.
[]orderdetailsforapurchaseatastore
[]allordersacrosshundredsofbranchesnationwide
[]informationaboutapersonsstockportfolio
[]allstocktransactionsmadeontheNewYorkStockExchangeduringtheyear
Answer:
Formostpeople,theanswersaregoingtobe2and4.Alistofpurchasesatasinglestoreis
Copyright2014Udacity,Inc.AllRightsReserved.
almostcertainlysmallenoughtobeeasilyhandledbyatraditionalrelationaldatabasesystem
orevenjustaspreadsheet.Ordersfromhundredsofstoresnationwide,though,couldstartto
overwhelmtraditionalsystems.Likewise,informationaboutasinglepersonsstockportfolioisa
smallandeasilymanagedchunkofdata.ButdataontradesacrosstheentireNYSEforayear
willrunintotensorhundredsofterabytesandthatswheretraditionalsystemsreallydostartto
struggle.
Quiz: Challenges
ButBigDataismorethanjustsizeofthedata.Whatadditionalproblemscanyouseeinthis
field?
[]mostdataisworthlessanditshardtofindtheusefulparts
[]itshardtogatherdata
[]dataiscreatedveryfast
[]datafromdifferentsourcesisindifferentformats
Answer:
Apotentialchallengewithbigdataisthatitiscreatedveryfastanddoescomefromdifferent
sourceswhichcouldcomeinavarietyofformats.Inmyexperience,mostdataisnotworthless
butactuallydoeshavealotofvalue.
Volume
Thepricetostoredatahasdroppedincrediblyoverthelast60years.In1980,thecostper
gigabytewasseveralhundredthousanddollars.In2013,itswellunder10cents.
Copyright2014Udacity,Inc.AllRightsReserved.
Althoughitsworthsayingthatifyouactuallywanttostorethedatareliably,youregoingtoend
uppayingrathermorethanthatprobablyseveraldollarspergigabyte,maybeevenmore.
Thatsparticularlythecasewithmore
traditionaldatastoragedevicessuchas
storageareanetworks,orSANs,which
canbeextremelyexpensive.Thehigh
costofreliablestorageputsacaponthe
amountofdatacompaniescan
practicallystore.Atsomepoint,theyd
say,OK,itstooexpensivetostoreall
thatdatathatImnotdoinganythingwith.
Letsjuststorethecriticalstuff:my
actualsales,forexample,ratherthanall
thatstuffabouthowlongpeoplespenton
eachpageofmyWebsite.Butitturns
out,aswellsee,thatthedatatheyre
currentlythrowingawaycanbeincredibly
useful.Whatweneedisacheaperway
tostoreitreliably.
Andofcoursestoringthedataisonlyonepartoftheequationyoualsoneedtobeabletoread
Copyright2014Udacity,Inc.AllRightsReserved.
andprocessitefficiently.StoringaterabyteofdataonaSANisntsohard,butstreamingthe
datafromtheSANacrossthenetworktosomecentralprocessorcantakealongtime,and
processingitcanbeextremelyslow.
QUIZ: Volume
Whichofthefollowingdatadoyouthinkisworthstoringandanalyzing?
[]transactions(financial,governmentrelated)
[]logs(recordsofactivity,location)
[]businessdata(productcatalogs,prices,customers)
[]userdata(images,documents,video)
[]sensordata(temperature,pollution)
[]medicaldata(xrays,brainactivityrecords)
[]social(email,twitteretc)
Answer
Andtheansweristhatallofthesecanprovideusefulinformation.Butinordertostoreit,youll
needawaytoscaleyourstoragecapacityuptomassivevolume.Hadoop,whichstoresdatain
adistributedwayacrossmultiplemachines,doesthat.Youllseejusthowinthenextlesson.
Variety
ThesecondVisdatavariety.Foralongtime,peoplehaveuseddatabasestostoreand
processtheirdataeithersmallerdatabaseslikeMySQL,orbigdatawarehousesbasedon
softwarefromcompanieslikeOracleandIBM.Butforadatawarehousetoeffectivelyprocess
information,allthatinformationhastofitnicelyintoapredefinedsetoftables.Theproblemis
thatthesedays,lotsofthedatayouwanttostoreiswhatwetendtocallunstructureddata,or
semistructureddata.Sarahcangiveussomeexamples.
Byunstructured,wemeanthedataarrivesinlotsof
differentformats.Forexample,abankmighthavea
listofyourcreditcardandaccounttransactions,but
theymayalsohavescansofyourchecks,recordsof
yourinteractionswithcustomerservice
representativesontheWebandoverthephone,
perhapsevenrecordingsofthosephonecalls.Allof
thatdataisinavarietyofdifferentformats,anditcan
behardtostoreandreconcileitallusingtraditional
systems.
Copyright2014Udacity,Inc.AllRightsReserved.
Andthisalsotiesbacktovolume.Youwanttostorethatdatainitsoriginalformatsoyourenot
throwinganyinformationaway.Thatwayyoucanthenprocessthedatalaterindifferentways
youmightnotevenhavethoughtoforiginally.
Forinstance,ifwejusttranscribecallcenter
conversationsintotext,wehavewhatpeoplesaidto
ourcustomerservicerepresentatives.Butifwehave
theactualrecordings,thenlateronwemightdevelop
softwarewhichcaninterpretthetoneofvoicethe
customerusesandthatmightleadtoavery
differentinterpretationofthedata.AndthenicethingaboutHadoopisthatitdoesntcarewhat
formatthedatacomesin.Unlikeatraditionaldatabase,youcanjuststorethedatainitsraw
format,andmanipulateandreformatitlater.
Sometimesthemostunlikelydatacanbeextremelyusefulandleadtosavingsduetobetter
planning.Forexample,aconventionalsystemforcoordinatinglogisticssystemmightsendthe
closesttrucktothewarehousetopickupthepackage.However,itmightbethattheclosest
truckisnotthebestsolutionperhapstherearetrafficjams,orthemostdirectrouteisonsmall
roadsthatwouldtakelongertodrive.Maybethetruckdoesnthaveenoughfreespaceforthe
newload.Sowhatkindofdatawouldbehelpfulinmakingabetterplanthatcouldsavemoney
andtimeforthecompany?
[]CurrentGPSlocationfromalltrucks
[]Currentitinerariesforalltrucks
[]Currenttrafficspeedinrelatedareasasreportedby
servicessuchasWaze
[]Currentloadoftrucksbyvolumeandweight
[]Fuelefficiencyofthedifferentvehicles
Answer:
Andagainalloftheseanswersarecorrect.Youcansavealotofmoney,andtime,bymaking
betterdecisions,drivenbymorevarieddata.Theworldweliveinisextremelycomplex,and
therearealotofvariablestoconsiderthatyoucantweaktogetlargebenefits.
Velocity
Copyright2014Udacity,Inc.AllRightsReserved.
Velocity,thethirdV,isaboutthespeedatwhichthedataarrives,readytobeprocessed.We
needtobeabletoacceptandstorethatdataevenwhenitscominginatarateofterabytesor
moreaday,whichisoftenthecase.Ifwecantstoreitasitarrives,wellendupdiscarding
someofit,andthatswhatweabsolutelywanttoavoid.
ThinkaboutanecommerceWebsite.Ifweknowwhatproductsyouvelookedatinthepast,we
couldrecommendsimilarproductsthenexttimeyouvisitoursite.Ifyouspentfiveminutes
lookingataparticularitem,wecouldmaybesendyouanemailinformingyouwhenthatitemis
onsale.IfweknowthatyoutypicallybrowseoursiteusingafirstgenerationiPad,wecould
suggestthelatestmodel.
Thisisahugedifferencetowhatwewoulddobefore,whenweonlystoredrecordsofactual
purchases.IfwecanstoreandprocessallofourWebserverlogfiles,alongwiththepurchase
datathatsinourtraditionaldatawarehouse,wecangivethecustomeramuchbettershopping
experiencewhichshoulddirectlytranslateintobiggerprofits.
YetanotherexampleisamoviesitelikeNetflix.Basedonwhat
theyknowaboutyourviewinghabits,theycanrecommend
moviestoyouasyoucanseehere,becauseofwhatIans
ratedhighlybefore,themovieontheleftisrecommendedfor
himandtheycanevenpredictwhatratinghellgivethe
movie.
Sothereareplentyofthingswecandowithbigdata.Butfirstwehavetosolveacoupleof
problems.Weneedtobeabletostorethedatainacosteffectiveway,andweneedtobeable
toprocessitefficiently.Anditturnsoutthatthesearenoteasyproblemstosolvewhenwere
talkingaboutmassiveamountsofdata.Fortunately,though,someextremelysmartpeopleat
Googlewereworkingontheminthelate1990sandreleasedtheresultsoftheirworkas
researchpapersin2003and2004.LetsseewhatDougCutting,oneofthefoundersofHadoop,
hastosay.
Copyright2014Udacity,Inc.AllRightsReserved.
So,letmetellyouhowHadoopcametobe.Abouttenyearsagoinaround
2003,IwasworkingonanOpenSourcewebsearchenginecalledNutch,and
weknewitneededtobesomethingveryscalable,becausetheWebwasyou
know,billionsofpages.terabytes,petabytes,ofdata,thatweneededtobeable
toprocess,andwesetaboutdoingthebestjobwecouldanditwastough.We
gotthingsupandrunningonfourorfivemachines,notverywell,andaround
thattimeGooglepublishedsomepapersabouthowtheyweredoingthingsinternally.
Publishedapaperabouttheirdistributedfilesystem,TFS.andabouttheirprocessing,
framework,MapReduce.SomypartnerandI,atthetime,inthisproject,MikeCafarella.
saidabouttryingtoreimplementtheseinOpenSource.Sothatmorepeoplecoulduse
themthanjustfolksatGoogle.Tookusacoupleofyears,andwehadNutchupand
runningon,insteadoffourorfivemachines,on,20to40machines.Itwasn'tperfect,it
wasn'ttotallyreliable,butitworked.Andwerealizethattogetittothepointwhereitwas
scaledtothousandsofmachines,andbeasbulletproofasitneededtobe,wouldtake
morethanjustthetwoofus,workingparttime.
Aroundthattime,Yahooapproachedmeandsaidtheywereinterestedininvestingin
this.SoIwenttoworkforYahooinJanuaryof2006.FirstthingIdidthere,was,wetook
thepartsofNutchthatwereadistributedcomputingplatform,andputthemintoa
separateproject.AnewprojectchristenedHadoop.Overthenextcoupleyears,with,
Yahoo'shelp,andthehelpofothers,wetookHadoop,andreallygotittothepointwhere
itdidscaletopetabytes,andrunningonthousandsofprocessors.Anddoingsoquite
reliably.
Itspreadtolotsofcompanies,andmostlyintheInternetsector,andbecamequitea
success.afterthat,we,westartedtoseeabunchofotherprojectsgrowuparoundit.
AndHadoop'sgrowntobethekernelofa,which,prettymuchanoperatingsystemforbig
data.We'vegottoolsthat,allowyouto,moreeasilydo,MapReduceprogramming,so,
youcandevelopusingSQLoradataflowlanguagecalledPig.And
we'vealsogotthebeginningsofhigherleveltools.We'vegotinteractiveSQLwith
Impala.We'vegotSearch.andsowe'rereallyseeingthisdeveloptobeingageneral
purposeplatformfordataprocessing.thatscale'smuchbetterandthatitismuchmore
flexiblethananythingthat's,that's,elseisoutthere.
ThatsthestoryofthegenesisofHadoop:itsbasedonworkdonebythefolksatGoogle,andits
grownfromsmallbeginningstothepointnowwherehundredsofpeoplecontributetothe
project,andwhereitsbeingusedbythousandsandthousandsofcompaniesworldwide.The
Copyright2014Udacity,Inc.AllRightsReserved.
Hadooplogoisactuallyalittleyellowelephant,butdoyouknowwherethenamecamefrom?
Theresafunnystoryattachedtothat.HeresDougagain.
SothenameHadoopcomesfrommyson'stoyelephant.Whenhewasabout
two,afriendgavehimalittlestuffedelephantwhichheplayedwith
incessantly.Andweoverheardhimcallingitsomething,thisstrangewordthat
heinvented,andsaidHadoop.SoIimmediatelywroteitdownbecauseIwas
inthesoftwarebusiness.Andwe'realwayslookingforgoodnames.Andthis
onecamewithamascot,even.AndafewyearslaterwhenIneededaproject
name,pulleditout.Now,IwroteitdownasHADOOP.Andfiguredthateveryone
wouldsayHadoop.NowitturnsouteveryonesaysHadoopinstead,butIpersistinsaying
Hadoop.Nowmyson,ofcourse,is13,andexpectsroyaltiesforthename.Hehewants
morecredit.Healsoaccusesmeofstealingthetoy.Atsomepoint,hewasusingitin
somekindofrocketshipexperiment,andIhadtorescueit.Andnowit,itlivesinmysock
drawerfor,forsafety.
Hadoop Cluster
ThecoreHadoopprojectconsistsofaway
tostoredata,knownastheHadoop
DistributedFileSystem,orHDFS,anda
waytoprocessthedata,called
MapReduce.Thekeyconceptisthatwe
splitthethedataupandstoreitacrossa
collectionofmachines,knownasacluster.
Then,whenwewanttoprocessthedata,
weprocessitwhereitsactuallystored.
Ratherthanretrievingthedatafroma
centralserver,insteaditsalreadyonthe
cluster,andwecanprocessitinplace.Youcanaddmoremachinestothecluster(makethe
clusterbigger)astheamountofdatayourestoringgrowsand,indeed,manypeoplestartwith
justafewmachinesandaddmoreastheyreneeded.Themachinesintheclusterdontneedto
beparticularlyhighendalthoughmostclustersarebuiltusingrackmountservers,theyare
typicallymidrangeserversratherthantopoftherangeequipment.
Hadoop Ecosystem
CoreHadoopconsistsofHDFSandMapReduce.
Copyright2014Udacity,Inc.AllRightsReserved.
Butsincetheprojectwasfirststarted,anawfullotofothersoftwarehasgrownuparoundit.And
thatswhatwecalltheHadoopEcosystem.Someofthesoftwareisintendedtomakeiteasyto
loaddataintotheHadoopcluster,whilelotsofitisdesignedtomakeHadoopeasiertouse.For
example,asyoullseeinthenextlesson,writingMapReducecodeisntcompletelysimple.You
needtoknowaprogramminglanguagelikeJava,orPython,orRuby,orPerl.Buttherearelots
offolksouttherewhoarentprogrammersbutwhocanwriteSQLqueriestoaccessdataina
traditionalrelationaldatabaselikeSQLServer.Andofcoursealotofbusinessintelligencetools
alsowanttohookintoHadoop.
Forthatreason,otheropensourceprojectshavebeen
createdtomakeiteasierforpeopletoquerytheirdata
withoutknowinghowtocode.TwokeyonesareHiveand
Pig.InsteadofhavingtowriteMappersandReducers,in
Hiveyoujustwritestatements,whichlookverymuchlike
standardSQL.TheHiveinterpreterturnsthatSQLinto
MapReducecode,whichitthenrunsonthecluster.Andan
alternativeisPig,whichallowsyoutowritecodetoanalyse
yourdatainafairlysimplescriptinglanguageratherthanMapReduceagain,thecodeisturned
intoactualJavaMapReduceandrunonthecluster.
HiveandPigaregreat,buttheyrestillrunningMapReducejobs,whichmeantheywilltakea
reasonableamountoftime,especiallywhenrunningonreallylargeamountsofdata.Soanother
opensourceprojectcalledImpalawasdevelopedwhichagainallowsyoutoqueryyourdata
usingSQLbutwhichdirectlyaccessesthatdata,ratherthanaccessingitviaMapReduce.
Impalaisoptimizedforlowlatencyqueriesinotherwords,Impalaqueriesrunveryquickly,
typicallymanytimesfasterthanHivequerieswhileHiveisoptimizedforlongrunningbatch
processingjobs.
Anotherprojectusedbymanypeopleis
Sqoop.Thattakesdatafromatraditional
relationaldatabaseserversuchas
MicrosoftSQLServerandputsitinHDFS
asdelimitedfilessoitcanbeprocessed
alongwiththeotherdataonthecluster.
ThentheresFlume,whichingestsdataas
itsgeneratedbyexternalsystems.HBase
isarealtimedatabasebuiltontopofHDFS.Hueisagraphicalfrontendtothecluster.Oozieis
aworkflowmanagementtool.Mahoutisamachinelearninglibrary
Copyright2014Udacity,Inc.AllRightsReserved.
Infact,therearesomanydifferentecosystemprojectsthatmakingthemalltalktoeachother,
andworkwellwitheachother,canbetricky.Tomakeinstallingandmaintainingaclustereasier,
Cloudera,thecompanyweworkfor,hasputtogetheradistributionofHadoopcalledCDH.This
takesallthekeyecosystemprojects,alongwithHadoopitself,andpackagesthemtogetherso
thatinstallationisareallysimpleprocess.Andthecomponentsarealltestedtogether,soyou
canbesurethattherearenoincompatibilitiesbetweenthem.Ofcourseitscompletelyfreeand
opensource,justlikeHadoopitself.Youcouldinstalleverythingfromscratchyourself,butitsfar
easiertouseCDH,andthatscertainlywhatwedrecommend.Inthenextlesson,infact,youll
bedownloadingandrunningavirtualmachinewhichhasCDHinstalled.
Conclusion
Sointhislessonyoulearnedwhatbigdatais,andhowHadoopcanhelpwithbigdata
problems.Inthenextlesson,welltakeadeeperlookatthetwokeypartsofHadoop:thats
HDFS,theHadoopDistributedFileSystem,andMapReduce,thewayyoucanprocessthat
data.
Copyright2014Udacity,Inc.AllRightsReserved.