0% found this document useful (0 votes)

394 views

Hadoop Notes

Hadoop Handbook

Uploaded by

Vijay Vishwanath Thombare

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

394 views

Hadoop Notes

Hadoop Handbook

Uploaded by

Vijay Vishwanath Thombare

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Intro to Hadoop and MapReduce

Lesson 1 Notes

Introduction

Hi!WelcometoFundamentalsofHadoopandMapReduce.MynamesSarah
Sproehnle,andImtheVicePresidentofEducationalServicesatCloudera,a
companywhichhelpsdevelop,support,andmanageHadoop.

AndImIanWrigley,ClouderasSeniorCurriculumManager.Betweenus,Sarah
andIhavebeenresponsibleforbringingHadooptrainingtoover20,000people,
andwereexcitedtoreachamuchbiggeraudiencehereatUdacity.Duringthis
courseweregoingtodiscusswhatbigdatais,whatHadoopis,whyitsuseful,
andhowtowriteMapReducecode.

Bytheendofthecourse,youllbeabletodescribethekindsofproblemsHadoopaddresses,
andyoullhavewrittenMapReduceprogramstoefficientlyanalyzeverylargeWebserverlog
files.Infact,youllhavehadhandsonexperiencerunningaHadoopjobbytheendoflessontwo.

So,letsstart.Inthislesson,we'regoingtodefine'bigdata',thesortofproblemsitintroduces,
andhowtoaddressthoseproblems.

Sources of Data

Organizationshavebeengeneratingdatasince
wayback,butastimegoeson,moreandmore
dataisbeinggenerated.IBMestimatesthatas
muchas90%ofthedataintheworldtoday
hasbeencreatedinthelasttwoyearsalone.

Justasasimpleexample,thinkaboutyourcellphone.Wheneverits
turnedon,itsconnectingtocelltowerstogetreception.Asyoumove
around,itwillconnecttodifferenttowers,andatdifferentsignal
strengthsdependingonhowfarawayfromthemyouare.Allofthat
connectiondataiscollectedbythephonecompany,anditslogged.
Copyright2014Udacity,Inc.AllRightsReserved.

Theycanuseittofinddeadspotsintheircoverage,toworkoutwhichtowersarethebusiest
andneedincreasedcapacity...theycaneventraceyouifyoumakeanemergencycallbutdont
giveyourexactlocation.Thatsanenormousamountofdatarightthere.

Anotherexampleiswhenyouvisita
WebsitelikeAmazonorNetflix.
Everythingyoudothereislogged:what
pagesyouviewed,whatproductsyou
lookedat,howlongyouspentoneach
page...eventhingslikewhatWeb
browseryouwereusingandwhatsortof
computeryouwereconnectingfrom.
Again,hugeamountsofdata.

Andthatsjustinthecorporateworld.Inmedicine,forexample,eachXRaycreateshuge
amountsofpotentiallyincrediblyvaluableinformation,andcomparinglargenumbersofthemcan
helpustodetectsimilaritiesintumors.

Thisincreaseintheamountofdataweregeneratingopensuphugepossibilities.Butitcomes
withproblemstoo.Wehavetostoreallthatdata,andwehavetobeabletoprocessitina
sensibleamountoftime.

Quiz: What is a Big Data problem?

ThiscourseisaboutHadoop,andhowithelpstodealwithBigData.Butnoteverythingis
actuallyabigdataproblem.Therearelotsofcaseswhereyoucanusetraditionalsystemsto
store,manage,andprocessyourdata.Sothefirstthingyouneedtodoisdecideifwhatyou
havereallydoesfallundertheheadingofbigdatainthefirstplace.Andtomakethatcall,we
havetocreatesomekindofdefinitionforwhatbigdatais.

Letsstartwithaquickquestion.Whichofthesewouldyouconsidertobebigdata?Youarenot
goingtobegradedonthisanswer,butgiveityourbestguess.

[]orderdetailsforapurchaseatastore
[]allordersacrosshundredsofbranchesnationwide
[]informationaboutapersonsstockportfolio
[]allstocktransactionsmadeontheNewYorkStockExchangeduringtheyear

Answer:
Formostpeople,theanswersaregoingtobe2and4.Alistofpurchasesatasinglestoreis
Copyright2014Udacity,Inc.AllRightsReserved.

almostcertainlysmallenoughtobeeasilyhandledbyatraditionalrelationaldatabasesystem
orevenjustaspreadsheet.Ordersfromhundredsofstoresnationwide,though,couldstartto
overwhelmtraditionalsystems.Likewise,informationaboutasinglepersonsstockportfolioisa
smallandeasilymanagedchunkofdata.ButdataontradesacrosstheentireNYSEforayear
willrunintotensorhundredsofterabytesandthatswheretraditionalsystemsreallydostartto
struggle.

Definition of Big Data

Theresnoonedefinitionforbigdataitsaverysubjectiveterm.Mostpeoplewouldconsidera
datasetofterabytesormoretobebigdata,buttherearecertainlypeopleusingHadoopwith
greatsuccessonsmallerchunksofdatathanthat.Onereasonabledefinitionisthatitsdata
whichcantcomfortablybeprocessedonasinglemachine.

Quiz: Challenges
ButBigDataismorethanjustsizeofthedata.Whatadditionalproblemscanyouseeinthis
field?

[]mostdataisworthlessanditshardtofindtheusefulparts
[]itshardtogatherdata
[]dataiscreatedveryfast
[]datafromdifferentsourcesisindifferentformats

Answer:
Apotentialchallengewithbigdataisthatitiscreatedveryfastanddoescomefromdifferent
sourceswhichcouldcomeinavarietyofformats.Inmyexperience,mostdataisnotworthless
butactuallydoeshavealotofvalue.

The 3 Vs of Big Data:

WhenyoureadortalkaboutBigData,youlloftenhearpeoplerefertothethreeVs.Volume
referstothesizeofdatathatyouredealingwith,Varietyreferstothefactthatthedataisoften
comingfromlotsofdifferentsourcesandinmanydifferentformats,andVelocityreferstothe
speedatwhichthedataisbeinggenerated,andthespeedatwhichitneedstobemade
availableforprocessing.Soletslookinmoredetailateachofthem.

Volume
Thepricetostoredatahasdroppedincrediblyoverthelast60years.In1980,thecostper
gigabytewasseveralhundredthousanddollars.In2013,itswellunder10cents.

Althoughitsworthsayingthatifyouactuallywanttostorethedatareliably,youregoingtoend
uppayingrathermorethanthatprobablyseveraldollarspergigabyte,maybeevenmore.

Thatsparticularlythecasewithmore
traditionaldatastoragedevicessuchas
storageareanetworks,orSANs,which
canbeextremelyexpensive.Thehigh
costofreliablestorageputsacaponthe
amountofdatacompaniescan
practicallystore.Atsomepoint,theyd
say,OK,itstooexpensivetostoreall
thatdatathatImnotdoinganythingwith.
Letsjuststorethecriticalstuff:my
actualsales,forexample,ratherthanall
thatstuffabouthowlongpeoplespenton
eachpageofmyWebsite.Butitturns
out,aswellsee,thatthedatatheyre
currentlythrowingawaycanbeincredibly
useful.Whatweneedisacheaperway
tostoreitreliably.

Andofcoursestoringthedataisonlyonepartoftheequationyoualsoneedtobeabletoread
Copyright2014Udacity,Inc.AllRightsReserved.

andprocessitefficiently.StoringaterabyteofdataonaSANisntsohard,butstreamingthe
datafromtheSANacrossthenetworktosomecentralprocessorcantakealongtime,and
processingitcanbeextremelyslow.

QUIZ: Volume

Whichofthefollowingdatadoyouthinkisworthstoringandanalyzing?

[]transactions(financial,governmentrelated)
[]logs(recordsofactivity,location)
[]businessdata(productcatalogs,prices,customers)
[]userdata(images,documents,video)
[]sensordata(temperature,pollution)
[]medicaldata(xrays,brainactivityrecords)
[]social(email,twitteretc)

Answer
Andtheansweristhatallofthesecanprovideusefulinformation.Butinordertostoreit,youll
needawaytoscaleyourstoragecapacityuptomassivevolume.Hadoop,whichstoresdatain
adistributedwayacrossmultiplemachines,doesthat.Youllseejusthowinthenextlesson.

Variety

ThesecondVisdatavariety.Foralongtime,peoplehaveuseddatabasestostoreand
processtheirdataeithersmallerdatabaseslikeMySQL,orbigdatawarehousesbasedon
softwarefromcompanieslikeOracleandIBM.Butforadatawarehousetoeffectivelyprocess
information,allthatinformationhastofitnicelyintoapredefinedsetoftables.Theproblemis
thatthesedays,lotsofthedatayouwanttostoreiswhatwetendtocallunstructureddata,or
semistructureddata.Sarahcangiveussomeexamples.

Byunstructured,wemeanthedataarrivesinlotsof
differentformats.Forexample,abankmighthavea
listofyourcreditcardandaccounttransactions,but
theymayalsohavescansofyourchecks,recordsof
yourinteractionswithcustomerservice
representativesontheWebandoverthephone,
perhapsevenrecordingsofthosephonecalls.Allof
thatdataisinavarietyofdifferentformats,anditcan
behardtostoreandreconcileitallusingtraditional
systems.

Andthisalsotiesbacktovolume.Youwanttostorethatdatainitsoriginalformatsoyourenot
throwinganyinformationaway.Thatwayyoucanthenprocessthedatalaterindifferentways
youmightnotevenhavethoughtoforiginally.

Forinstance,ifwejusttranscribecallcenter
conversationsintotext,wehavewhatpeoplesaidto
ourcustomerservicerepresentatives.Butifwehave
theactualrecordings,thenlateronwemightdevelop
softwarewhichcaninterpretthetoneofvoicethe
customerusesandthatmightleadtoavery
differentinterpretationofthedata.AndthenicethingaboutHadoopisthatitdoesntcarewhat
formatthedatacomesin.Unlikeatraditionaldatabase,youcanjuststorethedatainitsraw
format,andmanipulateandreformatitlater.

Quiz: Data Variety

Sometimesthemostunlikelydatacanbeextremelyusefulandleadtosavingsduetobetter
planning.Forexample,aconventionalsystemforcoordinatinglogisticssystemmightsendthe
closesttrucktothewarehousetopickupthepackage.However,itmightbethattheclosest
truckisnotthebestsolutionperhapstherearetrafficjams,orthemostdirectrouteisonsmall
roadsthatwouldtakelongertodrive.Maybethetruckdoesnthaveenoughfreespaceforthe
newload.Sowhatkindofdatawouldbehelpfulinmakingabetterplanthatcouldsavemoney
andtimeforthecompany?

[]CurrentGPSlocationfromalltrucks
[]Currentitinerariesforalltrucks
[]Currenttrafficspeedinrelatedareasasreportedby
servicessuchasWaze
[]Currentloadoftrucksbyvolumeandweight
[]Fuelefficiencyofthedifferentvehicles

Answer:
Andagainalloftheseanswersarecorrect.Youcansavealotofmoney,andtime,bymaking
betterdecisions,drivenbymorevarieddata.Theworldweliveinisextremelycomplex,and
therearealotofvariablestoconsiderthatyoucantweaktogetlargebenefits.

Velocity

Velocity,thethirdV,isaboutthespeedatwhichthedataarrives,readytobeprocessed.We
needtobeabletoacceptandstorethatdataevenwhenitscominginatarateofterabytesor
moreaday,whichisoftenthecase.Ifwecantstoreitasitarrives,wellendupdiscarding
someofit,andthatswhatweabsolutelywanttoavoid.

What problems can we solve?

ThinkaboutanecommerceWebsite.Ifweknowwhatproductsyouvelookedatinthepast,we
couldrecommendsimilarproductsthenexttimeyouvisitoursite.Ifyouspentfiveminutes
lookingataparticularitem,wecouldmaybesendyouanemailinformingyouwhenthatitemis
onsale.IfweknowthatyoutypicallybrowseoursiteusingafirstgenerationiPad,wecould
suggestthelatestmodel.

Thisisahugedifferencetowhatwewoulddobefore,whenweonlystoredrecordsofactual
purchases.IfwecanstoreandprocessallofourWebserverlogfiles,alongwiththepurchase
datathatsinourtraditionaldatawarehouse,wecangivethecustomeramuchbettershopping
experiencewhichshoulddirectlytranslateintobiggerprofits.

YetanotherexampleisamoviesitelikeNetflix.Basedonwhat
theyknowaboutyourviewinghabits,theycanrecommend
moviestoyouasyoucanseehere,becauseofwhatIans
ratedhighlybefore,themovieontheleftisrecommendedfor
himandtheycanevenpredictwhatratinghellgivethe
movie.

History of solving data problems

Sothereareplentyofthingswecandowithbigdata.Butfirstwehavetosolveacoupleof
problems.Weneedtobeabletostorethedatainacosteffectiveway,andweneedtobeable
toprocessitefficiently.Anditturnsoutthatthesearenoteasyproblemstosolvewhenwere
talkingaboutmassiveamountsofdata.Fortunately,though,someextremelysmartpeopleat
Googlewereworkingontheminthelate1990sandreleasedtheresultsoftheirworkas
researchpapersin2003and2004.LetsseewhatDougCutting,oneofthefoundersofHadoop,
hastosay.
Copyright2014Udacity,Inc.AllRightsReserved.

DOUG CUTTING about History of Hadoop:

So,letmetellyouhowHadoopcametobe.Abouttenyearsagoinaround
2003,IwasworkingonanOpenSourcewebsearchenginecalledNutch,and
weknewitneededtobesomethingveryscalable,becausetheWebwasyou
know,billionsofpages.terabytes,petabytes,ofdata,thatweneededtobeable
toprocess,andwesetaboutdoingthebestjobwecouldanditwastough.We
gotthingsupandrunningonfourorfivemachines,notverywell,andaround
thattimeGooglepublishedsomepapersabouthowtheyweredoingthingsinternally.
Publishedapaperabouttheirdistributedfilesystem,TFS.andabouttheirprocessing,
framework,MapReduce.SomypartnerandI,atthetime,inthisproject,MikeCafarella.
saidabouttryingtoreimplementtheseinOpenSource.Sothatmorepeoplecoulduse
themthanjustfolksatGoogle.Tookusacoupleofyears,andwehadNutchupand
runningon,insteadoffourorfivemachines,on,20to40machines.Itwasn'tperfect,it
wasn'ttotallyreliable,butitworked.Andwerealizethattogetittothepointwhereitwas
scaledtothousandsofmachines,andbeasbulletproofasitneededtobe,wouldtake
morethanjustthetwoofus,workingparttime.

Aroundthattime,Yahooapproachedmeandsaidtheywereinterestedininvestingin
this.SoIwenttoworkforYahooinJanuaryof2006.FirstthingIdidthere,was,wetook
thepartsofNutchthatwereadistributedcomputingplatform,andputthemintoa
separateproject.AnewprojectchristenedHadoop.Overthenextcoupleyears,with,
Yahoo'shelp,andthehelpofothers,wetookHadoop,andreallygotittothepointwhere
itdidscaletopetabytes,andrunningonthousandsofprocessors.Anddoingsoquite
reliably.

Itspreadtolotsofcompanies,andmostlyintheInternetsector,andbecamequitea
success.afterthat,we,westartedtoseeabunchofotherprojectsgrowuparoundit.
AndHadoop'sgrowntobethekernelofa,which,prettymuchanoperatingsystemforbig
data.We'vegottoolsthat,allowyouto,moreeasilydo,MapReduceprogramming,so,
youcandevelopusingSQLoradataflowlanguagecalledPig.And
we'vealsogotthebeginningsofhigherleveltools.We'vegotinteractiveSQLwith
Impala.We'vegotSearch.andsowe'rereallyseeingthisdeveloptobeingageneral
purposeplatformfordataprocessing.thatscale'smuchbetterandthatitismuchmore
flexiblethananythingthat's,that's,elseisoutthere.

ThatsthestoryofthegenesisofHadoop:itsbasedonworkdonebythefolksatGoogle,andits
grownfromsmallbeginningstothepointnowwherehundredsofpeoplecontributetothe
project,andwhereitsbeingusedbythousandsandthousandsofcompaniesworldwide.The
Copyright2014Udacity,Inc.AllRightsReserved.

Hadooplogoisactuallyalittleyellowelephant,butdoyouknowwherethenamecamefrom?
Theresafunnystoryattachedtothat.HeresDougagain.

DOUG about Name of Hadoop

SothenameHadoopcomesfrommyson'stoyelephant.Whenhewasabout
two,afriendgavehimalittlestuffedelephantwhichheplayedwith
incessantly.Andweoverheardhimcallingitsomething,thisstrangewordthat
heinvented,andsaidHadoop.SoIimmediatelywroteitdownbecauseIwas
inthesoftwarebusiness.Andwe'realwayslookingforgoodnames.Andthis
onecamewithamascot,even.AndafewyearslaterwhenIneededaproject
name,pulleditout.Now,IwroteitdownasHADOOP.Andfiguredthateveryone
wouldsayHadoop.NowitturnsouteveryonesaysHadoopinstead,butIpersistinsaying
Hadoop.Nowmyson,ofcourse,is13,andexpectsroyaltiesforthename.Hehewants
morecredit.Healsoaccusesmeofstealingthetoy.Atsomepoint,hewasusingitin
somekindofrocketshipexperiment,andIhadtorescueit.Andnowit,itlivesinmysock
drawerfor,forsafety.

Hadoop Cluster
ThecoreHadoopprojectconsistsofaway
tostoredata,knownastheHadoop
DistributedFileSystem,orHDFS,anda
waytoprocessthedata,called
MapReduce.Thekeyconceptisthatwe
splitthethedataupandstoreitacrossa
collectionofmachines,knownasacluster.
Then,whenwewanttoprocessthedata,
weprocessitwhereitsactuallystored.
Ratherthanretrievingthedatafroma
centralserver,insteaditsalreadyonthe
cluster,andwecanprocessitinplace.Youcanaddmoremachinestothecluster(makethe
clusterbigger)astheamountofdatayourestoringgrowsand,indeed,manypeoplestartwith
justafewmachinesandaddmoreastheyreneeded.Themachinesintheclusterdontneedto
beparticularlyhighendalthoughmostclustersarebuiltusingrackmountservers,theyare
typicallymidrangeserversratherthantopoftherangeequipment.

Hadoop Ecosystem

CoreHadoopconsistsofHDFSandMapReduce.

Butsincetheprojectwasfirststarted,anawfullotofothersoftwarehasgrownuparoundit.And
thatswhatwecalltheHadoopEcosystem.Someofthesoftwareisintendedtomakeiteasyto
loaddataintotheHadoopcluster,whilelotsofitisdesignedtomakeHadoopeasiertouse.For
example,asyoullseeinthenextlesson,writingMapReducecodeisntcompletelysimple.You
needtoknowaprogramminglanguagelikeJava,orPython,orRuby,orPerl.Buttherearelots
offolksouttherewhoarentprogrammersbutwhocanwriteSQLqueriestoaccessdataina
traditionalrelationaldatabaselikeSQLServer.Andofcoursealotofbusinessintelligencetools
alsowanttohookintoHadoop.

Forthatreason,otheropensourceprojectshavebeen
createdtomakeiteasierforpeopletoquerytheirdata
withoutknowinghowtocode.TwokeyonesareHiveand
Pig.InsteadofhavingtowriteMappersandReducers,in
Hiveyoujustwritestatements,whichlookverymuchlike
standardSQL.TheHiveinterpreterturnsthatSQLinto
MapReducecode,whichitthenrunsonthecluster.Andan
alternativeisPig,whichallowsyoutowritecodetoanalyse
yourdatainafairlysimplescriptinglanguageratherthanMapReduceagain,thecodeisturned
intoactualJavaMapReduceandrunonthecluster.

HiveandPigaregreat,buttheyrestillrunningMapReducejobs,whichmeantheywilltakea
reasonableamountoftime,especiallywhenrunningonreallylargeamountsofdata.Soanother
opensourceprojectcalledImpalawasdevelopedwhichagainallowsyoutoqueryyourdata
usingSQLbutwhichdirectlyaccessesthatdata,ratherthanaccessingitviaMapReduce.
Impalaisoptimizedforlowlatencyqueriesinotherwords,Impalaqueriesrunveryquickly,
typicallymanytimesfasterthanHivequerieswhileHiveisoptimizedforlongrunningbatch
processingjobs.

Anotherprojectusedbymanypeopleis
Sqoop.Thattakesdatafromatraditional
relationaldatabaseserversuchas
MicrosoftSQLServerandputsitinHDFS
asdelimitedfilessoitcanbeprocessed
alongwiththeotherdataonthecluster.
ThentheresFlume,whichingestsdataas
itsgeneratedbyexternalsystems.HBase
isarealtimedatabasebuiltontopofHDFS.Hueisagraphicalfrontendtothecluster.Oozieis
aworkflowmanagementtool.Mahoutisamachinelearninglibrary

Infact,therearesomanydifferentecosystemprojectsthatmakingthemalltalktoeachother,
andworkwellwitheachother,canbetricky.Tomakeinstallingandmaintainingaclustereasier,
Cloudera,thecompanyweworkfor,hasputtogetheradistributionofHadoopcalledCDH.This
takesallthekeyecosystemprojects,alongwithHadoopitself,andpackagesthemtogetherso
thatinstallationisareallysimpleprocess.Andthecomponentsarealltestedtogether,soyou
canbesurethattherearenoincompatibilitiesbetweenthem.Ofcourseitscompletelyfreeand
opensource,justlikeHadoopitself.Youcouldinstalleverythingfromscratchyourself,butitsfar
easiertouseCDH,andthatscertainlywhatwedrecommend.Inthenextlesson,infact,youll
bedownloadingandrunningavirtualmachinewhichhasCDHinstalled.

Conclusion
Sointhislessonyoulearnedwhatbigdatais,andhowHadoopcanhelpwithbigdata
problems.Inthenextlesson,welltakeadeeperlookatthetwokeypartsofHadoop:thats
HDFS,theHadoopDistributedFileSystem,andMapReduce,thewayyoucanprocessthat
data.

Rakesh Kumar - 21554244 - Big Data - Assessment 2
No ratings yet
Rakesh Kumar - 21554244 - Big Data - Assessment 2
23 pages
Data Engineers Guide Apache Spark Delta Lake v3
No ratings yet
Data Engineers Guide Apache Spark Delta Lake v3
94 pages
PBL2 SME Governance Problem Statement-V2
No ratings yet
PBL2 SME Governance Problem Statement-V2
3 pages
RedBooks-InfoSphere DataStage For Enterprise XML Data Integration PDF
100% (1)
RedBooks-InfoSphere DataStage For Enterprise XML Data Integration PDF
404 pages
Informatica Big Data Management Course Agenda
100% (2)
Informatica Big Data Management Course Agenda
4 pages
Understanding ETL
No ratings yet
Understanding ETL
20 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Making Big Data Simple With Databricks
No ratings yet
Making Big Data Simple With Databricks
25 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
Big Data Syllabus For Theory and Lab
No ratings yet
Big Data Syllabus For Theory and Lab
4 pages
Create First Data WareHouse
No ratings yet
Create First Data WareHouse
39 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
Big Data Platforms
No ratings yet
Big Data Platforms
8 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Hands On
No ratings yet
Hands On
26 pages
Hadoop Training #4: Programming With Hadoop
100% (2)
Hadoop Training #4: Programming With Hadoop
46 pages
HBase
No ratings yet
HBase
31 pages
Guided By:: Miss. Rupali Zambre
No ratings yet
Guided By:: Miss. Rupali Zambre
20 pages
Big Data Engineer Interview Questions
No ratings yet
Big Data Engineer Interview Questions
1 page
Data Lake On The Aws Cloud With Talend Big Data Platform
100% (1)
Data Lake On The Aws Cloud With Talend Big Data Platform
13 pages
Nosql: Non-Relational Next Generation Operational Datastores and Databases
No ratings yet
Nosql: Non-Relational Next Generation Operational Datastores and Databases
19 pages
Homework Labs Lecture01
No ratings yet
Homework Labs Lecture01
9 pages
Data Warehousing&Data Mining
No ratings yet
Data Warehousing&Data Mining
170 pages
Big Data Hadoop Certification Training: About Intellipaat
No ratings yet
Big Data Hadoop Certification Training: About Intellipaat
13 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
Windowing Functions
No ratings yet
Windowing Functions
54 pages
100+ Hadoop Interview Questions From Interviews
No ratings yet
100+ Hadoop Interview Questions From Interviews
32 pages
2 IntroductionToRDBMS
No ratings yet
2 IntroductionToRDBMS
192 pages
SQL Server Interview Questions Developers PDF
No ratings yet
SQL Server Interview Questions Developers PDF
142 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Hadoop Questions
No ratings yet
Hadoop Questions
41 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Hive Commands
No ratings yet
Hive Commands
3 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
Kafka Cheat Sheets
No ratings yet
Kafka Cheat Sheets
1 page
1 Data Vault Tdwi Southfl 20110311 by Raphael Klebanov
No ratings yet
1 Data Vault Tdwi Southfl 20110311 by Raphael Klebanov
30 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
6 pages
Cloudera Spark
No ratings yet
Cloudera Spark
70 pages
Big Data - RDBMS, NoSQL and DynamoDB
No ratings yet
Big Data - RDBMS, NoSQL and DynamoDB
6 pages
Lekcija09 - 04 NoSQL Redis
No ratings yet
Lekcija09 - 04 NoSQL Redis
40 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
Hadoop Commands Cheat Sheet
No ratings yet
Hadoop Commands Cheat Sheet
1 page
What Is Data Vault Modelling
No ratings yet
What Is Data Vault Modelling
4 pages
iCEDQ Ebooks - DataOps Implementation Guide
No ratings yet
iCEDQ Ebooks - DataOps Implementation Guide
13 pages
CODE201911 Practices DataVisualizations
No ratings yet
CODE201911 Practices DataVisualizations
19 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
Dcap603 Dataware Housing and Datamining PDF
No ratings yet
Dcap603 Dataware Housing and Datamining PDF
281 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
Data Warehouse Development Approach
No ratings yet
Data Warehouse Development Approach
25 pages
Star Schema
100% (3)
Star Schema
45 pages
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
100% (1)
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
20 pages
How To Master Apache Spark Interview Questions
No ratings yet
How To Master Apache Spark Interview Questions
14 pages
Sid of Icici Prudential Esg Fund
No ratings yet
Sid of Icici Prudential Esg Fund
172 pages
Icici Prudential nv20 Etf
No ratings yet
Icici Prudential nv20 Etf
89 pages
Autotrader
No ratings yet
Autotrader
50 pages
Formal Systems - Tutorial 1
No ratings yet
Formal Systems - Tutorial 1
1 page
Print
No ratings yet
Print
25 pages
GFFGHF
No ratings yet
GFFGHF
20 pages
Avl Trees
No ratings yet
Avl Trees
12 pages
Resources Handout 07KMM
No ratings yet
Resources Handout 07KMM
3 pages
Resources Handout 07KMM
No ratings yet
Resources Handout 07KMM
3 pages
Server Guide - HTML
No ratings yet
Server Guide - HTML
1 page
BigQueryTechnicalWP PDF
No ratings yet
BigQueryTechnicalWP PDF
12 pages
Yasaswi-Sr Data Engineer-Resume
100% (1)
Yasaswi-Sr Data Engineer-Resume
4 pages
Midterm Report 1
No ratings yet
Midterm Report 1
21 pages
Scalability and Validation of Big Data Bioinformatics Software
No ratings yet
Scalability and Validation of Big Data Bioinformatics Software
8 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
Unit 4 Pig and Hive
No ratings yet
Unit 4 Pig and Hive
86 pages
Moth-Flame Optimization-Bat Optimization: Map-Reduce Framework For Big Data Clustering Using The Moth-Flame Bat Optimization and Sparse Fuzzy C-Means
No ratings yet
Moth-Flame Optimization-Bat Optimization: Map-Reduce Framework For Big Data Clustering Using The Moth-Flame Bat Optimization and Sparse Fuzzy C-Means
15 pages
Cloud Computing: Mini Project Work Entiled On
No ratings yet
Cloud Computing: Mini Project Work Entiled On
24 pages
Ajai_Chaganti_AH
No ratings yet
Ajai_Chaganti_AH
6 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
Big Data Analytics in Weather Forecasting
No ratings yet
Big Data Analytics in Weather Forecasting
29 pages
Anwesh Babu: Hadoop Developer - Wells Fargo
No ratings yet
Anwesh Babu: Hadoop Developer - Wells Fargo
5 pages
Professional Summary:: Paluri Siva Sai Abhishek Big Data Engineer
No ratings yet
Professional Summary:: Paluri Siva Sai Abhishek Big Data Engineer
2 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Big Data Analytics in Health Care A Review Paper
No ratings yet
Big Data Analytics in Health Care A Review Paper
12 pages
BDA011GU03
No ratings yet
BDA011GU03
56 pages
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
Deep Learning ECOMMERCE
No ratings yet
Deep Learning ECOMMERCE
9 pages
Data Science Course Content
No ratings yet
Data Science Course Content
24 pages
Top 500 Data Engineering Interview Questions
No ratings yet
Top 500 Data Engineering Interview Questions
126 pages
Unit 2 B)
No ratings yet
Unit 2 B)
16 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
Shuffle and Sort
No ratings yet
Shuffle and Sort
4 pages
Big Data Analytics For Healthcare Organization A S
No ratings yet
Big Data Analytics For Healthcare Organization A S
8 pages
Impala
No ratings yet
Impala
11 pages
WASE 2021 Cloud Computing Handout - S2 - 23
No ratings yet
WASE 2021 Cloud Computing Handout - S2 - 23
25 pages
Question Bank BDA CCS334
No ratings yet
Question Bank BDA CCS334
12 pages
Big Data: Hrushikesha Mohanty Prachet Bhuyan Deepak Chenthati Editors
No ratings yet
Big Data: Hrushikesha Mohanty Prachet Bhuyan Deepak Chenthati Editors
50 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages