100% found this document useful (1 vote)
543 views91 pages

BDA Techneo

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (1 vote)
543 views91 pages

BDA Techneo

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 91
Index Chapter Chapter Name No. Module No. 1 | tnrodveson to Big Daa and Hadoop 2 | adoop HOS and Map Reduce NoSOL Mining Data Steams Finding 8 ilar Nem and Clistering Fal-Time Big Data Models MODULE 1 Introduction to Big Data and Hadoop (a ea wa ena Yor} tense aa ‘2p Dum emcee Types ig Data Taos va. Bg Dia business each ‘care in of a aa Sabon 15 Concept Madson Cre Hadcop Components Haste Econom Inteduten gl a i What ig it Characters og Osa acs Uo Emantigcn hese ETS “Tatra versus ca pone... aso sta Bg Dat Sate. onc Haein a lan Fee = ‘pun Core conpenenst Hasse EVER. lan Hee rie wih hs eho Solon tOrsacitcve. ETAT. Hadoop Cannon Package. aap Ostia Fi Sytem 189 Had Mae nr 184 Yet Armor Rosuce Nogoste YARN) 11 ate YARN? eon Esso. + Chapter Ends, rotor by Solum” of tran S eabtes 1012 bytes) may be On ig data i tne ter Tig a i story, den bards of Ab on ‘yolume of transnetions in area ad aoe as be cored cl orn gorerazent OFS vas enornoos amount of data through ata tracing, mobile devo fee ‘and complex dstasols that are generated sp rie rel ais oS a at but ln he tah poh os elect, oe in vere forma, and 9 i. 1 ecompanen nat only the pcs, extract insights from this Ds tots Whe Ble Da? Big data rls fo the massive datasets that are cillecbd fromm varia. sours fr usnes med a revel ne aight fr optimized decison makings ing to JBM sour, basins and cunsumer life crate 2.5 exabyt lt er ay. Ii proce tat ata byt 1021 bytes) of data edu 2015 and 90% thong wil fom the last 5 years, These data te rd for analy to reveal hidden correlations and patterns which are ig Data Analytics * Strpee persona computers (PCs) can al 500 GB of data; it would require 201 Ost areata bytes of data, Google stores dat in millions of server around re ata in milions of nlions a nd Soot ex messages are sent; Faebook has millions off aud ren shar conten, phon a video, Ie lrmation chat the bases, + Big Date Analytin 8 shown in Pig. 1. bile Computing wing basa ables: Social Netuorking, pig the resale of ting: Mo Ht of three naar trends tld devices, ug i itn Pim ag caret oleate thehardwnr 4p far so 3 ing whic con puting (on tn cary 2924) 0) = Br, 4 otto scm oil challenges and Considerations: ‘Working with hig data presents ooveral challenges: ‘Storage: String massive vol and Cloud Computing B i,t: Big Date: Rest of thre computing reds james of data requires salable and costffertive orase olutiona ‘Tradkinal databases may not be suficlent, leading t» the adoption of ‘Detrhuted le estas Hko Hadop Distributed Fle System (HDPS) or cloud storage cptons. Processing + Procising large-scale data ncemitates parallel and distributed computing techziquae. Technologies like Apache Hadeop, Apache Sparks snd dais streaming frameworks are commonly used 19 process big data in a diatibuted and sealable mantier ‘Analysin + Advanced analytes techniques, incading, machine learning, natural Tenguage procesng, and data mining, are employed to extract insights from big date ‘These techniques help uncover patterns, trends, end anomalis that would be challenging t identify using traditional approaches, ‘ust implement robust security mearures, data comply with relevant daia protection regulations; oat Mu-senr ig Data op). Page 15) 10, Meteorology + Weather sensors and aatlites all over the globe help collect large ‘volumes of data ta track climate conditions, Meteorologist extensively use Big Data ‘atthe pattras of natural disasters, prepare forecast of weather, and the like 11, Bdueation 1 Many educational inttations have embraced the wxage of Big Data for improving curricula, attracting the best talent, and reducing rates of dropouts by improving student ovtoomes, targeting eobal recruiting, and optimizing the overall 5 Opportunitas and Impact cppetunites and impact roa Big data has the pool to ring sigan 2, Business Insights: Big dats anaitce helps eraniations gain valuable into camer behavior, market trends, end operational efficiencies. Tt em sta-rven dacsion making and aid in opinising proses, improving cust student expereneo, perience ad entitng new business operant So we can say Uy big date represents the vast amount of data generated in our “laltboare : Big dat anlyis enhance medial research, personalized me «igtal world. It poses unique challenges but also presenta opportuni fr ganizations ee eels Pare cae oe ‘ind weity asa whole Effectively harnessing and analyzing big data can led to valuable “ ass fare patient datasets, genomic data, tnsigho, ianevatlon, aod tmnproved deceten-rengin ariens demon 3 moatring kang to beter dieae diagnos, treatment, and prevent Smart Cities: Hig ata tachnolgies can facta s the development of smart ei 3 soapsng data om sears, 17 devices, and soil medi, cites can energy management, waste mena a semen, and public nfo Scientific Research Big date pays Sa ‘Volume : Big dala lnvlves massive volumes of data that exceed the eapacty of arenes ries. ona traditional data stage and processing systema i can range from terabytes (10°12 Suen matey, pone ial ‘bytes to petabytes (1018 byte) oF even exabytes (10°18 bytes) and beyond) This leading to new nights, ‘nmense volume challenges traditional data management techniques. . Velocity: Big data i generated and cilleted at high speeds, often io realtime or ear real-time) Seal media stirs, Gancel masks. Iteraet6€ Tongs Col?) device, and othr aoures pre data at an unprecedented vlocsy he veloity of big data requires efficient processing and analysis methods te derive timely insights Variety : Big date encompases a wide variety of data pes and formats It includes sacred dag. nlioal datbete), enetracerd tag YM ON snd snalaeé] a arpa datasets in els cpp, accleraing iam rays 05) ow og Oat +A Structured Query Language ($A) 0 neoded to bring the data topather. Structured dita is easy to eater, query, and analy, All of Use data follows the same forma. However, forcing 6 consistent strectre alka means that any alteration of data is too tough as each record has tobe updatd to adhere tothe now structun, 4+ Beamples of structured data inchade nambers, dates, strings, ete. The business ata ofan e pais output for th sep © A Rader Tsk preempt os map ak. Sint the map stag all tasks cecr at these tine, aad they work inde “ pendently. Th data ie “tion eda the desired cup The fal reel isa redoced aot fl ‘ale puis which MapRedace, by detail, tars in HOPS, 7%. 243 How Hadoos Map and Redce Work To The foal opt Pat We re ekg i han ad Tack spa ie” OY ON the wed Apach, First in the map stag, the input data (the six documents) i split and ditebuted crs tho cluster (the three servers) In this ease, each map tak works on split containing two dorumenta. During mapping, there is no communication batween the odes. They perf indopondently. ‘Then, map tasks cron a chey,value> pair for every word. These pairs show how ‘many times a wor occurs. A word is a Ley, and a value is its count. For example, one Aoeament contains three of four words we are looking for: Apache 7 times, Cass 8 times, and Track 6 imes The key-value pairs in one map task output leok like this: + + claws, track. > ‘This proces is dine in parallel tasks on all nodes for all documents and gives a unique output After input spliting and mapping completes, the outputs of every. map task ‘reshuffled. This is the frst step of the Reduce stage. Since we are looking for the ‘frequency of occurence for four words, there are four parallel Reduce tasks. The ‘reduc tasks can run on the same nodes as the map tasks, or they can run on any other node, ‘The shlfl step eaures the ys Apache, Hadoop, Class and Track are srted for the reduce step. This process groups the vales by keys in the form of < pins, Ulan rcs Asi Se Now tu wa aca ya 29.28 12) ay are ron a date for some rey wit ar au ms - sa ded to fits from time to iy 64 mapas in sa, Pere avid into duos which ae eel 6 herent ute nodes Moreover, Spicer hak on! ado frente, 0 9e da oe ts sak nave. Norm, bh he chk sine and Bi gia He abe ce ty oe BBS cn ee is amber sal ile called the mater od Se that Tr maser ee plied, od a dstry fag Bea tc ines whee to fal is cope. The dire. eal so al psp ng th DS av here the dns coplag dsr. As the processing eompam oop. The term “MapRadvey? runt frm. The ithe map joy ether st of deta, where individual clement Gas eee men ont matte seats sain SBR asaods of servers in 3 Hacoos _ {Inthe early daye of Hadop (version 1), Jobracker and Teel Tracker daemons ran ‘nerations in MapRedics. At the time, a Hadoop cluster could only support “MapRaduce applications ee | ay sy ee i —a quests to the compute resources ina aster Since it ‘monitored the execution and he satus of MapRadar, it 7 ‘sided ona master node, | cae] sf A ToskeTracker proceneed the 2a -equests that me ffom the sJobTracka, Fie * _Alltank trackers were detibuted acromsthe slave node in a Findoop chuater. ‘The tasks shouldbe big enough te jus the tsk basing time Akyou divide a job say saul ell seats, ho total ie to prepare the opts nod rea te ‘ay outwegh the time nied to produce the actual ob output notin steatonne sa pia Sliema eet ae Abeer way i itl sanity’ Ad mre machine in th aril way i pdr nce mo can see the easter rom 10 9 10 efor ty witht any dowating % ata ttgrty Data inte fr to the core of data, HDPS ensures data checking the dt sot he checksum calclated during hg conta ete, Wie le mang ithe stm det nt mach wit dhe orgiad ois st serape Te eat hen pst roe he seat DtaN that arpa fat ck. The Nestea ‘he erupted Bak and cute anaitenal ne repli, Mh races dnp HOPS slr uid ation rally oo citer efsons Ringer Cech tae totes cae To latino hie rol taken ofr 1 Hl iat be sored redundantly, I we didnot duplicate the feat aren ode ll les woold be naval va he De ty mt eng nt back up the fsa al nd the dik rab, he pre laralans fine wal be ot tree == i. ei Computations must be divided into. peat a gh wi ‘ass sich that ify ne tack (Bets processed. "* application layer and {lst execute to completion, it can retard wie eiosiny ‘jaa ou apace Wana Aerneyper ener ect dou ran nea oh lll MspRedase is igo io whch «hae grim it sbdividd ito mal aod ron parallly t make computation fate, eave time, and enol 9 Morea ev dap od organs ey, vale pe, or ecg ® dictonary you saa fr te word “Date” and is aseciated toeaning fe “uae ‘ati colced together for referesoe or analysis. Here the Key a the Valor aocinted with iar and sation called together for ref 5 Retecer Ick repenble fr oon dai paral! and produce Goal oupa ‘Morch 1: The map function 1 Porcath sement my of M do A eS Gia ete pera 0.3.04 my) rk» 138, en we ‘pred Ghy vata) aia (62, 0%, jay) 19 m 4) f= 1,2. Wo the number 6 ‘star et fy val pair tat ach by, ‘4,30 forall pose vale of ‘Avot 2: The reduce function 1 Foresch ty, do Ser aoe bg with Mya ny ‘5 ale bg id Ny ji ing multiply my dp a valag of (Ge ha it with vat 2 a ‘4 6 ech ist a es ioe tn ling att fealdrd) ro matic Ai 29 alee whieh mea the uber of ra) = 2 ad the cuter felunnat. Matric Bi sana 3x Emr tere mero rvs) =? tod onter ofl Bach al ft Ayan Be cle in cars Ais alld Az Lo. 24 row enum. Now ene ati maint hat ‘ope and der. The Fora Mapper Mates A») = (kA or a Mapper fr ate B 9) =X) fr als Tae, eaptng he mapper fr Mac A * A hdemputa the mba ns os 1 Hoe lle therfore wen k=. can have 1 alo 12a saeco hae 9 artbor 1 vl f= and = Subang al vals + nee ket tetjer G@aary int Dagan Le2set @0,4.4,8) J? @nA20) List @2,4.1,0 (a,2,4,2.29 i=2 se @a,a.1,9) Ja? @.2,0,2,09 lsat Makipeaton by Map ace Computing te mapper metric B Jel eat @B1,9) 2 @.2,0,10) i-2 kat 0,09) 4.2,,28 Jat ket @0,0,1,9) (Wen San masonic yor 2520 MT) aoa anaLe) ’ era RL ) 21), 48) jerset nae aa ST verefore, the final matrix is urate Map eda ™ Hae) lib re Aa (A Sma AB Bd het opting do: el cers taveiay ak it seperate fr Mats A & Apply condition cto each tap in the relation and produce ax output only those coplesthat sty ‘Te al ofthis election ie enotd by eR) Seton realy dot not ned the fll power of Mapedsce. ‘They can be done most conveaienly in the map portion alone, thou they could ss be done inthe reduce portion als. ‘The prea code i ows: Map Gay valve) fortapein valve: ‘Fpl atin C= 13)04,2.4)) it apt, pl) 2.1.9,18.2.7) Raine ey, aes) Now AB (3°) (6) ag ese 22) Aue, 1,3,(4, 2,0) Projection Peet 8)8.2,0) for sme suo s af the attribute ofthe relation, produce om each tp only the Now Aix 18) 04 ag &) compments for tho attribuen a ‘The reutof this proeton is denoted TT) 8 with dining value ken fom Mapper step shove Matric VacorMalipicsion by May Rau, 4.1) AieetA.1.0,4,2,29 Bul, 1,5,08,2,7) Now Avie: 105) + (27)) «19 09 1,2) Ate, 1,004, 2.29, Bis, 1, 6,8, 2,8) Now Aix: (1"6)-+ (298)} 20 From, i ana i ac Gwe ache that Hee Projection is performed similarly to selestion. of the four tks Redace sags, etch Taal kev pair. The redice task also ap the cam ie ed wick independent ram, th rd asks gt the Plowing 1. Wide dts ype vay + Webco athe runing a iniert faring eros 12 en lrg inary chic fr rng a 2 Ditribtnd ao + Mote oS dats can be eset strated fashion 23, Balk plod + Os atscing a ice cape 14. Lower aii + Oe ACID concep abn acon fr ealbility and Uaroghpat 15, Dsicted toate + Mostiy ne ayzehronoos replication betwen ditrbated nodes Asyachronai®| | 15 yal ime salle MultiMaster Replication, po opr, IDS Replisstion | Only providing eventual oni to SAtrneticranicyors2 7c) — (Brannoranatin

You might also like