0% found this document useful (0 votes)
66 views56 pages

Big Data

Uploaded by

karimunisa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
66 views56 pages

Big Data

Uploaded by

karimunisa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 56
Filernputstream fileTn = new Filetnputstream (*student-aer") ; ObjectInputStream in = new Object nputstrean (#ilern) 5 = (Student) in.zeadobject (0 incelose() filein.close(): ) eaten (ZoEXception i) | LeprintstackTrace()7 ) catch (ClasstiotFoundixception <) [ System.out.printin ("Student class not found") euprintstackTrace (); return : system.out.printin(*Deserialized student...) System.out println(*Registered No.: "+ s-regdno) Systen.out.printin("Nane: "+ s.name)7 System out printin("Course: " + s.course) ; Systen.out.printin("Aadhar No,: "+ s,aadhar) ‘Output: Ez aeva\celipse\ BDA seca\sresjavae Desenalibledeno. Java E:\2ava\Eclipse\804_seca\sre>jave Deserialsbledeno Student ostails are fog. to. = 11 lane’ = Rohith ‘Adnan Wo. = 8 E:\ava\eclipse\so4 seca\sre> SKEEEEE TB boas asic Ren NOD BV acai Dna Hae BHD DOS Page TT SX» Oe dite 2. Introduction to Big Data > Whats Big Data and Big Data Analyties (BDAY? “ig Data is 30 evolving term that describes any lange amount of seared, semi= structured and unstructured data that as th potential 1 be mind fr iformatin " “big Data Aualyties (BDA) isthe process of examining large data ses containing & variety of datatypes ~ ie, big data to uncover hidden patems, unknown correlations, ‘marke tends, customer preferences and other useful business information.” Hierarchy of memory usage wats cacy eid byer=thicbe T ieaeaaal Teas cieabaes= i Terre Tone Terebneri Pere 1024 Peas = Bae 02s Gabyten Tae ee TozavoRubre = i wroncbe —— aaa Tee] characteristics of Big Data (or) Why is Big Data different from any other data? “There are “Five V's" that characterize tis data: Volume, Velocity, Vat, Veracity nd Vali. 1. Volume (Data Quantity): “Most organizations were already stupsing with the increasing size oftheir databases asthe Big Data tsunami ht the data sores TaN bees sat esr AFODBW Re eg Dhnoaron SNP Mae Be 1Ots Page SZ 2 Velocity Data Speed): ‘There are two aspects fo velocity. They are throughput of data and the other represening latency ‘Throughput which represents the data moving in the pipes. Latency is the other measure of velocity. Analytics used! to be a “sfor and report” ‘environment where reporting typically contained dat as of yesterday—popully represented aD «Variety Data Types): ‘The source dats ncds unstructured (ext, sound, and vdeo i addition o structured Aaa. numberof applications are gathering data from emits, documents, or Blogs 4. Veracty Data Quality): ‘Veraciy presents both the credibility ofthe datasource a5 well asthe suitability of ‘he data forthe target audience «Validity (Data Correctness): ‘Validity meaning ste ata correc and accurate forthe future use. Clealy valid data ‘key to making the right decisions. As per TBM, the numberof characteristics of Big Data is V" and described ia the following Figure: Importance of Big Dat 1, Acces o soa data fom search engines and sits lke Facsbook, twitter re enabling ‘organizations to fine tune ther busines strategies. Marketing agencies ae learning stout the response fr their campaigns, promotions and ther advertising mediuns. 2 Tditional customer fedback systems are geting replaced by new systems designed with ‘Big Data’ technologies. n those new systems, Big Data and natal language rocesing technologies are being used to read and evaluate consumer responses, BV tthe Asa Pfea 00 BV Aah Cle Bornran SOOO Mase FIN Sows Page aa 3 Basod on information inthe social medi ike preferences and product perepton of their consumes, product companies and cetal orpanizations are planing their production. 44. Determining rot causes of lures, issues and defects in near-eal time 5, Detecting faudulent behavior bafre it affets the organization, >When to you use Big Data technologies? 1. Big Data soltons ae ies! for analyzing not oly raw stuctred data, but semi steytured and unstructured das from a wide varity ofsoures. 2. Big Daa solitons are ideal when allo most, ofthe data needs to be alyzed versus 1 sample ofthe daa; ora sampling of data isnt nearly as effective asa large set of ata rom which to derive analysis. 3. Big Data solutions are ideal for iterative and exploratory analysis when business measures on daa are not predetermined Patterns for Big Data Development: ‘The following six most common usage patems represent gre Big Data ‘opportunites—busness problems that weren't easy to solve before—and baip us gain an understanding of how Big Data ca help us (r how i's elping out competitors make us less competitive if we ae not paying ateaton) LIP foe FT Log Analytics 2. The Fraud Detection Pater 3. The Socal Media Pater 4. The Call Ceater Mantra: “This Call May Be Recorded for Quality Asurance Purposes” 5. Risk Patems for Modeling and Management 6. Big Daa and the Energy Secor 1. IT for IT Log Analytics: IT depiriments need logs at their disposal, and day they just. ‘can't store enough logs and analyze thom ina cost-eficient manner, so logs are typically kept {oc emergencies and discarded as soon as possible, Another reason why IT departments keep large amounts of dat in logs isto look for are problems, It often the case that he most common problems are known and easy o del with, but the problem tat happens “ence in 8 ‘while typically more dificult diagnose and prevent fom occuring again ‘But here are more reasons why log analysis a Big Data problem apan fom is large nature, The nate of thes logs is Semi-structured and raw, so they aren't always suited for ‘eadtionsl database processing, In addition, log formats are constantly ctanging due to Ihrdvare and software upgrades, so they can't be id to src inexible analysis paradigms, av. tne acne fence AOD Rap Cle: Bnmeran” SAG Mase Baa ONG Page Finally, not oaly do we need to perform analysis on the longevity of the logs to determine ‘eends and paters and fo find failures, but also we need to ensure the analysis is done on all tho data ‘Log analytics is actully a pster that IBM established after working witha number ‘of companies, including some lage financial services sector (FSS) companies. This usecase comes up with quite a few customers since; for that reson, this pattern is called fr I. tt We are new (otis usage pattern and wondering just who is iterested in IT fer IT Big Data Soluons, we should know tht tis is an internal usecase within an organization ise: An intemal IT for IF implementation is well suited for any organization with a large data center oorprnt, especialy if i is relatively comples. For example, service-oriented architecture (SOA) applications with ots of moving par, federated datacenters, and soon, al suffer fiom the same issues cutlined in this section Some of large insurance and real lens need to know the answers to such questions 88, "What are the precursors to failures?” "How are these systems all related”, and more ‘These are the types of questions that conventional monitoring doesnt answer; a Big Data platform finally offers the opportunity to get some new and beter insight ino the problems than. 2. The Fraud Detection Pattern: Fraud detection comes up a lot in the Finacial services ‘vertical, we wil find it in any sont of elaims- or wanstetion-bsed envirorment (online avecons, insurance claims, underwriting ets, and soon), Prety much anyoiere some sort. of financial transaction i involved presents a potential fr misuse andthe universal threat of fiaud. IP we influence a Big Data platform, we have the apportnity to do more than we have overdone before to identity it or beter ye, stop it ‘Traditionally, in fraud eases, samples and models are wsed to identify crstomers that ‘characterize a certain kindof profile. The problem wih thsi that although it works, we are Profiling a segment and not the granularity at an individual transaction or person level. As per customer experiences, i is estimated that only 20 percent (or maybe less) ofthe available information that could be useful for faud modeling is actully being used. The tational approach is shown inthe fllowing Figure TE bhmar AteteProfeasrd ROD BW fay Gane Shannan. BHaNE asic bos Tne Page ae tne fatlit ‘We can use Biglnsights to provide an exible and cosvefective repository to establish what ofthe remaining 80 percent of te information is wsefl for faud modeling, and thon fod newly discovered high-value information back into the fraud mel as shown inthe following Figure, |A modern-day faud detection ecosystem provides a low-cost Big Dasa platform for exploratory modsting sd discovery. Typically, fraud datection works alr a tansacton ges sored only to get pulled out of storage and analyzed; storing something to instantly pull it tuck out again feels Like latency to us. With Streams, we can apply the fraud detection models a he tanscton is happening ‘3. The Social Media Pattern; Pethaps the mos talked shout Big Data usage pater is social ‘media and customer sentiment. More specially, we can determine how seniment is impacting sles, the effectiveness or receptiveness of matketing campaigns, the accuracy of marketing mix (product, pric, promotion, and placement, and soon Social media analytes isa prety hot topic, so hot in fat that IBM has built solution specifically to eeelerte our use of it: Cognos Consumer Insights (CCI), CC! can el what People ae saying ow topics are tending in social media, an all sorts of things that affect the business al packed into arch visualization engine. 4 The Call Center Mantra: “This Call May Be Recorded for Quality Assurance Purposes It seems that when we want our call with a eastomer service representative (CSR) to be recorded for quality assurance purposes, it sems the may part never Works in ‘our favor. The challenge of cll center efficiencies somewhat similar othe fad detetion pate. Call centers ofall kinds wan 10 find better ways to process information to address ‘hats going on inthe business with lower latency. Thi sa relly interesting Big Data use ‘ase, because it uses anlyties-immotion and analytics ates. Using in-moion analytics (Sweams) means that we basically build our models and find out what's interesting based ‘pon the conversations that have been converted from voice o text or with vibe analysis as the calli happening. Using at-rest analytics (Biglsights), we buildup these madels and then ‘Promote them back into Steams to examine and analyze the calls that ae actualy happening ical time: it's truly a elosed-loop feedback mechanism, ', Ribk: Patterns for Modeling and Management: Risk modeling and management is ‘nother big opportunity and common Big Data usage patter. Risk modeling brings into fous frequent question when it comes fo the Big Data usage paters, “How much four data do ‘we use in our modeling?” The financial exiss of 2008, the associated subpeie lan esis, ‘nd its outcome has made risk modeling and management a key atea of focus for fnencial instiutions, ‘Two problems are associated with this usage pater: “How much of he dat will we se for our model?” and “How can we koep up withthe daa's velocity?” The answer to the second question, unfortunatly soften, “We ean” Finally, consider that fnail services ‘tend to move thei sk model and dashboards to interday positions rater than jst elose-oF- day positions and we can see yet anther challenge that can’t be solved wit traditional systems alone. Another characteristic of today's financial markets i thi thete are massive trading volumes requires beter model and manage risk. 20 Bt Clee Ohnoeran SHPO ble BRED EHG Page oT 6. Big Data and the Energy Sector: The energy secior provides many Big Data use case lenges in how to deal with the massive volumes of sansor data from remot installations, Many companies are using only a fraction of the data being collected, boats they lack the infostracture to store o analyze the available scale of data, ‘Vestas is primarily engaged in the development, manufacturita, sal, and ‘maintenance of power systems tha use wind energy to generate electricity vough its wind Turbines Is product range includes land and offshore wind urines, At the tne of wrote this book, is had more than 43,000 wind turbines in 65 countries on S continent. Vestas used IBM Biglosights platform o achieve thee vision is abou the generation of lem ener. Data in the Warehouse and Data in Hadoop: ‘Traditional warehouses are mostly ideal for analyzing structured date from various systems and producing insights with Known and relatively stable measurement. On the other hand, Hadoop-based platform is well suited to deal with somi-stractred and unstructured ata aswell as when a data discovery process is neded ‘The authors could say that data warehouse dts is tsted enough tobe public," while Hadoop data isn'ta rusted (public can mean vastly distibuted within the company and net for external consumprion) and although this wil likely change in the fur, today this is something that experince suggests characterizes hese opositris Hadoop-based repository scheme stores the entire business entity and tke eliblity of ‘he Tweet, transaction, Facebook pos, and more kept inte, Data in Hadoop might sem of low value today IT deparents pik and choose high-valed data and pu i trough dificult leasing and transformation processes because they know tha ata has igh known vate per byte ‘Unstructured data cant be easly stored in a warehouse. A Big Dats platform can store all of the data in its mative business object format and get value out of it through ‘massive parallelism on readily avaiable component BREKREKEE TV bein etc fencr AED BY Rep Clegebinonran” SADE Man Bose Page 3B iadoop- definition, Understanding distributed systems and Hadoop, Comparing SQL databases and Hadoop, Understanding, MapReduce, Counting jwords with Hadoop—runing your fis program, History of Hadoop, Stating Hadoop - The {uilding blocks of Hadoop, NameNode, DataNode, Secondary NameNode, Job Tracker and ‘Task Tracer. 2, HFS: Components of Hadoop -Working with les in HDFS, Anatomy of !MapReduceprogram, Reading and writing the Hadoop Distributed File system -The Design lof HDFS, HDFS Conceps, The Command-Line Interface, Hadoop Filesystem, The Iaval Interface, Data Flow, Parallel Copying with itp, Hadoop Archives. 1 Introduction to Hadoop Hadoop: Hadoop is an open source fiamework for writing and running distibuted pplication tht process large amounts of data, Distributed computing isa wide and varied Field, but the key distinctions of Hadop se that itis © Accessible—Hadoop runs on large clusters of commodity machines or on cloud ‘computing services such as Amazon's Elastic Compute Cloud (BC2), ‘ ‘Robust—Because itis intended fo ran on commodity hardware, Hadoopis architected with the assumption of frequent hardware malfunctions (enor). I can graceilly handle most such fies, * Scalabe—tiadoop scales linearly to handle larger data by adding tore nodes to the cluster ‘& Simple—tisdoop slows users to quickly write efficient parallel code. ‘The following Figure illustrates how one interacts with Hdoop cluster. As we can see, | Hadoop cluster isa set of commodity machines networked together in one Ieation. Data storage and processing all occur within this “cloud” of machines, Different users can submit computing “jobs” to Hadoop ftom individual clients, which can be their own desktop ‘machines in remote locations from the Hadoop cluster EU Baars Aaa Refer AOD RY Ree Bonen fe nanlceaea ve Page Understanding distributed systems and Hadoop: ‘Alt of low-end/commodity machines ted rogether as single functional is known as dstribured system, & high-end machine with four UO channels each having throughput of 100 MB/sec will require tree Hours to read a4 data se. With Hadoop his same data set ‘will be divided ino smaller (ypically 64 MB) blocs tat re spread among many machines {in the cluster via the Hadoop Distibued File System (HDFS). With a medest degree of, replication, the caster machines can read the data set in parallel and providea much higher throughput. And such cluster of commodity machines tums ou tobe cheape than one high cd sever Comparing SQL databases and Hadoop: SQL (seuctued query language) is designed foe structured data. Many of Hadoop's ‘nil applications deal with unsuctred dais such as text, From this perspective Hadoop ‘provides a more general paradigm than SQL, SQL is a query language which can be implemented on top of Hedoop asthe execution engine. But in practice, SQL databases tend to efecto a whole set of legacy technologies, with several dominant vendors, optimized for istorical st of applications, The fllowing concepts explain a more detailed comparison of “Hadoop with typical SQL databases on specific dimensions. 1. Seale-Out Instead of Seale Up: Scaling commercial relational databoses i expensive, Their esign is friendlier to sealing up. Torun a bigger database we need vo buy a bigger machine whichis expensive. Unfortunatly, at some point tbe won't be big enough machine vale for the lager data ses. Hadoop is designed to bea seale-ou arcitcture operating fon a cluster of commodity PC machines. Adding more resources meus adding more machines 10 the Hadoop cluster, A Hadoop cluster wih ten to fnndreds of commodity Leen eee ne STR T eer rT Ta tamara hance refs AHODBV Po Caleg. Daman -Sb52 Woe HBO SNG Page? ‘machine is standard. In fic, oter than for development purposes, there's a reason fo run ado ona single sever. 2 Key/Value Pay Instead of Relational Tables: A findsmental principle of relational databases is that data resides in tables having relaonal sinctre defined by a schema. Hadoop uses keyvale pairs as its base data unit, which is flexible enough fo work with the lessstnictured datatypes. Hadoop, dala can originate in any form (Sirecturcfunstrctured! semi-stuctured), but it evenually tansforms into (keyslve) pairs for tie processing intions to work on 4 Funetiosal Programming (Mapreduce) inscad of Declarative Queries (Sa): SQL is fandamentlly a high-level declarative language. By exccuting queries, the requ dat will ‘be retieved from database. Under MapReduce we specify the actual steps in processing the at, which is more similar oan execution plan fora SQL-eagine Under SQL we have query satements; under MapReduce we have serps snd codes. MapReduce allows to pecess data {na more generl fashion tan SQL queries. For example, we can bull complex statistical ‘models om our data o reformat ou image dal, SQL is ot well designed for ach tsk 4. offing ch Processing Instead of Online Transactions: Hadoop is designed for fine ‘Processing and analysis of large-scale data 1 doesn work for random reading and writing of| «few records, which isthe typeof Toad for online transaction processing, Inf, Hadoop is best ase as a write-once ,read-many-imes (ype of data store, In this aspect it's similar to ata warehouses inthe SQL world, Hadoop relates to distibuted systems and SQL databases ata high eve Understanding MapReduce: MapReduce isa data processing model. The main advantage is easy sling of data Processing over multiple computing nodes. Under the MapReduce model, the dete processing Primitives are called mappers and reduces, Decomposing a data processing application into Iappers and reducers is sometimes nontrivial But, once we write an appiation in the ‘MapReduce form, scaling the aplication o run over hundreds, thousands, orevea tens of ‘thousands of machines in a chistes is merely a configuration change, This isthe eason what ‘ag attraced many programmers othe MapReduce model Ex: To count the number of times each word occurs in a set of documents. Weave ase of documents having only one document with ealy one sentence: o.as say, not aI. We derive the word couns shown a the following secs fear HOBBY hap ape Ghnoran SE Ma BHD OHO Ped nano ‘When these of documents is small, scaightfrward program wil do the job and peeudo-code is for each token an 7 ‘oeacoune[eotent +7 > ‘The program loops through all she documeats. For each document, the words are estrated one by one using a tkenization process. For each word its coresponing etry ina ‘multise called wordCount i incremented by one. At the end, dsplay() funcion prints out all the entries in wordCount “The above code works fine uni he set of documents we want to process becomes large. Its large, to speod it up by owrting the program so that it distributes the work over several machines. Each machine will process a distinct faction ofthe documents. When all ‘the machines have completed this, a second phase of processing will combine the result ofall ‘the machines. The pseudocode forthe first phase, tobe distributed over many machines, is for each document in Socunentsubeet ( soracoane token) > 1 “The pseudo-code for he second pase i ) “This word counting program is geting complicated. To make it work across a cluster ‘of istibuted machines, we need to adda number of fanctionlits: 1. Store files over many processing machines (of phase one). Tv bre ssc fenr AED B Raj Clee enon SATO Hane SSREHEDAS Page ', Wire a disk-based hash able pormiting processing without being limited by RAM capacity, © Pation dhe intermediate daa (hat it, wordCount) from phase one. 4. Shute the partitions othe appropriate machines in phase two, Scaling the same program in MapReduce: ‘MapReduce programs are executed in two main phases, called mapping and reducing. Each phase is defined by a data processing faction, and these functions ae called mapper and reducer, respectively. Inthe mapping phase, MapReduce takes the imp dala and feeds each data element to the mapper. Inthe reducing phase, the reducer processes ll he outputs fom the mapper and arrive ata final result. In simple terms, the mapper is meant to flr ‘and transform he inp ito something tht the reducer can aggregate over ‘The MapReduce famework was designed in writing sealable, disteiued programs. ‘This two-phase design pater i using in scaling many programs, and became te bass ofthe fiamework. Partoning and shuffing are common design patens along with mapping and ‘reducing. The MapReducefamework provides default implementation that works in most situations, MapResuce uses iss and (tey/alue) pais as its main data primitives. The keys stings but can also be dummy values to be ignored or nd values are often integers 0 complex object types. The map and reduce functions mist obey the following constant on the types of keys and values, rence | aimnar | exam In the MapReuce amework we write applications by specifying the mapper and reduce. The following steps explin the complete data flow: 1. The input co the application must be siuctored a list of key/value) pits, ist (i, >). The input format for procesing multiple fs is usual ist (, is processed by cling the map function of the mapper. For word counting, "mapper takes and prompily ignores filename, I ‘can output alist of , we ean output a ist of String word, Integer 1> with repeated eres and lot the complete agaregation be dor later: That TEN baer Austen ANODE Rap alog Bionsran- SDD Maule SDE Panes isin te ouput ist we ean have the (keyvalue) pir <"foo", 3> ance er we ean have the pir foo", 1> tree times. 3. The output ofall the mappers ae (concopully) aggregated nto one gin ist of pairs. All pairs saving the sme k2 are grouped topeer into anew (keyvali) pei, de times, and the nap output for ‘another document may be a list with pair <'oo", > twice, The aggregated par the reducer wll e is "00, li(1,1, 11) In word counting, the output of our reducer is , which isthe total numberof times “foo has oceured it our document set. Bach reducer works ons different word, The MapReduce framework automatically collet al the pais and writes them vo fle(9) Pseudo-code for aap and reduce functions for word counting: / coup(tsingflename, Sting document) { ListString> T= wkeniza(documen); ec exch oken in T { emit (Sting, (Intege 1; | ) ’ redce( Sting token, List vals) ( Integer sum = 0, {oc each value in values { sum = sum + vale; ) mit (Stingioke, (Integer) sum) y Inthe above pseudocode,» special finetion is used inthe framework called emit() whichis used to generat the elements in thelist one at atime. The emit fanetion further relieves the programmer from managing a large Hist. But Hadoop makes bilding scalable Aisibuted programs easy rt Tov bone force Pacer ODE Rp lpr Boman SEO ate zHes Page Counting words with Hadoop—running your fist program: ‘To. Hadoop ona single mache is mainly useful fr development werk. Linux is the official development and production platform for Hadoop, although Windows is a supported development platform as well. For @ Windows box, we'll need tonsa eyewin (bup:www-eygwin conto coable shell and Unix scripts. To run Hadoop requires Java (version 1.6 or higher). Mac users shoud ge it from ‘Apple. We can download the latest JDK for ather operating systems from Sun at ‘nova sun conviavaseldownlondsfindex so (1) www orace-cam, Instat nd remember ‘he oot of the Java installation, which we'l ned late, To instal Hadoop, fast get the Intest_— version release at Inuplhadoopapscheoxseorweleases ml. Aer we unpack the distribution, se the script “contHadoop-envsh” to set JAVA HOME to the root of the Jva insallaion we have remembered from earlier. For example, in Mae OS X, we'll eplaes this line 4 export JAVA_HOME-usfity2sdk, sum ith the following ine export JAVA, HOME-Libary/Java/Home be using the Hadoop serpt quite often. Run the following commatd Dintadoop We only need to Know that the command to run a (Java) Hadoop program is binvhadoop jar , As the command implies, Hadoop programs writen in Java are Packaged in jar files for execution, The fllowing command shows about a dazen example programs prepackaged with Hadoop bivtadoop jar hadoop-*-examples jar ‘wordcount” The imprtant (inner) classes ofthat program are: wert One ofthe program i TE btm ete Pofesors OD BV ju Cys Bomar SHEDS Mae BUD TENS Pape 1 su metoes SE ens i Se fn a = = ero eat © epee ‘WordCount uses Java's StringTokenizee in ils default sting, which tokenizes based. only on whitespaces, To ignoce standard punctuation marks, we adf them to the ‘StringTokenizer's is of delimiter characters ‘StringTokenizer it “new Sting Tokenize(ine, "Wa 71); ‘When looping trough the sat of tokens, each token is exacted and cst into a Text Objet. In Hadoop, the special clas Tex suse in place of String, We want he word count 10 ignore capitalization, so we lowercase ll the words before turning hem into Text objects. ‘word set nextFoken0 toLowerCase0); Finally, we want only words that appear more than four times. ‘We modify to collect the word count into the ouput only if tht condition is met (This is Hadoop's equivalent ofthe emit) function in eur pseudo-coe) ‘(sum > 4) ouput collect(key, new In Writable(sum)); After making changes to those thee lines, we can recompile the program and execute ‘again, The results ae shown in he following bl. DV bine Amst ese FED Recep: Dhnsaran” Stee Mac Baa eee Pages Without specifying any argumens, exeesting wordcount will show its usage information: biavhadoop jor hadoop-*-examples jar wordcount ‘which shows the arguments is: wordcount [-m ] [1 ] ‘The only parameters are an input dtetory () where the rogram will damp is output. To execute wordcoun, we need to fist esate an np dtetoy mint and put some documents init. We ean add any text document tothe diecory. To see the wordcount results: binadoop jar hadoop-*-examples jar wordeount input output more out ‘We'll see a word count of every word used inthe document, listed ic alphabetical ‘order, The souree code for wordcount is availble and included in the isallation at ssclexampleslorppachehhadoop/examples/WordCountjava. We can modify tas per our requirements History of Hadoop: Hadoop is versatile (Mexible) (oo that allows now uses to sccess ‘he power of Aistibued computing. By using distributed storage and transfering code inead of data, Hadoop avoids the costly transmission step when working with large datasets, Moreover, the ‘redundancy of data allows Hadoop to recover should single node fil It seas ofereatng ‘ograms with Hadoop using the MapReduce framework. On a fully configured cluster, “running Hadoop” means runing a set of daemons, of resident programs, on the different servers inthe network. These daemons have specific roles; some exist only on one serve, some exist across mulpl servers. The daemons inelade OV Bahra amie roferar SOD Race Binonron SHED Mae BIDDU HOW Page 1. NameNode 2. DataNode 3. Secondary NameNode 4, JobTiacker 5. TaskTracker 1, NameNode: The distibuted storage sytem is calle the Madoop Fle S)sem, or HDES. ‘The NameNodeis she mastor of HDFS that directs the slave DataNode daemons to perform the low-level VO tasks. The NameNode isthe bookkesper of HFS; i kogps trick of how the files are broken down int file bloks, which nodes store thse blocks, and the overall health ofthe distributed filesystem, ‘The function of the NameNode is memory and UO intensive. As such, the server hosting the NameNode typically doesn't store any user data or perform any computations for 4 MapReduce program to lower the workload on the machine, This means that the [NameNode server doesn't double ssa DutsNode or TaskTracker ‘There is unfortunately negative aspect 1o the importance of the NaneNode—it's single point of flue ofthe Hadoop caster, For any ofthe ether daemons if ter host nodes fal or software or hardware reson, the Haoop cluster will likely contnae to finetion smoothly or we can quickly restart it and Not so forthe NameNode 2. DataNode: Each slave machine inthe cluster will host a DataNode daemon o perform the ‘runt work ofthe éisibuted Mesysem—reading and writing HDS blocks to actual files on the local filesystem. When we want to read or write 8 HDFS file he fle is broken into ‘locks and the NameNode will ell the client which DataNode etch block sides in, The client communicates direedy with the DatsNode daemons (0 process the local files comesponding to the blocks. Furthermore, a DalsNode may communicate with other DataNodes to replicate its data blocks for redundancy. The following igure ilustrates the ‘oles of NameNode and DataNodes Ta Bahar Asc rs HOD BY. ba alge hiram 23402 Mi 29603206 Page 10 ‘The datal le takes wp thre blocs, which we denote 1, 2, and 3, andthe date? fle consists of blocks and 5. The content ofthe fs are distibued among de DatsNodes. In this itasatio, each blak has thee replies. For example, block 1 (used for dal) is replated ove the tte rightmost DataNodes. This ensures that if any one DatsNode cashes ‘oc becomes inaccessible over the network, we'l stil be ble to read the fles DataNodes are constantly reporting tothe NameNode. Each ofthe DataNodes informs ‘the NameNode of the blocks it's curently storing. After this mapping is complete, the _DataNodes continually pol the NameNode o provide information regarding local changes as elas receive instuctions to create, mov, or delete blocks fom the local disk 4. Secondary NameNode: The Secondary Nam=Node (SNN) issn assistant daemon for ‘monitoring the state of the cluster HDFS, Like the NameNode, each cluster has one SNN, and ‘typically resides on its own machine as wel, No ther DataNode or Task Tasker demons Fun on the same server. The SN differs from the NameNode jn tht ths proces doesn’t receive or record any reabtime changes (o HDFS. Instead, it communictes with the [NameNode to tke snapshots of the HDFS metadata at intervals defined by the cluster configuration ‘The NameNode is a single point of fire for a Hadoop cluster, end the SNN ‘snapshots help minimize the downtime and loss of data, However, 2 NamneNode failure requires human intervention to reconfigure the cluster t use the SNN as the primary NameNode 4. Job Tracker: There is only one JobTracker daemon per Hadoop cluster. K's ypclly tun ‘on 8 server asa master node ofthe cluster. The JobTracker daemon isthe ink between out application and Hadoop. Once we submit our code (othe cluster, the JobTracker determines ‘he execution plan by determining which ils to process, sssigns nodes to differnt aks, and ‘monitors all asks as they're running, If a task fils, the JobTracker will sofomatically ‘relaunch the task, possibly on a different node, up oa predefined limit of eres. 5. TaskeTracker: The Jobracker isthe maser contol for overall execution of « MapReduce ob and the TaskTrackers manage the execution of individual tasks on each save node, The interaction betwen JobTracker and Task Tracker is shown ia the following diagram, Esch TaskTracker is responsible for exacting the individual tasks that the Job Tracker signs. Although there isa single TaskTacker per slave node, each TaskTracke ean spawn ‘multiple JMS to bandle many maps or reduce as in parle HOSE hoje Dharam GDE hae ako Hee Page (One responsibilty of the TaskTracker is to constantly communicate with the JobTracker. If tbe JobTracker fils to roesive 2 hearbeat from a TaskTracker within a specified amount of time, i will assume the TaskTracker has crashed and wil resubmit the The topology of one typial Hadoop clase in desribed in the felling igure: TE bam amc Profesor NOD BY Race Beran DSH ic BaTOONIWG Page TT 1, MapReduce Programming: Writing basie Map Reduce programs - Gxting the patent stage, constructing the basic template of Map Reduce program, Counting things, Adepting— for adoop’s APL changes Steaming in Hadoop. Peaaning, exith RODS~ Bask 2. MapReduce Advanced Programming: Advanced MapReduce ~ Chaining Map Reduce obs, joining data from diferent sources, wg gd OF SREZ 1 MapReduce Programming Weng base Map Reduce programs: The MipRedce roening model is ke ox poping me. To wie and tndestnd MopRedce progam, he alent dt chen a dove National Buea of Bonmicesoweh (NBER ) Gps //s ber. ora/eatens/ The data ‘were riinalheonpiled fr he apr "The NBER Pt Cito Ds il: Lees, Insights andeotogal Took” We we the ciaion daa set cits.99 280 1) ex and te teadossipioa ast apat62_88.tx¢ 280 8) The petition da st cons cans fom US. pats sed tween 1975, snd has morta 6 iin dit ow oe ene ling ‘The data sets inte standard comma-separated values (CSV) format, with heist line & ( pantie voud map(oxe hey, Text valve, cougpoe.cottecetvstaa, Hy): , , ‘blie scatic clase Reduce extends uapheducesabe ‘iolenente Rasicerctoss, exe, eae, eT pamite vod resucetmace key, Teeratorcrext> values, Roporear seporesr) sheave somcsoelon ( , peblie sn ran(seringf) argc) eorove Eeepeton ( ‘oatiguration cont = gwccont Or sotcont Job = noe gobeont(cont, wuss. ce) oth cut = ney bathlarge (ly Peteuepacromae.serinpetethe Je Al lsoutpurrosme:serouepseborntja5, S60, ja. tment Tv babe Av fcr ANOD BV Ra ClegeDimoeron SEED aie aEDoTNG Paget , public scatie yold masn(sering] aga) thcoee Exc ‘nc res = Tootmunoer-Fontaov coneeguracion(), aw 2093bI1, S590) A single clas, called mysob completly defines each MapReduce job, Hadoop quire the Mapper and the Reducer hve their own static classes. These clases ace quite small, and the teplate includes them as inner classes 1 the My Job class. The advantage is ‘hat eventing fits in one fle, simplifying code managemest. Various nodes with different VMs clone and mun the Mapper andthe Reducer during job execution, wheres the rest of the job class is executed ony atthe cient machine The core ofthe skeleton is within the run () method also known as the cr'vex. The sver instants, congue, and passes @ JobCon® object named job to Joke ent.Each ob can reset the default job properties, such as input Format, OutputFornst, and s0 on. “The following command is used to execute nyo class 2 bin/tadocp jae plaparound/MyZo. jar My Sepnt/o4 C678 99. tat oatpat To see the only mapper’ ouput (which we may want todo for dsbuging purposes), ‘we ould sot the mumber of reducers to zer0 withthe option -D mapred.redsce.taske=d “The convention forthe template isto eal he Mapper class Hapciass and the Reducer class Reduce, The naming would seem more symmetical if we call the Mapper class Map, bt Java already asa las (interface) named Map. Roth the Mapper and the Reducer extend mapRedscesare, which is a small class proving. no-0p mplemeotations tothe contigure() and close () methods required bythe two intriaces, We wse the contigure() and closo(} methods wo Set up and clean up the map (reduce) tasks, We won't need to override chem except fr more advaned obs. The signatures forthe Mapper class andthe Reducer cles ar HCD Bf alge nore SDE Ha Saab ioe Pages piblse Vola retice(z2 Key, Tteratorcy2> value, “The key/value pars generated by the mapper are outputed via the collect () method of the ovtputcelLector object, Somewhere in the map() mettod we need to call; output collect ((Ka} ky (V2) We "The ceduce () method wil kay havea lop to go through ll the vals of ype V2 nile (values hnateet (0) ( , ‘Counting things: For the patent citation dat, we may want the numberof ciatons a patent nas received. This ois counting. The desired output would Took like this: tooo 1 1000026 1 1900088 4 300000 1 900033 2 1000081 4 3000006 1 1000083 1 so000s4 4 oao0s7 1 ioo00ee 2 000065 1 oo0011 1 oo904s 1 000067 3 In each record, a patent number is associated withthe number of citations it has received. We can write « MapReduce program asthe following for his tsk Nene en ee eee eee marat av Borman Aencee PefenordNEDRY Ro cacy Sharon SPaGz Mabie EK TENG Page Dublic static clase mapciace extends tapreducetere ‘nplennce nappercrext, Text, MUwiLtable, Zarheicable> { private fina ccacto muturteable uno = ney tatiteteable(2); public void map(naee key, Toxt value, ‘Reporter reporter) throws iOmxception { : clesctongeac ca aneazr paar vase ceri , [pobtic static class Roduce excende maptedacesase ‘Seplenents meducor-rateritable, ineirteable, ntlcable, rttcicabie> c pubic vold reduco(mncuritable key, Zterstercintiritable> values, eporter reporter} throws Tomxeeption { ” seine (vases hasten ()) fcoune so vatuee-nexe(} gee) d Guepue-eotiece ay, new Entweteable (count): , , Doblic nt run¢etring{) arge) throvs Brception ( ‘Configuration cont = getceatOs shcent job = nev gobcont (conf, cteationttscogran.class) och a = new Fechiasse(01)) Path oat = nev Patharge(}7 Pilemputrormat.certapurberna[ ob, 45) ‘ilccutputsoraac.seroutpurfann(jeb, ex) SO SE EEE TEU betmareAsncte Prefer OD Rj Clap momran SAPO Hae BUNTING Page esobsana “estat son Job. zetourputkeyClass (zncuritable.class) Job cetourpacvaluesiasa anuxritable.cia00) jee. unre Job , putt scic vod main(scring{] args) earovs sxcapcion ( fe = Goolninner.runinew Gonttgurstson' arse): . ‘The following Figure shows the numberof patents at various citation fequencies. The Plots ona log-og seal When a disibtion shows a ine ina logelog plo.’ considered to be a pover Jaw distribution, The citation count histogram seems 1 fit he description, slough is approximately parabolic curvature aso suggests tognormsl disrbaton, a ictng the mama of patos a aorntctaton oquondon Hany stents have one tation or ot aa hh i at sow ons gen). Sane Btents have hunceds ef eto. Ona glog Pap, ils looks se ena to {stg ne tobe considered a power astbtin Adapting for Hadoop's API changes: ‘The new map() and reduce.) methods ae contained in new absrae classes Mapperand Reducer, respectively, They replace the Mapper and Reducer mrfaces in the orginal APL (org. apache. hadoop .maprod.Happer and {TEV Boheocs Anscte Profenar HOD OV. as Calg Baron SBOE Hebe BDO SMe Page jorg-apache.nadoop.mapred.Reducex), The new abstact classes also replace the MapReduceBase clas, which ha been deprecated, ‘The previous My program i ewriten for new APL 0.20 es the folowing: nal) cttation « venue tastsina( eplit") buble etatte clas Haduce eatands Redversfont, ext, Tent, Ter ( MF {ffeerelengtnt) > 0) ca¥ o *.*) rte stows sre or sah ttein ‘recap op-seteoaucerclaza(neaucn-clnee)} cy ‘ictntrormstnns econo clase) puto cas public statte void matn(stringt} stge) takow Exception { ‘syatenentt zee , We have wo rewrite the template using Text ioput Format ©. We expect all Hadoop lasses to support the new APT when version 0:21 ie released ober Aaa Pofoor& Taj cle. Sonam SDDE ese Beno Pape D Streaming in Hat ‘We have been using Java to write all our Hadoop programs. Hadoop support her languages via generic API called streaming: In practic, Steaming is most useful for ‘witing simple, short MapReduce programs that are more rapidly developed in sripting language that can ake advantage of non-Jva braves, Hadoop Sueaming interacts with programs using the UNIX steaming paradigm, Inputs come in through $10: and outputs go to $1000", Data has o be tex based and each line is considered a record. The ove dataflow in Hadoop Streaming i ik pipe where ata steams throvgh the mapper, th output of which i sorte and steamed through the reducer, The command is coat (Amput_file] | laapper] | ort | [reducer] >foutpot_ette) ‘Streaming with UNIX commands: The following single line command is uses 10 get a list of cid patents in ive75_99. et Dln/ideop Jae contetb/atveasinghadsop-2.19.1-streantng See The Streaming API sin a contrib package at contrib/streaming/tadoop~*~ streaning Jae, The fist pat and the ~Lnput and the -autput ergumeatsspeciffat ‘we're runing a Streaming program with te corresponding input and output illdiretory, “The mappe and reducer are specifies arguments in quote. The ouspt of Mis command is “The first sow as the column descriptor “CITED” from the orginal file Noe thatthe rows re sorted lexicographiclly because Steaming processes everything as ext and doesn’t, now otber datatypes. To got record coun the following command wir command (we -)) is used: Tr botnet ofr @ POD Re ClegeDinoranSHai2 Male BSPIDLZNS Page IO bin/uadcop Jar centetb/etreaning/aadoop-0.19..-atee Here we -1 as the mapper to count the number of records in each split Hadoop ‘Streaming (since version 0.19.0) supports he GenerscoptionsParser. TI ‘sed for specifying configuration properties. Tae mapper dietly outputs the record count without any reducer, so We set napred. reduce. tasks 100 and don’ specify the reducer option at all. The final count is 3256906, Morethan 3 milion patents have been cited sccording to our data Streaming with seripts: We can use any executable script tat proceses a ineoriented date stream fiom stort and outputs to sT20UP with Hadoop Swesming, Far example, the following Python sript randomly samples data from STDIN. For each line, we choose random integer between I and 100 and check against the user-given argument > argument is Gys-eravt0)) IT spore sya, random AE (render. Zandine (2,200) “The following command calls the Python script with an argument of 10; sampled output .ext will have (proximately) 10 percent ofthe records in input.txt. soe (eve.srevttI)) S cat input.txt | Randonsanple.py 10 >sample: {output.txt Hadoop Streaming supports a £16 option to package our executable file as pat of {he job submission. The following command is used to execute Random Sampler: ‘Bin/nadcep jar contrth/atreaning/hadoop-0.19. 1-2 input input /atee7S 99.008 w ~esposr “Randcasanple.oy 10 “este Randoncanl SD mapred.reauce. toes] “The random sampling seript was implemented in Python, although any scripting language that works with sTOrN and STDOUT would work. The following sipt code is ‘evrien in PHP §, 7 arn Aas Pfeatr HOD BV. ap Cage Banarres BBA Aoi SaWO Ne Pape TT Feanconisamrte pip a PHP slit pntns random ines ton STON) “roe fe ceanarnsagey ee gare ¢ ccule this Steam sript using the following commane: SS “maoper ‘pip Bandonsanote.zhp 10 Lo. dveereaning. sar ‘The mapper can calculate the maximum over uividual split. Baek mapper will ‘ouput a single value at the end, A single redoce outputs the global maximum. The following Python scrip for a mapper is used o compute the maximum over sph. ao ‘The mapper is ‘att ributetax-py 6. IC outputs the maximum ofthe ninth column {na split The single reducer collows al the mapper outpus. Given seven mappes, the final outputs: Each line records the maximum avers particular split, We se that one split has 2er0 claims in al its records. This sounds suspicious until we reall ha the claim count atribute {not available for patents before 1975 Streaming with keyalue pals: By default, Steaming uses the ta character separate the ey from the vale ina record, When here's nota character, he ene record is considered TY theres fate Woferer AAODBN ha Clee Dlncaran Sree ale eo wee Page TZ the hey and the value is empty text. For our data sets, which have mo tn character, this Drovides the ilsion that we're processing each individual record as «whole unit Furthermore, even ifthe records do have ab characters in them, the Steaming API will only sulle and sort the records ina different order. Let's examine how key/value pits work in ‘he Sieeaming API foreach step of the MapReduce dataflow: L [As we've seen, the mapper under Streaming reads a split through st02" and extracts ach line as a record. Our mapper can choose 10 interpret each input record a8 a -ey/alue pai or line of text ‘The Streaming API wil interpret each line of our mappet’s output a a seyvalve paie separated by ab, Similar to the standard MapReduce model; we apply be paritioner {0 the key to find the right reducer to shut the record to. All keyvaliv pits with ‘he sae key will nd up atthe same reducer. [At each reducer, key/value pairs are sorted according to the key by the reaming ADI, Recall hatin the Java model, all key'valve prs ofthe same key ate grouped logether into one Key and a list of values. This group is then presented to the reduce () method. Under the Steaming API ou reducer is responsible for performing the grouping. This snot too bad asthe key/value pis ae steady sorted by key. All records of the same key are in one contigueus chunk. Ou: reducer will read one lin ta ime fom S124 and will esp tack ofthe new keys, For all practical purposes, the output (S700U2) of our reducer is writen to file Aiecily. Technically a n0-op stp is taken before the file wit, In this step the Steaming API breaks each line ofthe reduces output by the tab character and feeds «he Keyvale pir to the default texeoutput Format, which by default r-insets the ‘ab character before writing the result afl, Without ib characters inthe reducer ‘output it will show the same no-op behavior. We can reconfigure the default behavior 1o do something different, but it makes sense to leave it as @ nop and push the processing nto our reducer. ue Oe ete 2. MapReduce Advanced Programming (Advanced MapReduce) Catan Mapa oe Many comple, ts ned Yo be ten down iat singer sss ech soled bya vil Mapas os Fr exp, fo tna to nde ence pens qe seen fo Mapa os The oe et thereto dat td cut he er of oe och en de ‘sod ie up en tinea a, Chan Mate a ene Toh we an eo joe may ne tte tec, cei 0 eal te ein somes He ean Mayes os oan eu, with hap fone Mags jo eng te pat (thnx Ts i oe ing Uni pe te loving Imapeeduce-1 | mapreduce-2 | mapreduce-2 | Chaining MapReduce jobs with complex dependency: Hadoop has a mechanism to simplify the management of nonlinear jb dependencies va the Job and Jobcontrol classes. A Job object isa representation of a MapReduce jb. We instantiate a Job object by passing 2 ‘obcont object ois construc. In addition to holding job configuration in‘ormaton, Job lio holds dependency information, specified through the adaependingsob () mh. For Job objects x andy, x-adavependingzob ty) ‘means x will not start until y has Gnished Chaining preprocesig and postprocesing tes: A Tot of data processing asks involve ‘ecord-erentod preprocessing and postprocessing. For example, in processing documents for information retieval, we may have one step fo remove stop words (words lke a, the, nd ‘hat occur fequeily but arent too meaning), snd snotber step for stommusg (converting ifereot forms of @ word into the same form, such as finishing and finshed im finish) We can write separate MapReduce job foreach ofthe pre-and postprocessing eps and chain them together, using Ident ityReducer (or no reducer at al) for these steps This approach is inefficient as each step in the chain takes wp UO and storage to process te intermediate resus, Another appreach isto write a mapper such tha tell all the preprocessing steps ‘beforehand and the reducer to cll ll the postprocessing steps aftervard, Hadop introduced the Chaintapper and the chainRedvcer clasts in version 0.19.0 10 simplify the composition of pre-and postprocessing Fr example four mappers (Mapl, Map2, Map3, and Map4) and one reducer (Reduce), and they're chained into a single MapReduce job inthis sequence: Mepl | Map2 | Reduce | Map3 1 Mapt We need to make sue the key and value ouputs of one task have matching types (classes with te inputs ofthe nex tsk, This is explained in the following code ‘8 Map 0p cob 4a sp ep eb a ap up eb {CEV Biman Amst oferar AHODEW Bas Coleg Game DE hee HBOS Page TS fe sata “The driver esses p the “global” JobCon¢ object withthe job's nane, input path, ‘output pth, and so Forth Joining data from different sources: Unfortunately, joining daa in Hadoopis more complex, and thee are seal possible approaches wih different rade-ofls, We use @ couple toy datasets to beter illustrate joining in Hadoop, Les take comma-separated Customers file wher each record has three fiels; customer 10, Name, and chone Munnes. We put four records in the fle for ‘tuseation: ‘We store Customer orders in separate fl, called Orders. 1's in CSV format, with four felis: Customer 10, Order 10, Price, and Purchase Date, fe aplid an inner join ofthe two data Sts above, the desired cups: Desired output ofan Tone Joln between customers 2, Stephanie Uaung,SS5-S55-S555, 2, Bowatd Kim, 123-456-7890, 32-00, 30-Nov-2007 3 (one tage! :,281-330-€004,4,12,95,02-aun-2008 3lgoee tages, 281-330-€004,D, 25.02, 22-Jan-2008 Hauioop can also perform outer joins. But we focus on inn joins Reduceslde Joining: Hadoop has & cont i package called datajoca hat works 25 a generic framework for data joining in Hadoop. lis jar file is at ‘contrib/datajoin/hsdoop-*-datajoin.jar. To distinguish it from other joining techniques, i's called the reduea-2ide Join, as we do most of th processing on the teduce side I'salo known as the zepartitioned join (orthe repartitioned sort ‘merge Jos), a8 the same ss the database technique ofthe same name. A though {Kav tree Arnos Wofenord HOD BY Re Coee Shean SDUGE Hebe ILUDIGNS Page 16 ‘the most efficient joining technique, it's the most general snd forms the basis of some more sdvanced techniques (suchas the somos) Reduce-sie join intoduces some new terminologies and concepts, namely, data Source, tag, and group key. A data source is similar o table in relational databases. We ‘nave two deta soutes in our toy example: Customers and Orders. A data sue can be a single file or multiple les. The important poi is that al the records ina datasource have the same structure, equivalent oa schema, ‘The MapRedce paradigm call for processing each record one a ine in stateless ‘manner. If we want some state information to parist, we have to fag the rec with such state, For example, given our two files, a reord may lok to mapper like this Where the record ype (custonars or crdex) is dissociated ffom the recotd ise Taging, the record wil ensure that specific metadata will always go along with the cord, For the purpose of dita joning, we want o tg each record with ts data source. The group Key faetions like a jin key ina relational database, For our example, the group key isthe Customer 1D. DATA Mow OF 4 REDUCE-SIDE JOIN: The following Figure illsiates the dela flow of @ reparoned join onthe toy dats aets customers and orders, up (the reduce stage SESE Enema EBV tem amine Profesor GOD BV Ra Clee Sharon” SBOE habs BERD Page TT ‘The funtion zeduce () will ake its inpat and do a full exoe2-prouee on the valves, Reduce() creates all combinations of the values with the corstaint that a combination will not be tagged more than once. In cases where reduce (} sees values of istine tags, the crss-poduct isthe oviginal st of values. Tn our example, thi isthe case for ‘group keys 2, 2, and 4, The following Figure illustrates cross product fer group key 3 ‘We have te values, one tgged with Customers and two tagged with Orders. The eross- product creates two combinations. Each combination consists ofthe Customers value and one ofthe Orders value. O B= 4 IMPLEMENTING JOIN WITHTHE DATAJOIN PACKAGE: Hadoop's datsjoin package has Three abstract classes that we inherit and make concrete: oatadoinappersase, DataJoineducerBase, and TaggadMapoutput, As the mames sugges, our MapClass Will extend DataJoinwappersese, and our Reduce ciess will extend DatavoinReducerBace, The datajein package has already implemented the nap) and ‘racuce() methods in these respective base clases to perform the join dataflow TV. hm fant refer AED ple Dinnnron- 59202 abe BORING Page TH Replica joins using DisieibutedCache: Hadoop has a mechanism called es ‘cache that’s designed 10 distribu files o all odes in a cher. Distribute cache is handled by the appropriately named clas DistbutedCache. Public static class napciase extends MapnaducaBase ‘mplemente meppercfext, ext, Text, Tex { shared private Hashcablecsteing, String> fotatata = public vold cantigure(Zoboont cont) ( ¢ Semon: reduceside join with mapeside fering: When processing records from Customers 1nd Orders, the mapper will drop any record whose key ie ot inthe set Custoner2041$ (415s te area code). This is sometimes called a eem Join, taking the terminology ftom the database word Last but not lest, what ithe fle Custoner0415 i sill to big of in memory? ‘Or maybe customerr0425 does fin memory but its size makes replicating it across all the Iappers inefficient. This situation calls fora dat structure called a Bloom ££2ter, A Bloom filters a compact representation of et tha supports only the contain query. ABRERRER en FEV bohm ante refemor HOBBY hay cog Gomevon SUSI Hohe ERDSENE Page 1D = UNIT-IV get 1. Graph Representation in MapRedce: Modeling data and solving a raph, Shortest Pah Algordim,Frends-o-Friends Algorithm, PageRank Algorithm, Bloom Filters. 1, Graph Representation in MapReduce ‘Modeling Data and Solving Problems with Graphs: A graph consis of a mumber of nodes (formally called vertees) and links (formally called edges) that connect nodes together. The following Figure shows a araph with nodes and edge. ade “< g Asmat oh with ae ed gee ‘Graphs are mathematical constucs tat represent an interconnected set of objects “Theyre wed to represent data sucha the hyperlink ster ofthe invert, socal networks (where they represent relationships between wars, and in interme routing to determine ‘optimal paths for forwarding packs. The eiges cam be directed (implying @ one-vay relationship), or undisected. For example, we can use 4 directed raph to model reatioasips between vers in a socal network because relationships are ot always bidirectional. The following, Figure shows examples of directed and undirected graphs. 3 7 arn Arcs Wefe BOD BV fuji Darron BDAEE Hae CD IONE Gmaphs can be cyctic or acyclic. in eylic graphs is possible for a vertex to reach iself by traversing 2 saquence of edges. In an acyclic araph it’s not 2ossible for a ‘vertex to traverse a path to reach itself The following Figure shows examples af eylic and scyeliceraps pe arta gag sopeaeoesgn 2 © ¢ a maa la yee and eye gape Modeling Gra There ae two commen ways of representing graphs ae wit adjacency snateoes and with sdjacency2ists [ADIACENCY MTR: In this marx, ve representa grap as an Vx NV square matrix M, were isthe umber of nodes, and represents an edge between nodes 1 andj ‘The following Figure shows directed graph representing connections in & socal ‘raph, The arrows indicat a one-way relationship between two people, The ajuceney matrix shows how this graph would be repeesened natoeney maricrepesetton at gah alet-telel ale] ‘The disudvanageof adjacency matrices ae that they model both he existence and lack ofa relationship, which makes them a dense data structure Aniacrscy us; Adjacency lists are similar adjcency matices, ther than the fast that (hey don’t model the lack of relationship. The following Figure shows an adjacency list othe ove graph, Els papa ‘an adaconey Tat vepreertaton ofa rp ‘The advantage ofthe adjacency is isha it offers a sparse represetain ofthe data, ‘hich is good because it requires less space. Italo fs well when representing graphs in "EBV Bamow Arte Prefer AHODBV Re Clee Dnonran” SEE Ras BIEDDTEMG POBE? MapRedace because the Key can representa vertex, and the values are list of vertices that denote a diected or undirected relationship node. Shortest path algorithm: This algorithm isa common problem in graph theory, where the 0 iso fnd the shortest route between two nodes. The following Figure shows an example ofthis algorithm on a graph where the edges don't have weight in which ese the shortest pat isthe path with the smallest numberof hops, o intermediary nodes between the source snd destination, Applications of this stores route beoween wo adresses, routers that compute the shortest path tee foreach lgordum include talfc mapping software to determine the route and socal networks to determine connections between users, Find the shortest distance between two users: Dijksta's algorithm is « shortest path lgorihas and is basic implementation wses a sequential iterative process 0 vaverse the ‘entre graph rom the starting node Problem: ‘We need to use MapReduce to find the shores path between two people in a social rp Solution: ‘Usean adacenc ist to model graph, and foreach node store the ditanceffom the xiginal node, a well sa backpointer to the original node, Use the mapper te propagate the Aistanceto the original node, and the reducer to restore the state ofthe graph, erate until he target node has been reached, iseusion: The following Figure shows a small socal network, which well use for this technique. Our goal is wo find the shortest path between Dee and Joe There ar our pats that ‘wean take fom Dee to Joe, but only one of them resus inthe fewest umber of hops. av brome hance fen SHO BV aj cag non HENS Rese HEB) Pa ‘Well implement 2 parallel breadth-fest seatch algorithm t0 fnd the shortest path between two users. Because we're operating on a social nework, we don't need care about ‘weights on our edges. The pseudo-code forthe algoritim is desribed asthe following: Map(node-name, ode) 1 mil(node mame, node) (40 preserve node) 2 nolecdtstnae os ten (pecear neighbors ifthe eure wee diatance as ca corp a Nelpibordetence mode distance +1 ¢ Eevaladicent node edjmede © pode.edincdes do iz Toute tho nchcont ede, the mijn nn citanc, and the eth bockpetnter. Scat bites, eghtrdtonce nods tachztner+novename) Petar af ete BE Sintfede-nanes neds) ‘Peeuo-code for bead tat parallel zearh on graph using MapReduce ‘The following Figae shows the algorithm i Jost ike Dijkstra’ algoitim, wel art wih all tke node distances se nile, and se the distance forthe stating node, Dee, at zero, With each MapReduce pass, we'll determine odes tht don't have an infinite distance and propagate ther distance values o tei adjacent ‘odes. We continue this util we reach the end node. tions in play with our social raph {OV tora hasate Resor AHODEY Rar Gseg thnarean SUR Ra OOD Pape sehen o\ Speen we Eee Ob Xo EE ‘We first ned to create the stating point. This is done by rang the social network (ich i stored 48 an adjacency lis) fiom file and seting the iil distace valves, The following Figure shows the two file formats, the second being the format thas used iteratively in our MapReduce code. nan = eo ‘inl soll new Te Tat and Napa oon opined a again ‘Our frst step is to eeate the MapReduce form ftom the original ile Te following, command shows the orignal input file, and the MapReduce-eady form of the iput fle generated by the transformation code: Leen nee ne en eT TaETTeT Tv ecmane home Prefrer SHOD BV PsclegeDunoeron- 5302 Mab BOTH Page at tst-dataeh7Mrtendeator ptt Then Sr eee oak doe eo to at cae ae Tye elena of ea th 5 tocop 4 cat een MALE Bt dats Benker aoe aa fhe Gesern de Testing Et rt SUL Sueremear ne tle tet joe Mt 0 tie nde we bg VALUE oe ditiesear ea at le Bl tet am ied to Rinse i So ‘The reducer calculates the minimum distance for each node and “2 output the ‘minimum distane, the backpontr, and the original adjacent nodes, This i shown in the {allowing code Tht oe ea be mb of Topside ede protected vd etapa cnet) Se earn tah ora 2 Pe wld echt, eins, trai Tbe, neraptencegtn ( sit Sumani for (oe tne + eles) ( et tad ate» tab. fetetllaetstrng)}; —@— Store ide cnt toes) ( Sespabioe = ose, @—— 1 ede verona nde ‘cis et ndc perce ‘(go getter) «setae ‘oictnea'sneupetivemce() © ~Karen Shrestashconc™ as spt es Barns _ Line poet aa Assets referer HOD Race: Winn SBT es Meares et) ( “esi Sets atts, “foretajocebnn ceecletnert eta yt straint seit; Se anteater, tds 1 esr thd dy 1 over = a eDATE a orenenrccsrere a Se oy) —<—" wet il ea Fearn BE STALE COPED, cout arent) (Shee cnet Stig), ‘Sietiejcoir petsoapemer) sneenen) > ' "Now we can run our code, But we need to copy the input ile into HOFS, and then kick off ur MapReduce job, specifying she start node name (dee) nd target nae name (je): 1 had f put \ "ert -deta/ent/frtende-short-pah.txt \ Sriendesshoreopoth se § Syren comming oc stotestpth. tain ee $08 End odes foe Sus tee dow His ob Joe eo 2 decals alt joe Sos 2 oozals bob alt ‘The outpu shows tha the minimum hops between Dee and Je is 2, and that AI was the connecting node. Friends-of friends (FoF): “The FoF algorithm is using by Social network sites such as Linkedin ad Faesbook to help users broaden ther networks. The riendsof fiends (FoF) algorithm suggests friends that a user may know thet arent part of their immediate network, The folowing Figure shows FoF tobe in the 2nd degree network Problem ‘We want implement the FoF algorithm in MapReduce. TEV Broan Aree rfc & HOD BW ha seg Sharam BOG Mee 20086 Page 7 Solution Two MapResuce jobs are required to ealeulae the FoFs for each wer in a social ‘network. The first job calculates the common ends foreach ser and the sezond job sons ‘he common fiends by the numberof connections to ou friends, ‘a xa FS re oe Don a conan FFT em The folowing Figure shows © newark of people with Jim, one of the users, highlighted, 6. ©" “Ee © +o ‘An exalt FoF ware Joe ad Jon ae conse FFs tin Jn above graph Jim's FoFs are represented in bold (Dee, Joe, and Jon). Nexto Jim's FoF isthe numberof fends tat the FoF and Jim have in common. Our gal ere i 10 ‘determine al he FoF and order them by the mumber of fiends in common. Therefore, our ‘expected results would have Joe as the first FoF recommendation, followed by Dee, at then Jon, The ex file to represent the social graph fr tis echniqe ie shown here 5 cat tart-deta/en/Friends. txt fee Song to SIs a de Sie ate Gee ewe joe ton joo foe aE i joe RS sim Sie ks bee dit E80 batman Aoncae refer RODBW ha ay nwo BMGKE Rais HemD Pee Pape “The following first UspRecuce job coe calculates the FOFs for ech use, ete a ee i igh edi beet ae as = istic odes Ochosne Pinter bod — WiC Tet duets (er Um 035 «tte So) eae Pelt rip a fed te urn Ser Sa elise isa ? > Sees nemerteie, Lee, Tete, treaty { private Intihitale fotendstaceman = neu Ineiestable(): pubic void reduo( Tortie bey, Teroblecintinitable> valuse, contort cortex), rou Hoexception, Interraptegecoption Geoteanslresdyriends = elses for (intirieaie hops = values) ( TW (ropsget() = 1) ( ateenyriande = tron; tren he tert ) eave rds cements, > gt te fi a hee Phan eu ftom bids Th swe Ue Tea tae 36 (alrendyrinds) ( scopic a theo ne tense set{comontteds); |= ey antexeuriteiay, friendstaCneoh) 2 ? , “The following eecond. MapReduce job code sorts FoF by the mmbe of shared common friends, Tai. bares Ain ofan HOD BV Regs Dna ale BN 2586 Pape ‘ands hyper Cene, Tent, Perse, Perso { sce tran ity = eon fener Pctecte via mp er hey, Text val “rw ntceptn,tntreptde sevint] pers Seriogiete spe Seeing tase prts(0y Ine comers = tegen vabaoepaes3), Contant content) ‘on CeSttnt: caret, corte se Sebenslancin) ceric convenes); Ea och eis ‘Sree wattiay eaten “~ cies art, cme) Setantucietepetey otptieae)y aq Ea athe a of edi , > ‘rtens Rescerersen, Person, Text, Text ( private Tent came = nn Text; Poiate est poenciatfriens = nex Text) (override bic vd reduce(Person key, Itrable vas, ‘ncest content). ‘rows Toexeptn, interupeeatacepion ( Sernglusder sb = new Stringer AU Be ple nr bees cree Woede tcone bind fo (arson porreatFtand «wats ( —_ ‘steiengng) > 8) seopeant®, 3, > 2 spe potentiated gethane()) ena") pend ptantalfred,geComenfrien()}s 44 (yea = 309 a diye te tert eo ) nae. set (ey gene); oteeialdonde aoe teStesne())s ontetwriteCome, potrtiafrians); “The only addtional item we need to perform in our MapRedce driver ode sto add the Bloom fer ile we created in the previous technique into the distributed cache, as follows DistributedCache.addCachoFile(bloonFilterPath.toUei(), cor); (Our MapReuce job doesn’t have a reduce, so the map out wil be writen to HDS. To lunch ou job and vt the output be nb FE Ce Gd ‘ile was cad in the proven ethinal Stor fe tert anata tt « “oigtetattetorae meted metre lpare 009-20 The Blom fer file crested i the prvi tthe, Bey at etre | “ra Sf | sh ‘A pook atthe ouput ofthe jb in HDS verifies that it only contains uses je, alison, an morie, who were present i the Bloom fits. ABRAREREE TEN barre fete afer 4 HOD BV Ra Cle Beran“ SUS08 hale P00 Oe vuUYVULLUYYLUYUUUYVOLUYUT UU YUUVO UYU OUT

You might also like