BDA Text Book 1-Part 2

Uploaded by

Aditya B N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

246 views82 pages

BDA Text Book 1-Part 2

Uploaded by

Aditya B N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 82

soe Big Dts nd Anat 2, Puzzle on Architecture The Big Data Technology Landscape TT BRIEF CONTENTS cross Down f Whack in Soe Tien 1 isanimporeantadvaneage of 1. In this architecture, central memory is shared | # NSQL (Not Only SQL) * Features of Hadoop " Ghaed nothing architecture. ‘by: muliple processors. 5 Wher sins? + Key Aanags of ado | 2. In this architecture, multiple procesors have Sere eons of Hadoop + Type of NaSQL Databases + Hadoop 10 theirown priate memory. es ssn tt pnw + Advantages of NoSQL + Overvie of Hadoop Bcosems 7 own + What we mis with NoSQL? + Hidoop Disebucons Across * Use of NoSQL in Industey ‘+ Hadoop versus SQL 1. Sclbilcy 1. Shared Memory + NoSQL Vendor + Integrated Hadoop Systems Offered by Leading 2, Shared Diske * SQL versus NoSQL ‘Marker Vendors + NoSOL + Cloud Based Hadoop Solutions Comparison of SQL, NoSQU, and NewSQL. “The gal isso turn data ints information, and information iso insight. ~ Carly Fiorina, former CEO, Hewee-Packand Co WHAT'S IN STORE The focus ofthis chaperis on understanding “big dats technology landscape”. This chapter is an overview on NoSQL and Fiadoop. There ae separate chapters on NoSQL (MongoDB and Cassandra) as well a Hadoop in the booka — ‘The big data technology landscape can be majorly studied under wo important cchnologe: 1, NosQL. 2. Hadoop 1_NoSQL (NOT ONLY SQL) Theva NS wa icine Cata Sin 198 wanes gh ope nen Fa Neen imnlSQU heehee ean Ec Ban 20) | 44.1 Where is it Used? [NoSOUL databases ar widely sedi big data and other real-time web applications. Refer Figure 4.1. NoSQl, ‘Goabascs i ued o stock log data which can then be pulled for analysis. Likewise its used to store soi media data and all such data which cannoc be sored and analyzed comfortably in RDBMS. 44.2 What is it? \ : isgsbued databases. They a [NoSQL stands for Not Only SQL. These ae now-tlational, open source, distributed dat 2 a hugely popalat today owing co thee ability to sale out or scale horizontally andthe adepenes at dealing ‘with a rich variety of data: structured, semi-structured and unstructured data. Refer Figure 4.2 for addons] Features of NoSQL. 1. NeSQl- dab renown do nota nl a el a ey seater econ otdolurn-ncnel or gpl dab 2. Bnd eine mening th das del carl ona ser ~ Ei | Figure 41 Where to use NoSOL? | | Figure 4.2 What is NoSOL? Taig Dats Technology Lancs pF De Tal ne — +8 $, suppor fo AID rope Nomi Canseng ikon and DubiliThe on affrsuppor for ACID proper of transactions, On the ooh enw Brae CaP Comin y consistency in vor of availabil ty have adherence to Brewers ation tolerance) theorem and az of en compromisingon an patton tolerance 4, No fixed table schema: NoSQL. databases are becoming increasing popular owing to their suppor for Aexibiliry co the schema They do not mandate forthe dats to sty adhere to. Sei © be he ly adhere wo any schema structure 41.3 Types of NoSQL Databases have already stated thar NoSQU. databases are non-relational. They can be broly cased into the flowing 1, Key-value or the big hash cable. 2, Schemales. fuer Figure 4.3. Let us tke a cose look at key-value and few ther types of chema less databases: J. Key-ale It minis a bighsh abl of keys and vale Fa xp, Dynamo, Rd Sample Key-Value Pair in Key-Value Database ie Dynamo Kelis Rak ey ‘ale Fist Name Simmonds last Name David 2 Document: It maintains data in collections constituted of dacuuents, For ean ‘Apache CouehDB, Couchbase, MarkLoge ete He seta Moe ‘Sample Document in Document Database ( “Book Name": "Fundamental of Business Ani” "Publisher" “Wiley India’, a "Year of Publication": "2011" 3. Column: Ech sorage block has data from only one column. For example: Casandra, HBase et ay Crete PEST OHO! “pac boa DCNet pean Figure 4.3 Types of NoSOL databasespe Bg Date Tchlogy Landope +6 we _bigbuasn Andg 4. Graph They ae ao lo ewok dae A gph res dan odes For xml, Neo | Teenie : | pte Guph is Graph Datu ‘i know sce 2000 / Tine member sir 2090 Refer Table 4.1 for popular schemacles databases, 4.1.4 Why NoSQL? 1, Iehae sal ou architecture instead ofthe monoichic archicerue of telational databases, 2, Iecam house large volumes of structured, semisteuetwted, apd uastruccured daa 3. Dynamic achema: NoSQL databate allows insertion of data wichout a pre-defined schema. In othe ‘words ie fclitates application changes in rel time, which thus supports Fister development, easy code integration, and requizes less databace administration 4, Auto-sharding: It aucomatially spreads data across an acitrary number of servers. The application in question is more often not even aware ofthe composition ofthe server pool. It balances the load ‘of data and query on the available servers and if and when a server goes down, iis quickly placed without any major acvity ciruptions 5. Replication I offers good suppor for replication which in turn guarantees high availablity, Eu tolerance, and dsuter recovery 4.1.5 Advantages of NoSOL Let us enueate the advantages of NoSQL. Refer Figure 4.4 1. Can exsily seale up and down: NoSQlL databae supports cling rapidly and elastically and even allows to sale to the dowd. Table 4.1. Popular schemaless databases Key-vaiue Data Store Column-Orlented Data Stare Document Data Store Graph Data Store Rik * assanda * HongoD8 > InfiiteGraph + Reais + Base + couchos + Neos + Membase + Hypertable ' avend® + Alegroeaph aia A _ SN \ Nn "cane Epes Figure 4.6 Advantages of NoSQL (4) Cluster sates ie llows distribution of database across 100+ nodes fen in mlkple data centers () Performance sale: Wt sustains over 100,000+ database reads and writes pet second. (©) Data sates I supports housing of | billions documents in the database. 22. Doesnt quite a pre-defined schema: NoSQI. doesnot equizeany adherence co predefined schema kis prety flexible. For example if we look at MongeDB, dhe documents (equivalens of records in RDBMS) ina collection (equivalent of table in RDBMS) can have differen ses of key-value pais ‘Lid: 101.*BookName’: "Fundamentals of busines analytics, “AuthorName": “Seema Acharya Publisher" "Wiley India’ LLi02, "BookName""Big Data and Analytics") 3, Cheap, easy to implement: Deploying NoSQl. propery allows fr all of the benefits of sae, high available eolerance etc wale also lowering operational costs, 4, Relaxes the data consistency requirement: NoSQL. databases have adherence to CAP eheorem (Conssteney, Avaiabiliy, and Partition Tolerance). Most ofthe NoSQL. databases compromise on con sistency in vor of avallablry and patton tolerance, However, they do go for eventual consency 5. Data canbe replicated to multiple nodes and can be partitioned: Ther are two tems that we wll discuss here: (@) Sharding: Shasting is when diferent pieoes of data are disribuced across multiple server, NoSQL databases support auro-sharding meaning they can natively and aucomatically spread lata acrosan arbitrary numberof serves, without requiting the application to even be aware of the composition of te server pool Servers can be added or removed from the data lyet without pplication downtime. This would mean chat data and query load are automatically balanced frost servers and when a server goes daw, ican be quickly and eransparenly replaced with no application disruption. (0) Replication: Replication is when multiple copies of dara ate stored across che cluster and even facts datacenters, This promises high availability and ful tolerance. 4.1.6 What We Miss With NoSQL? ‘With NoSQL. around, we ave been able to counter the problem of wale (NoSQL sales ou). Ther is also tre few features of conventional RDBMS that the eiblty with respect ro schema design. However d ane greatly mised. Refer Figure 4.5.Big Du and Anal, eS! \ genres ecueeees enentatol se Figure 4.5. What we miss with NoSOl? NoSQL does not support joins. Howevet, ic compensates fr ie by allowing embedded documents asin MongoDB. Ie docs not have provision for ACID properties of tansactions. However, i obeys the Eric Brewers CAP theorem. NoSQL doesnot have «standard SQL inerace bat NoSQU. databases such a MongoDB and Cassndra have thee own rch query language (MongoD query language and Cassandea ‘query language (CQL) to compensace forthe lack oft. One thing whichis dearly missed isthe easy ince tation wth other applications thar support SQL 4.1.7. Use of NoSQL in Industry NoSQl is being pu to use in vated industries. They ate used so support analyse for applications such ax web user data analysis, log analysis, sensor fed analysis, making recommendations for upsell and crossl, te, Refer Figure 4.6, Key-value Pare won ow can vais ren, ine) ~ ‘A A caren crmpnineet \~—__ sweet.) ‘ccmmcanoy | Wesou_ | \, Saint | Anaiyze ruge woe \ ‘eeracrone | (Feeebock Teton, / i” Realtime sears. otaea, | \ Smeragerant / y Figure 4.6 Use of NoSOL in industry. 4.1.8 NoSQL Vendors Refer Table 4.2 fo few popular NoSQL vendors, 4.1.9 SQL versus NosQL Refer Table 4.3 for few salen differences berween SQL and NoSQL Table 4.2 Few popular NoSQL vendors Comat Facebook Casandra Googie Bigrable ost Widely Used by inked, Moa Neti, Titer, eBay adobe Photoshop Table 4.3 SOL versus NoSOL Sat Relational database schema Table based databases Vertically salable (by increasing system resources) Uses Sa Not prefered for large datasets Nota est fi fr hierarchical data emphasis on ACID properties Excelent support from vendors Supports complex querying and data keeping needs an be conrigued for strong conssteny amples: Oracle, 082, MSO, HS SOL, Postrel, Nosat Non-elatonal,distibuted database Models approach Dynami schema for unstructured data Docament-based or grah-ased or wide column store or key-value pats databases Norzontaly scalable (by rating a cluster of commodity machines) Uses Und (Unstructured Query Language) Largely prefered for ange datasets Best fit for hlarchial storage as fllons the key ‘value pair of storing data smart ISON (lava Serpt (Objet Notation) Follons Brewers CAP theorem Relies heavily on community support Does not have goed suppert for complex querying Few support strong consistency (e.., MongoD8, few thers canbe configured for evental consistency (es. Cassandra) ongo08, HBase, Cassada, Reds, Neos, CouchD®, Couchbas, Ria, ee.Figure 4.7. Characteristics of NewSQL. 4.1.40 NewsaL Thetis yet another new erm ding the rounds -*NewSQ” So what is NewSQand how isi life from SQL and NeSQU? "Wha stat we love bout NoSQL. and isnot therewith oa aitional RDBMS ar wha ith loves SQL hat NoSQh doesnot ave supp? Yu guesed ih! We ned database that has thevame sable performance of NeSQL stems for On Line Tranacon Pocesing (OLT?) whi ill ‘Snuining the ACID guurers oa ralitonal Gabe Thi new modem RDBMS is called NewSQ. Teaupora tlio data mode and wes SQL ahr primary ier 4.1.10.1 Characteristics of NewSQL Refer Figure 4.7 wo lean about she characteristics of NewSQL. NewSQl. is based on the shared nothing architecture with a SQL interface for application interaction 4.1.11. Comparison of SQL, NoSQL, and NewSQL Refer Table 4.4 fora comparative study of SQL, NoSQL and New5QK. Table 4.4 Comparative study of SQL, NoSQL, and NewSQ. sat sat Newsat “haherence to ACID properties Yes No Yes _ coxre/ounr Yes No ves Schema dgaty Yes No Naybe . Aaherence to data model Adherence to relational model Data Format exiiity No ves aybe sealabiity Scale up Seale out Seale out Vertical Sealing Horizontal Sealing Dietbuted Computing Yes ves ves community Support Huge Growing Slowly growing (tee eeeeied Figure 4.8 Hadoop. 4.2. HADOOP Hadoop isan open-source project of the Apache foundation. I is afamework writen in ava, originally developed by Doug Curtin in 2005 who named it afer his son's toy clephan. He was working with Yahoo then, le was created co suppore dinribution for "Nutch, the text seatch engine. Hadoop uses Googles Mapieduce and Google File Syscem technologies sits foundation. Hadoop is now a core pat ofthe com puting infrastructure for companies such as Yahoo, Facebook, Linkedin and Twitter, ec. Refer Figure 43. 4.2.1 Features of Hadoop Lees cite afew features of Hadoop: 1. Ieis optimized to handle masive quancies of seracured, semi-structured, and unstructured daca, using commodity hardware, that i relatively inexpensive computes, 2. Hiadoop has a shard nothing architec 13. Teceplicates its data actse multiple computers so that ifone goes down, the data can sil be processed from another machine that sores its epi 4, Hladoop is for high throughput rather chan low latency. Is. atch operation handling massive quan ties of data herefore the response cme is not immediate 5. Iecomplements On-Line Tansaction Processing (OLI) and On-Line Analytical Processing (LAD), However, i nt a eplacement fora relational database management system 6, ItisNOT good when work cannot be parlelized or when chere are dependencies witin the dats 7. ItisNOT good for processing small les, Te works best wich huge das files and da sets 4.2.2 Key Advantages of Hadoop Refer Figuse 4.9 for a quick lok a the key advantages of Hadoop: 1. Stores data in its native format: Hadoops data storage famework (HDFS ~ Hadoop Distibuted File Sytem) can store daa in its native format. There no structue cat is imposed while keying, in daca or storing data. HDS is prety much schemaclss. It is ony later when the daca neds to be proceed chat structures impose on the raw data.Big Daa ad Ata ce Siti eines eas pei nme tears Figure 4.9. Key advantages of Hadoop. 2, Scalable: Hadoop can store and discribe very lege datasets (involving thousands of terabytes of data) across hundreds of inexpensive servers that operate in parallel 3. Cost-effective: Owing t its scaleour architecture, Hadoop has 2 much reduced costlerabyte of ssorage and processing 4. Resilient to failure: Haloop is Fale-soleran. Ie practices replication of daa diligenly which means ‘whenever data is sent any node, she sme data also gets replicated wo other nodes in the cluster, there ty ensuring chat inthe event ofa node fait, chez il abway he another copy of data available for use 5. Flexibility: Onc ofthe key advantages of Hadoop isis ability to work withall kinds of data: structured, semistructured, and unstructured data. [can help desive meaningful busines insight fom email Conversations, social media daa, clickstram dat, ee. I can be put ro several purposes such a log, analysis data mining, commendation syscms, market crmpaign analysis et. 6, Fast the processing s extremely fast in Hadoop compared to other conventional systems owing. the ‘move code to data’ paradign. 4.2.3. Versions of Hadoop “There are two versions of Hadoop avilable 1. Hadoop 1.0 2. Hadoop 2.0 Lecus take s look a the features of both, Refer Figure 4.10 4.2.3.1 Hadoop 1.0 Teas «wo main pats: 1. Data storage framework: [cis a genetl-purpase fle stem called Hadoop Distributed File Sytem (HDFS), HFS is shemales, Iesimply stores daa files aid these data les ean be i just abou any. he Big Dats Technology Landa 6 ‘ilpee aes rai ‘af a ech aa een! Figure 4.10 Versions of Hadoop. format The idea is store files as clase co dei original form 38 posible. This is ur provides the busines units and the organization che much nesded Rexx and agility without being over wore vied by what it can implement. 2. Data processing framework: This i simple functional programming mode inially populated by Google as MapReduce. esetally ui two fincions the MAP and the REL proces dita. The "Mappels” ake in ase of key-value pits and generate intermediate daa (whch is nother is of key-value pai). The “Redacet dhen acon thi inp o produce the oui data, The «vo functions seemingly workin ation from one anochet, thus enabling the processing co be highly discribe in «highly parle, fauleolerant and sealable way “There were however afew limitations of Hadoop 1.0. They ae as fllows 1. Thetis imitation ws the requirement for MapReduceprogeamming experigealong with poiiency equi in other programming languages, nosy Java 2, Tesuppoted only batch processing which atboughatable for asks sich slog snl nese daa mining projects but prey much uasuitable for other kinds of proj 3. One major limitation was that Hadoop 1.0 vas ight smputconally coupled with Map, which meant chat the cablished data management vendors were let with two options Either eweite th Fao in Mapes tha tcl be xsd in Hadop ov ext she dt in HES and proces it outide of Hadoop. None ofthe options were viable sc Ted co proces ine iene cated bythe daa beng moved in and ou the Hadoop ser Let us look a whether these limitations have been wholly a in pats esolved by Hadoop 20 42.3.2 Hadoop 2.0 1 Hadoop 2.0, HDES coninucs tobe the data storage framework. However anew and separate resource management famework lle Yet Anaher Reurce Negotiaor (YARN) ha Ben added. Aay application capable of viding sl inc patil aks i supported hy YARN. YARN coordinates the allocation of subiaks of the submised application, thereby farther enhancing the BGEbily. scalability, and eiceney of the applications. It wok by having an Application atria place ofthe erstwhile JobTckeh, ru ning applications on resources governed by a new NodeManager fin place ofthe erstwhile Tsk Tracer). Appleton asters able tor sny application and nt jusc MapRece Ta other words, means that the MapRedce Programming expertise is no longet required Furthermore, it not ony supports batch processing bu also eat proces MapReduce i no longer the only dat ocesng option: other semaive data procesingRunctons sich ar data standardbacon, Inst daca management can now bs performed natively in HFS.Big Daa ad Arig oe aa (Petar Batanase Rae ee en fener TFS Cora Figure 4.11 Hadoop ecosystem. 4.2.4 Overview of Hadoop Ecosystems ‘We ill discuss he Hadoop ecoystem in brief ere Ie will be covered in detail in Chapter 5. The following are the components ofthe Hadoop ecosystem (shown in Figure 4.11) 1. HFS: Hadoop Distbuced File System. Ie simply stores data fs as close to he orginal form a possible. 2, Base: tis Hadoop’ database and compares well wth an RDBMS. Ie supports structured data storage for large tables. 23. Hive: Iccnables analysis of lage datasets using a language very similar o standard ANSI SQL. This. | implies that anyone fila with SQL should be able to access dat stored on a Hadoop cluster 4. Pigg Pigis an easy to understand data low language. I helps withthe analysis flag datasets which i quite the order wth Hadoop. Even if one does nt have che proficiency in MapReduce programming, ‘he-anayatsand che pesons entested with che task of comprehending data wil sil beable to analyze the data ina Hadop chstes as the Pig scripts ate astoraically converted into MapReduce obs by the Pig interpreter. MapRedce programming willbe covered in detail in Chapter 8, - ZooKeeper Is coordination service for distributed applications, {6 Oories Iisa workflow schedules system to manage Apache Haloop jobs. 7. Mabouts Is sealable machine karin and data mining ibrar Chaka: Iisa data collection sytem for managing lage disibutd systems 9. Sau i we eat ae Hap nd rt da sr ch ena dltabass, 10, Ambaris Ic is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, 4.2.5 Hadoop Distributions Hiadoop isan open-source Apache projec, Anyone can fiely download the core aspects of Hadoop. The core aspects of Hadoop include the fllowing: Dror Tehogytanbcype ee © 1. Haloop Common 2, Haloop Disvibuced File System (HDFS) 3, Hadoop YARN (Wer Another Resource Negotiator) 4. Hladoop MapReluce here ae ew companies suchas BM, Amazon Web Services, Microsoft, Teradata, Horromwork, louder, Tar ihat have packaged Hadoopinco a more exly consumable distributions or services Although each of ‘fos companies havea slightly differen statey, che key essence remains its abiity to distribute data and Midoads across potentially chousands of servers thus making big data manageable data. A few Hadoop Yurrbutions are given in Figure 4.12, 4.2.6. Hadoop versus SQL ‘able 4.5 lise the difrences between Haloop and SQL ia eee cert Figure 4.12 Hadoop distributions. Table 4.5 Hadoop versus SOL Hadoop sal Seale out ‘Seale up Key ave pats Relatlona abe Functional Programming Declarative Queries Oftne batch processing One transaction processingsei me: big bss snd Amb * Daa storage framework Data processing framework +n Hadoop 2.0, a new and separate resource management feamework called Ver Another Resource Negotiator (YARN) has been added Refer Fguce 4.13 to ger a glimpse ofthe leading market vendors offering integrated Hadoop systems Grats Daa Rola POINT ME (BOOKS) + Hadoop for Dummies, Dirk Deroos, Paul C. Zikepoulos, Roman B, Melapl, Bruce Brown, Wiley India Pv. Le ‘+ NoSQL Distilled: A Brief Guide co the Emerging Word of Polyglr Peestence, Pramod J. Sadalage and Martin Fowler Figure 4.13 Integrated Kadoop ystems CONNECT ME (INTERNET RESOURCES) 4.2.8 Cloud-Based Hadoop Solutions Das Technology Landiape on 4.2.7 Integrated Hadoop Systems Offered by Leading Market Vendors J Hadoop 1.0 has wo main pars: | Amazon Web Services holds outa comprehensive, end-o-end portfolio of cloud computing service to help | ¢ hnp:/wwwanongodb.com/nos-explaned tanage big data The aim isto achive this and more along with retaining the emphasis on reducing coss, | + hp:/nosql-atabise or scaling 10 moet demand, and aceerating the sped of innovation. + -hipllwwwsechrepubliecom/blog/10-thngs/LO-things-you-shoule know-about-nova-databases The Google Cloud Storage connector for Hadoop empowers one co pesform MapReduce jobs dreedy | + hupufhadoop apache org/docsicurrenvhadoop-mapreducelint/hadoop-mapreduce-cien-core ‘on cata in Google Cloud Storage, without the need to copy ite lca dsk and running i in the Hadoop “MapReduce- Compatibility. Hadoop!_Hadoop2.html Diseibuced Fle Sytem (HDES). The connector simplifies Hadoop deployment, and at the same time | + hisp:/hadoop apacheorg! reduces cos and provides performance comparable to HDS, allthis while increasing clair by elimi rating the single poineoffulure of the name node, Refer Figure 4.14 TEST ME CEE Aime J 1, The pasion fr CAP mt 2 Thechawan gE dior tee cs Figure 4.14 Cloud-based solution. 5 has no support for ACID properties of eansactions aR REMIND ME re 1 Comixency Avy snd Partin 3, Coninet and Pain Tle Ness all Op ated ad Tce Sec and Paton les nou (peace one 2. Bal ae Sof Ste en 5 Nosor + Hiadoop has a shared nothing architecture Consistency 6. NewSQLFollowing words are to be placed inthe relevant basket: (@) Relational (6) Diseribured (Predefined schema (@ Wide column stores (6) Vericaly salable © Key-vale pats @ MySQL (8) Couch 0 Neos} Answers: sat Relational Predefined schema Vertical sealable ysat ActD properties Table oF relations ) Cassandra (ho) Large dase (ACID properties (0a) Brewers CAP theorem (a) Document based database (6) Scales horizoneally (p) Avoids join operations (@) JSON data (9) Table or relations BRIEF CONTENTS + What in Sere? # Intoducing Hadoop * Dats: The Teenure Trove ‘Why Hadoop? ‘Why nor RDBMS? RDBMS versus Fladoop Disuibuted Computing Challenges + Hardware Failure + How to Proces this Gigantic Store of Date History of Hadoop Nosat Distrib * The Name "Hadoop’ Hadoop Overview 1» Key Aspects of Hadoop ‘+ Hiadoop Components Wide-colunn stores eyelue pais Coven * Hadoop Concepeal Layer eos High-Level Architecture of Hadoop Cessna 2 Large dataset + Hdoop Dieiburos Brees CA theorem + HOES + HDES Dsemons Document based database scales horontally voids join operations There ISON data 2003, but re 5 exabytes of information created betwen = Introduction to Hadoop * Anatomy of File Read * Anaromy of File Write * Replica Placement Strategy * Working with HDFS Commands 1 Special Features of HDES + Procesing Data with Hadoop * MapRetluce Daemons 1 How does MapReduce Work? * MapReduce Example + Managing Resources and Applications with Hiadoop YARN * Limitations oF Hadoop 1.0 Archiecrue * HDPS Limitation * Hadoop 2: HDES * Hadoop 2 YARN: Taking Hadoop Beyond Barch + Intersting with Hadoop Ecosystem © Pig * Hive © Sqo0p © Hae lar of iilicaton through ceated very 2 day ~ Brie Schmid, of Googe, sid in 2010Big Das and Ana WHAT'S IN STORE? We assume that you are already familiar with the diseribuced file stem and the distibuted comparing model. The fous of this chapter will be to build on this knowledge base andl comprehend and appreciate { flow Hadoop mores and processes colossal volumes of data. It willbe our endeavor to get you the impr. tence of Haddoop with case studies and scenatios, We wil also discuss HDFS commands and MapReuce Programming, However MapReduce Programming will be discussed in dealin Chapter 8 “fe suggct you refer to some of the laring resources provided atthe end ofthis chapter and alo com, plete the "Test Me” exercises. 5.1_ INTRODUCING HADOOP Today, Big Data scms to be the buzz wor Enterprises, the world over, are beginning to selize that there iss huge volume of untapped informacion before them in the form of structured, semi-structured, and tensteucered data. This varied vaiety of data ie spead actos the nevwors Let us Took at few statistics vo get an ide ofthe rnouae of data which gets generated every day, every J minute, and every second. 1, Bvery day: (G@) NYSE (New York Stock Exchange) generates 1.5 billion shares and trae dats {) Facebook ores 27 billion comments and Likes. (© Google proceses about 24 pesabytes of data. 2, Every minute: {@) Facebook users share nearly 2.5 million pieces of conten, (6) witeer users weet nearly 300,000 times. { (6) Instagram users post nearly 220,000 new photos {) YouTube users upload 72 hours of new video content (6 Apple users download nearly 50,000 apps (© Email users send oer 200 ailion messages (@) Amazon generates over $80,000 in online sale. {) Google receives over 4 milion search queries. 3, Every second (@) Banking applications process more chan 10,000 creditcard transactions Figure 5.1 Challenges with big volume, variety, and velocity of data. 5.2_ WHY HADOOP? ver wondered why Hladoop has boen and is ane ofthe mos wanted technological! The key consdertion (the rationale behind its huge popularity) is Ie capabilty to handle massive amounts of data, diffrent careories of data ~ fairly quickly, The other considerations ae Figure 5.2) 1. Low cost: Hadoop isan open-source framework and uses commodity hardware (commodity hard wate is lative inexpensive and aay to oben hardware) ta store enotmous quantities of dita Data: The Treasure Trove 1. Provides busines advantages suchas generiing product recommendations, inventing new products analyzing the market, and many, many mow 2. Provides ew eal key indicators that can cua che fortune of busines. 3. Provides room for precise analysis, I we have more data for analysis then we have greater precision of analy “Ta proces, analyze and make sense of these different kinds of data, we need 2 sytem that scales nd | adresses the challenges shown in Figure 51 Figure 5.2 ay considerations of Hadoop.16 ig Ds nd Aa 2 Computing power: Haoop is based on distibured computing model which proceses very Inge a lunevof data ft quickly. The more the numberof computing nodes, the more the procesing pong shan 4. Seley Ths bil down ro simpy adding nodes asthe stem grows and requis nich es adi eal ea israion. | a S553, 4 Sa xi eho loa dass in Hadoop aa este pct — Nae eet aay poet emenenesof ern mach as aon ned nd = a ety of leading nero how tte te ore dat In Hap one eg sil ds ihe oge vids nd ior x | 5, Iaesce dan proton Hadoop por data and ccuting pplication gaint aa ik Pee reat rlaca ch jb that had been sige oth deo oe fig roe Rc nodes an res hat erbted ompucing does ot al gc a ep fal re tnlple copies (pica ofthe da on various node srs the cater Figure 5.4 fOBNS with respect cos of tog Hadoop males use of commodity hardware, disibued file system, and ditebuted computing s shown ig Figure 5.3 In this new design, groups of machine ae gathered cogeher is known as a Ch 4 RDBMS versus HADOOP. ‘ible 5. describes the difrence beeen RDBMS and Hadoop, Table 5.1. ROBHS versus Hadoop PaRAwereRs ROB wabo0e sen Relational Database Management Node Based Fat Stuture. ta Suitable for structured data, Suitable for structured, unstructured data, Suppots variety of data formats in real time such a XML ISON, text based Mat fle formats, ete Processing OUP attic, Bi Data Processing | Choice When the data needs consistent Big Data processing, which doesnot require ary | Figure 5.3 Hadoop framework (distributed file system, commodity hardware) relations, consistent relationships betwen data, Processor Needs expensive hardware ot Ina Hadoop Cluster node rauites any & highend processors to store huge processor, a network car, and few hard ives volumes of data 1 Dabs hat and apis chk of ch le cto al ln, i al the Cost aru $10,000 to $24,000 per Cot rou $4,000 pera ting | isome chunk of data as shown in Figure 5.3. ee eee | 2 Locally availble compute resource used to proces each chunk of daa in parallel 3. Hadoop Framework handles lover smarly and aucomaticaly ‘With cis new paradigm che ds canbe managed with Hadoop as follows 25-31 5.5 DISTRIBUTED COMPUTING CHALLENGES WHY NOT RDBMS? 5.3 WHY NOTRDBMS? 0D athough there ae several challenges with dseributed computing, we will focus on ewo major challenges RDBMS is not suitable for storing and proesing lage fs, images, and videos. RDBMS isnot a good ‘Hote when itcomes to avanced analytes involg machine learing Fgure4 dcrbes he RDBMS] 4 awa repent to cove and storage. Tcl for huge investment tthe volume of data shows a] 8 sid ante, sevr serves ate actword roger Thisimpi tha mae offen than no hee ‘nay bea possibilty of hardware failure. And when such a failure does happen, how docs one reieve the Hardware Failure Upward trendBig Daa and Atari cues Figure 5.5 Replication factor ata tha was stored in the system? Justo explain further. regular hard disk may fl once in 3 years An ‘when you have 1000 such hard disks, shee ia post of atleast few being down every diy. Hadoop has an answer to this problem in Replication Factor (RE). Replication Factor consocs the number of data copies ofa given data item/data block cored across the network. Refer Figure 55. ‘Youworkin a projet team. There are sixoter members in the team, Each time ther san update relate to the poet worker an input received from the cen the project manager, Alex, ensues that he eens at {east the team member are ofthe developrent, You have been wondering at thls sl of working of you poect anager. One day during the coffee beak, when the projet manager joins fr cafe, you es tantly aim the question. Ale,“ had this question for you, Why i that each time we have an input fom the cent or any inportant pice of infomation, you leave it with at last tre of ur team members” ‘ix smiled a he answered, "The reason i very sin ple, zune thatthe lent called and sugested some Inofeaton tothe projet. shared it wth jst one person, lus sy parson X Tomorom, when te ug (ested changes have tobe incorporated, person X als in sik, He fe indsposed and notin oe. Wl that lead to our project coming toa tandstl? Yes, tit Therefore share it with at lest ie team rember, sathateven Fane ison eave or out office for ame reason Ur work wil ot be tal 5.5.2 How to Process This Gigantic Store of Data? Ina diseibuced system, the dita is spread across che neework on several machines. A key challenge here it inegrate the dita avalable on several machines prior to processing it adoop soles this problem by using MapReduce Programming. Ic is programming model to procs the daca (MapReduce programming will be discussed alle ae) 5.6 HISTORY OF HADOOP Hadoop was created by Doug Cutting, the creator of Apache Lucene (a commonly used tex saech liber Hadoop isa part ofthe Apache Nutch (Yahoo) projec (an open-source web search engine) and aso a part ofthe Lucene projec, Refer Figure 5.6 for more deal 5.6.1 The Name "Hadoop” The name Hadoop is wot an acronym is 3 made-up name. The project creator, Doug Cutting, exp how the name came about: "The name my hid gave w suf yellow elephant. Shor, relatively ny so pll and pronounce, meanngles, an not ad elewhere: thar are my naming criteria. Kids ae good at generating uch (Gong i kid ter —— ee i Doug added DFS and Map Rese o Nah | eee ceca fetes Closers Foues vahoohied Doug Doug Cun jhe Cauca Figure 5.6 Hadgop histor, Subproject and “conn” modules n Hadoop also tend have names that ae uneated othe anton, Sen th an clean ooter animal heme CPi for ample). _ufeene: Hadoop, The Definitve Guide, 3" Eltion, O'Reilly Publication Page. No. 9. 5.7_ HADOOP OVERVIEW (Open-source sofcwarefamewors ro store and proces massive amounts of dat in a distributed fasion on Lage clases of commodity hardware. Basically, Hadoop accomplishes wo tasks 1. Masive dita storage. 2. Faster dara processing, 5.7.1 Key Aspects of Hadoop Figure 5.7 describes the key aspects of Hadoop, Figure 5.7 Koy aspects of Hadoop.ig Dats 04 Ang 5.7.2 Hadoop Components Figure 5.8 depicts the Hadoop components Hasoop Ezseaton Figure 5.8 Hadoop components. Hadoop Core Components 1. HES: @) Storage component. () Distributes daca across several nodes. (©) Navively edundanc. 2. MapReduce: {@) Computational framework {) Splice a tak across multiple nodes. ( Proceses data in parallel Hadoop Ecosystem: Hadoop Bcosystem are support projets o enhance the functionality of Hadoop Core Components, The Fco Project are as fllows |. HIVE PIG . SQOOP, HIBASE FLUME ooziE MAHOUT, 5.7.3 Hadoop Conceptual Layer Ieis concep divided into Data Storage Layer which stores huge volumes of data and Data Processing, Layer which process daca in parallel co exteact richer and meaning insights ftom data (Figure 5.9), 5.7.4. High-Level Architecture of Hadoop Hadoop is distributed Master-Slave Architecture, Master node is known as NameNode and slave nods are known as DataNodes Figure 5.10 depict the Mastr-Slave Architecture of Hadoop Framework. —_— Figure 5.9 Hadoop conceptual layer. rc Figure 5.10 Hadoophigh-vel architecture Reference: Hadoop In Practice, Alex Holmes. aanan Lec wslole at che key components of the Maser Node 1. Master DES: Is main esponsibility i partitioning the daa storage ares the slave nodes. cabo keep rack oflcations of data on DataNodes 2 Master MapReduce: I decides and schedules computation tsk on slave nodes. 5.8 USE CASE OF HADOOP 5.8.1 ClickStream Data ‘lickSieam data (mouse clicks) helps you to understand she purchasing behavior of customers. ClikStream ‘mys helps online marketers to opimize thee product web pages, promotional content, ete 0 improve thet businesFigure 5.11. ClckStream data analysis. “The ClickSteam analysis (Figure 5.11) using Hadoop provides three Key benefits 1. Haadoop helps go join ClickScream dacs with other data sources such as Customer Relationship Management Data (Customer Demographics Data, Sales Data, and Information on Advertsing | Campaign). This addtional data often provides che much needed information co understand cus | tomer behavior 2, Hadoop’s saailty property helps you to store years of data without ample incremental cost. Thi } helps yu to perform temporal or yea ver year analysis on CickSteam data which your competitor 3. Busines analysts can use Apache Pig or Apache Hive for website analysis. With these roo, you can organize ClckStream data by user sesion, refine it and fed icc visualization or analytics cols. Reference: hupllhorconworkscom!wp-coneentluplosds/2014/05/Horconwodks BusinessValucotadoop, wNgpd 5.9_HADOOP DISTRIBUTORS “The companies shown in Figute 5.12 provide produces tha include Apache Hadoop, commercial suppor, and/or tool and utes elated to Hadoop. Figure 5.22 Common Hadoop distributors. 5.10 _HDFS (HADOOP DISTRIBUTED FILE SYSTEM) Some key Points of Hadoop Distributed File System ae as follows 1. Storage component of Hadoop, 2, Diseribueed Fle System. 3. Mode after Google Fle System ‘4, Optimized fr high choughpus (HIDES leverages large block size and moves computation where da is sored). 5. You can replicate a file for a con ann hardware gured numberof times, which i olan in terms ofboth sofowae conc Ha 6, Re-replcates data blocks automatically on nodes that have fled 7, You can elze che power of HDES when you perform ed or writen age ies (gigabytes and lange) {Sits on top of native ile system such a xt3 andl ext, which i desribed in Figure 5.13, gute 5.14 describes importane key points of HDFS, Figure 5.15 describes Hadoop Diseribuced File sytem Architecture. Client Appliation interacts with NameNode for metadata laced atviies and com nicats with DataNodes to read and writ les, DatsNodes converte with each other for pipeline reads dit. [Ler assume dha the ile“Sample." is ofsize 192 MB. As per the defaledata block sie (64 MB), ie will te splc into hice blocks and replicated across the nodes on the cluster based onthe deaul replication Factor, 5.10.1 HDFS Daemons 5.10.1.1 NameNode DFS breaks large fle into smaller pieces called blocs, NameNode wse ste ID to identify DataNodes inthe rack, Araki aclletion of DataNodes within the else, NameNode keps tacks of blocks of file ‘itis placed on various DataNodes, NameNode manages fietlated operations auch a ead, wet, cca, tnd dle, Is main job is managing the File System Namespace. file system namespace is collection of fies in the cluster NameNode stores HFS namespace. Filesystem namespace includes mapping of blokes tof, le properties and is stored in file calle Fslmage. NarneNode uses an PitLog (wansation log) «0 record every wansacton that happens to che filesystem metadata. Refer Figure 5.16. When NameNode stats pit rads Falmage and BaitLog fiom disk and applies all ransactions from the EditL.og to in-memory representation ofthe Falmage, Then it Bushes out new version of Fslmage on disk and truncate the old Fuitog because che changes are updated inthe Fslmage There ia single NameNode per cluster. Refenc: tpl hadoop apache org/docs/1.0.4hd, design. hel mae coat a ay | a od — Figure 5.14 Hadoop Distributed File System — key pointsBig Data and Ansty Neto Cert Aopen IN \\ NI NA ‘anode A Dataoes © DaleNode 8 Figute 5.15. Hadoop Distributed File Sytem Architecture. Aeference: Hadoop in Practice, Alex Holmes. eee Figure 5.16. Namellode. 5.0.1.2 DataNode There ate multiple DataNodes pe cluster. During Pipeline rad and write DataNodes commuticae with ach other A DatsNode also continuously sends “heartbeat” message to NameNode to ensure the oom Teatvity becween the NameNode and DataNode, In cate there is no heartbeat from a DataNode, the NameNode replicates that DataNode within che clase and keeps on running aif nothing had happened Tet us explain the concep behind sending the heartbeat eport by the DataNodes co the NameNode. Reference: Wrox Cenifed Big Data Developer ‘You work fora renowned TT ogaization Every day- team members who are present in office, The tas when you come to ole, you ae required to swipe for the day cannot be allocated to team members frto record your attendance Tis reed of atten- who ave not tured in. Likewise Reatbeat report ances then shared with your manage to keep hin) a way by which DataNodes inform the Naretiode posted on who al from Iie team have reported for that they are up and functional an canbe assigned Wve. Your manager sable to allocate tasks tothe tasks. Figure 5.17 depicts the above scenario. sh dion 0 ado 235 Figure 5.17 NameNode and DataNode Communication. 510.13. Secondary NameNode the Secondary NemeNode aes sapshot of DES mead tinal pce nthe Haloop contig: io Sine he memory feqtements of Secondary NameNavde arte se 2 NaneNode, ie eter srrtn NameNode and Secondary NameNode on dient machines In ae fare ofthe Nate, ondary NaneNode canbe engued mana to bring up te caer. Hower, he Sconay NameNote dos not eon any elie changes that happen to che HDPS metadts. 5.10.2 Anatomy of File Read Figure 5.18 describes the anatomy of File Read i | Nenatoge vertu lent Noes DasaNode DataNed Figure 5.18 File ReadBig Dat and Analyt The steps involved in che Fle Read are a follows 1. The client opens the ile that ie wishes to read from by calling open) on che DistributedleSycem, 2 DisuibutedleSystem communicates with the NameNode to get the location of data blocks [NamNlode reruns wit the addresses ofthe DataNodes tha the data blocks are stored on. Subsequent to this, the DseibucedleSytem returas an FSDatalnputSteam to cen o red from the i 3. Clin then call ead om che steam DESInputSteam, which has address of che DataNodes forthe fist few blocks ofthe fle, connects to the closest DataNode for te fist block in che file 4, Clien calls ead) repeatedly co steam the data from the DataNode 5, When end of the block is reached, DFSInpuxStrcam closes the conncetion withthe DataNode. Ie pets the tes to find the bese DataNode fo the next block and subsequent blocks, the reading ofthe file, i ells lose on the FSDatalapusStream to cose 6, When the cient comple Reference: Hadoop, The Definitve Guide, Sd Edition, O'Reilly Publication. 5.10.3 Anatomy of File Write Jescibes the anatomy of File Wie. The sep involved in anatomy of File Write are as follows Figure 5.19 1. The clint calls create on DiseribuedFileSystem to create fle 2. An RPC call co the NameNode happens cough the DiseibuceFieSyscem to create a new fle The NameNode performs various checks to crete anew file (checks whether such a file exists or rod). nitally, che NameNode creates file without astociating any data blocks to the file. The DiseibucedPlesystem reruns an FSDataOutpuSueam to he client co perform write As the client writes data, data is split int packets by DFSOutpurStream, which i then writen co an inncenl que, called data queue, DacSereamer consumes the data queue. The DataStreamet reqs 2:creae Pains ot Detose ‘- ataNode [S—DalaNode the NameNode to allaste now blocks by selecting a list of suitable DataNodes to store replicas This lis of DataNodes makes pipeline. Here, we will go wich the defaule replication facts of the, so there will be tree nodes inthe pipeline forthe frst block, 4. DaaSreamer steams the packets tothe frst DataNode in the pipeline, le stores packet and forwards it to the second DataNode in the pipeline. Inthe same way, the second DataNade stores the packee and forwacde eo the third DataNode in the pipeline 5, In addition co the internal queue, DFSOurputSteam also manages an ‘Ack queue" of packets that are wating forthe acknowledgement by DataNodes. A packet is removed fom the “Ack queue” oa iie is acknowledge by ll the DaaNodes in che pipe 6. When the clint finishes writing the ie cals clos) onthe stzeam, 7. This shesall the remaining packers ro the DatiNode pipeline and wae for televant acknowledgments before communicating withthe NameNode to inform the cline char che cretion of the fe i complete erence Hadoop, The Definitive Guide, 3nt Edition, O'Reilly Publication 5.10.4 Replica Placement Strategy 5.10.4.1 Hadoop Default Replica Placement Strategy Asperthe Hladoop Replica Placement Sate pics second epica on a noc chats present on differen rack Ie places the third replies on he same rack js second, but on a different node inthe rack. Once replica locations have been sc, a pipeline is built, This Seater provides good relist. Figare 5,20 describes the typical replica pipeline. id Eation, O'Rely Publication. fre replica ie placed on the same node a the elie. Then i Refoonc,Hadoop, the Definite Guide, 5.10.5 Working with HDFS Commands (Objectives To get the lis ofdiecovss an fles a the root of HDFS, Act hhadoop fe-l/ alll ont || ft fd eae Shsauemuantie cea Figure 5.20, Replica Placement Stega ee ‘Objectives To ge the lit of complete directories and files of HIDFS. | Act Iadoop fi-te-R/ - ‘Objective: Ta create drecory (sy, sample) in HDS, ‘et | ‘adoop filer ample (Objectives To copy a file from local ile system to HFS. Act hadoop fi-put irootlsampleltes.tt lsampleltest. ext — Objective: To copy a fle from HDFS co loca fle sytem. Act ‘hadoop fs get Iampleltest.astIootleamplltestsample.sxt ‘Objective: Ta copy 2 file from local filesystem to HDFS via copyFromLocal command, Act ‘badoop fs -copyFromLocal irootsempleltest.tst /sampleltestsemple.txt — ‘Act ‘badoop fi-copyToLocal hampleltest.et/rootsampleltetsamplelxt (Objective: To display the contents ofan HIDES file on consol. Act: ‘hadoop feat amples Des nd Asai (Objective: To copya file from Hadaop filesystem to lca fle system via copy ToLacal command. 9 Objective To copy ale from one directory to another on HDES. ct hhadoop fs-ep eampleltest.set amplet eo (Objective: To remove a directory from HDFS. Act Ibadoop f-rmr sampled 5.10.6 Special Features of HDFS 1, Data Replica client tothe nearest replica to ensue high performance, 2. Data Pipeline: A clene application write a block to the frst DataNode in the pipeline. Then this DazaNode takes over and forwards the data othe next node in the pipeline. This process continues for all the data blocks, and subsequent all the replicas are written tothe disk There is absolutly no need fora client application to tracks blacks, Iedirects the rence: Wrox Certified Big Data Developer 5.11_ PROCESSING DATA WITH HADOOP MapReduce Programming is a software framework, MapReduce Programming helps you ro process masive amounts of daa in paral In MapReduce Programming, the inpuc daase is split into independent chunks. Map tasks po these independent chunks completely in a parallel manner. The ourput produced by the map tasks serves as intermediate data and is stored on the lca disk ofthat server The output ofthe mappes are automatically ‘hulled and sorted by the framework, MapReduce Framework sors che output based on ey. This sored ‘ouput becomes the input to che reduce tasks. Reduce task provides reduced output by combining the out put ofthe various mapper. Job inputs and outputs ae stored in fle system. MapReduce framework aso takes care ofthe other tasks such a scheduling, monitoring, e-executng fled asks, et Healoop Distributed File System and MapReduce Framework run onthe same se of nodes. This config- uation allows elective scheduling of asks on the nds where datas present (Data Localiy). This in turn results in very high throughput. There ate two daemons assciaced with MapReduce Programming. A single master JobTtacker per cluster and one sve TaskTracker per clusternode. The JobTacker is responsible for scheduling tasks, tothe Tak Tracker, monitoring the task, and re-execuing the task justin ease che TasTacker fal. The TikTrcker executes the tase Refer Figure 5.21 ‘The MapReduce functions and inpuc/ousput locations are implemented via the MapReduce applics: ons. These applications we suitable interfaces ro constrict the job. The aplication and the job parame ‘er together are known as job configuration. Hadoop job client submits jo (atexccuable, ec) to the Job Tracker Then it isthe responsibilty of JobTiacker ro schedule ess eo the saves. In addition 10 sched linge also monizors the tsk and provides sets information co the job-lintFigure 5.21 MapReduce Programming phases and daemons. Reference plhadoop apache orgldocse05mapred_oral. hem 5.11.1 MapReduce Daemons 1. JobTracker Ie provides connectivity between Hatloop and your applicacon. When you submit code to cute, Jo thonior all he runing tasks. When a es ft aoromatically re schedules the tak wa diferent node afcs a predefined number ofrecies. JobTeacker isa master daemon responsible for executing ‘verll MapReduce job There isa single JobTeacker per Hadoop chute. cher creates the execution plan by deciding which ask tassign to which node. alo | 2. Takliekcn Ths Smet repose for excuing individual tks tha is assigned by the {Jobtacker Ther is single Task Tracker per slave and spawn multiple Java Viral Machines (VMS), sin parallel, Tasleracker continuously sends heartbeat message twohandle mile map oF reduce ta fo JobTiacker When the JobTiacer fils to receive a heartbeat fom a TiskTiackes, the JobTracker Teens that the Tsk Tracker has led and esubmits the task co another available node in the lust Gee the clene submits a job to che Jo Tracker, i partitions and assigns diverse MapRedues tasks foreach TaskTtacker inthe cluster Figure 5.2 depicts JobTiacker and TaskTacker interaction. Refrence: Hadoop in Action, Chuck Lam, 5.11.2 How Does MapReduce Work? MapReduce divides data analysis sk ino two pats ~ map and reduce. Figure 5.25 depicts how che -MapRedace Programing, works. In thi example, there are two mappers and one reducer. Each mapper o® Figure 5.22. JobTracker and Taskracer interaction a Hadoop von al oe re o— am | Figure 5.23, MapReduce programming workflow. orks on che part dataset hate stored on that node and the reducer combines the output fom the map- esto produce che reduced result st fences Wrox Big Data Cenificaion Matias gue 5.24 describes che working model of MapReduce Programming, The following steps describe how MapRedace performs its ak * ws 1, Fin, he inpur datas is spi into muliple piees of data (several smal subst). 2, Next, che feamewor creates a master and several workers processes and exccues the worker processes remotely Figure 5.24 MapReduce programming architecture5 Several map tasks work simoltancously and tea piers of data that were assigned to each map The map worker uss the map function to extract only chose data that are present on thie server generates key/value pair forthe exacted daa 44, Map wes uses prttoner function to divide the data into regions. Partiione decides which red should get the output ofthe specified mapper, 5. When the map workers comple cei work, the master instructs the reduce workers 0 bein th ‘work, he reduce workers in turn contact the map workers to get the key/value data for their prt The data thus received ie shuld and sored as per keys. 66. Then ills reduce function for every unique key. This Function writes the ouput vo he fle 3. When ll the duce wotkers compete dei work the master transfers the contol to the user rogry 5.11.3 MapReduce Example “The famous example for MapReduce Programming is Word Count. For example, consider you need count the occustences of similar words across 50 files. You can achieve this sing MapReduce Programming} Refer Figure 5.25. ‘| ‘Word Count MapReduce Programming using Java The MapReduce Programming requires thee things 1. Driver Clas: Ths cs specifies Job Configuration desl 2) Mapper Class This clas overtdes the Map Fanction based on the probes statement. 3, Reducer Class: This class overtides the Reduce Faction based on che problem starerent, Wordcounterjava: Driver Program Ynport. Java.io. IOException: oa) an aero rere oe (is ae . Sejere | foo iui Figure 5.25 Wordcount example, F stion Hadoop org.apache.hadoop. f2.Path erg apache-hadcop.1o.rntitrLtable; org.apache hadoop.40,Text org-apache.hadoop mapred.JobCont; frg_apache hadoop napred. Mapper fsport ord. hadoop napreduce Job; feport org-apache.hadoop mapredace. 1b. input. Filet fepest org apache. hadoop mapredice ib. input TextInputFormat Joport org-apache.hadoop mapreduce lib. output. leOutputFormat Japort org-apache.hadoop mapradice Lib output TextOutput Format public clase Woracouncer public static void main (string [] args) Theerruptedexception, Classiot Foundexcep job = new dob () Job.settobiiame ("wordeounter") ; job-setvarsyclass (Wordcounter.class) Job-sattapperciass (Nordcounterwap.clase) r job setReducerciags (Wordcounterted.class) job-seroutpueKeyclags (Text.clase) job setOutputVeiueCiaes (Intvritabie class) FileInputPormat .sdainoetPath (job, new Path (1 /sample/wor tae); Fileoutputformat.setOutputPath (job, new Path (*/sanplo/ wordcount"}) Syston.exit (Job.waitForcompletion (true]? 0; 1) WorCounterMap java: Map Class ickage com.app: import Java.o.1oException Inport org.anache.hadoop, to. Tats table; import org apache hadoop io. Longirieable; import org apache hadoop i9, "Text? Import org. apache hadoop .napredice. Napper Biblic class wordcountertap extends Mappa Tntfeivaple> ( override ‘LongWritable, Text 9‘Lonaiiritable key, Text value, Context context) String () worde-value.tosering ()-split (*,*) for (String word: words) { Context write (new Text (word), new Inteitable (1 , ‘WordCountReduce javas Reduce Class package com. infosys import og. apache. hadoop. 1. Inttiritab {mport orgiapacheshadoop-Lo.Text; import org apache. hadoop .mapreduco. Reducer spLic class WordCountersed extends ReducercText, intitritable, Text, i ntieLtable> covert protected void reduce(Text word, iterab! context context) throws TOsxception, Interruptedsxception ( for(intWeltable val: values) ( count += val. gee d now Inthritable(count Table 5.2 describes differences berween SQL and MapReduce Table 5.2. SOL versus MapReduce sat MapReduce “hecess Interactive and Batch Batch Structure Static Dyraric Updates Read and write may times Wit once, read many times Integrity igh low Sealaity Nontnear Lineae Big Data and Ara +95 5.12 MANAGING RESOURCES AND APPLICATIONS WITH HADOOP, ye YARN (YET ANOTHER RESOURCE NEGOTIATOR) SY _YARN (YET ANOTHER RESOURCE NEGOTIATOR) 5.12.1. Limitations of Hadoop 1.0 Architecture In Hadoop 1.0, HDES and MapReduce ate Core Components, while other components ae bile around the core Single NameNode is responsible for managing entire namespace for Hadoop Cluster. 3 Ichas a restricted procesing mode which is suitable for batch-oriented MapReslce jobs. 53. Hadoop MapReduce isnot sable for interactive analysis Hladoop 1.0 is noe suitable for machine learning algovithms, graphs, and other memory intensive algorithms 5, MapReduce i responsible for cluster resource management and data processing, In thie Architecture, map slots might be “full”, while the reduce slots are empty and vice versa. This causes source utilization ites, This aceds o be improved for proper resource ulization 5.12.2 HDFS Limitation [NameNode sve ll file metadata in main memory. Although the main memory today isnot as small anda expensive a i sed to be ewo decades ago, sl theresa limic on the number of objects thar one can haven the memory ona single NameNode. The NameNode can quickly become overwhelmed wih load on the system increasing In Hadoop 2x, this resolved with the help of HFS Federatio 5.12.3. Hadoop 2: HDFS DFS 2 consists of ro major components: a) tes care of file-ela espace, (b) blocks storage service. Namespace service ‘operations such as creating les, modifying les, and diectories, The block storage sevice handles data node luster management, replication. DES 2 Features 1. Horizontal salbliy 2. High aaibilg. HDFS Federation ures multiple independent NamcNodes for horizontal scalails. NameNodes are independent of cach other. Ie means, NemeNodes docs nat need any coordination with each other. The Diodes ate commion storage for blocks and shared by all NameNodes, All DataNodes in the duster Teer with cach NamneNode in che lust. igh avalbily of ameNode is obtained with the help of Passive Standby NameNade. In Hadoop 2 Aeive-Pasive NameNode handles fllover automaticly. All namespace edits ate recorded to shared NFS ‘orig and there is single writer at any point of time. Pasive NameNode reads edits Fom shared storageBig Dats nd Ang 3% and keeps updated metadata information In ese of Active NameNode lure, Passive NameNode become ah Active NameNode automatically. Then ic starts writing t dhe shared storage. Figure 5:26 describes thy ‘Active-Pasive NameNode interaction Referense:haps}towwedurcka.cofbloglintroduction-to-hadoop,2-O-and-advantages-oFhadoop 2-0! | Figure 5.27 depics Hadoop 1.0 and Hadoop 20 architecture 5.12.4 Hadoop 2 YARN: Taking Hadoop beyond Batch YARN helps us co store all data in one place We can interact in multiple ways to get predicable perfor nance snd quality of sevices, This was originally achteced by Yahoo. Refer Figure 5.28 Figure 5.26 Active and Passive NameNode interaction. edoop 20 one BES Bee bata peewee ae Co eee Figure 5.27. Hadoop 1.x versus Hadoop 2.x Wedoop 10 ee Cet » fo ob ce So Cid ‘tine nos oy Chiter Recoixce Management Hors? Figure 5.28 Hadoop YARN eaten 0 ado ” 41241 Fundamental idea he fundamen idea behind eis architectures sping the Job Tracker responsibly of source manage Aa tand Job Scheduling/ Monitoring inc sepuae daemons. Daemons that ae par of YARN Archieetire Be eserbed below: 1, A Global ResourceManager: Its main responsibilty sw dstebute resources among various applica tion inthe system, I has two main component: (a Seheduler The pluggable scheduler of ResoureeManager decides allocation of resources var ious running applications, The scheduler is ost tha, pure scheduler, meaning ic docs NOT rmoaitor or tac the satus ofthe application (©) ApplicationManagers Application anager docs the followings + Accepting job submissions + Negotiating resources (containee) for executing the aplication spciicApplicationMastes, + Restarting the Application Mater in case of ule 2, NodeManager: This is a per-machine slave daemon. NodeManager responsibilty is launching ‘he application containers for application execution, NodeManager monitors che resource usage such as memory, CPU, disk, network, et. It then reports the usage of resources to the global ResourceManager. ', Pee-application ApplicationMaster: Ths ian application specific emi: Is eesponsiily isto nego- tate reuted reoutees Fr execution fiom che ResourceManager Ieworks slong with the NodeManager far executing and monitoring component tsk 5.12.2 Basic Concepts Application: 1. Application i job submitted to the framework 2, Example MapReduce Job, Containers 1, Basicunic of allocation. 2. Fine-ganed resource allocation across malkiplereou G@) container 0 = 2GB, 1CPU (©) contain 1 = 1GB, 6 CPU 3, Replaces the ied maplredace slows. © types (Memory, CPU, disk, nerwork ec.) YARN Architecares pure 5.29 depicts YARN atchitectre The steps involved in YARN architecture are a follows 1. A diene program submit the application which includes the necesary specifications to launch the aplictionspecifc ApplicationMaster itself 2. The Resource Manager launches the Application Master by asiging some container: 3. The ApplicationMaster, on bootup, segisets with the RerourceManagc. This helps the cient pro- tram to query the ResourceManage dtedy for the deal 4, During che normal cours, ApplistionMaster negates appropriate resource containers via che esoure-requestprotocaltucson Haloop ig Date Ar 5.13.2 Hive pine isa Data Warchousing Layer on top of Hadoop. Analysis and queties canbe done using an SQlike fanguage. Hive can be used ro do ad-hoc queries, summarization, and data analysis. Figure 5.31 depicts Hive jn the Hadoop ccosytein, 5.13.3 Sqoop jeop isa tool which helps co wansfer data beween Hiadoop and Relational Databases, With the help A Sqoop. you cin import data from RDBMS to HDPS and viceees, Figure 5.32 depics the Sqoop in Haloop ecosystem, 5.13.4 HBase Base is NoSQUL database for Hadoop, HBase is columa-oriened NoSQlL database. HBase is used to fooe billions of rows and millions of columns. Hace provides random read'wrte operation. I aso Tppors record vel updates which is not posible using HDDS. HBase sie on top of HES, Figure 5.33, ‘dpc the HB in Hadoop ecosystem, ===» | | LI Figure 5.29 YARN architecture 4 5. On succes container allocations, che ApplicationMaster launches the container by providing te container launch specification to the NodeManaget 6 TheNodeManager executes the application code and provides necessary information such as progr | status, etc co is ApplcationMaser via an application-spectcprorocol 77. Dating the application execution, the clint that submited the job directly communicates with [ApplicionMaster to get status, progres updates, et. va an application-specifie prorcol, {8 Once the application his een procesed complete, ApplicatonMaster registers with the ResoureeManager and shuts down, allowing its own container to be repurposed, Figure 5.30. Fig inthe Hadoop ecosystem. Reference: tpl nortonworks.comblog/apache-hadoop-yarn-background-and-a0-overvew INTERACTING WITH HADOOP ECOSYSTEM 3 Figure 5.31 Hive in the Hadoop ecosystem. Hadoop ecosystem was introduced in Chapter 4. Here we wil look a icin more deta 5.13.1 Pig Pig sa dataflow system for Hadoop. Ie uses Pig Latin to specify data lw. igisan alremative to MapRed Programming, abstracts some details and allows you to focus on data procesin, Ic consss of 0 ‘componenss SP isremsonnnsnnero Figure 5.32 Sqoop inthe Hadoop ecosystem. 1, Pig Latiy The data procesing language 2, Compiler To translate Pig Lain to MapReduce Programming. Figure 5.30 depts che Pig in the Hadoop ecosystemBig Dats nd Anti Figure 5.33. HBase in the Hadoop Ecosystem. REMIND ME REMINDME 1+ ‘The key consideration (the rationale behind the huge popularity of Hadoop) is: Ie egpabiity ro handle massive amonnts of data, diferent categories of data ~ fairly quickly. + Hladoop was ected by Doug Gating the creator of Apache Lucene (a commonty used text search library), Hadoop isa pat of the Apache Nutch (Yahoo) project an open-source web search engine) and also a pat of che Lucene project. 1+ Hadoop isan opensoure sfivate framework, Ie stores and processes huge volumes of data in a dlsteipuct fashion on large chsters f commodity hardware. Basil, ladoop accomplishes two tasks 1 Massive data storage 1 Faster data processing +The core components of Hadoop are * HES + MapRedace «Apache Hadoop YARN is sub-project of Hadoop 2x. Hadoop 2x is YARN bated architecture. le provides general procesing platform which is noc constrained to MapReduce ony. POINT ME (BOOKS) POINT ME (BOOKS) + Hadoop, the Definite Guide, 3" ation, O'elly Publication + Hadoop in Practice, Alex Holmes, + Hiadoop in Action, Chuck Lam, CONNECT ME (INTERNET RESOURCES) +p! Thadoop apache orp/dacfl.0.4rhdf design hem + hup/Thadoopapache.orp/docsle.0-4lmapred_ oil hel + hnp/Toraclsy.com2013/04/03/diference-between-hadoop-and-lbms! no ado +101 bap! 7hortonwors.com/blog/apache-hadoop-yar-backgroundand-an-overiew! ep! /rrcomsispro.comfarcesThadoop-2-ve-1,2-7 8.heml frp orm. wikidiflerence.com/dffrence betwen-hadoop,ai-rdbms! bp! Fer etek, colblog/inttodction-t-hadoop-2-0-and-advantages-of hadoop-2-0 TEST ME wesTME A FLMe 1, Hadoop is based fa strate 2, RDBMS ir be choice when — isthe main concern 4. Hadoop supporss and. dara formas 4 RDBMS suppor daca Formas. 5, In Hadoop, daa is procsed in {6 HDFS can be deployed on 7, NameNode wses To sore filesystem namespace. NameNode uses ~ to cond every tansetion 9, Secondary NameNode sa daemon, 10, DatuNode i responsible for ~ ile operation. 1. Hadoop 2c is based on auchitcrure 12. YARN is responsible for 13, Global ResouceManager distributes 14, NodeManager i responsible for launching Application 15. Application sa — slit to framework 16 van open-source framework managed by Apache Software Foundations 17. The emphasis of HDES. is on hroughpue of data access rather chan latency of dat acces 18, An HIDES clster consis ofa single __ and a number of 19, Complete the series Bis > Bytes > Kilobyees~> Megabytes > Gigabytes > +L 5 > Yorsbyees 20, HDFS pasa. architec 21, HDFS is built wing the —anguage 2. The maintain the file stem Namespace 23. The number of copies oF file called che of eat ile 24, The NameNode petidiclly exives anda fiom each ofthe DataNodes in the cluster 25, Reece of a Heartbea implies that the is Functioning propery 26. contains ale of al blocks on a DataNode. 27. The blocks of le ate replicated for tolerance 28, When the NameNode sats up, it reads he and from disk 29. Atypical Block ie used by HDFS is — 39, ae responsible for serving read and write equests fom the ie system's clientsBigDts nd Ani 31 perform block creation, delesion and replication upon instruction from the 32 was the first co publcie MapReduce ~ a system they hal used to sae dei da procesing nels 33. _devsloped an open-source version of MapReduce system elled 7 34, Hadoop is an open-source famework for writing and cunning process lange amounts of dt, 35, The key distinctions of Hadoop ate _. 36, Hadoop cuns on large clusters of 37. Hadoop scales 38. Hadoop focuses on moving © 39, The move-code-to-data philosophy makes sense for 40. Hadoop is designed tobe machines 41. Hadoop ws. datatypes, applications thac and ‘orhandle larger data by adding more to the lusts, intensive processing architecture operating on cluster of commodity PC asits basic data unit, which is lexible enough to work wich les structured 42, Hadoop is best used 383 once and many times typeof data ore 43, Under SQL we have ‘statements; under MapReduce we have and 4, Under the MapReduce Model, data procesing. primitives are called and 45. The Mapper is meant to and ‘the input into someshing tha the reducer cin over 46 and. are common desig paterns that go along with mapping and reducing 7. isthe lfc development an production platform for Hadoop, 46 seated out a8 a sub-project of 1 hich in tam was «sub-project of isa single pont of fare of Hadoop clus. 50. —_— is the bookkeeper of HDFS, 51 beeps track of how your files ae broken down into fle lacks, which nodes store thor Blocks, and the overall heath of the distributed fe system. 52. communicates with the NamieNode to take snapshots of the HDES metadata at invorale defined bythe user configuration 53, There is only one 54, There isa single slaenion per Hadoop cluster, per slave node Answers: 1. Node 5. Parallel 2. Consistency 6, Low cost hardware 3. Structured, semi-stucruredand unstructured 7. Falmage 4, Serucured 8. BalitLog ston Hadoop auton 9. Helper oF House Keeping 32. 9. Read Write 33. YARN 34 3. Cluster Management 35. Resoures 36. Containers 37. 15. Job 38. 16. Hadoop 17. High, Low 18. NameNode, DataNodes 19. Terabyes,Peabyes, Exabytes, Zeuabytes 42, 20, Master/slave 43, 21, Java 44, 22, NameNode 45 23. Replication factor 46, 24, Heartbeat, Bloleeport 47. 25. DazNode 48, 26, Blockrepore 49. 27, Fale 28, Felmage, Eat og 29, 64MB 30, DataNodes 3 31. DataNodes, NameNode 54. Match Me 1. Calum & Gotu 8 WOR SS*~*~CSCta Nd NapReduce Progaming Nametiode aster node Processing Data Slave node Hadoop Implementation storage Anse Cuma A Column B OFS ‘Storage Napteduce Progamming Processing Data Master node ametiode Slave node DataNode Hadoop Implementation Google Fle Syst + 103 Google Doug Casting, Hadoop Distibured Accessible, Robust, and Scalable Commodity machines Lineal, nodes Code, Data Write, ead Quer Mappers, Reducers ier and eranfoem,aggregte Partiioning and Shufling Linux Hadoop, Nutch, Apache Lucene NameNode NameNode NameNade Secondary NameNode JobTaacker TaekTrcker Scripts, nd Code Google File System and MapReduce tom and MapReduce2. Column A TobTracker MapReduce Taskacter db Coniguration Map Answer: olumn A Tobracker MapReduce Tasktracker Job Configuration Map 3. Column & Nametode dbtracer DataNode Tasktracker Answer: column A NomeNode sobtacker DataNode Tasktracker . True or False 1, For using Hadoop to process your dats, che dats has to be movedingested into HDES, 2. Sqoop i used ro query HDFS data Column 8 Tecate Tsk Schedules Task Programming Model Converts input into Key Value pair Job Parameters Column 8 ‘Schedules Task Programming Model fecutes Tsk Job Paametes Converts input into ey Value pair Columns ands processing on master Hands storage on lave Handles storage on master Handles processing on slave Column 8 andes storage on master Handles processing on master Handles storage on slave Handles processing on slave rutin Hadoop +105 3 4 5 6 a Corie so importfexport data fiom RDBMS. adoop fs will show the contents forthe HDFS root directory. Maser node in Hadoop can below on disk space but neds to have good amount of RAM. In Production, NameNode preferably runs on Red Hat OS, Hiadoop configurations are stored in CSV format. Anse 1, Te 5. Tue 2, False 6 True 3 False 7. Fale 4, Tiue Pick the Right Cholce ‘Which of the two are components of Hadoop? (@ HDFS (Shuler (6) MapReduce (© Sqo09 (0 Secondary NameNovle 2. How many blocks willbe crated fora fle that is 300 MB? The defaule block size is 64 MB and the seplieaton fcr i 3. (@) 30 is 5 (@ 100 Pigisa (@) Dac love language (6 Impore expore tol (©) Scheduling engine @) Shuler What does JobiTracker do? (@) Scores blocks of data (6 Stores metadata (©) Coordinates and schedules the job (€) Actsas.a mini reducer 5. Which ecosystem projec is idea for use when we hive multiple MapReduce and Pig programs to run ina sequence? @ Ouxie (9 Hive ©) Pig () Sqoop Which files used for updating MapReduce sexing? (@) coresite (©) mapredsite (b) hafesie (8) hadoop-envsh Answers: 1. (a)and 40) 20 5.) 3. 6Crossword uzleon Big Data and Hadoop Complete the crossword below Eo eross Down 2. One Gigabyesarethereinone created the popular Hadoop Enbye sofware framework fr otage and prooaing of is Splunk’s new product wo __aepe datases. seach accesvand report on Hadoop data vets, 3+ ——_traional IT Company ithe ‘ Secltadop meee ingest ig Data vendor in the world. 2 4. According to 4 study by IBM, approximately oped from Googles MapReduce concept 8, The MapReduce programming model widely used in analytics was developed ae Answer: Across 2. Billion 5. Hunk 6. Toy Elephant 7. Hadoop 8 Google open-source software was deel= ° _ amount of dat existed in che ‘gia universe in 2012. Down 1. Doug cutting 3. IBM 4.27 Zetabytes Tprseson Hao CHALLENGE ME Ae ea ‘pe are questions on topics that are not covered in che chaptt, We wil eed you ora pon your ow, 1, What are che four modules that make up the Apache Hadoop framework? Answer + Hadoop Common, which contains she common uit and libraries necessary for Hadops ther modules + Hdoop YARN, the framework’ platform fo eure management. + Hndoop Disibuted Fie Sytem, or HDFS, which store infomation on commodiy machines + Hiadoop Map, programming model wed vo proces gece ses of data 2, Which moder can Hadoop be ru in? Lin afew features foreach mode. Answer + Sanhlon. or lnc mode, which it one of the st commonly sed nvronments: When tis use, i asually ony for runing Mapleduce pogras. Standalone mode lacks a diseibued fle system so uss loa fl stem nse + eudoisibuted mode, which usll daemons on single machine. It is mos commonly wed ia Qi and development environments + Bully dssibted mode, which it most commonly used in production environments. Unie pseu odie mode, all rite mode tne al daemons om acter of machines rahe han 3, Where ae Hadoop' configuration files located? Answer Hadoop confguatio Sle can be found ine the conf aub-tirccoy 4, List Hadoop ere configuration files: Answer + hasta! + mapred-sicexmal + core-siel 5. How many NameNodes can run on a single Hadoop cluster? ‘Answer: Only one NanteNode process can run on a single Hadoop clue, The fie ytem wil go fine if his NameNode goes down, 6. What is a DataNode? ‘Answer: Unlike NamacNode, a DataNode actully stores data within the Hadoop distibuced 8 tem, DataNodes sun on their own Java viral machine proces. le Hadoop cluster? Answer: Hadoop save nodes contin only one data node proces each 8, What is JobTeacker in Hadoop? Answer: Joi Tracker is wsed to submit and erack jobs in MapReduce a single Hadoop cluster? How many data nodes can run on a 9. How many JobTracker processes ean run ‘Answer: Thee can only be one JobiTscer proces runing ona single Hadoop cluster job Tracker process run on their own Java viel machine process. IfJobiTacker goes down, all curendy active jobs op.108 + Big Dasa Asai 10, What i the difference between replication and sharding? Answer Replication essentially takes the sane data snd copie i aver several machines/ndes (he ‘numberof copies it makes, depends on the defined replication factor) ‘Shanling eke different dats and paces con diferent machines. tipi valuable or exon inanec a5 i an help with read and write operations. Replication is for ful colerance 11, What is polyglot persistence? “Answer: The oficial definition of polyglot isa person who has the bility to spe, reads and write seve languages”. Now consider an organination that has grow over 35 years. Ic asa lot of app ations which wrt o a numberof daa sources (ROBMSé, Fat fils, xs ew fle, et.).The organi ration alg has seve data mart conten management seer, Thi ia pial polyglot suation san analytics apliation may require the data tobe read from all ofthese different rypes of dat You have a few questions which you need answered + Who are the customers who have purchased a product Xin the ast 12 months? + Do you have comments left by these customers on social network site? + Ae there repeat customers onthe company’s website? * Hive they recommended your predic to thee fends, colleagues, and relatives? * Did they go o chek the product elewhere? “This cals for data o be collected fiom vatied disparate datasources tlational and non-elatonal) and analyzed. The above isa peal case of polyglor persistence 12, What is BigTable? [Answer It is compressed, proprietary data storage sytem buile on Aiscibuted outside of Googe, although it underlies the Google Datastore og File System. It's not Z 6 Introduction to MongoDB BRIEF CONTENTS 7 Whats in Score * Taser), Update), Save), Removed) find) What is MongoDB? + Null Values + Why MongoDB? * Counc Limi, Sore and Skip + Using ISON © Anrays Creating of Generating a Unique Key Aggregate Function + Suppot for Dynamic Queries * MapReduce Function + Storing Binary Data 1 Java Serpe Programming + Replication + Cursors in MongoDB, * Sharding + Indexes + Updating Information In-Place + Mongolmpor: + Terms used in RDBMS and MongoDB + MongoExpore + Daa Types in MongoDB * Auromatc Generation of Unique Nombers MongoDB Query Language: CRUD (Creat, forthe id” Field Read, Update, and Delete) You cam have date without information, but you cannot have information withous da Daniel Keys Moran, computer programmer and science fiction author WHAT'S IN STORE? Thetelavonal database mod! has prow for dead. Of late new kind of database ining gro in the enter cllad NoSQL. (Nor only SQL). The Facsof this caper wl bon exploring s NoSQl dat base cal" MongoDB". We bingo you the features of MongoDIB sich as “Aso Sharding,"RepstonAlb,hooks save | id5,Categoy:" Web Mining, Bookname” Learning R*,Autho prie!850 0:10 pagess120) Richard Cotton’ 35, Step 2: Confirm che presence ofthe shove documents in the “books” collection. Step 3+ Write map and reduce functions spc she books into the fllowing two categories: 2) Big books () Small books Books which have more than 300 pages should bein the big book category. Books which have ls than 300 pages should bein the small book earegory Step 4: Count the number of books in each category Step 5: Store the ousputas follows s documents ina new collection, called, "Book Result Book Category Count ofthe Books (@ Big books 2 (b) Sina books 3 ‘Objective: "To practice import, export and aggregation in MongoDB, Step 1: Pick any public dataset fiom the sie worwckdnuggets.com. Convert ic into CSV Format, Make sure that you have atleast ewo numeric columns, Seep2 Use Mongolmport to import dats from the CSV format file into MongoDB calletion, ‘MongoDBHandsOn" in test database Seep 3: ——Menifya grouping column ‘Step 4: Compute the sum ofthe values in the fist numeric column, ‘Step 5: Compute the average ofthe values in the second numeric column, "To copy the JSON documents from one MongoDB collection to another MongoDB collection, ASSIGNMENT 4 Objective: Write she insert method to store the fllowing document in MongaDB. Nate: “Stephen More Ades: [Cig “Banglore ‘Serst”._ "Electonics Ciy", ‘Afiiation” “XYZ Le } Hobbies: Chess, Lawn Tennis, Baseball BRIEF CONTENTS: TWhatk in Seoce Apache Cassandra An Introduction # Teacures of Casandra [+ Peer-to-Peer Network 1+ Gossip and Failure Derection * Partiioner * Replication Factor * AnteEntropy and Read Repaie * Whites in Cassandra * Hinced Handot, * Tunable Consistency: Read Consistency and White Consistency Keyspaces E: CRUD Operations Timm Berners-Lee, inventor ofthe Would Wide Web, Introduction to Cassandra * Calleeions Set Collection * Lise Colleton * Map Collection + Using a Counter + Time To Live (TL) * Altes Commands * Alter Table to Change the Data Type of Column * Ale Table to Delete a Column * Drop Table * Drop a Database + Import and Export = Exporco CSV. * Import fom CSV * Import fom STDIN + Export to STDOUT + Querying Sytem Tables + Pree Examples Data isa precons thing and wil a longer then the stems themseloesrae BjgDets snd Ani, WHAT'S IN STORE? NO “This chapter will cover another NoSQI. database called “Casandra'. We will explore the feacaes op | (Cassandra that has made ie so immensely popular. The chaper will ever the basic CRUD (Creat, Reg Update, and Delete) operations using cqlsh 4 Please attempr che Test Me exercises piven a the end of the chapter to practice, learn, and comprcheng Cassande ffecively. 7.1_APACHE CASSANDRA - AN INTRODUCTION ‘We shall sare this chaps with ew points that «reader should know about Cassa 1. Apache Cassandra was hor at Facebook. Aer Facebook open sourced the code in 2008, Cassin, became an Apache Incubator project in 2009 and subsequently became a top-level Apache project 2010. 2, Ie is bile on Amazons dynamo and Googie BigTable, +3. Cassandra does NOT compromise on avails. Since it dacs not have a mastersave architecture, there is no question of single poine of file. This proves beneficial for business citical aplication, tha need co be up and running always and cannoe afford co go down eve, 4. Iris highly salable (ie scales out), high pesformance distributed database. Ie distributes and manage igantic amount of daa across commodity servers: Figure 7.1. Features of Cassandra cdcsion wo Casa sis 5. I isa column-oriented databae designed co support peer-to-peer symmetric nodes instead of the mastersive achitecture 6, Iehas adherence to the Availabilty and Parton Tolerance properties of CAP theorem. Ie takes cate of consistency using BASE (Baially Available Sot State Eventual Consistency) approach Refer Figure 71. Few companies tha have sucesflly deployed Cassandra and have benefited immensely fiom icare as fellows: 1, Tier 2. Netix 3. Cisco 4, Adobe 5, eBay 6, Rackspace 7.2_FEATURES OF CASSANDRA 7.2.1 Peerto-Peer Network Aswith any ther NoSQL database, Cassande is designed to distribute and manage lange data loads across snultiple nodes in a cluster constituted of commodity hardware. Casandra dacs NOT have a masrer-lave architeetare which means that i does NOT have singe poine of lure. node in Casandra is sructrally identical to anyother node. Refer Figure 7.2 In ase anode ils os taken ofine, ie definitely impact the hroughpuc. However, i i case of gracefal degradation where evrything does not come crashing at any fren insane owing to anode flue. One can sil go abou busines as usual, Ides aver che problems of flue by employing a peer-o-post distributed system across homogeneous nodes. Ie enstres that dat i Asibuted actos all nodes inthe cuser, Each node exchanges information across the cust every second [Let ws look at how a Cassandra node writes. Each write ie written to che commit log sequential. A write Jn taken eo be sucesfl only i is writen to the commit log, Daa is then indexed and pushed roan Jnmemory structure called "Memeable. When the in-memory data structure, “the Memsable’ i fll che ‘contents ate ushed to “SSTable” (Sorted String) dita fle on the disk. The SSTable i immutable and is Figure 7.2. Sample Cassandra clustergD Figure 7.3. Gossip protocol appens-only Ic is stored on disk sequentially and is maintained for each Cassandra table. The partitioning ‘and eplication of al writes ae performed auromatialy across the cluster, 7.2.2 Gossip and Failure Detection Gossip protocol i used for int-sng communication. Ie a peetto-peer communication protocol which ceases the discovery and sharing of locasion and state information with other nodes in the cluster, Ref Figure 7.3. Alchough thee are quite afew ubteties involved, bu tits cor it’ simple and robus sytem, Avnode only has to send out the communication ta subset of other nodes. Fo repairing unread dar Cassandra uses what’ called an antientropy version ofthe gosip protocol 7.2.3. Partitioner [A parttoner takes call on how to distribute data onthe various nodes ina cluster. tals determines the radeon which o place the very fs copy ofthe daa. Basically a parttoner i hash function to compute the token ofthe partition ley. The partion key helps to identify a row unig 7.2.4. Replication Factor “The replication factor determines the numberof copies of dara (epics) that wil be stored across nodes in achster, [Fone wishes to store any one copy of ech row on one node, they should set the replication Factor to one, However if the need is for two copies of each row of data on tw differene nodes, one should go with replication factor of tw. The replication factor should ideally be more than one and not more than the number of nodes in the clase. replication strategy i ermployed vo determine which nodes o place the data on. Two replication strategies ar available 1, SimmpleSeategy 2. NetworToplogyStategy: ‘The prefered one i Network TopologySeratepy as iis simple and support esy expansion to mulkple dats centers, should there be «need 7.2.8 Anti-Entropy and Read Repair A clusters made up of several nodes, Since che use is constituted of commodity hardware itis prone ro flue. In order to achieve rule tolerance given piece of data is replicated on one or more nodes. A client an connect co any node inthe chater to read data. How many noes wil be read before esponding to the tient is based on the consistency level specified by the cine Ithe cienc-specifed consistency is ot met, the read operation blocks. Theresa posbilty that Fe of ee nodes may epi With a Gato date vale {nsuch cae, Casale wil nkatea read ep operation to Bring the replicas with ale values up to date For egpiring unread dita, Cassandra uses an an-entropy version of the gossip protocol. An-entopy implies comparing all he replica ofeach piece of data and updating eich replica tthe newex version, The tex reir operation ie performed ether before or after renrming the value othe client as per the specified consistency level , 7.2.6 Writes in Cassandra ec us look st behind the scene activi Here cline that ints write request, Where does his rte get wren 02 tis iat wren to the Gomi log. write ieee a succes nly i wien to he com fog. The nex stp iso push the wre toa memory rede data sci clled Memabl Arbol value is defined isthe Menable, Whe te uber of jes Sorta the NERTBIETaches 1 brsold, the coments of Memable ae Sushd ote dk in fle eld SSTable (Sore Sing Table) Fhing 9 99-blocking operation. ei pose to have maple Mentbls fora single column ani. One ot ofthe scree andthe et re wating to be Shed. 72.7 Hinted Handofis The fist question that aie is: Why Cassandra sll for availabilty? I works on the philosophy chat wil alvaysbe available for writes “Atlin tht we hav 3 curtr of three nodes Node A, Node B and Node C. Node Cis down for some reason, Refer Figure 7, We ae maintsining a replication fctr of 2 which imple that wo copies of each tow willbe spud on ova diferenc nodes. The cine makes a writerequcet ts Node A: Node Ais the coordi- ator and serves sa prony berween the client and che nodes on which the replicas eo be placed. The client (wade cadoun, - p_ cao Wats wk | Replctes ow Figure 7.6 Depiction of hinted handos.ig Das sd Ary units Row K to Node A. Node A then writes Row K to Node B and stores ahi for Node C. The hint wil have the following information 1. Location ofthe node on which the replica iso be place. 2. Version metadata 3 The acl daa, ‘When Node C recovers andi bak to the functional self, Node A reacts tothe hint by forwarding the day to Node ¢ 7.2.8 Tunable Consistency ‘One afte fetuces of Cassandra dha has nade i immensely popular ists ability co utilize cunable conse tency. The database systems can go for either strong consistency oF eveaialcondstengy. Cassandra can cash iow either favor of consiteney depending on the tequiements, Ina distibuted system, we work with several servers in the system, Few ofthese server ate in one datacenter and others in other datacenters. Lex tus take a lok at what ie means by sttong consistency and eventual consistency 1. Strong consistency If we work with song consisrency, implies chac each update propagates to all Jocaions where chat piece o ta Teds. Pe us une a single data center sep, Serong consisency ‘wll sae tha al ofthe Servers that should have a copy of the data, wll hae it, before the cen is cksowledged with s success If we are wondering whether ie wil impac performance, yes it will. Te wil cose a ew extra milliseconds co write cal eves. 2. Eventual consistency If we work with eventual consistency, icimplies thatthe client is acknowledged wth sucess s soon aa pat ofthe cluster acknowledges the write. When should one go For eventual Consisteney? The choices fly obvious... when application performance maters the most. Example: ‘Asingle server acknowledges the write and thén Begins propagating the data co other sever. 7.2.8.1 Read Consistency {eu understand wha the read consistency eel means. means how many elias must respond before send jingu the result the ent application. Tere ae several read consistency levels as mentioned in Tale 71, 7.2.8.2 Write Consistency Tet us understand what the wite consistency level means. Ie means on how many replicas write must sue ‘eed before sending out an acknowledgement to the dene application. There are several write consistency levels as menioned in Table 7.2 Table 7.1. Read consistency levels in Cassandra One Fetus a response frm the doses nde (replica) holding the data QUORUM Returns a esl from 2 quorum af servers withthe most recent timestamp forte data LOCAL Returns a result from a quorum of vers withthe mast recent timestamp for the data in (QUORUM thesame datacenter a the coordinator node EACH_QUORUM Returns a est from 3 quorum of servers withthe most recent tnestamp in all data ceters AL ‘This provides the highest evel of consistency ofall ves ad the lowest level of availabilty of alllevels I responds to a ead request from acetate alle repica nodes have responded Table 7.2. Write consistency levels in Cassandta This s the highest evel of consistency af al levels a it necessitates that a write must be writin tothe commit log and Memtabie onal replica rods in the cluster, ‘A write must be writen tothe commit lg and Memtableon a quorum of replica nodes in ol datacenters, ‘A write must be writen tothe commit lg and Memtableon a quorum of replica nodes. ‘A wste must be writen to the commit lg and Memtable on a quorum of replica nodes in the same data center as the coordinator nae. This to avod latency of inter-data center ‘A write must be wrlten tothe commit lg and Memtable of atleast one replica rode ‘write must be writen to the commit log and Memtable of atleast two replica nodes. ‘A write must be writen tothe commit lg and Memtable of atleast tive replica nodes. ‘A write must be sent to, and succesfully acknowledged by, atleast one replica node inthe local data cents 7:3_COL DATA TYPES 3 for buile-in data eyes for columns in CQL Table 7.3. Built-in data types in Cassandia Ine 532 bit signed integer Bigint 64 bit signed Long Double sbi TEEE-754 ating point Rat 32-bit IEE-756 floating point Boolean Teor false lob Arbitrary bytes, expressed in hexadecimal Counter istibuteé counter value Decimal Variable ~ precslon Integer List ‘collection af one or more odeed elements ap A3S0N style aay of elements set A coletion of one or more elements Timestamp Date plus time Varchar UTP encoded string Varine Arbitrary. precision integers Text UTE 8 encoded stringCOLsH 7.4.1 Logging into eqlsh The below screenshot depicts the cash command prompt after logging in, using cash suceed The upcoming sections have been designe as follows: Objectives “What i ic hat we are trying to achieve here? Input (optional: Whats the input that has been given tus to act upon? et ‘The actual statement command to accomplish che task at hand Outcome: The resultourput a consequence of executing the statement, ‘Objective: To ge help with CQL Act Help Outcome: — 5_KEYSPACES What is keypace? A keyspace i container to old application data. [is comparable co a elational dat tse, [ts used to group column families together, Typically, a cluster has one keyspace per applica Replication s controled on a per keyspace basis. Therefore, ata that has different replication require should reside on diffrent keyspace roction to Candia 19 When one creates a heyspce, i is roquited to specity a strategy clas. There ate wo choices available rategy” of a "NetworkTopologySeategy” cass. While using SimpeScrategy” clas and for production usage, work withthe avich us. Ether we ean specify a "Simple Casandra for evaluation purpose, go wich ‘Network TopologySerategy” class _——_——— (Objectives To creates keyspace by the name “Students et (CREATE KEYSPACE Students WITH REPLICATION "clas Sionpleteategy’ ‘replication factors eeranaet “The plication factor tated above inthe syntax for ceasing keyspce i elated to che numberof copies of keyspace data that is housed ina cluster (Objectives To describe all che exiting keyspaces ets DESCRIBE KEYSPACES; Outcomes fa get more deals onthe existing heyspacs such as keyspace name, durable writes, stat- gy clas, strategy prions Act SELECT * FROM system-schema keyspaces10+ Big Dats and Outcome: "Notes Cassandra converted the Stdentskeyspace to lowercase as quotation marks were not use - ‘Objectives To use the keyspace “Students”, use the lowing command: Use keyspace_name Use connects the client sewsion tothe specified keyspace. Act ‘ USE Studen Objec et (CREATE TABLE Seuden RollNo int PRIMARY KEY, ‘StudName text, Dteoffoining timestamp, LastExamPercent double The able “student info” gets created inthe heyspace “students [Notes Tables can have citer a single or compound primary key. Always ensute that cere is exactly one primary key definition. The primaty key, however, can be simple (consisting of a singe aterbute) ot Composite (comprising evo oF more tebe), Explanation about the composite PRIMARY KEY: Primary Key (colurn_namel,column_name, colonn_name3 .) Primary key (columa_ named, clunin_nameS}, column_name, column_name7...) {a the above sy olumn_namel isthe patcion key {olurmn_nathe and column. name3 are che clustering columns {alurmn_named and eolumn_ameS are che partitioning keys {alumn_name and column _same7 are che clustering columns. ‘The partition key is used to distribute the data inthe cable actos various nodes that consicute the see. The clucering columns are wed to store data in ascending order on che disk (Objective: To lookup the names ofall ables in the cuzrent keyspace or in ll che keyspaces there is po current keyspace Act DESCRIBE TABLES; Ostcome: salsh:students> dascribe £20166) lah:etudentss _——_—_—_—_- ‘Objective To describe the table “tudentinf use the below command red DESCRIBE TABLE student infor Note:The outputs alist of CQL. commands with the help of which the rable “student info® can be recreated. Ostcomer Ten stusentas Gereribe table Sticer®= Tato ——— Der rae etter nfo Bhat dorm

Full Stack UNIT3
No ratings yet
Full Stack UNIT3
57 pages
Unit 2 - Big Data Analytics - CCS334
No ratings yet
Unit 2 - Big Data Analytics - CCS334
36 pages
2 BDA A6515 Hadoop
No ratings yet
2 BDA A6515 Hadoop
55 pages
Adbms Unit 1
No ratings yet
Adbms Unit 1
32 pages
Bda Unit-2
No ratings yet
Bda Unit-2
29 pages
P.prabu (29x61c) CCS334 BDA - Unit 2
No ratings yet
P.prabu (29x61c) CCS334 BDA - Unit 2
29 pages
Big Data Notes
No ratings yet
Big Data Notes
70 pages
NoSQL Databases
No ratings yet
NoSQL Databases
10 pages
3.1 Introduction To NoSQL
No ratings yet
3.1 Introduction To NoSQL
10 pages
Unit 2
No ratings yet
Unit 2
23 pages
Unit 4-1
No ratings yet
Unit 4-1
21 pages
Unit Ii - Nosql Databases
No ratings yet
Unit Ii - Nosql Databases
112 pages
Chapter - 4 - NoSQL - 1676181987
No ratings yet
Chapter - 4 - NoSQL - 1676181987
85 pages
BigData Unit2 V2
No ratings yet
BigData Unit2 V2
70 pages
Unit 4
No ratings yet
Unit 4
36 pages
Unit 2 Bda
No ratings yet
Unit 2 Bda
28 pages
What Is NoSQL
No ratings yet
What Is NoSQL
52 pages
Unit VI - 1
No ratings yet
Unit VI - 1
31 pages
Big Data Bhag 4 Changes
No ratings yet
Big Data Bhag 4 Changes
26 pages
Overview of NoSQL
No ratings yet
Overview of NoSQL
17 pages
Nosql Module 1
No ratings yet
Nosql Module 1
23 pages
IA2 - QnA
No ratings yet
IA2 - QnA
44 pages
Unit 4 BDA
No ratings yet
Unit 4 BDA
22 pages
Module 3 Bigdata Analytics
No ratings yet
Module 3 Bigdata Analytics
19 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
13 pages
Unit No 1
No ratings yet
Unit No 1
34 pages
Unit Iii
No ratings yet
Unit Iii
22 pages
NoSQL Databases
No ratings yet
NoSQL Databases
36 pages
NO SQL Unit 1
No ratings yet
NO SQL Unit 1
66 pages
BDT Unit-Ii
No ratings yet
BDT Unit-Ii
13 pages
UNIT II First Half Notes
No ratings yet
UNIT II First Half Notes
21 pages
NoSQL
No ratings yet
NoSQL
18 pages
Nosql Databases Unit-1
No ratings yet
Nosql Databases Unit-1
16 pages
Dbms Presentation
No ratings yet
Dbms Presentation
22 pages
Unit 3
No ratings yet
Unit 3
10 pages
Dbms Unit 5
No ratings yet
Dbms Unit 5
9 pages
Unit VI Big Data
No ratings yet
Unit VI Big Data
19 pages
BDA Module 3
No ratings yet
BDA Module 3
27 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
30 pages
Chapter 4-Python Database Interaction Using MongoDB
No ratings yet
Chapter 4-Python Database Interaction Using MongoDB
25 pages
Module 1 Introduction
No ratings yet
Module 1 Introduction
9 pages
BDA Unit-5
No ratings yet
BDA Unit-5
18 pages
Lec 17 Nosql
No ratings yet
Lec 17 Nosql
19 pages
No SQL
No ratings yet
No SQL
11 pages
U5 Final
No ratings yet
U5 Final
45 pages
DBMS Module 5 Part 2
No ratings yet
DBMS Module 5 Part 2
18 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
12 pages
Unit - I - Nosql
No ratings yet
Unit - I - Nosql
12 pages
6.unit 2 Bda
No ratings yet
6.unit 2 Bda
50 pages
Unit 1-NoSQL
No ratings yet
Unit 1-NoSQL
31 pages
Unit 3
No ratings yet
Unit 3
28 pages
Nosql Database: Nosql Databases Are Generally Classified Into Four Main Categories
No ratings yet
Nosql Database: Nosql Databases Are Generally Classified Into Four Main Categories
11 pages
Lec 15 Notes
No ratings yet
Lec 15 Notes
3 pages
Nosql Databases
No ratings yet
Nosql Databases
2 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
30 pages
NoSQL DATABASE-B
No ratings yet
NoSQL DATABASE-B
4 pages
Nosql Database: New Era of Databases For Big Data Analytics - Classification, Characteristics and Comparison
No ratings yet
Nosql Database: New Era of Databases For Big Data Analytics - Classification, Characteristics and Comparison
17 pages

BDA Text Book 1-Part 2

Uploaded by

BDA Text Book 1-Part 2

Uploaded by

You might also like