0% found this document useful (0 votes)
64 views34 pages

Bigdata Unit 1

big data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
64 views34 pages

Bigdata Unit 1

big data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 34
- UNIT = What is Big Data ? ‘Big Data sa collection of data hats huge volume, yet growing exponen with time, tea data with large sie and complexity that none of aon ata management tools an sore itor paces fice: Big data alsa ats but with ge sie What is Data ? ‘The quantities, characters, or symbols on which operations are performed by 2 ‘computer, which may be stored and transmitted i the form of electrical signals and recorded on magnetic, optical, o mechanical recording media, What is an Example of Big Data? 1 Fatmrgar son eB Osa nares Fae say ny Th els smarty gowns a iomavien abode meage ‘arg nrg ora + sg satan cn get tay ft 20min i ay ‘Natasa praeon nent op 9 aye Big Data Analytics + wait a ie reise ede ania etn seston arisen scr ged em ae ‘Saget artnet ncn ts ec ce ln a ron remanent + What dante nic et any aan ens limes eer tts mate arama sec ee ‘Peta hustler yb topo naan mar Data Analytics vs Data Analysis ‘+ Data Analytics is a scvanced ‘+ Data analysis consists of ‘broader Gata analysis It defining 2 data investigation, indludes data analysis as 3 sub clearing, transforming daca‘o component ses the logical give meaningful outcome, ‘framework basea on which 1+ Foranalyse the cata Tableau analysis s done, Excel et ‘+ There are many analytics tools In market mainly thon, ‘Apache Spark ete Cont ‘ig data analytics applications enable big data analyst, data scientist, predictive modelers, starstiians and other analytics professionals to analyze rowing volumes of structure transacton data, plus otner forms of data For example internet clkstrear data, web server logs, social mela content, {ext from customer emails and survey responses, mobile phone records, and ‘machine data captured by sensors connected tothe internet of things (oT) Data Analytics vs data analysis The importance of big data analytics : Driven by specialized analytics systems and software, as well as hgh-powered computing systems, big data analytics offers various business benefits including: New revenue opportunities More effective marketing Better customer service Improved operational efficiency Competitive advantages over rivals Structuring Big Data Three different data structures For the analysis of data, itis important to understand that there are three ‘common types of data structures: +FD 0+ 66 Structured Data 1+ sutures dais bt mat adheres tos preset ita mode and itnctore “Nrigonord to ana Structured ta cnfres ta tabular ermat th elstonsip ‘ere Sel astnnes acho! thee have sured rms and conn tan Be 1+ Sucre data depends onthe eustence ofa data mode -arradel of hw data can be "Ror process ane acest cae ota model ech ale darted can ‘Seeeas sparta length trom ster ele he makes racers ‘ae carey power is posobeto hy egaepte arom vero eto thecase 1+ Sucre ais constr tne mon rasta frm of da storage, ence the ‘lest ears of danse manogarent some (Dan) were seta tera, proces and Employee Empl Name Gender Depatnent 28 Foshan ale France x rte female in com ws Shuey ale dia shutettes ale France so ipsme male Frame sso Unstructured Data ‘+ Unsrucured datas information tha ether does nat Rave a redefined data mode (or enot oases ina predeined mane 1+ Unsrucuredinfrmaton styl sexo, but may conan daa such seas, ‘ombers, ons acts el ‘Ths resuts in regutstes and ambiguties that make dca to undersand using Leadon progam as compared o asta stored in structure antabanes + Common names ofunsrutred data indude aus vide les or hosel bases Semi-structured Data + Semistructured data fa form of structured data that does nat conform ‘withthe formal suture of data models associated with elavonal databases or other forms of data tables, but nonetheless contain tags or ‘other markers o separate semantic elements and enforce hierarchies of records and fields within the data 1+ Therefore, its also known as sel describing structure, Examples of semistructured data include SON and xML are forms of semistructured data, catia uane¢|nana< cecrsuanentevenlah 0. pale /seureagerticlager|se0> ‘Big data analvtics technologies and tools: ‘© Unstructured and semi-structured datatypes typically don’ ht well in ‘radtional data warehouses that are based on relational databases oriented to structured data sets. ‘© Further data warehouses may not be able to handle the processing ‘demands posed by sets of big data that need to be updated frequently oF ‘even continually, as in the case of real-time data on stock trading, the ‘online activities of website vistors or the performance of mabile applications. ‘© Asa result, many of the organizations that collect, process and analyze ‘big data turn to NoSQL databases, as well as Hadoop and its companion data analytics tools, including: Exploring the use of Big Data in Business Context : ‘+ Almost all organisation collects relevant data (either directly or thraugh agency. ‘+ This data is related to customers feedback, information about supplies and retails, current market trends et 1+ The continuously increasing cost of collecting this information wal be just 2 waste of resources unless some logical conclusion and business insight ‘an be derived from it. This is where Big data Analytics come into picture. ‘+ This wll nelp organisations to reduce the cycle ime, fll orders quickly, improve forecast accuracy ‘YARN omer manage chs an one oh yearn scondeneron acon. DMapReduc:asfnre meat als developers to wie paras that es and sone competes. * ™ Spark an pen rc, paral procs framework hat ete sera un angele sharable spplstanracom caters sens ve an open sure data warehouse orgueyng nd ansng rg atts ored nao es ata: ari pub subscribe messing system designed 0 replce Pig. an open sore technology hat fers high ee meckansm fr the parle rgrommingat Moped bs ected on Hadoop esters AGENDA 1+ We are going to discuss in ifferent areas of big data applications: © Use of Big Data in Social Networking Use of Big Daa in Preventing Fraudulent Activities Use of Big Daa in Dececting Fraudulent Activities in Insurance Sector Use of Big Dara in Retall Secor In each area we wl discuss the fellowing aspects “+ What isthe data invoved ? “+ How to make optimum use of data? ‘© Wha are the useful insights rom analytics of the data? 0° oeye5 Use of big data in social networking @ 3% © 8 ° 'A. Whats social network data? siclee) Itrefers to data generated from people socializing in social media websites such as twitter, facebook etc + Ona social meaia website you wil ind eifferent people const adding and updating comments, status, preferences etc Following url shows the socal network data generated per/seconds through various socal media, ely www.internetlivestats.com B. How to make optimum use ofthe social networking Big data? https://youtu.be/JAO_3EvD3DY ‘Analyzing and mining the larger volume of data in social networking sites such as comments, status, posts likes ete show the business trends in general with respect to “wants” and “preferences’ of a wide audience. If this data can be systematically segregated on the basis of ifferent age group, locations, gender etc, then organisation can design products and services specific to people needs, This is called social network analytics. EXAMPLE Cont... ‘© Infact the data generated from social networking analytics enable an organisation to calculate total revenue a customer can influence instead of the direct revenue he himself generate. ex: food blogger's ‘Social networking analytics has even advanced applications such as predicting online reputation of a brand ex: tripadvisor, increasing profitability in business by targeting influential customers. * exinsta influencer. ~, es } | Ga | fer ‘C.What are the useful insights from Big Data in social networking? ‘Te fllowing are the areas in which decsion making processes of organisation i influenced by socal networking data: Business intelligence : cs a data analysis process to convert 2a data analyss process to convert a raw dataset to ‘meaningful information tha ean add value to decision making. Social networking data and its appropriate analysis has proven tobe 2 good aid in providing business intelligence. ‘This can be understood from following examples: rom “ferent sector in business, IL Marketing: Today preferences of cansumers have changed due to their busy schedules, So marketers aim to deliver wat consumers desie by using interactive communication channels such as email mabile,web et. ‘Example: Walmart has started a socal media analytes company calles kesmix {and estaolsned a branch Walmart fb, analysis mela cornmunation such 25 blog, twits, transactions daa eto predicts trends and learn about customers wants IV. Product Design and Development: By Istering to what consumers want by understanding where the gap inthe product offering is and soon, ‘organisation can make the right decision inte direction of trer product design and development. ‘Customer relationship management data : vith the help of social networking analytics, organisations can identify some customers inthe customers networks, that make a large no af cals, text messages and havea large network of friends. Such a customer is said o be highly Influential as studies have shown that when a user ofa telephone ‘networks leaves his fiend also leaves. Infact some organisztons reward thei Influencer customers wth discount and offers. And these customers in turn spreading a paste brand image. Other sector ex Google pay, Airtel et L. Link Analysis: Social network analytics canals help in aw enforcement {and ant-terorism efforts asi is possible to ldentiy trouble groups or ‘people who are directly ar indirectly connected to each other. Such type ‘of analysis called LINK ANALYSIS. Sentiment analysis refers to a computer programming technique to analyze human emotions, attitudes and views across popular social networking, including facebook, Twitter and blogs. The techniques requires analytics skills as well as advanced computing applications. Business research organisations and marketing professionals across the global use sentiment analysis in one form or the other to identify and measure ‘customer behaviour and online trends. | Preventing Fraudulent Activites | Types of financial frauds A. Credit Card fraud: type of fraud very common and relates to use of creat car faci Commonly occurs when a fake ora stolen cards used in an online transactions inspite of securty checks about the valid owen ofthe card such as address verfication or (card verification value) CW no et, Fraudsters manage to manipulate the loopholes in the system What are fraudulent Activities 2? ud, “+ Fraud can be committed by both words and behaviour intended to deceive the other party generally o gain an advantage over the party. Here financial frauds are discussed, Frauds that occurs frequentl in financialinstution such as banks and insurance companies and involve any type of monetary transaction are called as financial frauds ‘+ sue frauds online retallers such as amazon, ebay, Groupon sufer huge losses, and this is where Big data anelytes come to use 2. Exchange or return polly fraud : Occurs when people take advantage of exchange return policies offered by an online retailers + Sample: Customers euring ne produ ater xing EEO reporting non-delivery and later attempting to slit online et, The online retailer can prevent such a fraud by charging 3 Restocking fee on return goods, getting customer signatures on deliver, tracing customers known t0 ‘omits such frauds using thee transaction patterns. This ts where big data analytes come to use For example: the retaler can study customers ordering patterns, frequency of change in shipping address, rush ‘orders, sudden huge orders etc {3 Personal information fraud : This type of fraud occurs when the ‘raudsters obtain login credentials of customers and purchase 3 product using ‘them and changing the existing delvery address they buy't ‘when , the original customer realises this he keeps calng the retailer to refund the amount ashe or she has not mad the transaction, ‘+ According to Consumer goods regulations once fraud is proved retallshas torefund the amount tothe customers What are the useful insights from big data analytics in Real-time fraud detection ‘+ Live data matching :n this study organisations can compare Ive detals ‘of customers obtained from diferent sources to valdate the authentic. ‘+ Bc Inan online vansacton, big data could compare the incoming address with the geodata received from the customers smartphone ‘8p9s. Avald mateh between the two confirms the authenticity of the + Be Als costy products can have sensors attached to them that ‘transmits thei location information, when such products ae delivered to customers the streaming data obtained from the sensors provide good source of information to trace any frau. How to make optimum use of customer data to prevent fraud. G Pay or VERIICATION Image analytics In order to desl with huge amount of data and fain meaninglul insights to avoid fraud, organisations need to derive analytics ols to aferentiate beween real or genuine and ‘Fraudulent customer entries Organisations have to upgrade thelr knowledge about emerging methods of fraud and design necessary prevention checks. Example: Secure OTP acting asa second round lor check after CW, Google pay introducing a ering secur pin apart rm the requ n + Thisis another emerging field that can help detect frauds. ‘+ Image analysis (also known as “computer visior® ‘or image recognition) ste ablityof computers torecognize attributes within an image, Some ofthe examples include facial Fecognition(smart phone), postion movernent analysis (Google mapsetc Analytical systems that deal with big data are ‘designed to integrate and understand images, videos, text, numbers and al forms of unstructured data to faciitate image analytics. Use of big data in detecting fraud in Insurance sector MPP (Massively Parallel Processing database) ‘+ This technology is used in powerful fraud management systems in order to detec frauds. The system analysis each customer transaction on the ‘basis of $00 different criterias or aspects to differentiate between areal and fraudulent transaction, ‘+ This level of analy scalability needs a MPP system, ‘+ MPP is widely used database management system for storing and analysing huge volume of data, ‘+ An MPP database has severalindependent pieces of data stored on ‘multiple nesworks of connected computers. ‘+ teeliminates the concept of ene central server having a single CPU and disk “+ VA payment services make use of MPP ints fraud management system, Use of big data in detecting fraud in Insurance sector ‘+ This important to study because most cases of cheating and fraudulent activities occurs in insurance and retail sector. What is the data availabe in Insurance sector? In generalthe company offering insurance is always willing to improve ts abilty to take decisions hile processing claims and ensuring that the claim sa genuine one. ‘+ The company as policies and procedures to help underwriters (an officer who evaluates insurance coverage, claim details ete) however underwrites always da nat have the required dat atthe right time to make necessary decision, thus delaying the processing time and inereasing chances of frauds ‘+ Til before big data Insurance companies use to analysis small sample of data ofthe customer and lesser parameters making iless ful proof, How to make optimum use of big data analytics in Insurance ‘+ Asa soktion to these problems bg data anaycal platforms increase the valeity of data about customers by integrating einer data wth Gata obained fram socal med or other sources «+ Ex: Acustomer might indicate hat his/her ar was destroyed in a ood butte documentation from the socal media may sow tat te ca was cual in another cy onthe day te fldos occur, ths mismatch may hint esence of aus, ‘+ Thusinformation obtained from these platforms will enable the insurance companies to diagnose customer cm behaviour and other related ‘+ Big data can deret patterns of fraudulent behaviour from large amount of structured and unstructured data ghven tot, ex bank statement medical bil, erminal recordset and help in detecting rauds quicker and insuring better actions Social Customer relationship management : Social Customer relationship management is not a platform or technology, but a process. It makes it critical for insurances companies to link social media sites, such as Facebook and Twitter, to their CRM systems. ‘When social media is integrated within an organisation, it provides high transparency in various issues related to customers. ‘What are the useful insights from big data analytics in Insurance Soca network analysis" mised approach using ttl methods pater onal and kaa oly oy Hn ofreatenhips wth Linge amount of ata cece fom ferent sures for x Gta rom pbc reord suchas crmial cords odes change Frequreyoreosures ga prectesin which recovers mone om a ister wh asdf repayments) dearaon of brkrpiy, are {rou datasources that on be asmated nt te NA model wc es {Derecvel deter enstence a aud. + Uang ois apreach secorprsong eration cbalne from varius data scurtesinass model the rane cgay can cae dao tah ating, Indcatsthat cam sFaud) ex fa customer fes case o get surance taney ca espe ine, sopose wee Sete ay onthe Clstmners statements inthe cam eps an come ares word he “lube er removed io ca et then ths igh indat te cr was Dimon prpore. Retail industry What is big data in retail industry?? ‘+ Inthe recent times Omni channel retailing process is 3 new buzz word, ‘this proces isthe one which focuses on consumer experiences by using all avalable channels (asthe word omni means all direction, including mobil, internet, television, showrooms, radio, mall, apps, and many more evolving channels Hence considering the immense numberof transactions prevallingin the ‘omni channe!retailindusty from all channels, there is alot of scope for the use of big data technologies in extracting useful information such as relationship patterns, tendsin the sales of product. Cont... ‘+ For example: wiat time ofthe year do we soll maximum no of leggings and from which channel? Design promotional coupons for customers based on their ordering Further,Ta meet demand of new customers retailers are adopting, specialized software applications for example : customers are gven the Information whether a particular item is in tock in nearby store or nat.( ‘pollo pharmacy). Thisis where Sig data analytics comes to use. | RFID https://www.youtube.com/watch?v-reQUE7 BOUSY LL How to make a optimum use retail data:RFID tech ‘Te biggest evolution in automating the process of beling and tracking detail goods Is RFID (Radi feequency idenifcaton). ‘+ walmartis tne 1st retaler to implement RFD In its products, ‘+ RED helps better item tracking by aifferentiating tems that are out of| stock and that are available on shel ‘With this technology the huge volume associated with transactional data ‘of omni channel retailing can be easily handled and measures can be ‘made for enhancing customer experiences. Useful Insights from retail data analytics : ‘Asset management :Retal Organisations can tag heir material handling ‘equipments Such a5 venicles, tools with RFID in order to trace them ary tme {and from any locations. Readers fixed a speci locations can observe and record all movements of the tg assets with great accuracy. ‘This information lessess the time for documentation als. ‘+ Regulatory Compliance : To meet the regulations of agencies such as FBA (food and orug administration), OSHA (Occupational safety and health administration ) etc, Manufacturers need to dispatch products such as medicines regulated drugs special foods having preservatives, hazardous chemicals ec, with updated labels ‘+ RFID tags canbe used asa labeling sytem for this goods ‘= Also logistics companies lke DTC can also dfferertiate speed delivery products from normal delvery once using REID tags. Inventory control: RFID data allows manufacturers to track inventory for raw materials ,works in progress (WIP) or finish goods (FG), Readers installs on shelves can update inventory automatically and rise alarms, incase the requirement for restocking arises, Further the readers can be programmed to rse an alarm incaseitems are removed and placed elsewnere, Even Apollo pharmacy manages inventory of available drugs using this ‘echnology. Shipping and Receiving : RFD tags can be used to fasten the process of final shipping of finshed goods. Service and voluntary authorisations : RFID tags can hold updated Information about repair and services dane on the product.Once the repair and service has been completed the information can be fed ino the RFID ag, ‘nthe produc, so thusiffuture repairs are required, the technicians can access this informacion without accessing ary exteral database, which help In reducing cals and time expensive enquires into document, cont... © This is done with the help of a new software programs or applications, that do the following : © Breaking up the given tasks into sub-tasks © Surveying the available resource on hand © Assigning the sub-task to the nodes or computing, devices that are interconnected via network. © Finally collecting outputs from all subtasks Introducing technologies for handling big data: ‘+ Huge amount of data from different sources need to be managed properly, to derive productive results. The astronomical increase in volume, velocity, variety of data collected from different sources at the same time are forcing organisations to adopt a data analysis strategy that can be used for analysing entire data in a very short time, Above applications are based on the concepts of distributed and parallel computing Distributed Computing and parallel computing Peete en Benes ‘Techniqueton for Big data ‘+ Distributed Computing : In distributed computing, multiple computing resources are connected in a network and computing taskare distributed across this resources. This sharing of task increases the speed as wells efficiency of the system f+ Iisalso more suitable to process huge amounts of data in a limited time, Characteristics of Distributed System 1 Heterogeneity refs toh aii for he syste 0 operon arity of eet andr and stare compen + Opennen of ditched system define the dict invled tented improve Anson or pes varying eof © Coneureney fers tots sso’ tiy wo han the ace and use of aad + Sealab ison te maar character efcvenes ofa bts em, i ‘sso bo’ easly the system as aap 1 eae size seme opt, ee Sci cont... Parallel Computing: this another way to improve the processing capability ‘ofa computer system by adcing additional computational resources toi. tn this method complex computations are divided into sub tasks, which can be handled indivivally by processing unt, running in parallel In general organisations use a combination of parallel and dstributes techniques to process big data. Diff... Parallel computing ss 9p of computing sect which sever proestors stancoulyonecte lle, smaler eal broken ‘down oman overall ler, complex problem calleston of uch huge eta hat cant be proceseé by tational Issues in big data handling systems : 1+ Latency scan be defined asthe aggregate delay in the system because of delays inthe completion of individual tasks. © Such a éelay automaticaly leads to the slow down in system performance as a whole and thsis often termed as System Delay. ©The number of nodes designed in the dstrbuted computing system topprocess indluidual tasks determines the level of scalabilty ofthe big data system, © Thus implementing distributed and paratel computing methodologies helps in handing ftency. Conti... “+ Load Bolancing: The sharing of workload across various systems throughout the network to manage the oad is known as load bolancing. Distributed and parallel computing methodologies make use cf load balancing feature to handle growing amounts of big data more efficiently and flexibility “+ Virtualization: ig data vitualzation is a process of creating virtual structures for big date systems such asthe hardware platform, storage device and operating system etc to meet the goals and objectives of big, data analytics ‘© Thisvitualzation helps the organisations to understand and navigate ‘easily the flow of information across these physical systems © Distributed and parallel computing methodologies make use of \irual'sation to segregate the processing and analysis ask in 2 Special techniques of Distributed & Parallel computing : ‘+ The distributed and parallel computing techniques has been around almost 50 years inially the technology was used in computer science research to salve complex problems by increasing scalabilty without ‘Investing on massive computing system. ‘+ Over the period af time , concepts of Distributed & Parallel computing ‘technology has evolve into a numberof techniques to process and ‘manage huge amounts of data produced at a high veloc ‘+ Some of these teenniques are shown below systematic framework to minimize errors Contd. + Cluster or geld computing cis 2 orm of parle computing in which 2 bunch of computers (often called nodes) are connected through a LAN and used to salve complex operations so that te behave like 2 single machine. ‘Tris will reduce down time and prowde lager storage capacty. Prima used in Hodoep ‘Massive Parallel Processing: Piaily used in data warehousing MPP swedely used database management system forstonng and analysing huge volume of data 1+ An MPP database has severalindependent pieces of data stored on multiple networks of connected computers. eliminates the concept of one central server having a single CPU and disk ‘+ MPP platform examples are Greenpium and ParAccel (both popular database ‘management comparies) Cont... High performance computing (HPC): HPC environments are the once thats specially designed for processing Fosting point data at high speed. Ics used in esearch and business organisations to develop specialized apps here accurate results 's more valuable and satel Example : polltion level detection ete https: LL LL www.youtube.com/watch TIYEGt=1845 Why Hadoop ? | 0b! | Hadoop - High Availability Distributed Object Oriented Platform. What it is and why it matters LL ‘over the course of evolution of Big Data handling systems, Distributed computing environments are used to process high volumes of data, However the mulkiple nodes in such an environment may not akways cooperate with each ather (due co Issue suchas latency, data related problem system delay etc) thus leaving alt of scope for erors In this context hadoop evolved 35 a platform or framework providing an improved programming model, which is used to create and run distributed systems quickly and ficiently with least erors “Hadoop is a framework that allows + you to first store Big data ina What is Hadoop ? distributed environment, so that, you can process it parallely’. Hortonworks(a data oftware company based in california that developed When to use Hadoop ? and supported open sources} for bi dat processing) dentin “An open source software platform for distributed storage and + Search - Yahoo, Amazon prea cing acm per oped «Log processing - Facebook, Yahoo run on an entire cluster instead of one PC + Data Warehouse - Facebook + “Disriouced storage": A ata set is spreagacrossmutiplearéves. «Video and Image Analysis - New York Times fone of ther burns down, the data stil lal stored. 1 "Disributed processing Hagoop can agaresate at sing any CPUs inthe caster When not to use Hadoop ? Low Latency data access : Quick access to small parts of data Multiple data modification : Hadoop is a better fit only if we are primarily concerned about reading data and not modifying data. Lots of smalll files : Hadoop is suitable for scenarios, where we have few but large files, Evolution of Hadoop In 2007 Yahoo started using Hadoop on a 1000 node cluster. Later in Jan 2008, Yahoo released Hadoop as an open source project to Apache Software Foundation In July 2008, Apache tested a 4000 node cluster with Hadoop successfully In 2009, Hadoop successfully sorted a petabyte of data in less than 17 hours to handle billions of searches and indexing millions lof web pages. Moving ahead in Dec 2011, Apache Hadoop released version 1.0. Later in Aug 2013, Version 2.0.6 was available. Evolution of Hadoop In 2003, Doug Cutting (a s/w designer who invented open-source search technologies) launches project “Nutch” to handle billions of searches and indexing millions of web pages. Later in Oct 2003 - Google releases papers with GFS (Google File system). # In Dec 2004, Google releases papers with MapReduce. ‘© In 2005, Nutch used GFS and MapReduce to perform operations #2006, Yahoo created Hadoop based on GFS and MapReduce with Doug Cutting and team Hadoop Ecosystem : ‘+ Aswe understand Hadoop is open source iw framework (a set of prog written in Java that allows for massively parallel computing allowing big data sets to be stored and spread across multiple serves with reduction in performance) ‘+ Being a framework hadoop is made up ef several medles that are supported by a large ecosystem af technologies. 1+ Thus hadoop ecosystem is defined as a platform which provides various services to solve the problem associated with big data, cont... ‘There are 4 major services provided : © Data processing (tools being mapreduce, Yarn) © Data storage (tools are HDFS, HBASE) © Data access (tools are HIVE, PIG, SQOOP etc) © Data management (tools are OOZIE, FLUME, ZOOKEEPER etc) Understanding Hadoop Ecosystem : | Following are the components that collectively form a Hadoop ecosystem: + Ho uo Dette ye \www.youtube,com/watch?v-aReuLtY + ttqronee reremangtontoso oe OYMI 1 Santee anapocsine a nie cur bse roses of ra sess faze: tanaee about: Mochine Lesing sit aes ackaepe Manag ter ose os erettng HADOOP ECOSYSTEM Contd. “HPS Mapreduce,YARN are the core components of Apache Hadoop and they form ‘he si ditnbuted Hadoop framework “There ae several oer Hadoop components that form an integral past ofthe adoop ccosgtem withthe intent of enhancing the ower of Apache Hadoop in some way ot the acer like- providing beter iteration with databases, making Hadoop faster ot developing novel features and functionalities. in the nex few sides we will discuss some ofthe eminent Hadoop components used by enterprises extensively and mentioned in ou lsu “They ate Mahout Sqoop Oozie Flume, Zookeeper base What is HDFS? ‘emote epeangt caer We HOFS sot ty ear HDFS — Hadoop Distributed File System DES the abstraction means representing the data over the Blocks of a le Faher than sing le which spies te soragestbeyser, Simla vealzaton, you cam See HDFS logically a single unit for staring Big Data, but actually you are storing yoursdataactoss muluple nodes in 8 HDS ows astra architecture HDFS - Illustration IN HDFS Architecture HDFS — Hadoop Distributed File System What is YARN ? 1+ HADOOP YARN (Yet Aout Resource Nog econ doo says ‘tpn fo lloeing char tence svar aplaont peda ge) "YARN ithe esare mangement nd job scheling tsholoy in the open source Hadoop dsnbuced procesing framework Mace th tes eared tobe eves sper mapedice ropes on diferent nscale a the operating stem a adopt i responsible for aging a 1 Namenode the master node and which eat blocks stored in whch data node, where are the replications of the dau boc ept te he actus dat stored in Dts Nodes [YARN - Yet Another Resource Negotiator forms daa processing alles by allocating resources and shedlng tsk es nsec | 4 ane —a ” i eeere ae =) bal MapReduce : is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. HDFS YARN [aio stands or Yet Another Resource Negonator J was troduced in Hadoop 2, where the resource negataion art was sl outro MapReB i os Je where HOFS spt up the data storage eros your ester, YARN splits up the Map Reduce {YAN wrt lg al nas to un jobs fin as pose HIVE Fives a application tha runs over the Iadoop framework and provides SQL Uke Interface for procesing or querying data J HIVE provides 2 SOLcke interface for working on” data stored on Hadaopntegrated systems = Thats females your HDPS-stored data lok ike 3 SQL databate, sq is a structured query language used for processing structured and. sombstractured ase, sa.” User \ tf HDFS What is Hive ? 1 Apsteieies ire atleast nen at embry amaie eatin reread wean mage of ing SO so psec as ‘Asal Hecate wth Hara vino mck eyo peer + Inshor, rtransforms the queries int efficient MapReduce or Spark js. IVE is ull upon hadoop and the query for processing the data I ive i hive, ‘this query is then converd into mapreduce program and then processed by hadoop, Pic PIG Vs HIVE ‘Th corresponding scping languages cated Pig lati, bas 2 SOL-sinlar syntax, anit can perform MapRedce obs. Oozie programs (enserthan nav, tat) ‘© Oozie is an orchestration system for Hadoop jobs. © Oozie is an Open Source Java Web-Application available under Apache. © Oozie is designed to run multistage Hadoop jobs as a single job : an Oozie job. Oozie In Operation ‘+ Apache Ooze @schedular system io run and manage Hadoop distrbuted envionment. ‘+ talows combining mutple complex jobs tobe run na sequential order to achiove a bigger tase '+ Ooze detects competion of tasks trough callback ane poling. ‘+ When Oozie starts a task, i provides a unique callback HTTP URL to te {ask and noes tat URL when iis complete ‘+ Ifthe task fas to invoke the callback URL, Oozie can pl the task for complain. Workflow in Oozie ‘Spark 30 nap Reduce Fork Join Pig 306 3b ive query Features Of Apache Oozie ‘+ Apache Ooze is @ sehedut istibuted environment ‘+ Corie allows combining multiple complex jobs tobe run ina sequent! ‘ordor to achieve the desied output, ‘+ is strongly integrated with Hadoop slack supporting various jobs ke Pig, Hive, Sqoop et. ‘+ Furthor.00z0's abo to manage the oxistng Hadoop machinery for problems such as load balancing, fai-over, system to tun and manage Hadoop jobs in 2 Types of Apache Oozie Jobs Foliowing thee ypes afb are common in Oozie— ‘+ Oczie Workflow Jobs - Qazi jobs running on demand Workflow actions canbe diferent ks ko Hive tasks, Pig task, Shllacton ot. ‘© Oczle Coordinator Jobs - Oazie jobs ruming periodical. “+ Oozie Bundle - itis a colecton of coorcinatr jobs managed asa single job, Sqoop Special features of Sqoop: _Apache Sqoop undertakes the fllowing tasks to integrate bulk data ‘mavement between Hadoop and structured databases '3q009 fulfils the growing need to transfer data from the mainframe to OES. Iefacttaces feature to vansfer data parallel for effecve performance and optimal system utlizaton. Sigoop creates fat data copes from an external source into Hadoop, Ik acts as load balancer by mitigating extra storage and processing loads to other devices. ‘© Sqoop componentis used for importing data from external sources such as relational databases & variously structured data marts into Felated Hadoop components like HDFS, Base or Hive ete ‘© Sqoop mainly helps in moving data from an enterprise database to Hadoop cluster to performing the FTL (Extract, transform, load)process. ‘ Itcanalso be used for exporting data from Hadoop components to external structured data stores as showin below FLUME What is Flume ? Flume sreevantin cases when the data is required ta be brought rom ‘multpie servers immodiately nto Hadoop. In such cases, Flume component is usad to gather and aggregate large ‘amounts of data, Ifacitates the streaming of huge volumes oo fest various sourcos {like web servers such as Titer, Facabook et.) info the Hadoop Distributas Fe Systom (HOFS), Why Apache Flume? ‘+ Organizations running multiple web services across multiple servers and haste wil genarate multudes of log les on a dally bass. ‘Also When the rate of incoming data exceeds the rate at which data can be ‘writen tothe destination, Flume acts 9s a mediator between data producers ‘and the centralized stores and provides a steady ow of date between them. ‘+ These logfiles wl contain information about actives that are require for both auditing and analytical purposes. ZOOKEEPER Introduction ‘+ Apache Zookeeper is an open source software framework designed to Coordinate mulipie services inthe Hadoop ecosystem ‘+ Organizing ana maintaining a service na touted environments a complicated tase ‘+ ZoaKeoperalows developers to focus on core application logle without ‘worying about the dstibutd nature ofthe appieaton 1+ ZooKeoperis a disrauted co-erination servic to manage large set of hosts. | Hbase Features of Zookeeper contd. “+ Zookseperis a coorinator Many other tocls HDFS Base rely on ‘+ Ttean keep track of what node is upidown, which one isthe master node, nat workers are avalable, and many more things ‘+ Itkeeps track of things that can go wrong onthe cluster include them ode crashing, a worker crashing, o network rouble, where a pat of he luster ean s6e the ros of Nanya ee ‘+ Zookeeper sits onthe side of your system and tes to maintain a consistent picture of sale on the ene dstibuted system ina consistent manne. | HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). LL More on Data Storage part of HADOOP Ecosystem: HBASE. + Naseis a hadop dtaboe an opensource noncelstonalatir buted, coknn-arented “atace developed a3 par of HADOO® rage of tan HOFS, '+ tis banettalwnen rg amouns of formation I requred tobe store. upéated ans + Whe Vapteauceeahances Bie dota processing Mosse ates care is storage and + Hae use when you ned random, reimereadrwrte accesso you Big Daa 1 sas database enerprse an create ge abe wth mins of ows and clus Features of HBase: + asenebs propane teste antes ns aay Cb + eave dainacangrecidtomtanacne cc at mano pc + Vismateseuy tad aon tir fhe date oped acai + nsteyeu tive epee na arene ta eemmercea te + elsutae ern stage sul Denson cea Paces ee nda vp trea spn sts dane age cite = eee ae 7 Why Hbase ? ‘+ He.aso isa specialized fle systom used in HOFS which is relevant inthe following cases ‘© Mote random read and write access to data © When you want the data tobe stored in a more structured fashion. © When the veloc of tai ery high. 1 When the lg data ofa website needs tobe stored. Example: facebook ata stored in HBase sors sone Basic Blocks of HBase + tonne ni att ela ted Dandy Vln ean sate song ie os OFS cesnetsupeat ist Has prowess oki fran aes Indl aco oks. Frocnag no sore beh ens acs Rando ace) + MStratrence tren Diference need and Snore Tieagemsy. fsb Mme Aee Fone oer oe ont neta Neos Ne atari Mare hee feed tn O_o eaten jroo asees Bee weenie | 2045S at piss HD 1890 2a mie sam 8 ese wo we00 1808 sie tam Shai 25 caron B06 1501 ae renee 1a sah as serio mas sso te serie ibaa 2567 ely 18ST MMA ena a0 met baa 18 amt Man ana ama meant == ome seat 48 aserh okt ate pe ae saa perros 178 pmseaiage so 0") tet fee 1 Random hese ia te ans thatb contra ‘Symi aan wee oa Summary and How Hbase is combined with Hadoop ‘+ Apache Hoase is te hadoop database, adistibuted, column oriented, scalable, big data stor. ‘+ Use Apache Hbase when you need random, ealime readiwite access to your Big Data ‘+ Ifs goals hosting large tables lions of rows, X mons of columes - top clusters of commoaty hardware. ‘+ Apache Has an open source, cstrbted, on-elatonal database modelled after Google's Bglable-ADistrbuted Storage System for ‘Structured Data, ‘+ Just as Google Bigtable uses the distibuted data storage provided by the ‘Google File Systom, Apache Hbase provides Bitable-the capabilites ontop ‘of Hadoop and HOFS, Understanding Hadoop components: Use cases Tabulation Use cases Tabulation Hadoop ‘Component of ‘Brief Description isn Component of ‘lef Daseription ‘Medoor Use cases Tabulation Component lof Hedoop Brief Description Use cases Tabulation ‘Brief Description

You might also like