0% found this document useful (0 votes)
220 views

Big Data Analytics

Uploaded by

jayaybosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
220 views

Big Data Analytics

Uploaded by

jayaybosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 96
Wh sem - (Course Code : 1TD08011) (Department Optional Course - 5) Big Data Analytics Dr. Sangeeta Vhatkar Dipali Pawar hakur COE and Technology, ndivatice)) DUE ta (Zeal COE and Research, Pune) Rupali D. Pashte Dr. Zahir Aalm Shree L. R. Tiwari COE, Mumbai) (Thakur COE and Technology, Kandivali(e)) eae ete Pcneatinemerer 1 seen | | Geen Ro een eed Scanned with CamScanner Introduction to Big Data University Prescribed Syllabus Introduction to Big Data, Big Data characteristics, types of Big Data, Traditional vs. Big Data business approach, Big Data Challenges, Examples of Big Data in Real Life, Big Data Applications. Self-learning Topics : Identification of Big Data applications and its solutions. »4 1.1 INTRODUCTION TO BIG DATA AND HADOOP ees : GQ. _ Firstly, We need to know “what is data” ? (1) Now a day the amount of data created by various advanced technologies like Social networking sites, E-commerce etc. is very large. It is really difficult to store such huge data by using the traditional data storage facilities. 2) Until 2003, the size of data produced was 5 billion gigabytes. If this data is stored in the form of disks it may fill an entire football field. In 2011, the same amount of data was created in every two days and in 2013 it was created in every ten minutes. This is really tremendous rate. Scanned with CamScanner | poduCtON SO meme pm wu-sem8 yt, will discuss non coneel some of t about fa pis ate 10 ig data, We ne proceses and eng ia) ig Data An 3) In his topics NE ind define om about wide oa " ie a - Da ve collection of data that continues 1 Big Data is a massive coe re 1. Stock Exchange : The data in the share market regarding information about prices and status details of shares of | om ‘of companies is very huge. ; Social Media Data : The data of social networking sites contains information about all the account holders, their posts, chil history, advertisements ete. On topmost sites like facebook and whatsapp, there are literally billions of users. Big Data Analytics (MU-Som 847) _(Invoduction to Big Data) Pg. 90. (1-3) 3. Video sharing portals : Video sharing portals like youtube, ‘Vimeo etc. contains millions of videos each of which requires lots of memory to store. 1. Stock Exchange 2, Social Media Data 3, Video sharing portals 4, Search Engine Data Search Engine Data : The search engines like Google and Yahoo holds lot much of metadata regarding various sites. ‘Transport Data : Transport data contains information about ‘model, capacity, distance and availability of various vehicles. Banking Data : The big giants in banking domain like SBI or ICICI hold large amount of data regarding huge transactions of account holders. DM 1.2 BIG DATA CHARACTERISTICS {Gq What are Characteristics of Big Data? | ug. Describe any five characteristics of Big Data UQ. Explain what characteristic of Social Networks makes Data, TEI {Q. Explain Big data along with SV's es i 5. (wu 2029 (0-130) WadrecnseoPuicavons ‘Scanned with CamScanner won aig Data Ana he volume ie. amount of ata that j ‘ Volume represents a ts vigil ata volume in Petabyte BOviy ata refers to turing data into vale. By tming scceeg a ere lus, businesses may generate FEVENUE i ata into valves, f available data, fers to the uncertainty OF v ve fe (3) Veracity srs duet the high votume of dat that brings incompl and inconsistency (4) Visualization isthe process © saps, and other visual forms. {ferent data types ie. various data f f displ ng data in charts, (5) Variety refers to the di like text, audios, videos, ete: j (6) Velocity is the rat t whic data grows Social media con a myjor oe inte velocity of growing data oad : (a) Virality describes ow quickly information gets spread seg | people to people (P2P) networks. | 1.2.1 Volume | | « Asit follows from the name, big data is used to refer to enorm ‘amounts of information. | We are tking about ot gigabytes but terabytes and petites ata, «The [oT (Intemet of Things) is creating exponential growth in data, +The volume of data is projected to change significantly in the coming years. Hence, "Volume is one characteristic which needs to be considered while dealing with Big Data, (ne 2:25) (ae-138) Big Data Analytics (MU-Sem 8-7) (introduction to Big Data)_.Pg.no._.(1-5) "= Volume {Data at Rest] Terabytes, Petabytes Records/Arch Table/Files Distributed % 1.2.2 Variety «Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. ‘© Data comes in different formats — from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions. «This variety of unstructured data poses certain issues for storage, ‘mining and analysing data + Organizing the data in a meaningful way is no simple task, especially when the data itself changes rapidly. Another challenge of Big Data processing goes beyond the massive volumes and increasing velocities of data but also in ‘manipulating the enormous variety ofthese data "© Variety [Data in many Forms } Text Multimedia Structured Unstructured w% 1.2.3 Veracity © Veracity describes whether the data can be trusted. Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency. ‘Hygiene of data in analytics i important ‘cannot guarantee the accuracy of your results because otherwise, you ren nec patcatrs (ws 2229) (48-131) ‘Scanned with CamScanner nroduction 8g Oata)_P9. no yg oats ana MSS D ones rom so many TE SOUS i iy Besa nse ad ransform Gat TOSS SYST, to link, mate being analysed are + However, it is useless ifthe data being oll incomplete ~ ‘eracity is al about making 8 the dat is soon . ee poner ep a a FE SEAN systems «@ veracity {Data in Doubt } rstworhiness Authenticity ‘Accurate Availability 1.2.4 Velocity Velociy is the speed in which data i BTOWS, process an becomes accessible 4. A.data flows in from sources like business processes, application Togs, nexworks, and social media sites, sensors, Mobile deve, etc. “The flow of data is massive and continuous. ‘Most data are warehoused before analysis, there is an increasing need for realtime processing ofthese enormous volumes. + Real-time processing reduces storage requirements while providing more responsive, accurate and profitable responses. «It should be processed fast by batch, in a stream-like manner because it just keeps growing every years. © Velocity {Data in Motion } Streaming‘ Batch——Real / Near Time Processes (wu-22-28)(Me-131) : } } | Big Data Analytics (MU-Sem 7) (introducton to Big Data)_.Pg 0.(1-7 12.5 Value + Irefers to turning data into value, By turing accessed big data into values, businesses may generate revenue. + Value is the end game, After addressing volume, velocity, variety, variability, veracity, and visualization — which takes a lot of time, effort and resources ~ you want to be sure your organization is getting value from the data, ‘+ For example, data that can be used to analyze consumer bebavior is valuable for your company because you can use the research results to make individualized offers. Value {Data into Money} Statistical Events Correlations % 1.2.6 Visualization + Big data visualization isthe process of displaying data in charts, graphs, maps, and other visual forms. + Its used to help people easily understand and interpret their data at a glance, and to clearly show trends and pattems that arise from this dat + Raw data comes in a different formats, so creating data visualizations is process of gathering, managing, and transforming data into a format that’s most usable and ‘meaningful + Big Data Visualization makes your data as accessible as possible to everyone within your organization, whether they have technical data skills or not ab rereoritiins (wu 22.28) (M131) ‘Scanned with CamScanner inrosucton to Big Data mu-son8.l isadve red Data pa) oe the data which lacks a proper format or sequence vn ined by # fixed schema. 1 isnot constr ec schema. 4, Very Flexible due to absense © 44, Data is portable. 5, Ibis very scalable. 6 teean deal easly with the heterogencltY of sources. i + rase pe of data hve a variety of BUSINESS inelligene gy i 4 analytics applications. «= pisadvantages ! itis dificult to store and manage unstructured data due to schema and structure, 2, Indexing the data is difficult and error prone due to structure and not having pre-defined altributes. Due to search results are not very accurate. 3. Ensuring security to data is difficult task. 1.3.2 Type #2 : Structured ‘+ Any data that can be stored, accessed and processed in the fom ‘of fixed format is termed as a “Structured” data. + Over the period of time, talent in computer science have achieved ‘greater success in developing techniques for working with ‘kind of data (where the format is well known in advance) and determining value out of it * When size of such data grows to a huge extent, typical sizes Pa 2 the range of multiple zettabyte, Data stored in relatio fe ae ‘Management system in one example of (22.25) (a-191) ~ Blew | (u-20.23) 48191) Big Data Analytics (MU-Sem 8-7) (Introduction to Big Data)._.Pg. no... (1-11 + Structured data isthe data which conforms to a data model, has a well define structure, follows a consistent order and can be ‘easily accessed and used by a person or a computer program. + Structured data is usually stored in well-defined schemas such as Databases. It is generally tabular with column and rows that clearly define its atributes. + SQL (Structured Query language) is often used to manage structured data stored in databases. %®_1.3.2(A) Characteristics of Structured Data + Data conforms to a data model and has easily identifiable structure + Data is stored in the form of rows and columns. ‘Example : Database Data is well organised so, Definition, Format and Meaning of data is explicitly known. ‘+ Data resides in fixed fields within a record or file. + Similar entities are grouped together to form relations or classes. ‘+ Entities in the same group have same attributes. © Easy to access and query, So data can be easily used by other programs. ‘+ Data elements are addressable, so efficient to analyse and process. % 1.3.2(B) Sources of Structured Data (1) SQL Databases (2). Spreadsheets such as Excel (3) OLTP Systems (4) Online forms (5) Sensors such as GPS or RFID tags (6) Network and Web server logs (1) Medical devices ech Neo Publications. ‘Scanned with CamScanner t Pg. no. (1-1 fig Data Analytics (MU-Sem.8-1 Introduction to Big Data 0. (101 vantages of Structured well defined structure that helps in eagy Data we 1.3.2(0) A 1, Structured data have @ storage and access of data >, Data can be indexed based on text strin cl mn hassle-free. ‘This makes search operatior c. knowledge can be easily extracted from 1g as well as attributes Data mining is easy i data, erations such as Updating and deleting is easy due to wel 4. Of structured form of data 5, Business Intelligence operations such as Data warehousing can be easily undertaken. {6 Easily scalable in case there is an increment of data 7. Ensuring security to data is easy Structured - Example Employee_Table [ee Employee Name| Gender | Depariment|Salary_In_lacs 1 XYK MALE | FINANCE 850000, 2 ‘ABC ‘MALE | ADMIN 250000 3 POR FEMALE) SALES 350000, 4 wR —|FEMALE| FINANCE | 600000 % 1.3.3 Type #3 : Semi Structured ‘+ Semi structured is the third type of big data. Semi-structured data can contain both the forms of data. ‘+ Semi-structured data pertains to the data containing both the formats mentioned above, that is, structured and unstructured data | | | (MU. 22:29) 6-191) ig Data Araytes (MU-Som 8. (ireductono Big Data... no. (119) ‘+ To be precise, it refers to the data that although has not been classified under a particular repository (database), yet contains Vital information or tags that segregate individual elements within the data, ‘+ Web application data, which is unstructured, consists of log files, transaction history files etc. ‘+ Online transaction processing systems are built to work with structured data wherein data is stored in relations (tables). + Semi-structured data is data that does not conform to a data model but has some structure. It lacks a fixed or rigid schema. It is the data that does not reside in a rational database but that have some organizational properties that make it easier to analyze With some processes, we can store them in the relational database. %_1.3.3(A) Characteristics of Semi-structured Data 1. Data does not conform to a data model but has some structure Data can not be stored in the form of rows and columns as in Databases 2. Semi-structured data contains tags and elements (Metadata) which is used to group data and describe how the data is stored. 3. Similar entities are grouped together and organized in a hierarchy. Entities in the same group may or may not have the same attributes or properties. 4. Does not contain sufficient metadata which makes automation and management of data difficult Size and type of the same attributes in a group may differ. 6. Due to lack of a well-defined structure, it can not used by Computer programs easily (0mu. 22:29) (Mé-131) [dl rech-o Puvications ‘Scanned with CamScanner 9.17) (Introduction to Big Data)__Pp. no. fig Data Analytics (MU. Ys 1.3.3(B) Sources of semi-structured Data Wen (2) XMLand other markup languages (1) E-mails (3) Binary executables (4) ‘TCPAP packets (5) Zipped files (6) Integration of data from ditterey sources (1) Web pages ys 1.3.3(C) Advantages and Disadvantages of ‘Semi-structured Data 57 Advantages 1, The data is not constrained by a fixed schema. 2. Flexible ie. Schema can be easily changed. 3. Datais portable. 4. Itis possible to view structured data as semi-structured data. ‘5. Its supports users who can not express their need in SQL. 6. Itcan deal easily with the heterogeneity of sources. "= Disadvantages 1, Lack of fixed, rigid schema make it difficult in storage of the data, 2. Interpreting the relationship between data is difficult as there is ‘no separation of the schema and the data, 3. Queries are less efficient as compared to structured data © Semi-structured - Example + User can see semi-structured data as a structured in form but it is actually not defined with eg. a table definition in relational DBMS. (Mu. 22.28) (M6-131) [abr rie Putco ; Big Data Analyics (MU-Som 87) (Intoducton to Big Data). no._ (1-15) + Personal data stored in a XML file - Prashant Rao Male 35 < name> Seema R. Female 41 Satish Mane Male20 Do 1.4 DIFFERENCE BETWEEN STRUCTURED, SEMI- STRUCTURED AND UN-STRUCTURED DATA GQ. What is difference between structured, semi-structured and Un-Structured Data ? Properties | Structured | Semi-structured | Unstructured data data data Technology | Itis based | Itis based on Itis based on on XMLIRDF(Resource | character and Relational | Description binary data | database | Framework), table Transaction | Matured | Transaction is No management | transaction | adapted from DBMS | transaction and various | not matured ‘management concurrency and no techniques concurrency Version | Versioning | Versioning over __| Versioned as ‘management | over tuples, | tuples or graph is | a whole row, tables | possible Flexibility | Itis schema | tt is more flexible | Itis more dependent | than structured data | flexible and and ess | but less flexible than | there is flexible | unstructured data | absence of schema (omy. 22:28) (M191) W Toch Nao Publications ‘Scanned with CamScanner {troduction to Big Data)...Pg. no... (1-16) 1a Analytics (MU-Sem.8-1 BigDat | Properties | Structured Semi-structured | Unstructureg ‘Bg Data Analytica (MU-Sem.6-17) (Introduction to 64g. 0808) a se os ed, sem «© Itdeals with large volume of both structures © Scalability | Itis very | I's scaling is Ttis more Se ee (ia VO Ts a8 difficult | simpler than scalable. eee seale DB | structured data Veracity and Value refer to the schema data van wer! bust e «Big data not only refers to large amount OF TA yt of usiness | Very robust | New technology, not ig eee eee very spread extracting meaningful data by analy2inB th je complex data set. s Structured | Queries over Only textual 5 aiaaie ais seat query allow | anonymous nodes | queries are UQ.” Compare big data analytics with taditiongy complex | are possible possible a = a ‘Traditional Data Big Dale generated 1, | Traddional daa is generated | Big dat Pr ve in enterprise level. Bi ide and enterprise IEVET voume ranges [0m | 2, [ts volume ranges from | Its Zenabytes Gigabytes to Terabytes. Petabytes to Zettaby! | Exabytes. ta Sytem deals with 3, | Traditional database system | Big data system CC rag | structured, semi 4, Traditional Data deals with structured data. | ‘+Traditional data is the structured data which is being majorly, and unstructured data_| maintained by all types of businesses starting from very 4. | Traditional data is generated | But big data 's generated | small to big organizations. perhour or per day or more. | more frequently mainly PEF | + In traditional database system a centralized database seconds. — architecture used to store and maintain the data in a fixed 5. | Traditional data source is | Big data source is distributed | format or fields in a file. For managing and accessing the ‘centralized and it is managed | and it is managed in | data Structured Query Language (SQL) is used. in centralized form, distributed form. 2. Bigdata 6. | Data integration is very easy. | Data integration is very difficult, ‘+ Wecan consider big data an upper version of traditional data. Big data deal with to0 large or complex data sets which is difficult to manage in traditional data-processing application software. 7. | Normal system configuration | High system configuration is is capable to process | required to process big data. traditional data (wu 2229) 8-13 Wl recreate (MU- 22.23) (Mot Tech. ) “ ) (8-131) & Pu ‘Scanned with CamScanner Data Analytics (MU-Sem 8.17) (Invoduction to Big Data) Py no (1-1 > 1. Fraud detection Fraud detection is a Big Data application example for ye '. | The size of the data is very ‘ traditional data size, businesses which has operations like any type of claims or sal transaction processin Traditional data base tools | Special kind of data base tools 7 . red to perform any | are required 10 perform any ‘+ Number of times the detection of fraud is concluded long ey ‘operation, data base operation. ‘after the fact. At this point the damage has been already done Goes! kind of foocticos ol all that’s left is to decrease the harm and revite policies to 10, | Normal functions can prevent it in future, manipulate data manipulate data, Th [ws data model is strict | Its data model is flat schema -2 "| schema based and itis state | based and tis dynamic, SS 12. | Traditional data is stable and | Big data is not stable and [rood ecton | inter relationship. unknown relationship. [2g anainics | Traditional data is in| Big data is in huge volume { 3. Cat contr anaiyics | manageable volume. which becomes [7 coil Tr 14, | 1 is easy to manage and Pig. 1.6.1 + Big data applications It is difficult to manage and |__| manipulate the data, © The Big Data platforms can analyze claims and transactions 15, | is data sources includes ERP | Ks data sources includes of businesses. They identify large-scale patterns across many transaction data, CRM | social media, device data, transactions or detect anomalous behaviour of a some user. transaction data, financial | sensor data, video, images, data, organizational data, | audio ete “Tae La i evel fa ‘Web transection date ofc. > 2, IT log analytics s + An enormous quantity of logs and trace data is generated in Ib EXA | i IT solutions and IT departments. Many times such data go ene ‘unexamined: organizations simply don't have the manpower Sj —— ‘There ar various big data applications as shown in Fig 1.6.1 As 22°25) (av) a ‘Scanned with CamScanner Introduction to Big Data)...Pa. no. (1. em 81) Big Data Analytics (MU-Sem.8- Big Data Analytics (MU-Som.8-IT) (Introduction to Big Data)...Pg_ no... (1-21 e ability to quickly identify lange. © Big data has the ability (0 a ee «It is necessary for the data to be available in an accurate, to help in diagnosing and preventing problems patterns Bee eee k complete and timely manner because if data inthe companies helps the organization with a lars information system is to be used to make accurate decisions > 3, Call center analytics in time then it becomes necessary for data to be available in this manner. Now we tum to the customer-facing Big Data applicatig examples, of which call center analytics ae partculg, 2. Privacy and Security powerful. Without a Big Data solution, much of the insigh + Its another most important challenge with Big Data. This that a call center can provide will be ignored or exposed late challenge includes sensitive, conceptual, technical as well a5. + By making sense of time/quality resolution metrics, the Big legal significance. Data solutions are able to identify recurring problems op + Most of the organizations are unable to maintain regular customer and staff behaviour pattems. Big data can alsy checks due to large amounts of data generation. However, it capture and process call content itself. should be necessary to perform security checks and > 4, Social media analysis | ‘observation in realtime because it is most beneficial With the help of Social media we can observe the real-time ‘+ There is some information of a person which when combined with external large data may lead to some facts of a person insights into how the market is responding to products and atest which may be secretive and he might not want the owner to ; know this information about that person. * With the help of these insights, it is possible for companies to : adjust their peicing, promotion, and campaign placemeaiiil + Some of the organization collet information of the ea ‘get optimal results, ] in order to add value to their business. This is done by : making insights into their lives that they're unaware of. D4 _1.7 BIG DATA CHALLENGES ‘Analytical Challenges «There are some huge analytical challenges in big data which 1, Sharing and Accessing Data ? on ‘2 arise some main challenges questions like how to deal with a * Peshaps the most frequent challenge in big data efforts is the problem if data volume gets too large? inaccessibility of data sets from external sources, “0: Or hw to Sia onthe adh Soli? 3 3 Se ae a ig data can cause substantial challenges, ‘© Orhow to use data to the best advantage? ~ a a the need for inter and intra- institutional legal ‘© These large amount of data on which these type of analysis is ‘ument to be done can be structured (organized data), semi-structured * Accessing data from public repositories leads to multiple (Semi-organized data) or unstructured (unorganized data). | sities. Pe ee ms 70.25) e131) ech No Puan er ee, Ue rec. neo Puticatons a : me ed ‘Scanned with CamScanner ‘Introduction to Big D: pio aro ects oush which desson making cy fig Data A There be done 1, ithe incorporate mass 2. Or determine upfront which we data volumes in the analy. Big data is relevant, 4, Technical challenges Quality of data When there is a collection of a large amount of data and storage of this data it comes at a cost. Big companies, business leaders and IT leaders always want large data storage. 2. For better results and conclusions, Big data rather than having inrelevant data, focuses on quality data storage. 1 3. This futher arise a question that how it can be ensured that data ig relevant, how much data would be enough for decision making and whether the stored data is accurate or not. © Fault tolerance 1. Fault tolerance is another technical challenge and fault tolerance computing is extremely hard, involving intricate algorithms. 2. Nowadays some of the new technologies like cloud computing and big data always intended that whenever the failure occurs the damage done should be within the acceptable threshold that is the whole task should not begin from the scratch. "= Scalability 1. Big data projects can grow and evolve rapidly. The scalability issue of Big Data has lead towards cloud computing, 2. Itleads to various challenges like how to run and execute various jobs so that goal of each workload can be achieved cost- effectively. (nu 22:29) 8-131) ig Data Analytics (MU-Sem.8-T) (Introduction to Big Data)..Pg. no... (1-23) 3. It also requires dealing with the system failures in an efficient manner. This leads to a big question again that what kinds of storage devices are to be used. DH 1.8 EXAMPLES OF BIG DATA IN REAL LIFE (4) In the Education Industry ‘The University of Alabama has more than 38,000 students and an ‘ocean of data. In the past when there were no real solutions to analyze that much data, some of them seemed useless. Now, administrators can use analytics and data visualizations for this data to draw out patterns of students revolutionizing the university's operations, recruitment, and retention efforts, (2) In the Healthcare Wearable devices and sensors have been introduced in the healthcare industry which can provide real-time feed to the electronic health record of a patient. One such technology is Apple. Apple has come up with Apple HealthKit, CareKit, and ResearchKit. The main goal is to empower iPhone users to store and access their real-time health records on their phones. (3) In Government Sector Food and Drug Administration (FDA) which runs under the jurisdiction of the Federal Government of the USA leverages the analysis of big data to discover pattems and associations to identify and examine the expected or unexpected occurrences of food-based infections. (4) In Media and Entertainment Industry Spotify, on-demand music-providing platform, uses Big Data Analytics, collects data from all its users around the globe, and then uses the analyzed data to give informed music recommendations and suggestions to every individual user, (wu. 20.29 0-19) Dad recr no Pucaton ‘Scanned with CamScanner Introduction to Big Data). Pg, no... (1-25) Analytics (MU: Bi rem dens, mst KINDO 9) a9 spac Sector which ‘Amazon Prime aa onusing big one-stop shop is als is: NASA is collecting data from different satellites and rovers about rns | ne geography, atmospheric conditions, and other factors of mars Tor 5) in Weather Patte 3 ‘ Deep Thunder, which isa research project by IBM, proyige their upcoming mission. It uses big data to manage all that data and a recasting wirou inh performance cOmpUENE Of Big dug) anaes that to run simulations. E cast ee eee sing ToS WH improved weather forecasting fg iar ea oes redicting the probability of damaged power lings Be natural disasters OF Pe sportation Industry Uber generaes and uses # BBE amount of data regarding divey, their vices, loleations Very 1 from every vehicle, etc, All ty data is analyzed and then ‘used to predict suPPIYs demand, Ting sve and fare that il beset f0F EVEN wip. ; (6) In Tran (7) In Banking Sector ‘Various an-money laundering software such as SAS AML pata Analytis in Banking 10 detect SUSPICIONS transactions analyze customer data. Bank ‘customer for more than 25 years. of America has been a SAS (8) In Marketing ‘Amazon collected data about the purchase done by milion raed the purchase patterns people around the word, They analy payment methods used bythe customers and used the results 10 new offers and advertisements (9) In Business Sights Netflix is using Big Data to understand the user behavior, type of content they like, popular movies on the website, content that can suggest to the user, and which series or movies they inves in ‘Scanned with CamScanner Introduction to Big Data Frameworks University Prescribed Syllabus What is Hadoop? Core Hadoop Components; Hadoop Ecosystem, Working with Apache Spark What is NoSQL? NoSQL data architecture pattems : Keyvalue stores, Graph stores, Column family (Gigtable) stores, Document stores, MongoDB Seltearning Toples : HOFS vs GFS, MongoDB vs other NoSaL system, Implementation of Apache Spark D_2.1 CONCEPT OF HADOOP °% 2.1.1 What is Hadoop ? Hadoop is an open-source software Platform for storing massive Yolumes of data and running applications on clusters (groups) of commodity software. It gives us the massive data storage capability, ‘massive computational power and the ability to handle different irtually limitless jobs that can be a running job, waiting jobs or | iPport growing big data {asks Ils main essential component is to su females, techy spp forvawthinking. analytics ike tive analytics, Machine earning and data mining. Hadoop bas the capability to handle different ‘modes of data such as structured, Big Data Analytics (MU-Sem 8-17) nto_to Big Data Fram)._Pg.n0.. (2-2 unstructured and semi-structured data. It gives us the elasticity to collect, process, and investigate data thatthe old data warehouses concept failed to do, YW 21.2 History of Hadoop ‘+ The Hadoop was introduced by Doug Cutting and Mike Cafarella in 2002. Its beginning was the Google File System paper, printed by Google. «In the year 2002, Doug Cutting and Mike Cafarella stared to work on a project of Apache Nutch. It is an open source ie. free web crawler software project. + While working on Apache Nutch, they were facing some issue with big data. To store that data, they have invested lot of money which becomes the challenging of that project for completion + Duc to this problem appearance of Hadoop came into existence + In 2003, Google presented a file system known as GFS (Google file system). Iti a registered distributed file system developed to provide effective access to data. ‘+ In year 2004, Google released the concept of @ white paper on Map Reduce, + This technique makes simpler the data processing on large clusters (groups), ‘+ In 2005, Doug Cutting and Mike Cafarella presented a new file system known as NDFS (Nutch Distributed File System), ths file system also contains Map reduce. In 2006, Doug Cutting resign Google and joined Yahoo. Based on the Nutch project, Dough Cutting announce a new project Hadoop witha file system known as HDFS (Hadoop Distributed File System) wu-22.29) e198) LB recto rstetens Scanned with CamScanner oa nays wu. Sen.81D. ten ata Fram).P9. no. (a renin 0.10 was sased a0 DOUE CUD gay cop afer his sonst elephant. In 207, Pawo sccesflly ran wo cuss OF 1000 machines, In 2008, Hadoop became the quickest system 10 Sort 1 terabyte of tuna on 900-node cluster in 209 seconds In 2013, Hadoop 2 aed, And Curent In 2017, Hadoop 3.0 was released, was el + Hadoop firs ‘named his project Had 2s 2.1.3 Features of Hadoop 1. Suitable for Big Data Analysis : As Big Data manages to be ibuted and unstructured in nature, Hadoop clusters are well matched for analysis of Big Data. Meanwhile it is processing Tose (ot the actual data) that flows tothe computing nodes and less network bandwidth is spent. This concept is called as data locality which helps to increase the productivity of Hadoop based applications. 2. Scalability : Hadoop clusters can easily be scaled to any amount by adding extra cluster nodes and thus allows for the growth of Big Data. Also, scaling does not require adjustments to application logic. 3. Fault Tolerance : Hadoop network has a facility to duplicate the input data on to other cluster nodes. So, inthe event of a cluster node failure the data processing can still process data by using data stored on another cluster node. % 2.1.4 Advantages of Hadoop 4. Fast: In HDFS the data distributed over the cluster and mapped such @ way which helps in faster recovery. Even the tools to Process the data are often on the same servers, thus reducing the processing time can be efficient way to manage the data. It al processes terabytes of data in minutes and Peta bytes in hours 7 (wu 22:29) 8-131) aa pig Data Analytics (MU-Sem 8:7. to ig Data Fram). Pg. nea. [28) Scalable : Hadoop cluster can be extended by just adding nodes in the cluster so failure chance can be less p is open source and uses commodity to traditional bs 3. Cost Effective : Hadooy hardware to store data, it is cheaper as compared RDMS. 44, Tough to failure : HDFS has the property with which it can duplicate data over the network, so if one node is down or some other network failure happens, then Hadoop takes the backup data and use it. Normally, data are replicated thrice but the replication factor is configurable. \ges of Hadoop 24.5 Chi 1. Hadoop is a complex distributed Application programming interface. 2. Specialized skills are required for using Hadoop and, developers from efficiently bringing solutions. 45, Business logic and infrastructure APIs have no clear separation therefore burdening come on app developers. 4, Automated testing of end-to-end solutions is unfeasible or terrible. 5. Common data patterns often require but does not steadiness and accuracy. (6. Hadoop is more than just disconnected storage. 7. Hadoop is a various collection of many open source projects 8. Understanding multiple technologies and hand-coding combination between them is difficult, 9. Significant effort is wasted on simple tasks like data absorptions and ETL (Extract, Transform, Load). system with low-level prevent most support data (mu. 22.23) (MB-131) ‘Scanned with CamScanner _—_—_—S—rs—sesesia‘Ss<‘ 1. Issue with Small Files Hadoop is not suitable for the small data. (HDFS) Hadoop distributed file system wants the capability to professionally support the arbitrary reading of the small files since of itis high volume design. Small files are the main problematic in HDFS. A small file is expressively minor than the HDES block size (default 128MB). If we are storing these vast numbers of small files, then HDFS cannot handle these files while HDFS is working accurately with a small unit of large files by storing large data (mu. 2229) (MB-131) Wa teen eo Puicatons ‘Scanned with CamScanner a (wu 22:29 (8-131) ‘am)..P9.10..(24 sigoataA than storing several small Cae a se le 3. Support for Batch Processing only } rts the batch processing only and it does not Ma mo aa ander complete perfomance i uber MapReduce framework of the Hadoop does not slower, influence the memory of the Hadoop cluster to the extreme level, > 4, NoReab-time Data Processing ‘Apache Hadoop is for the operation of batch processing, a allow it to take a vast amount of data in input and execute it and ‘generate the outcome. Even though batch processing is very well: conganized for processing a data of high volume dependent on size of the data that is being processes and the co power of te system but basically an output can be delay so ‘Hadoop is not appropriate for Real-time data processing. > 5, NoDelta Iteration Hadoop isnt well-organized for the constant processing the Hadoop doesn't support the repeated data flow ie. ‘of phases during which the respective output of the earlier 1s the input tothe succeeding phase. ig Dae Ara Sem) it. Big Data Fram). no (2-14) > 6. Latency The Hadoop MapReduce framework is that the comparatively slower so meanwhile its supporting the various format, structure ‘and vast capacity of information or data. In MapReduce, Map takes the collection of the data and decodes it into the alternative Set of data where the separate elements are fragmented down into ‘key-value pairs and reduce the output from the map as input and Process extra and MapReduce requires plenty of the time to ‘accomplish these tasks thus by increasing the latency, > 7. Security Hadoop is challenges in handling the compound application. If the user doesn’t know the way to enable a platform who is ‘managing the platform thatthe data can be inthe danger. AL storage and network levels, Hadoop is missing the encryption part, which can leads to the major point of concem. Hadoop supports Kerberos authentication, which is tough to manage HDFS supports access control lists (ACLs) and a standard file Permissions model. However, third-party vendors have enabled an association to influence the Active Directory Kerberos and LDAP for verification, > 8. No Abstraction Hadoop doesn’t have any kind of abstraction; thus, MapReduce creators need to the hand code for each process which makes it difficult to work. > 9 No Caching Hadoop is not well-organized for caching. In Hadoop, MapReduce cannot store the intermediate data in the memory for 4 additional condition which reduces the performance of the Hadoop. (ws 2220 09) WrecrreoPutctons ‘Scanned with CamScanner Gr | (Mu 2229) e131) ram)._.Pg.no. iyo. 1 Big Da Fra). Mu-sem 8.17) tnt fig Data An ‘Code ). Lengthy Line of! roximately about 120,000 codes of lin, hy =o at Ee eooe by the bugs and it will tke mop umber offi period fe ine tenet IN PCE. . ae ‘explain how Hadoop goals =e Covered in Hadoop distrib ug i file system. HDFS and MapReduet Hadoop, whereas /HDFS is «are two important major Components gf, ‘xeful for infrastructure point of view ang_ ree wxeful for programming concept. TO understand th Make i vetabilty of Hadoop from single-node to a thousand concept behind scalability of Hadoop 2 soe cer, HDS is very wsefl. It covered the goals of Hadoop ag follows: ] Handling of large dataset : As Hadoop supports distribu storage and processing of large data set, HDFS architecture designed as it most useful to store and retrieve large data, Fault tolerance and data replication : In HDFS, the data files are divided into big blocks of data and for fault tolerance each block is stored on three nodes from which two nodes are same rack and one is from a different rack. A block is consi as the amount of data stored on each data node. The redun of data leads to robustness, fault detection, quick recovery of and scalability 3. Commodity Hardware : HDFS consider that the luster will con common hardware such as less expensive or ‘machines. And an important feature of Hadoop is that Hi can be installed on any average commodity hardware, installation and execution of Hadoop It does not require any computers or high-end hardware, This reduces the overall cost. 4, Data Locality concept : Data locality means locé computation logic nea to the data, instead of moving data to i a ig Data Anais (MU-Sem.£-(1 (Miro. to Big Data Fram). Pp. no.. (2-16) computation logic or application space . This reduces the bandwidth utilization in a system, HDFS provide interfaces for applications to relocate themselves closer to the location where data. on, aa rare Oitalinda? fi bee 2.5 Mora vision to make over India into a digitally empowered society and build knowledge economy. Its vision is mostly focused on three main areas (Digital Infrastructure as a basic utility to every citizen, (ii) Governance and services on demand and (Gil) Digital empowerment of citizens of India. '* Subsequently, to fulfil this vision, availability of different Data resources has got increased. + In Big data huge amount of data collected and stored whether it is in structured or unstructured form or semi-structured. This data ‘may contains various business related transactions, email, images, audio, surveillance camera videos, logs and unstructured data from blogs or messages from social media, medical data, banks related transaction data, e-Goyemance data, media data, defence related data, IT sectors. + If this data get efficiently cleaned and then get analyzed, it can helpful in data visualization for business trade for various ‘enterprises or organizations, + This digital technology has make the progress enterprises or ‘organizations mote easier, Data from collected from tweets, various blogs and other social networks sites can useful to an ‘enterprise or organization to analyze consumer's view. It will helpful for them to understand needs and choices of their customers, (Mv. 22:29) (8-131) Tech-Neo Publications ‘Scanned with CamScanner emma cons a (oven) (222-9) sn WHOM) ANA OW pouoALOD st we Big wor exp ap Tryosn suou st “210g iq yo asn am fq uonestnaio pomes siyouog atqistaut 40 ajqss jo amseow 5: ape, “wasn ey dope} apa 0} a2r0y9 Jo wamXoydap amp simp soyeu Aarons 3s ‘SI PP SUCH 424 wo sqof Rds AremQse uns uM Hed J980 UL "YIN doopeH wun apts 4g apis yredg uny pue saysnj> dopey Ze eteno2 oe 5 ep 38 amp Sue ‘a sourgowal Jo 29sqns 20 [fe uo s2amosal aro Aqeanes ues BUIBEUEWY -sav07 feorydesf099 pu pnot> axp se yons 300 wamoydep auorepuris ay quLAL + ywamo|dap auofEpURS + Prep ofdrypmur ywoxaymp era soysuen 01 aygr 2q wnt rez ta 3H Tey areorpu YoryM Aurxajduso> sueayy : AypeIIA “yp Canis) aw ErU URES ©) we s00(@) euopURS( Ames om soprioad sontyeue wep iq pomouns 2 ono © a ‘aM dope Woy wep jo sadky quasynp o1 sayy: MOURA E rep 1 a a eur 09 0} spa3U 3 “a589 305 IS Po "NAV A “PUoTepuRrs :saisnjo dopey v wn xseds, § SS Madadeaanel yueuads Yang “aun Aotdap 01 skew aay ase arty “seqnoned uy jyedg und 0} nos 30} Stoojan si o saasaudas vip 314 J 0 fom © 51 23 YOU 20 Jaen doopey xn amdiguos o safapaud — PAPUOODS uy jest, y98 saBessau pau [OS TH UAC SAMERSTURpE aAey nok soMaqm JoEW OU ue “(NAVA) OZ 9h Sy “one ‘aumideo qam dat] “VIPAT [PIIOS “sOsUEs oper 20 x1 doopey; uns nok somaya soneu on “sanmqedeo A ‘suaysks suquo se gons uuopMd [ep snoues Saweds Jo afmueape oye 01 2980 doopeyy K1ax3 40} aIq1ssod a oy siajas wiep 31g UT: AIPORA T se Asvo se 1 Suryew uo posnooy équersuoo axey am ‘puosag « a " qovuzonur eta pay Pur PLO ‘Spomaurey wep 31q Jomo pue ‘osogy “sonpaydeyy nom 1 2sou doopeH tum ayeds Summquios Aq sonmgedes Srssaooad POM paroyeos are yetp sasegerep 100 As parnqunsip Zan WOuH? oxo JoopEH ons sy eg s,uozemy poe — PR PRD sim aioys wo damn 0m 08 SWISS 25PGH 9 Wns “swaIsis a8eior sono se jam se “SHH 01 pur 9 ain sasn wep iq erep aBy sm HUM [PAP {wou FIEP sium PE Peas or pauBlsap sem eds ‘suo Kep wo 4 40 * ppow [est rca ‘gous vonez pas doopeH om) ‘serdas rou ‘souequs or popuaiu st eds nig © eng eke sumapy : oaunyon“T af ‘MON “surejuod 1 wep Jo yunoure 2H 6 yaw dyed “APUEA *avdS SHOVEV HLM ONDRIOM EZ KA ge he J s,A 24 ea) Nou hg Vueia 880 Bg or onal Ursus aM ee eG EE | PA ‘Minjon : sonqume SA Tuer veg BGO suoddns wep 3M %4L Hadoop users who have aa 2. Analytics (MU-Som, 8-11) (nto. to Big Data Fram )__Pg.n0. ‘There are different kinds of management systems ment . deploy Hadoop Yam can simpy deployed or are pani (© Sr hasta? saa ‘ RDBMS oar | NoSQL without any pre-ins Spark on ee ‘allows users t0 easily integrate cad (Relational (Online Analytical (Not only access required Me ake advantage of he fall ONE Sy Database Processing) SQL) their Hadoop stack running on top of Spark, Management as well as of other components : System) spark In MapReduce (SIMR): For the Hadoop fe 2 running YARN yet, another option, in or ss sd deployment, is to we SIMR 1 launch Spark jobs MapReduce, Wh SIM, user can Sart oxPSTimenig Spark and use its shell within a couple of minutes downloading it! This wemendously lowers the by deployment, and lets vitally everyone play wit Spark, % 2.4.1 Introduction to NoSQL |A database is a systematic collection of data. And a ‘management system supports storage and manipulation of ‘which makes data management easy. For example, an telephone directory uses a database to store data of phone numbers and other contact details that can be service provider to manage billing client related issues and! fall data etc, That means A database management provides the mechanism to store and retrieve the data. NoSQL refers to all databases and data stores that are not based ‘on the relational database management system or RDBMS principles. NoSQL are the new set of databases that has emerged recent pastas an alternative solution to relational databases. Carl Strozzi introduced the term NoSQL to name his file- based database in 1998. NoSQL does not represent single product or technology but it represents a group products and various related data concepts for storage and management. NoSQL is an approach to database ‘management that can accommodate a wide variety of data models including key-value, document, column and graph formats. NoSQL database generally means that it is non-relational, distributed, flexible and scalable. So, we can bind it up as NoSQL an approach to database design that provides flexible schemas for the storage and retrieval of data beyond the traditional table structures found in relational databases. It relates to large data sets accessed and manipulated on a Web scale, 2.4.2. Brief History of NoSQL Databases 1998- Carlo Strozzi use the term NoSQL for his lightweight, ‘open-source relational database 2000- Graph database Neos} is launched, 2029 (190) Le rene utiatons ‘Scanned with CamScanner Mu-sem 9-1) intro, 10.8 s004- Google BigTabe is launched 2005- CouchDB is launched: 2007. Te research paer on Amazon Dynan s open sources the Cassandra project is released, 2008- Facebook’ 2000- The term NoSL was reintroduced yw 2.4.3 Why NoSQL? ‘The concept of NoSQL. databases became popular with i « Google, Facebook, Amazon etc, Who deals with giants like se time becomes slow when we: volumes of data the system respon RDBMS for massive volumes of data soto resolve this problem, could scale up our system by upgrading our existing hardware but process is an expensive. So alternative for this issue is t0 ds Gaatase load on multiple hosts whenever the load increases method known as scaling out. NoSQL databases are non-relational so they scale-out better ‘relational databases. As they designed with the web applications mind, Now NoSQL database is exactly type pf database that handle all sors of semi structured data, unstructured data, rapi changing data and bigdata, So, to resolve the problems related to ‘volume and semi structured data. NoSQL. databases have emerged. % 2.4.4. CAP Theorem It plays important role in NoSQL databases, CAP theorem is called brewer's theorem which states that it is impossible fe distributed data store to offer more than two out of three guarantees Big Data Analytics (MU-Sem 81) (Into. 10 Big Data Fram) Pg. no._ (2-22) So basically, some NoSQL databases offer consistency and partition tolerance. While some offer availability and partition tolerance. But partition tolerance is common as NoSQL databases are distributed in nature so based on requirement, we can choose NoSQL database has to be used. Different types of NoSQL databases are available based on data models, Fig, 2411: CAP Property "= Consistency This means that the data in the database remains consistent after the execution of an operation ‘© For example, after an update operation all clients see the same data. "= availability ‘This means that the system is always on (service guarantee availability), no downtime. "= Partition Tolerance + This means that the system continues to function even the communication among the servers is unreliable, ie., the servers may be partitioned into multiple groups that cannot communicate ‘with one another. as 22-29) (8-131) iw 2229 90:90 ree ‘Scanned with CamScanner a (mu. 22-29) 121) yios (mu-Sem.stT) ito. t08 ‘Data Fram)...Pg.no, ta Analytics impossible to fulfil all 3 requirements, Cy ides the basic requirements for 2 distributed system to folgy provides the basic Y ahe 3 requiements. Therefore all the cure Noga 2o “abe follow the cifferent combinations ofthe C. A. Prom gg CAP theorem. + In theoretically itis ere i the brief description of three combinations CA, Cp AP: ‘> CA- Single site cluster therefore all nodes re always in contagl) ‘When a partition occurs, the system blocks. ; CP ~ Some data may not be accessible, but the rest is sti) consstenvaccurate. [AP - System is still available under partitioning, data returned may be inaccurate ‘The use of the word consistency in CAP and its use in ACID dg 1 al not refer to the same identical concept In CAP, the term consistency refers to the consistency of values in different copies of the same data item in a replicated distributed system. In ACID, it refers tothe fact that a transaction will not violate the integrity constraints specified on the database schema % 2.4.5 Characteristics / Features of NoSQL Ug. Descrbe characteristics of a NoSQL database. 4, Non-relational + NoSQL databases never follow the relational model + Never provide tables with at fixed-column records, + Work with self-contained aggregates or BLOBs + Doesnt require object-elational mapping and data ‘normalization Big Data Analytics (MU-Sem 8-1) (nro to Big Data Fram.)._Pg.n0._ (2:24) + No complex features like query languages, query planners, referential integrity joins, ACID 2. Open-source NoSQL databases don’t require expensive licensing fees and can run on inexpensive hardware, rendering their deployment cost effective. 3. Schema-free © NoSQL databases are either schema-free or have relaxed schemas ‘© Do not require any sort of definition of the schema of the data ‘© Offers heterogeneous structures of data in the same domain ‘Simple API ‘+ Offers easy to use interfaces for storage and querying data provided ‘* APIs allow low-level data manipulation & selection methods Ul recr nao Puoictans (u- 22:29) (we131) BbrecrieoPsicion ‘Scanned with CamScanner . ig bat Anaytos ‘Text-based protocol JSON Mostly used 90 standar Web-enabled databases ata Fram)..Pg. no... (24 em), (inro 108 used with HTTP REST wig, most 1d based query language running as interaet-facing services, ea 2 tt ones betel tg fashion Often ACID concept can be sacrificed for scalability ang Mostly no synchronous replication between distributed nog “Asynchronous MultiMaster Replication, peer-to-peer, HDFS Replication nly providing eventual consistency and higher distribution, 2.4.6 Advantages and Disadvantages of NoSQL a Advantages of NoSQL 1. Scale(horizontal) 2. SQL databases are vertically scalable. This means that you increase the load on a single server by increasing things RAM, CPU or SSD. But on the other hand, NoSQL databases horizontally scalable, This means that you handle more traffic sharding, or adding more servers in your NoSQL database ‘Simple data model (fewer joins) Streaming) volume Reliability Schema-less (no modelling or prototyping) Dp (uu. 2225) 8-199) — Shared Nothing Architecture. This enables less coordinatiog, Bhrecnres 11 3n9 43 ‘ay Aqqeotda st sxaysereyo 30 suadaqut 'sBumns Jo sauas y “sé dust duu 249 se seg anjea-Cay yo waned ayy ut pa29109 steep 2X1 Tapa Sst apo asqeiep TOSON 21889 SOW aH J0 200) -95n ued ey Suogedyidde ony Ayquapr i "ussnedsia ’ | ened fembeinare TOson we enisckan jeep uiiuertg Dn a seiois onjen-Aoy STR saosin (a) saioig wewnsog (9) seuoig (ajqmilig) cjurey unjoy —(g) saris anjea-Koy (¥) susomed ammoonrgoxe wp sno} 2m Jo Kv" swotloy TOSON uw pasors cep auL ‘ep SurBuey {ypides 30 pammansun soipre azsteue pur “aAainas ‘exo 0) poou ym suonvetue’i0 1 fradde 0} sanunuoo yom “ywiod Sumas ssorea:8 s,JOSON uaaq SigenBre sey veg) “erp FurrweBio oy yovosdde ,01-P%, sn “AgP9y sm $1 puy wamdoqarap 01 wyBrens 08 01 2Iqe a8 ie sm ‘uay>s ywouydn ue ambos 3,upip fox asneoaq sosoaeieP oe nN eee Ssedojaaq “sip some sdysuoneas ys? | 102 "91ge1 asn Yous ‘saseqerep Og 40 jeuonele! Teuontpen J0 APH 2K) 0 Ro woq arsm saseqenp TOSON " ‘Sejdluexa yuenajar yim suaued 2115 hued wune> pue aos eiep ders vecieg SNWALLVd UNLDALIHOUY Viva TOSON 57 KA —— ee Se or Ba Tas Ra “ ' ermine Ti ‘Pwes-ny) souneuy e100 PD ATOSEN Y suaned YemDenDLe waieyp ay aie rey “DN Pr ms por oat ep Su 55 hemersonron van ose ‘seseqeiep SON 10} saAUP ssoursnq awos0q sey Aime Sm Squarory wep yo aumyor Kure appueY pur Arisea so8ueys uonwaridde ayepouruosoe wes Seu, “Kyse9 1p pafess 2q we pue snewayss 978 seseqEIep OS9tLL -aseqeiep JOSON 9 auoar2K0 1899 ar SONY 25241 TTY -saqnpayos Susan pur 1uaudoyan2p Ut SUROPNOTS nyo owas “pos pooua.9dx9 as TOA ‘sradoasap axeatjos 9 v9 swsonbas afte mnbos Suddeus fouone(as 129LG0 ‘AINEUED suoneaqydde Sunstx 0109 fauadxa sau Krpour 10 mau Suidofaxop wags a8ueg> pide ou st sson0ud STL 1 aq yp poweroosse st pure apd sage} aoucissiod yo anour 0} 1HSKHaTEIS say Jo voneUgIOD isos UL * gary amp wosy pur or exp 1°24 IS apes pue arzyep ‘arepdn “Hes 109 ayy ayesouas or sf 9Ke] StH JO Aaqig3st0% 22601 10 we apnpout 1 pau nok ‘soars a ep 06 JL deus yruoner ota 30 sdnosfqns povradas pu Pas asogenep 24030 0 ood amp st SWETE red xoqditoo ou! UL sep jo stunouse 2871 ryan Musn suonvondde Supima 3° sn yonaeuuoo pur : 3 assotesPu® Pounuosse o1 vonvaydde we a> 7 2) oaey oO ot sarabae or Trewesw" TB weld ee —_ ‘Scanned with CamScanner rF | _ gs Sata eal loted lel ig Data Fram)..P9.n0.. som 9. gn. t0 ig. 28 ata Analytics (MU-Sem 6. gosta Anaytos US Bi LEAT) (nto to Big Data Fram). Pp no. (2-34) related 10 the key, ay, " a GQ. State example of any two ky vai cted oF cO- sue is connect’ ae ically store informat databases for key-value Palt ically iad ‘hashtable where eac ey #8 UMAR i re of any form (JavaSeript Object Notas | . storage (YP! value may i os inary Large object (BLOB), Sings et) tication : This style of architecture 3s commonly used splat Ar er esoms los ni np seis its ability for wide management of data volumes, hey i ke weet ore da loads = Fig. 25.1: An example of Key-Value Keys and values are flexible, Keys can be image names, page URLs, or file path names that point to values like binary i HTML web pages, and PDF documents. Constraints associated with the key-value store databases is complexity in handling queries which will attempt to include ‘key-value pairs that may delay output and may cause data to cl ‘with many-to-many relationships. (22:29 (8-193), (wu 22:29) (48-193) databases (2 Marks) Examples here are DynamoDB (developed by Amazon) Berkeley DB (developed by Oracle) REDIS : An advanced open-source key-value store, also referred | to as a data structure server because Keys can include strings, | hashes, lists, sets and sorted sets. This product, written in C/C++, is searingly quick, which makes it perfect for data collection in | real time, Riak : An open source that is powerful, distributed database that predictably scales capability and simplifies creation by prototyping, developing, and deploying applications quickly. ‘Written in Erlang and C this technology gives transparent fault- toleranvfail-over functionality, a comprehensive and versatile API perfect for point-of-sale and factory control systems VoltDB : scalable database in memory that offers complete transactional ACID consistency and ultra-high throughput, self- referred to as the NewSQL. ‘This technology relies on segmentation and replication to achieve hhigh-availabilty data snapshots and durable command logging using Java stored processes (or crash recovery), making i ideal for capital markets digital networks, network services, and for online gaming. Teche Publications Es ‘Scanned with CamScanner F ug Data Fram)._Pg. no. som 847) (nto. t0 goat Anais MUS fig Oat Arayics U-Ser 67) (rot ig Ona Fram). no. 2-36 atabase 0: va 25.2 column store D3 + Basically, columns ate in this son of storage mode. Data is readily available and it is possible to perform queries such as Number, AVERAGE, COUNT on columns easly +The setbacks for this system includes: transactions should be avoided or not supported, queries can decrease high performance with table joins, record updates and deletes reduce storage efficiency, and it can be difficult to design efficient partitioning/indexing schemes. Q._ State example of any two column store databases (2 Marks) Examples here are + HBase : HBase is a distributed, portable, Big Data Store modelled after Google's BigTable technology, the Hadoop database ef «Google's BigTable a + Cassandra : An open-source distributed database management ae system built to manage very large volumes of data scattered over Seecaa | several servers without a single point of failure while delivering a Sear : highly accessible service. Fig. 25.2: An Example of Column Store + Written in Java, this product is best for non-transactional real- time data analysis with linear scalability and proven fault- tolerance combined with column indexes. + This pattem employs data storage in individual cells that is further divided into columns, rather than storing data in relational tuples + Databases that are column-oriented operate only on columns,» In the form of key-value pairs, the record database fetches and They together store vast quantities of data in columns. The column format and tiles will diverge from one row to another. accumulates information, but here the values are called documents. A complicated data structure can be represented as a + Esch column is handled differently, but stil, like conventional = databases, each individual column will contain several other columns (Niharika, 2020) + Itis hierarchical version of key-value databases, che Pts ! : | %& 2.5.3 Document Database : (tm 22-29) 48193) Tecr-Noo Publications (MU: 22:29) (M191) ‘Scanned with CamScanner rE. t~—” Big Data Analytics (MU-Sem. rr) (inv. to Big Data Fram). 9.00. (2 The document can be in text form’ arrays, strings, JSON (JavaScript Object XML (Extensible Markup Language) or any other forma «The use of nested documents efficient since most of the gen the form of JSONs and is unstructured. Notation), js immensely popular. It is highly erated information is generally in Fig. 2.5. : An Example of Document pig pots Arbvics MU-Sem 8) tn 3540 ig Data Fram.).Pg no.. (2-38) sphis format is extemely useful ang stud daa and iS Spl to reine serine for semi Sage: THE drawbacks associated amy ae bane fon serrening for of handing mile dy is system includes the pair ‘output for the reduce step, slytics (MU-Sem.8-17) vq 03a An Ma Paradigm)..Pg.no.. (35) (2) A Reduce Task process an ouput ofa map tsk Siar 5 rap stage, all edie asks cca at te se tine sat ne independently. The data is aggregated aay 'y work the desired output. The final resut i a reducey ees value> pits which MapReduce by defau, sores in HDES. 3.1.3 How Hadoop Map and Red, Together Bee Wott ‘As the name Suggests, MapReduce works by processing input data in two stages ~Map and Reduce, To demonstrate this, we will use a simple example with counting the number of ‘occurrences of words in each document, ‘The final output we are looking fors: How many times the words ‘Apache, Hadoop, Class, and Track appear in total in all documents, + For illustration purposes, the example environment consists of three nodes. The input contains six documents distributed across the cluster. We will keep it simple here, but in real circumstances, there is no limit. You can have thousands of servers and billions of documents, Tabrrcrio putin isshcpiacs masa & ‘Scanned with CamScanner MapReduce Par 1. First in themap stage, the input data (ihe six docu is split and distributed across the cluster (the three servers), In ‘each map task works on a split containing two docu case, ication between the During mapping, there is no communi They perform independently. 2, Then, map tasks create a pair for every word. pairs show how many times a word occurs. A word is Key, and value is its count, For example, one document contains thre. four words we are looking for: Apache 7 times, Cl times, and Track 6 times. The key-value pairs in one map ‘output look like this: track, 6> ¢ # . ‘This process is done in parallel tasks on all nodes for documents and gives a unique output. 3. After input spliting and mapping completes, the outputs of map task are shuffled. This is the first step of the Reduce Since we are looking for the frequency of occurrence for words, there are four parallel Reduce tasks, The reduce tasks run on the same nodes as the map tasks, or they can run on ‘other node, The shuffle step ensures the ‘keys Apache, (Class, and Track are sorted for the reduce step. This groups the values by keys inthe form of 4. In the reduce step of the Reduce stage, each of the four | process a to provide a final key-value pair. reduce tasks also happen at the same time and independent! Blreriieor eee oa anaiytios (MU-Som8- ago tcce Paration)_.Po, no. (2-1 in our example from the diagram, te sollowing individual results; apache, * sno, ie 2 > ‘ ei 18> aS Pe a «Combiner always works in between Mapper and Reducer. The output produced by the Mapper is the intermediate output in terms of key-value pairs which is massive in size, «If we directly feed this huge output the Reducer, then that will result in increasing the Network Congestion. Soto minimize this [Network congestion we have to put combiner in between Mapper and Reducer. + These combiners are also known as semi-educer. It is not necessary 10 add a combiner to your Map-Reduce program, itis optional. + Combiner is also a class in our java program like Map and Reduce class that is used in between this Map and Reduce classes. + Combiner helps us to produce abstract details or a summary of very large datasets. When we process or deal with very large datasets using Hadoop Combiner is very much necessary, resulting in the enhancement of overall performance, %® 3.2.1 How does combiner works + In the above example, we can see that two Mappers are containing different data. The main text file is divided into two different Mappers. Each mapper is assigned to process a diferent line of our data. In our above example, we have two lines of data so we have two Mappers to handle each line. recn neo Pucatons educe tasks get the (mu. 22-23)(Ma-131) nese INNS SU ‘Scanned with CamScanner | 4, aradigm) P9.00, ig Daa Anaics (Mu-Sems.T_ mapheduce PES 3.9) the intermediate Key-*AN€ AS, Whey key and its count is its Value For Geeks For the key-valye + Mappers are produciny the name of the particular word is For example, for the data Geeks pairs are shown below 1W Key Value pairs generated for data (Geeks,1) or) (Geeks,1) (For!) The Key-value pairs generated by the Mapper are known athe intermediate key-value pairs or intermediate output of the Mapper. + Now we can minimize the number of these key-value pairs by introducing a combiner for each Mapper in our program. In our case, we have 4 key-value pairs generated by each of the Mapper, Since these intermediate key-value pairs are not ready to directly feed to Reducer because that can increase Network congestion so Combiner will combine these intermediate key-value pairs before sending them to Reducer ‘The combiner combines these intermediate key-value pairs as per their Key. For the above example for data for the combiner will Partially reduce them by merging the same pairs according to their key value and generate new key-value pairs as shown below. 1 Patally reduced key-value pairs with combiner Geeks,2) (For.2) With the help of Combiner, the Mapper output got partially reduced in terms of size(key-value pairs) which now can be made available to the Reducer for better performance. Ub recn sie Pubcon r Analytics (MU-Sem.8- No’ + Jombinets and produces the final cup (qiadoop Distributed File System), 3.2.2 Advantage of Combiners peduces the me taken for tansfering the daa from Mapper to Reducer Reduces the size of the intrmediae ouput genented by the Mappet- Improves performance by minimizing Network congestion 3 Disadvantage of combiners Eo ‘The intermediate key-value pairs generated by Mappers are stored on Local Disk and combiners wil run later on o partially reduce the output which results in expensive Disk Input Output ‘The map-Reduce job cannot depend on the function of the combiner because there is no such guaante in its execution, MATRIX VECTOR MULTIPLICATIONBY ‘MAPREDUCE Ww 33 “UQ Write pseudo code for Matrix vector Multiplication by MapReduce. Ilustrate with an example showing al thesteps. ‘What happen when vector does nat fit in memory in matrix 6. © vector multiplication ? to multiply to matrices. or aoc + MapReduce is a technique in which a huge program i subdivided into small tasks and run paralllly to make computation faster, save time, and mostly used in distributed systems. It bas 2 important parts UQ. Write MapReduce lustrate the procedure on Brererrnicnors (tu 22.25) (ua-131) —— ‘Scanned with CamScanner , (iu: 2223) e131) | «ree Big Data Analytics (MU-Sem&-1T)_(MapBeduce Paradigm). 22.00. (3-10 + Mapper st takes raw data input and organizes into KeY, value p 4 ary, you search for the wong acts and statistics collecteg together for reference or analysis". Here the Key is Data ang the Value associated with is facts and statistics collected together pairs. For example, In a dictio ‘Data and its associated meaning is for reference or analysis. ‘© Reducer : It is responsible for processing data in parallel ang produce final output {Algorithm 1 : The map function 1. Foreach element m of M do 2. product (key value) pair as (i,k), (M, jm) for k = 1, 2,3,..up to the number of columns of N 3. foreach element ny of N do 4, product (key value) pair as (i k), (N, jm), fori = 1, 2,3,..Up to the number of rows of M. 5. return set of (key value) pairs that each key, (i,k), has a list ‘with values (M, j,my) and (N, j, ny) forall possible values of j © Algorithm 2: The reduce function Foreach key (j,k) do sort value begin with M by jin listy 1 3. sort value begin with N by jin listy 4. multiply my and ny for jy value of each list 5S. sumup my * my 6 return (i,k), Y my * ny. j=l Let us consider the matrix multiplication example to visualize MapReduce (2229) e391) gous trace MU Some i ec ‘Consider the following matrix no (8-14) 12 E §] Gabe Here matrix A is 22 2 matix which ‘means jows()= Zand the numberof columns) «2. Yan ix where number of rOWSG) = 2 and umber of coh shy cell of the matrix is label ee a Med a Ay and By. Ex. element 3 in spatrix Ais called Azy ie. 2". r0W Ieolumn, Now one sultplication has! MEpper and I edser The Fomigag Mapper fr Matrix A (k, ¥)= (i, (Ajay fora Mapper for Matrix B (,¥)= (8, j, By) foram Therefore, ecomputing the mapper for Matrix 4 k;i,j computes the number of times it occurs Bisalsoa2x2 # Here all are 2, therefore when k=, ican have # 2values 1 & 2, each case can have ? funher # values of j = 1 and j= 2. Substituting all values # in formula (40,410) j=2 ,0,(A,2,2), ie2 j J=2 @D,(A2,9) (@D.A1,3) (0,2,(4 1, D) J=2 (1,2)(A,2,.2)) (2,2,(4,1,3) J=2 22,024) Matrix-Vector Multiplication by Map-Reduce (MU 22:29) (48-131), ‘Scanned with CamScanner Big Data Analytics (Mu-Sem 8-T,_(MapReduce Paradigm) 59. 00.. (3-19 Computing the mapper for matrix B isl k=l (0.15) 2 (1,2), B16) k (0,82. j=2 1,2, 8,2.8) (20,815) k=22,2,B1.0) k=l (2,0,8,2,0) j=2 k=2 (2,2), B,2,8) ‘+ Matrix-Vector Multiplication by Map-Reduce Reducer (k, v) = (i, k)=>Make sorted Aj and Bise (i.e) => Summation (Ay * By) ford Output =>((i sum) Therefore, computing the reducer: 4 We can observe from Mapper computation # that 4 pairs are common (1, 1), (1.2), # (2,1) and (2,2) # Make a lst separate for Matrix A & 4B with adjoining values taken from # Mapper step above : Matrix- Vector Multiplication by Map-Reduce (1,1) =>Aw=((A, 1, D,(A,2,2)) Biu=((B, 1,5), (B,2,7)) rr sua anaiytics MUSE 2.7) row AixBu: (16) + (248 )) ap oy (A13)(A.2,4)) i) Bu=((B: 1.5). (B,2,7)) Now AuxBu: ((3*5)+(4°7)) 2 43 -Auw=((A, 1,3), (A,2,4)) Bul, 1,6), (B, 2,8)) Now AixBj: (3%) + (4°8)] = 50 From (i), (i, (ii) and (i) we conclude that 0.19 (1.2.22) (2.0. 43) (2.2.50) ‘Therefore, the final matrix is, [3 3) wo34 RELATIONAL ALGEBRA OPERATIONS EE See Selection 2. Projection 4. Natural Join (iv) Union & Intersection Grouping & Aggregation Selection ‘Apply a condition ¢ to each tple inthe relation and produce as output only those tuples that satisfy © «+The result of this selection is denoted by 6R)6e(R) Now AyxBu: ((1*5) + (2*7)] =19 i) (1,2) =>Ange((A, 1,1), (A,2,2) + Sskctonsealy tose aa an } ly Byy=((B, 1,6), (B, 2,8) - (wu 22-23) (6-131) Tech-Neo Publications (MU- 22-23) (MB-131) Bhrnser = ‘Scanned with CamScanner ee

You might also like