0% found this document useful (0 votes)
314 views61 pages

BDA-Unit 4

Big data analytics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
314 views61 pages

BDA-Unit 4

Big data analytics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 61
ie Faria 2021 DS4015 BIG DATA ANALYTICS UNIT 4 UNIT. FRAMEWORKS ‘MapReduce ~ Hadoop, Hive, MapR ~ Sharding — NoSQL. Databases - S3 - Hadoop Distributed File Systems — Case Study- Preventing Private Information Inference Attacks on Social Networks- Grand Challenge: Applying Regulatory Science and Big Data to Improve Medical Device Innovation PART-A 1, What is MapReduce? “8 et Map reduce’ is an application programming model used by big data to process data in multiple parallel nodes. Usually, this MapReduce divides a task into smaller parts and assigns them to many devices, Then the end results will be collected in one place and integrate to form effective data sets. MapReduce is a programming model or pattern within the Hadoop framework. that is used to access big data stored in the Hadoop File System (IDES). It is a core component, integral to the functioning of the Hadosp framework. MapReduce program work in two phases, namely, Map and Reduce, Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data, ‘The below diagram 4.1 will explain how this MapReduce integrate the tasks; Fig 4.1 MapReduce: 2. What are the tasks involved in MapReduce? In general, this MapReduice algorithm divided into wo components as "Map” and “Reduce”, 1. The Map task takes out data sets and converts them into another data set, where individual data set will be divided into key-value pairs (or you can call them Tuples). 2. The Reduce task will take the output data sets from the Map task as an input value and combines them into tuples of key-value pairs, Prepared By M.Sanjheeviraaman AP/MCA Page 1 R2021 DS4015 BIG DATA ANALYTICS UNIT 4 3. Write the Benefits of using Map Reduce algorithms, ‘The following are the key advantages of using Map Reduce: 1, Offers Distributed data and computations. 2. The tasks are independent, and entire nodes can fail and restart. 3. Linear scaling is considered to be an idle case. This is used to design hardware commodities. 4. Map Reduce is a simple programming model and with the help of this end, programmers can only write the map reduce task. 4. What is Hadoop? Hadoop is an open source framework from Apache and is used to store process and analyze data which are Very huge in volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more, Moreover it can be scaled up just by adding nodes in the cluster. Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of data. 5. Mention the Limitations of Hive. » Hive is not capable of handling real-time data % It is not designed for online transaction processing. » Hive queries contain high latency. Fig 4.2 Limitations of Hive ‘The Figure 4.2 represents the Limitations of Hive. 6. Discuss the Features of Hive. These are the following features of Hive: Hive is fast and sealable. It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce ot Spark jobs. It is capable of analyzing large datasets stored in HDFS. injheeviraaman AP/MCA Page 2 Prepared By R2021 DS4015 BIG DATA ANALYTICS 14. What is meant by NoSql Database? NoSQL database stands for “Not Only SQL” or “Not SQI BLL”, NoSQL caught on, Carl Strozz introduced the NoSQL, concept in 1998, Though a better tert wouig Traditional RDB MS_ uses SQL syntax to store and retrieve dala for further insighns Instead, a NoSQL database system encompasses a wide range of database technologies that can, store structured, semi-structured, unstructured and polymorphic data, Figure 4.4 shows the comparison, oem Database Ration es jalytical (OLAP) 9 Column-Family one Key-Value a oe <> ies <> Fig 4.4 Comparison of SQL and NoSQL_ 15. Why NoSQL? The concept of NoSQL databases became popular with Internet giants like Google, Facebook, Amazon, etc. who deal with huge volumes of data, The system response time becomes slow when you use RDBMS for massive volumes of data. To resolve this problem, we could “scale up” our systems by upgrading our existing hardware, This process is expensive. The alternative for this issue is to distribute database load on multiple hosts whenever the load increases. This method is known as “scaling out.” Figure 4.5 denotes the scaling. ‘Scale-Up (vertical | Seate-out (horizontal sealing): scaling): SEO Prepared By M Sanjheeviraaman AP/MCA Pages t 2021 DS4015 BIG DATA ANALYTICS Onn t 20. What do you understand by Sharding? Sharding is the practice of optimizing database management systems by separating the Tows o columns of a larger database table into multiple smaller tables, The new tables are called “sharggs @ Partitions), and cach new table cither has the same schema but unique rows (as is the case 4 “horizontal sharding”) or has a schema that is a proper subset of the original table's schema (as is yyy case for “vertical sharding”), Figure 4.6 represents the sharding. Original Table Vertical Shards Horizontal Shards Fig 4.6 Sharding 21. Why Sharding? + Database systems having big data sets or high throughput requests can doubt the ability of a single server. + For example, High query flows can drain the CPU limit of the server + The working set sizes are larger than the system’s RAM to stress the I/O capacity of the disk drive. Horizontal sharding is effective when queries tend to retum a subset of rows that are often grouped together. For example, queries that filter data based on short date ranges arc ideal for horizontal sharding since the date range will necessarily limit querying to only a subset of the servers. Vertical sharding is effective when queries tend to return only a subset of columns of the data. For example, if some queries request only names, and others request only addresses, then the names and addresses can be sharded onto separate servers. 22, How does Sharding work? Sharding determines the problem with horizontal scaling breaking the system dataset and store over multiple servers, adding new servers to increase the volume as needed. Ce ecemnere-Sna vane nnrannnnnNnsyrvr-yememeeemmeeenee es Prepared By MSanjheeviraaman AP/MCA Page’ BIG DATA ANALYTICS S Z a 2021 DS4015 PART -B 1. Discuss in detail about the Map Reduce Application Model. p Reduce application model: ee eros Maps Maps in Hive a similar to Java Maps. [ Syntax: MAP Structs Structs in Hive is similar to using complex data with comment. Syntax: STRUCT= “ol_name : data_type [COMMENT eau to @ data field automatically adjust data loads across various 10. Write the Advantages and Disadvantages of Sharding. Sharding adds more server servers, ‘The number of operations each shard manage got reduced. Italso increases the write capacity by splitting the write load over multiple instances. It gives high availability due to the deployment of repli servers for shard and config. Total capacity will get increased by adding multiple shards es of Sharding Adds Complexity in the System: It’s a complicated task and if it’s not implemented properly then you may lose the data or get corrupt tables in your database. You also need to manage the data from multiple shard locations instead of managing and accessing it {rom a single-entry point. ‘This may affect the workflow of your team which can be potentially disruptive to some teams. + Rebalancing Data: In a sharded database architecture, sometimes shards become unbalanced (when a shard outgrows other shards) and may create database hotspot. To overcome this problem and to rebalance the data you need to do re-sharding for even data distribution. Moving, data from one shard to another shard is not a good idea because it requires a lot of downtime. + Joining Data from Multiple Shards is Expensive: In sharded architecture, you need to pull the data from different shards, and you need to perform joins across multiple networked servers. You na a Prepared By M.Sanjheeviraaman AP/MCA Page 31 R2021 DS4O15, BIG DATA ANALYTICS oe * No Need for Separate Caching Layer N provides fast performance and horizontal scalability. ‘en handle structured, semi-structured, and unstroctured data with equal effect Object-oriented programming which is easy to tise and flexible NoSQL databases don’t need a dedicated high-performance server Support Key Developer Languages and Platforms ple to implement than using RDBMS + Ttean serve as ‘the primary data source for online applications, Handles big data which manages data velocity, varicty, volume, and complexity Excels at distributed database and multi-data center operations Eliminates the need for a specific caching layer to store data ‘Offers a exible schema design which can casily be altered without downtime or service disruption Disadvantages of NoSOL. + No standardization miles Limited query capabilities DPMS databases and tools are comparatively mature It docs not offer any traditional database capabilities, like consistency when multiple transactions are performed si wultaneously. Whea the volume of data increases it is difficult to maintain unique values as keys become difficult Doesn't work as well with relational data The learning curye is stiff for new developers OPen source options 30 not so popular for enterprises. 13, Write the Advantages of Amazon S3. Figs4.22 Advantages of $3 SSS ne Prepared By M Sanjheeviraaman AP/MCA ps ie 2021 DSA0IS BIG DATA ANALYTICS UNIT 4 14, Write the steps to Creating an S3 Bucket. Stun in to the AWS Management console, After sign in, the screen appears is shown below the Figure 4.34: Q Bi | ‘Y Recenly vised serves As Ou O wissinjesin 2 be 0s ogaete | | | fas singe cnt aa sarevuse et cn ned | wwestayer atte Lonnerelf ¥ services i Empat 8 std | Sea ; Containers with AWS Farge tovetanes ‘ i ‘ar Servers Containers aga Gout Mtertyon |) MRA de nartinsstbths_| AS ba Seng ‘tage, Jp) terarayesete oducts een neve [2 as Caskormeisg as a 5S Cu eee tania ak / Saliba See dae Reeth |) maton 3 | Sat Oris Mate 1 | oo a ete | Diethoncsomm a hati se ‘Syren Manager Nhe “| Sans MS tha saree ee mare & Sage = Shep a Vwpiseves tte j eee ie ae I ity rsrenpncnte fu MSLemeMer We way 2 a ep j ‘Fig:424 AWS Management Mowe to the S3 services. AMer licking on Sd, the oreen appears ia shown below the Figure 424: I Prepared By M Sanjheeviraaman AP/MCA R202 1 DS#015 BIG DATA ANALYTICS Re an UO rt a ate Gr Dice dene wee | [eta ae aa : " Ones infos You donot have ay bushels, Hee fro get slaried with Amanen Create a new aucket Upload your data ‘Setup your permissions. Fig:4.25 Steps to Create Bucket To create an $3 bucket, click on the “Create bucket". On clicking the "Create bucket” button, the sereen appears is shown below the Figure 4.25: Prepared By M.Sanjheeviraaman AP/MCA Page 40 2021 DS4015 BIG DATA ANALYTICS UNIT 4 Fig:4.26 Signup to create Bucket Enter the bucket name which should look like DNS address, and it should be resolvable to be shown in the figure 4.26. A bucket is like a folder that stores the objects. A bucket name should be unique. A bucket name should start with the lowercase letter, must not contain any invalid characters. It should be 3 to 63 characters long as shown in the figure 4.27 Prepared By M.Sanjheeviraaman AP/MCA Page 41 R2021 DS4015 BIG DATA ANALYTICS UNIT 4 Fig:4.27 Signup (o create Bucket Click on the "Create" button. Now, the bucket is created. —_— Prepared By M.Sanjheeviraaman AP/MCA Page 42 | ail sort Dsais BIG DATA ANALY] eee eany eerICs et ete Ce tae ddantpnrptic USE Oe Fig4.28 S3 Buckets We have seen from the Figure 4.28 screen that bucket and its objects are not public as by default, all the objects are private. Now, click on the "javatpointbucket” to upload a file in this bucket. On clicking, the screen appears is shown below the Figure 4.29: ‘anjheeviraaman AP/MCA Page 43, 1CS BIG DATA ANALYT R2021 DS4015 Tis tuckelis emp. Upioad new ober fo et stared VY @& & Set object properties Sel object pemisns Upload an obect Fig:4.29 Amazon S3 Buckets Click on the "Upload button to add the files to your bucket shown in the Figure 4.30 Page 4 Prepared By M.Sanjheeviraaman AP/MCA | DATA ANALYTICS UNIT 4 Fig:4.30 Uploading the Files Click on the "Add files” button. Prepared By M.Sanjheeviraaman AP/MCA BIG DAT ‘A ANALYTICS Bee Oeics Fig:4.32 Upload Jpg Files oad” button displays the figure Click on the "uy 2021 DS401S (Seaton) oi twa iy Sar Veet! & Fig:4.33 Upload Jtg Jpg Files From the above Figure 4.33 screen, we observe that the "jtp.ipa” has been successfully uploaded to the bucket "javatpoint” Prepared By MSanjheeviraaman AP/MCA Page 48 ayn DSA015 : BIG DATA ANALYTICS Fea reve 10 the properties of the object “tppg" and li f UNIT 4 on the object URL to run the appearing onthe right side ofthe soreen 4.34 Aow8l,) potiadt Fig:4.34 Overview Jtg Jpg Files ‘On clicking the object URL, the screen appears is shown below ‘AP/MCA R2021 DS4015_ BIG DATA ANALYTICS. UNIT4 ese permissions, save th Enter “confirm” in a textbox, then click on the “confirm' button displays 4.37 i es Fig4.37 Edit Public Access Settings en click on the "Make public” Click on the "Action: dropdown and 1 - Fig:4.38 Make Public Now, click on the Object URL of an object to run the file that shows the figure 4.38. Prepared By M.Sanjheeviraaman AP/MCA Page 51 ont ps4o1s: i BIG DATA ANALYTICS UNITS ROUT provides low lateney and high throughput performance, |. Iedesigned for 99.99% availability and 99.99999999% durability «g3 one zone-infrequent access } $3 one zone-infrequent access storage class is used when data is accessed less frequently but requires rapid access when needed. It stores the data in a single availability zone while other storage classes store the data in a minimum of three availability zones. Due to this reason, its cost is 20% less than Standard 14 storage class. [tis an optimal choice for the less frequently accessed data but does not require the availability ‘of Standard or Standard IA storage class. It isa good choice for storing the backup data. It is cost-effective storage which is replicated from other AWS region using $3 Cross Region replication, It has the same durability, high performance, and low latency, with a low storage price and low retrieval fee. It designed for 99.5% availability and 99.999999999% durability of objects in single availability zone. It provides lifecycle management for the automatic migration of objects to other $3 storage classes. ‘The data can be lost at the time of the destruction of an availability zone as it stores the data ina single availability zone. Glacier $3 Glacier storage class is the cheapest storage class, but it can be used for archive only > You can store any amount of data at a lower cost than other storage classes. > $3 Glacier provides three types of models: co Expedited: In this model, data is stored for a few minutes, and it has a very higher fee, © Standard: The retrieval time of the standard model is 3 to 5 hours. © Bulk: The retrieval time of the bulk model is 5 to 12 hours. © You can upload the objects directly to the $3 Glacier. 5 Itis designed for 99.999999990% durability of objects across multiple availability zones, (16. Explain in detail about the HDFS Concept. 1 Blocks: A Block is the minimum amount of data that it can read or write: HDFS blocks are 128 MB by default and this is configurable Filesin HDFS are broken into block-s stored as ind ized chunks,which are endent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it Prepared By MSanjheeviraaman AP/MCA Page 53 S r UNIT R2021 DS4015 BIG DATA ANALYTICS : - does not occupy full block?s size, Le. 5 MB of file stored in HDFS of block size 128 MB takes SMB of space only. The HDFS block size is large just to minimize the cost of seek. 2. Name Node: HDFS works in master-worker pattern where the name node acts as the status and the metadata of all the master,Name Node is controller and manager of HDES as it knows # files in HDFS; the metadata information being file permission, names and location of each block.The metadata are small, so itis stored in the memory of name node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple clients concurrently,so all this information is handled bya single machine, The file system operations like opening, closing, renaming etc. are executed by it 3. Data Node: They store and retrieve blocks when they are told to; by client or name node, ‘They report back to name node periodically, with list of blocks that they are storing. The data node being a commodity hardware also does the work of block creation, deletion and replication as stated by the name node as shown in the Figure 4.40 HDFS DataNode and NameNode Image: Meta Data ‘Aatafpristine/cataina log > ‘data/pristne file. > HIDFS Read Image: Fig:4.41 HDFS Read Prepared By M Sanjheeviraaman AP/MCA Page 54 e i ‘ é

You might also like