BDA-Unit 4

Big data analytics

Uploaded by

V.Sakthi Parameshwari - lll yrs cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

318 views61 pages

BDA-Unit 4

Big data analytics

Uploaded by

V.Sakthi Parameshwari - lll yrs cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 61

ie Faria 2021 DS4015 BIG DATA ANALYTICS UNIT 4 UNIT. FRAMEWORKS ‘MapReduce ~ Hadoop, Hive, MapR ~ Sharding — NoSQL. Databases - S3 - Hadoop Distributed File Systems — Case Study- Preventing Private Information Inference Attacks on Social Networks- Grand Challenge: Applying Regulatory Science and Big Data to Improve Medical Device Innovation PART-A 1, What is MapReduce? “8 et Map reduce’ is an application programming model used by big data to process data in multiple parallel nodes. Usually, this MapReduce divides a task into smaller parts and assigns them to many devices, Then the end results will be collected in one place and integrate to form effective data sets. MapReduce is a programming model or pattern within the Hadoop framework. that is used to access big data stored in the Hadoop File System (IDES). It is a core component, integral to the functioning of the Hadosp framework. MapReduce program work in two phases, namely, Map and Reduce, Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data, ‘The below diagram 4.1 will explain how this MapReduce integrate the tasks; Fig 4.1 MapReduce: 2. What are the tasks involved in MapReduce? In general, this MapReduice algorithm divided into wo components as "Map” and “Reduce”, 1. The Map task takes out data sets and converts them into another data set, where individual data set will be divided into key-value pairs (or you can call them Tuples). 2. The Reduce task will take the output data sets from the Map task as an input value and combines them into tuples of key-value pairs, Prepared By M.Sanjheeviraaman AP/MCA Page 1R2021 DS4015 BIG DATA ANALYTICS UNIT 4 3. Write the Benefits of using Map Reduce algorithms, ‘The following are the key advantages of using Map Reduce: 1, Offers Distributed data and computations. 2. The tasks are independent, and entire nodes can fail and restart. 3. Linear scaling is considered to be an idle case. This is used to design hardware commodities. 4. Map Reduce is a simple programming model and with the help of this end, programmers can only write the map reduce task. 4. What is Hadoop? Hadoop is an open source framework from Apache and is used to store process and analyze data which are Very huge in volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more, Moreover it can be scaled up just by adding nodes in the cluster. Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of data. 5. Mention the Limitations of Hive. » Hive is not capable of handling real-time data % It is not designed for online transaction processing. » Hive queries contain high latency. Fig 4.2 Limitations of Hive ‘The Figure 4.2 represents the Limitations of Hive. 6. Discuss the Features of Hive. These are the following features of Hive: Hive is fast and sealable. It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce ot Spark jobs. It is capable of analyzing large datasets stored in HDFS. injheeviraaman AP/MCA Page 2 Prepared ByR2021 DS4015 BIG DATA ANALYTICS 14. What is meant by NoSql Database? NoSQL database stands for “Not Only SQL” or “Not SQI BLL”, NoSQL caught on, Carl Strozz introduced the NoSQL, concept in 1998, Though a better tert wouig Traditional RDB MS_ uses SQL syntax to store and retrieve dala for further insighns Instead, a NoSQL database system encompasses a wide range of database technologies that can, store structured, semi-structured, unstructured and polymorphic data, Figure 4.4 shows the comparison, oem Database Ration es jalytical (OLAP) 9 Column-Family one Key-Value a oe <> ies <> Fig 4.4 Comparison of SQL and NoSQL_ 15. Why NoSQL? The concept of NoSQL databases became popular with Internet giants like Google, Facebook, Amazon, etc. who deal with huge volumes of data, The system response time becomes slow when you use RDBMS for massive volumes of data. To resolve this problem, we could “scale up” our systems by upgrading our existing hardware, This process is expensive. The alternative for this issue is to distribute database load on multiple hosts whenever the load increases. This method is known as “scaling out.” Figure 4.5 denotes the scaling. ‘Scale-Up (vertical | Seate-out (horizontal sealing): scaling): SEO Prepared By M Sanjheeviraaman AP/MCA Pagest 2021 DS4015 BIG DATA ANALYTICS Onn t 20. What do you understand by Sharding? Sharding is the practice of optimizing database management systems by separating the Tows o columns of a larger database table into multiple smaller tables, The new tables are called “sharggs @ Partitions), and cach new table cither has the same schema but unique rows (as is the case 4 “horizontal sharding”) or has a schema that is a proper subset of the original table's schema (as is yyy case for “vertical sharding”), Figure 4.6 represents the sharding. Original Table Vertical Shards Horizontal Shards Fig 4.6 Sharding 21. Why Sharding? + Database systems having big data sets or high throughput requests can doubt the ability of a single server. + For example, High query flows can drain the CPU limit of the server + The working set sizes are larger than the system’s RAM to stress the I/O capacity of the disk drive. Horizontal sharding is effective when queries tend to retum a subset of rows that are often grouped together. For example, queries that filter data based on short date ranges arc ideal for horizontal sharding since the date range will necessarily limit querying to only a subset of the servers. Vertical sharding is effective when queries tend to return only a subset of columns of the data. For example, if some queries request only names, and others request only addresses, then the names and addresses can be sharded onto separate servers. 22, How does Sharding work? Sharding determines the problem with horizontal scaling breaking the system dataset and store over multiple servers, adding new servers to increase the volume as needed. Ce ecemnere-Sna vane nnrannnnnNnsyrvr-yememeeemmeeenee es Prepared By MSanjheeviraaman AP/MCA Page’BIG DATA ANALYTICS S Z a 2021 DS4015 PART -B 1. Discuss in detail about the Map Reduce Application Model. p Reduce application model: ee eros Maps Maps in Hive a similar to Java Maps. [ Syntax: MAP Structs Structs in Hive is similar to using complex data with comment. Syntax: STRUCT= “ol_name : data_type [COMMENT eau to @ data field automatically adjust data loads across various 10. Write the Advantages and Disadvantages of Sharding. Sharding adds more server servers, ‘The number of operations each shard manage got reduced. Italso increases the write capacity by splitting the write load over multiple instances. It gives high availability due to the deployment of repli servers for shard and config. Total capacity will get increased by adding multiple shards es of Sharding Adds Complexity in the System: It’s a complicated task and if it’s not implemented properly then you may lose the data or get corrupt tables in your database. You also need to manage the data from multiple shard locations instead of managing and accessing it {rom a single-entry point. ‘This may affect the workflow of your team which can be potentially disruptive to some teams. + Rebalancing Data: In a sharded database architecture, sometimes shards become unbalanced (when a shard outgrows other shards) and may create database hotspot. To overcome this problem and to rebalance the data you need to do re-sharding for even data distribution. Moving, data from one shard to another shard is not a good idea because it requires a lot of downtime. + Joining Data from Multiple Shards is Expensive: In sharded architecture, you need to pull the data from different shards, and you need to perform joins across multiple networked servers. You na a Prepared By M.Sanjheeviraaman AP/MCA Page 31R2021 DS4O15, BIG DATA ANALYTICS oe * No Need for Separate Caching Layer N provides fast performance and horizontal scalability. ‘en handle structured, semi-structured, and unstroctured data with equal effect Object-oriented programming which is easy to tise and flexible NoSQL databases don’t need a dedicated high-performance server Support Key Developer Languages and Platforms ple to implement than using RDBMS + Ttean serve as ‘the primary data source for online applications, Handles big data which manages data velocity, varicty, volume, and complexity Excels at distributed database and multi-data center operations Eliminates the need for a specific caching layer to store data ‘Offers a exible schema design which can casily be altered without downtime or service disruption Disadvantages of NoSOL. + No standardization miles Limited query capabilities DPMS databases and tools are comparatively mature It docs not offer any traditional database capabilities, like consistency when multiple transactions are performed si wultaneously. Whea the volume of data increases it is difficult to maintain unique values as keys become difficult Doesn't work as well with relational data The learning curye is stiff for new developers OPen source options 30 not so popular for enterprises. 13, Write the Advantages of Amazon S3. Figs4.22 Advantages of $3 SSS ne Prepared By M Sanjheeviraaman AP/MCA psie 2021 DSA0IS BIG DATA ANALYTICS UNIT 4 14, Write the steps to Creating an S3 Bucket. Stun in to the AWS Management console, After sign in, the screen appears is shown below the Figure 4.34: Q Bi | ‘Y Recenly vised serves As Ou O wissinjesin 2 be 0s ogaete | | | fas singe cnt aa sarevuse et cn ned | wwestayer atte Lonnerelf ¥ services i Empat 8 std | Sea ; Containers with AWS Farge tovetanes ‘ i ‘ar Servers Containers aga Gout Mtertyon |) MRA de nartinsstbths_| AS ba Seng ‘tage, Jp) terarayesete oducts een neve [2 as Caskormeisg as a 5S Cu eee tania ak / Saliba See dae Reeth |) maton 3 | Sat Oris Mate 1 | oo a ete | Diethoncsomm a hati se ‘Syren Manager Nhe “| Sans MS tha saree ee mare & Sage = Shep a Vwpiseves tte j eee ie ae I ity rsrenpncnte fu MSLemeMer We way 2 a ep j ‘Fig:424 AWS Management Mowe to the S3 services. AMer licking on Sd, the oreen appears ia shown below the Figure 424: I Prepared By M Sanjheeviraaman AP/MCAR202 1 DS#015 BIG DATA ANALYTICS Re an UO rt a ate Gr Dice dene wee | [eta ae aa : " Ones infos You donot have ay bushels, Hee fro get slaried with Amanen Create a new aucket Upload your data ‘Setup your permissions. Fig:4.25 Steps to Create Bucket To create an $3 bucket, click on the “Create bucket". On clicking the "Create bucket” button, the sereen appears is shown below the Figure 4.25: Prepared By M.Sanjheeviraaman AP/MCA Page 402021 DS4015 BIG DATA ANALYTICS UNIT 4 Fig:4.26 Signup to create Bucket Enter the bucket name which should look like DNS address, and it should be resolvable to be shown in the figure 4.26. A bucket is like a folder that stores the objects. A bucket name should be unique. A bucket name should start with the lowercase letter, must not contain any invalid characters. It should be 3 to 63 characters long as shown in the figure 4.27 Prepared By M.Sanjheeviraaman AP/MCA Page 41R2021 DS4015 BIG DATA ANALYTICS UNIT 4 Fig:4.27 Signup (o create Bucket Click on the "Create" button. Now, the bucket is created. —_— Prepared By M.Sanjheeviraaman AP/MCA Page 42| ail sort Dsais BIG DATA ANALY] eee eany eerICs et ete Ce tae ddantpnrptic USE Oe Fig4.28 S3 Buckets We have seen from the Figure 4.28 screen that bucket and its objects are not public as by default, all the objects are private. Now, click on the "javatpointbucket” to upload a file in this bucket. On clicking, the screen appears is shown below the Figure 4.29: ‘anjheeviraaman AP/MCA Page 43,1CS BIG DATA ANALYT R2021 DS4015 Tis tuckelis emp. Upioad new ober fo et stared VY @& & Set object properties Sel object pemisns Upload an obect Fig:4.29 Amazon S3 Buckets Click on the "Upload button to add the files to your bucket shown in the Figure 4.30 Page 4 Prepared By M.Sanjheeviraaman AP/MCA| DATA ANALYTICS UNIT 4 Fig:4.30 Uploading the Files Click on the "Add files” button. Prepared By M.Sanjheeviraaman AP/MCABIG DAT ‘A ANALYTICS Bee Oeics Fig:4.32 Upload Jpg Files oad” button displays the figure Click on the "uy2021 DS401S (Seaton) oi twa iy Sar Veet! & Fig:4.33 Upload Jtg Jpg Files From the above Figure 4.33 screen, we observe that the "jtp.ipa” has been successfully uploaded to the bucket "javatpoint” Prepared By MSanjheeviraaman AP/MCA Page 48ayn DSA015 : BIG DATA ANALYTICS Fea reve 10 the properties of the object “tppg" and li f UNIT 4 on the object URL to run the appearing onthe right side ofthe soreen 4.34 Aow8l,) potiadt Fig:4.34 Overview Jtg Jpg Files ‘On clicking the object URL, the screen appears is shown below ‘AP/MCAR2021 DS4015_ BIG DATA ANALYTICS. UNIT4 ese permissions, save th Enter “confirm” in a textbox, then click on the “confirm' button displays 4.37 i es Fig4.37 Edit Public Access Settings en click on the "Make public” Click on the "Action: dropdown and 1 - Fig:4.38 Make Public Now, click on the Object URL of an object to run the file that shows the figure 4.38. Prepared By M.Sanjheeviraaman AP/MCA Page 51ont ps4o1s: i BIG DATA ANALYTICS UNITS ROUT provides low lateney and high throughput performance, |. Iedesigned for 99.99% availability and 99.99999999% durability «g3 one zone-infrequent access } $3 one zone-infrequent access storage class is used when data is accessed less frequently but requires rapid access when needed. It stores the data in a single availability zone while other storage classes store the data in a minimum of three availability zones. Due to this reason, its cost is 20% less than Standard 14 storage class. [tis an optimal choice for the less frequently accessed data but does not require the availability ‘of Standard or Standard IA storage class. It isa good choice for storing the backup data. It is cost-effective storage which is replicated from other AWS region using $3 Cross Region replication, It has the same durability, high performance, and low latency, with a low storage price and low retrieval fee. It designed for 99.5% availability and 99.999999999% durability of objects in single availability zone. It provides lifecycle management for the automatic migration of objects to other $3 storage classes. ‘The data can be lost at the time of the destruction of an availability zone as it stores the data ina single availability zone. Glacier $3 Glacier storage class is the cheapest storage class, but it can be used for archive only > You can store any amount of data at a lower cost than other storage classes. > $3 Glacier provides three types of models: co Expedited: In this model, data is stored for a few minutes, and it has a very higher fee, © Standard: The retrieval time of the standard model is 3 to 5 hours. © Bulk: The retrieval time of the bulk model is 5 to 12 hours. © You can upload the objects directly to the $3 Glacier. 5 Itis designed for 99.999999990% durability of objects across multiple availability zones, (16. Explain in detail about the HDFS Concept. 1 Blocks: A Block is the minimum amount of data that it can read or write: HDFS blocks are 128 MB by default and this is configurable Filesin HDFS are broken into block-s stored as ind ized chunks,which are endent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it Prepared By MSanjheeviraaman AP/MCA Page 53S r UNIT R2021 DS4015 BIG DATA ANALYTICS : - does not occupy full block?s size, Le. 5 MB of file stored in HDFS of block size 128 MB takes SMB of space only. The HDFS block size is large just to minimize the cost of seek. 2. Name Node: HDFS works in master-worker pattern where the name node acts as the status and the metadata of all the master,Name Node is controller and manager of HDES as it knows # files in HDFS; the metadata information being file permission, names and location of each block.The metadata are small, so itis stored in the memory of name node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple clients concurrently,so all this information is handled bya single machine, The file system operations like opening, closing, renaming etc. are executed by it 3. Data Node: They store and retrieve blocks when they are told to; by client or name node, ‘They report back to name node periodically, with list of blocks that they are storing. The data node being a commodity hardware also does the work of block creation, deletion and replication as stated by the name node as shown in the Figure 4.40 HDFS DataNode and NameNode Image: Meta Data ‘Aatafpristine/cataina log > ‘data/pristne file. > HIDFS Read Image: Fig:4.41 HDFS Read Prepared By M Sanjheeviraaman AP/MCA Page 54 e i ‘ é

Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
No ratings yet
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
40 pages
Big Data Analytics TEXTBOOK
100% (1)
Big Data Analytics TEXTBOOK
230 pages
Understanding Inputs and Outputs of Mapreduce
No ratings yet
Understanding Inputs and Outputs of Mapreduce
13 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
33 pages
Unit V
100% (1)
Unit V
66 pages
BD - Unit - III - MapReduce
100% (1)
BD - Unit - III - MapReduce
31 pages
Bda Module 4 PPT (KM)
No ratings yet
Bda Module 4 PPT (KM)
76 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
Unit01 03
No ratings yet
Unit01 03
147 pages
ccs363 SNS
No ratings yet
ccs363 SNS
3 pages
Ds4015 Big Data Analytics QB
No ratings yet
Ds4015 Big Data Analytics QB
155 pages
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
No ratings yet
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
25 pages
Unit 1 DataScience
No ratings yet
Unit 1 DataScience
105 pages
Data-Intensive Computing
No ratings yet
Data-Intensive Computing
88 pages
Cs3391 Oops Unit 1 Notes Eduengg
No ratings yet
Cs3391 Oops Unit 1 Notes Eduengg
60 pages
815CSE02-Social Network Analysis
No ratings yet
815CSE02-Social Network Analysis
2 pages
FIND-S Algorithm: Machine Learning 15CSL76
No ratings yet
FIND-S Algorithm: Machine Learning 15CSL76
3 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Unit-1: Introduction To Mobile Communication
No ratings yet
Unit-1: Introduction To Mobile Communication
22 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
NO SQL Data Management
No ratings yet
NO SQL Data Management
123 pages
BigData Mining and Analytics
No ratings yet
BigData Mining and Analytics
2 pages
Unit 5 Notes
100% (3)
Unit 5 Notes
66 pages
Super Important Questions For BDA
100% (1)
Super Important Questions For BDA
26 pages
Ccs368-Stream Processing Lab Manual
No ratings yet
Ccs368-Stream Processing Lab Manual
50 pages
CS8091 Bigdata Analytics Lessonplan With Date
No ratings yet
CS8091 Bigdata Analytics Lessonplan With Date
11 pages
CS8091-BIG DATA ANALYTICS UNIT V Notes
100% (4)
CS8091-BIG DATA ANALYTICS UNIT V Notes
31 pages
Bda Material Jntugv R20 Unit 1
No ratings yet
Bda Material Jntugv R20 Unit 1
32 pages
Nosql Databases Unit-1
No ratings yet
Nosql Databases Unit-1
16 pages
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
100% (1)
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
15 pages
Unit 5
No ratings yet
Unit 5
27 pages
M.E. Bda 2021
No ratings yet
M.E. Bda 2021
64 pages
Understanding Big Data
No ratings yet
Understanding Big Data
117 pages
M.E.cse - R21 Syllabus
No ratings yet
M.E.cse - R21 Syllabus
20 pages
Final Document
No ratings yet
Final Document
73 pages
M.tech - Data Science Lab
No ratings yet
M.tech - Data Science Lab
48 pages
Big Data and Analytics and MapReduce 29052023 054155pm
No ratings yet
Big Data and Analytics and MapReduce 29052023 054155pm
35 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Principles of Pervasive Computing
No ratings yet
Principles of Pervasive Computing
15 pages
cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
CS8661-IP LAB MAUAL UPDATION NEW (1) Lak
100% (1)
CS8661-IP LAB MAUAL UPDATION NEW (1) Lak
87 pages
Cs3451 Ios Unit 5 Notes
No ratings yet
Cs3451 Ios Unit 5 Notes
21 pages
r18 - Big Data Analytics - Cse (DS)
0% (1)
r18 - Big Data Analytics - Cse (DS)
1 page
Ccs334 Big Data Analytics
0% (1)
Ccs334 Big Data Analytics
2 pages
Mobile and Pervasive Computing
No ratings yet
Mobile and Pervasive Computing
13 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
Domain Specific Iot
No ratings yet
Domain Specific Iot
17 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
Bda Unit 1
No ratings yet
Bda Unit 1
27 pages
CCS334 Big Data Analytics Important Question
No ratings yet
CCS334 Big Data Analytics Important Question
1 page
Bda Unit3
No ratings yet
Bda Unit3
22 pages
Vtu 5th Sem Cse Computer Networks
No ratings yet
Vtu 5th Sem Cse Computer Networks
91 pages
Data Analytics Unit-3 Notes
No ratings yet
Data Analytics Unit-3 Notes
21 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
38 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
CS8492-Database Management Systems
No ratings yet
CS8492-Database Management Systems
15 pages
Template
No ratings yet
Template
5 pages
RDBMS Unit 5
No ratings yet
RDBMS Unit 5
39 pages
Question Paper Code: Reg. No.
100% (1)
Question Paper Code: Reg. No.
2 pages
CS8091 Important Questions BDA
No ratings yet
CS8091 Important Questions BDA
1 page
Desktop Payroll Application
No ratings yet
Desktop Payroll Application
1 page
18CS72-BDA Question Bank of First Internal Syllabus
No ratings yet
18CS72-BDA Question Bank of First Internal Syllabus
1 page

BDA-Unit 4

Uploaded by

BDA-Unit 4

Uploaded by

You might also like