0% found this document useful (0 votes)

40 views7 pages

Unit 5

Cloud Computing Unit-5 Important question Notes 100 % pass

Uploaded by

hr.admin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views7 pages

Unit 5

Cloud Computing Unit-5 Important question Notes 100 % pass

Uploaded by

hr.admin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Unit-5

Hadoop
1.Hadoop is an open-source software framework used for storing data and running
applications on clusters of commodity hardware.

2. It provides massive storage for any kind of data, enormous processing power and the
ability to handle virtually limitless concurrent tasks or jobs.

3. The Hadoop ecosystem is a framework of various types of complex and evolving tools and
components. Some of these elements are very different from each other in terms of their
architecture however, what keeps them all together under a single roof is that they all
derive their functionalities from the scalability and power of Hadoop.

4. Hadoop ecosystem can be defined as a comprehensive collection of tools and

technologies that can be effectively implemented and deployed to provide big data
solutions in a cost-effective manner.

5. MapReduce and Hadoop Distributed File System (HDFS) are two core components of the
Hadoop ecosystem that is used to manage big data. However, they are not sufficient to deal
with the big data challenges.

6. Along with these two, the Hadoop ecosystem provides a collection of various elements to
support the complete development and deployment of big data solutions.

Use of Hadoop:

1. Ability to store and process huge amounts of any kind of data quickly.
2. Computing power: Hadoop's distributed computing model processes big data fast.
3. Fault tolerance: Data and application processing are protected against hardware failure.
If a node goes down, jobs are automatically redirected to other nodes to make sure that
distributed computing does not fail. Multiple copies of all data are stored automatically.
4. Flexibility: Unlike traditional relational databases, we do not have to preprocess data
before storing it. We can store as much data as we want and decide how to use it later. That
includes unstructured data like text, images and videos.
5. Low cost: The open-source framework is free and uses commodity hardware to store
large quantities of data.
6. Scalability: We can easily grow our system to handle more data simply by adding nodes.

Features of Hadoop:

1. Suitable for big data analysis:

i As big data tends to be distributed and unstructured in nature, Hadoop clusters are best
suited for analysis of big data.
ii. Since it is processing logic (not the actual data) that flows to the computing nodes, less
network bandwidth is consumed.
iii. This concept is called as data locality concept which helps to increase the efficiency of
Hadoop based applications.

2. Scalability:

i. Hadoop clusters can easily be scaled to any extent by adding additional cluster nodes and
thus allows for the growth of big data.
ii. Scaling does not require modifications to application logic.

3. Fault tolerance:

i Hadoop ecosystem has a provision to replicate the input data on to other cluster nodes.
ii. In case of a cluster node failure, data processing can still proceed by using data stored on
another cluster node.

Modules of Hadoop:

1. HDFS (Hadoop Distributed File System): It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.

2. YARN (Yet Another Resource Negotiator): It is used for job scheduling and managing the
cluster.

3. MapReduce:

i This is a framework which helps Java programs to do the parallel computation on data
using key value pair.
ii. The Map task takes input data and converts it into a data set which can be computed in
key value pair.
iii. The output of Map task is consumed by reduce task and then the reducer gives the
desired result.

4. Hadoop common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.

Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in
faster retrieval. Even the tools to process the data are often on the same servers,
thus reducing the processing time. It is able to process terabytes of data in minutes
and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data
so it really cost effective as compared to traditional relational database management
system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then
Hadoop takes the other copy of data and use it. Normally, data are replicated thrice
but the replication factor is configurable.

Hadoop Architecture

NameNode

o It is a single master server exist in the HDFS cluster.

o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.
DataNode

o The HDFS cluster contains multiple DataNodes.

o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's
clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker

o It works as a slave node for Job Tracker.

o It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.

MapReduce
1. MapReduce is based on the parallel programming framework to process large amounts of
data dispersed across different system.

2. The process is initiated when a user request is received to execute the MapReduce
program and terminated once the results are written back to the HDFS (Hadoop Distributed
File System).

3. MapReduce facilitate the processing and analyzing of both unstructured and sem-
structured data collected from different sources, which may not be analyzed effectively by
other traditional tools.

4. MapReduce enables computational processing of data stored in a file system without the
requirement of loading the data initially into a database.

5. It primarily supports two operations, map and reduce.

6. These operations execute in parallel on a set of worker nodes.

7. MapReduce works on a master working approach in which the master process controls
and directs the entire activity, such as collecting. segregating, and delegating the data
among different working.

Working and Phases of MapReduce

1. The MapReduce algorithm contains two important tasks, namely Map and Reduce:

i The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key value pairs).

ii. The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.

2. The reduce task is always performed after the map task.

Phases of MapReduce:

1. Input phase: Here we have a record reader that translates each record in an input file and
sends the parsed data to the mapper in the form of key-value pairs.

2. Map: Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key value pairs.

3. Intermediate keys: They key-value pairs generated by the mapper are known as
intermediate keys.

4.Combiner:
A combiner is a type of local reducer that groups similar data from the map phase into
identifiable sets.

ii. It takes the intermediate keys from the mapper as input and applies a user-defined code
to aggregate the values in a small scope of one mapper.

iii. It is not a part of the main MapReduce algorithm; it is optional.

5. Shuffle and sort:

i The Reducer task starts with the shuffle and sort step.

ii. It downloads the grouped key-value pairs onto the local machine, where the reducer is
running.

ni. The individual key-value pairs are sorted by key into a larger data list.

iv A The data list groups the equivalent keys together so that their values can be iterated
easily in the reducer task

6. Reducer:

The reducer takes the grouped key value paired data as input and runs a reducer function
on each one of them

IL Here, the data can be aggregated, filtered, and combined in a number of ways, and it
requires a wide range of processing.

Once the execution is over, it gives zero or more key-value pairs to the final step.

7. Output phase:

In the output phase, we have an output formatter that translates the final key-value pairs
from the reducer function and writes them onto a file using a record writer.
Features of MapReduce:

1. Scheduling:

i. MapReduce involves two operations: map and reduce, which are executed by dividing
large problems into smaller chunks are run in parallel by different computing resources.

ii. The operation of breaking tasks into subtasks and running these subtasks independently
in parallel is called mapping, which is performed ahead of the reduce operation.

2. Synchronization:

i Execution of several concurrent processes requires synchronization.

ii. The MapReduce program execution framework is aware of the mapping and reducing
operations that are taking place in the program.

3. Co-location of code/data (Data locality):

The effectiveness of a data processing mechanism depends on the location of the code and
the data required for the code to execute.

ii. The best result is obtained when both code and data reside on the same machine.

iii. This means that the co-location of the code and data produces the most effective
processing outcome.

4. Handling of errors/faults:
L MapReduce engines provide a high level of fault tolerance and robustness in handling
errors

ii. The reason for providing robustness to these engines is their high tendency to make
errors or faults.

5. Scale-out architecture:

i. MapReduce engines are built in such a way that they can accommodate more machines,
as and when required.

ii. This possibility of introducing more computing resources to the architecture makes the
MapReduce programming model more suited to the higher computational demands of big
data.

Working of MapReduce algorithm:

1. Take a large dataset or set of records.

2. Perform iteration over the data.

3. Extract some interesting patterns to prepare an output list by using the map function.

4. Arrange the output list properly to enable optimization for further processing.

5. Compute a set of results by using the reduce function.

6. Provide the final output.

IDS Unit3
No ratings yet
IDS Unit3
19 pages
All Certik Skynet Answer (Up-To-date)
100% (2)
All Certik Skynet Answer (Up-To-date)
21 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
IDS Unit3
No ratings yet
IDS Unit3
16 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
1 MapReduce Introduction With Example
No ratings yet
1 MapReduce Introduction With Example
52 pages
HF Security Smart-Pass - Installation Instructions - 1.5.9 - 20220304
No ratings yet
HF Security Smart-Pass - Installation Instructions - 1.5.9 - 20220304
28 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
Unit 2
No ratings yet
Unit 2
22 pages
Bda Unit-2
No ratings yet
Bda Unit-2
37 pages
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
No ratings yet
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
6 pages
NTC ESD Process Flow and Requirements For Type Approval and Acceptance Certificate Application
No ratings yet
NTC ESD Process Flow and Requirements For Type Approval and Acceptance Certificate Application
33 pages
Unit 5
No ratings yet
Unit 5
32 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
Unit-2 - Hadoop2
No ratings yet
Unit-2 - Hadoop2
30 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
IMTC634 - Data Science - Chapter 13
No ratings yet
IMTC634 - Data Science - Chapter 13
16 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
HADOOP
No ratings yet
HADOOP
19 pages
50 Ipv4 Subnetting Practice Questions (With Answer Key)
33% (3)
50 Ipv4 Subnetting Practice Questions (With Answer Key)
15 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Bda 2
No ratings yet
Bda 2
35 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Omni Legend Scanner
No ratings yet
Omni Legend Scanner
13 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Assignment 5 (Hadoop)
No ratings yet
Assignment 5 (Hadoop)
1 page
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
21 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Practice English Literacy Questions With Answer Keys and Discussion
No ratings yet
Practice English Literacy Questions With Answer Keys and Discussion
10 pages
Unit 2
No ratings yet
Unit 2
7 pages
Unit 5
No ratings yet
Unit 5
35 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Big Data Notes
No ratings yet
Big Data Notes
13 pages
Attachment
No ratings yet
Attachment
11 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Unit 3
No ratings yet
Unit 3
10 pages
Key Point Mapping
No ratings yet
Key Point Mapping
69 pages
Unit 1
No ratings yet
Unit 1
8 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Unit 5
No ratings yet
Unit 5
9 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
BDM 2
No ratings yet
BDM 2
5 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Hadoop
No ratings yet
Hadoop
5 pages
Home (Vintage International) Ebook All Chapters PDF
100% (7)
Home (Vintage International) Ebook All Chapters PDF
23 pages
Unit 2
No ratings yet
Unit 2
10 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
BTech - 5sem - CE - Booklet - 2022-23-ODD
No ratings yet
BTech - 5sem - CE - Booklet - 2022-23-ODD
37 pages
Hadoop Is An Open
No ratings yet
Hadoop Is An Open
4 pages
HADOOP
No ratings yet
HADOOP
10 pages
Unit 4
No ratings yet
Unit 4
14 pages
IoT-Based Smart Air Conditioning Control For Thermal Comfort
No ratings yet
IoT-Based Smart Air Conditioning Control For Thermal Comfort
6 pages
Computer Communication (MIS Project)
No ratings yet
Computer Communication (MIS Project)
16 pages
Dear Sir
100% (3)
Dear Sir
3 pages
Unit 4
No ratings yet
Unit 4
10 pages
Unit 5
No ratings yet
Unit 5
8 pages
96boards Som Carrier Board Schematics
No ratings yet
96boards Som Carrier Board Schematics
28 pages
Erp Glossary PDF
No ratings yet
Erp Glossary PDF
2 pages
Megaproject
No ratings yet
Megaproject
6 pages
Coin98 (C98) - Audit - BSC
No ratings yet
Coin98 (C98) - Audit - BSC
23 pages
Amazon's Dynamo - All Things Distributed
No ratings yet
Amazon's Dynamo - All Things Distributed
21 pages
Week Eight Term Project
No ratings yet
Week Eight Term Project
5 pages
Database Management Systems 1
No ratings yet
Database Management Systems 1
7 pages
Apple, Google and Microsoft Battle For Your Internet Experience - Case Study
No ratings yet
Apple, Google and Microsoft Battle For Your Internet Experience - Case Study
33 pages
Sas#8 - Ite 303-Sia
No ratings yet
Sas#8 - Ite 303-Sia
11 pages
Enhancing Discontinuities in Seismic Data and Automated Fault Mapping
No ratings yet
Enhancing Discontinuities in Seismic Data and Automated Fault Mapping
19 pages
T MSL 9.4 Im Up SS
No ratings yet
T MSL 9.4 Im Up SS
84 pages
Sadhana
No ratings yet
Sadhana
2 pages
06 Finite Elements Catalogs Options
No ratings yet
06 Finite Elements Catalogs Options
28 pages
Levels of Programming Languages
No ratings yet
Levels of Programming Languages
4 pages
Config
No ratings yet
Config
1 page
The Entity-Relationship Model: Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
No ratings yet
The Entity-Relationship Model: Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
18 pages
Mustafa Awni CV PDF
No ratings yet
Mustafa Awni CV PDF
1 page
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet