0% found this document useful (0 votes)
38 views7 pages

Unit 5

Cloud Computing Unit-5 Important question Notes 100 % pass

Uploaded by

hr.admin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views7 pages

Unit 5

Cloud Computing Unit-5 Important question Notes 100 % pass

Uploaded by

hr.admin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit-5

Hadoop
1.Hadoop is an open-source software framework used for storing data and running
applications on clusters of commodity hardware.

2. It provides massive storage for any kind of data, enormous processing power and the
ability to handle virtually limitless concurrent tasks or jobs.

3. The Hadoop ecosystem is a framework of various types of complex and evolving tools and
components. Some of these elements are very different from each other in terms of their
architecture however, what keeps them all together under a single roof is that they all
derive their functionalities from the scalability and power of Hadoop.

4. Hadoop ecosystem can be defined as a comprehensive collection of tools and


technologies that can be effectively implemented and deployed to provide big data
solutions in a cost-effective manner.

5. MapReduce and Hadoop Distributed File System (HDFS) are two core components of the
Hadoop ecosystem that is used to manage big data. However, they are not sufficient to deal
with the big data challenges.

6. Along with these two, the Hadoop ecosystem provides a collection of various elements to
support the complete development and deployment of big data solutions.

Use of Hadoop:

1. Ability to store and process huge amounts of any kind of data quickly.
2. Computing power: Hadoop's distributed computing model processes big data fast.
3. Fault tolerance: Data and application processing are protected against hardware failure.
If a node goes down, jobs are automatically redirected to other nodes to make sure that
distributed computing does not fail. Multiple copies of all data are stored automatically.
4. Flexibility: Unlike traditional relational databases, we do not have to preprocess data
before storing it. We can store as much data as we want and decide how to use it later. That
includes unstructured data like text, images and videos.
5. Low cost: The open-source framework is free and uses commodity hardware to store
large quantities of data.
6. Scalability: We can easily grow our system to handle more data simply by adding nodes.

Features of Hadoop:

1. Suitable for big data analysis:

i As big data tends to be distributed and unstructured in nature, Hadoop clusters are best
suited for analysis of big data.
ii. Since it is processing logic (not the actual data) that flows to the computing nodes, less
network bandwidth is consumed.
iii. This concept is called as data locality concept which helps to increase the efficiency of
Hadoop based applications.

2. Scalability:

i. Hadoop clusters can easily be scaled to any extent by adding additional cluster nodes and
thus allows for the growth of big data.
ii. Scaling does not require modifications to application logic.

3. Fault tolerance:

i Hadoop ecosystem has a provision to replicate the input data on to other cluster nodes.
ii. In case of a cluster node failure, data processing can still proceed by using data stored on
another cluster node.

Modules of Hadoop:

1. HDFS (Hadoop Distributed File System): It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.

2. YARN (Yet Another Resource Negotiator): It is used for job scheduling and managing the
cluster.

3. MapReduce:

i This is a framework which helps Java programs to do the parallel computation on data
using key value pair.
ii. The Map task takes input data and converts it into a data set which can be computed in
key value pair.
iii. The output of Map task is consumed by reduce task and then the reducer gives the
desired result.

4. Hadoop common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.

Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in
faster retrieval. Even the tools to process the data are often on the same servers,
thus reducing the processing time. It is able to process terabytes of data in minutes
and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data
so it really cost effective as compared to traditional relational database management
system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then
Hadoop takes the other copy of data and use it. Normally, data are replicated thrice
but the replication factor is configurable.

Hadoop Architecture

NameNode

o It is a single master server exist in the HDFS cluster.


o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.
DataNode

o The HDFS cluster contains multiple DataNodes.


o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's
clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker

o It works as a slave node for Job Tracker.


o It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.

MapReduce
1. MapReduce is based on the parallel programming framework to process large amounts of
data dispersed across different system.

2. The process is initiated when a user request is received to execute the MapReduce
program and terminated once the results are written back to the HDFS (Hadoop Distributed
File System).

3. MapReduce facilitate the processing and analyzing of both unstructured and sem-
structured data collected from different sources, which may not be analyzed effectively by
other traditional tools.

4. MapReduce enables computational processing of data stored in a file system without the
requirement of loading the data initially into a database.

5. It primarily supports two operations, map and reduce.

6. These operations execute in parallel on a set of worker nodes.

7. MapReduce works on a master working approach in which the master process controls
and directs the entire activity, such as collecting. segregating, and delegating the data
among different working.

Working and Phases of MapReduce

1. The MapReduce algorithm contains two important tasks, namely Map and Reduce:

i The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key value pairs).

ii. The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.

2. The reduce task is always performed after the map task.


Phases of MapReduce:

1. Input phase: Here we have a record reader that translates each record in an input file and
sends the parsed data to the mapper in the form of key-value pairs.

2. Map: Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key value pairs.

3. Intermediate keys: They key-value pairs generated by the mapper are known as
intermediate keys.

4.Combiner:
A combiner is a type of local reducer that groups similar data from the map phase into
identifiable sets.

ii. It takes the intermediate keys from the mapper as input and applies a user-defined code
to aggregate the values in a small scope of one mapper.

iii. It is not a part of the main MapReduce algorithm; it is optional.

5. Shuffle and sort:

i The Reducer task starts with the shuffle and sort step.

ii. It downloads the grouped key-value pairs onto the local machine, where the reducer is
running.

ni. The individual key-value pairs are sorted by key into a larger data list.

iv A The data list groups the equivalent keys together so that their values can be iterated
easily in the reducer task

6. Reducer:

The reducer takes the grouped key value paired data as input and runs a reducer function
on each one of them

IL Here, the data can be aggregated, filtered, and combined in a number of ways, and it
requires a wide range of processing.

Once the execution is over, it gives zero or more key-value pairs to the final step.

7. Output phase:

In the output phase, we have an output formatter that translates the final key-value pairs
from the reducer function and writes them onto a file using a record writer.
Features of MapReduce:

1. Scheduling:

i. MapReduce involves two operations: map and reduce, which are executed by dividing
large problems into smaller chunks are run in parallel by different computing resources.

ii. The operation of breaking tasks into subtasks and running these subtasks independently
in parallel is called mapping, which is performed ahead of the reduce operation.

2. Synchronization:

i Execution of several concurrent processes requires synchronization.

ii. The MapReduce program execution framework is aware of the mapping and reducing
operations that are taking place in the program.

3. Co-location of code/data (Data locality):

The effectiveness of a data processing mechanism depends on the location of the code and
the data required for the code to execute.

ii. The best result is obtained when both code and data reside on the same machine.

iii. This means that the co-location of the code and data produces the most effective
processing outcome.

4. Handling of errors/faults:
L MapReduce engines provide a high level of fault tolerance and robustness in handling
errors

ii. The reason for providing robustness to these engines is their high tendency to make
errors or faults.

5. Scale-out architecture:

i. MapReduce engines are built in such a way that they can accommodate more machines,
as and when required.

ii. This possibility of introducing more computing resources to the architecture makes the
MapReduce programming model more suited to the higher computational demands of big
data.

Working of MapReduce algorithm:

1. Take a large dataset or set of records.

2. Perform iteration over the data.

3. Extract some interesting patterns to prepare an output list by using the map function.

4. Arrange the output list properly to enable optimization for further processing.

5. Compute a set of results by using the reduce function.

6. Provide the final output.

You might also like