0% found this document useful (0 votes)

17 views10 pages

Unit 3

Uploaded by

Yuva Teja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views10 pages

Unit 3

Uploaded by

Yuva Teja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

UNIT IV

Cloud Programming and Software Environments: Features of cloud and grid

platforms, Parallel & Distributed Programming Paradigms, Programming support of
Google App Engine, Programming on Amazon AWS and Microsoft Azure, Emerging
Cloud Software Environments.

4.1 Features of Cloud and Grid Platforms:

4.1.1 Cloud Capabilities and Platform Features

Commercial clouds need broad capabilities, as summarized in the following table.

These capabilities offer cost-effective utility computing with the elasticity to scale up and
down in power. However, as well as this key distinguishing feature, commercial clouds offer
a growing number of additional capabilities commonly termed “Platform as a Service”
(PaaS).
4.1.2 Infrastructure Cloud Features

4.1.3 Traditional Features in Cluster, Grid and Parallel Computing Environments

4.1.4 Platform features supported by Clouds and (sometimes) Grids

4.2 Parallel and Distributed Programming Paradigms:

A distributed system consisting of set of networked nodes or workers. The system

issues for running a typical parallel program in either a parallel or a distributed manner
include the following:

 Partitioning: This is applicable to both computation and data

 Computation partitioning: This splits a given job or program into smaller tasks.
Partitioning greatly depends on correctly identifying portions of the job or program
that can be performed concurrently.
 Data partitioning: This splits the input or intermediate data into smaller pieces.
 Mapping: This assigns either smaller parts of a program or the smaller pieces of data
to underlying process.
 Synchronization: Because different workers may perform different tasks.
 Communication: Because data dependency is one of the main reasons for
communication among workers, communication is always triggered when the
intermediate data is sent to workers.
 Scheduling: For a job or program, when the number of computation parts or data
pieces is more than the number of available workers, a scheduler selects a sequence of
tasks or data pieces to be assigned to the workers.
4.2.1 Map reduce

MapReduce is a framework using which we can write applications to process huge amounts
of data, in parallel, on large clusters of commodity hardware in a reliable manner.

To take advantage of parallel processing of Hadoop, the query must be in MapReduce form.
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output
from a map as an input and combines those data tuples into a smaller set of tuples.

The Algorithm

● Generally MapReduce paradigm is based on sending the computer to where the data
resides!
● MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.

➢ Map stage: The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line.
The mapper processes the data and creates several small chunks of data.

➢ Reduce stage: This stage is the combination of the Shuffle stage and the
Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored
in the HDFS.

● During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
● The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
● Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
● After completion of the given tasks, the cluster collects and reduces the data to form
an appropriate result, and sends it back to the Hadoop server.

RecordReader

To understand record reader in Hadoop, We need to understand dataflow in hadoop.

Apache Hadoop can process any arbitrary data like log files, text files, structured data etc. We
know that actual data is stored inHDFS while InputSplit is the logical partition of the data.
InputSplit is the chunk of data that processed by single map (i.e. one Mapper can process one
InputSplit at a time) Now each split is required to divide into the records. Please note that
inputsplit do not contain actual data rather they contain the references to the actual data
(HDFS blocks).

InputFormat is responsible for validating the input data, creating the inputsplit and divide
them into the records.

RecordReader reads the data from inputsplit (record) and converts them into key-value pair
for the input to the Mapper class.

Combiner Phase
The Combiner class is used in between the Map class and the Reduce class to reduce the
volume of data transfer between Map and Reduce. Usually, the output of the map task is
large and the data transferred to the reduce task is high.

The following Map Reduce task diagram shows the COMBINER PHASE.
The Combiner phase takes each key-value pair from the Map phase, processes it, and
produces the output as key-value collection pairs.

The Combiner phase reads each key-value pair, combines the common words as key and
values as collection. Usually, the code and operation for a Combiner is similar to that of a
Reducer. Following is the code snippet for Mapper, Combiner and Reducer class
declaration.

Reducer Phase
The Reducer phase takes each key-value collection pair from the Combiner phase, processes
it, and passes the output as key-value pairs. Note that the Combiner functionality is same as
the Reducer.

4.2.2 Hadoop Library from Apache:

A Hadoop cluster can comprise of a single node (single node cluster) or thousands of nodes.

Once you have installed Hadoop you can try out the following few basic commands to work
with HDFS:

▪ hadoop fs -ls
▪ hadoop fs -put <path_of_local><path_in_hdfs>
▪ hadoop fs -get <path_in_hdfs><path_of_local>
▪ hadoop fs -cat <path_of_file_in_hdfs>
▪ hadoop fs -rmr <path_in_hdfs>

With the help of the following diagram, let us try and understand the different components of
a Hadoop Cluster:
The above diagram depicts a 4 Node Hadoop Cluster

NameNode (Master) – NameNode, Secondary NameNode, JobTracker

DataNode 1(Slave) – TaskTracker, DataNode

DataNode 2(Slave) – TaskTracker, DataNode

DataNode 3(Slave) – TaskTracker, DataNode

DataNode 4(Slave) – TaskTracker, DataNode

In the diagram the Name Node, Secondary Name Node and the Job Tracker are running on a
single machine. Usually in production clusters having more that 20-30 node, the daemons run
on separate nodes.

Hadoop follows Master-Slave architecture. As mentioned earlier, a file in HDFS is split into
blocks and replicated across Data nodes in a Hadoop cluster. You can see that the three files
A, B and C have been split across with a replication factor of 3 across the different Data
nodes.

Now let us go through each node and daemon:

Name Node

The Name Node in Hadoop is the node where Hadoop stores all the location information of
the files in HDFS. In other words, it holds the metadata for HDFS. Whenever a file is placed
in the cluster a corresponding entry of it location is maintained by the Name Node. So, for the
files A, B and C we would have something as follows in the Name Node:

File A – DataNode1, DataNode2, DataNode4

File B – DataNode1, DataNode3, DataNode4

File C – DataNode2, DataNode3, DataNode4

This information is required when retrieving data from the cluster as the data is spread across
multiple machines. The Name Node is a Single Point of Failure for the Hadoop Cluster.

Secondary Name Node

IMPORTANT - The Secondary Name Node is not a failover node for theName Node.

The secondary name node is responsible for performing periodic housekeeping functions for
the Name Node. It only creates checkpoints of the file system present in the Name Node.

Data Node

The Data Node is responsible for storing the files in HDFS. It manages the file blocks within
the node. It sends information to the Name Node about the files and blocks stored in that
node and responds to the Name Node for all file system operations.

Job Tracker

Job Tracker is responsible for taking in requests from a client and assigning Task
Trackers with tasks to be performed. The Job Tracker tries to assign tasks to the Task
Tracker on the Data Node where the data is locally present (Data Locality). If that is not
possible it will at least try to assign tasks to Task Trackers within the same rack. If for some
reason the node fails the Job Tracker assigns the task to another Task Tracker where the
replica of the data exists since the data blocks are replicated across the Data Nodes. This
ensures that the job does not fail even if a node fails within the cluster.

Task Tracker

Task Tracker is a daemon that accepts tasks (Map, Reduce and Shuffle) from the Job
Tracker. The Task Tracker keeps sending a heart beat message to the Job Tracker to notify
that it is alive. Along with the heartbeat it also sends the free slots available within it to
process tasks. Task Tracker starts and monitors the Map & Reduce Tasks and sends
progress/status information back to the Job Tracker.

A typical (simplified) flow in Hadoop is a follows:

1. A Client (usually a Map Reduce program) submits a job to the Job Tracker.
2. The Job Tracker get information from the Name Node on the location of the data
within the Data Nodes. The Job Tracker places the client program (usually a jar file
along with the configuration file) in the HDFS. Once placed, Job Tracker tries to
assign tasks to Task Trackers on the Data Nodes based on data locality.
3. The Task Tracker takes care of starting the Map tasks on the Data Nodes by picking
up the client program from the shared location on the HDFS.
4. The progress of the operation is relayed back to the Job Tracker by theTask Tracker.
5. On completion of the Map task an intermediate file is created on the local file system
of the Task Tracker.
6. Results from Map tasks are then passed on to the Reduce task.
7. The Reduce tasks works on all data received from map tasks and writes the final
output to HDFS.
8. After the task complete the intermediate data generated by the Task Tracker is
deleted.

A very important feature of Hadoop to note here is that, the program goes to where the data is
and not the way around, thus resulting in efficient processing of data.

4.3 Programming support on Google App Engine:

Google App Engine (GAE) is a Platform as a Service (PaaS) cloud-based Web hosting
service on Google's infrastructure. For an application to run on GAE, it must comply with
Google's platform standards, which narrows the range of applications that can be run and
severely limits those applications' portability.

Google App Engine lets you run web applications on Google's infrastructure.

● Easy to build.
● Easy to maintain.
● Easy to scale as the traffic and storage needs grow

Google app engine support various programming languages-

Java

● App Engine runs JAVA apps on a JAVA 7 virtual machine (currently supports JAVA
6 as well).
● Uses JAVA Servlet standard for web applications:
● WAR (Web Applications Archive) directory structure.
● Servlet classes
● Java Server Pages (JSP)
● Static and data files
● Deployment descriptor (web.xml)
● Other configuration files

Python

● Uses WSGI (Web Server Gateway Interface) standard.

● Python applications can be written using:
● Webapp2 framework
● Django framework
● Any python code that uses the CGI (Common Gateway Interface) standard.

PHP (Experimental support):

● Local development servers are available to anyone for developing and testing local
applications.
● Only white listed applications can be deployed on Google App Engine.

Google’s Go

● Go is an Google’s open source programming environment.

● Tightly coupled with Google App Engine.
● Applications can be written using App Engine’s Go SDK.

Programming on Amazon Aws and Microsoft Azure:

Unit 5
No ratings yet
Unit 5
7 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
21 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Bda U2
No ratings yet
Bda U2
79 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
BDM 2
No ratings yet
BDM 2
5 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
HADOOP
No ratings yet
HADOOP
19 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Analysis of Hadoop MapReduce Scheduling in Heterog 2021 Ain Shams Engineerin
No ratings yet
Analysis of Hadoop MapReduce Scheduling in Heterog 2021 Ain Shams Engineerin
10 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
21 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
T05 MapReduce
No ratings yet
T05 MapReduce
20 pages
Big Data Notes
No ratings yet
Big Data Notes
8 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
1 MapReduce Introduction With Example
No ratings yet
1 MapReduce Introduction With Example
52 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
Survey Paper On Traditional Hadoop and Pipelined Map Reduce: Dhole Poonam B, Gunjal Baisa L
No ratings yet
Survey Paper On Traditional Hadoop and Pipelined Map Reduce: Dhole Poonam B, Gunjal Baisa L
5 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Unit 5
No ratings yet
Unit 5
35 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Big Data
No ratings yet
Big Data
67 pages
Unit 5 - Big Data Ecosystem - 06.05.18
No ratings yet
Unit 5 - Big Data Ecosystem - 06.05.18
21 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Unit 3 Bba
No ratings yet
Unit 3 Bba
11 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Codevita Plan
No ratings yet
Codevita Plan
29 pages
Hackathon Guide Lines
100% (1)
Hackathon Guide Lines
2 pages
Unit Iv PDF
No ratings yet
Unit Iv PDF
97 pages
Unit V PDF
No ratings yet
Unit V PDF
58 pages
CC Unit-2
No ratings yet
CC Unit-2
21 pages
Unit-1 Part-1
No ratings yet
Unit-1 Part-1
14 pages
Unit Iv
No ratings yet
Unit Iv
29 pages
Unit-1 Cloud Computing
No ratings yet
Unit-1 Cloud Computing
18 pages
U3 Fev
No ratings yet
U3 Fev
32 pages
FEV-U5 - Energy Storage
No ratings yet
FEV-U5 - Energy Storage
28 pages
25Q80BV Winbond
No ratings yet
25Q80BV Winbond
74 pages
COI2 Interfacing With 8051
No ratings yet
COI2 Interfacing With 8051
44 pages
Computer Science Mark Scheme
No ratings yet
Computer Science Mark Scheme
7 pages
Azure MySQL Infographic - Final
No ratings yet
Azure MySQL Infographic - Final
2 pages
Prudhvi
No ratings yet
Prudhvi
2 pages
Oliver
No ratings yet
Oliver
29 pages
Unit 12 File Structures: Structure Page Nos
No ratings yet
Unit 12 File Structures: Structure Page Nos
7 pages
Task To Complete - Frontend Internship
No ratings yet
Task To Complete - Frontend Internship
2 pages
Ethernet Crossover Cable
No ratings yet
Ethernet Crossover Cable
5 pages
04 - Fetch Decode Execute Cycle
No ratings yet
04 - Fetch Decode Execute Cycle
3 pages
Flowchart and Algo PDF
No ratings yet
Flowchart and Algo PDF
8 pages
k400 Quick Start Guide
No ratings yet
k400 Quick Start Guide
2 pages
Computer Architecture Note 2024
No ratings yet
Computer Architecture Note 2024
45 pages
Low Level Design LLD Document Template
No ratings yet
Low Level Design LLD Document Template
6 pages
GPU Partitioning
No ratings yet
GPU Partitioning
6 pages
SLC or MicroLogix Processor - Major Fault Code 20h
No ratings yet
SLC or MicroLogix Processor - Major Fault Code 20h
4 pages
Lab2 Synthesis
No ratings yet
Lab2 Synthesis
27 pages
Computer Education (VI-VIII) - FKedits
No ratings yet
Computer Education (VI-VIII) - FKedits
31 pages
Hadoop and Spark Interview Questions - Sree
No ratings yet
Hadoop and Spark Interview Questions - Sree
74 pages
Echo Cancellation Thesis
100% (3)
Echo Cancellation Thesis
5 pages
Elasticsearch and Apache Lucene
No ratings yet
Elasticsearch and Apache Lucene
7 pages
Power BI - Basic To Intermediate Training Content
100% (1)
Power BI - Basic To Intermediate Training Content
2 pages
Project Specification
No ratings yet
Project Specification
8 pages
Study Guide 1
No ratings yet
Study Guide 1
112 pages
Opera Exchange Interface - Communication Vendor Specification
No ratings yet
Opera Exchange Interface - Communication Vendor Specification
25 pages
057-283 A108 Soft
No ratings yet
057-283 A108 Soft
32 pages
E-Design User Manual
No ratings yet
E-Design User Manual
18 pages
Creative Tech Grade 7 QRTR 1 Exam
No ratings yet
Creative Tech Grade 7 QRTR 1 Exam
5 pages
Chapter 6 Arrays
No ratings yet
Chapter 6 Arrays
11 pages
Sandpiper 2B Electronics Launch Package
No ratings yet
Sandpiper 2B Electronics Launch Package
14 pages

Unit 3

Uploaded by

Unit 3

Uploaded by

UNIT IV

Cloud Programming and Software Environments: Features of cloud and grid

4.1 Features of Cloud and Grid Platforms:

4.1.1 Cloud Capabilities and Platform Features

Commercial clouds need broad capabilities, as summarized in the following table.

4.1.3 Traditional Features in Cluster, Grid and Parallel Computing Environments

4.2 Parallel and Distributed Programming Paradigms:

A distributed system consisting of set of networked nodes or workers. The system

 Partitioning: This is applicable to both computation and data

To understand record reader in Hadoop, We need to understand dataflow in hadoop.

4.2.2 Hadoop Library from Apache:

NameNode (Master) – NameNode, Secondary NameNode, JobTracker

DataNode 1(Slave) – TaskTracker, DataNode

DataNode 2(Slave) – TaskTracker, DataNode

DataNode 3(Slave) – TaskTracker, DataNode

DataNode 4(Slave) – TaskTracker, DataNode

Now let us go through each node and daemon:

File A – DataNode1, DataNode2, DataNode4

File B – DataNode1, DataNode3, DataNode4

Secondary Name Node

A typical (simplified) flow in Hadoop is a follows:

4.3 Programming support on Google App Engine:

Google app engine support various programming languages-

● Uses WSGI (Web Server Gateway Interface) standard.

PHP (Experimental support):

● Go is an Google’s open source programming environment.

Programming on Amazon Aws and Microsoft Azure:

You might also like