0% found this document useful (0 votes)
20 views52 pages

Cloud Unit 5

The document provides an overview of the Hadoop framework, including its components such as HDFS, YARN, and MapReduce, which facilitate distributed data processing. It details the architecture of HDFS, the role of NameNode and DataNodes, and the programming model of MapReduce, emphasizing input/output parameters and data formats. Additionally, it discusses the configuration and execution of Hadoop jobs, as well as the advantages of using HDFS for large-scale data storage and processing.

Uploaded by

abiderdude123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views52 pages

Cloud Unit 5

The document provides an overview of the Hadoop framework, including its components such as HDFS, YARN, and MapReduce, which facilitate distributed data processing. It details the architecture of HDFS, the role of NameNode and DataNodes, and the programming model of MapReduce, emphasizing input/output parameters and data formats. Additionally, it discusses the configuration and execution of Hadoop jobs, as well as the advantages of using HDFS for large-scale data storage and processing.

Uploaded by

abiderdude123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

UNIT V PROGRAMMING MODEL

Introduction to Hadoop Framework - Mapreduce, Input splitting, map and reduce


functions, specifying input and output parameters, configuring and running a job –
Developing Map Reduce Applications - Design of Hadoop file system –Setting up
Hadoop Cluster - Aneka: Cloud Application Platform, Thread Programming, Task
Programming and Map-Reduce Programming in Aneka.
Introduction to Hadoop Framework

Apache Hadoop is an open source java based programming framework that supports
the processing of large data set in a distributed computing environment.

Apache Hadoop framework is composed of following modules:

1. Hadoop common – collection of common utilities and libraries that support other Hadoop
modules.

2. Hadoop Distributed File System (HDFS) – Primary distributed storage system used by
Hadoop applications to hold large volume of data. HDFS is scalable and fault-tolerant which
works closely with a wide variety of concurrent data access application.

3. Hadoop YARN (Yet Another Resource Negotiator) – Apache Hadoop YARN is the
resource management and job scheduling technology in the open source Hadoop
distributed processing framework. YARN is responsible for allocating system resources to
the various applications running in a Hadoop cluster and scheduling tasks to be executed
on different cluster nodes.

1. Hadoop MapReduce – MapReduce is a framework using which we can write


applications to process huge amounts of data, in parallel, on large clusters of commodity
hardware in a reliable manner.
❖ In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is
being loaded in. The Hadoop Distributed File System (HDFS) will split large data
files into chunks which are managed by different nodes in the cluster.
❖ In addition to this, each chunk is replicated across several machines, so that a
single machine failure does not result in any data being unavailable.

❖ Even though the file chunks are replicated and distributed across several machines,
they form a single namespace, so their contents are universally accessible.
MAPREDUCE in Hadoop

❖ Hadoop will not run just any program and distribute it across a cluster. Programs must
be written to conform to a particular programming model, named "MapReduce."

❖ In MapReduce, records are processed by tasks called Mappers. The output from the
Mappers is then brought together into a second set of tasks called Reducers, where
results from different mappers can be merged together.

Hadoop Architecture:

❖ HDFS has a master/slave architecture. An HDFS cluster consists of a single


NameNode, a master server that manages the file system namespace and regulates
access to files by clients.
❖ In addition, there are a number of DataNodes, usually one per node in the cluster,
which manage storage attached to the nodes that they run on.
❖ HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes.
❖ The NameNode executes file system namespace operations like opening, closing,
and renaming files and directories. It also determines the mapping of blocks to
DataNodes.
❖ The DataNodes are responsible for serving read and write requests from the file
system’s clients. The DataNodes also perform block creation, deletion, and
replication upon instruction from the NameNode.
❖ The NameNode and DataNode are pieces of software designed to run on commodity
machines.
❖ HDFS is built using the Java language; any machine that supports Java can
run the NameNode or the DataNode software.

specifying input and output parameters Data Format


MapReduce has a simple model of data processing: inputs and outputs for the map and reduce
functions are key-value pair .
MapReduce Types:
The map and reduce functions in Hadoop MapReduce have the following general form:
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
In general, the map input key and value types (K1 and V1) are different from the map output
types (K2 and V2). However, the reduce input must have the same types as the map output,
although the reduce output types may be different again (K3 and V3).

InputFormat: Hadoop can process many different types of data formats, from flat text files to
databases. In this section, we explore the different formats available.

How the input files are split up and read is defined by the InputFormat. An InputFormat is a
class that provides the following functionality:

● Selects the files or other objects that should be used for input
● Defines the InputSplits that break a file into tasks
● Provides a factory for RecordReader objects that read the file
Input Splits and Records:
An input split is a chunk of the input that is processed by a single map. Each map processes a
single split. Each split is divided into records, and the map processes each record—a key-value
pair—in turn. Splits and records are logical.
FileInputFormat:
FileInputFormat is the base class for all implementations of InputFormat that use files as their data
source.
It provides two things: a place to define which files are included as the input to a job, and an
implementation for generating splits for the input files. The job of dividing splits into records is
performed by subclasses.
Preventing splitting Some applications don’t want files to be split, so that a single mapper can
process each input file in its entirety. For example, a simple way to check if all the records in a file
are sorted is to go through the records in order, checking whether each record is not less than the
preceding one. Implemented as a map task, this algorithm will work only if one map processes the
whole file.3
Text Input Hadoop excels at processing unstructured text.
TextInputFormat :
TextInputFormat is the default InputFormat. Each record is a line of input. The key, a LongWritable,
is the byte offset within the file of the beginning of the line. The value is the contents of the line,
excluding any line terminators (newline, carriage return)
KeyValueTextInputFormat:
TextInputFormat’s keys, being simply the offset within the file, are not normally very useful. It is
common for each line in a file to be a key-value pair, separated by a delimiter such as a tab character.
For example, this is the output produced by TextOutputFormat, Hadoop’s default OutputFormat. To
interpret such files correctly, KeyValueTextInputFormat is appropriate.
NLineInputFormat
With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable number of
lines of input. The number depends on the size of the split and the length of the lines. If you want
your mappers to receive a fixed number of lines of input, then NLineInputFormat is the InputFormat
to use. Like TextInputFormat, the keys are the byte offsets within the file and the values are the lines
themselves.

XML :
Most XML parsers operate on whole XML documents, so if a large XML document is made up of
multiple input splits, then it is a challenge to parse these individually. Of course, you can process the
entire XML document in one mapper (if it is not too large) using the technique in “Processing a
whole file as a record” .

Binary Input Hadoop MapReduce is not just restricted to processing textual data—it has support
for binary formats,

SequenceFileInputFormat :Hadoop’s sequence file format stores sequences of binary key-value


pairs. Sequence files are well suited as a format for MapReduce data since they are splittable (they
have sync points so that readers can synchronize with record boundaries from an arbitrary point in
the file, such as the start of a split), they support compression as a part of the format, and they can
store arbitrary types using a variety of serialization frameworks.

Database Input (and Output)

DBInputFormat is an input format for reading data from a relational database, using JDBC.

Output Formats

Hadoop has output data formats that correspond to the input formats

Text Output The default output format, TextOutputFormat, writes records as lines of text. Its keys
and values may be of any type, since TextOutputFormat turns them to strings by calling toString()
on them. Each key-value pair is separated by a tab character.

OutputFormat: The (key, value) pairs provided to this OutputCollector are then written to output
files.
❖ Hadoop can process many different types of data formats, from flat text files to
databases.
❖ If it is flat file, the data is stored using a line-oriented ASCII format, in which each
line is a record.
❖ For example, ( National Climatic Data Center) NCDC data as given below, the format
supports a rich set of meteorological elements, many of which are optional or with
variable data lengths.

Data files are organized by date and weather station.


Analyzing the Data with Hadoop
To take advantage of the parallel processing that Hadoop provides, express the query as a
MapReduce job.
Map and Reduce:
Map Reduce works by breaking the processing into two phases: the map phase and the reduce
phase. Each phase has key-value pairs as input and output, the types of which may be chosen
by the programmer. The programmer also specifies two functions: the map function and the
reduce function.

The input to map phase is the raw NCDC data. We choose a text input format that gives us
each line in the dataset as a text value. The key is the offset of the beginning of the line from
the beginning of the file.

To visualize the way the map works, consider the following sample lines of input data:

These lines are presented to the map function as the key-value pairs:
The keys are the line offsets within the file, which we ignore in our map function. The map
function merely extracts the year and the air temperature (indicated in bold text), and
emits them as its output (the temperature values have been interpreted as integers):
(1950, 0)

(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework before
being sent to the reduce function. This processing sorts and groups the key- value pairs by
key. So, continuing the example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each year appears with a list of all its air temperature readings. All the reduce function has to
do now is iterate through the list and pick up the maximum reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year. The whole
data flow is illustrated in the following figure.
Combiner Functions

❖ Hadoop allows the user to specify a combiner function to be run on the map output—
the combiner function’s output forms the input to the reduce function.
❖ Since the combiner function is an optimization, Hadoop does not provide a guarantee
of how many times it will call it for a particular map output record, if at all.
❖ This is best illustrated with an example. Suppose that for the maximum temperature
example, readings for the year 1950 were processed by two maps (because they were
in different splits). Imagine the first map produced the output:

(1950, 0)
(1950, 20)
(1950, 10)
And the second produced:
(1950, 25)
(1950, 15)
The reduce function would be called with a list of all the values: (1950, [0,
20, 10, 25, 15])
with output:

(1950, 25)
since 25 is the maximum value in the list. We could use a combiner function that, just like the
reduce function, finds the maximum temperature for each map output.
The reduce would then be called with: (1950,
[20, 25])
and the reduce would produce the same output as before.
The code is
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25

configuring and running a job:

Design of HDFS
HDFS is a file system designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware.

“Very large”

Files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters
running today that store petabytes of data.

Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern.

Commodity hardware

Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on
clusters of commodity hardware (commonly available hardware available from multiple
vendors) .

These are areas where HDFS is not a good fit today:

Low-latency data access

Applications that require low-latency access to data, in the tens of milliseconds


range, will not work well with HDFS. Remember, HDFS is optimized for delivering a high
throughput of data, and this may be at the expense of latency.

Lots of small files :

Since the namenode holds filesystem metadata in memory, the limit to the number of files in
a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each
file, directory, and block takes about 150 bytes. So, for example, if you had one million files,
each taking one block, you would need at least 300 MB of memory. While storing millions
of files is feasible, billions is beyond the capability of current hardware

Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single
writer. Writes are always made at the end of the file. There is no support for multiple writers,
or for modifications at arbitrary offsets in the file
HDFS Concepts
The following diagram illustrates the Hadoop concepts

Important components in HDFS Architecture are:


● Blocks
● Name Node
● Data Nodes
HDFS Blocks

❖ HDFS is a block structured file system. Each HDFS file is broken into blocks of
fixed size usually 128 MB which are stored across various data nodes on the cluster.
Each of these blocks is stored as a separate file on local file system on data nodes
(Commodity machines on cluster).
❖ Thus to access a file on HDFS, multiple data nodes need to be referenced and the list
of the data nodes which need to be accessed is determined by the file system metadata
stored on Name Node.
❖ So, any HDFS client trying to access/read a HDFS file, will get block information from
Name Node first, and then based on the block id’s and locations, data will be read from
corresponding data nodes/computer machines on cluster.
❖ HDFS’s fsck command is useful to get the files and blocks details of file system.
❖ Example: The following command list the blocks that make up each file in the file
system.

$ hadoop fsck / -files -blocks

Advantages of Blocks
2. Quick Seek Time:
By default, HDFS Block Size is 128 MB which is much larger than any other file system. In
HDFS, large block size is maintained to reduce the seek time for block access.
3. Ability to Store Large Files:
Another benefit of this block structure is that, there is no need to store all blocks of a file on
the same disk or node. So, a file’s size can be larger than the size of a disk or node.
4. How Fault Tolerance is achieved with HDFS Blocks:
HDFS blocks feature suits well with the replication for providing fault tolerance and
availability.
By default each block is replicated to three separate machines. This feature insures blocks
against corrupted blocks or disk or machine failure. If a block becomes unavailable, a copy
can be read from another machine. And a block that is no longer available due to corruption
or machine failure can be replicated from its alternative
machines to other live machines to bring the replication factor back to the normal level (3 by
default).
Name Node

❖ Name Node is the single point of contact for accessing files in HDFS and it determines
the block ids and locations for data access. So, Name Node plays a Master role in
Master/Slaves Architecture where as Data Nodes acts as slaves. File System
metadata is stored on Name Node.
❖ File System Metadata contains File names, File Permissions and locations of each
block of files. Thus, Metadata is relatively small in size and fits into Main Memory
of a computer machine. So, it is stored in Main Memory of Name Node to allow fast
access.

Data Node
Data Nodes are the slaves part of Master/Slaves Architecture and on which
actual HDFS files are stored in the form of fixed size chunks of data which are
called blocks.

Data Nodes serve read and write requests of clients on HDFS files and also
perform block creation, replication and deletions.

The Command-Line Interface


There are many other interfaces to HDFS, but the command line is one of the simplest and,
to many developers, the most familiar. It provides a command line interface called FS shell
that lets a user interact with the data in HDFS. The syntax of this command set is similar to
other shells (e.g. bash, csh) that users are already familiar with. Here are some sample
action/command pairs:
Action Command
Create a directory named /foodir bin/hadoop dfs -mkdir /foodir
View the contents of a file named bin/hadoop dfs –cat
/foodir/myfile.txt /foodir/myfile.txt

Hadoop Filesystem
Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation.
The Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop,
and there are several concrete implementations, which are described in Table,
JAVA INTERFACE
Reading Data from a Hadoop URL
One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL
object to open a stream to read the data from.

The general syntax is:


InputStream in = null; try
{
in = new URL("hdfs://host/path").openStream();
// process in
}
finally
{
IOUtils.closeStream(in);
}

Writing HDFS Files Through FileSystem API:


To write a file in HDFS,
● First we need to get instance of FileSystem.
● Create a file with create() method on file system instance which will return an
FSDataOutputStream.
● We can copy bytes from any other stream to output stream using
IOUtils.copyBytes() or write directly with write() or any of its flavors method on object
of FSDataOutputStream.
Data Flow
Anatomy of File Read
❖ HDFS has a master and slave kind of architecture. Namenode acts as master and
Datanodes as worker.
❖ All the metadata information is with namenode and the original data is stored on the
datanodes.
❖ The below figure will give idea about how data flow happens between the Client
interacting with HDFS, i.e. the Namenode and the Datanodes.

The following steps are involved in reading the file from HDFS:
Let’s suppose a Client (a HDFS Client) wants to read a file from HDFS.
Step 1: First the Client will open the file by giving a call to open() method on FileSystem
object, which is an instance of DistributedFileSystem class.
Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure Call), to
determine the locations of the blocks for the file. For each block, the NameNode returns the
addresses of all the DataNode’s that have a copy of that block. Client will interact with
respective DataNode’s to read the file. NameNode also provide a token to the client which it
shows to datanode for authentication.

The DistributedFileSystem returns an object of FSDataInputStream(an input stream that


supports file seeks) to the client to read data from FSDataInputStream.

Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the
DataNode addresses for the first few blocks in the file, then connects to the first closest
DataNode for the first block in the file.

Step 4: Data is streamed from the DataNode back to the client, which calls read() repeatedly
on the stream.

Step 5: When the end of the block is reached, DFSInputStream will close the connection to
the DataNode , then find the best DataNode for the next block.

Step 6: Blocks are read in order, with the DFSInputStream opening new connections to
datanodes as the client reads through the stream. It will also call the namenode to retrieve the
datanode locations for the next batch of blocks as needed. When the client has finished
reading, it calls close() on the FSDataInputStream.
Anatomy of File Write

Step 1: The client creates the file by calling create() method on DistributedFileSystem.

Step 2: DistributedFileSystem makes an RPC call to the namenode to create a new file in the
filesystem’s namespace, with no blocks associated with it.The namenode performs various
checks to make sure the file doesn’t already exist and that the client has the right permissions
to create the file.

Step 3: As the client writes data, DFSOutputStream splits it into packets, which it writes to
an internal queue, called the data queue. The data queue is consumed by the DataStreamer,
which is responsible for asking the namenode to allocate new blocks by picking a list of
suitable datanodes to store the replicas. The list of datanodes forms a pipeline, and here we’ll
assume the replication level is three, so there are three nodes in the pipeline. TheDataStreamer
streams the packets to the first datanode
in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline.

Step 4: Similarly, the second datanode stores the packet and forwards it to the third (and last)
datanode in the pipeline.

Step 5: DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue
only when it has been acknowledged by all the datanodes in the pipeline.

Step 6: When the client has finished writing data, it calls close() on the stream.
This action flushes all the remaining packets to the datanode pipeline and waits for
acknowledgments before contacting the namenode to signal that the file is complete.

Setting up a Hadoop Cluster


In general, a computer cluster is a collection of various computers that work collectively as a
single system.

“A hadoop cluster is a collection of independent components connected through a dedicated


network to work as a single centralized data processing resource. “

Advantages of a Hadoop Cluster Setup

● As big data grows exponentially, parallel processing capabilities of a Hadoop


cluster helps in increasing the speed of analysis process.
● Hadoop cluster setup is inexpensive as they are held down by cheap commodity
hardware. Any organization can setup a powerful hadoop cluster without having to
spend on expensive server hardware.
● Hadoop clusters are resilient to failure meaning whenever data is sent to a
particular node for analysis, it is also replicated to other nodes on the hadoop cluster.
If the node fails then the replicated copy of the data present on the other node in the
cluster can be used for analysis
There are two ways to install Hadoop, i.e. Single node and Multi node.

Single node cluster means only one DataNode running and setting up all the NameNode,
DataNode, ResourceManager and NodeManager on a single machine. This is used for
studying and testing purposes.
While in a Multi node cluster, there are more than one DataNode running and each
DataNode is running on different machines. The multi node cluster is practically used in
organizations for analyzing Big Data.

Steps to set up a Hadoop Cluster

Step 1: download the Java Package. Save this file in home directory.

Step 2: Extract the Java Tar File.

Step 3: Download the Hadoop 2.9.0 Package.

Step 4: Extract the Hadoop tar File.

Command: tar -xvf hadoop-2.9.0.tar.gz

Step 5: Add the Hadoop and Java paths in the bash file (.bashrc). Open.

bashrc file. Now, add Hadoop and Java Path as shown below. Command:

gedit .bashrc
Fig: Hadoop Installation – Setting Environment Variable

Then, save the bash file and close it.

For applying all these changes to the current Terminal, execute the source command.

Command: source .bashrc

To make sure that Java and Hadoop have been properly installed on the system and
can be accessed through the Terminal, execute the java -version and hadoop version
commands.

Command: java -version

Fig: Hadoop Installation – Checking Java Version

Command: hadoop version


Fig: Hadoop Installation – Checking Hadoop Version

Step 6: Edit the Hadoop Configuration files.

Command: cd hadoop-2.9.0/etc/hadoop/ Command: ls

All the Hadoop configuration files are located in hadoop-


2.9.0/etc/hadoop directory as seen in the snapshot below:

Fig: Hadoop Installation – Hadoop Configuration Files

Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It
contains configuration settings of Hadoop core such as I/O settings that are common
to HDFS & MapReduce.

Command: gedit core-site.xml

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:

hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode,


DataNode, Secondary NameNode). It also includes the replication factor and block
size of HDFS.

Command: gedit hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>

Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:

mapred-site.xml contains configuration settings of MapReduce application like


number of JVM that can run in parallel, the size of the mapper and the reducer process,
CPU cores available for a process, etc.

In some cases, mapred-site.xml file is not available. So, we have to create the mapred-
site.xml file using mapred-site.xml template.

Command: cp mapred-site.xml.template mapred-site.xml

Command: gedit mapred-site.xml.

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>

</property>
</configuration>

Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:

yarn-site.xml contains configuration settings of ResourceManager and NodeManager


like application memory management size, the operation needed on program &
algorithm, etc.

Command: gedit yarn-site.xml

<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:

hadoop-env.sh contains the environment variables that are used in the script to run
Hadoop like Java home path, etc.

Command: gedit hadoop–env.sh


Fig: Hadoop Installation – Configuring hadoop-env.sh

Step 12: Go to Hadoop home directory and format the NameNode.

Command: cd

Command: cd hadoop-2.9.0

Command: bin/hadoop namenode -format

This formats the HDFS via NameNode. This command is only executed for the first
time. Formatting the file system means initializing the directory specified by the
dfs.name.dir variable.

Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start
all the daemons.

Command: cd hadoop-2.9.0/sbin

Either start all daemons with a single command or do it individually.

Command: ./start-all.sh

The above command is a combination of start-dfs.sh, start-yarn.sh & mr- jobhistory-


daemon.sh

Step 14: To check that all the Hadoop services are up and running, run the below
command.

Command: jps
Fig: Hadoop Installation – Checking Daemons

Step 15: Now open the Mozilla browser and go


to localhost:50070/dfshealth.html to check the NameNode interface.

Fig: Hadoop Installation – Starting WebUI


Cloud Software Environments -Eucalyptus, Open Nebula, Open Stack, Nimbus
a)Eucalyptus

⮚ Eucalyptus is an open source software platform for implementing


Infrastructure as a Service (IaaS) in a private or hybrid cloud
computing environment.
⮚ It combines together existing virtualized infrastructure to create cloud
resources for infrastructure as a
service, network as a service and storage as a service.

⮚ The name Eucalyptus is an acronym for Elastic Utility Computing


Architecture Linking Your Programs To Useful Systems.

⮚ Eucalyptus was founded out of a research project in the Computer


Science Department at the University of California, Santa Barbara.

⮚ Eucalyptus Systems announced a formal agreement


with Amazon Web Services (AWS) in March 2012, allowing
administrators to move instances between a Eucalyptus private cloud
and the Amazon Elastic Compute Cloud (EC2) to
create a hybrid cloud.

⮚ The partnership also allows Eucalyptus to work with Amazon’s product


teams to develop unique AWS-compatible features.

Eucalyptus features include:

● Supports both Linux and Windows virtual machines (VMs).

● Application program interface- (API) compatible with Amazon EC2 platform.

● Compatible with Amazon Web Services (AWS) and Simple Storage Service (S3).
● Works with multiple hypervisors including VMware, Xen and KVM.

● Internal processes communications are secured through SOAP and WS-Security.

● Multiple clusters can be virtualized as a single cloud.

Eucalyptus Architecture

⮚ The Eucalyptus system is an open software environment.


⮚ The following figure shows the architecture based on the need to manage VM
images.
⮚ The system supports cloud programmers in VM image management.
Essentially, the system has been extended to support the development of both
the compute cloud and storage cloud.

The Eucalyptus architecture for VM image management.


Eucalyptus is comprised of several components: Cloud Controller, Walrus, Cluster
Controller, Storage Controller, and Node Controller. Each component is a stand- alone
web service.

Cloud Controller

The Cloud Controller (CLC) is the entry-point into the cloud for administrators,
developers, project managers, and end-users.

As the interface to the management platform, the CLC is responsible for exposing and
managing the underlying virtualized resources (servers, network, and storage).

Walrus

Walrus allows users to store persistent data, organized as buckets and objects. We can
use Walrus to create, delete, and list buckets, or to put, get, and delete objects, or to set access
control policies.

Walrus is interface compatible with Amazon’s Simple Storage Service (S3). It


provides a mechanism for storing and accessing virtual machine images and user data.

Cluster Controller

CCs gather information about a set of NCs and schedules virtual machine (VM)
execution on specific NCs. The CC also manages the virtual machine networks. All NCs
associated with a single CC must be in the same subnet.

Storage Controller

The Storage Controller (SC) provides functionality similar to the Amazon Elastic
Block Store (Amazon EBS). The SC is capable of interfacing with various storage systems.
Node Controller

The Node Controller (NC) executes on any machine that hosts VM instances. The NC
controls VM activities, including the execution, inspection, and termination of VM instances.

VM Image Management

⮚ Eucalyptus stores images in Walrus, the block storage system that is analogous to
the Amazon S3 service.
⮚ This image is uploaded into a user-defined bucket within Walrus, and can be
retrieved anytime from any availability zone.
⮚ The Eucalyptus system is available in a commercial proprietary version as well as
the open source version.

b) OpenNebula

⮚ OpenNebula is a open source toolkit which allows users to transform existing


infrastructure into an IaaS cloud with cloud-like interfaces.
⮚ The following figure shows the OpenNebula architecture and its main components.
⮚ The architecture of OpenNebula has been designed to be flexible and modular to allow
integration with different storage and network infrastructure configurations, and
hypervisor technologies.
OpenNebula architecture and its main components

⮚ Here, the core is a centralized component that manages the VM full life cycle,
including setting of networks dynamically for groups of VMs and managing their
storage requirements.
⮚ Another important component is the capacity manager or scheduler.
⮚ The default capacity scheduler is a requirement/rank matchmaker. However, it is also
possible to develop more complex scheduling policies, through a lease model and
advance reservations.
⮚ The last main components are the access drivers. They provide basic functionality of
the monitoring, storage, and virtualization services available in the cluster.
⮚ OpenNebula is not tied to any specific environment and can provide a uniform
management layer regardless of the virtualization platform.
⮚ OpenNebula offers management interfaces to integrate the core's functionality within
other data-center management tools, such as accounting or monitoring frameworks.
⮚ To this end, OpenNebula implements the libvirt API , an open interface for VM
management, as well as a command-line interface (CLI).
⮚ A subset of this functionality is exposed to external users through a cloud interface.
⮚ When the local resources are insufficient, OpenNebula can support a hybrid cloud
model by using cloud drivers to interface with external clouds.
⮚ OpenNebula currently includes an EC2 driver, which can submit requests to Amazon
EC2 and Eucalyptus , as well as an Elastic Hosts driver.

c) OpenStack
▪ OpenStack was introduced by Rackspace and NASA in July 2010.
▪ OpenStack is a set of software tools for building and managing cloud computing
platforms for public and private clouds.
▪ It focuses on the development of two aspects of cloud computing to address compute
and storage aspects with the OpenStack Compute and OpenStack Storage solutions.
▪ “OpenStack Compute for creating and managing large groups of virtual private
servers”
▪ “OpenStack Object Storage software for creating redundant, scalable object
storage using clusters of commodity servers to store terabytes or even petabytes
of data.”
▪ The OpenStack is an open source cloud computing platform for all types of clouds,
which aims to be simple to implement, massively scalable, and feature rich.
▪ OpenStack provides an Infrastructure-as-a-Service (IaaS) solution through a set of
interrelated services. Each service offers an application programming interface (API)
that facilitates this integration.

OpenStack Compute
⮚ OpenStack is developing a cloud computing fabric controller, a component of an
IAAS system, known as Nova.
⮚ Nova is an OpenStack project designed to provide massively scalable, on demand,
self service access to compute resources.
⮚ The architecture of Nova is built on the concepts of shared-nothing and
messaging-based information exchange.
⮚ Hence, most communication in Nova is facilitated by message queues.
⮚ To prevent blocking components while waiting for a response from others, deferred
objects are introduced. Such objects include callbacks that get triggered when a
response is received.
⮚ To achieve the shared-nothing paradigm, the overall system state is kept in a
distributed data system.
⮚ State updates are made consistent through atomic transactions.

⮚ Nova is implemented in Python while utilizing a number of externally supported


libraries and components. This includes boto, an Amazon API provided in Python,
and Tornado, a fast HTTP server used to implement the S3 capabilities in
OpenStack.
⮚ The Figure shows the main architecture of Open Stack Compute. In this architecture,
the API Server receives HTTP requests from boto, converts the commands to and from
the API format, and forwards the requests to the cloud controller.
OpenStack Nova system architecture
⮚ The cloud controller maintains the global state of the system, ensures authorization
while interacting with the User Manager via Lightweight Directory Access Protocol
(LDAP), interacts with the S3 service, and manages nodes, as well as storage workers
through a queue.
⮚ Additionally, Nova integrates networking components to manage private networks,
public IP addressing, virtual private network (VPN) connectivity, and firewall rules.
⮚ It includes the following types:

• NetworkController manages address and virtual LAN (VLAN) allocations

• RoutingNode governs the NAT (network address translation) conversion of public


IPs to private IPs, and enforces firewall rules

• AddressingNode runs Dynamic Host Configuration Protocol (DHCP)


services for private networks

• TunnelingNode provides VPN connectivity


OpenStack Storage

Openstack Storage uses the following components to deliver high availability, high
durability, and high concurrency:

● Proxy servers - Handle all of the incoming API requests.


● Rings - Map logical names of data to locations on particular disks.
● Zones - Isolate data from other zones. A failure in one zone does not impact the rest
of the cluster as data replicates across zones.
● Accounts and containers - Each account and container are individual databases that
are distributed across the cluster. An account database contains the list of containers
in that account. A container database contains the list of objects in that container.
● Objects - The data itself.
● Partitions - A partition stores objects, account databases, and container databases and
helps manage locations where data lives in the cluster.

Object Storage building blocks


Proxy servers

Proxy servers are the public face of Object Storage and handle all of the incoming API
requests. Once a proxy server receives a request, it determines the storage node based on the
object’s URL.

Proxy servers use a shared-nothing architecture and can be scaled as needed based on
projected workloads. A minimum of two proxy servers should be deployed for redundancy.
If one proxy server fails, the others take over.

Rings

A ring represents a mapping between the names of entities stored on disks and their physical
locations. There are separate rings for accounts, containers, and objects. When other
components need to perform any operation on an object, container, or account, they need to
interact with the appropriate ring to determine their location in the cluster.

The ring maintains this mapping using zones, devices, partitions, and replicas. Each partition
in the ring is replicated, by default, three times across the cluster, and partition locations are
stored in the mapping maintained by the ring. The ring is also responsible for determining
which devices are used for handoff in failure scenarios.

Accounts and containers

Each account and container is an individual SQLite database that is distributed across the
cluster. An account database contains the list of containers in that account. A container
database contains the list of objects in that container.
Partitions

A partition is a collection of stored data. This includes account databases, container databases,
and objects. Partitions are core to the replication system.

d) Nimbus

⮚ Nimbus is a set of open source tools that together provide an IaaS cloud computing
solution.
⮚ The following figure shows the architecture of Nimbus, which allows a client to lease
remote resources by deploying VMs on those resources and configuring them to
represent the environment desired by the user.
⮚ To this end, Nimbus provides a special web interface known as Nimbus Web. Its aim
is to provide administrative and user functions in a friendly interface.
⮚ Nimbus Web is centered around a Python Django web application that is intended to
be deployable completely separate from the Nimbus service.
⮚ As shown in Figure, a storage cloud implementation called Cumulus has been tightly
integrated with the other central services, although it can also be used stand-alone.
⮚ Cumulus is compatible with the Amazon S3 REST API , but extends its capabilities
by including features such as quota management. Therefore, clients such as boto and
s2cmd , that work against the S3 REST API, work with Cumulus.
⮚ On the other hand, the Nimbus cloud client uses the Java Jets3t library to interact with
Cumulus. Nimbus supports two resource management strategies.
⮚ The first is the default “resource pool” mode. In this mode, the service has direct
control of a pool of VM manager nodes and it assumes it can start VMs.
⮚ The other supported mode is called “pilot.” Here, the service makes requests to a
cluster’s Local Resource Management System (LRMS) to get a VM manager available
to deploy VMs.
⮚ Nimbus also provides an implementation of Amazon’s EC2 interface that allows users
to use clients developed for the real EC2 system against Nimbus- based clouds.

Figure: Nimbus Cloud Infrastructure


Aneka in Cloud Computing

⮚ Aneka includes an extensible set of APIs associated with programming models like
MapReduce.
⮚ These APIs support different cloud models like a private, public, hybrid Cloud.
⮚ Manjrasoft focuses on creating innovative software technologies to simplify the
development and deployment of private or public cloud applications. Our product
plays the role of an application platform as a service for multiple cloud computing.

Multiple Structures:

⮚ Aneka is a software platform for developing cloud computing applications.


⮚ In Aneka, cloud applications are executed.
⮚ Aneka is a pure PaaS solution for cloud computing.
⮚ Aneka is a cloud middleware product.
⮚ Manya can be deployed over a network of computers, a multicore server, a data center,
a virtual cloud infrastructure, or a combination thereof.
⮚ Aneka is a pure PaaS solution for cloud computing. Aneka is a cloud middleware
product that can be deployed on a heterogeneous set of resources: a network of
computers, a multicore server, datacenters, virtual cloud infrastructures, or a mixture
of these. The framework provides both middleware for managing and scaling
distributed applications and an extensible set of APIs for developing them.
⮚ Figure provides a complete overview of the components of the Aneka framework. The
core infrastructure of the system provides a uniform layer that allows the framework
to be deployed over different platforms and operating systems.
⮚ The physical and virtual resources representing the bare metal of the cloud are
managed by the Aneka container, which is installed on each node and constitutes the
basic building block of the middleware. A collection of
interconnected containers constitutes the Aneka Cloud: a single domain in which
services are made available to users and developers. The container features three
different classes of services: Fabric Services, Foundation Services, and Execution
Services.
⮚ These take care of infrastructure management, supporting services for the Aneka Cloud,
and application management and execution, respectively. These services are made
available to developers and administrators by means of the application management and
development layer, which includes interfaces and APIs for developing cloud applications
and the management tools and interfaces for controlling Aneka Clouds.

Aneka framework overview


Anatomy of the Aneka container

⮚ The Aneka container constitutes the building blocks of Aneka Clouds and represents
the runtime machinery available to services and applications. The container, the unit
of deployment in Aneka Clouds, is a lightweight software layer designed to host
services and interact with the underlying operating system and hardware. The main
role of the container is to provide a lightweight environment in which to deploy
services and some basic capabilities such as communication channels through which
it interacts with other nodes in the Aneka Cloud. Almost all operations performed
within Aneka are carried out by the services managed by the container. The services
installed in the Aneka container can be classified into three major categories:
• Fabric Services
• Foundation Services
• Application Services
⮚ The services stack resides on top of the Platform Abstraction Layer (PAL),
representing the interface to the underlying operating system and hardware. It provides
a uniform view of the software and hardware environment in which the container is
running. Persistence and security traverse all the services stack to provide a secure and
reliable infrastructure. In the following sections we discuss the components of these
layers in more detail.
⮚ Fast and Simple: Task Programming Model:
⮚ Task Programming Model provides developers with the ability of expressing
applications as a collection of independent tasks. Each task can perform different
operations, or the same operation on different data, and can be executed in any order
by the runtime environment. This is a scenario in which many scientific applications
fit in and a very popular model for Grid
Computing. Also, Task programming allows the parallelization of legacy applications
on the Cloud.
⮚ Concurrent Applications: Thread Programming Model

⮚ Thread Programming Model offers developers the capability of running multithreaded


applications on the Aneka Cloud. The main abstraction of this model is the concept of
thread which mimics the semantics of the common local thread but is executed
remotely in a distributed environment. This model offers finer control on the execution
of the individual components (threads) of an application but requires more
management when compared to Task Programming, which is based on a “submit and
forget” pattern.
⮚ The Aneka Thread supports almost all of the operations available for traditional local
threads. More specifically an Aneka thread has been designed to mirror the interface
of the System. Threading. Thread .NET class, so that developers can easily move
existing multi-threaded applications to the Aneka platform with minimal changes.
Ideally, applications can be transparently ported to Aneka just by substituting local
threads with Aneka Threads and introducing minimal changes to the code. This model
covers all the application scenarios of the Task Programming and solves the additional
challenges of providing a distributed runtime environment for local multi-threaded
applications.
⮚ The SDK (Software Development Kit) includes the Application Programming
Interface (API) and tools needed for the rapid development of applications. The Anka
API supports three popular cloud programming models: Tasks, Threads and
MapReduce;

Data Intensive Applications: MapReduce Programing Model


⮚ MapReduce Programming Model is an implementation of the MapReduce model
proposed by Google, in .NET on the Aneka platform. MapReduce has been designed
to process huge quantities of data by using simple operations that extracts useful
information from a dataset (the map function) and aggregates this information together
(the reduce function) to produce the final results. Developers provide the logic for
these two operations and the dataset, and Aneka will do the rest, making the results
accessible when the application is completed.

Task Programming Model:


The Aneka Task Programming model is part of the Aneka Cloud Platform, a middleware
designed for developing and deploying applications on cloud infrastructures. Aneka provides a
framework to build applications that can leverage distributed resources in a cloud environment,
allowing users to focus on the application logic while abstracting much of the complexity involved
in resource management and task scheduling.

Overview of Aneka Task Programming Model

The Task Programming Model is one of the primary programming models supported by Aneka,
enabling the execution of independent tasks in parallel. It is particularly useful for embarrassingly
parallel workloads, where tasks do not need to communicate or share data with each other. Each task
runs independently, and the system manages the distribution and execution across available
resources.

Key Features of the Aneka Task Programming Model:

1. Independent Task Execution:


○ Each task is a self-contained unit of computation, making it suitable for parallel
execution. These tasks can be executed on different nodes within a cloud
infrastructure, speeding up the overall application by leveraging multiple resources.
2. Task Scheduling:
○ Aneka automatically schedules tasks on available resources within the cloud. The
platform includes an intelligent scheduler that optimizes resource usage, taking into
account factors such as load balancing, priorities, and deadlines.
3. Fault Tolerance:
○ Aneka includes mechanisms to handle task failures. If a task fails, it can be
automatically rescheduled and executed again, ensuring reliability and robustness in
distributed computing environments.
4. Task Management API:
○ Aneka provides APIs for creating, submitting, and managing tasks. Developers can
easily create tasks, submit them for execution, and monitor their progress through
these APIs.
5. Resource Management:
○ The Aneka framework abstracts the underlying resources, whether they're physical
machines, virtual machines, or containers. It dynamically allocates resources for task
execution and can scale based on the workload requirements.
6. Hybrid Cloud Support:
○ Aneka supports both private and public cloud environments, allowing users to run
tasks across multiple cloud providers (like AWS, Microsoft Azure, or private data
centers). This enables flexible deployment strategies and better resource utilization.
7. Task Dependencies:
○ While the Task Programming Model typically focuses on independent tasks, Aneka
also provides support for workflows where tasks may have dependencies, allowing
more complex task execution patterns if needed.

How Aneka Task Programming Works:

1. Application Developer Perspective:


○ A developer writes an application by defining tasks. Each task performs a specific
computation, such as processing a dataset, running simulations, or performing
calculations. These tasks are bundled into an application for parallel execution.
2. Task Submission:
○ The developer submits the tasks to the Aneka cloud. The Aneka platform handles the
scheduling and execution of these tasks. Developers do not need to worry about the
low-level details of how resources are allocated or where tasks are executed.
3. Execution in the Cloud:
○ Aneka distributes the tasks across available computing resources. Tasks are executed
independently, and the results are collected and made available to the developer or
end-user.
4. Monitoring and Control:
○ Aneka provides a dashboard for monitoring the progress of tasks. Developers or
system administrators can track task execution, handle failed tasks, and manage
resource allocation dynamically.

Use Cases for Aneka Task Programming:

● High-Throughput Computing: Applications requiring the processing of large datasets


(e.g., in bioinformatics, image processing, or financial modeling) can benefit from the
parallel execution of independent tasks.
● Scientific Simulations: Researchers often need to run multiple simulations with different
parameters, where each simulation can be treated as a separate task.
● Monte Carlo Simulations: Many scientific and financial applications rely on Monte Carlo
methods, where independent simulations are run in parallel, making them perfect for the
Aneka Task Programming Model.

Comparison with Other Models:

The Aneka Task Programming Model is designed for embarrassingly parallel tasks, making it
simpler than more complex distributed computing models such as:

● MapReduce Model: MapReduce is suited for scenarios where tasks need to be distributed
but may also require aggregation (e.g., data analytics). Unlike MapReduce, the Aneka Task
Model does not require a reduce phase.
● MPI (Message Passing Interface): MPI is used for tasks that require communication
between nodes. In contrast, Aneka’s Task Model is for independent tasks, where
communication is minimal or not required.

Advantages:

● Simplicity: It is straightforward to use and requires minimal modifications to convert


applications into cloud-based parallel applications.
● Scalability: By leveraging cloud resources, Aneka can scale applications dynamically to
handle larger workloads efficiently.
● Cost-Effectiveness: By distributing tasks across available resources, Aneka helps optimize
resource usage, reducing operational costs.
MapReduce Programming Model
In the Aneka Cloud Platform, the MapReduce programming model is supported as one of its
parallel computing paradigms, providing a framework to handle large-scale data processing tasks.
It is an extension of the well-known MapReduce model popularized by Google, and it allows
developers to process vast amounts of data in parallel by splitting tasks into smaller subtasks (Map
phase) and combining the results (Reduce phase).

MapReduce Overview in Aneka

The MapReduce model in Aneka works by dividing a large computational task into two primary
functions:
1. Map Function: This phase involves processing input data and transforming it into a key-
value pair format. Each independent input chunk is processed in parallel by mapper nodes.
2. Reduce Function: This phase takes the output from the Map phase, aggregates, or combines
the results based on the keys generated, and produces a final result. Each reducer node
handles a subset of the intermediate results from the Map phase.

How Map and Reduce Work in Aneka:

Map Phase:

● The input data is divided into multiple splits or chunks, which can be processed
independently.
● Each split is processed by a mapper node. The mapper processes data and generates key-
value pairs as output.
● The mappers work independently, which allows for parallel execution across different nodes
in the Aneka cloud.

def map(key, value):

# process the input (value) and output key-value pairs

for word in value.split():

emit(word, 1)

In this example, if the input is a text file, the map function tokenizes the text into words and outputs
each word with a count of 1.

Reduce Phase:

● The output from the Map phase (key-value pairs) is shuffled and sorted by the key.
● The reduce function receives all values associated with a specific key and processes them to
generate a final output.
● The reducers aggregate the values for each key, typically by summing, counting, or merging
the values.
def reduce(key, values):

# aggregate the values associated with a key

result = sum(values)

emit(key, result)

Process Flow in Aneka’s MapReduce Model:

1. Input Data Splitting:


○ Aneka takes the input dataset and splits it into smaller parts, which are then
distributed to different nodes (computing resources).
2. Map Phase Execution:
○ Each split is processed by a mapper node, which generates key-value pairs.
○ All mappers work in parallel across the cloud infrastructure.
3. Shuffling and Sorting:
○ After the Map phase, Aneka performs a shuffling operation, where the key-value
pairs generated by mappers are sorted based on the key and then grouped together for
the reduce phase.
4. Reduce Phase Execution:
○ The grouped key-value pairs are distributed to reducer nodes.
○ The reducers process these groups, aggregating the data and generating final results.
5. Output:
○ The final output of the MapReduce job is written to a storage system, which can be
distributed, such as a cloud-based file system or database.

Benefits of MapReduce in Aneka:

1. Scalability: The MapReduce model in Aneka allows developers to scale applications


effortlessly by leveraging multiple nodes in a cloud environment, processing vast amounts
of data in parallel.
2. Fault Tolerance: Aneka ensures that tasks can be retried in case of failure. If a node fails
during the Map or Reduce phase, the task is reassigned to another node, ensuring the process
continues.
3. Simplified Data Processing: MapReduce abstracts the complexity of parallel programming.
Developers focus on writing the Map and Reduce functions, while Aneka handles the
distribution, scheduling, and execution of tasks.
4. Cost Efficiency: By using cloud resources for parallel data processing, organizations can
minimize processing time and resource usage, reducing costs for large-scale data processing
applications.

Use Cases for MapReduce in Aneka:

● Big Data Analytics: Processing large datasets, such as logs, sensor data, or transaction data,
to generate insights.
● Text Processing: Analyzing large text corpora (e.g., word frequency analysis or natural
language processing).
● Data Aggregation: Performing operations like counting, summing, or averaging large
datasets distributed across cloud resources.
● Scientific Simulations: Running simulations that generate vast amounts of data, where the
intermediate results need to be processed in parallel.

Comparison with Aneka Task Programming Model:

● MapReduce is suited for applications with large datasets that can be broken down into
smaller tasks that involve mapping and reducing.
● Task Programming Model is more general-purpose and deals with independent tasks that
do not require intermediate data aggregation like MapReduce.

You might also like