0% found this document useful (0 votes)
18 views26 pages

Cloud 4 Unit

The document discusses parallel and distributed computing, highlighting their definitions, advantages, and disadvantages. It explains the MapReduce programming model for processing big data, along with the Twister framework for iterative MapReduce tasks. Additionally, it introduces Hadoop as an open-source framework for distributed storage and processing of large datasets, detailing its components and features.

Uploaded by

romeyoremo900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views26 pages

Cloud 4 Unit

The document discusses parallel and distributed computing, highlighting their definitions, advantages, and disadvantages. It explains the MapReduce programming model for processing big data, along with the Twister framework for iterative MapReduce tasks. Additionally, it introduces Hadoop as an open-source framework for distributed storage and processing of large datasets, detailing its components and features.

Uploaded by

romeyoremo900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT – IV

1.Parallel and Distributed Computing


There are mainly two computation types, including parallel computing and distributed
computing. A computer system may perform tasks according to human instructions. A single processor
executes only one task in the computer system, which is not an effective way.

Parallel computing solves this problem by allowing numerous processors to accomplish tasks
simultaneously. Modern computers support parallel processing to improve system performance.

In contrast, distributed computing enables several computers to communicate with one another and
achieve a goal. All of these computers communicate and collaborate over the network. Distributed
computing is commonly used by organizations such as Facebook and Google that allow people to share
resources.

1.1 Distributed Computing :

Distributed computing is defined as a type of computing where multiple computer systems work on a
single problem. Here all the computer systems are linked together and the problem is divided into sub-
problems where each part is solved by different computer systems.

The goal of distributed computing is to increase the performance and efficiency of the system and
ensure fault tolerance. In the below diagram, each processor has its own local memory and all the processors
communicate with each other over a network.

It comprises several software components that reside on different systems but operate as a single
system. A distributed system's computers can be physically close together and linked by a local network or
geographically distant and linked by a wide area network (WAN).

A distributed system can be made up of any number of different configurations, such as mainframes,
PCs, workstations, and minicomputers. The main aim of distributed computing is to make a network work as
a single computer.There are various benefits of using distributed computing. It enables scalability and makes
it simpler to share resources. It also aids in the efficiency of computation processes.

Advantages and Disadvantages of Distributed Computing

Advantages

1. It is flexible, making it simple to install, use, and debug new services.


2. In distributed computing, you may add multiple machines as required.
3. If the system crashes on one server, that doesn't affect other servers.
4. A distributed computer system may combine the computational capacity of several computers,
making it faster than traditional systems.

Disadvantages

1. Data security and sharing are the main issues in distributed systems due to the features of open
systems
2. Because of the distribution across multiple servers, troubleshooting and diagnostics are more
challenging.
3. The main disadvantage of distributed computer systems is the lack of software support.

1.2 Parallel Computing

Parallel computing is defined as a type of computing where multiple computer systems are used
simultaneously. Here a problem is broken into sub-problems and then further broken down into instructions.
These instructions from each sub-problem are executed concurrently on different processors.

It utilizes several processors. Each of the processors completes the tasks that have been allocated to
them. In other words, parallel computing involves performing numerous tasks simultaneously. A shared
memory or distributed memory system can be used to assist in parallel computing. All CPUs in shared
memory systems share the memory. Memory is shared between the processors in distributed memory
systems.

The parallel computing system consists of multiple processors that communicate with each other and
perform multiple tasks over a shared memory simultaneously.The goal of parallel computing is to save time
and provide concurrency.

Advantages

1. It saves time and money because many resources working together cut down on time and costs.
2. It may be difficult to resolve larger problems on Serial Computing.
3. You can do many things at once using many computing resources.
4. Parallel computing is much better than serial computing for modeling, simulating, and
comprehending complicated real-world events.

Disadvantages

1. The multi-core architectures consume a lot of power.


2. Parallel solutions are more difficult to implement, debug, and prove right due to the complexity of
communication and coordination, and they frequently perform worse than their serial equivalents.
Features Parallel Computing Distributed Computing

It is that type of computing in which


It is a type of computation in which the components are located on various
Definition various processes runs networked systems that interact and
simultaneously. coordinate their actions by passing
messages to one another.

The processors communicate with one The computer systems connect with
Communication
another via a bus. one another via a network.

Several processors execute various


Several computers execute tasks
Functionality tasks simultaneously in parallel
simultaneously.
computing.

Number of Computers It occurs in a single computer system. It involves various computers.

The system may have distributed or Each computer system in distributed


Memory
shared memory. computing has its own memory.

It allows for scalability, resource


It helps to improve the system
Usage sharing, and the efficient completion
performance
of computation tasks.

2.Map Reduce
MapReduce is a programming model for writing applications that can process Big Data in parallel on
multiple nodes. MapReduce provides analytical capabilities for analyzing huge volumes of complex data.

Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional model is
certainly not suitable to process huge volumes of scalable data and cannot be accommodated by standard
database servers. Moreover, the centralized system creates too much of a bottleneck while processing
multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a
task into small parts and assigns them to many computers. Later, the results are collected at one place and
integrated to form the result dataset.

How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The Map task takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs)
into a smaller set of tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand their significance.

 Input Phase − Here we have a Record Reader that translates each record in an input file and sends the
parsed data to the mapper in the form of key-value pairs.
 Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one of
them to generate zero or more key-value pairs.
 Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.
 Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into
identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to
aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is
optional.
 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-
value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted
by key into a larger data list. The data list groups the equivalent keys together so that their values can be
iterated easily in the Reducer task.
 Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on
each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it
requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the
final step.
 Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs
from the Reducer function and writes them onto a file using a record writer.
The two tasks Map &f Reduce with the help of a small diagram

3.Twister & Interactive Map reduce


The map reduce framework has to involve a lot of overhead when dealing with iterative map
reduce.Twister is a great framework to perform iterative map reduce.
1.) Static and variable Data : Any iterative algorithm requires a static and variable data. Variable data are
computed with static data (Usually the larger part of both) to generate another set of variable data. The
process is repeated till a given condition and constrain is met.

In a normal map-reduce function using Hadoop or DryadLINQ the static data are loaded uselessly
every time the computation has to be performed. This is an extra overhead for the computation. Even though
they remain fixed throughout the computation they have to be loaded again and again.

Twister introduces a “config” phase for both map and reduces to load any static data that is required.
Loading static data for once is also helpful in running a long running Map/Reduce task

2.) Fat Map task : To save the access a lot of data the map is provided with an option of configurable map
task, the map task can access large block of data or files. This makes it easy to add heavy computational
weight on the map side.

3.) Combine operation: Unlike GFS where the output of reducer are stored in separate files, Twister comes
with a new phase along with map reduce called combine that’s collectively adds up the output coming from
all the reducer.

4.) Programming extensions: Some of the additional functions to support iterative functionality of Twister
are:
i) mapReduceBCast(Value value) for sending a single to all map tasks. For example, the “Value” can be a
set of parameters, a resource (file or executable) name, or even a block of data

ii) configureMaps(Value[]values) and configureReduce(Value[]values) to configure map and reduce with


additional static data
The Twister is designed to effectively support iterative MapReduce function. To reach this flexibility it
reads data from the local disk of the worker nodes and handle the intermediate data data in the distributed
memory of the workers mode. The messaging infrastructure in twister is called broker network and it is
responsible to perform data transfer using publish/subscribe messaging.

Twister has three main entity:


1. Client Side Driver responsible to drive entire MapReduce computation
2. Twister Daemon running on every working node.
3. The broker Network.

Access Data

To access input data for map task it either reads dta from the local disk of the worker nodes.
receive data directly via the broker network. They keep all data read as file and having data as native
file allows Twister to pass data directly to any executable. Additionally they allow tool to perform typical
file operations like:
(i) create directories, (ii) delete directories, (iii) distribute input files across worker nodes, (iv) copy a set of
resources/input files to all worker nodes, (v) collect output files from the worker nodes to a given location,
and (vi) create partition-file for a given set of data that is distributed across the worker nodes.

Intermediate Data

The intermediate data are stored in the distributed memory of the worker node. Keeping the map
output in distributed memory enhances the speed of the computation by sending the output of the
map from these memory to reduces.

Messaging

The use of publish/subscribe messaging infrastructure improves the efficiency of Twister runtime. It
use scalable NaradaBrokering messaging infrastructure to connect difference Broker network and reduce
load on any one of them.

Fault Tolerance

There are three assumption for for providing fault tolerance for iterative mapreduce:
(i) failure of master node is rare adn no support is provided for that.
(ii) Independent of twister runtime the communication network can be made fault tolerant.
(iii) the data is replicated among the nodes of the computation infrastructure. Based on these assumptions we
try to handle failures of map/reduce tasks, daemons, and worker nodes failures.

4.Hadoop Introduction
Hadoop is an open-source software framework that is used for storing and processing large amounts
of data in a distributed computing environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel processing of large datasets.

What is Hadoop?

Hadoop is an open source software programming framework for storing a large amount of data and
performing the computation. Its framework is based on Java programming with some native code in C and
shell scripts.
Hadoop is an open-source software framework that is used for storing and processing large amounts
of data in a distributed computing environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel processing of large datasets.

Hadoop has two main components:

HDFS (Hadoop Distributed File System): This is the storage component of Hadoop, which allows
for the storage of large amounts of data across multiple machines. It is designed to work with commodity
hardware, which makes it cost-effective.

YARN (Yet Another Resource Negotiator): This is the resource management component of Hadoop,
which manages the allocation of resources (such as CPU and memory) for processing the data stored in
HDFS.

Hadoop also includes several additional modules that provide additional functionality, such as Hive
(a SQL-like query language), Pig (a high-level platform for creating MapReduce programs), and HBase (a
non-relational, distributed database).

Hadoop is commonly used in big data scenarios such as data warehousing, business intelligence, and
machine learning. It’s also used for data processing, data analysis, and data mining. It enables the distributed
processing of large data sets across clusters of computers using a simple programming model.

Hadoop has several key features that make it well-suited for big data processing:

 Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for the storage
and processing of extremely large amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of machines, making it easy to add
more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to operate
even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is stored on the same node where
it will be processed, this feature helps to reduce the network traffic and improve the performance
 High Availability: Hadoop provides High Availability feature, which helps to make sure that the data
is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model allows for the processing of
data in a distributed fashion, making it easy to implement a wide variety of data processing tasks.
 Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the data stored
is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps to replicate the data across
the cluster for fault tolerance.
 Data Compression: Hadoop provides built-in data compression feature, which helps to reduce the
storage space and improve the performance.
 YARN: A resource management platform that allows multiple data processing engines like real-time
streaming, batch processing, and interactive SQL, to run and process data stored in HDFS.
5.What is Apache Hadoop?

Apache Hadoop is an open-source software platform designed for distributed storage and data
processing. It is well-known for its capacity to manage large amounts of data across clusters of commodity
servers. Apache Hadoop is extensively utilized in big data applications, serving as a framework for parallel
data processing and analysis. This makes it an essential tool for organizations that handle massive datasets.

5.1 Native Hadoop Library

Hadoop has native implementations of certain components for performance reasons and for non-
availability of Java implementations. These components are available in a single, dynamically-linked native
library called the native hadoop library. On the *nix platforms the library is named libhadoop.so.

Usage

It is fairly easy to use the native hadoop library:

1. Review the components.


2. Review the supported platforms.
3. Either download a hadoop release, which will include a pre-built version of the native hadoop
library, or build your own version of the native hadoop library. Whether you download or build, the
name for the library is the same: libhadoop.so
4. Install the compression codec development packages (>zlib-1.2, >gzip-1.2):
o If you download the library, install one or more development packages - whichever
compression codecs you want to use with your deployment.
o If you build the library, it is mandatory to install both development packages.
5. Check the runtime log files.

Components

The native hadoop library includes various components:

 Compression Codecs (bzip2, lz4, zlib)


 Native IO utilities for HDFS Short-Circuit Local Reads and Centralized Cache Management in
HDFS
 CRC32 checksum implementation

Supported Platforms

The native hadoop library is supported on *nix platforms only. The library does not to work with Cygwin or
the Mac OS X platform.

The native hadoop library is mainly used on the GNU/Linus platform and has been tested on these
distributions:

 RHEL4/Fedora
 Ubuntu
 Gentoo

On all the above distributions a 32/64 bit native hadoop library will work with a respective 32/64 bit jvm.
Download

The pre-built 32-bit i386-Linux native hadoop library is available as part of the hadoop distribution and is
located in the lib/native directory. You can download the hadoop distribution from Hadoop Common
Releases.

Be sure to install the zlib and/or gzip development packages - whichever compression codecs you want to
use with your deployment.

Build

The native hadoop library is written in ANSI C and is built using the GNU autotools-chain (autoconf,
autoheader, automake, autoscan, libtool). This means it should be straight-forward to build the library on
any platform with a standards-compliant C compiler and the GNU autotools-chain (see the supported
platforms).

The packages you need to install on the target platform are:

 C compiler (e.g. GNU C Compiler)


 GNU Autools Chain: autoconf, automake, libtool
 zlib-development package (stable version >= 1.2.0)
 openssl-development package(e.g. libssl-dev)

Once you installed the prerequisite packages use the standard hadoop pom.xml file and pass along the native
flag to build the native hadoop library:

$ mvn package -Pdist,native -DskipTests -Dtar

You should see the newly-built library in:

$ hadoop-dist/target/hadoop-3.4.1/lib/native

Please note the following:

 It is mandatory to install both the zlib and gzip development packages on the target platform in order
to build the native hadoop library; however, for deployment it is sufficient to install just one package
if you wish to use only one codec.
 It is necessary to have the correct 32/64 libraries for zlib, depending on the 32/64 bit jvm for the
target platform, in order to build and deploy the native hadoop library.

Runtime

The bin/hadoop script ensures that the native hadoop library is on the library path via the system property: -
Djava.library.path=<path>

During runtime, check the hadoop log files for your MapReduce tasks.

 If everything is all right, then: DEBUG util.NativeCodeLoader - Trying to load the custom-built
native-hadoop library... INFO util.NativeCodeLoader - Loaded the native-hadoop library
 If something goes wrong, then: WARN util.NativeCodeLoader - Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Check

NativeLibraryChecker is a tool to check whether native libraries are loaded correctly. You can launch
NativeLibraryChecker as follows:

$ hadoop checknative -a
14/12/06 01:30:45 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native,
will use pure-Java version
14/12/06 01:30:45 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /home/ozawa/hadoop/lib/native/libhadoop.so.1.0.0
zlib: true /lib/x86_64-linux-gnu/libz.so.1
zstd: true /usr/lib/libzstd.so.1
lz4: true revision:99
bzip2: false

Native Shared Libraries

You can load any native shared library using DistributedCache for distributing and symlinking the library
files.

This example shows you how to distribute a shared library in Unix-like systems, mylib.so, and load it from a
MapReduce task.

1. First copy the library to the HDFS: bin/hadoop fs -copyFromLocal libmyexample.so.1


/libraries/libmyexample.so.1
2. The job launching program should contain the
following: DistributedCache.createSymlink(conf); DistributedCache.addCacheFile("hdfs://host:port/l
ibraries/libmyexample.so.1#libmyexample.so", conf);
3. The MapReduce task can contain: System.loadLibrary("myexample");

6.Mapping Applications:
Application mapping is the process of creating a visual representation of an organization's
applications and their relationships to each other. This can be done manually, but it is increasingly becoming
the standard to automate application mapping through specialized software. It is important for several
reasons.
Futhermore, application mapping allows you to understand the dependencies between different applications,
enabling you to make informed decisions when modifying or replacing components. This flexibility
empowers businesses to adapt to changing market dynamics and stay ahead of the competition.

6.1Key Components of Application Mapping

Discovery and Inventory

The first and most crucial component of application mapping is the discovery and inventory process.
This is where all the applications, services, and resources within the cloud environment are identified and
cataloged.

The discovery process involves scanning the cloud environment to detect applications, servers,
databases, and other related resources. Advanced tools can even identify the versions, patches, and
configurations of these resources.
Inventory, on the other hand, is about organizing the discovered resources. It involves categorizing
and tagging the resources based on various factors such as their function, owner, location, etc. The inventory
process is essential for maintaining an accurate and up-to-date record of the cloud environment.

Visualization and Documentation

The next component of application mapping is visualization and documentation. Once all the
resources are discovered and inventoried, they need to be represented visually. This is where application
mapping tools come into play. These tools generate diagrams that depict the relationships and dependencies
between different applications and services.

Visualization is about understanding the interconnectedness of the cloud environment. It shows how a
change in one part of the system can impact other parts. This knowledge is critical for planning and executing
changes in the cloud environment.

Documentation, on the other hand, is about recording the details of the cloud environment. It includes
information about the applications, their configurations, dependencies, and more. Documentation serves as a
reference point for the IT team and aids in troubleshooting and problem resolution.

Performance Monitoring

The last but certainly not the least component of application mapping is performance monitoring. This
involves continuously tracking the performance of the applications and services within the cloud
environment.

Performance monitoring helps to identify any issues or anomalies in the system. It provides insights
into the health and performance of the applications and services. With this information, the IT team can take
proactive measures to rectify issues before they escalate and impact business operations.

6.2 Benefits of Application Mapping in the Cloud

Enhanced Visibility and Control

One of the major benefits of application mapping in the cloud is enhanced visibility and control. With
a comprehensive map of the cloud environment, IT teams have a clear understanding of the system’s
structure and behavior.

This visibility extends to the minutest details of the cloud environment — from the number of
resources to their configurations, dependencies, and performance. It provides the IT team with a holistic view
of the cloud environment, enabling them to manage and control it more effectively.

Efficient Resource Utilization

Another significant benefit of application mapping in the cloud is efficient resource utilization. The
cloud environment is a vast ecosystem of resources. Without a proper mapping system, it’s easy to lose track
of these resources, leading to underutilization or overutilization.
Application mapping helps to avoid this scenario. By providing a clear picture of the resources and
their usage, it enables the IT team to optimize resource allocation. This not only leads to cost savings but also
improves the overall performance of the cloud environment.

Improved Security and Compliance

One of the primary advantages of application mapping in the cloud is the enhanced security and
compliance it provides. By mapping out the various components and dependencies of your applications, you
gain a holistic view of your cloud environment. This allows you to identify potential vulnerabilities and
address them proactively.

Scalability and Flexibility

As your business grows, your application requirements evolve, and you need to scale your cloud
resources accordingly. By mapping your applications, you can identify potential bottlenecks, optimize
resource allocation, and allocate additional resources where needed. This enables you to scale your
applications seamlessly, ensuring optimal performance and user experience.

7.What is Google App Engine?


Google App Engine (GAE) is a platform-as-a-service (PaaS) product that enables web app
developers and enterprises to build, deploy and host scalable, high-performance applications in Google's fully
managed cloud environment without having to worry about infrastructure provisioning or management.
GAE is Google's fully managed and serverless application development platform. It handles all the
work of uploading and running the code on Google Cloud. GAE's flexible environment provisions all the
necessary infrastructure based on the central processing unit (CPU) and memory requirements specified by
the developer.
With GAE, developers can create applications in multiple supported languages or run
custom containers in a preferred language or framework. Each language has a software development kit
(SDK) and runtime to enable app development and testing. GAE also provides a wide range of developer
tools to simplify app development, testing, debugging, deployment and performance monitoring.
Programming Support of Google App Engine
Programming Support of Google App Engine GAE programming model for two supported
languages: Java and Python. A client environment includes an Eclipse plug-in for Java allows you to debug
your GAE on your local machine. Google Web Toolkit is available for Java web application developers.
Python is used with frameworks such as Django and CherryPy, but Google also has webapp Python
environment.
There are several powerful constructs for storing and accessing data. The data store is a NOSQL data
management system for entities. Java offers Java Data Object (JDO) and Java Persistence API (JPA)
interfaces implemented by the Data Nucleus Access platform, while Python has a SQL-like query language
called GQL.

The performance of the data store can be enhanced by in-memory caching using the memcache,
which can also be used independently of the data store. Recently, Google added the blobstore which is
suitable for large files as its size limit is 2 GB.

There are several mechanisms for incorporating external resources. The Google SDC Secure Data
Connection can tunnel through the Internet and link your intranet to an external GAE application. The URL
Fetch operation provides the ability for applications to fetch resources and communicate with other hosts
over the Internet using HTTP and HTTPS requests.
After creating a Cloud account, you may Start Building your App

 Using the Go template/HTML package


 Python-based webapp2 with Jinja2
 PHP and Cloud SQL
 using Java’s Maven

7.1 Advantages of Google App Engine

 The Google App Engine has a lot of benefits that can help you advance your app ideas. This
comprises:
 Infrastructure for Security: The Internet infrastructure that Google uses is arguably the safest in the
entire world. Since the application data and code are hosted on extremely secure servers, there has
rarely been any kind of illegal access to date.
 Faster Time to Market: For every organization, getting a product or service to market quickly is
crucial. When it comes to quickly releasing the product, encouraging the development and
maintenance of an app is essential. A firm can grow swiftly with Google Cloud App Engine’s
assistance.
 Quick to Start: You don’t need to spend a lot of time prototyping or deploying the app to users
because there is no hardware or product to buy and maintain.
 Easy to Use: The tools that you need to create, test, launch, and update the applications are included
in Google App Engine (GAE).
 Rich set of APIs & Services: A number of built-in APIs and services in Google App Engine enable
developers to create strong, feature-rich apps.
 Scalability: This is one of the deciding variables for the success of any software. When using the
Google app engine to construct apps, you may access technologies like GFS, Big Table, and others
that Google uses to build its own apps.
 Performance and Reliability: Among international brands, Google ranks among the top ones.
Therefore, you must bear that in mind while talking about performance and reliability.
 Cost Savings: To administer your servers, you don’t need to employ engineers or even do it yourself.
The money you save might be put toward developing other areas of your company.
 Platform Independence: Since the app engine platform only has a few dependencies, you can easily
relocate all of your data to another environment.

8.Introduction to Eucalyptus
The open-source cloud refers to software or applications publicly available for the users in the cloud to set up
for their own purpose or for their organization.

Eucalyptus is a Linux-based open-source software architecture for cloud computing and also a storage
platform that implements Infrastructure a Service (IaaS). It provides quick and efficient computing services.
Eucalyptus was designed to provide services compatible with Amazon’s EC2 cloud and Simple Storage Service(S3).

Eucalyptus Architecture

Eucalyptus CLIs can handle Amazon Web Services and their own private instances. Clients have the
independence to transfer cases from Eucalyptus to Amazon Elastic Cloud. The virtualization layer oversees
the Network, storage, and Computing. Occurrences are isolated by hardware virtualization.

8.1 Components of Architecture

 Node Controller is the lifecycle of instances running on each node. Interacts with the operating
system, hypervisor, and Cluster Controller. It controls the working of VM instances on the host
machine.
 Cluster Controller manages one or more Node Controller and Cloud Controller simultaneously. It
gathers information and schedules VM execution.
 Storage Controller (Walrus) Allows the creation of snapshots of volumes. Persistent block storage
over VM instances. Walrus Storage Controller is a simple file storage system. It stores images and
snapshots. Stores and serves files using S3(Simple Storage Service) APIs.
 Cloud Controller Front-end for the entire architecture. It acts as a Complaint Web Services to client
tools on one side and interacts with the rest of the components on the other side.

8.2 Operation Modes Of Eucalyptus

 Managed Mode: Numerous security groups to users as the network is large. Each security group is
assigned a set or a subset of IP addresses. Ingress rules are applied through the security groups
specified by the user. The network is isolated by VLAN between Cluster Controller and Node
Controller. Assigns two IP addresses on each virtual machine.
 Managed (No VLAN) Node: The root user on the virtual machine can snoop into other virtual
machines running on the same network layer. It does not provide VM network isolation.
 System Mode: Simplest of all modes, least number of features. A MAC address is assigned to a
virtual machine instance and attached to Node Controller’s bridge Ethernet device.
 Static Mode: Similar to system mode but has more control over the assignment of IP address. MAC
address/IP address pair is mapped to static entry within the DHCP server. The next set of MAC/IP
addresses is mapped.

Important Features are:-

 Images: A good example is the Eucalyptus Machine Image which is a module software bundled and
uploaded to the Cloud.
 Instances: When we run the picture and utilize it, it turns into an instance.
 Networking: It can be further subdivided into three modes: Static mode(allocates IP address to
instances), System mode (assigns a MAC address and imputes the instance’s network interface to the
physical network via NC), and Managed mode (achieves local network of instances).
 Access Control: It is utilized to give limitations to clients.
 Elastic Block Storage: It gives block-level storage volumes to connect to an instance.
 Auto-scaling and Load Adjusting: It is utilized to make or obliterate cases or administrations
dependent on necessities.

Advantages Of The Eucalyptus Cloud

 Eucalyptus can be utilized to benefit both the eucalyptus private cloud and the eucalyptus public
cloud.
 Examples of Amazon or Eucalyptus machine pictures can be run on both clouds.
 Its API is completely similar to all the Amazon Web Services.
 Eucalyptus can be utilized with DevOps apparatuses like Chef and Puppet.
 Although it isn’t as popular yet but has the potential to be an alternative to OpenStack and
CloudStack.
 It is used to gather hybrid, public and private clouds.
 It allows users to deliver their own data centers into a private cloud and hence, extend the services to
other organizations.

9.Open Nebula
What Is OpenNebula?

OpenNebula is an open-source cloud computing platform that streamlines and simplifies the
manufacture and management of virtualized hybrid, public, and private clouds. It is a straightforward yet
feature-rich, flexible solution to build and manage enterprise clouds and data center virtualization. You can
gain control over your cloud infrastructure with OpenNebula while enjoying flexibility and simplicity. You
can also centrally administer and monitor virtual systems on different Hyper-V and storage systems with
OpenNebula.

It supports many hypervisors like KVM, VMware, and Xen. OpenNebula also offers compatibility
with various storage backends.
This versatility enables you to leverage your existing infrastructure. This will let you choose the
storage solution that suits your needs. Thanks to OpenNebula’s extensive APIs and CLI tools, you can
integrate with existing systems seamlessly. You can also connect OpenNebula with monitoring and billing
tools. This will enable automation and cost optimization. The platform’s vibrant community and rich
ecosystem provide valuable support. OpenNebula is there with resources to assist you in harnessing its
features effectively.

OpenNebula architecture

9.1 How Does OpenNebula Work?

OpenNebula abstracts physical resources such as servers, storage, and networking. Then it presents
them to the user as a unified pool of resources that can be allocated and managed on demand. The platform
includes several components that work together to provide this functionality.

OpenNebula’s front end is the central management component. It allows users to create and manage
virtual resources. These resources include virtual machines, networks, and storage volumes. It
communicates with the OpenNebula nodes responsible for running the virtual machines. These nodes are
also responsible for managing the virtual networks and storage.

OpenNebula nodes use virtualization technologies like KVM or VMware to run virtual machines.
Each node can host many virtual machines. And the platform can scale up or down by adding or removing
nodes to the system. OpenNebula also supports hybrid cloud deployments. OpenNebula deploys nodes in
public clouds such as Amazon Web Services or Microsoft Azure.

OpenNebula also has a scheduler. This handles the allocation of virtual resources to users based on
defined policies. The scheduler ensures that resources are used efficiently and fairly among users. It can also
balance the workload across the available nodes.

9.2 Importance of OpenNebula

Centralized Management

It eliminates the need to switch between tools to manage different aspects of your IT infrastructure.
You get a single interface to manage your private cloud computing needs. You can also manage your
infrastructure and virtualization needs. You can create, manage, and track your networks, storage, and
virtual machines from anywhere. The interface is user-friendly and intuitive. , It allows you to manage your
infrastructure efficiently without extensive technical knowledge. OpenNebula also supports role-based
access control. This way, you can control who has access to specific resources and functions within the
platform.
Scalability

OpenNebula is highly scalable. You can easily add or remove resources to meet your changing
needs. Depending on your requirements, you can scale your infrastructure up or down without extra
hardware or software. OpenNebula also supports automatic resource allocation. This means you can set
resource usage policies and let the platform manage resources for you. This, in turn, makes it easy to
manage large-scale deployments without compromising performance or efficiency.

Cost-Effectiveness

OpenNebula is an open-source platform, meaning it’s free to use and distribute. Since you don’t need
to pay for expensive licenses or subscriptions, it is a cost-effective solution for businesses of all sizes.
OpenNebula also supports a wide range of hardware and software. This makes it easy to use existing
infrastructure and tools without extra investment.

Flexibility

OpenNebula is a flexible platform that supports a variety of virtualization technologies. This includes
KVM, VMware, and Xen. This flexibility allows you to select the ideal virtualization technology for your
needs. This, in turn, gets you free from vendor or solution restrictions. OpenNebula also supports a range of
storage backends. This includes local disks, NFS, Ceph, and GlusterFS, giving you flexibility in managing
your storage.

10.OpenStack Architecture
Introduction
OpenStack is an open-standard and free platform for cloud computing. Mostly, it is deployed
as IaaS (Infrastructure-as-a-Service) in both private and public clouds where various virtual servers and
other types of resources are available for users. This platform combines irrelated components that
networking resources, storage resources, multi-vendor hardware processing tools, and control diverse
throughout the data center. Various users manage it by the command-line tools, RESTful web services, and
web-based dashboard. In 2010, OpenStack began as the joint project of NASA and Rackspace Hosting. It
was handled by the OpenStack Foundation which is a non-profit collective entity developed in 2012
September for promoting the OpenStack community and software. 50+ enterprises have joined this project.

Architecture of OpenStack
OpenStack contains a modular architecture along with several code names for the components.
OpenStack components

 Apart from various projects which constitute the OpenStack platform, there are nine major services
namely Nova, Neutron, Swift, Cinder, Keystone, Horizon, Ceilometer, and Heat. Here is the basic
definition of all the components which will give us a basic idea about these components.
 Nova (compute service): It manages the compute resources like creating, deleting, and handling the
scheduling. It can be seen as a program dedicated to the automation of resources that are responsible
for the virtualization of services and high-performance computing.
 Neutron (networking service): It is responsible for connecting all the networks across OpenStack. It
is an API driven service that manages all networks and IP addresses.
 Swift (object storage): It is an object storage service with high fault tolerance capabilities and it used
to retrieve unstructured data objects with the help of Restful API. Being a distributed platform, it is
also used to provide redundant storage within servers that are clustered together. It is able to
successfully manage petabytes of data.
 Cinder (block storage): It is responsible for providing persistent block storage that is made accessible
using an API (self- service). Consequently, it allows users to define and manage the amount of cloud
storage required.
 Keystone (identity service provider): It is responsible for all types of authentications and
authorizations in the OpenStack services. It is a directory-based service that uses a central repository
to map the correct services with the correct user.
 Glance (image service provider): It is responsible for registering, storing, and retrieving virtual disk
images from the complete network. These images are stored in a wide range of back-end systems.
 Horizon (dashboard): It is responsible for providing a web-based interface for OpenStack services. It
is used to manage, provision, and monitor cloud resources.
 Ceilometer (telemetry): It is responsible for metering and billing of services used. Also, it is used to
generate alarms when a certain threshold is exceeded.
 Heat (orchestration): It is used for on-demand service provisioning with auto-scaling of cloud
resources. It works in coordination with the ceilometer.

Features of OpenStack

 Modular architecture: OpenStack is designed with a modular architecture that enables users to
deploy only the components they need. This makes it easier to customize and scale the platform to
meet specific business requirements.
 Multi-tenancy support: OpenStack provides multi-tenancy support, which enables multiple users to
access the same cloud infrastructure while maintaining security and isolation between them. This is
particularly important for cloud service providers who need to offer services to multiple customers.
 Open-source software: OpenStack is an open-source software platform that is free to use and modify.
This enables users to customize the platform to meet their specific requirements, without the need for
expensive proprietary software licenses.
 Distributed architecture: OpenStack is designed with a distributed architecture that enables users to
scale their cloud infrastructure horizontally across multiple physical servers. This makes it easier to
handle large workloads and improve system performance.
 API-driven: OpenStack is API-driven, which means that all components can be accessed and
controlled through a set of APIs. This makes it easier to automate and integrate with other tools and
services.
 Comprehensive dashboard: OpenStack provides a comprehensive dashboard that enables users to
manage their cloud infrastructure and resources through a user-friendly web interface. This makes it
easier to monitor and manage cloud resources without the need for specialized technical skills.
 Resource pooling: OpenStack enables users to pool computing, storage, and networking resources,
which can be dynamically allocated and de-allocated based on demand. This enables users to
optimize resource utilization and reduce waste.

Advantages of using OpenStack

 It boosts rapid provisioning of resources due to which orchestration and scaling up and down of
resources becomes easy.
 Deployment of applications using OpenStack does not consume a large amount of time.
 Since resources are scalable therefore they are used more wisely and efficiently.
 The regulatory compliances associated with its usage are manageable.

Disadvantages of using OpenStack

 OpenStack is not very robust when orchestration is considered.


 Even today, the APIs provided and supported by OpenStack are not compatible with many of the
hybrid cloud providers, thus integrating solutions becomes difficult.
 Like all cloud service providers OpenStack services also come with the risk of security breaches.

11.What Is Aneka in Cloud Computing


Aneka is an agent-based software product that provides the support necessary for the development
and deployment of distributed applications in the cloud. In particular, it enables to beneficial utilize
numerous cloud resources by offering the logical means for the unification of different computational
programming interfaces and tools.

By using Aneka, consumers are in a position to run applications on a cloud structure of their making;
and efficiency and effectiveness are not being compromised. The provided platform is universal and can be
used in computations and data processing, both for calculations with a large number of tasks and complex
working schemes.
11.1 Classification of Aneka Services in Cloud Computing

1. Fabric Services

 The Fabric services in Aneka represent the basic part of the infrastructural framework through which
the resources of the cloud environment can be managed and automated. They implement as they
involve the physical or low level of resource provision and allocation and also virtualization. Here
are some key components:
 Resource Provisioning: Fabric services are to provide computational assets such as virtual machines,
containers, or otherwise deploying bare metal hardware.
 Resource Virtualization: These services conceal the lower-level physical resources present there and
offer a virtual instance for running the applications. From the above, they are also responsible for
identifying, distributing, and isolating resources to optimize them.
 Networking: Fabric services are fairly involved with the connectivity of the network as it is in the
context of virtual networking and routing thereby facilitating interactions between various parts of
the cloud.
 Storage Management: They manage storage assets within a system, specifically creating and
managing storage volumes, managing file systems as well as performing data replication for failover.
2. Foundation Services

Foundation services rely on the fabric layer and provide further enhancement for the development of
applications in the distributed environment. The following are the benefits of microservices: They provide
basic foundations that are necessary for constructing applications that are portable and elastic. Key
components include:

 Task Execution: Foundation services are responsible for coordinating the work and processes in the
systems of a distributed environment. These include the capability of managing the tasks’ schedule,
distributing the workload, and using fault tolerance measures that guarantee efficient execution of
tasks.
 Data Management: These provide the main function of data storage and retrieval as we see in
distributed applications. The need to be able to support distributed file systems, databases, or
requests and data caching mechanisms is also present.
 Security and Authentication: Foundation services include the security of data-bearing services
implemented by authentication, authorization, and encryption standards to comply with the required
level of security.
 Monitoring and Logging: They allow us to track the application usage and its behaviour in real-time
mode as well as track all the events and the measures of activity for the usage in the analysis of the
incident.

3. Application Services

Subservices in Aneka are many but they are more generalized services built on top of the core
infrastructure to support specialized needs of different types of applications. It is worth mentioning that they
represent typical application templates or scenarios that can help to promote application assembly. Key
components include:

 Middleware Services: Application services can involve various distributed applications fundamental
components like messaging services, event processing services or a service orchestration framework
in case of complex application integration.
 Data Analytics and Machine Learning: Certain application services are dedicated to delivering
toolkits and platforms for analyzing the data, training as well as deploying machine learning models
and performing predictive analysis.
 Content Delivery and Streaming: These services focus on the efficient transport of multimedia
content, streaming information, or real-time communications for video streaming services or online
gaming, for instance.
 IoT Integration: Apiproducts can provide support for IoT devices and, in essence, for IoT protocols,
for data collection, processing, and analysis of sensor data from distributed IoT networks.

Components of the Aneka Framework

1. Aneka Runtime Environment

 The Aneka Runtime Environment is the component within the Aneka computing system that
supports the execution of distributed applications. It has a container net – the Aneka container that is
responsible for the scheduling of computational tasks and distribution of jobs over the extended
topology. Key features include:
 Task Execution Management: The Aneka container is responsible for the management of specific
tasks, it decides how the tasks are to be a resource and then manages their execution, their progress
and any issue or failure that occurs in the process.
 Resource Abstraction: It hides the backend computing resources, these may be physical hosts, virtual
hosts or containers and presents a common execution model for applications.
 Scalability and Fault Tolerance: The main features of the runtime environment include the ability to
scale anticipating the levels of workload along with the means of handling faults so that distributed
applications can run effectively.

2. Aneka Development Toolkit

 The Aneka Development Toolkit is made up of tools, a library, and an Application Programming
Interface that can be used by developers in creating distributed applications on Aneka. It includes:
 Task Submission APIs: Interface for enlisting tasks and jobs to be run in an Aneka runtime
environment, as well as defining characteristics of job execution.
 Resource Management APIs: The following includes the APIs for guaranteed access and usage of
compute resources allotted to the application and may also involve the APIs for applications to be
informed of available compute resources to use and when to release them for other uses.
 Development Libraries: Software libraries for data handling, interaction with other processes and
services, and defining workloads in distributed environments.

3.Aneka Software Development Kit (SDK)

 Its other functionalities include access to detailed documentation and samples that will enable the
experienced programmer to satisfy their specific needs regarding the Aneka framework in the form
of components, applications or services. It includes:
 API Documentation: The detailed manual of the Aneka APIs: how to use basic and advanced
methods, how some of them work, and recommendations for Aneka application development.
 Development Tools: Components of an IDE for building Aneka applications, which include code
editing tools, debuggers, and unit test tools that can be used as plug-ins in the supported IDEs –
Eclipse or Visual Studio.
 Sample Applications: Examples of code stubs and initial Aneka applications illustrating some key
aspects of GDI application implementation: Task submission, resource management, and data
processing.

Advantages of Aneka in Cloud Computing

 Scalability: Aneka is self-sufficient in the dynamism of resource provisions and allocations; hence
applications can scale to as far as the required workload as envisaged. It looks efficiently at the
resource and allows for horizontal scaling to make sure the cloud platforms are being used to their
full benefit.
 Flexibility: Aneka supports various programming paradigms and orientations allowing software
developers to execute a broad range of different types of distributed applications as per their needs. It
organizes the architectural design and the deployment of an application while enabling it to be used
in a variety of contexts and under various architectures of the application.
 Cost Efficiency: Aneka has the potential to minimize the overall cost of infrastructure as it increases
resource utilization and allows for the predictable scaling of such infrastructures in contexts that
entail the deployment of clouds. This is because it extends the notion of usage allowance to a broader
sense where customers only are billed according to the number of resources they use, hence avoiding
careless usage of some resources while other important resources lag, thus good cost-performance
ratios are achieved.
 Ease of Development: The focussed aspects of Aneka are to ease the creation of distributed
applications and to offer high-level framework, tools and libraries. It has APIs provided for task
submission, resource management and data processing, which ensures that the application is built
with increased efficiency in a shorter time.
 Portability: Currently, Aneka applications are independent of the specific cloud platform and
infrastructure software. It works on public, private or hybrid cloud environments without requiring
additional modifications and thus provides contractual freedom.
 Reliability and Fault Tolerance: Aneka consists of several components, for graceful failure and
resiliency of jobs which will enable the implementation of securely developing and running
distributed applications. It also tracks applications and provides failover in case of application
failures at the level of the cluster.
 Integration Capabilities: Aneka can easily work in conjunction with current and active cloud
solutions, virtualization solutions, and containerization technologies. It comes with integrations for
different clouds and lets you work with third-party services and APIs, which is useful for functioning
in conjunction with existing systems and tools.
 Performance Optimization: Aneka improves the utilization of resources schedules missions’ tasks
and efficiently processes data. It utilizes parallelism, distribution, and caching techniques to optimize
the rate at which an application runs and its response time.
 Monitoring and Management: The features of Aneka include, monitoring and management tools for
assessing the performance of the applications that are hosted in it, consumption rates of the resources
as well as the general health of the system. It offers a dashboard, logging as well as analyses to
support proactive monitoring and diagnosing.

Disadvantages of Aneka in Cloud Computing

 Learning Curve: There is the possibility that Aneka would take some time to understand for the
new developers in distributed computing or those who are not aware of the programming models and
abstractions used as part of the system. The concepts in Aneka can take some time to understand and
get acquainted with, so there are more things to do here.
 Complexity: Dealing with complexity while constructing and administering distributed applications
based on Aneka might occur if the application scale reaches considerable sizes or encompasses
sophisticated structural designs. Due to the distributed computing environment utilized by Aneka,
developers who wish to maximize the platform should know distributed computing concepts and
patterns.
 Integration Challenges: Some of the complexities involved may include; Aneka may be challenging
to integrate with other structures, applications, or services. Limitations could emerge in the form of
compatibility concerns when integrating Aneka with this dynamic environment or platforms as well
and the different configurations can create complex concerns with APIs disparately.
 Resource Overhead: While Aneka’s runtime environment and middleware components can be
beneficial for the management and delivery of computational resources, they may also cause
additional overhead in the required memory, computational or network capabilities. This overhead
could potentially slow down application performance or even raise the amount of resources required
for execution, especially in contexts where resources are limited.
 Performance Bottlenecks: At some moments, resource utilization, scheduling, or communication
strategies of Aneka may become an issue and slow down the application. Application performance as
well as its scalability might be vital and should sometimes be tuned and profiled.

12.Cloud Sim
CloudSim is an open-source framework, which is used to simulate cloud computing infrastructure
and services. It is developed by the CLOUDS Lab organization and is written entirely in Java. It is used for
modelling and simulating a cloud computing environment as a means for evaluating a hypothesis prior to
software development in order to reproduce tests and results.

For example, if you were to deploy an application or a website on the cloud and wanted to test the
services and load that your product can handle and also tune its performance to overcome bottlenecks before
risking deployment, then such evaluations could be performed by simply coding a simulation of that
environment with the help of various flexible and scalable classes provided by the CloudSim package, free
of cost.

CloudSim Architecture:

CloudSim Core Simulation Engine provides interfaces for the management of resources such as VM,
memory and bandwidth of virtualized Datacenters.

CloudSim layer manages the creation and execution of core entities such as VMs, Cloudlets, Hosts etc. It
also handles network-related execution along with the provisioning of resources and their execution and
management.

User Code is the layer controlled by the user. The developer can write the requirements of the hardware
specifications in this layer according to the scenario.

Some of the most common classes used during simulation are:

 Datacenter: used for modelling the foundational hardware equipment of any cloud environment, that
is the Datacenter. This class provides methods to specify the functional requirements of the
Datacenter as well as methods to set the allocation policies of the VMs etc.
 Host: this class executes actions related to management of virtual machines. It also defines policies
for provisioning memory and bandwidth to the virtual machines, as well as allocating CPU cores to
the virtual machines.
 VM: this class represents a virtual machine by providing data members defining a VM’s bandwidth,
RAM, mips (million instructions per second), size while also providing setter and getter methods for
these parameters.
 Cloudlet: a cloudlet class represents any task that is run on a VM, like a processing task, or a
memory access task, or a file updating task etc. It stores parameters defining the characteristics of a
task such as its length, size, mi (million instructions) and provides methods similarly to VM class
while also providing methods that define a task’s execution time, status, cost and history.
 DatacenterBroker: is an entity acting on behalf of the user/customer. It is responsible for functioning
of VMs, including VM creation, management, destruction and submission of cloudlets to the VM.
 CloudSim: this is the class responsible for initializing and starting the simulation environment after
all the necessary cloud entities have been defined and later stopping after all the entities have been
destroyed.

Features of CloudSim:
 Large scale virtualized Datacenters, servers and hosts.
 Customizable policies for provisioning host to virtual machines.
 Energy-aware computational resources.
 Application containers and federated clouds (joining and management of multiple public clouds).
 Datacenter network topologies and message-passing applications.
 Dynamic insertion of simulation entities with stop and resume of simulation.
 User-defined allocation and provisioning policies.

You might also like