Module 3-2
Module 3-2
Security is the foremost concern for cloud computing. Gaining the trust of users,
especially for sensitive applications, is crucial. While public clouds are popular for
scalability and cost efficiency, For these cases, private or hybrid clouds provide
more secure alternatives.
Data Vulnerability:
Data at rest and data in transit are particularly vulnerable to unauthorized access
and breaches. While encryption protects data in storage, it must be decrypted for
processing, which exposes it to potential attacks.
Data replication, essential for fault tolerance and service continuity, increases the
risk of unauthorized access or compromise, especially if proper safeguards are not
in place.
Threat Landscape:
The three cloud delivery models face distinct challenges due to their unique
characteristics.
SaaS (Software-as-a-Service):
IaaS (Infrastructure-as-a-Service):
PaaS (Platform-as-a-Service):
Vendor Lock-In:Users often find themselves tied to a specific cloud provider due
to proprietary APIs, data formats, and integration tools. This dependence can
hinder flexibility, especially if a service critical to operations becomes unavailable.
Beyond technical challenges, cloud computing has significant social and economic
implications:
The core unit in workflow modeling is a task, which has several attributes:
Name: (unique identifier
Description: (natural language explanation)
Actions: (changes caused by task execution)
Preconditions :(conditions that must be true before
execution)
Post-conditions: (conditions that must be true after
execution)
Attributes: (resource requirements, security needs, reversibility
etc.)
Exceptions :(error-handling mechanisms)
Tasks can be primitive (indivisible) or composite (composed of multiple tasks with a
defined execution order).
A routing task manages: the flow between tasks, enabling sequential, concurrent, or
iterative execution.
A process description (or workflow schema) outlines task execution order and is often
written in a Workflow Definition Language (WFDL).
Workflow descriptions resemble flowcharts, supporting branching, concurrency, and
iteration.
Errors like deadlocks (when tasks block each other due to resource contention) can
occur. One way to prevent deadlocks is to acquire all necessary resources at once,
though this may reduce resource utilization.
Workflow Patterns
Workflow patterns define the relationships between tasks:
1.Sequence Pattern – Tasks are executed one after another.
2.AND Split Pattern – Multiple tasks are triggered concurrently.
3.Synchronization Pattern – A task starts only after multiple preceding tasks
complete.
4.XOR Split Pattern – A decision determines which of two tasks will execute.
3.4 Coordination Based on a State Machine Model: The ZooKeeper
What is Distributed Coordination Services?
Distributed Coordination Service (DCS) is a service that allows multiple nodes to
operate and coordinate in a distributed system. They provide tools to manage all nodes
and integrat everything to ensure consistency and prevent errors.
A popular example of a shared hosting service is Apache ZooKeeper.
ZooKeeper is a popular open-source tool designed to handle this challenge by
providing services like leader election, distributed locking, and configuration
management.
It follows a state machine model, where multiple servers work together, electing a
leader for coordination. Clients can connect to any server via TCP connections to send
requests and receive responses
The application programming interface (API) to the ZooKeeper service is very simple
and consists of seven operations:
•create – add a node at a given location on the tree.
•delete – delete a node.
•get data – read data from a node.
•set data – write data to a node.
•get children – retrieve a list of the children of the node.
•synch-wait for the to propagate propagate.
The system also supports the creation of How ZooKeeper Works
1.Clients connect via TCP, sending requests and receiving updates.
2.Reads are handled by local replicas, while writes go through the leader.
3.The Paxos-based messaging protocol ensures consistency by requiring quorum
agreement for updates.
4.ZooKeeper supports ephemeral nodes, which exist only for the duration of a client
session.
Use Cases
ZooKeeper is widely used in distributed systems for service coordination,
synchronization, leader election, and group membership. Notable users include
Yahoo!’s Message Broker and various cloud applications.
The details of the workflow of GrepTheWeb are captured in Figure 4.7(b) and
consist of the following steps:
1.The startup phase. Creates several queues – launch, monitor, billing, and
shutdown queues. Starts the corresponding controller threads. Each thread
periodically polls its input queue and, when a message is available, retrieves the
message, parses it, and takes the required actions.
2.The processing phase. This phase is triggered by a StartGrep user request; then a
launch message is enqueued in the launch queue. The launch controller thread picks
up the message and executes the launch task; then, it updates the status and time
stamps in the Amazon Simple DB domain.
Finally, it enqueues a message in the monitor queue and deletes the message from the
launch queue.
The processing phase consists of the following steps:
a. The launch task starts Amazon EC2 instances. It uses a Java Runtime
Environment preinstalled Amazon Machine Image (AMI), deploys required Hadoop
libraries, and starts a Hadoop Job (run Map/Reduce tasks).
b.Hadoop runs map tasks on Amazon EC2 slave nodes in parallel. A map task
takes files from Amazon S3, runs a regular expression, and writes the match results
locally, along with a descrip- tion of up to five matches. Then the combine/reduce
task combines and sorts the results and consolidates the output.
c .Final results are stored on Amazon S3 in the output bucket.
3.The monitoring phase. The monitor controller thread retrieves the message left at
the beginning of the processing phase, validates the status/error in Amazon Simple
DB, and executes the monitor task.
It updates the status in the Amazon Simple DB domain and enqueues messages in
the shutdown and billing queues.
The monitor task checks for the Hadoop status periodically and updates the Simple
DB items with status/error and the Amazon S3 output file.
Finally, it deletes the message from the monitor queue when the processing is
completed.
4.The shutdown phase. The shutdown controller thread retrieves the message from
the shutdown queue and executes the shutdown task, which updates the status and
time stamps in the Amazon Simple DB domain. Finally, it deletes the message from
the shutdown queue after processing.
The shutdown phase consists of the following steps:
a.The shutdown task kills the Hadoop processes, terminates the EC2 instances after
getting EC2 topology information from Amazon Simple DB, and disposes of the
infrastructure.
b.The billing task gets the EC2 topology information, Simple DB usage, and S3 file
and query input, calculates the charges, and passes the information to the billing
service.
5.The cleanup phase. Archives the Simple DB data with user info.
6.User interactions with the system. Get the status and output results. The
GetStatus is applied to the service endpoint to get the status of the overall system (all
controllers and Hadoop) and download the filtered results from Amazon S3 after
completion.
To optimize the end-to-end transfer rates in the S3 storage system, multiple files are
bundled up and stored as S3 objects. Another performance optimization is to run a
script and sort the keys and the URL pointers and upload them in sorted order to S3.
In addition, multiple fetch threads are started in order to fetch the objects.
This application illustrates the means to create an on-demand infrastructure and run it
on a massively distributed system in a manner that allows it to run in parallel and
scale up and down based on the number of users and the problem size.
3.7 Clouds for Science and Engineering:
In a talk delivered in 2007 and posted on his Web site just before he went missing in
January 2007, computer scientist Jim Gray discussed eScience as a transformative
scientific method Today, eScience unifies experiment, theory, and simulation; data
captured from measuring instruments or generated by simulations are processed by
software systems, and data and knowledge are stored by computer systems and
analyzed using statistical packages.
The Web search technology allows scientists to discover text documents related to such
data, but the binary encoding of many of the documents poses serious challenges.
Metadata is used to describe digital data and provides an invaluable aid for discovering
useful information in a scientific data set.
A recent paper describes a system for data discovery that supports automated fine-
grained metadata extraction and summarization schemes for browsing large data sets
and is extensible to different scientific domains.
The system, called Glean, is designed to run on a computer cluster or on a cloud; its
run-time system supports two computational models, one based on MapReduce and the
other on graph-based orchestration.
Traditionally, supercomputers have been the primary choice for these computations
due to their high processing power and efficient interconnects. However, cloud
computing is emerging as a flexible and cost-effective alternative.
Each of these applications has different computational demands, with some being more
dependent on raw processing power while others require high-speed communication
between nodes.
The results while EC2 performed well for compute-intensive tasks, it struggled with
applications requiring frequent inter-node communication
Carver, Franklin, and Lawrencium demonstrated superior performance, particularly for
workloads that involve extensive data exchange.
One of the key challenges of using cloud computing for HPC is its high network
latency and lower communication bandwidth, which make it inefficient for parallel-
processing applications that rely on fast interconnects.
Additionally, cloud platforms often suffer from performance variability, as resources
are shared among multiple users.
Despite these limitations, cloud computing remains aviable option for independent,
compute-heavy tasks and can complement supercomputers in a hybrid HPC
model.
where cloud infrastructure is used for data preprocessing and storage while high-
performance systems handle complex computations.
In conclusion, while cloud computing is not yet a full replacement for supercomputers,
it provides a scalable and cost-effective solution for certain scientific workloads.
The results in Table 4.1 give us some ideas about the characteristics of scientific
applications likely to run efficiently on the cloud. Communication-intensive
applications will be affected by the increased latency (more than 70 times larger then
Carver) and lower bandwidth (more than 70 times smaller than Carver).
3.9 Cloud Computing for Biology Research
Biology, a field requiring vast computational power, has been an early adopter of cloud
computing to handle large-scale data processing.
[NOTE:computational power -ability of a computer to process data and perform task]
To complete the computation efficiently, the team allocated 3,700 weighted instances
across three data centers, using 475 extra-large VMs (each with 8-core CPUs, 14GB
RAM, and 2TB storage). The computation, which would have taken 6–7 CPU-years,
was completed in 14 days, producing 260GB of compressed output across 400,000 files.
Social networks have grown in both size and functionality, making large-scale data
analysis crucial.
Cloud computing enables efficient distribution of computational workloads for
evaluating social closeness, which is highly resource-intensive.
Traditional methods like sampling and surveying are inadequate for large networks.
Social intelligence, involving knowledge discovery and pattern recognition, benefits
from cloud resources.
Case-based reasoning (CBR) is a preferred approach for large-scale recommendation
systems, as it handles data accumulation better than rule-based systems.
The BetterLife 2.0 system demonstrates CBR in social computing. It consists of a cloud
layer, a CBR engine, and an API. Using MapReduce, the system computes pairwise
social closeness, retrieving similar cases efficiently.
This iterative process allows CBR systems to improve over time by learning from past
experiences.
In the past, social networks have been constructed for a specific application domain like
biology (MyExperiment) and nanoscience (nanoHub), enabling researchers to share
workflows.
To address this, credit-based models like PlanetLab and middleware solutions like
BOINC provide accountability.
credit-based system in which users earn credits by contributing resources and then
spend those credits when using other resources.
digital content Cloud computing provides a flexible and scalable infrastructure for
managing digital content, enabling efficient storage, distribution, and delivery of
various media formats, including documents, images, audio, and videos.
The new technologies supported by cloud computing favor the creation of digital
content.
Data mashups or composite services combine data extracted by different sources;
[Data mashup is the process of combining data from multiple sources into a single data
source.] .
Event-driven mashups, also called Svc, interact through events rather than the
request/response traditional .
To improve reliability, the mashup system uses Java Message Service (JMS) for
asynchronous communication. Famethodult tolerance is achieved through VMware
vSphere, where primary and secondary virtual machines (VMs) run simultaneously. If
one fails, the other seamlessly takes over, ensuring continuous operation.
Social computing, cloud technology, and digital content are deeply interconnected.
Cloud computing enhances data analysis, recommendation systems, volunteer
computing, and social media scalability. It also ensures fault tolerance and service
reliability, making it a crucial backbone of modern digital applications.
1.Task Initialization
2.Map Phase
Each Map worker reads its assigned input split and processes it using the
user-defined Map function.
The intermediate <key, value> pairs generated are buffered in memory and
then partitioned into R regions, stored locally on disk.
The master informs the Reduce workers where the intermediate data is stored
Reduce workers retrieve this data via remote procedure calls (RPCs).
The Reduce function processes each unique key and its associated values.
The master tracks the state of tasks and pings workers periodically.
Once all tasks are completed, the master signals the user program that execution is finished.
The system ensures efficient scheduling and fault tolerance, making it robust for
processing large-scale dat
Application
Master instance
2
1 1 7
Map
Segment 1
instance 1 Local disk
Reduce
Segment 2 Map instance 1
instance 2 Local disk Shared
Reduce storage
Segment 3 Map Local disk instance 2
instance 3
Shared
storage
Reduce
3 4 5 instance R 6
Map
Segment M instance M Local disk