Notes - Cloud Computing
Notes - Cloud Computing
https://www.javatpoint.com/cloud-computing-tutorial
What is Cloud?
Cloud refers to the internet or network. The processing, configuring,
accessing and storing of data and developing,deploying, delivering and
maintenance of various applications and services via the internet is
referred to as cloud computing. It includes both hardware and software
computing resources delivered through the internet.
Why Cloud
Traditional Systems- It needs
Computers, Db Servers, Networking servers,
Parallel Computing :
It is the use of multiple processing elements simultaneously for solving any
problem. Problems are broken down into instructions and are solved concurrently
as each resource that has been applied to work is working at the same time.
Types of Parallelism:
1. Bit-level parallelism –
It is the form of parallel computing which is based on the increasing
processor’s size. It reduces the number of instructions that the system
must execute in order to perform a task on large-sized data.
Example: Consider a scenario where an 8-bit processor must compute
the sum of two 16-bit integers. It must first sum up the 8 lower-order
bits, then add the 8 higher-order bits, thus requiring two instructions to
perform the operation. A 16-bit processor can perform the operation
with just one instruction.
2. Instruction-level parallelism –
A processor can only address less than one instruction for each clock
cycle phase. These instructions can be re-ordered and grouped which
are later on executed concurrently without affecting the result of the
program. This is called instruction-level parallelism.
3. Task Parallelism –
Task parallelism employs the decomposition of a task into subtasks and
then allocating each of the subtasks for execution. The processors
perform the execution of sub-tasks concurrently.
4. Data-level parallelism (DLP) –
Services might aggregate information and data retrieved from other services or
create workflows of services to satisfy the request of a given service consumer.
This practice is known as service orchestration Another important interaction
pattern is service choreography, which is the coordinated interaction of services
without a single point of control.
Components of SOA: ====
Guiding Principles of SOA:
1. Standardized service contract: Specified through one or more service
description documents.
2. Loose coupling: Services are designed as self-contained components,
maintain relationships that minimize dependencies on other services.
3. Abstraction: A service is completely defined by service contracts and
description documents. They hide their logic, which is encapsulated
within their implementation.
4. Reusability: Designed as components, services can be reused more
effectively, thus reducing development time and the associated costs.
5. Autonomy: Services have control over the logic they encapsulate and,
from a service consumer point of view, there is no need to know about
their implementation.
6. Discoverability: Services are defined by description documents that
constitute supplemental metadata through which they can be effectively
discovered. Service discovery provides an effective means for utilizing
third-party resources.
7. Composability: Using services as building blocks, sophisticated and
complex operations can be implemented. Service orchestration and
choreography provide solid support for composing services and
achieving business goals.
Advantages of SOA:
● Service reusability: In SOA, applications are made from existing
services. Thus, services can be reused to make many applications.
● Easy maintenance: As services are independent of each other they
can be updated and modified easily without affecting other services.
● Platform independent: SOA allows making a complex application by
combining services picked from different sources, independent of the
platform.
● Availability: SOA facilities are easily available to anyone on request.
● Reliability: SOA applications are more reliable because it is easy to
debug small services rather than huge codes
● Scalability: Services can run on different servers within an
environment, this increases scalability
Disadvantages of SOA:
● High overhead: A validation of input parameters of services is done
whenever services interact this decreases performance as it increases
load and response time.
● High investment: A huge initial investment is required for SOA.
● Complex service management: When services interact they
exchange messages to tasks. the number of messages may go in
millions. It becomes a cumbersome task to handle a large number of
messages.
Practical applications of SOA: SOA is used in many ways around us whether it
is mentioned or not.
1. SOA infrastructure is used by many armies and air forces to deploy
situational awareness systems.
2. SOA is used to improve healthcare delivery.
3. Nowadays many apps are games and they use inbuilt functions to run.
For example, an app might need GPS so it uses the inbuilt GPS
functions of the device. This is SOA in mobile solutions.
4. SOA helps maintain museums a virtualized storage pool for their
information and content.
Service-Oriented Terminologies
Let's see some important service-oriented terminologies:M
1.1K
C++ vs Java
● Services - The services are the logical entities defined by one or more published
interfaces.
● Service provider - It is a software entity that implements a service specification.
● Service consumer - It can be called as a requestor or client that calls a service
provider. A service consumer can be another service or an end-user application.
● Service locator - It is a service provider that acts as a registry. It is responsible
for examining service provider interfaces and service locations.
● Service broker - It is a service provider that pass service requests to one or more
additional service providers.
Characteristics of SOA
The services have the following characteristics:
● They are loosely coupled.
● They support interoperability.
● They are location-transparent
● They are self-contained.
Components of service-oriented architecture
The service-oriented architecture stack can be categorized into two parts - functional
aspects and quality of service aspects.
Functional aspects
The functional aspect contains:
● Transport - It transports the service requests from the service consumer to the
service provider and service responses from the service provider to the service
consumer.
● Service Communication Protocol - It allows the service provider and the service
consumer to communicate with each other.
● Service Description - It describes the service and data required to invoke it.
● Service - It is an actual service.
● Business Process - It represents the group of services called in a particular
sequence associated with the particular rules to meet the business
requirements.
● Service Registry - It contains the description of data which is used by service
providers to publish their services.
Quality of Service aspects
The quality of service aspects contains:
● Policy - It represents the set of protocols according to which a service provider
makes and provides the services to consumers.
● Security - It represents the set of protocols required for identification and
authorization.
● Transaction - It provides the surety of consistent results. This means, if we use
the group of services to complete a business function, either all must complete
or none of the complete.
● Management - It defines the set of attributes used to manage the services.
https://www.vmware.com/in/topics/glossary/content/hypervisor.html#:~:text=A%20hy
pervisor%2C%20also%20known%20as,such%20as%20memory%20and%20processing.
Web Technology
Due to cloud computing’s fundamental reliance on internetworking, Web browser
universality, and the ease of Web-based service development, Web technology is
generally used as both the implementation medium and the management
interface for cloud services.
This section introduces the primary Web technologies and discusses their
relationship to cloud services.
Resources vs. IT Resources
Artifacts accessible via the World Wide Web are referred to as resources or Web
resources. This is a more generic term than IT resources. An IT resource, within
the context of cloud computing, represents a physical or virtual IT-related artifact
that can be software or hardware-based. A resource on the Web, however, can
represent a wide range of artifacts accessible via the World Wide Web. For
example, a JPG image file accessed via a Web browser is considered a
resource. For examples of common IT resources, see the IT Resource section.
Furthermore, the term resource may be used in a broader sense to refer to
general types of processable artifacts that may not exist as standalone IT
resources. For example, CPUs and RAM memory are types of resources that are
grouped into resource pools and can be allocated to actual IT resources.
This section covers the following topics:
● Basic Web Technology
● Web Applications
Web Applications
A distributed application that uses Web-based technologies (and generally relies
on Web browsers for the presentation of user-interfaces) is typically considered a
Web application. These applications can be found in all kinds of cloud-based
environments due to their high accessibility.
Figure 1 presents a common architectural abstraction for Web applications that is
based on the basic three-tier model. The first tier is called the presentation layer,
which represents the user-interface. The middle tier is the application layer that
implements application logic, while the third tier is the data layer that is
comprised of persistent data stores.
The presentation layer has components on both the client and server-side. Web
servers receive client requests and retrieve requested resources directly as static
Web content and indirectly as dynamic Web content, which is generated
according to the application logic. Web servers interact with application servers in
order to execute the requested application logic, which then typically involves
interaction with one or more underlying databases.
PaaS ready-made environments enable cloud consumers to develop and deploy
Web applications. Typical PaaS offerings have separate instances of the Web
server, application server, and data storage server environments.
Note
For more information about URLs, HTTP, HTML, and XML, visit
www.servicetechspecs.com.
Multitenant Technology
The multitenant application design was created to enable multiple users (tenants) to
access the same application logic simultaneously. Each tenant has its own view of the
application that it uses, administers, and customizes as a dedicated instance of the
software while remaining unaware of other tenants that are using the same application.
Multitenant applications ensure that tenants do not have access to data and
configuration information that is not their own. Tenants can individually customize
features of the application, such as:
● User Interface – Tenants can define a specialized “look and feel” for their
application interface.
● Business Process – Tenants can customize the rules, logic, and workflows of
the business processes that are implemented in the application.
● Data Model – Tenants can extend the data schema of the application to
include, exclude, or rename fields in the application data structures.
● Access Control – Tenants can independently control the access rights for
users and groups.
Multitenant application architecture is often significantly more complex than that of
single-tenant applications. Multitenant applications need to support the sharing of
various artifacts by multiple users (including portals, data schemas, middleware, and
databases), while maintaining security levels that segregate individual tenant
operational environments.
Common characteristics of multitenant applications include:
● Usage Isolation – The usage behavior of one tenant does not affect the
application availability and performance of other tenants.
● Data Security – Tenants cannot access data that belongs to other tenants.
● Recovery – Backup and restore procedures are separately executed for the
data of each tenant.
● Application Upgrade – Tenants are not negatively affected by the synchronous
upgrading of shared software artifacts.
● Scalability – The application can scale to accommodate increases in usage by
existing tenants and/or increases in the number of tenants.
● Metered Usage – Tenants are charged only for the application processing and
features that are actually consumed.
● Data Tier Isolation – Tenants can have individual databases, tables, and/or
schemas isolated from other tenants. Alternatively, databases, tables, and/or
schemas can be designed to be intentionally shared by tenants.
A multitenant application that is being concurrently used by two different tenants is
illustrated in Figure 5.11. This type of application is typical with SaaS implementations.
Cloud Architecture – Public cloud – private cloud – Hybrid cloud – Types of Services in
Cloud – Infrastructure as a Service (IAAS) - Platform as a Service (Paas) - Software-as-
aService (SaaS)- Cloud Storage – pros and cons of Cloud Storage - Virtualization concepts –
Disaster Recovery mechanism.
Unit IV – Cloud Platforms and its Security
Resource management in Cloud – Scheduling – Scheduling Map reduce algorithms - cloud
resource management policies - overview of cloud security -Security issues – security
principles – security in Virtual machine – standards in Cloud Security.
Resource Management System
1.Admission control prevent the system from accepting workload in violation of high-level
system policies.
2. Capacity allocation allocate resources for individual activations of a service.
3. Load balancing distribute the workload evenly among the servers.
4. Energy optimization minimization of energy consumption.
5. Quality of service (QoS) guarantees ability to satisfy timing or other conditions specified
by a Service Level Agreement.
Control theory uses the feedback to guarantee system stability and predict transient
behavior.
Machine learning does not need a performance model of the system.
Utility-based requires a performance model and a mechanism to correlate user-level
performance with cost.
Market-oriented/economic do not require a model of the system, e.g., combinatorial
auctions for bundles of resources.
Scheduling
The aim of using scheduling techniques in cloud environment is to improve
system throughput and load balance, maximize the resource utilization, save
energy, reduce costs, and minimize the total processing time. Therefore, the
scheduler should consider the virtualized resources and users’ required
constraints to get efficient matching between jobs and resources. Each
scheduling technique should be based on one or more strategies. The most
important strategies or objectives commonly that are used, are time, cost,
energy, QoS, and fault tolerance.
Scheduling Policies
Scheduling Types:
Different types and categories of scheduling in cloud computing system are
shown
MapReduce algorithm
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
● The map task is done by means of Mapper Class
● The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper
class is used as input by Reducer class, which in turn searches matching pairs and
reduces them.
MapReduce implements various mathematical algorithms to divide a task into small
parts and assign them to multiple systems. In technical terms, MapReduce algorithm
helps in sending the Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −
● Sorting
● Searching
● Indexing
● TF-IDF
Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-value
pairs from the mapper by their keys.
● Sorting methods are implemented in the mapper class itself.
● In the Shuffle and Sort phase, after tokenizing the values in the mapper
class, the Context class (user-defined class) collects the matching valued
keys as a collection.
● To collect similar key-value pairs (intermediate keys), the Mapper class
takes the help of RawComparator class to sort the key-value pairs.
● The set of intermediate key-value pairs for a given Reducer is
automatically sorted by Hadoop to form key-values (K2, {V2, V2, …})
before they are presented to the Reducer.
Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner
phase (optional) and in the Reducer phase.
Indexing
Normally indexing is used to point to a particular data and its address. It performs
batch indexing on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known as inverted
index. Search engines like Google and Bing use inverted indexing technique. Let us try
to understand how Indexing works with the help of a simple example.
TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse
Document Frequency. It is one of the common web analysis algorithms. Here, the term
'frequency' refers to the number of times a term appears in a document.
Term Frequency (TF)
It measures how frequently a particular term occurs in a document. It is calculated by
the number of times a word appears in a document divided by the total number of
words in that document.
TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms
in the document)
Cloud security
Cloud security refers to the technologies, policies, controls, and services that protect cloud
data, applications, and infrastructure from threats.\
Cloud user
Cloud Service Provider
Computing Resources
The way to approach cloud security is different for every organization and can be
dependent on several variables. However, the National Institute of Standards and
Technology (NIST) has made a list of best practices that can be followed to establish a
secure and sustainable cloud computing framework.
The NIST has created necessary steps for every organization to self-assess their
security preparedness and apply adequate preventative and recovery security
measures to their systems. These principles are built on the NIST's five pillars of a
cybersecurity framework: Identify, Protect, Detect, Respond, and Recover.
Another emerging technology in cloud security that supports the execution of NIST's
cybersecurity framework is cloud security posture management (CSPM). CSPM
solutions are designed to address a common flaw in many cloud environments -
misconfigurations.
Cloud infrastructures that remain misconfigured by enterprises or even cloud providers
can lead to several vulnerabilities that significantly increase an organization's attack
surface. CSPM addresses these issues by helping to organize and deploy the core
components of cloud security. These include identity and access management (IAM),
regulatory compliance management, traffic monitoring, threat response, risk mitigation,
and digital asset management.
https://www.slideshare.net/mkotari/scheduling-in-cloud
https://www.slideshare.net/mkotari/scheduling-in-cloud
https://www.javatpoint.com/what-is-hadoop
What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and
analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP
(online analytical processing). It is used for batch/offline processing.It is being used by
Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled
up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on
the basis of that HDFS was developed. It states that the files will be broken into
blocks and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage
the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output of
Map task is consumed by reduce task and then the out of reducer gives the
desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave
node includes DataNode and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines.
The Java language is used to develop HDFS. So any machine that supports Java
language can easily run the NameNode and DataNode software.
NameNode
● It is a single master server exist in the HDFS cluster.
● As it is a single node, it may become the reason of single point failure.
● It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
● It simplifies the architecture of the system.
DataNode
● The HDFS cluster contains multiple DataNodes.
● Each DataNode contains multiple data blocks.
● These data blocks are used to store data.
● It is the responsibility of DataNode to read and write requests from the file
system's clients.
● It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
● The role of Job Tracker is to accept the MapReduce jobs from client and process
the data by using NameNode.
● In response, NameNode provides metadata to Job Tracker.
Task Tracker
● It works as a slave node for Job Tracker.
● It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a case,
that part of the job is rescheduled.
Advantages of Hadoop
● Fast: In HDFS the data distributed over the cluster and are mapped which helps
in faster retrieval. Even the tools to process the data are often on the same
servers, thus reducing the processing time. It is able to process terabytes of data
in minutes and Peta bytes in hours.
● Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
● Cost Effective: Hadoop is open source and uses commodity hardware to store
data so it is really cost effective as compared to traditional relational database
management systems.
● Resilient to failure: HDFS has the property with which it can replicate data over
the network, so if one node is down or some other network failure happens, then
Hadoop takes the other copy of data and uses it. Normally, data are replicated
thrice but the replication factor is configurable.
What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed
over several machines and replicated to ensure their durability to failure and high
availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks,
data nodes and node name.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS
blocks are 128 MB by default and this is configurable.Files n HDFS are broken
into block-sized chunks,which are stored as independent units.Unlike a file
system, if the file is in HDFS is smaller than block size, then it does not occupy
full block?s size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes 5MB
of space only.The HDFS block size is large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and
the metadata of all the files in HDFS; the metadata information being file
permission, names and location of each block.The metadata are small, so it is
stored in the memory of name node,allowing faster access to data. Moreover the
HDFS cluster is accessed by multiple clients concurrently,so all this information
is handled bya single machine. The file system operations like opening, closing,
renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or
name node. They report back to the name node periodically, with a list of blocks
that they are storing. The data node being a commodity hardware also does the
work of block creation, deletion and replication as stated by the name node.
Features of HDFS
● Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a
single cluster.
● Replication - Due to some unfavorable conditions, the node containing the data
may be loss. So, to overcome such problems, HDFS always maintains the copy of
data on a different machine.
● Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the
system in the event of failure. The HDFS is highly fault-tolerant that if any
machine fails, the other machine containing the copy of that data automatically
become active.
● Distributed data storage - This is one of the most important features of HDFS
that makes Hadoop very powerful. Here, data is divided into multiple blocks and
stored into nodes.
● Portable - HDFS is designed in such a way that it can easily portable from
platform to another.
What is YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and
makes it interactive to let another application Hbase, Spark etc. to work on it.Different
Yarn applications can co-exist on the same cluster so MapReduce, Hbase, Spark all can
run at the same time bringing great benefits for manageability and cluster utilization.
Components Of YARN
● Client: For submitting MapReduce jobs.
● Resource Manager: To manage the use of resources across the cluster
● Node Manager:For launching and monitoring the computer containers on
machines in the cluster.
● Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are
scheduled by the resource manager, and managed by the node managers.
Jobtracker & Tasktracker were used in previous versions of Hadoop, which were
responsible for handling resources and checking progress management. However,
Hadoop 2.0 has a Resource manager and NodeManager to overcome the shortfall of
Jobtracker & Tasktracker.
Usage of MapReduce
● It can be used in various application like document clustering, distributed sorting,
and web link-graph reversal.
● It can be used for distributed pattern-based searching.
● We can also use MapReduce in machine learning.
● It was used by Google to regenerate Google's index of the World Wide Web.
● It can be used in multiple computing environments such as multi-cluster,
multi-core, and mobile environments.
VirtualBox
VirtualBox is open-source software for virtualizing the x86 computing
architecture. It acts as a hypervisor, creating a VM (virtual machine) where
the user can run another OS (operating system).
The operating system where
VirtualBox runs is alled the "host" OS.
The operating system running in the
VM is called the "guest" OS.
VirtualBox supports Windows, Linux,
or macOS as its host OS.
When configuring a virtual machine,
the user can specify how many CPU
cores, and how much RAM and disk
space should be devoted to the VM.
When the VM is running, it can be
"paused." System execution is frozen
at that moment in time, and the user
can resume using it later.
VirtualBox was originally developed
by Innotek GmbH, and released on
January 17, 2007 as an open-source
software package. The company was later purchased by Sun Microsystems.
On January 27, 2010, Oracle Corporation purchased Sun, and took over
development of VirtualBox.
Supported guest operating systems
Guest operating systems supported by VirtualBox include:
● Windows 10, 8, 7, XP, Vista, 2000, NT, and 98.
● Linux distributions based on Linux kernel 2.4 and
newer, including Ubuntu, Debian, OpenSUSE,
Mandriva/Mandrake, Fedora, RHEL, and Arch Linux.
● Solaris and OpenSolaris.
● macOS X Server Leopard and Snow Leopard.
● OpenBSD and FreeBSD.
● MS-DOS.
● OS/2.
● QNX.
● BeOS R5.
● Haiku.
● ReactOS.
NIMBUS
Nimbus is an open-source toolkit to convert a computer cluster into an
Infrastructure-as-a-Service cloud to provide compute cycles for scientific
communities. It allows a client to lease remote resources by deploying
virtual machines (VMs) on those resources and configuring them to
represent an environment desired by the user.
Nimbus is comprised of two products:
● Nimbus Infrastructure is an open source EC2/S3-compatible
Infrastructure-as-a-Service implementation specifically targeting
features of interest to the scientific community such as support for
proxy credentials, batch schedulers, best-effort allocations and
others.
● Nimbus Platform is an integrated set of tools, operating in a
multi-cloud environment, that deliver the power and versatility of
infrastructure clouds to scientific users. Nimbus Platform allows
you to reliably deploy, scale, and manage cloud resources.