Mod 4
Mod 4
Laptop
SaaS
Cloud
IaaS
Internet Provider
Desktop
PaaS
Mobilesor
PDAs
FEATURES OF CLOUD COMPUTING
1. Scalability –
Addition of new resources to an existing infrastructure.
• Data storage capacity, processing power and networking can all be
scaled using existing cloud computing infrastructure. Better yet, scaling
can be done quickly and easily
• A system’s scalability refers to its ability to increase workload with
existing hardware resources.
2. Elasticity –
• Elasticity refers to a system’s ability to grow or shrink dynamically
in response to changing workload demands
- no extra payment is required for acquiring specific cloud services.
- A cloud does not require customers to declare their resource
requirements in advance.
- When demand unexpectedly surges, properly configured cloud
applications and services instantly and automatically add
resources to handle the load. When the demand abates,
services return to original resource levels.
3. Resource Pooling
Resource pooling means that a cloud service provider can share resources
among several clients, providing everyone with a different set of services as
per their requirements.
Multiple organizations, which use similar kinds of resources to carry out
computing practices, have no need to individually hire all the resources.
Resources provided by the providers are shared by multiple unrelated
customers.
Pooling resources on the software level means that a consumer is not the
only one using the software.
4. Self Service – Cloud computing involves a simple user interface that
helps customers to directly access the cloud services they want.
It makes getting the resources you need very quick and easy.
In on-demand self service, the user accesses cloud services through an
online control panel.
5. Low Cost
Cloud offers customized solutions, especially to organizations that cannot
afford too much initial investment. Cloud provides pay-us-you-use option,
in which organizations need to sign for those resources only that are
essential.
6. Fault Tolerance & Data security – offering uninterrupted services to
customers.
• Data security is one of the best characteristics of Cloud Computing. Cloud
services create a copy of the data that is stored to prevent any form of data
loss. If one server loses the data by any chance, the copy version is restored
from the other server.
7. RESILIENCE
Resilience in cloud computing means the ability of the service to quickly
recover from any disruption. A cloud’s resilience is measured by how fast its
servers, databases, and network system restarts and recovers from any kind
of harm or damage.
Amazon CloudWatch provides a monitoring system that will estimate
Amazon Web Service charges.
CLOUD DEPLOYMENT MODELS
▪ Public Cloud
▪ Private Cloud
▪ Community Cloud
▪ Hybrid Cloud
Public Cloud (End-User Level Cloud)
• As the name suggests, this type of cloud deployment model supports all
users who want to make use of a computing resource, such as hardware (OS,
CPU, memory, storage) or software (application server, database) on a
subscription basis. Most common uses of public clouds are for application
development and testing, non-mission-critical tasks such as file-sharing, and
e-mail service.
- Eg : Verizon, Amazon Web Services, and Rack space.
Company X
Cloud
Public Cloud Services(IaaS/ Company Y
PaaS/SaaS)
Company Z
Fig : Level of Accessibility in a PublicCloud
Private Cloud (Enterprise Level Cloud)
A private cloud is typically infrastructure used by a single organization. Such
infrastructure may be managed by the organization itself to support various
user groups, or it could be managed by a service provider that takes care of it
either on-site or off-site.
Private clouds are more expensive than public clouds due to the capital
expenditure involved in acquiring and maintaining them. However, private
clouds are better able to address the security and privacy concerns of
organizations today.
- Remains entirely in the ownership of the organization using it.
Private Cloud
Cloud services
Reduce
Output
1. Input File
• The data for a MapReduce task is stored in input files which typically lives in HDFS.
• Files contains both Structured and Unstructured data.
2. Input Split
• Hadoop framework divides the huge input file into smaller chunks/blocks, these chunks
are referred as input splits.
• For each input split Hadoop creates one map task to process records in that input split.
That is how parallelism is achieved in Hadoop framework.
3. Map
• Mapper class contains coding logic functions.
• The conditional logic is applied to the ‘n’ number of input blocks/splits spread across
various data nodes.
• Mapper function output will be in key-value format as (k, v), where the key represents
the offset address of each record and the value represents the entire record content.
4. Combine
• Combiner is Mini-reducer/Semi reducer which performs local aggregation on the mapper's output.
• It is an optional phase.
• The job of the combiner is to optimize the output of the mapper before its fed to the reducer in
order to reduce the data size that is moved to the reducer.
• In this phase, various outputs of the mappers are locally reduced at the node level.
5. Shuffle & Sort
• The key value pair output of various mappers (k, v), goes into Shuffle and Sort phase.
• All the duplicate values are removed, and different values are grouped together based on similar
keys.
• The output of the Shuffle and Sort phase will be key-value pairs again as key and array of values (k,
v[]).
6. Reduce
• The output of the Shuffle and Sort phase (k, v[]) will be the input of the Reducer phase.
• In this phase reducer function’s logic is executed and all the values are aggregated against their
corresponding keys.
• Reducer consolidates outputs of various mappers and computes the final job output.
7. Output
The final output is then written into a single file in an output directory of HDFS
Features of MapReduce
• Scheduling (of tasks among nodes based on the availability)
Mapping – Dividing big tasks into subtasks and assigns to individual nodes in cluster and
executes parallelly. So MapReduce model requires scheduling.
• Synchronization (among running subtasks)
Accomplished by a barrier between the map and reduce phases of processing.
Synchronization, refers to the mechanisms that allow multiple concurrently running processes
to "join up“. for example, to share intermediate results or exchange state information.
• Co location of Code/Data
In order to achieve data locality, the scheduler starts tasks on the node that holds a particular
block of data needed by the task.
• Handling Errors/Faults
High chances for failure of running nodes. MapReduce engine has the capability to recognize
and rectify the faults effectively. It also identifies the incomplete tasks and reassigns it to other
available nodes.
Benefits of MapReduce
1. Fault-tolerance
• During the middle of a map-reduce job, if a machine carrying a few
data blocks fails architecture handles the failure.
• It considers replicated copies of the blocks in alternate machines for
further processing.
2. Resilience
• Each node periodically updates its status to the master node.
• If a slave node doesn’t send its notification, the master node
reassigns the currently running task of that slave node to other
available nodes in the cluster.
3. Quick
• Data processing is quick as MapReduce uses HDFS as the storage
system.
4. Parallel Processing
• MapReduce tasks process multiple chunks of the same datasets in-
parallel by dividing the tasks.
• This gives the advantage of task completion in less time.
5. Availability
• Multiple replicas of the same data are sent to numerous nodes in the
network.
• Thus, in case of any failure, other copies are readily available for
processing without any loss.
6. Scalability
• MapReduce lets you run applications from a huge number of nodes,
using terabytes and petabytes of data by accommodating new nodes if
needed.
HBase
• HBase is a column-oriented NoSQL database management system that runs on
top of HDFS (Hadoop Distributed File System) efficient for structured data
storage and processing.
• Initially, it was Google Big Table, afterward; it was renamed as HBase and is
primarily written in Java.
• Apache HBase is needed for real-time Big Data applications.
• HBase is used extensively for random read and write operations
• HBase stores a large amount of data in terms of tables
• HBase stores data in the form of key/value pairs in a columnar model. In this
model, all the columns are grouped together as Column families.
• HBase on top of Hadoop will increase the throughput and performance of
distributed cluster set up. In turn, it provides faster random reads and writes
operations.
• One can store the data in HDFS either directly or through HBase.
• Good for Structured as well as Semi structured data.
Hbase table Columns & Rows
• HBase is a column-oriented, non-relational database. This means that data is
stored in individual columns, and indexed by a unique row key.
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.