0% found this document useful (0 votes)
17 views137 pages

Unit II

The document provides an overview of virtualization in cloud computing, detailing its significance, types of hypervisors, and specific virtualization techniques for CPU, memory, and I/O devices. It also discusses Amazon EC2 as a virtual machine service that offers scalable computing resources and highlights the role of Software Defined Networking (SDN) in managing networks through software. Key advantages and drawbacks of virtualization are outlined, along with various use cases for Amazon EC2.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views137 pages

Unit II

The document provides an overview of virtualization in cloud computing, detailing its significance, types of hypervisors, and specific virtualization techniques for CPU, memory, and I/O devices. It also discusses Amazon EC2 as a virtual machine service that offers scalable computing resources and highlights the role of Software Defined Networking (SDN) in managing networks through software. Key advantages and drawbacks of virtualization are outlined, along with various use cases for Amazon EC2.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 137

Unit II

Virtualization- Virtualization (CPU, Memory, I/O) - Amazon EC2 Software


Defined Networks (SDN)- Software Defined Storage (SDS) –-Cloud Storage -
Introduction to Storage Systems -Cloud Storage Concepts - Distributed File
Systems (HDFS, Ceph FS) -Cloud Databases (HBase, MongoDB, Cassandra,
DynamoDB) -Cloud Object Storage (Amazon S3, OpenStack Swift, Ceph).
Introduction
• Virtualization is a foundational element of cloud computing.
• Virtualization is a technique, which allows to share single physical
instance of an application or resource among multiple organizations or
tenants (customers).
• It does so by assigning a logical name to a physical resource and
providing a pointer to that physical resource on demand.
• It enables multiple users to share a single physical instance of a
resource at a time.
Virtualization
• Virtualization is a technology that creates virtual representations of
physical resources such as servers, storage, networks, and other
hardware.
• It allows multiple operating systems and applications to run on the
same physical machine simultaneously, enhancing hardware utilization
and flexibility.
• In other words, Virtualization is a technique, which allows to share a
single physical instance of a resource or an application among
multiple customers and organizations.
• It does by assigning a logical name to a physical storage and providing
a pointer to that physical resource when demanded.
• For instance, you can segment server capacities into small parts to allow
for the distribution of the server’s capabilities among several
environments and users.
• the global virtualization market is at around 40 to 60 billion US dollars
and is expected to grow exponentially to nearly 120 billion US dollars
within the next few years.
• A Virtual machine provides an environment that is logically separated from
the underlying hardware.
• The machine on which the virtual machine is going to create is known as Host
Machine and that virtual machine is referred as a Guest Machine.
• host machine" refers to the physical computer that provides computing
resources to run virtual machines, while a "guest machine" is a virtual
machine running on that host, essentially an isolated operating system
environment that utilizes the host's hardware resources
• multiple guest machines can run simultaneously on a single host machine
using a hypervisor to manage resource allocation
Hypervisor ?
• This virtual machine is managed by a software or firmware, which is
known as hypervisor.
• The hypervisor is a firmware or low-level program that acts as a Virtual
Machine Manager.
There are two types of hypervisor:
Type 1 hypervisor
Type 2 hypervisor
Type 1 hypervisors
• A Type 1 hypervisor runs directly on the host machine's physical
hardware.
• It's referred to as a bare-metal hypervisor.
• The Type 1 hypervisor doesn't have to load an underlying OS.
• With direct access to the underlying hardware and no other software --
such as OSes and device drivers -- to contend with for virtualization,
Type 1 hypervisors are regarded as the most efficient and best-
performing hypervisors available for enterprise computing.
• In fact, Type 1 hypervisors are often referred to as the virtualization or
virtual operating system.
Type 2 hypervisors
• A Type 2 hypervisor is typically installed on top of an existing host OS.
• It is sometimes called a hosted hypervisor because it relies on the host
machine's preexisting OS to manage calls to CPU, memory, storage and
network resources.
Difference Between Type 1 and Type2 Hypervisors

Type 2 hypervisor
Type 1 hypervisor

Also known as Bare metal hypervisor. Hosted hypervisor.

Underlying physical host machine


Runs on Underlying operating system (host OS).
hardware.

Large, resource-intensive, or fixed-use


Best suited for Desktop and development environments.
workloads.

Examples VMware ESXi, Microsoft Hyper-V, KVM. Oracle VM VirtualBox, VMware


Workstation, Microsoft Virtual PC
• Almost any type of IT infrastructure can be virtualized, including
servers, desktops, networks, storage, data, applications, data centers,
CPUs, and GPUs.
 CPU Virtualization
 Memory Virtualization
 I/O Virtualization
1. CPU Virtualization (Processor virtualization)CPU Virtualization In Cloud Computing)
• CPU virtualization allows multiple operating systems to run on a single physical
machine by sharing the CPU resources.
• CPU virtualization is an innovative technology that enables the creation of virtual
central processing units (CPUs) or processors that exist independently from physical
hardware
• This is achieved using a hypervisor, which is a software layer that manages the
virtual machines (VMs) and allocates CPU time to each VM.
• CPU Virtualization is one of the cloud-computing technology that requires a
single CPU to work, which acts as multiple machines working together.
• A single CPU can run numerous operating systems (OS) via CPU virtualization
in cloud computing.
• This is possible by creating virtual machines (VMs) that share the physical
resources of the CPU. Each Virtual Machine can’t see or interact with each
other’s data or processes
Example:
• Consider an example to understand CPU virtualization. Imagine we have a physical
server with a single CPU.
• We want to run two different operating systems on this server,Windows & Linux.
• So it can easily be done by creating two Virtual Machines (VMs), one for Windows
and one for Linux.
• The virtualization software will create a virtual CPU for each VM.
• The virtual CPUs will execute on the physical CPU but separately. This means the
Windows Virtual Machine cannot view or communicate with the Linux VM, and
vice versa.
• The virtualization software will also allocate memory and other resources to each
VM. This guarantees each VM has enough resources to execute.
Working of CPU Virtualization:
Step 1: Creating Virtual Machines (VMs)
• Let’s take an example you have a powerful computer with a CPU, memory,
and other resources.
• To start CPU virtualization, you use special software called a hypervisor. This
is like the conductor of a virtual orchestra.
• The hypervisor creates virtual machines (VMs) – these are like separate,
isolated worlds within your computer.
• The “virtual” resources of each VM include CPU, memory, and storage. It’s
like having multiple mini-computers inside your main computer.
Step 2: Allocating Resources
• The hypervisor carefully divides the real CPU’s processing power among the
VMs. It’s like giving each VM its own slice of the CPU pie.
• It also makes sure that each Virtual memory (VM) gets its share of memory,
storage, and other resources.
Step 3: Isolation and Independence
• Each VM operates in its own isolated environment. It can’t see or interfere
with what’s happening in other VMs.
Step 4: Running Operating Systems and Apps
• Within each Virtual Machine, you can install & run different operating
systems (like Windows, Linux) and applications.
• The VM thinks it’s a real computer, even though it’s sharing the actual
computer’s resources with other VMs.
Step 5: Managing Workloads
• The hypervisor acts as a smart manager, deciding when each VM gets to use
the real CPU.
• It ensures that no VM takes up all the CPU time, making sure everyone gets
their turn to work.
Step 6: Efficient Use of Resources
• Even though there’s only one physical CPU, each VM believes it has its own
dedicated CPU.
• The hypervisor cleverly switches between VMs so that all the tasks appear to
be happening simultaneously.
2. Memory Virtualization
• Memory virtualization abstracts and manages physical memory (RAM) to provide
virtual memory to VMs.
• It creates a layer of abstraction between the physical memory and the software
running on the VMs.
• Memory virtualization in cloud computing is a technique that allows multiple
virtual machines (VMs) to share the physical memory of a host machine
efficiently.
• This process abstracts the physical memory into a virtual memory pool, which can
be dynamically allocated to different VMs based on their needs.
• This approach enhances resource utilization, flexibility, and scalability in cloud
environments.
• This virtualization enables the dynamic allocation of cloud memory to cloud user
instances.
• Memory virtualization is vital for handling a large number of users and workloads.
• Memory virtualization and storage virtualization are not the same in cloud
computing.
• But both are essential components of virtualization technology.
• Here's a brief overview of each:
• Memory virtualization abstracts the physical memory resources of a system
to create a virtual memory pool that can be dynamically allocated to virtual
machines (VMs) as needed.
• Storage virtualization abstracts physical storage resources to create a virtual
storage pool that can be managed and allocated to VMs or applications as
needed.
Working:
• The hypervisor allocates memory to each VM, ensuring that they have enough
resources to run their applications.
• Memory virtualization techniques, such as paging and swapping, are used to manage
memory efficiently.
• Techniques like memory ballooning and overcommitment help optimize memory
usage across multiple VMs.
• memory ballooning" is a technique where a virtual machine (VM) can dynamically
give back unused memory to the host system, allowing the host to allocate that
memory to other VMs that need it, essentially optimizing memory usage across
multiple virtual machines on the same host by reclaiming unused memory from less
demanding VMs.
• overcommitment" refers to the practice of a cloud service provider allocating more
virtualized computing resources (like CPU cores or memory) to users than the
physical hardware actually has available, essentially "selling" more capacity than they
physically possess
3. I/O Device Virtualization
• I/O device virtualization allows VMs to share physical I/O devices, such as
network interfaces, storage devices, and graphics cards.
• This is achieved by creating virtual versions of these devices, which are then
allocated to the VMs.
Working:
• The hypervisor provides virtual I/O devices to each VM, allowing them to
interact with the physical devices as if they were directly connected.
• Techniques like full device emulation, paravirtualization, and direct I/O are
used to manage I/O virtualization.
• Virtualization frameworks, such as VirtIO, provide standardized interfaces
for virtual I/O devices.
Example:
• Imagine a data center running multiple VMs that require GPU
resources for tasks like video rendering or AI model training.
• Instead of each VM needing its own physical GPU, GPU virtualization
allows all VMs to share a single physical GPU, optimizing resource
usage and reducing costs.
• NVIDIA GeForce NOW is a cloud gaming service that allows users to
stream and play high-end video games on various devices, including
low-end PCs, Macs, and mobile devices.
Advantages of Virtualization in Cloud Computing
• Scalability: Easily scale resources up or down based on demand.
• Cost Efficiency: Optimize resource usage and reduce hardware costs.
• Flexibility: Run multiple operating systems and applications on the same
physical hardware.
• Isolation: Ensure that VMs are isolated from each other, improving security
and stability.
Drawbacks of Virtualization in Cloud Computing
• Performance Overhead: Virtualization can introduce some performance
overhead due to the additional layer of abstraction.
• Complexity: Managing virtualized environments can be complex and requires
expertise.
• Security Risks: Ensuring the security of virtualized environments is crucial, as
vulnerabilities in the hypervisor can affect all VMs
Amazon EC2
• Amazon EC2 (Elastic Compute Cloud) is essentially a virtual machine.
• "EC2 instance" refers to a single virtual server running on Amazon's cloud
infrastructure.
• it allows users to rent virtualized computing resources like CPUs and
memory to run applications without managing physical hardware and
makes it a type of virtual machine service
• Amazon EC2 is a fundamental building block of AWS, enabling businesses
and developers to leverage the power of cloud computing for a wide range
of applications and services.
• Amazon EC2 provides scalable computing capacity in the AWS cloud.
Leveraging it enables organizations to develop and deploy applications
faster, without needing to invest in hardware upfront. Users can launch
virtual servers, configure security and networking,
• Through Amazon EC2, teams can access reliable, scalable
infrastructure and secure computing capabilities in the AWS cloud.
• Amazon EC2 reduces the time required to obtain and boot new user
instances to minutes rather than in older days.
• You can scale the compute capacity up and down as per the computing
requirement changes.
• Amazon EC2 provides the developers with the tools to build resilient
applications that isolate themselves from some common scenarios.
• The Amazon EC2 platform includes more than 750 instances (virtual
servers) of different types to support different types of enterprise
workloads.
• Users can also choose from the latest processor, storage, networking,
operating system and purchase model to meet their requirements.
• These components are packaged into pre-configured templates for every
Amazon EC2 instance.
• Amazon EC2 is also optimized for machine learning inference,
deep learning and generative artificial intelligence applications.
• Users can also use Amazon EC2 to build on-demand macOS workloads
for Apple devices and platforms
• Important features of Amazon EC2
• Multiple storage options. Users can choose between multiple storage
options, including block-level storage, instance storage and
object storage for their specific requirements.
• Enhanced Networking. This provides high packet-per-second
performance, low network jitter and low latency compared to traditional
implementations.
• Access to Intel's features. Amazon EC2 instances that feature an Intel
processor lets users access many processor features like including Intel
Advanced Encryption Standard New Instructions, Intel Advanced
Vector Extensions, Intel Deep Learning Boost and Intel Turbo Boost
Technology.
• Support for cluster networking. Some Amazon EC2 instances support
cluster networking when launched into a common cluster placement
group for low-latency networking between the instances in the cluster.
• Auto Scaling. Users can automatically scale Amazon EC2 capacity up
or down by adding or removing instances according to their own
conditions, using EC2's dynamic and predictive scaling policies.
• Optimized CPU configurations. With this feature, users can better
control their Amazon EC2 instances by specifying a custom number of
virtual CPUs when launching new instances. They can also disable Intel
Hyper-Threading Technology for workloads that perform well enough
with single-threaded CPUs
Use Cases:
• Web Hosting: Host websites and web applications with scalable
computing power.
• Big Data Processing: Run large-scale data processing tasks, such as
data analysis and machine learning.
• Gaming: Host online gaming servers that can handle high traffic and
low latency.
• Application Development: Develop, test, and deploy applications in a
flexible and scalable environment.
• Disaster Recovery: Use EC2 as part of a disaster recovery plan to
ensure business continuity in case of infrastructure failures.
Software Defined Networks (SDN)
• Software-defined networking (SDN) is a category of technologies that make it possible
to manage a network via software.
• In traditional networks, the hardware (like routers and switches) decides how data
moves through the network, but SDN changes this by moving the decision-making to a
central software system
• This setup allows network administrators to manage and control the entire network
using a single, unified interface. Instead of configuring each device individually, they
can program and adjust the network from one central place. This makes managing the
network much easier and more efficient.
• SDN technology enables IT administrators to configure their networks using a software
application.
• SDN software is interoperable, meaning it should be able to work with any router or
switch, no matter which vendor made it.
• Software provides real-time monitoring and analytics of network
performance and health.
• Based on analytics, the SDN controller can make adaptive decisions to
optimize network performance.
• Software abstracts the physical network infrastructure to create virtual
networks, allowing for more efficient use of resources.
• Resources can be dynamically allocated based on current network
demands.
• Communication Protocol: OpenFlow is a standard protocol used for
communication between the SDN controller and the network devices
(switches, routers).
• Flow Rules: The controller uses OpenFlow to install flow rules on
network devices, dictating how packets should be handled.
Woking of SDN

• SND achieve the above by, separating


the control plane from the data plane.
• Data Plane includes: Forwarding of
packets, Segmentation and reassembly
of data.etc.
• i.e: all the activities to move packet to
destination.
• Control Plane include: Making routing
tables,Setting packet handling policies
• i.e: decision on traffic.
Components of Software Defining Networking (SDN)
The three main components that make the SDN are:
1. SDN Applications: SDN Applications relay requests or networks through SDN
Controller using API.
2. SDN Controller: SDN Controller collects network information from hardware
and sends this information to applications. It acts as the centralized control unit
for the entire network. It is responsible for making routing decisions, managing
policies, and controlling the flow of data. Examples: OpenDaylight, ONOS,
Cisco APIC.
3. SDN Networking Devices: SDN Network devices help in forwarding and data
processing tasks.
Role of SDN in Cloud Computing
• In cloud computing, Software-Defined Networking (SDN) plays a
crucial role and making it ideal for dynamic cloud environments where
infrastructure needs to scale quickly and adapt to changing workloads.
• Some of the roles are:
 providing a centralized and programmable way to
manage network resources
 enabling flexible network configuration
 rapid provisioning of virtual networks
 efficient traffic management
 enhanced security,
Key aspects of SDN in cloud computing:
• Dynamic resource allocation: SDN allows cloud providers to
dynamically allocate network resources to virtual machines (VMs) based
on their needs, ensuring optimal performance and efficient utilization of
network bandwidth.
• Scalability: With a centralized controller, SDN can easily scale
networks up or down to accommodate fluctuating cloud usage, quickly
provisioning new virtual networks as required.
• Automation: SDN enables automated network configuration and policy
management, streamlining network operations and reducing manual
intervention.
• Multi-cloud management: SDN facilitates consistent network
management across different cloud platforms, allowing seamless
connectivity and policy enforcement in hybrid cloud environments.
• Security enhancements: By defining granular security policies and
implementing network segmentation, SDN helps improve cloud
network security by controlling access and isolating potential threats.
• Application-aware networking: SDN can prioritize specific
application traffic based on their requirements, optimizing network
performance for critical cloud applications.
Benefits of using SDN in cloud computing:
• Faster deployment: Streamlined network provisioning process allows
for quick deployment of new cloud services and applications.
• Cost optimization: Efficient resource allocation and automation can
lead to reduced network operational costs.
• Improved agility: Enables rapid response to changing cloud
requirements and application needs.
• Enhanced flexibility: Allows for customized network configurations
based on specific application demands.
Examples of SDN
1. OpenDaylight (ODL)
2. ONOS (Open Network Operating System)
3. Cisco Application Centric Infrastructure (ACI)
4. VMware NSX
5. Big Switch Networks (Now part of Arista Networks)
SDS - Software-defined storage
• Software-defined storage (SDS) is a data storage approach that uses
software to manage storage independently of the underlying hardware.
• In traditional system known as monolithic SAN(Storage Area
Network) or NAS (Network-Attached Storage) system, all the storage
components are housed within a single, unified device.
• So there is no modularity or no ability to easily expand the storage by
adding separate units.
• software-defined storage systems let users upgrade the software
separately from the hardware.
Working of SDS
• Software-defined storage (SDS) works by using a software layer to
abstract the physical storage hardware.
• this software layer acts as a "storage hypervisor" that manages data
placement, provisioning, and other storage services, enabling users to
access storage without needing to know the underlying hardware
details.
• It allows for centralized management and control of diverse storage
resources across different types of devices.
• It creates a virtual pool of storage that can be dynamically allocated
and accessed as needed,
Why is SDS important?
• Unlike monolithic SAN and NAS systems, software-defined storage
systems let users upgrade the software separately from the hardware.
• Common characteristics of SDS products include the following
capabilities:
 Aggregate cost-effective storage resources.
 Scale out the system across a server cluster.
 Manage the shared storage pool and storage services through a
single administrative interface.
 Set policies to control storage features and functionality.
SDS in cloud computing
• In cloud computing, Software-Defined Storage (SDS) plays a crucial
role and allowing users to provision and manage storage resources
across different cloud environments without being limited by specific
physical hardware constraints, making it ideal for dynamic cloud
workloads and hybrid cloud setups.
• Its role is to enable flexible, scalable, and cost-effective data
storage management by decoupling storage from the underlying
hardware,.
Key points about SDS in cloud computing:
• Abstraction of storage: SDS separates the storage software from the
physical hardware, providing a unified view of storage across different
devices and allowing for easier management and allocation of storage
resources.
• Scalability: With SDS, users can easily scale storage capacity by
adding new nodes to the cluster without needing to replace entire
storage arrays, making it well-suited for rapidly growing data needs in
the cloud.
• Cost-efficiency: By allowing the use of commodity hardware, SDS
can help reduce storage costs compared to traditional proprietary
storage systems.
• Flexibility: SDS enables dynamic provisioning of storage, allowing
users to quickly allocate storage based on application requirements,
regardless of the underlying hardware.
• Data mobility: SDS facilitates seamless data migration between
different cloud environments, including public and private clouds, due
to its software-based nature.
• Automation: SDS often leverages APIs to automate storage
management tasks like data replication, snapshots, and tiering,
improving operational efficiency.
Use cases for SDS in cloud computing:
• Big data analytics: SDS can efficiently manage large datasets
distributed across multiple cloud nodes, enabling fast data processing
for analytics applications.
• Virtualized environments: SDS is well-suited for managing storage in
virtualized environments, allowing for flexible allocation of storage to
virtual machines.
• Disaster recovery: SDS can be used to create geographically dispersed
data replicas for disaster recovery purposes.
Example SDS Solutions:
1.DataCore SANsymphony:
2.StarWind VSAN:
3.Nutanix Unified Storage:
4.Red Hat Ceph:
Storage Systems
• Storage systems refer to the hardware and software components used to
store data.
• These systems are critical for data persistence, ensuring that information
remains available even when devices are powered off.
• Storage systems are vital for managing data in both personal and
professional settings.
• With various types and technologies available, they cater to different
requirements, from high-performance computing to long-term data
archiving.
• Understanding these systems helps in selecting the right storage solution
for specific needs.
Types of Storage Systems:

1. Primary Storage (Main Memory) - main memory, includes RAM (volatile)

2. Secondary Storage - It includes hard disk drives (HDDs), solid-state drives (SSDs), and optical
discs.(Non volatile)

3. Tertiary Storage - magnetic tapes and certain optical storage media

4. Cloud Storage - storage provided over the internet by third-party services such as Amazon
S3, Google Cloud, Microsoft Azure etc.

5. Network Attached Storage (NAS) - Dedicated file storage device connected to a network,

6. Storage Area Network (SAN) - High-speed network that connects storage devices to servers,
providing block-level storage
1.Primary Storage (Main Memory):
1. Description: Primary storage, also known as main memory, includes RAM
(Random Access Memory) and cache. It is volatile, meaning it loses its data
when the power is turned off.

2. Use Case: Used for temporary storage of data that is actively being accessed and
processed by the CPU.
2.Secondary Storage:

• Description: Non-volatile storage used for long-term data retention. It includes hard
disk drives (HDDs), solid-state drives (SSDs), and optical discs.

• Use Case: Stores operating systems, applications, and user data. Data persists even
when the device is powered off.

3. Tertiary Storage:

Description: Primarily used for archival and backup purposes. Examples include
magnetic tapes and certain optical storage media.

Use Case: Long-term storage of data that is not frequently accessed but must be
preserved for legal or historical reasons.
4.Cloud Storage:

Description: Data storage provided over the internet by third-party services such as Amazon
S3, Google Cloud Storage, and Microsoft Azure Storage.

Use Case: Offers scalable and flexible storage solutions with the ability to access data from
anywhere with an internet connection.

5. Network Attached Storage (NAS):

Description:

Dedicated file storage device connected to a network, providing data access to multiple clients.

Use Case: Ideal for small to medium-sized businesses that need centralized storage accessible
by multiple users.
6. Storage Area Network (SAN):

• Description: High-speed network that connects storage devices to


servers, providing block-level storage.

• Use Case: Used in enterprise environments for high-performance and


high-availability storage needs.
Cloud storage
• Cloud Storage is a mode of computer data storage in which digital
data is stored on servers in off-site locations.
• Cloud Storage enables organizations to store, access, and maintain
data so that they do not need to own and operate their own data
centers, moving expenses from a capital expenditure model to
operational.
• The servers are maintained by a third-party provider who is
responsible for hosting, managing, and securing data stored on its
infrastructure. The provider ensures that data on its servers is always
accessible via public or private internet connections.
Working
• Cloud Storage uses remote servers to save data, such as files, business data,
videos, or images.
• Users upload data to servers via an internet connection, where it is saved on
a virtual machine on a physical server.
• To maintain availability and provide redundancy, cloud providers will often
spread data to multiple virtual machines in data centers located across the
world.
• If storage needs increase, the cloud provider will spin up more virtual
machines to handle the load.
• Users can access data in Cloud Storage through an internet connection and
software such as web portal, browser, or mobile app via an application
programming interface (API).
Cloud Storage is available in four different models: (refer Unit I PPTs)
1. Public
2. Private
3. Hybrid
4. Multi cloud
• Multicloud is when an organization uses cloud computing services
from at least two cloud providers to run their applications. Instead
of using a single-cloud stack, multicloud environments typically
include a combination of two or more public clouds, two or more
private clouds, or some combination of both.
• "Multi-cloud" means multiple public clouds. A company that uses
a multi-cloud deployment incorporates multiple public clouds
from more than one cloud provider. Instead of a business using
one vendor for cloud hosting, storage, and the full application
stack, in a multi-cloud configuration they use several.
Types of Cloud Storage
Cloud Storage comes in three different types: object, file, and block.
Object
• Object storage is a data storage architecture for large stores of unstructured data. It
designates each piece of data as an object, keeps it in a separate storehouse, and bundles it
with metadata and a unique identifier for easy access and retrieval.
File
• File storage organizes data in a hierarchical format of files and folders. File storage is
common in personal computing where data is saved as files and those files are organized in
folders. File storage makes it easy to locate and retrieve individual data items when they are
needed. File storage is most often used in directories and data repositories.
Block
• Block storage breaks data into blocks, each with an unique identifier, and then stores those
blocks as separate pieces on the server. The cloud network stores those blocks wherever it is
most efficient for the system. Block storage is best used for large volumes of data that
require low latency such as workloads that require high performance or databases.
Advantages of Cloud Disadvantages of Cloud
Storage Storage
• Total cost of ownership • Compliance
• Elasticity • Latency Control
• Flexibility • Outages
• Security
• Sustainability
• Redundancy
Distributed File Systems - HDFS,
Ceph FS
• A distributed file system (DFS) is a file system that is distributed on
various file servers and locations.
• It permits programs to access and store isolated data in the same
method as in the local files.
• It also permits the user to access files from any system.
• It allows network users to share information and files in a regulated
and permitted manner.
• Although, the servers have complete control over the data and provide
users access control.
Advantages
• It allows the users to access and store the data.
• It helps to improve the access time, network efficiency, and availability of files.
• It provides the transparency of data even if the server of disk files.
• It permits the data to be shared remotely.
• It helps to enhance the ability to change the amount of data and exchange data.
Disadvantages
• In a DFS, the database connection is complicated.
• In a DFS, database handling is also more complex than in a single-user system.
• If all nodes try to transfer data simultaneously, there is a chance that overloading
will happen.
• There is a possibility that messages and data would be missed in the network
while moving from one node to another.
HDFS - Hadoop Distributed File
System.
• Hadoop is an open source framework from Apache and is used to store
process and analyze data which are very huge in volume.
• It allows for the distributed processing of large data sets across
clusters of computers using simple programming models.
• It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
Modules of Hadoop
1.HDFS: Hadoop Distributed File System, states that the files will be
broken into blocks and stored in nodes over the distributed architecture.
(storage unit)
2.YARN: Yet another Resource Negotiator is used for job scheduling and
manage the cluster.(resource management unit)
3.Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair (processing unit)
4.Hadoop Common: These Java libraries are used to start Hadoop and
are used by other Hadoop modules
HDFS - Introduction
• The Hadoop Distributed File System (HDFS) was designed for Big
Data processing.
• HDFS is the basic storage system of Hadoop.
• The large data files running on a cluster of commodity hardware are
stored in HDFS.
• It can store data in a reliable manner even when hardware fails.
1. Name Node

2. Data Node

3. Secondary Name Node

Fig. Components of HDFS


• The NameNode is the master of HDFS
• HDFS breaks large file into smaller pieces called blocks.
• NameNode uses rackID to identify data nodes in the rack.
(rack is a collection of datanodes with in the cluster)
• NameNode keep track of blocks of a file.
Name • NameNode is the book keeper of HDFS.
Node • It keeps track of how files are broken down into blocks and
which DataNode stores these blocks.
• It is a collection of files in the cluster.
• file system namespace includes mapping of blocks of a file,
file properties and is stored in a file called FsImage.
• namenode uses an EditLog (transaction log) to record
every transaction that happens to the file system metadata.
• The DataNode is the slave of HDFS
• Multiple data nodes per cluster.
• Each slave machine in the cluster have DataNode daemon
for reading and writing HDFS blocks of actual file on local
file system.
Data • During pipeline read and write DataNodes communicate
with each other.
Node
• It also continuously Sends “heartbeat” message to
NameNode to ensure the connectivity between the Name
node and the data node.
• If no heartbeat is received for a period of time NameNode
assumes that the DataNode had failed and it is re-
replicated..
• Takes snapshot of HDFS meta data at intervals
specified in the hadoop configuration.
• Memory is same for secondary node as NameNode.
Secondary • But secondary node will run on a different machine.
Name Node • In case of name node failure secondary name node
can be configured manually to bring up the cluster
i.e; we make secondary namenode as name node.
CephFS
• The Ceph File System (CephFS) is a file system built on top of Ceph’s
distributed object store, called RADOS (Reliable Autonomic
Distributed Object Storage).
• The Ceph File System has two primary components:
Clients
Metadata Servers (MDS)
• The CephFS clients perform I/O operations on behalf of
applications using CephFS, such as ceph-fuse for FUSE
clients and kcephfs for kernel clients.
Clients
• CephFS clients send metadata requests to an active
Metadata Server.

• In return, the CephFS client learns of the file metadata,


and can begin safely caching both metadata and file
data.
• Provides metadata to CephFS clients.
• Manages metadata related to files stored on the Ceph
File System.
• Coordinates access to the shared Red Hat Ceph Storage
cluster.
Meta data • Caches hot metadata to reduce requests to the backing
servers MDS metadata pool store.
• Manages the CephFS clients' caches to maintain cache
coherence.
• Replicates hot metadata between active MDS.
• Coalesces metadata mutations to a compact journal
with regular flushes to the backing metadata pool.
3.GFS
• Google Inc. developed the Google File System (GFS), a scalable distributed
file system (DFS), to meet the company’s growing data processing needs.
• GFS offers fault tolerance, dependability, scalability, availability, and
performance to big networks and connected nodes.
• GFS is made up of a number of storage systems constructed from
inexpensive commodity hardware parts.
• The search engine, which creates enormous volumes of data that must be
kept, is only one example of how it is customized to meet Google’s various
data use and storage requirements.
Components of GFS
• A group of computers makes up GFS. A cluster is just a group of connected computers.
There could be hundreds or even thousands of computers in each cluster. There are
three basic entities included in any GFS cluster as follows:
• GFS Clients: They can be computer programs or applications which may be used to
request files. Requests may be made to access and modify already-existing files or add
new files to the system.
• GFS Master Server: It serves as the cluster’s coordinator. It preserves a record of the
cluster’s actions in an operation log. Additionally, it keeps track of the data that
describes chunks, or metadata. The chunks’ place in the overall file and which files they
belong to are indicated by the metadata to the master server.
• GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-sized file
chunks. The master server does not receive any chunks from the chunk servers. Instead,
they directly deliver the client the desired chunks. The GFS makes numerous copies of
each chunk and stores them on various chunk servers in order to assure stability; the
default is three copies. Every replica is referred to as one.
• A GFS cluster has a single master, multiple chunkservers, and multiple
clients.
• GFS divides files into fixed-size chunks.
• The system identifies each chunk by an immutable and globally unique
64-bit chunk handle assigned by the master at the creation time.
• Chunkservers store chunks on local disks and read/write chunks using
a chunk handle and byte range.
• GFS replicates chunks on multiple chunkservers (three replicas by
default) for reliability.
• The master handles all file system metadata, including the namespace,
access control information, mapping from files to chunks, and chunk
locations.
• It also controls chunk lease management, garbage collection, and
chunk migration between chunkservers.
• The master communicates with each chunkserver periodically
through HeartBeat messages.
• The GFS client communicates with the master and chunkservers to
read or write data.
• Clients interact with the master only for metadata operations; they
communicate directly to the chunkservers for data-related operations.
Cloud Databases
• A cloud database is a database service built and accessed through a cloud
computing platform.
• It serves many of the same functions as a traditional database with the
added flexibility of cloud computing. Users install software on a cloud
infrastructure to implement the database.
• The cloud database is a type of database that is stored on a cloud
computing platform.
• It allows businesses to store, manage, and access data remotely without
the need for physical servers.
• A cloud database is a database that is hosted on a cloud computing
platform, such as Amazon Web Services, Microsoft Azure, or
Google Cloud Platform.
• It enables users to store, manage, and access data through the
Internet.
• The data is stored in a remote server, which eliminates the need for
physical servers in the user’s location.
• Cloud databases are accessible from anywhere with an internet
connection, making them an ideal solution for businesses with
remote teams or with employees working from home.
• There are several types of cloud databases available to businesses in terms of how data is
stored and retrieved, each with its own unique features and advantages. Here are some of the
most common types of cloud databases:
1.Relational Databases: Relational databases are the most common type of database used in
businesses today. They are based on the relational model and store data in tables with columns
and rows. Relational databases are ideal for storing structured data and are commonly used for
applications such as inventory management, customer relationship management (CRM), and
human resources management.
2.NoSQL Databases: NoSQL databases handle large volumes of unstructured or semi-structured
data. They are often used for applications that require high scalability and availability, such as
social media platforms and e-commerce sites. NoSQL databases come in several flavours,
including document databases, key-value stores, and graph databases.
3.In-memory Databases: In-memory databases use RAM to store data instead of on disk. This
allows them to provide extremely fast data access and processing times. In-memory databases
are commonly used for applications that require real-time data processing, such as financial
trading systems and online gaming.
4. Columnar Databases: Columnar databases store data in columns instead of rows,
which allows for faster data retrieval and analysis. They are commonly used for data
warehousing and business intelligence applications.
5. Time-series Databases: Time-series databases are designed to handle large
volumes of time-stamped data, such as sensor data, logs, and metrics. They are
optimized for fast writes and reads and are commonly used in industries such as
finance, healthcare, and manufacturing.
6. Object-oriented Databases: Object-oriented databases store data in objects, which
makes them ideal for applications that use object-oriented programming languages
such as Java and Python. They are commonly used for complex applications that
require flexible data modelling, such as scientific research and engineering.
7. NewSQL Databases: NewSQL databases are a new type of database that combines
the scalability of NoSQL databases with the ACID (Atomicity, Consistency, Isolation,
Durability) properties of traditional relational databases. NewSQL databases are
designed to handle large volumes of structured data and are ideal for high-
performance transaction processing applications, such as e-commerce and financial
systems.
Cloud Database Offerings
There are many cloud database offerings available from various cloud service
providers.
1.Amazon Web Services (AWS) — Amazon Relational Database Service
(RDS)
2.Microsoft Azure — Azure SQL Database:
3.Google Cloud Platform (GCP) — Cloud SQL
4.IBM Cloud — IBM Db2 on Cloud:
5.Oracle Cloud — Oracle Database Cloud Service
6.MongoDB Atlas: MongoDB
7.Firebase Realtime Database: Firebase Realtime Database .
HBase, MongoDB, Cassandra,
DynamoDB
1. Hbase
• Since 1970, RDBMS is the solution for data storage and
maintenance related problems.
• After the advent of big data, companies realized the benefit of
processing big data and started opting for solutions like Hadoop.
• HBase is a distributed column-oriented database built on top of the
Hadoop file system.
• It is an open-source project and is horizontally scalable.
• HBase is a data model that is similar to Google’s big table designed to
provide quick random access to huge amounts of structured data.
• It leverages the fault tolerance provided by the Hadoop File System
(HDFS).
• It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.
• One can store the data in HDFS either directly or through HBase.
• Data consumer reads/accesses the data in HDFS randomly using HBase.
• HBase sits on top of the Hadoop File System and provides read and
write access.
• HBase is a column-oriented database and the tables in it are sorted by
row.
• The table schema defines only column families, which are the key
value pairs.
• A table have multiple column families and each column family can
have any number of columns.
• Subsequent column values are stored contiguously on the disk. Each
cell value of the table has a timestamp.
Hbase Architecture:
• HBase architecture has 3 main components: HMaster, Region Server,
Zookeeper.
1. HMaster
• The implementation of Master Server in HBase is HMaster.
• It is a process in which regions are assigned to region server as well as
DDL (create, delete table) operations.
• It monitor all Region Server instances present in the
cluster.
• In a distributed environment, Master runs several background threads.
HMaster has many features like controlling load balancing, failover
etc.
2. Region Server –
• HBase Tables are divided horizontally by row key range into Regions.
• Regions are the basic building elements of HBase cluster that consists
of the distribution of tables and are comprised of Column families.
• Region Server runs on HDFS DataNode which is present in Hadoop
cluster.
• Regions of Region Server are responsible for several things, like
handling, managing, executing as well as reads and writes HBase
operations on that set of regions.
• The default size of a region is 256 MB.
3. Zookeeper
• It is like a coordinator in HBase.
• It provides services like maintaining configuration information,
naming, providing distributed synchronization, server failure
notification etc.
• Clients communicate with region servers via zookeeper.
Advantages of HBase –

1.Can store large data sets

2.Database can be shared

3.Cost-effective from gigabytes to petabytes

4.High availability through failover and replication

Disadvantages of HBase –

5.No support SQL structure

6.No transaction support

7.Sorted only on key

8.Memory issues on the cluster


2. MongoDB
• MongoDB is an open-source, cross-platform, and distributed
document-based database designed for ease of application
development and scaling.
• It is a NoSQL database developed by MongoDB Inc.
• MongoDB name is derived from the word "Humongous" which means
huge, enormous.
• MongoDB database is built to store a huge amount of data and also
perform fast.
• MongoDB is not a Relational Database Management System (RDBMS).
• It's called a "NoSQL" database.
• It is opposite to SQL based databases where it does not normalize data
under schemas and tables where every table has a fixed structure.
• Instead, it stores data in the collections as JSON based documents and
does not enforce schemas.
• It does not have tables, rows, and columns as other SQL (RDBMS)
databases.
Advantages of MongoDB
1.MongoDB stores data as JSON based document that does not enforce
the schema. It allows us to store hierarchical data in a document. This
makes it easy to store and retrieve data in an efficient manner.
2.It is easy to scale up or down as per the requirement since it is a
document based database. MongoDB also allows us to split data across
multiple servers.
3.MongoDB provides rich features like indexing, aggregation, file store,
etc.
4.MongoDB performs fast with huge data.
5.MongoDB provides drivers to store and fetch data from different
applications developed in different technologies such as C#, Java,
Python, Node.js, etc.
6.MongoDB provides tools to manage MongoDB databases.
3. Cassandra
• Apache Cassandra is a highly scalable, high-performance distributed
database designed to handle large amounts of data across many
commodity servers, providing high availability with no single point of
failure.
• It is a type of NoSQL database.
• Apache Cassandra is an open source, distributed and
decentralized/distributed storage system (database), for managing very
large amounts of structured data spread out across the world.
• It provides highly available service with no single point of failure.
• Cassandra was developed at Facebook for inbox search.
• It was open-sourced by Facebook in July 2008.
• Cassandra was accepted into Apache Incubator in March 2009.
• It was made an Apache top-level project since February 2010.
• The design goal of Cassandra is to handle big data workloads across
multiple nodes without any single point of failure.
• Cassandra has peer-to-peer distributed system across its nodes, and data
is distributed among all the nodes in a cluster.
Features of Cassandra
Components of Cassandra
• Users can access Cassandra through its nodes using Cassandra Query
Language (CQL).
• CQL treats the database (Keyspace) as a container of tables.
• Programmers use cqlsh: a prompt to work with CQL or separate
application language drivers.
Advantages:
• High scalability: Cassandra can easily scale out horizontally by adding
more nodes to handle increased data traffic.
• High performance: Cassandra’s design and architecture enable it to achieve
fast read and write performance, making it particularly well-suited for
applications that require a large amount of data read and write operations.
• Strong Consistency: Cassandra offers a strong consistency model to ensure
the consistency and accuracy of data.
• Fault-tolerance: Cassandra has high fault-tolerance, ensuring data
availability and integrity even in the event of node failure.
• Flexible data model: Cassandra has support for a flexible data model that
can store various types of data, including structured and unstructured data.
Drawback:
• Complexity: Configuring and managing Cassandra can be relatively
complex, requiring a certain level of expertise and experience.
• Cassandra does not support transactions, making operations on data
relatively complex.
• Data modeling limitations: Cassandra’s data model is relatively simple
and does not support complex queries or features typically found in
relational databases.
• Data Redundancy: In order to ensure data availability and fault
tolerance, Cassandra utilizes data redundancy, which results in an
increase in storage capacity.
• Scalability and distributed storage are the main focuses in Cassandra’s
design, which could impose limitations on query performance.
4. DynamoDB
• Amazon DynamoDB is a fully managed, serverless NoSQL database service
provided by Amazon Web Services (AWS)
• Serverless database with no server provisioning, patching, or management
required.
• There's no software to install, maintain, or operate, with zero downtime
maintenance.
• Features include DynamoDB streams for building serverless event-driven
applications and zero-ETL integration with Amazon OpenSearch Service for
powerful search capabilities without the need to design, build, or maintain
complex data pipelines.
• Reliability is supported with managed backups, point-in-time recovery, and a
broad set of security controls and compliance standards
• DynamoDB allows users to create databases capable of storing and
retrieving any amount of data and comes in handy while serving any
amount of traffic.
• It dynamically manages each customer’s requests and provides high
performance by automatically distributing data and traffic over servers.
• It is a fully managed NoSQL database service that is fast, predictable in
terms of performance, and seamlessly scalable.
• It relieves the user from the administrative burdens of operating and
scaling a distributed database as the user doesn’t have to worry about
hardware provisioning, patching Softwares, or cluster scaling.
• It also eliminates the operational burden and complexity involved in
protecting sensitive data by providing encryption at REST.
Advantage of DynamoDB:
• The main advantages of opting for Dynamodb are listed below:
• It has fast and predictable performance and highly scalable.
• It offloads the administrative burden operation and scaling.
• It offers encryption at REST for data protection.
• Its scalability is highly flexible.
• AWS Management Console can be used to monitor resource utilization and
performance metrics.
• It provides on-demand backups.
• It enables point-in-time recovery for your Amazon DynamoDB tables. Point-
in-time recovery helps protect your tables from accidental write or delete
operations. With point-in-time recovery, you can restore that table to any
point in time during the last 35 days.
• It can be highly automated.
Limitations of DynamoDB –
• It has a low read capacity unit of 4kB per second and a write capacity
unit of 1KB per second.
• All tables and global secondary indexes must have a minimum of one
read and one write capacity unit.
• Table sizes have no limits, but accounts have a 256 table limit unless
you request a higher cap.
• Only Five local and twenty global secondary (default quota) indexes per
table are permitted.
• DynamoDB does not prevent the use of reserved words as names.
• Partition key length and value minimum length sits at 1 byte, and
maximum at 2048 bytes, however, DynamoDB places no limit on
values.
Cloud Object Storage
(Amazon S3, OpenStack Swift, Ceph).
• Object storage is a technology that stores and manages data in an unstructured
format called objects.
• Modern organizations create and analyze large volumes of unstructured data such
as photos, videos, email, web pages, sensor data, and audio files.
• Cloud object storage systems distribute this data across multiple physical devices
but allow users to access the content efficiently from a single, virtual storage
repository.
• Object storage solutions are ideal for building cloud native applications that
require scale and flexibility, and can also be used to import existing data stores for
analytics, backup, or archive.
• Cloud Object Storage is widely used by major cloud service providers
like AWS, Google Cloud, and IBM Cloud3.
• It's a popular choice for modern applications that require scalable and
cost-effective storage solutions.
Amazon S3 - Amazon Simple
Storage Service
• It is an object storage service that offers industry-leading scalability,
data availability, security, and performance.
• Customers of all sizes and industries can use Amazon S3 to store and
protect any amount of data for a range of use cases, such as data lakes,
websites, mobile applications, backup and restore, archive, enterprise
applications, IoT devices, and big data analytics.
• Amazon S3 provides management features so that you can optimize,
organize, and configure access to your data to meet your specific
business, organizational, and compliance requirements.
What is Amazon S3 Used for?
• Amazon S3 is used for various purposes in the Cloud because of its robust features
with scaling and Securing of data. The following are a few Wide Usage of Amazon
S3 service.
• Data Storage: Amazon s3 acts as the best option for scaling both small and large
storage applications. It helps in storing and retrieving the data-intensitive
applications as per needs in ideal time.

• Backup and Recovery: Many Organizations are using Amazon S3 to backup their
critical data and maintain the data durability and availability for recovery needs.

• Hosting Static Websites: Amazon S3 facilitates in storing HTML, CSS and other
web content from Users/developers allowing them for hosting Static Websites
benefiting with low-latency access and cost-effectiveness.
• Data Archiving: Amazon S3 Glacier service integration helps as a cost-
effective solution for long-term data storing which are less frequently
accessed applications.

• Big Data Analytics: Amazon S3 is often considered as data lake because


of its capacity to store large amounts of both structured and unstructured
data offering seamless integration with other AWS Analytics and AWS
Machine Learning Services.
• Amazon S3 bucket is a fundamental Storage Container feature in AWS
S3 Service.
• It provides a secure and scalable repository for storing of Objects such as
Text data, Images, Audio and Video files over AWS Cloud.
• Each S3 bucket name should be named globally unique and should be
configured with ACL (Access Control List).
• Amazon S3 works on organizing the data into unique S3 Buckets,
customizing the buckets with Acccess controls.
• It allows the users to store objects inside the S3 buckets with facilitating
features like versioning and lifecycle management of data storage with
scaling.
How To Use an Amazon S3 Bucket?
• You can use the Amazon S3 buckets by following the simple steps which
are mentioned below. To know more how to configure about Amazon S3
refer to the Amazon S3 – Creating a S3 Bucket.
• Step 1: Login into the Amazon account with your credentials and search
form S3 and click on the S3. Now click on the option which is “Create
bucket” and configure all the options which are shown while
configuring.
• Step 2: After configuring the AWS bucket now upload the objects into
the buckets based upon your requirement. By using the AWS console or
by using AWS CLI following is the command to upload the object into
the AWS S3 bucket.
aws s3 cp <local-file-path> s3://<bucket-name>/
• Step 3: You can control the permissions of the objects which was
uploaded into the S3 buckets and also who can access the bucket. You
can make the bucket public or private by default the S3 buckets will be
in private mode.
• Step 4: You can manage the S3 bucket lifecycle management by
transitioning. Based upon the rules that you defined S3 bucket will be
transitioning into different storage classes based on the age of the object
which is uploaded into the S3 bucket.
• Step 5: You need to turn to enable the services to monitor and analyze
S3. You need to enable the S3 access logging to record who was
requesting the objects which are in the S3 buckets.
Advantages
Scalability Amazon S3 is highly scalable. You can store an unlimited amount of data, and it
automatically scales to accommodate your storage needs. This makes it suitable for both
small startups and large enterprises.
Durability Data stored in Amazon S3 benefits from high durability and availability. Amazon
and replicates data across multiple data centers, reducing the risk of data loss.
Availability:
Security Amazon S3 offers robust security features, including encryption at rest and in transit.
Access control and identity management through AWS Identity and Access Management
(IAM) provide granular control over who can access your data.
Versatility Amazon S3 is suitable for various use cases. It serves as an excellent choice for data
archiving, data analytics, content distribution, and more.
Data Amazon S3 allows you to enable data versioning. This means you can preserve, retrieve,
Versioning: and restore every version of every object stored in your S3 bucket.
Data Transfer Amazon S3 Transfer Acceleration speeds up uploading and downloading of objects by
Acceleration: using Amazon CloudFront's globally distributed edge locations.
Integration S3 seamlessly integrates with various AWS services, enabling data processing, analysis,
with AWS and integration into your AWS applications.
Services
Disadvantages
Cost While the pay-as-you-go model can be cost-effective, data transfer and
storage costs can accumulate, particularly with extensive usage. It's
important to manage your usage to control costs.
Complexity: Setting up and configuring Amazon S3 for specific use cases can be
complex, particularly for those who are new to AWS.
Data While Amazon S3 offers data versioning and lifecycle policies,
Management: managing large amounts of data and ensuring efficient organization can
be challenging without careful planning
Network The performance of Amazon S3 can be affected by network latency
Latency: when transferring data over the internet. This can result in slower data
retrieval times.
Object Amazon S3 imposes limitations on individual object sizes and bucket
Limitations: names, which may require additional management considerations for
very large files.
OpenStack Swift
• It is a free open standard cloud computing platform that first came into
existence on July 21′ 2010.
• It was a joint project of Rackspace Hosting and NASA to make cloud
computing more ubiquitous in nature.
• It is deployed as Infrastructure-as-a-service(IaaS) in both public and
private clouds where virtual resources are made available to the users.
• The software platform contains interrelated components that control
multi-vendor hardware pools of processing, storage, networking
resources through a data center.
• In OpenStack, the tools which are used to build this platform are
referred to as “projects”.
• These projects handle a large number of services including computing,
networking, and storage services.
• Unlike virtualization, in which resources such as RAM, CPU, etc are
abstracted from the hardware using hypervisors, OpenStack uses a
number of APIs to abstract those resources so that users and the
administrators are able to directly interact with the cloud services.
Architecture
• OpenStack Swift employs a distributed architecture with no central point
of control. Here are its key components:
1.Proxy Server: Handles incoming API requests and routes them to the
appropriate storage nodes.
2.Storage Nodes: Store the actual data objects and manage replication and
consistency.
3.Ring: Maps the logical location of data to the physical location across the
cluster.
4.Account, Container, and Object Servers: Handle requests related to
user accounts, containers (similar to directories), and the objects (files)
themselves.
Features
• Scalability: Designed to scale out horizontally by adding more servers.
• Durability: Uses replication to ensure data is always available even if
some nodes fail.
• Availability: Data is distributed across multiple nodes to ensure high
availability.
• Concurrency: Optimized for high read/write concurrency.
Use Cases
• Backup and Archival: Ideal for storing backups and archival data.
• Media Storage: Efficiently handles large volumes of media files like
videos and images.
• Big Data and Analytics: Useful for storing large datasets used in
analytics and machine learning.
Advantages of using OpenStack

• It boosts rapid provisioning of resources due to which orchestration and


scaling up and down of resources becomes easy.

• Deployment of applications using OpenStack does not consume a large


amount of time.

• Since resources are scalable therefore they are used more wisely and
efficiently.

• The regulatory compliances associated with its usage are manageable.


Disadvantages of using OpenStack

• OpenStack is not very robust when orchestration is considered.

• Even today, the APIs provided and supported by OpenStack are not
compatible with many of the hybrid cloud providers, thus integrating
solutions becomes difficult.

• Like all cloud service providers OpenStack services also come with the
risk of security breaches.
Ceph

• Ceph is an open-source, distributed storage system designed to


provide excellent performance, reliability, and scalability.
• It offers unified storage services with object, block, and file interfaces
from a single cluster built from commodity hardware components.
Key Features of Ceph Object Storage
1.Scalability: Ceph can scale out to thousands of nodes, making it
suitable for large-scale storage needs.
2.High Availability: Data is replicated across multiple nodes to ensure
high availability and fault tolerance.
3.Flexibility: Supports multiple storage interfaces, including S3, Swift,
and POSIX, allowing it to integrate with various applications and
services.
4.Durability: Uses replication and erasure coding to protect data against
hardware failures.
5.Self-Healing: Automatically detects and repairs data corruption and
hardware failures
Common Use Cases
• Backup and Archival: Ideal for storing large amounts of unstructured
data.
• Big Data and Analytics: Supports data-intensive workloads like data
lakes and analytics.
• Cloud Storage: Provides a scalable and reliable backend for cloud storage
services.
• IoT: Aggregates data from various devices and ensures interoperability
and security
Core Components
• Ceph Monitors (MONs): These maintain the master copy of the cluster
map, which includes the state of the entire cluster. Monitors ensure high
availability and consistency.
• Ceph OSD Daemons (OSDs): These store data as objects on storage
nodes. OSDs handle read, write, replication, and recovery operations
• Ceph Metadata Servers (MDS): These manage metadata for the Ceph
File System (CephFS), including file attributes and directory structures.
• Ceph Managers: These provide detailed information about placement
groups, process metadata, and host metadata, improving performance at
scale
• Data Distribution and Management
• CRUSH Algorithm: Ceph uses the Controlled Replication Under
Scalable Hashing (CRUSH) algorithm to determine the placement of
data objects across the cluster. This ensures data distribution is
balanced and avoids bottlenecks
• RADOS: The Reliable Autonomic Distributed Object Store (RADOS)
protocol is the foundation of Ceph, providing a unified interface for
object, block, and file storage

You might also like