0% found this document useful (0 votes)
37 views34 pages

Notes CC Unit 02

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views34 pages

Notes CC Unit 02

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

ISBM College of Engineering, Pune

Department of Computer Engineering


Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Unit-02
Data Storage: Introduction to Enterprise Data Storage, Direct Attached Storage, Storage
Area Network, Network Attached Storage, Data Storage Management, File System, Cloud
Data Stores, Using Grids for Data Storage. Cloud Storage: Data Management, Provisioning
Cloud storage, Data Intensive Technologies for Cloud Computing. Cloud Storage from
LANs to WANs: Cloud Characteristics, Distributed Data Storage.
----------------------------------------------------------------------------------------------------------------

2.1 INTRODUCTION TO ENTERPRISE DATA STORAGE


Enterprise Data Storage refers to the management and storage of large volumes of data within
an organization or enterprise. It involves the use of sophisticated storage systems and
technologies to store, organize, and retrieve data efficiently. The data stored can include
critical business information, databases, applications, user files, and other digital assets
necessary for the organization's operations. Enterprise data storage solutions are designed to
meet the scalability, reliability, and performance requirements of large and complex data
environments.

The various types of storage subsystems are:


• Direct Attached Storage (DAS)
• Storage Area Network (SAN)
• Network Attached Storage (NAS)

1. Direct Attached Storage (DAS):


Description: Direct Attached Storage (DAS) refers to a storage system directly attached to a
computer or server. In this configuration, storage devices, such as hard drives or external
storage arrays, are connected physically to the host system. The connection is typically made
using interfaces like USB(Universal Serial Bus), SATA(Serial Advanced Technology
Attachment or Serial ATA), SAS(Serial Attached SCSI(Small Computer System Interface)),
or FireWire.
Key Characteristics:
• Direct connection to a single server or computer.
• Storage devices are not shared among multiple systems.
• Typically used for local storage needs of a specific server.
• Provides high-speed access to data due to the direct connection.
Use Cases:
• Small-scale applications and databases.
• Local storage for individual servers or workstations.
• Environments where shared storage is not a requirement.
2. Storage Area Network (SAN):
Description: A Storage Area Network (SAN) is a dedicated network that connects multiple
storage devices to multiple servers. SANs use high-speed fiber channels or Ethernet
connections to create a separate storage network. This architecture allows multiple servers to
access a shared pool of storage resources.
Key Characteristics:
• Dedicated network for storage, separate from the local area network (LAN).
• Multiple servers can access shared storage resources simultaneously.
• Centralized storage management.
• Enables features like high availability, scalability, and centralized backup.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 1


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Use Cases:
• Enterprise-level applications with high-performance requirements.
• Database servers, virtualization platforms, and large-scale data storage.
3. Network Attached Storage (NAS):
Description: Network Attached Storage (NAS) is a file-level storage system connected to a
network. It provides file-based storage services to other devices on the network. NAS devices
are equipped with their own operating systems and file systems and are accessed using
protocols such as NFS (Network File System) or SMB/CIFS (Server Message
Block/Common Internet File System).
Key Characteristics:
• Storage is directly connected to the network.
• Utilizes standard Ethernet connections.
• File-level access to data.
• Often includes additional features like built-in security, data protection, and remote
access.
Use Cases:
• Shared file storage for small to medium-sized businesses.
• Home networks for centralized file sharing and media streaming.
• Backup and data archiving.

In conclusion, each storage architecture—Direct Attached Storage (DAS), Storage Area


Network (SAN), and Network Attached Storage (NAS)—has its own advantages and is suited
for specific use cases. DAS is suitable for local storage needs, SAN is designed for high-
performance and shared storage environments, and NAS provides file-level access for
network-based file sharing. The choice depends on factors such as performance requirements,
scalability, and the nature of the data storage needs.

Direct Attached Storage (DAS):


• Directly connected to a single computer or server.
• Uses common interfaces like SATA, SAS, or USB.
• Provides high performance and low latency for local storage needs.
• Examples: Internal hard drives, external hard drives, RAID arrays.
Storage Area Network (SAN):
• Separate, high-speed network dedicated to storage.
• Multiple servers can share storage resources.
• Block-level access: Servers treat SAN storage as locally attached disks.
• Ideal for large-scale, high-performance applications like databases and virtual
machines.
• Complex and expensive to set up and manage.
Network Attached Storage (NAS):
• File-level storage device connected to a network.
• Provides file sharing and access for multiple users and devices.
• Accessed through file-sharing protocols like NFS or SMB/CIFS.
• Often used for file servers, media servers, and backups.
• More cost-effective and easier to manage than SANs.

Comparison of DAS,SAN,NAS
Feature DAS SAN NAS

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 2


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Connection Direct to server Separate network Network
Access Block-level Block-level File-level
Performance High High Moderate
Scalability Limited High Moderate
Sharing No Yes Yes
Cost Low High Moderate
Complexity Low High Moderate

Choosing the Right Storage:


• DAS: Ideal for individual servers or workstations requiring high performance and
direct control.
• SAN: Best for high-performance, mission-critical applications requiring shared
storage and advanced features like snapshots and replication.
• NAS: Suitable for file sharing, collaboration, backups, and general-purpose storage
needs across multiple users and devices.
Consider factors like:
• Performance requirements
• Data sharing needs
• Budget
• Management complexity
• Scalability
• Specific application requirements

2.2 DATA STORAGE MANAGEMENT

2.2.1 Data Storage Management


Data storage management (Enterprise Data storage management) refers to the processes and
tools used to efficiently and securely store, organize, protect, and retrieve digital information.
It's like having a well-maintained library for your digital assets, ensuring everything is easily
accessible while also safe from harm.

What it encompasses:
• Provisioning: Allocating storage resources for different needs.
• Data organization: Categorizing and structuring data for efficient retrieval.
• Data protection: Implementing security measures to prevent unauthorized access and
data loss.
• Backup and recovery: Creating copies of data for disaster recovery purposes.
• Performance optimization: Ensuring efficient data access and retrieval speeds.
• Compliance: Maintaining adherence to data privacy regulations.
• Cost management: Optimizing storage resources to avoid unnecessary expenses.

Data Storage Management involves the planning, organization, and administration of an


organization's data storage infrastructure. It encompasses a set of processes, policies, and
technologies aimed at efficiently storing, organizing, securing, and retrieving data. Effective
data storage management is crucial for ensuring data availability, reliability, and security
while optimizing storage resources.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 3


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Here are key aspects of data storage management:
1. Capacity Planning:
• Assessing current and future storage needs to ensure that the storage
infrastructure can accommodate the growing volume of data. This involves
predicting data growth rates and planning for additional storage capacity.
2. Data Classification and Tiering:
• Categorizing data based on its importance, access frequency, and performance
requirements. Implementing a tiered storage strategy allows organizations to
allocate different storage types and costs based on the characteristics of the
data.
3. Data Backup and Recovery:
• Establishing robust backup and recovery processes to protect against data loss.
Regularly backing up critical data and testing the recovery processes are
essential components of storage management.
4. Data Security:
• Implementing security measures to protect stored data from unauthorized
access, data breaches, and cyber threats. This includes encryption, access
controls, and monitoring for suspicious activities.
5. Storage Virtualization:
• Employing storage virtualization technologies to abstract physical storage
resources and provide a unified view of the storage infrastructure.
Virtualization enhances flexibility, simplifies management, and optimizes
resource utilization.
6. Performance Optimization:
• Monitoring and optimizing storage performance to ensure timely data access.
This may involve tuning storage configurations, implementing caching
strategies, and utilizing high-performance storage technologies.
7. Lifecycle Management:
• Managing the entire lifecycle of data from creation to deletion. Implementing
policies for data retention, archiving, and disposal in compliance with
regulatory requirements and organizational policies.
8. Disaster Recovery Planning:
• Developing and maintaining a disaster recovery plan to ensure business
continuity in the event of data loss or system failures. This includes off-site
backups, redundant storage systems, and recovery procedures.
9. Cloud Integration:
• Integrating cloud storage solutions into the overall storage strategy.
Leveraging cloud storage for scalability, cost-effectiveness, and off-site
backup, while ensuring data sovereignty and compliance.
10. Monitoring and Reporting:
• Implementing monitoring tools to track storage performance, usage patterns,
and potential issues. Generating reports to assess the health of the storage
infrastructure and making informed decisions based on data analytics.
11. Compliance and Governance:
• Ensuring that storage practices align with industry regulations, compliance
standards, and governance policies relevant to the organization's sector. This
includes maintaining audit trails and documentation.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 4


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Effective data storage management requires a holistic approach that considers various factors,
including data lifecycle, security, performance, and compliance. It aims to strike a balance
between providing seamless access to data and optimizing storage resources to meet business
objectives.

2.2.2 Data Storage Management Tools


Data storage management tools play a crucial role in efficiently administering, monitoring,
and optimizing an organization's data storage infrastructure. These tools offer features for
capacity planning, performance monitoring, data protection, and overall storage optimization.

Maintaining storage devices is a tedious job for storage administrators. They adopt some
utilities to monitor and manage storage devices. Management level tasks are configuration,
migration, provisioning, archiving and storage monitoring/reporting. Storage Resource
Management (SRM) tools include configuration tools, provisioning tools and measurement
tools.
Configuration tools handle the set-up of storage resources. These tools help to organize
and manage RAID devices by assigning groups, defining levels or assigning spare drives.
Provisioning tools define and control access to storage resources for preventing a network
user from being able to use any other user’s storage.
Measurement tools analyse performance based on behavioural information about a storage
device. An administrator can use that information for future capacity and upgrade
planning.

Here are some commonly used data storage management tools:

Storage Provisioning and Management:


• Handle storage resource allocation, configuration, and optimization.
• Examples:
o SAN Management Software (IBM Spectrum Virtualize, Dell EMC
PowerStore Manager, HPE Storage Manager)
o NAS Management Software (NetApp OnCommand, Synology DSM, QNAP
QTS)
o Storage Resource Management (SRM) Software (Veritas SRM, IBM Tivoli
Storage Productivity Center, SolarWinds Storage Resource Monitor)
Data Backup and Recovery:
• Create and manage data copies for disaster recovery and restoration.
• Examples:
o Backup Software (Veeam Backup & Replication, Veritas
NetBackup, Commvault Simpana)
o Disk Cloning Software (Acronis True Image, Macrium Reflect, Paragon Hard
Disk Manager)
o Cloud Backup Services (Amazon S3, Microsoft Azure Backup, Google Cloud
Storage)
Data Protection:
• Prevent unauthorized access, data breaches, and data loss.
• Examples:
o Encryption Software (BitLocker, VeraCrypt, Symantec Encryption Desktop)
o Access Control Solutions (Active Directory, OpenLDAP, Azure Active
Directory)

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 5


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
o Data Loss Prevention (DLP) Software (Symantec DLP, McAfee
DLP, Forcepoint DLP)
Performance Optimization:
• Monitor, analyze, and enhance storage performance.
• Examples:
o Storage Performance Management (SPM) Software (SolarWinds Storage
Resource Monitor, Dynatrace, NetApp OnCommand Performance Manager)
o Storage Caching (EMC FlashCache, NetApp Flash Pool, IBM Easy Tier)
Compliance:
• Ensure adherence to data retention and privacy regulations.
• Examples:
o Data Archiving Software (Commvault Simpana Archive, Veritas Enterprise
Vault, IBM Spectrum Archive)
o Data Retention Management Software
Additional Considerations:
• Storage Visualization Tools: Create visual representations of storage environments for
easier management (e.g., HPE Storage Insights, NetApp OnCommand Insight).
• Storage Migration Tools: Assist in moving data between storage systems or platforms
(e.g., Dell EMC PowerPath Migration Enabler, NetApp XCP).

2.2.3 Storage Management Process

Data storage management tools must rely on policies which governs the usage of storage
devices and its procedures. Storage management includes three areas—
• change management,
• performance and capacity planning and
• tiering (tiered storage).
The process used to request, schedule, implement and evaluate adjustments to the storage
infrastructure is called change management.
The change management process defines the way a request is made and approved and
documents the steps used to configure and provision the requested space on a storage array or
server. Change management may also document processes such as data migration and
maintains the integrity and availability of that data for network users.

Performance and capacity planning are used to measure the performance of a system
in-terms of storage and utilization. The result of performance and consumption analysis is
used to make sensible decisions about subsequent storage purchases.

Storage management encompasses three crucial areas that work together to ensure efficient
and effective use of your storage resources:
1. Change Management:
• Focuses on the lifecycle of data - from creation and modification to archiving and
deletion.
• Key activities include:
o Data classification: Categorizing data based on its importance, access
frequency, and regulatory requirements.
o Data migration: Moving data between different storage tiers based on its
access needs and value.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 6


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
o Data backups and disaster recovery: Ensuring data protection and availability
in case of unexpected events.
o Version control: Maintaining different versions of files for rollbacks or
historical analysis.
o Data security: Implementing access controls, encryption, and other measures
to protect sensitive information.
2. Performance and Capacity Planning:
• Aims to balance storage capacity with the performance needs of applications and
users.
• Key activities involve:
o Performance monitoring: Tracking storage metrics like I/O
operations, response times, and disk utilization.
o Capacity forecasting: Predicting future storage needs based on historical data
and expected growth.
o Resource allocation: Assigning storage resources to applications and users
based on their requirements.
o Storage optimization: Implementing techniques like data
compression, deduplication, and caching to improve efficiency and utilization.
o Cost optimization: Choosing the most cost-effective storage solutions while
meeting performance and capacity requirements.
3. Tiering (Tiered Storage):
• Utilizes different storage technologies with varying performance and cost
characteristics to optimize storage costs and performance.
• Common tiers include:
o Tier 1 (High-performance): SSDs used for frequently accessed data requiring
fast response times.
o Tier 2 (Performance): High-performance HDDs or hybrid solid-state hard disk
drives (SSHDs) for data accessed regularly but not as frequently as Tier 1.
o Tier 3 (Capacity): Standard HDDs used for infrequently accessed data
requiring high storage capacity and lower cost.
o Tier 4 (Archive): Tape storage or cloud storage for rarely accessed data
requiring long-term retention but minimal performance.

2.3.4 Data Storage Challenges


In depth, understanding of storage devices will minimize the risks, and an administrator can
easily handle challenges like fi nding out the reason for performance degrading, cost check,
etc. Managing traditional storage devices is a complicated task because of high operations
cost, performance and scalability issues. Some challenges are massive data demand,
performance barrier, power consumption and cost.

Massive Data Demand


An industry survey estimates the digital world to increase by 45 zettabytes by 2020, that is,
one terabyte is equal to 1024 gigabytes, one petabytes is equal to 1024 terabytes, one
exabytes is equal to 1024 petabytes and one zettabytes is equal to 1024 exabytes.

Performance Barrier
Rapid growth in data has caused a parallel increase in the size of databases. In the traditional
storage method, the response time taken for queries is slow and it should be increased. Be it a

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 7


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
social networking site, an enterprise database or a web application, all requires faster disk
access to read and write data.
Power Consumption and Cost
Because of the increase in storage demands, IT organizations and data centres need larger
storage
with minimal cost. Performance lags with minimal cost but has other expenses like licensing
and maintenance. Apart from this, other factors such as power consumed by storage devices,
cooling systems, man power for managing it and space for data centres are to be considered.

Data Volume and Growth:


• Exponential data surge: Data volume is exploding across all industries, with no signs
of slowing down. This puts immense pressure on storage infrastructure to
accommodate ever-increasing amounts of information.
• Unstructured data complexity: A significant portion of data is unstructured
(e.g., videos, images, emails), making it difficult to organize, manage, and analyze
effectively.
2. Performance and Scalability:
• Demanding access speeds: Modern applications and analytics require fast and reliable
access to large datasets, pushing storage systems to their limits.
• Scalability limitations: Outgrowing storage capacity is a constant concern, requiring
organizations to invest in expansion or struggle with performance bottlenecks.
3. Security and Compliance:
• Increased cyber threats: Data breaches and ransomware attacks are
commonplace, necessitating robust security measures to protect sensitive information.
• Complex compliance requirements: Stringent data privacy regulations, like GDPR
and HIPAA, add additional layers of complexity and compliance pressure.
4. Cost Management:
• Costly storage solutions: High-performance storage technologies can be
expensive, forcing organizations to find a balance between cost and capabilities.
• Wasted storage resources: Underutilized or obsolete data consumes valuable storage
space and incurs unnecessary costs.
5. Other Challenges:
• Data loss and disaster recovery: Protecting against hardware failures, natural
disasters, and other unforeseen events is critical to ensure data availability.
• Skill shortage: Finding qualified professionals with expertise in data storage
management can be a challenge for organizations.
• Integration and compatibility: Managing data across diverse systems and platforms
can be complex, requiring integration and compatibility efforts.

2.3.5 Unified Storage


A new innovative solution ‘Unified Storage’ is developed and addresses the issues discussed
earlier. Basically, this type of storage solution is a combination of NAS and SAN and termed
as NUS (network unified storage). This type of storage system handles both file and block
level accessing and hence storage devices can be accessed by single and multiple hosts. The
main advantage of this system is reduced cost and it supports fiber channel and iSCSI.

2.4 FILE SYSTEMS

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 8


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
A file system is a structure used in a computer to store data on a hard disk. When we install a
new hard disk, we need to partition and format it using a file system before storing data.

There are three file systems in use in Windows OS; they are NTFS, FAT32 and rarely-used
FAT.
2.4.1 FAT File System
The FAT (File Allocation Table) file system is a widely used file system format that dates
back to the early days of personal computing. Microsoft and IBM originally developed it and
has seen several versions and variations over the years. FAT file systems are relatively
straightforward, making them suitable for use on various storage devices, including floppy
disks, hard drives, USB flash drives, and memory cards.

FAT system was first devised in the so-called computer environment in the early years. FAT
was planned for systems with very small RAM and small disks. It required much less system
resources compared to other file systems like UNIX. Essentially, the FAT system has made a
comeback. Thumb or flash drives have become very common and have smaller sizes that
makes the FAT system useful. The smaller sizes are even formatted in FAT16.

2.4.2 NTFS

In the 1990s, Microsoft recognized that DOS-based Windows was inadequate because of
demands in business and industry. They started working for better software that can suit
larger systems.
NTFS is much simpler than FAT. While files are used, the system areas can be customized,
enlarged, or moved as required. NTFS has much more security incorporated. NTFS is not apt
for small-sized disks.

NTFS stands for "New Technology File System," and it is a modern file system developed by
Microsoft. NTFS is the default file system used by the Windows family of operating systems,
including Windows NT, Windows 2000, Windows XP, Windows Vista, Windows 7,
Windows 8, Windows 10, and Windows Server editions.

The FAT (File Allocation Table) and NTFS (New Technology File System) file systems are
both used to manage and organize files and folders on storage devices in Windows operating
systems.

Feature FAT File System NTFS File System


Complex structure with MFT
File System Structure Simpler structure with FAT (Master File Table)
Security and Minimal security and
Permissions permissions Robust security and permissions
Compression and Supports compression and
Encryption No support encryption
Generally faster on smaller
Performance devices Performs well on modern hardware
Compatibility Universally compatible Mainly used in Windows

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 9


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Feature FAT File System NTFS File System
File Size Limitations Varies (up to 4GB for FAT32) Up to 256TB
Volume Limitations Varies (up to 2TB for FAT32) Up to 16EB
Metadata and
Journaling Limited metadata; no journaling Extensive metadata with journaling
Unicode Support Limited Full Unicode support
Data Deduplication Not supported Supported in recent Windows
Compatibility, older devices, Modern Windows systems,
Use Cases removable media enterprise environments

2.4.3 Cloud File System


A cloud file system is a type of file storage system that is hosted and maintained by cloud
service providers. It allows users to store, manage, and access their files and data over the
internet rather than on local physical storage devices such as hard drives or servers. Cloud file
systems offer several benefits, including:

1. Accessibility: Users can access their files from anywhere with an internet connection,
making it convenient for remote work and collaboration.

2. Scalability: Cloud file systems can easily scale to accommodate growing storage needs
without the need to invest in additional hardware.

3. Redundancy: Many cloud providers replicate data across multiple data centers, ensuring
data redundancy and minimizing the risk of data loss.

4. Backup and Recovery: Cloud file systems often include built-in backup and recovery
options, making it easier to recover lost data.

5. Collaboration: Cloud file systems often provide features for real-time collaboration,
allowing multiple users to edit and share files simultaneously.

6. Cost-Efficiency: Users typically pay for the storage and services they use on a subscription
or pay-as-you-go basis, reducing upfront costs.

Popular examples of cloud file systems and storage services include Google Drive, Dropbox,
Microsoft OneDrive, and Amazon S3 (Simple Storage Service). These services offer both
personal and enterprise-level solutions for storing and managing files and data in the cloud.

In cloud file systems, the considerations are:


● It must sustain basic file system functionality.
● It should be an open source.
● It should be grown-up enough that users will at least think about trusting their data to it.
● It should be shared, i.e., available over a network.
● It should be paralleling scalable.
● It should provide honest data protection, still on commodity hardware with only internal
storage.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 10


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Organizations that use cloud computing outsource massive amounts of data and workloads to
cloud providers. Due to its low cost, lower management overhead, and elasticity,
organizations move towards using cloud computing.
In cloud storage, systems host or consumers can find only corruption or loss of data from
their service provider’s report, when a system failure occurs. This consumer–provider gap
creates business risk and complicates compliance SLAs. A cloud file system should be
scalable enough to adopt large organizations file systems under different workloads with
good performance requirements. Cloud file systems should have high throughputs then local
file systems. Cloud file system should have minimal operation latency. The system should
also be scalable to multiple hosts operating in parallel. Transparency and backwards
compatibility is important to facilitate migration to the cloud with less effort.

Following are some of the cloud file systems.

Ghost File System


Ghost cloud file system is used in Amazon Web Services (AWS). It gives high redundant
elastic mountable, cost-effective, and standards-based file system. A fully featured scalable
and stable cloud file system is provided by a ghost cloud file system. GFS (Ghost File
System) run over Amazon’s S3, EC2, and SimpleDB web services. When using GFS, the
user can have complete control of the data, and can be accessed as a standard network disk
drive.

Benefits of Ghost CFS


● Elastic and cost-efficient: Pay for what you use from 1 GB to hundreds of terabytes.
● Multi-region redundancy: Aiming to take advantage of AWS’s 99.99% availability
● Highly secure: Uses your own AWS account (ghost cannot access your data).
● No administration: Scales elastically with built-in redundancy—no provisioning or backup.
● Anywhere: Mount on a server or client or access fi les via a web page or from a mobile
phone.

Features of Ghost CFS


● Mature elastic file system in the cloud.
● All files and metadata duplicated across multiple AWS availability regions.
WebDav for standard mounting on any Linux, Windows or Mac server or client in the world.
● FTP access.
● Web interface for user management and for fi le upload/download.
● File name search.
● Side-loading of fi les from torrent and from URL.

Gluster File System


GlusterFS is an open source, distributed file system capable of handling multiple clients and
large data. GlusterFS clusters storage devices over network, aggregating disk and memory
resources and managing data as a single unit. GlusterFS is based on a stackable user space
design and delivers good performance for even heavier workloads.
GlusterFS supports clients with valid IP address in network. Users no longer locked with
legacy storage platforms which are costly and monolithic. GlusterFS gives users the ability to
deploy scale-out, virtualized storage, centrally managed pool of storage.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 11


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Attributes of GlusterFS include scalability and performance, high availability, global
namespace, elastic hash algorithm, elastic volume manager, gluster console manager, and
standards-based.

Hadoop File System

The Hadoop Distributed File System (HDFS) is a distributed file storage system designed to
store and manage vast amounts of data across a cluster of commodity hardware. It is a
fundamental component of the Apache Hadoop ecosystem, which is widely used for big data
processing and analytics.
A distributed file system designed to run on commodity hardware is known as Hadoop
Distributed File System (HDFS). In HDFS, files are stored in blocks ranging from 64 MB to
1024 MB.
The default size is 64 MB. The blocks will be distributed across the cluster and replicated for
fault tolerance.

XtreemFS: A Distributed and Replicated File System


XtreemFS is a distributed, replicated and open source. XtreemFS allows users to mount and
access fi les via WWW. Engaging XtreemFS a user can replicate the files across data centres
to reduce network congestion, latency and increase data availability. Installing XtreemFS is
quite easy, but replicating the fi les is bit difficult.

Kosmos File System


Kosmos Distributed File System (KFS) gives high performance with availability and
reliability.
For example, search engines, data mining, grid computing, etc. It is deployed in C++ using
standard system components such as STL, boost libraries, aio, log4cpp. KFS is incorporated
with Hadoop and Hypertable.

CloudFS
CloudFS is a distributed fi le system to solve problems when fi le system is itself provided as
a service. CloudFS is based on GlusterFS, a basic distributed fi le system, and supported by
Red Hat and hosted by Fedora.

2.5 CLOUD DATA STORES

A data store is a data repository where data are stored as objects. Data store includes data
repositories, flat files that can store data. Cloud data stores, often referred to as cloud
databases, are centralized repositories of data that are hosted and managed in cloud
computing environments. These data stores provide scalable and highly available solutions
for storing and managing data over the internet. Cloud data stores offer several advantages,
including ease of access, scalability, reliability, and the ability to offload the management of
database infrastructure to cloud service providers. There are various types of cloud data
stores, each designed for specific use cases and data management needs.
Data stores can be of different types:
• Relational databases (Examples: MySQL, PostgreSQL, Microsoft SQL Server, Oracle

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 12


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Database)
• Object-oriented databases
• Operational data stores
• Schema-less data stores, e.g. Apache Cassandra or Dynamo
• Paper files
• Data files (spread sheets, flat files, etc)

Here are some common types of cloud data stores:

1. Relational Database as a Service (RDBaaS):


o RDBaaS offers cloud-hosted relational database management systems (RDBMS),
such as MySQL, PostgreSQL, SQL Server, and Oracle Database.
o These databases provide structured data storage with support for complex queries and
transactions.
o RDBaaS solutions are suitable for applications that require ACID (Atomicity,
Consistency, Isolation, Durability) compliance and structured data modeling.

2. NoSQL Databases:
o NoSQL databases, including document stores, key-value stores, column-family stores,
and graph databases, are designed for handling unstructured or semi-structured data.
o They provide flexibility and horizontal scalability, making them suitable for big data
and real-time applications.
o Popular NoSQL databases in the cloud include MongoDB, Cassandra, Redis, and
Amazon DynamoDB.

3. Object Storage:
o Object storage services store data as objects in a flat namespace and are suitable for
storing unstructured data, such as documents, images, videos, and backups.
o Leading cloud object storage platforms include Amazon S3, Google Cloud Storage,
and Azure Blob Storage.

4. Data Warehouses:
o Cloud data warehouses are optimized for analytics and business intelligence (BI)
workloads.
o They enable the storage, management, and analysis of large datasets.
o Examples include Amazon Redshift, Google BigQuery, and Snowflake.

5. Time-Series Databases:
o Time-series databases specialize in handling time-stamped data, making them ideal
for IoT (Internet of Things) applications, telemetry data, and monitoring.
o Prominent cloud time-series databases include InfluxDB and Amazon Timestream.

6. Multi-Model Databases:
o Multi-model databases allow the storage and retrieval of data in various formats,
including structured, semi-structured, and unstructured data.
o They offer versatility for diverse data types and use cases.

7. Data Lakes:

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 13


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
o Data lakes are repositories for storing large volumes of raw, unprocessed data from
diverse sources.
o They facilitate data exploration, analytics, and data processing in a flexible manner.

8. Graph Databases:
o Graph databases are designed for managing and querying graph-structured data,
making them suitable for applications like social networks, recommendation systems,
and fraud detection.

9. Search Engines:
o Cloud-based search engines provide efficient indexing and searching capabilities for
full-text search and faceted search applications.

Cloud data stores can be deployed on various cloud platforms, including Amazon Web
Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, IBM Cloud, and others.
Organizations can choose the most appropriate type of cloud data store based on their specific
data management needs, performance requirements, and scalability demands, thereby
leveraging the benefits of cloud computing for their data storage and processing needs.

Distributed Data Store


A Distributed Data Store is like a distributed database where users store information on
multiple nodes. These kinds of data store are non-relational databases that searches data
quickly over a large multiple nodes.
A distributed data store, often referred to as a distributed database or distributed storage
system, is a specialized data storage solution designed to store and manage data across
multiple physical or virtual locations, often referred to as nodes or servers. Unlike traditional
centralized databases that store data in a single location, distributed data stores distribute data
across a network of nodes, providing advantages such as improved scalability, fault tolerance,
and high availability.

Here are some key characteristics and components of distributed data stores:

▪ Distribution and Replication: Data is distributed across multiple nodes to achieve load
balancing, scalability, and fault tolerance. In some cases, data is replicated across
multiple nodes to ensure redundancy and data availability, even in the event of node
failures.

▪ Consistency Models: Distributed data stores implement various consistency models to


manage how data consistency is maintained across distributed nodes. Common models
include strong consistency, eventual consistency, and causal consistency, each offering a
different trade-off between consistency and performance.

▪ Partitioning: Data partitioning involves dividing data into smaller subsets or shards,
which are distributed across nodes. Partitioning can be based on various criteria, such as
a key range or hash value, and helps distribute data evenly and efficiently.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 14


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
▪ Replication Strategies: Distributed data stores use different strategies for data replication.
These strategies can include master-slave replication, multi-master replication, and
leaderless replication, depending on the system's architecture.

▪ CAP Theorem: The CAP theorem, proposed by computer scientist Eric Brewer, describes
the trade-offs among three properties in distributed systems: Consistency, Availability,
and Partition tolerance. Distributed data stores must make trade-offs based on the CAP
theorem to meet specific application requirements.

▪ High Availability: Distributed data stores are designed to provide high availability,
ensuring that data remains accessible even when individual nodes or components fail.
This is achieved through redundancy and failover mechanisms.

▪ Scalability: Scalability is a crucial feature of distributed data stores. They can scale
horizontally by adding new nodes to accommodate growing data and user loads. This
allows for linear scaling and improved performance.

▪ Query and Access Patterns: Distributed data stores support various query and access
patterns, depending on their data model. Common data models include key-value,
document, column-family, and graph databases, each optimized for specific use cases
and query types.

▪ Consensus Protocols: Distributed data stores often employ consensus protocols like
Paxos or Raft to ensure that nodes agree on the state of the system, especially when
electing leaders or handling distributed transactions.

▪ Data Integrity and Security: Data integrity and security mechanisms are crucial in
distributed data stores. Encryption, access controls, and auditing are employed to protect
data from unauthorized access and tampering.

▪ Global Distribution: Some distributed data stores support global distribution, allowing
data to be stored across geographically dispersed data centers. This feature is valuable for
reducing latency and ensuring data access in different regions.

Examples of popular distributed data stores include Apache Cassandra, Apache HBase,
Amazon DynamoDB, Google Cloud Spanner, and Microsoft Azure Cosmos DB.
Organizations choose distributed data stores based on their specific needs, considering factors
like data volume, access patterns, consistency requirements, and geographic distribution.
These systems play a critical role in modern applications and services that require high levels
of availability, scalability, and fault tolerance.

Examples for this kind of data storage are Google’s BigTable, Amazon’s Dynamo and
Windows Azure Storage.
Some Distributed Data Stores use to recover the original file when parts of that file are
damaged or unavailable by using forward error correction techniques. Others download that
file from a diverse mirror.

Types of Data Stores

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 15


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Established IT organizations have started using advanced technologies for managing large-
size data, which come from social computing and data analysis applications.

BigTable
BigTable is a compressed, high performance, and proprietary data storage system construct
on Google File System, Chubby Lock Service, SSTable, and a small number of other Google
technologies.
BigTable was developed in 2004 and is used in a number of Google applications such as web
indexing, Google Earth, Google Reader, Google Maps, Google Book Search, MapReduce,
Blogger.com, Google Code hosting, Orkut, YouTube and Gmail. The advantage for
developing BigTable includes scalability and better performance control.
BigTable charts two random string values (row and column key) and timestamps into an
associated random byte array. BigTable is designed to scale into the petabyte range across
multiple machines and easy to add more machines and automatically start using resources
available without any configuration changes.

Other similar softwares are as follows:


● Apache Accumulo: Construct on top of Hadoop, ZooKeeper and economy. Server-side
programming mechanism deployed in Java environment.
● Apache Cassandra: Dynamo’s distributed design and BigTable’s facts and numbers form
adds simultaneously in Apache Cassandra, which uses Java.
● Hbase: Supports BigTable and Java programming language.
● Hypertable: Designed for cluster of servers especially for storage and processing.
● KDI: Kosmix stab to make a BigTable clone and is written in C++.

Dynamo: A Distributed Storage System


Dynamo is a vastly offered, proprietary key-value structured storage system or a dispersed
data store. It can act as databases and also distributed hash tables (DHTs). It is used with
parts of Amazon web services such as Amazon S3.
Dynamo is the most powerful relational database available in World Wide Web. Relational
databases have been used a lot in retail sites, to make visitors browse and search for products
easily.
It is difficult to create redundancy and parallelism with relational databases which is a single
point failure. Replication is also not possible.
Dynamo is a distributed storage system and not a relational database. Similar to a relational
database it stores information to be retrieved; however, it stores the data as objects and not as
tables. The advantage of using Dynamo is responsive and consistent in creating a distributed
storage solution.
11.5

2.6 USING GRIDS FOR DATA STORAGE

Grid computing is a distributed computing model that involves the coordinated use of a large
number of interconnected and geographically dispersed computers to solve complex
computational problems or perform resource-intensive tasks. Unlike traditional computing
systems where a single powerful supercomputer handles tasks, grid computing leverages the
collective processing power and resources of multiple networked computers, making it a
form of distributed computing.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 16


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Key characteristics and concepts of grid computing include:
• Resource Sharing
• High Performance
• Distributed Architecture
• Scalability
• Fault Tolerance
Definition by AWS: Grid computing is a computing infrastructure that combines computer
resources spread over different geographical locations to achieve a common goal. All unused
resources on multiple computers are pooled together and made available for a single task.
Organizations use grid computing to perform large tasks or solve complex problems that are
difficult to do on a single computer.

For example, meteorologists use grid computing for weather modeling. Weather modeling is
a computation-intensive problem that requires complex data management and analysis.
Processing massive amounts of weather data on a single computer is slow and time
consuming. That’s why meteorologists run the analysis over geographically dispersed grid
computing infrastructure and combine the results.

2.6.1 Grid Storage for Grid Computing

Grids can be used for data storage in a distributed and highly efficient manner. Grid-based
data storage systems leverage the collective storage capacity and resources of multiple
interconnected computers across a network.
Grid storage is a critical component of grid computing infrastructure, providing the necessary
storage capabilities to support the distributed and collaborative nature of grid computing
environments. Grid computing involves the coordination of resources from multiple
geographically dispersed locations to solve complex computational problems.

2.6.2 Grid Oriented Storage (GOS)

Grid Oriented Storage (GOS) is a dedicated data storage architecture connected directly to a
computational grid. It supports and acts as a data bank and reservoirs for data, which can be
shared among multiple grid clients. GOS is a successor of Network-Attached Storage (NAS)
products in the grid computing era. GOS accelerates all kinds of applications in terms of
performance and transparency. A GOS system contains multiple hard disks, arranged into
logical, redundant storage containers like traditional file servers. GOS deals with long-
distance, heterogeneous and single-image file operations. GOS acts as a fi le server and uses
fi le-based GOS-FS protocol. Similar to GridFTP, GOS-FS integrates a parallel stream engine
and Grid Security Infrastructure (GSI). GOS-FS can be used as an underlying platform to
utilize the available bandwidth and accelerate performance in grid-based applications.

3. CLOUD STORAGE

Cloud storage is nothing but virtualized storage on demand called as Data storage as a
Service (DaaS). Cloud storage is data storage hosted remotely using data storage devices in
WWW and maintained by the third party (service provider). Cloud storage is a part of cloud
computing. It is deployed using WAN infrastructure which includes hardware components
such as switches

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 17


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
and routers. Cloud storage can be deployed in many ways.
For example:
● Local data (desktop/laptop) can be backed up to cloud storage.
● A virtual disk can be ‘sync’ to the cloud and distributed.
● The cloud can be used as a reservoir for storing data.

Important Points:

1. Cloud Storage and Cloud Computing:


Cloud storage is a part of cloud computing.
Before using cloud storage, you need to know its rules and services.

2. Two Types of Resources:


Clients use resources called functional interfaces (like data paths).
Service providers manage resources called management interfaces (like
control paths).

3. Creating a Standard Model:


We should make a standard model for both functional and management
interfaces.
This model should match the services provided by the cloud storage.

4. Foundation for Cloud Storage:


This standard model will be the basis for how cloud storage works.

Cloud storage came under the limelight because of the following attributes available in cloud
computing: pay-as-you-use, elasticity and simplicity (management). It is important that any
provider providing storage as a service should also provide these attributes to the consumer.

Following are some additional cloud storage attributes:


● Resource pooling and multi-tenancy: Multiple consumers can use shared single storage
device. Storage resources are pooled and consumers can be assigned and unassigned
resources according to their needs.
● Scalable and elastic: Virtualized storage can be easily expanded on need basis.
● Accessible standard protocols including HTTP, FTP, XML, SOAP and REST.
● Service-based: Consumers no need to invest, that is, no CAPEX (Capital Expenditure) and
only pay for usage, that is, OPEX (Operational Expenditure).
● Pricing based on usage
● Shared and collaborative
● On-demand self-service
Cloud storage can be accessible through web-based applications or through web services
Application Programming Interfaces (APIs), and using this data are stored. IT organizations
have started developing personalized web applications for easy access to cloud storage
services.

Cloud Storage: Data Management, Provisioning Cloud Storage, Data Intensive Technologies
for Cloud Computing. Cloud Storage from LANs to WANs: Cloud Characteristics,
Distributed Data Storage.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 18


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
3.1 DATA MANAGEMENT FOR CLOUD STORAGE

For cloud storage, a standard document is placed by SNIA (Storage Networking Industry
Association). Storage Industry Resource Domain Model (SIRDM). It states the importance of
simplicity for cloud storage. Figure 1shows the SIRDM model which uses CDMI standards.
SIRDM model adopts three metadata:

system consisting of storage metadata,


data metadata and
user metadata.

By using these metadata, cloud storage interface can offer services without adding
unnecessary complexity in managing the data.
Storage system and data system metadata are used to meet the requirements of the data and
the simplicity required is maintained.

User metadata is used by the cloud to find the data objects and containers. Storage system
metadata is used by the cloud to offer basic storage functions like assigning, modifying and
access control. Data system metadata is used by the cloud to offer data as a service based on
user requirements and controls the operation based on that data.

To create, retrieve, update and delete objects in a cloud the cloud data management interface
(CDMI) is used.

The functions in CDMI are:


● Cloud storage offerings are discovered by clients
● Management of containers and the data

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 19


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
● Sync metadata with containers an objects
CDMI is also used to manage containers, domains, security access and billing information.
CDMI standard is also used as protocols for accessing storage.
CDMI defi nes how to manage data and also ways of storing and retrieving it. ‘Data path’
means how data is stored and retrieved. ‘Control path’ means how data is managed. CDMI
standard supports both data path and control path interface.

3.3 Cloud Storage Requirements


Multi-tenancy
In a multi-tenancy model, resources provided are pooled, so that it may be shared by multiple
customers based on their needs. Due to the elasticity property in cloud computing, shared
pool of storage model makes the provider cost effective and billing is made easy.

Security
Secure cloud storage requires a secure transmission channel and methods. Securing data can
be done using encryption, authentication and authorization.
● Encryption is the process of scrambling data in such a manner as to make it unreadable
without special information, called a key, to make it readable again.
● Authentication is the process of determining their identity. Authentication can employ
passwords, biometrics, identifying tokens and other means.
● Authorization determines access rights on the data and the levels of authorization. To
provide secure cloud storage, access must be restricted for the communication channel, the
data source and the cloud storage sites.

Secure Transmission Channel

The four primary methods used to secure network communications are as follows:
1. Transport Layer Security (TLS) and Secure Sockets Layer (SSL)
2. Hypertext Transfer Protocol Secure (HTTPS)
3. Private Networks
4. Virtual Private Networks (VPNs)

Performance

Cloud storage performance can be categorized into two: speed and latency. Factors that affect
cloud storage performance are: available network bandwidth, types of systems available in
provider’s end, method adopted for compression and caching.

Quality of Service (QoS)


Quality of service (QoS) refers to levels of performance and efficiency of the system that they
can provide.

Data Protection and Availability


To ensure that data is protected from loss and theft, providers must take some precautionary
measures:
● Physical site security
● Protection against power loss
● Protection against loss of network access
● Data redundancy

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 20


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
● Server redundancy and server fail-over
● Redundant data sites
● Levels of redundancy
● Versioning and data retention
● Accessibility of cloud storage as live data
● Backup to tape or other media
● Data availability, when contract disputes
Metering and Billing
Metering and billing in cloud storage are done based on: data uploaded, data downloaded,
data stored and depends on requests and types of requests.

3.4 PROVISIONING CLOUD STORAGE

Provisioning cloud storage refers to the process of allocating and configuring storage
resources in a cloud computing environment to meet the needs of users and applications.

It involves setting up, managing, and optimizing storage services in the cloud to ensure data
availability, scalability, and performance.

Here are key points to understand about provisioning cloud storage:

1. Resource Allocation: Provisioning involves allocating storage capacity, whether it's in the
form of block storage, object storage, or file storage, based on the requirements of the
application or user. It ensures that the right amount of storage is available when needed.

2. Service Selection: Cloud providers offer various storage services, each designed for
specific use cases. Provisioning includes selecting the appropriate storage service based on
factors like data access patterns, performance needs, and cost considerations.

3. Configuration: Once the storage service is chosen, it needs to be configured properly. This
includes setting access controls, defining storage policies, and configuring backup and data
retention settings.

4. Scalability: One of the advantages of cloud storage is its scalability. Provisioning should
take into account the potential growth in data volume and ensure that the storage resources
can scale up or down dynamically as needed.

5. Data Migration: In some cases, data may need to be migrated from on-premises storage to
the cloud or between different cloud providers. Provisioning includes planning and executing
data migration strategies.

6. Data Protection: Ensuring data integrity and security is crucial. Provisioning involves
implementing data protection measures such as encryption, redundancy, and backup
solutions.

7. Performance Optimization: Depending on the workload, provisioning may require


configuring storage for optimal performance. This could involve setting up caching,
optimizing data layout, and selecting storage media with appropriate I/O characteristics.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 21


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
8. Cost Management: Provisioning should consider cost-effective storage solutions. This may
involve choosing the right storage tier (e.g., hot, cool, or archive) and optimizing resource
utilization to control expenses.

9. Monitoring and Management: After provisioning, ongoing monitoring and management are
essential. Cloud management tools and services help track storage usage, performance
metrics, and compliance with service-level agreements (SLAs).

10. User and Application Access: Provisioning includes granting appropriate access to users
and applications while ensuring that data remains secure and compliant with regulatory
requirements.

11. Compliance and Governance: Depending on the industry and region, there may be
specific compliance requirements for data storage and management. Provisioning should
adhere to these regulations.

Cloud means sharing third party resources via the Internet. This sharing can be done on need
basis and there is no need to invest any infrastructure at consumers end. Storage clouds
increase the efficiency of storing data in remote places, by sharing the storage devices
provided by the service providers. Capacity of storage can be increased on need basis and can
be done using multi-tenancy methods.

Private storage clouds reside at the back of an organization’s fi rewall that is deployed for in-
house customers and is designed for providing elasticity and simplicity in cloud model. By
adopting Cloud Data Management Interface (CDMI), standard service providers can
implement the method for metering the storage and data usage of consumers. This interface
also helps the providers for billing to the IT organizations based on their usage. Advantage of
this interface is that IT organizations need not write/use different adapters used by the service
providers. By using this interface, they can connect with different service providers.

2.5 DATA-INTENSIVE TECHNOLOGIES FOR CLOUD COMPUTING

In cloud computing, "data-intensive" and "compute-intensive" are two different workload


characteristics that describe the nature of tasks or applications being run on cloud
infrastructure. Here's what each term means:

1. Data-Intensive:
Data-intensive workloads are those in which the primary focus is on processing and
analysing large volumes of data.
These workloads involve handling massive datasets, such as big data analytics, data
mining, data warehousing, and data-driven scientific research.
Data-intensive tasks often require high-speed data processing, storage, retrieval, and
analysis to extract valuable insights from the data.
Examples include running MapReduce jobs in Hadoop, processing real-time streams of
data, or conducting complex database queries on vast datasets.

2. Compute-Intensive:

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 22


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Compute-intensive workloads, on the other hand, prioritize processing power and
computational resources over data storage and handling.
These workloads involve performing complex calculations, simulations, modeling, and
scientific computations that demand significant CPU and memory resources.
Compute-intensive tasks may not require large datasets, but they require substantial
computational capabilities to execute.
Examples include rendering computer graphics, running scientific simulations (e.g.,
weather forecasting, and molecular modeling), and solving mathematical equations.

In practical terms, cloud service providers offer specialized instances or virtual machine types
tailored for either data-intensive or compute-intensive workloads to optimize performance
and cost efficiency. Users can choose the most suitable instance type based on their specific
workload requirements.

Application programs that are compute bound described using compute-intensive


requires more execution time. Parallel processing of this type of application involves running
individual algorithms parallel within a process.

Processing Approach

Data-intensive computing platforms use a parallel computing approach. This approach


combines multiple processors and disks as computing clusters connected via high-speed
network. The data that are needed to be processed are independently done by computing
resources available in the clusters. This improves the performance and scalability. A cluster
can be defined as a parallel and distributed system, consisting of multiple inter-connected
standalone computers working as a single computing resource. In parallel computing, this
move is suitable for data-intensive computing.

There are many common characteristics of data-intensive computing systems compared


with other forms of computing, they are:
• The principle mechanism used for collection of the data and programs or algorithms
to perform the computation
• Programming model used
• Reliability and availability
• Scalability of both hardware and software

System Architecture
For data-intensive computing, an array of system architectures have been implemented. A
number of solutions have come out, one among them is MapReduce concept which is
developed by Google and available as an open-source implementation known as Hadoop.

This project is used by Yahoo, Facebook and others.

Apart from this, proprietary system architecture for data-intensive computing is developed by
LexisNexis Risk Solutions called LexisNexis.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 23


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
MapReduce and HPCC (High-Performance Computing Cluster) are two different distributed
computing frameworks used in cloud computing and big data processing. Here's an overview
of each:

1. MapReduce:
The MapReduce architecture and programming model is an example for data-intensive
computing, pioneered by Google.
To create a map function, the MapReduce architecture uses a functional programming style
using key-value pair.
This pair is connected with the input data to produce a set of intermediary key-value pairs.
Reduce function merges all intermediate values using intermediate keys. System takes care of
particulars like partitioning the input data, scheduling and executing automatically. Hence
programmers who do not have experience in parallel programming can simply use a large
distributed processing environment without any problem.

MapReduce is a programming model and processing framework developed by Google to


process and generate large datasets in a parallel and distributed computing environment.
It's designed to handle massive amounts of data and parallelize the processing across a
cluster of computers.

MapReduce divides a computational task into two main stages:


• the Map stage and
• the Reduce stage.

1. Map Phase:
• Input Splitting: The input data is divided into smaller chunks called input
splits. Each split is typically a block of data stored in a distributed file system
like HDFS (Hadoop Distributed File System).
• Mapping: In this phase, each input split is processed by multiple mapper tasks
running on different nodes of a cluster. Mappers apply a user-defined function
(the "map" function) to each record in their input split, generating a set of
intermediate key-value pairs.
• Shuffling and Sorting: The generated key-value pairs from all mappers are
sorted by key. This sorting is crucial because it groups together all key-value
pairs with the same key, making it easier for the reduce phase to process them.
2. Shuffle and Sort:
• Partitioning: The sorted key-value pairs are partitioned into smaller sets
based on keys. Each partition is sent to a reducer.
• Shuffling: The partitions are transferred across the network to the nodes
where reducers are running. This process is known as shuffling.
• Sorting: Within each partition, the key-value pairs are sorted by key again.
This sorting ensures that the reducer receives data in a sorted order.
3. Reduce Phase:
• Reduce Function: Each reducer task receives one partition of the data
generated by the mappers. It applies a user-defined function (the "reduce"
function) to process the key-value pairs within the partition. The reducer's
output is typically a set of aggregated or processed results.
• Final Output: The output from all reducer tasks is collected and merged to
produce the final result of the MapReduce job.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 24


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
4. Output:
• The final output of the MapReduce job can be stored in a distributed file
system, a database, or any other storage system, depending on the application's
requirements.

2. HPCC (High-Performance Computing Cluster):

LexisNexis Risk Solutions independently developed and implemented a solution for data-
intensive computing called the HPCC (High-Performance Computing Cluster).
The LexisNexis method structure clusters with commodity hardware that runs in Linux OS.
Custom system software and middleware parts were created and layered to provide the
execution environment and distributed fi le system support that is essential for data-intensive
computing on the base of Linux operating system. A new high-level language for data-
intensive computing called ECL is also implemented by LexisNexis.

A High-Performance Computing (HPC) Cluster is a group of interconnected computers, often


referred to as nodes or servers, that work together to solve complex computational problems
or perform data-intensive tasks at high speeds. These clusters are designed to deliver
exceptional processing power, memory capacity, and storage resources, making them suitable
for scientific research, simulations, big data analytics, and other compute-intensive
applications.

Key characteristics and components of an HPC cluster include:


1. Nodes: Each node in the cluster is a standalone computer with its CPU, memory, and
storage. Nodes are often uniform in terms of hardware specifications to simplify
management and resource allocation.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 25


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
2. Interconnect: High-speed interconnects, such as InfiniBand or high-speed Ethernet,
connect the nodes within the cluster. These connections allow for rapid data transfer
and communication between nodes, crucial for parallel processing.
3. Parallel Processing: HPC clusters excel at parallel processing, where tasks are
divided into smaller subtasks and processed simultaneously on multiple nodes. This
parallelism significantly speeds up computations.
4. Job Scheduler: HPC clusters typically employ job scheduling software that manages
the allocation of resources to different jobs or tasks submitted by users. It ensures
efficient resource utilization and prevents conflicts.
5. Storage: HPC clusters often have dedicated high-performance storage systems, such
as Network Attached Storage (NAS) or parallel file systems, to handle the large
volumes of data generated and processed during computations.
6. Compute Nodes: These are the nodes where the actual computations take place. They
have CPUs with multiple cores and sufficient memory to handle demanding
workloads.
7. Master Node: The master node is responsible for managing and coordinating
activities across the cluster. It often hosts the job scheduler and provides a central
point of control.
8. Parallel Programming Models: To leverage the full potential of HPC clusters,
applications need to be designed to take advantage of parallelism. Programming
models like MPI (Message Passing Interface) and OpenMP are commonly used for
parallel programming.
9. Cooling and Power: HPC clusters generate a significant amount of heat due to the
intensive computational workloads. Proper cooling and power management are
critical to prevent overheating and ensure system stability.

In summary, MapReduce and HPCC are both distributed computing frameworks used for big
data processing in cloud computing environments. MapReduce is known for its simplicity
and is often associated with Hadoop, while HPCC provides a more versatile set of tools for
data-intensive tasks. The choice between them depends on the specific requirements of the
data processing workload.

2.6 CLOUD STORAGE FROM LANS TO WANS

Cloud Storage from LANs to WANs: Cloud Characteristics, Distributed Data Storage.

The topic "Cloud Storage from LANs to WANs" suggests a discussion or exploration of the
evolution and expansion of cloud storage technology and its use in both Local Area Networks
(LANs) and Wide Area Networks (WANs). Here's a breakdown of the meaning:

1. Cloud Storage: This refers to the practice of storing data and files on remote servers that
are accessed over the internet, as opposed to traditional local storage on physical devices like
hard drives. Cloud storage services allow users to store, manage, and retrieve data from
anywhere with an internet connection.

2. LANs (Local Area Networks): LANs are networks that connect devices within a limited
geographic area, such as a home, office, or campus. LANs are typically used for local
communication and data sharing within a specific location.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 26


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
3. WANs (Wide Area Networks): WANs, on the other hand, are networks that cover larger
geographic areas, often spanning across cities, regions, or even countries. WANs are used to
connect LANs and enable data transmission over long distances.

The phrase "from LANs to WANs" implies a discussion that covers the use of cloud storage
not only within local networks (LANs) but also in scenarios where data needs to be accessed
and shared across wide area networks (WANs). It may explore topics such as data
synchronization, data access over the internet, data security, and the challenges and benefits
of using cloud storage in various network environments.

Cloud Characteristics:
There are three characteristics of a cloud computing natural environment that are most
pertinent to be considered before choosing storage in cloud.

1. Computer power is elastic, when it can perform parallel operations. In general,


applications conceived to run on the peak of a shared-nothing architecture are well matched
for such an environment. Some cloud computing goods, for example, Google’s App Engine,
supply not only a cloud computing infrastructure, but also an entire programs stack with a
constrained
API so that software developers are compelled to compose programs that can run in a
shared-nothing natural environment and therefore help elastic scaling.

2. Data is retained at an unknown host server. In general, letting go off data is a threat to
many security issues and thus suitable precautions should be taken. The very title ‘loud
computing’ implies that the computing and storage resources are being operated from a
celestial position. The idea is that the data is physically stored in a specifi c host country
and is subject to localized laws and regulations. Since most cloud computing vendors give
their clientele little command over where data is stored, the clientele has no alternative but
to expect the least that the data is encrypted utilizing a key unavailable with the owner, the
data may be accessed by a third party without the customer’s knowledge.

3. Data is duplicated often over distant locations. Data accessibility and durability is
paramount for cloud storage providers, as data tampering can be impairing for both the
business and the organization’s reputation. Data accessibility and durability are normally
accomplished through hidden replications. Large cloud computing providers with data hubs
dispersed all through the world have the profi ciency to provide high levels of expected error
resistance by duplicating data at distant locations across continents. Amazon’s S3 cloud
storage service replicates data over ‘regions’ and ‘availability zones’ so that data and
applications can survive even when the whole location collapses.

Additional Characteristics:

Cloud computing is characterized by several key features and attributes that distinguish it
from traditional computing models. These characteristics define the foundation of cloud
technology and its advantages. Here are the main cloud characteristics:

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 27


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
1. On-Demand Self-Service: Cloud services can be provisioned and managed by users
without requiring human intervention from the service provider. Users can allocate resources
as needed through a self-service interface.

2. Broad Network Access: Cloud services are accessible over the internet or through a
network. Users can access cloud resources using a variety of devices, including laptops,
smartphones, and tablets.

3. Resource Pooling: Cloud providers use multi-tenant models to pool computing resources
(e.g., servers, storage, and networking) to serve multiple customers. Resources are
dynamically allocated and reassigned based on demand.

4. Rapid Elasticity: Cloud resources can be scaled up or down quickly to accommodate


changing workloads. This elasticity allows users to access additional resources during traffic
spikes and release them when they are no longer needed.

5. Measured Service: Cloud resources are metered, and users are billed based on their actual
usage. This pay-as-you-go model provides cost efficiency and flexibility.

6. Self-Service Catalog: Cloud providers often offer a catalog of available services and
resources. Users can select and configure these services according to their requirements.

7. Automation and Orchestration: Cloud environments rely heavily on automation to manage


and provision resources efficiently. Orchestration tools allow users to automate complex
workflows and processes.

8. Virtualization: Virtualization technologies, such as virtual machines (VMs) and containers,


are fundamental to cloud computing. They enable the isolation and abstraction of resources,
improving resource utilization.

9. High Availability and Redundancy: Cloud providers typically offer high availability and
redundancy to ensure that services remain accessible even in the event of hardware failures or
disruptions.

10. Security and Compliance: Cloud providers implement robust security measures to protect
data and resources. They often comply with industry-specific regulations and standards.

11. Scalability: Cloud services can easily scale horizontally or vertically to accommodate
growing workloads or user demands.

12. Global Reach: Cloud providers operate data centers in various regions around the world,
allowing users to deploy applications and services in geographically distributed locations.

13. Service Models: Cloud computing offers various service models, including Infrastructure
as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS), catering
to different levels of abstraction and management.

14. Deployment Models: Cloud services can be deployed in different models, including
public cloud (accessible to the general public), private cloud (used exclusively by a single

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 28


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
organization), hybrid cloud (combining public and private cloud resources), and community
cloud (shared by multiple organizations with common interests).

These cloud characteristics collectively enable organizations to achieve greater flexibility,


cost-effectiveness, scalability, and agility in their IT operations. Cloud computing has become
a fundamental technology for businesses and individuals alike, revolutionizing the way
computing resources are consumed and managed.

DISTRIBUTED DATA STORAGE:

Distributed data storage, often referred to as distributed storage or distributed file storage, is a
method of storing data across multiple physical or virtual locations, often on different servers
or nodes within a network or cloud infrastructure. This approach to data storage offers several
advantages and is commonly used in modern computing environments.

Here are some key aspects of distributed data storage:

1. Redundancy: Distributed storage systems replicate data across multiple nodes or servers.
This redundancy ensures data availability and fault tolerance. If one node fails, data can be
retrieved from another.

2. Scalability: Distributed storage can scale horizontally by adding more nodes or servers to
the storage cluster. This allows organizations to accommodate growing data volumes and
increased demand for storage capacity.

3. Load Balancing: Data can be distributed evenly across nodes, helping to balance the load
on the storage infrastructure. This prevents any single node from becoming a bottleneck.

4. Data Accessibility: Distributed storage provides high availability and accessibility to data.
Users and applications can access data from any node in the distributed storage system,
enhancing data retrieval performance.

5. Data Security: Data can be encrypted and secured within a distributed storage system.
Access controls and authentication mechanisms can be implemented to protect sensitive
information.

6. Data Consistency: Distributed storage systems often implement mechanisms for


maintaining data consistency across nodes. This ensures that data remains synchronized and
coherent, even in distributed environments.

7. Latency Reduction: By distributing data closer to where it's needed, distributed storage can
reduce data access latency. This is particularly beneficial for applications that require low-
latency access to data.

8. Disaster Recovery: Distributed storage can aid in disaster recovery efforts. Data
redundancy and geographical distribution of nodes can protect against data loss in the event
of hardware failures or disasters.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 29


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
9. Cost-Effective: Distributed storage can be cost-effective because it allows organizations to
use commodity hardware and scale incrementally as needed, avoiding the upfront costs of
large, centralized storage systems.

10. Cloud-Based Solutions: Many cloud providers offer distributed storage services, allowing
organizations to leverage the benefits of distributed storage without managing their
infrastructure.

Common examples of distributed storage technologies and systems include:

Distributed File Systems: Examples include Hadoop Distributed File System (HDFS) and
GlusterFS, which distribute files across multiple servers or nodes.

Object Storage: Services like Amazon S3 and Azure Blob Storage offer distributed object
storage, where data is stored as objects that can be accessed via unique URLs.

NoSQL Databases: NoSQL databases like Apache Cassandra and MongoDB are designed for
distributed data storage and retrieval.

Content Delivery Networks (CDNs): CDNs distribute web content to multiple servers
located in different regions to reduce latency and improve content delivery speed.

Blockchain: Blockchain technology is inherently distributed, with data replicated across


multiple nodes for security and integrity.

Distributed data storage plays a crucial role in modern computing, enabling organizations to
effectively manage and safeguard their ever-growing volumes of data while ensuring high
availability and accessibility. It has become a fundamental component of cloud computing,
big data analytics, and scalable web applications.

Amazon Dynamo, CouchDB, and ThruDB are all distributed NoSQL databases, each with its
own set of features and characteristics. Let's briefly explore each of them:

1. Amazon DynamoDB:
Type: Managed NoSQL database service.
Managed By: Amazon Web Services (AWS).
Key Features:
Fully Managed: DynamoDB is a fully managed database service, which means AWS
handles tasks like infrastructure provisioning, patching, and scaling.
NoSQL: It is a NoSQL database, suitable for flexible and schema less data models.
Highly Scalable: DynamoDB can automatically scale both read and write capacity to
handle varying workloads.
Low Latency: Designed for low-latency performance, making it suitable for applications
that require quick data retrieval.
Strong Consistency: Offers strong consistency options for read and write operations.
Security: Provides robust security features like encryption, access control, and
authentication.
Global Tables: Allows you to replicate data across multiple AWS regions for high
availability and disaster recovery.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 30


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
2. CouchDB:
Type: Opensource NoSQL database.
Key Features:
Distributed: CouchDB is designed to be distributed, making it suitable for applications
that require data synchronization across multiple nodes.
Document Oriented: Stores data in JSON-like documents, making it flexible for a variety
of data types and structures.
Multi Version Concurrency Control (MVCC): Supports concurrent access to data while
maintaining data consistency.
Built-in Conflict Resolution: Provides conflict resolution mechanisms for distributed
databases.
RESTful API: Offers a RESTful API for data access, which is easy to work with for web
and mobile applications.
Replication: Supports bidirectional replication, allowing data synchronization between
CouchDB instances.

3. ThruDB:
Type: Opensource distributed NoSQL database.
Key Features:
Column Oriented: ThruDB is designed as a column family store, suitable for handling
large volumes of data.
Distributed and Fault Tolerant: Supports distribution across multiple nodes and includes
fault tolerance mechanisms.
Schemaless: Allows for flexible data modeling and storage.
Thrift Interface: Provides a Thriftbased API for data access and manipulation.
MapReduce: Supports MapReduce for querying and processing large datasets.
Notable for TimeSeries Data: ThruDB is sometimes used for timeseries data applications
due to its distributed nature and column-oriented storage.

When choosing between these databases, it's essential to consider your specific use case and
requirements. DynamoDB, being a managed service, is convenient for AWS-based
applications, while CouchDB and ThruDB offer more flexibility and control over your
database infrastructure. The choice also depends on factors like data modeling, scalability
needs, and the level of administrative control you require.

Additional Learning :
Big data:

Big data refers to extremely large and complex datasets that cannot be easily managed,
processed, or analyzed using traditional data processing tools and methods. These datasets are
characterized by the "three Vs": volume, velocity, and variety, and they often include other
attributes like veracity and value. Let's break down these characteristics:

1. Volume: Big data involves massive amounts of data. This can range from terabytes to
petabytes and beyond. The sheer quantity of data is one of the defining features of big data.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 31


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
2. Velocity: Data is generated and collected at an unprecedented speed. This could be real-
time data streaming from sensors, social media updates, financial transactions, or any other
source that produces a constant flow of information.

3. Variety: Big data comes in diverse formats. It includes structured data (like traditional
databases), semi-structured data (like XML or JSON), and unstructured data (like text
documents, social media posts, images, and videos). Dealing with this variety of data types is
a challenge.

4. Veracity: Veracity refers to the reliability and quality of the data. Big data can be messy,
with inaccuracies, inconsistencies, and missing values. Ensuring data quality is a significant
concern.

5. Value: The ultimate goal of working with big data is to extract valuable insights, make
informed decisions, and gain a competitive advantage. The value of big data comes from its
potential to provide actionable information.

To effectively manage and harness the potential of big data, organizations use various
technologies and approaches, including:

- Distributed Computing: Tools like Hadoop and Apache Spark distribute data processing
tasks across clusters of computers, enabling parallel processing of large datasets.

- Data Storage: NoSQL databases, columnar databases, and distributed file systems are used
to store and retrieve big data efficiently.

- Data Warehousing: Data warehouses are designed to store and manage large datasets for
business intelligence and analytics purposes.

- Machine Learning and Data Analytics: Advanced analytics and machine learning algorithms
are applied to extract meaningful patterns, correlations, and insights from big data.

- Data Visualization: Tools for data visualization help make complex data understandable and
actionable by presenting it in graphical forms.

- Cloud Computing: Cloud platforms offer scalable and flexible infrastructure for storing and
processing big data, reducing the need for large on-premises data centers.

Industries such as finance, healthcare, e-commerce, social media, and scientific research
benefit from big data analytics by improving decision-making, customer experiences, fraud
detection, and more. However, working with big data also raises privacy and security
concerns, which require careful consideration and management.

In summary, big data refers to the massive and complex datasets generated at high speeds in
various formats. Effectively harnessing the potential of big data requires specialized tools,
technologies, and approaches to turn this data into valuable insights and actionable
information.

Experiential Learning:

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 32


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
Storage management process:
The storage management process involves a series of activities and procedures to efficiently
handle an organization's data storage infrastructure. This process includes various tasks
related to planning, provisioning, monitoring, optimizing, and securing storage resources.
Here is an overview of the storage management process:
1. Assessment and Planning:
• Objective: Understand current and future storage needs.
• Activities:
• ssess existing storage infrastructure.
• Predict future data growth rates.
• Identify performance requirements.
• Plan for capacity expansion.
2. Capacity Planning:
• Objective: Ensure that the storage infrastructure can meet growing data
demands.
• Activities:
• Estimate storage capacity requirements.
• Allocate resources based on data classifications.
• Plan for scalability and expansion.
3. Data Classification and Tiering:
• Objective: Categorize data based on importance and access patterns.
• Activities:
• Classify data based on usage frequency.
• Implement a tiered storage strategy.
• Allocate different storage types and costs based on data characteristics.
4. Provisioning:
• Objective: Allocate storage resources as needed.
• Activities:
• Allocate storage space to applications or users.
• Implement storage virtualization.
• Configure storage systems based on requirements.
5. Security Implementation:
• Objective: Protect data from unauthorized access and cyber threats.
• Activities:
• Implement encryption for sensitive data.
• Set up access controls and authentication.
• Monitor for security breaches.
6. Performance Monitoring and Optimization:
• Objective: Ensure optimal performance and identify potential issues.
• Activities:
• Monitor storage performance in real-time.
• Optimize configurations for better performance.
• Implement caching strategies.
7. Backup and Recovery:
• Objective: Protect data against loss and enable recovery.
• Activities:
• Establish regular backup schedules.
• Test recovery procedures.
• Implement data deduplication for efficient storage.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 33


Engineering, Pune
ISBM College of Engineering, Pune
Department of Computer Engineering
Cloud Computing:310254(C)
-----------------------------------------------------------------------------------------------
8. Lifecycle Management:
• Objective: Manage data from creation to deletion.
• Activities:
• Define data retention policies.
• Archive data as needed.
• Dispose of data securely.
9. Disaster Recovery Planning:
• Objective: Ensure business continuity in case of data loss or system failures.
• Activities:
• Develop a comprehensive disaster recovery plan.
• Test the plan regularly.
• Include off-site backups and redundant systems.
10. Cloud Integration:
• Objective: Leverage cloud storage for scalability and flexibility.
• Activities:
• Integrate cloud storage solutions.
• Implement a hybrid cloud strategy.
• Ensure data sovereignty and compliance.
11. Monitoring and Reporting:
• Objective: Gain insights into storage performance and health.
• Activities:
• Use monitoring tools to track storage metrics.
• Generate reports for analysis.
• Make informed decisions based on data analytics.
12. Compliance and Governance:
• Objective: Ensure storage practices align with industry regulations.
• Activities:
• Adhere to compliance standards.
• Maintain audit trails.
• Document storage processes and policies.
The storage management process is iterative, with ongoing monitoring, evaluation, and
adjustments to ensure that the storage infrastructure aligns with the organization's evolving
needs and industry standards.

Prof. K. N. Tripathi, Assistant Professor, Department of Computer Engineering, ISBM College of 34


Engineering, Pune

You might also like