Unit 5
Unit 5
Hierarchical Storage Management (HSM) is a data storage and management strategy that
automatically moves data between different storage tiers based on its access frequency and
importance. This approach optimizes storage costs and performance by storing frequently accessed
data on high-performance, high-cost storage and less frequently accessed data on lower-cost, slower
storage.
Imagine you have a closet. You keep your favorite clothes you wear every day in the front, easy to
reach. Less-used items, like seasonal clothing, are stored in boxes on a higher shelf. Old clothes you
rarely use are stored in a distant storage unit.
HSM works similarly for digital data. It automatically moves data between different storage tiers
based on how often it's accessed:
1. Hot Storage: High-performance, high-cost storage for frequently accessed data. This is like
your favorite clothes in the front of the closet.
2. Warm Storage: Slower, lower-cost storage for less frequently accessed data. This is like your
seasonal clothes on the higher shelf.
3. Cold Storage: Very slow, very low-cost storage for infrequently accessed data. This is like your
old clothes in the distant storage unit.
Concept of HSM
o HSM divides storage into multiple tiers based on performance and cost, such as
SSDs, HDDs, and archival storage like tape or cold cloud storage.
o For example, inactive files may be moved from SSDs to HDDs or from HDDs to
archival cloud storage.
o HSM systems are designed to scale with the growing volume of data in cloud
environments.
o HSM integrates seamlessly with cloud services like Amazon S3 Glacier, Azure Blob
Storage Archive Tier, and Google Coldline Storage.
Advantages of HSM
1. Cost Optimization:
o Reduces storage costs by using high-performance storage only for active data and
archival storage for inactive data.
o Frees up space in high-speed storage tiers for critical workloads by migrating less-
accessed data to lower-cost storage.
o Simplifies data management through automation, reducing the need for manual
oversight.
4. Better Performance:
5. Seamless Access:
o Users and applications experience seamless access to data, regardless of its physical
location in the storage hierarchy.
o Archiving less frequently used data to slower tiers reduces the volume of data that
needs to be backed up regularly, saving time and resources.
o HSM integrates with existing file systems and applications, minimizing disruptions to
workflows.
9. Disaster Recovery:
o Data stored in multiple tiers across geographic locations ensures better disaster
recovery options.
Lower-cost, slower storage tiers consume less power, contributing to greener IT operations.
Enables efficient long-term storage for large volumes of data, such as medical records, video
archives, and financial records.
As data volumes grow, HSM systems can expand to accommodate additional tiers or cloud
resources.
HSM can be implemented in hybrid environments, integrating on-premises and cloud storage
to maximize flexibility.
o Offers tiered storage options such as Hot, Cool, and Archive tiers, with automated
data movement between them.
o Provides storage classes like Standard, Nearline, and Coldline, enabling cost-effective
data management.
Conclusion
Distributed file systems in the cloud store data across multiple servers or locations to ensure
scalability and availability. However, ensuring data consistency in such systems is complex due to the
distributed nature of data and the need for real-time access. Let’s explore the challenges and
considerations in simpler terms.
1. Network Delays
o Data updates need to be shared across multiple servers. If there is a delay in this
process, some servers might have outdated data while others have the updated
version.
o Example: A file edited in one server might not immediately appear updated on
another.
2. System Failures
o Challenge: Restoring data consistency after failures takes time and effort.
3. Concurrent Access
o When multiple users or applications access and modify the same file simultaneously,
conflicts can occur.
o Example: Two users editing a shared document at the same time may overwrite each
other's changes.
4. Replication Delays
o Data is often replicated (copied) across multiple locations to ensure safety and
availability. However, delays in this process can result in some copies being outdated.
o Example: You upload a file, but another server still shows the old version.
o Sometimes, servers get disconnected from each other due to network issues, leading
to isolated updates. When the connection is restored, merging these updates can be
tricky.
o In distributed systems, you can prioritize either consistency (all users see the same
data) or availability (data is accessible even during failures), but not both at the same
time.
o Managing consistent updates across data stored in multiple regions (e.g., countries)
is harder due to longer communication times.
8. Eventual Consistency
o Some systems allow temporary inconsistencies, promising that all copies of data will
eventually become consistent. This works well for applications where immediate
consistency isn’t critical but can confuse users expecting instant updates.
1. Consistency Models
2. Replication Techniques
o Asynchronous Replication: Updates are applied to one copy first and then shared
with others, improving speed but risking temporary inconsistencies.
3. Conflict Resolution
4. Quorum-Based Methods
o To ensure reliable updates, a majority (quorum) of servers must agree on the change
before it’s finalized.
o Example: If five servers store a file, at least three must confirm the update.
o Protocols like Paxos and Raft help all servers agree on the same updates, ensuring
consistency.
6. Data Partitioning
o Dividing data into smaller parts and distributing them strategically can reduce
dependencies between servers, improving consistency.
o Regular monitoring of servers ensures that inconsistencies are quickly identified and
resolved.
8. Client-Side Considerations
1. Amazon S3
o Follows an eventual consistency model for overwrites but ensures strong
consistency for new objects.
o Example: If you update a file, the change might take some time to reflect across all
locations.
o Provides strong consistency for all operations, ensuring users always see the latest
data.
GFS is a highly scalable DFS designed by Google, optimized for large-scale data processing.
Key Features
1. Reliability:
2. Large Files:
3. File Segmentation:
o Files are divided into large chunks (64 MB), each assigned a unique identifier.
o Chunk servers store data chunks and communicate with clients directly.
6. Fault Tolerance:
7. Write Operations:
8. Garbage Collection:
o Implements lazy garbage collection to optimize storage reclamation.
Architecture of GFS
1. Master Server:
o Maintains the file system's metadata, including namespace and access control.
o Tracks locations of chunk replicas but does not participate in file I/O directly.
2. Chunk Servers:
3. Clients:
o Directly interact with chunk servers for data reads and writes.
o Primary processes write requests, synchronizes replicas, and updates the client.
Block storage is a type of cloud storage where data is divided into fixed-sized chunks called blocks.
Each block has a unique identifier and can be accessed independently. Block storage is like the "hard
drive" of the cloud—it is fast, flexible, and designed for high-performance applications.
1. Data Division: Data is broken into blocks, each stored separately with a unique address.
o Example: A 5GB file could be divided into smaller 1MB blocks, which are stored
across different servers.
2. Direct Access: Blocks are accessed directly by their unique address, allowing quick data
retrieval without scanning the entire storage.
3. Independent Updates: Individual blocks can be modified or replaced without affecting other
blocks.
1. Fixed-Sized Blocks
Data is stored in standardized blocks, making it easy to manage and retrieve specific pieces
of data.
Provides low latency and high IOPS (Input/Output Operations Per Second), ideal for
demanding applications like databases or virtual machines.
3. Flexibility
Compatible with various file systems (e.g., NTFS, ext4) and operating systems.
4. Scalability
You can add more blocks as your storage needs grow, making it highly scalable for
businesses.
5. Redundancy
Data is often replicated across multiple servers or zones, ensuring reliability even if one
server fails.
6. Snapshot Support
Block storage systems allow creating snapshots (point-in-time backups), which are useful for
quick recovery.
7. Encryption
Supports encryption to secure sensitive data both at rest and during transmission.
8. Multi-Region Availability
Blocks can be stored across different geographic locations to improve data access speed and
disaster recovery.
1. High Performance
Block storage is designed for applications requiring fast data access, such as:
2. Flexibility
Works like a physical hard drive and can be formatted, partitioned, and managed based on
application needs.
3. Reliability
Redundant copies of blocks ensure data availability even in case of hardware or server
failures.
5. Dynamic Scaling
Automatically scales storage capacity based on workload requirements, saving costs during
low usage.
6. Secure
Supports encryption and secure access controls, ensuring data privacy and compliance with
standards like GDPR and HIPAA.
7. Multi-Purpose
Can handle both structured and unstructured data, making it versatile for various use cases:
o Database storage.
o Email servers.
1. Databases
o Block storage is often used for high-performance databases, ensuring quick data
reads and writes.
2. Virtual Machines
o Provides storage for operating systems and applications running on virtual machines.
o Suitable for analytics and processing tasks requiring high-speed data access.
4. Enterprise Applications
o Offers General Purpose (GP3), Provisioned IOPS (IO1), and Cold HDD options for
various workloads.
Cost:
High-performance block storage can be expensive compared to other storage types like
object storage.
Complexity:
Block storage is not cost-efficient for storing large amounts of infrequently accessed data
Storage-as-a-Service (STaaS) in Cloud Storage Systems
o Users request storage through a cloud provider’s interface (e.g., web portal, CLI, or
APIs).
o Providers allocate storage from their data centers, which users can access and
manage remotely.
o Object Storage: Scalable storage for unstructured data like media and backups.
3. Data Accessibility:
o Users access stored data via secure protocols such as HTTP/HTTPS, NFS, or SMB.
4. Scalability:
o Storage capacity can be scaled up or down on demand, making it ideal for dynamic
workloads.
5. Pay-as-You-Go Model:
o Users pay for storage based on consumption, eliminating upfront capital expenses.
o Providers replicate data across multiple geographic locations for durability and fault
tolerance.
o STaaS includes features like encryption, access controls, and compliance with
standards (e.g., GDPR, HIPAA).
Features of STaaS
1. On-Demand Storage:
2. Global Accessibility:
4. Data Tiering:
o STaaS solutions often include automated backup and disaster recovery options.
6. Multi-Tenancy:
Benefits of STaaS
1. Cost Efficiency:
3. Reliability:
4. Simplified Management:
5. Fast Deployment:
6. Enhanced Security:
7. Global Collaboration:
o Teams across the globe can access shared files and resources seamlessly.
2. Microsoft Azure:
o Services: Azure Blob Storage, Azure Files, and Azure Disk Storage.
o Features: Integration with Azure services, multi-tier storage, and advanced security.
4. IBM Cloud:
Distributed and parallel file systems are specialized systems designed to store, manage, and access
large volumes of data across multiple servers. These systems are foundational in cloud computing,
big data processing, and high-performance computing (HPC) environments, offering scalability,
reliability, and efficiency.
Definition:
A Distributed File System (DFS) stores data across multiple servers or nodes, presenting a unified
view of files to users and applications. The goal is to provide reliable, scalable, and fault-tolerant
access to data, even in the face of hardware or network failures.
Key Features:
1. Data Distribution:
o Files are divided into smaller chunks (blocks) and distributed across multiple nodes.
2. Fault Tolerance:
3. Transparency:
o Users see the system as a single file system, hiding the complexity of the underlying
architecture.
4. Scalability:
o The system can scale horizontally by adding more nodes as data grows.
5. Replication:
o Copies of data blocks are maintained in multiple locations for durability.
o Used in big data applications, optimized for high-throughput access to large datasets.
o Designed for large-scale data storage and retrieval for Google’s internal applications.
4. Azure Files:
Use Cases:
Definition:
A Parallel File System (PFS) is designed to handle massive amounts of data with high-speed
processing by enabling multiple nodes or clients to read and write data simultaneously. This
parallelism ensures better performance for workloads requiring high throughput and low latency.
Key Features:
1. Simultaneous Access:
2. High Performance:
o Optimized for workloads that demand low-latency access and high IOPS
(Input/Output Operations Per Second).
3. Striping:
o Data is divided into stripes and distributed across multiple storage nodes, allowing
concurrent data access.
4. Metadata Servers:
o Widely used in HPC environments for its high throughput and scalability.
o Used in both enterprise and HPC environments, providing parallel access to data.
3. BeeGFS:
o Optimized for ease of use and performance in HPC and data-intensive applications.
Use Cases:
Data partitioning and distribution are fundamental techniques in Distributed File Systems (DFS) to
manage and store data efficiently across multiple nodes in a distributed environment. These
techniques ensure scalability, fault tolerance, and high availability while optimizing performance.
1. Data Partitioning
Definition:
Data partitioning refers to dividing large datasets into smaller, manageable chunks or partitions. Each
partition contains a subset of the total data and can be stored on a different node in the distributed
system.
Key Features:
2. Independent Management:
3. Logical Division:
o Data is logically partitioned based on predefined rules, such as file size, hash values,
or ranges.
Benefits:
1. Scalability:
o Partitioning allows the system to store and process massive datasets by distributing
them across multiple nodes.
2. Improved Performance:
3. Fault Isolation:
o If a node storing a specific partition fails, only that portion of data needs to be
recovered or replicated.
Techniques:
1. Range-Based Partitioning:
o Data is divided based on ranges of values (e.g., alphabetical ranges for file names).
2. Hash-Based Partitioning:
2. Data Distribution
Definition:
Data distribution involves placing the partitioned data blocks across multiple nodes in the distributed
system. The goal is to ensure redundancy, balance the load, and optimize access times.
Key Features:
1. Replication:
o Each data block is replicated across multiple nodes to ensure durability and
availability.
2. Load Balancing:
o Data is distributed evenly across nodes to prevent hotspots and ensure efficient
utilization of resources.
3. Geographic Distribution:
Benefits:
1. Fault Tolerance:
2. High Availability:
o Data placement algorithms ensure that data is stored close to where it is frequently
accessed, reducing latency.
Techniques:
1. Random Distribution:
2. Location-Aware Distribution:
3. Consistent Hashing:
o Ensures even distribution of data blocks across nodes while minimizing data
movement when nodes are added or removed.
The evolution of storage technology has been driven by the need to store increasing amounts of data
efficiently, securely, and cost-effectively. From basic physical storage methods to modern cloud-based
solutions, storage technology has undergone several transformative phases.
Description:
o Punch cards (1890s) stored data in holes punched in paper. They were used for
simple computational tasks.
o Magnetic drums (1930s) provided small amounts of storage with slower access
speeds.
Limitations:
Key Technologies: Magnetic Tapes, Floppy Disks, Hard Disk Drives (HDDs).
Advantages: Faster access times and higher capacity than tapes or floppy disks.
Evolution:
o From large, bulky drives (IBM 305 RAMAC) to compact, multi-terabyte drives.
Description:
o Advantages:
o Limitations:
4. Solid-State Storage
Limitations:
5. Networked Storage
Key Technologies: Network Attached Storage (NAS), Storage Area Network (SAN).
Description:
o Advantages:
o Limitations:
6. Virtualized Storage
Description:
o Advantages:
Simplifies management.
o Limitations:
Description:
o Advantages:
o Examples: Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage.
Key Innovations:
o DNA Storage: Experimental storage using DNA strands, promising immense data
density.
o Cold Storage: Designed for archival purposes with low-cost, high-capacity solutions
(e.g., Amazon Glacier).