Notes CC Unit 02
Notes CC Unit 02
Comparison of DAS,SAN,NAS
Feature DAS SAN NAS
What it encompasses:
• Provisioning: Allocating storage resources for different needs.
• Data organization: Categorizing and structuring data for efficient retrieval.
• Data protection: Implementing security measures to prevent unauthorized access and
data loss.
• Backup and recovery: Creating copies of data for disaster recovery purposes.
• Performance optimization: Ensuring efficient data access and retrieval speeds.
• Compliance: Maintaining adherence to data privacy regulations.
• Cost management: Optimizing storage resources to avoid unnecessary expenses.
Maintaining storage devices is a tedious job for storage administrators. They adopt some
utilities to monitor and manage storage devices. Management level tasks are configuration,
migration, provisioning, archiving and storage monitoring/reporting. Storage Resource
Management (SRM) tools include configuration tools, provisioning tools and measurement
tools.
Configuration tools handle the set-up of storage resources. These tools help to organize
and manage RAID devices by assigning groups, defining levels or assigning spare drives.
Provisioning tools define and control access to storage resources for preventing a network
user from being able to use any other user’s storage.
Measurement tools analyse performance based on behavioural information about a storage
device. An administrator can use that information for future capacity and upgrade
planning.
Data storage management tools must rely on policies which governs the usage of storage
devices and its procedures. Storage management includes three areas—
• change management,
• performance and capacity planning and
• tiering (tiered storage).
The process used to request, schedule, implement and evaluate adjustments to the storage
infrastructure is called change management.
The change management process defines the way a request is made and approved and
documents the steps used to configure and provision the requested space on a storage array or
server. Change management may also document processes such as data migration and
maintains the integrity and availability of that data for network users.
Performance and capacity planning are used to measure the performance of a system
in-terms of storage and utilization. The result of performance and consumption analysis is
used to make sensible decisions about subsequent storage purchases.
Storage management encompasses three crucial areas that work together to ensure efficient
and effective use of your storage resources:
1. Change Management:
• Focuses on the lifecycle of data - from creation and modification to archiving and
deletion.
• Key activities include:
o Data classification: Categorizing data based on its importance, access
frequency, and regulatory requirements.
o Data migration: Moving data between different storage tiers based on its
access needs and value.
Performance Barrier
Rapid growth in data has caused a parallel increase in the size of databases. In the traditional
storage method, the response time taken for queries is slow and it should be increased. Be it a
There are three file systems in use in Windows OS; they are NTFS, FAT32 and rarely-used
FAT.
2.4.1 FAT File System
The FAT (File Allocation Table) file system is a widely used file system format that dates
back to the early days of personal computing. Microsoft and IBM originally developed it and
has seen several versions and variations over the years. FAT file systems are relatively
straightforward, making them suitable for use on various storage devices, including floppy
disks, hard drives, USB flash drives, and memory cards.
FAT system was first devised in the so-called computer environment in the early years. FAT
was planned for systems with very small RAM and small disks. It required much less system
resources compared to other file systems like UNIX. Essentially, the FAT system has made a
comeback. Thumb or flash drives have become very common and have smaller sizes that
makes the FAT system useful. The smaller sizes are even formatted in FAT16.
2.4.2 NTFS
In the 1990s, Microsoft recognized that DOS-based Windows was inadequate because of
demands in business and industry. They started working for better software that can suit
larger systems.
NTFS is much simpler than FAT. While files are used, the system areas can be customized,
enlarged, or moved as required. NTFS has much more security incorporated. NTFS is not apt
for small-sized disks.
NTFS stands for "New Technology File System," and it is a modern file system developed by
Microsoft. NTFS is the default file system used by the Windows family of operating systems,
including Windows NT, Windows 2000, Windows XP, Windows Vista, Windows 7,
Windows 8, Windows 10, and Windows Server editions.
The FAT (File Allocation Table) and NTFS (New Technology File System) file systems are
both used to manage and organize files and folders on storage devices in Windows operating
systems.
1. Accessibility: Users can access their files from anywhere with an internet connection,
making it convenient for remote work and collaboration.
2. Scalability: Cloud file systems can easily scale to accommodate growing storage needs
without the need to invest in additional hardware.
3. Redundancy: Many cloud providers replicate data across multiple data centers, ensuring
data redundancy and minimizing the risk of data loss.
4. Backup and Recovery: Cloud file systems often include built-in backup and recovery
options, making it easier to recover lost data.
5. Collaboration: Cloud file systems often provide features for real-time collaboration,
allowing multiple users to edit and share files simultaneously.
6. Cost-Efficiency: Users typically pay for the storage and services they use on a subscription
or pay-as-you-go basis, reducing upfront costs.
Popular examples of cloud file systems and storage services include Google Drive, Dropbox,
Microsoft OneDrive, and Amazon S3 (Simple Storage Service). These services offer both
personal and enterprise-level solutions for storing and managing files and data in the cloud.
The Hadoop Distributed File System (HDFS) is a distributed file storage system designed to
store and manage vast amounts of data across a cluster of commodity hardware. It is a
fundamental component of the Apache Hadoop ecosystem, which is widely used for big data
processing and analytics.
A distributed file system designed to run on commodity hardware is known as Hadoop
Distributed File System (HDFS). In HDFS, files are stored in blocks ranging from 64 MB to
1024 MB.
The default size is 64 MB. The blocks will be distributed across the cluster and replicated for
fault tolerance.
CloudFS
CloudFS is a distributed fi le system to solve problems when fi le system is itself provided as
a service. CloudFS is based on GlusterFS, a basic distributed fi le system, and supported by
Red Hat and hosted by Fedora.
A data store is a data repository where data are stored as objects. Data store includes data
repositories, flat files that can store data. Cloud data stores, often referred to as cloud
databases, are centralized repositories of data that are hosted and managed in cloud
computing environments. These data stores provide scalable and highly available solutions
for storing and managing data over the internet. Cloud data stores offer several advantages,
including ease of access, scalability, reliability, and the ability to offload the management of
database infrastructure to cloud service providers. There are various types of cloud data
stores, each designed for specific use cases and data management needs.
Data stores can be of different types:
• Relational databases (Examples: MySQL, PostgreSQL, Microsoft SQL Server, Oracle
2. NoSQL Databases:
o NoSQL databases, including document stores, key-value stores, column-family stores,
and graph databases, are designed for handling unstructured or semi-structured data.
o They provide flexibility and horizontal scalability, making them suitable for big data
and real-time applications.
o Popular NoSQL databases in the cloud include MongoDB, Cassandra, Redis, and
Amazon DynamoDB.
3. Object Storage:
o Object storage services store data as objects in a flat namespace and are suitable for
storing unstructured data, such as documents, images, videos, and backups.
o Leading cloud object storage platforms include Amazon S3, Google Cloud Storage,
and Azure Blob Storage.
4. Data Warehouses:
o Cloud data warehouses are optimized for analytics and business intelligence (BI)
workloads.
o They enable the storage, management, and analysis of large datasets.
o Examples include Amazon Redshift, Google BigQuery, and Snowflake.
5. Time-Series Databases:
o Time-series databases specialize in handling time-stamped data, making them ideal
for IoT (Internet of Things) applications, telemetry data, and monitoring.
o Prominent cloud time-series databases include InfluxDB and Amazon Timestream.
6. Multi-Model Databases:
o Multi-model databases allow the storage and retrieval of data in various formats,
including structured, semi-structured, and unstructured data.
o They offer versatility for diverse data types and use cases.
7. Data Lakes:
8. Graph Databases:
o Graph databases are designed for managing and querying graph-structured data,
making them suitable for applications like social networks, recommendation systems,
and fraud detection.
9. Search Engines:
o Cloud-based search engines provide efficient indexing and searching capabilities for
full-text search and faceted search applications.
Cloud data stores can be deployed on various cloud platforms, including Amazon Web
Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, IBM Cloud, and others.
Organizations can choose the most appropriate type of cloud data store based on their specific
data management needs, performance requirements, and scalability demands, thereby
leveraging the benefits of cloud computing for their data storage and processing needs.
Here are some key characteristics and components of distributed data stores:
▪ Distribution and Replication: Data is distributed across multiple nodes to achieve load
balancing, scalability, and fault tolerance. In some cases, data is replicated across
multiple nodes to ensure redundancy and data availability, even in the event of node
failures.
▪ Partitioning: Data partitioning involves dividing data into smaller subsets or shards,
which are distributed across nodes. Partitioning can be based on various criteria, such as
a key range or hash value, and helps distribute data evenly and efficiently.
▪ CAP Theorem: The CAP theorem, proposed by computer scientist Eric Brewer, describes
the trade-offs among three properties in distributed systems: Consistency, Availability,
and Partition tolerance. Distributed data stores must make trade-offs based on the CAP
theorem to meet specific application requirements.
▪ High Availability: Distributed data stores are designed to provide high availability,
ensuring that data remains accessible even when individual nodes or components fail.
This is achieved through redundancy and failover mechanisms.
▪ Scalability: Scalability is a crucial feature of distributed data stores. They can scale
horizontally by adding new nodes to accommodate growing data and user loads. This
allows for linear scaling and improved performance.
▪ Query and Access Patterns: Distributed data stores support various query and access
patterns, depending on their data model. Common data models include key-value,
document, column-family, and graph databases, each optimized for specific use cases
and query types.
▪ Consensus Protocols: Distributed data stores often employ consensus protocols like
Paxos or Raft to ensure that nodes agree on the state of the system, especially when
electing leaders or handling distributed transactions.
▪ Data Integrity and Security: Data integrity and security mechanisms are crucial in
distributed data stores. Encryption, access controls, and auditing are employed to protect
data from unauthorized access and tampering.
▪ Global Distribution: Some distributed data stores support global distribution, allowing
data to be stored across geographically dispersed data centers. This feature is valuable for
reducing latency and ensuring data access in different regions.
Examples of popular distributed data stores include Apache Cassandra, Apache HBase,
Amazon DynamoDB, Google Cloud Spanner, and Microsoft Azure Cosmos DB.
Organizations choose distributed data stores based on their specific needs, considering factors
like data volume, access patterns, consistency requirements, and geographic distribution.
These systems play a critical role in modern applications and services that require high levels
of availability, scalability, and fault tolerance.
Examples for this kind of data storage are Google’s BigTable, Amazon’s Dynamo and
Windows Azure Storage.
Some Distributed Data Stores use to recover the original file when parts of that file are
damaged or unavailable by using forward error correction techniques. Others download that
file from a diverse mirror.
BigTable
BigTable is a compressed, high performance, and proprietary data storage system construct
on Google File System, Chubby Lock Service, SSTable, and a small number of other Google
technologies.
BigTable was developed in 2004 and is used in a number of Google applications such as web
indexing, Google Earth, Google Reader, Google Maps, Google Book Search, MapReduce,
Blogger.com, Google Code hosting, Orkut, YouTube and Gmail. The advantage for
developing BigTable includes scalability and better performance control.
BigTable charts two random string values (row and column key) and timestamps into an
associated random byte array. BigTable is designed to scale into the petabyte range across
multiple machines and easy to add more machines and automatically start using resources
available without any configuration changes.
Grid computing is a distributed computing model that involves the coordinated use of a large
number of interconnected and geographically dispersed computers to solve complex
computational problems or perform resource-intensive tasks. Unlike traditional computing
systems where a single powerful supercomputer handles tasks, grid computing leverages the
collective processing power and resources of multiple networked computers, making it a
form of distributed computing.
For example, meteorologists use grid computing for weather modeling. Weather modeling is
a computation-intensive problem that requires complex data management and analysis.
Processing massive amounts of weather data on a single computer is slow and time
consuming. That’s why meteorologists run the analysis over geographically dispersed grid
computing infrastructure and combine the results.
Grids can be used for data storage in a distributed and highly efficient manner. Grid-based
data storage systems leverage the collective storage capacity and resources of multiple
interconnected computers across a network.
Grid storage is a critical component of grid computing infrastructure, providing the necessary
storage capabilities to support the distributed and collaborative nature of grid computing
environments. Grid computing involves the coordination of resources from multiple
geographically dispersed locations to solve complex computational problems.
Grid Oriented Storage (GOS) is a dedicated data storage architecture connected directly to a
computational grid. It supports and acts as a data bank and reservoirs for data, which can be
shared among multiple grid clients. GOS is a successor of Network-Attached Storage (NAS)
products in the grid computing era. GOS accelerates all kinds of applications in terms of
performance and transparency. A GOS system contains multiple hard disks, arranged into
logical, redundant storage containers like traditional file servers. GOS deals with long-
distance, heterogeneous and single-image file operations. GOS acts as a fi le server and uses
fi le-based GOS-FS protocol. Similar to GridFTP, GOS-FS integrates a parallel stream engine
and Grid Security Infrastructure (GSI). GOS-FS can be used as an underlying platform to
utilize the available bandwidth and accelerate performance in grid-based applications.
3. CLOUD STORAGE
Cloud storage is nothing but virtualized storage on demand called as Data storage as a
Service (DaaS). Cloud storage is data storage hosted remotely using data storage devices in
WWW and maintained by the third party (service provider). Cloud storage is a part of cloud
computing. It is deployed using WAN infrastructure which includes hardware components
such as switches
Important Points:
Cloud storage came under the limelight because of the following attributes available in cloud
computing: pay-as-you-use, elasticity and simplicity (management). It is important that any
provider providing storage as a service should also provide these attributes to the consumer.
Cloud Storage: Data Management, Provisioning Cloud Storage, Data Intensive Technologies
for Cloud Computing. Cloud Storage from LANs to WANs: Cloud Characteristics,
Distributed Data Storage.
For cloud storage, a standard document is placed by SNIA (Storage Networking Industry
Association). Storage Industry Resource Domain Model (SIRDM). It states the importance of
simplicity for cloud storage. Figure 1shows the SIRDM model which uses CDMI standards.
SIRDM model adopts three metadata:
By using these metadata, cloud storage interface can offer services without adding
unnecessary complexity in managing the data.
Storage system and data system metadata are used to meet the requirements of the data and
the simplicity required is maintained.
User metadata is used by the cloud to find the data objects and containers. Storage system
metadata is used by the cloud to offer basic storage functions like assigning, modifying and
access control. Data system metadata is used by the cloud to offer data as a service based on
user requirements and controls the operation based on that data.
To create, retrieve, update and delete objects in a cloud the cloud data management interface
(CDMI) is used.
Security
Secure cloud storage requires a secure transmission channel and methods. Securing data can
be done using encryption, authentication and authorization.
● Encryption is the process of scrambling data in such a manner as to make it unreadable
without special information, called a key, to make it readable again.
● Authentication is the process of determining their identity. Authentication can employ
passwords, biometrics, identifying tokens and other means.
● Authorization determines access rights on the data and the levels of authorization. To
provide secure cloud storage, access must be restricted for the communication channel, the
data source and the cloud storage sites.
The four primary methods used to secure network communications are as follows:
1. Transport Layer Security (TLS) and Secure Sockets Layer (SSL)
2. Hypertext Transfer Protocol Secure (HTTPS)
3. Private Networks
4. Virtual Private Networks (VPNs)
Performance
Cloud storage performance can be categorized into two: speed and latency. Factors that affect
cloud storage performance are: available network bandwidth, types of systems available in
provider’s end, method adopted for compression and caching.
Provisioning cloud storage refers to the process of allocating and configuring storage
resources in a cloud computing environment to meet the needs of users and applications.
It involves setting up, managing, and optimizing storage services in the cloud to ensure data
availability, scalability, and performance.
1. Resource Allocation: Provisioning involves allocating storage capacity, whether it's in the
form of block storage, object storage, or file storage, based on the requirements of the
application or user. It ensures that the right amount of storage is available when needed.
2. Service Selection: Cloud providers offer various storage services, each designed for
specific use cases. Provisioning includes selecting the appropriate storage service based on
factors like data access patterns, performance needs, and cost considerations.
3. Configuration: Once the storage service is chosen, it needs to be configured properly. This
includes setting access controls, defining storage policies, and configuring backup and data
retention settings.
4. Scalability: One of the advantages of cloud storage is its scalability. Provisioning should
take into account the potential growth in data volume and ensure that the storage resources
can scale up or down dynamically as needed.
5. Data Migration: In some cases, data may need to be migrated from on-premises storage to
the cloud or between different cloud providers. Provisioning includes planning and executing
data migration strategies.
6. Data Protection: Ensuring data integrity and security is crucial. Provisioning involves
implementing data protection measures such as encryption, redundancy, and backup
solutions.
9. Monitoring and Management: After provisioning, ongoing monitoring and management are
essential. Cloud management tools and services help track storage usage, performance
metrics, and compliance with service-level agreements (SLAs).
10. User and Application Access: Provisioning includes granting appropriate access to users
and applications while ensuring that data remains secure and compliant with regulatory
requirements.
11. Compliance and Governance: Depending on the industry and region, there may be
specific compliance requirements for data storage and management. Provisioning should
adhere to these regulations.
Cloud means sharing third party resources via the Internet. This sharing can be done on need
basis and there is no need to invest any infrastructure at consumers end. Storage clouds
increase the efficiency of storing data in remote places, by sharing the storage devices
provided by the service providers. Capacity of storage can be increased on need basis and can
be done using multi-tenancy methods.
Private storage clouds reside at the back of an organization’s fi rewall that is deployed for in-
house customers and is designed for providing elasticity and simplicity in cloud model. By
adopting Cloud Data Management Interface (CDMI), standard service providers can
implement the method for metering the storage and data usage of consumers. This interface
also helps the providers for billing to the IT organizations based on their usage. Advantage of
this interface is that IT organizations need not write/use different adapters used by the service
providers. By using this interface, they can connect with different service providers.
1. Data-Intensive:
Data-intensive workloads are those in which the primary focus is on processing and
analysing large volumes of data.
These workloads involve handling massive datasets, such as big data analytics, data
mining, data warehousing, and data-driven scientific research.
Data-intensive tasks often require high-speed data processing, storage, retrieval, and
analysis to extract valuable insights from the data.
Examples include running MapReduce jobs in Hadoop, processing real-time streams of
data, or conducting complex database queries on vast datasets.
2. Compute-Intensive:
In practical terms, cloud service providers offer specialized instances or virtual machine types
tailored for either data-intensive or compute-intensive workloads to optimize performance
and cost efficiency. Users can choose the most suitable instance type based on their specific
workload requirements.
Processing Approach
System Architecture
For data-intensive computing, an array of system architectures have been implemented. A
number of solutions have come out, one among them is MapReduce concept which is
developed by Google and available as an open-source implementation known as Hadoop.
Apart from this, proprietary system architecture for data-intensive computing is developed by
LexisNexis Risk Solutions called LexisNexis.
1. MapReduce:
The MapReduce architecture and programming model is an example for data-intensive
computing, pioneered by Google.
To create a map function, the MapReduce architecture uses a functional programming style
using key-value pair.
This pair is connected with the input data to produce a set of intermediary key-value pairs.
Reduce function merges all intermediate values using intermediate keys. System takes care of
particulars like partitioning the input data, scheduling and executing automatically. Hence
programmers who do not have experience in parallel programming can simply use a large
distributed processing environment without any problem.
1. Map Phase:
• Input Splitting: The input data is divided into smaller chunks called input
splits. Each split is typically a block of data stored in a distributed file system
like HDFS (Hadoop Distributed File System).
• Mapping: In this phase, each input split is processed by multiple mapper tasks
running on different nodes of a cluster. Mappers apply a user-defined function
(the "map" function) to each record in their input split, generating a set of
intermediate key-value pairs.
• Shuffling and Sorting: The generated key-value pairs from all mappers are
sorted by key. This sorting is crucial because it groups together all key-value
pairs with the same key, making it easier for the reduce phase to process them.
2. Shuffle and Sort:
• Partitioning: The sorted key-value pairs are partitioned into smaller sets
based on keys. Each partition is sent to a reducer.
• Shuffling: The partitions are transferred across the network to the nodes
where reducers are running. This process is known as shuffling.
• Sorting: Within each partition, the key-value pairs are sorted by key again.
This sorting ensures that the reducer receives data in a sorted order.
3. Reduce Phase:
• Reduce Function: Each reducer task receives one partition of the data
generated by the mappers. It applies a user-defined function (the "reduce"
function) to process the key-value pairs within the partition. The reducer's
output is typically a set of aggregated or processed results.
• Final Output: The output from all reducer tasks is collected and merged to
produce the final result of the MapReduce job.
LexisNexis Risk Solutions independently developed and implemented a solution for data-
intensive computing called the HPCC (High-Performance Computing Cluster).
The LexisNexis method structure clusters with commodity hardware that runs in Linux OS.
Custom system software and middleware parts were created and layered to provide the
execution environment and distributed fi le system support that is essential for data-intensive
computing on the base of Linux operating system. A new high-level language for data-
intensive computing called ECL is also implemented by LexisNexis.
In summary, MapReduce and HPCC are both distributed computing frameworks used for big
data processing in cloud computing environments. MapReduce is known for its simplicity
and is often associated with Hadoop, while HPCC provides a more versatile set of tools for
data-intensive tasks. The choice between them depends on the specific requirements of the
data processing workload.
Cloud Storage from LANs to WANs: Cloud Characteristics, Distributed Data Storage.
The topic "Cloud Storage from LANs to WANs" suggests a discussion or exploration of the
evolution and expansion of cloud storage technology and its use in both Local Area Networks
(LANs) and Wide Area Networks (WANs). Here's a breakdown of the meaning:
1. Cloud Storage: This refers to the practice of storing data and files on remote servers that
are accessed over the internet, as opposed to traditional local storage on physical devices like
hard drives. Cloud storage services allow users to store, manage, and retrieve data from
anywhere with an internet connection.
2. LANs (Local Area Networks): LANs are networks that connect devices within a limited
geographic area, such as a home, office, or campus. LANs are typically used for local
communication and data sharing within a specific location.
The phrase "from LANs to WANs" implies a discussion that covers the use of cloud storage
not only within local networks (LANs) but also in scenarios where data needs to be accessed
and shared across wide area networks (WANs). It may explore topics such as data
synchronization, data access over the internet, data security, and the challenges and benefits
of using cloud storage in various network environments.
Cloud Characteristics:
There are three characteristics of a cloud computing natural environment that are most
pertinent to be considered before choosing storage in cloud.
2. Data is retained at an unknown host server. In general, letting go off data is a threat to
many security issues and thus suitable precautions should be taken. The very title ‘loud
computing’ implies that the computing and storage resources are being operated from a
celestial position. The idea is that the data is physically stored in a specifi c host country
and is subject to localized laws and regulations. Since most cloud computing vendors give
their clientele little command over where data is stored, the clientele has no alternative but
to expect the least that the data is encrypted utilizing a key unavailable with the owner, the
data may be accessed by a third party without the customer’s knowledge.
3. Data is duplicated often over distant locations. Data accessibility and durability is
paramount for cloud storage providers, as data tampering can be impairing for both the
business and the organization’s reputation. Data accessibility and durability are normally
accomplished through hidden replications. Large cloud computing providers with data hubs
dispersed all through the world have the profi ciency to provide high levels of expected error
resistance by duplicating data at distant locations across continents. Amazon’s S3 cloud
storage service replicates data over ‘regions’ and ‘availability zones’ so that data and
applications can survive even when the whole location collapses.
Additional Characteristics:
Cloud computing is characterized by several key features and attributes that distinguish it
from traditional computing models. These characteristics define the foundation of cloud
technology and its advantages. Here are the main cloud characteristics:
2. Broad Network Access: Cloud services are accessible over the internet or through a
network. Users can access cloud resources using a variety of devices, including laptops,
smartphones, and tablets.
3. Resource Pooling: Cloud providers use multi-tenant models to pool computing resources
(e.g., servers, storage, and networking) to serve multiple customers. Resources are
dynamically allocated and reassigned based on demand.
5. Measured Service: Cloud resources are metered, and users are billed based on their actual
usage. This pay-as-you-go model provides cost efficiency and flexibility.
6. Self-Service Catalog: Cloud providers often offer a catalog of available services and
resources. Users can select and configure these services according to their requirements.
9. High Availability and Redundancy: Cloud providers typically offer high availability and
redundancy to ensure that services remain accessible even in the event of hardware failures or
disruptions.
10. Security and Compliance: Cloud providers implement robust security measures to protect
data and resources. They often comply with industry-specific regulations and standards.
11. Scalability: Cloud services can easily scale horizontally or vertically to accommodate
growing workloads or user demands.
12. Global Reach: Cloud providers operate data centers in various regions around the world,
allowing users to deploy applications and services in geographically distributed locations.
13. Service Models: Cloud computing offers various service models, including Infrastructure
as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS), catering
to different levels of abstraction and management.
14. Deployment Models: Cloud services can be deployed in different models, including
public cloud (accessible to the general public), private cloud (used exclusively by a single
Distributed data storage, often referred to as distributed storage or distributed file storage, is a
method of storing data across multiple physical or virtual locations, often on different servers
or nodes within a network or cloud infrastructure. This approach to data storage offers several
advantages and is commonly used in modern computing environments.
1. Redundancy: Distributed storage systems replicate data across multiple nodes or servers.
This redundancy ensures data availability and fault tolerance. If one node fails, data can be
retrieved from another.
2. Scalability: Distributed storage can scale horizontally by adding more nodes or servers to
the storage cluster. This allows organizations to accommodate growing data volumes and
increased demand for storage capacity.
3. Load Balancing: Data can be distributed evenly across nodes, helping to balance the load
on the storage infrastructure. This prevents any single node from becoming a bottleneck.
4. Data Accessibility: Distributed storage provides high availability and accessibility to data.
Users and applications can access data from any node in the distributed storage system,
enhancing data retrieval performance.
5. Data Security: Data can be encrypted and secured within a distributed storage system.
Access controls and authentication mechanisms can be implemented to protect sensitive
information.
7. Latency Reduction: By distributing data closer to where it's needed, distributed storage can
reduce data access latency. This is particularly beneficial for applications that require low-
latency access to data.
8. Disaster Recovery: Distributed storage can aid in disaster recovery efforts. Data
redundancy and geographical distribution of nodes can protect against data loss in the event
of hardware failures or disasters.
10. Cloud-Based Solutions: Many cloud providers offer distributed storage services, allowing
organizations to leverage the benefits of distributed storage without managing their
infrastructure.
Distributed File Systems: Examples include Hadoop Distributed File System (HDFS) and
GlusterFS, which distribute files across multiple servers or nodes.
Object Storage: Services like Amazon S3 and Azure Blob Storage offer distributed object
storage, where data is stored as objects that can be accessed via unique URLs.
NoSQL Databases: NoSQL databases like Apache Cassandra and MongoDB are designed for
distributed data storage and retrieval.
Content Delivery Networks (CDNs): CDNs distribute web content to multiple servers
located in different regions to reduce latency and improve content delivery speed.
Distributed data storage plays a crucial role in modern computing, enabling organizations to
effectively manage and safeguard their ever-growing volumes of data while ensuring high
availability and accessibility. It has become a fundamental component of cloud computing,
big data analytics, and scalable web applications.
Amazon Dynamo, CouchDB, and ThruDB are all distributed NoSQL databases, each with its
own set of features and characteristics. Let's briefly explore each of them:
1. Amazon DynamoDB:
Type: Managed NoSQL database service.
Managed By: Amazon Web Services (AWS).
Key Features:
Fully Managed: DynamoDB is a fully managed database service, which means AWS
handles tasks like infrastructure provisioning, patching, and scaling.
NoSQL: It is a NoSQL database, suitable for flexible and schema less data models.
Highly Scalable: DynamoDB can automatically scale both read and write capacity to
handle varying workloads.
Low Latency: Designed for low-latency performance, making it suitable for applications
that require quick data retrieval.
Strong Consistency: Offers strong consistency options for read and write operations.
Security: Provides robust security features like encryption, access control, and
authentication.
Global Tables: Allows you to replicate data across multiple AWS regions for high
availability and disaster recovery.
3. ThruDB:
Type: Opensource distributed NoSQL database.
Key Features:
Column Oriented: ThruDB is designed as a column family store, suitable for handling
large volumes of data.
Distributed and Fault Tolerant: Supports distribution across multiple nodes and includes
fault tolerance mechanisms.
Schemaless: Allows for flexible data modeling and storage.
Thrift Interface: Provides a Thriftbased API for data access and manipulation.
MapReduce: Supports MapReduce for querying and processing large datasets.
Notable for TimeSeries Data: ThruDB is sometimes used for timeseries data applications
due to its distributed nature and column-oriented storage.
When choosing between these databases, it's essential to consider your specific use case and
requirements. DynamoDB, being a managed service, is convenient for AWS-based
applications, while CouchDB and ThruDB offer more flexibility and control over your
database infrastructure. The choice also depends on factors like data modeling, scalability
needs, and the level of administrative control you require.
Additional Learning :
Big data:
Big data refers to extremely large and complex datasets that cannot be easily managed,
processed, or analyzed using traditional data processing tools and methods. These datasets are
characterized by the "three Vs": volume, velocity, and variety, and they often include other
attributes like veracity and value. Let's break down these characteristics:
1. Volume: Big data involves massive amounts of data. This can range from terabytes to
petabytes and beyond. The sheer quantity of data is one of the defining features of big data.
3. Variety: Big data comes in diverse formats. It includes structured data (like traditional
databases), semi-structured data (like XML or JSON), and unstructured data (like text
documents, social media posts, images, and videos). Dealing with this variety of data types is
a challenge.
4. Veracity: Veracity refers to the reliability and quality of the data. Big data can be messy,
with inaccuracies, inconsistencies, and missing values. Ensuring data quality is a significant
concern.
5. Value: The ultimate goal of working with big data is to extract valuable insights, make
informed decisions, and gain a competitive advantage. The value of big data comes from its
potential to provide actionable information.
To effectively manage and harness the potential of big data, organizations use various
technologies and approaches, including:
- Distributed Computing: Tools like Hadoop and Apache Spark distribute data processing
tasks across clusters of computers, enabling parallel processing of large datasets.
- Data Storage: NoSQL databases, columnar databases, and distributed file systems are used
to store and retrieve big data efficiently.
- Data Warehousing: Data warehouses are designed to store and manage large datasets for
business intelligence and analytics purposes.
- Machine Learning and Data Analytics: Advanced analytics and machine learning algorithms
are applied to extract meaningful patterns, correlations, and insights from big data.
- Data Visualization: Tools for data visualization help make complex data understandable and
actionable by presenting it in graphical forms.
- Cloud Computing: Cloud platforms offer scalable and flexible infrastructure for storing and
processing big data, reducing the need for large on-premises data centers.
Industries such as finance, healthcare, e-commerce, social media, and scientific research
benefit from big data analytics by improving decision-making, customer experiences, fraud
detection, and more. However, working with big data also raises privacy and security
concerns, which require careful consideration and management.
In summary, big data refers to the massive and complex datasets generated at high speeds in
various formats. Effectively harnessing the potential of big data requires specialized tools,
technologies, and approaches to turn this data into valuable insights and actionable
information.
Experiential Learning: