0% found this document useful (0 votes)
3 views

Week 4- Azure-AWSStorage

The document discusses various data storage architectures, specifically focusing on Lambda and Kappa architectures for data processing. It outlines the differences between these architectures, including their complexity, processing models, and use cases. Additionally, it covers AWS and Azure storage options, emphasizing the importance of selecting appropriate storage types based on data characteristics and organizational needs.

Uploaded by

ojhashobha28
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Week 4- Azure-AWSStorage

The document discusses various data storage architectures, specifically focusing on Lambda and Kappa architectures for data processing. It outlines the differences between these architectures, including their complexity, processing models, and use cases. Additionally, it covers AWS and Azure storage options, emphasizing the importance of selecting appropriate storage types based on data characteristics and organizational needs.

Uploaded by

ojhashobha28
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

BUAN6335

Organizing for Business Analytics Platforms


Week 4

Storage Scenarios
Prof. Mandar Samant

Unless Otherwise Stated, this presentation refers to study material from Microsoft Azure Learn, AWS documentation, and Snowflake Academic Courses.
• Data pipeline and Data Platform Services scenario example
3

Quick Recap
Data Lakehouse 4

Source: https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
Modern Data Analytics Architecture – Conceptual 5

Organization

Source: Big data architectures - Azure Architecture Center | Microsoft Learn


What is Lambda and Kappa Data Architecture 6

Patterns?

• In traditional big data processing, one drawback is that it


introduces latency — if processing takes a few hours, a query
may return results that are several hours old.
• The lambda architecture, first proposed by Nathan Marz,
addresses this problem by creating two paths for data flow. All
data coming into the system goes through these two paths:
• A batch layer (cold path) stores all of the incoming data
in its raw form and performs batch processing on the
data. The result of this processing is stored as a batch
view.
• A speed layer (hot path) analyzes data in real-time. This
layer is designed for low latency at the expense of
accuracy.
• The batch layer feeds into a serving layer that indexes the
batch view for efficient querying. The speed layer updates the
serving layer with incremental updates based on the most
recent data.
• Eventually, the hot and cold paths converge at the analytics
client application. If the client needs to display timely yet
potentially less accurate data in real-time, it will acquire its
result from the hot path. Otherwise, it will select results from
the cold path to display less timely but more accurate data.

Lambda Architecture

Source: Big data architectures - Azure Architecture Center | Microsoft Learn


Kappa Architecture 7

• A drawback to the lambda architecture is its complexity.


o Processing logic appears in two different places —
the cold and hot paths — using different
frameworks.
o This leads to duplicate computation logic and the
complexity of managing the architecture for both
paths.

• Jay Kreps proposed kappa architecture as a streamlined


alternative to lambda architecture. It shares the same
core goals as lambda architecture, but with a significant
difference: All data is channeled through a single path
using a stream processing system, eliminating the need
for managing separate paths and frameworks.

• The data is ingested as a stream of events into a


distributed, fault-tolerant unified log. These events are
ordered, and the current state of an event is changed
only by a new event being appended. Similar to a lambda
architecture's speed layer, all event processing is
performed on the input stream and persisted as a real-
time view.
• If you need to recompute the entire data set (equivalent
to what the batch layer does in lambda), you simply
replay the stream, typically using parallelism to complete
the computation in a timely fashion.
Source: Big data architectures - Azure Architecture Center | Microsoft Learn
8

Quick Comparison: Lambda vs. Kappa

Aspect Lambda Architecture Kappa Architecture

Processing Model Combines batch and real-time processing. Focuses solely on stream (real-time) processing.

Single pipeline for both real-time and historical


Layers Batch layer, Speed layer, Serving layer.
data.

Higher due to managing separate batch and speed Simpler, as it uses only one stream processing
Complexity
layers. layer.

Fault-tolerant, as batch processing ensures data Fault-tolerant with real-time processing but
Fault Tolerance
accuracy. depends on stream integrity.

Suitable for both real-time and batch processing Best for real-time processing; batch processing is
Use Case
needs. less emphasized.

Batch layer allows accurate reprocessing of Reprocessing is done by replaying the stream in
Data Reprocessing
historical data. real-time.

Higher latency in batch processing; low latency in


Latency Low latency for all data due to stream processing.
real-time.

Batch layer provides high accuracy, speed layer Provides consistent results, but may not match the
Accuracy
offers immediate but less accurate results. accuracy of dedicated batch processing.

Source: Lambda Architecture vs. Kappa Architecture in System Design - GeeksforGeeks


9

AWS services to manage


data movement and governance

Key design
considerations
• Seamless data
movement
• Unified
governance Lake
Formation

AWS Glue

Source: AWS Documentation


End to End Azure Synapse Architecture – 10
Data Analytics Pipelines: Deeper view

Source: Analytics end-to-end with Azure Synapse - Azure Architecture Center | Microsoft Learn
Matching ingestion services to variety, volume, and 11

velocity

Ingest
Azure Data
SaaS apps Factory

Azure Synapse
Analytics Pipelines
OLTP ERP CRM Business
Applications
Azure File Sync

File shares
Azur Event Hub

Web Devices IoT Social Azure IoT Hub


Sensors media

Azure Data Box

On Premise Data-
limited connectivity
Store and Manager Enterprise Data 12

Volume, Velocity, Veracity, and Variety (Hint: Data lake)

Blob Storage Azure Data Microsoft Azure Synapse


Access Tiers Share Purview Analytics

Azure Data Lake


Storage
Gen 2
13

Modern data architecture storage layer

Storage layer: Catalog

Microsoft Purview Azure Data Share

Storage layer: Storage

Azure Synapse
Analytics

Built-in integration

Azure Blob Storage –


Data Lake
Storage for variety, volume, and velocity 14

Storage layer: Storage


structured data is loaded into classic DWH schemas Use case: BI dashboards

Semistructured data is loaded into staging tables


Azure Synapse
Analytics

Ingest Process

Unstructured, semistructured, and


structured data is stored as objects
Use case: Big data AI/ML

Azure Data
Lake Storage
Gen 2
15

Data Lake Zones/Layers for data in different states


• Raw layer or data lake
one
• Think of the raw layer as a
reservoir that stores data in
its natural and original
state. It's unfiltered and
unpurified.
• You might store the data in
its original format, such as
JSON or CSV. Or it might be
cost-effective to store the
file contents as a column in
a compressed file format,
like Avro, Parquet, or
Databricks Delta Lake.
• This raw data is immutable.
Keep your raw data locked
down, and if you give
permissions to any
consumers, automated or
human, ensure that they're
read-only.

Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
16
Data Lake Zones/Layers..continued…

Enriched layer or data lake two

Think of the enriched layer as a


filtration layer. It removes
impurities and can also involve
enrichment.
Your standardization container
holds systems of record and
masters. Folders are segmented
first by subject area, then by
entity. Data is available in merged,
partitioned tables that are
optimized for analytics
consumption.

Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
17
Data Lake Zones/Layers. Continued…

Curated layer or data lake two


Your curated layer is your
consumption layer. It's
optimized for analytics rather
than data ingestion or
processing. The curated layer
might store data in
denormalized data marts or star
schemas.
Data from your standardized
container is transformed into
high-value data products that
are served to your data
consumers. This data has
structure. It can be served to the
consumers as-is, such as data
science notebooks, or through
another read data store, such as
Azure SQL Database.

Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
18
Data Lake Zones/Layers. Continued…

Development layer or data lake three


Your data consumers can bring other useful data products along with the data
ingested into your standardized container.

Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
19

Extracting Insights from the Data


- Ad-hoc, structured, irrespective of data load

Azure Synapse Azure Data Bricks HDInsight Data Lake


Azure Data
Analytics Analytics
Explorer

Azure Stream
Elastic Jobs on Microsoft Purview Azure Data
Analytics
Azure Factory
20

Visualization and and machine learning services for advanced


analytics and Model predictions

Power Power BI Machine


Platform Learning
21

AWS:
Sample use-cases for AWS data and
analytics services
AWS Data Analytics Pipeline Services 22

Representative view

Collection Storage Process for analysis Visualization

Amazon AWS Amazon S3 Amazon S3 Amazon Redshift Amazon Amazon


Kinesis Data Snowball Glacier Athena QuickSight
Firehose

Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon


Data Analytics SageMaker

Amazon AWS Direct


Kinesis Data Connect
Amazon ES Amazon
Streams Amazon Managed Amazon EMR
CloudSearch
Streaming for
Kafka

Amazon Aurora
AWS Glue
Amazon AWS
AppFlow DataSync

AWS Database
Automation Migration Service
23

Break (10 Min)


24

Storage
Storage Introduction 25

Data Data Lake Data Cloud Data


Architecture Data Lake Warehouse
House Warehouse
Pattern Tier

File Block Object Cache


Storage In-Memory
System /RDBMS Storage
Tier Storage

SSD HDD RAM


Raw
Tier
Networking, Compression, Serialization
Types of Storage
26

File Storage Block Storage Object Storage


• Data is saved together in a • Data is split into fixed blocks • Object storage is a system that
single file with a file extension of data and then stored divides data into separate,
type such as .jpg, .docx or separately with unique self-contained units that are
.pdf. identifiers. The blocks can be re-stored in a flat environment,
• Files may also be stored on a stored in different with all objects at the same
network-attached storage environments level. There are no folders or
(NAS) device. These devices • Block storage is the default sub-directories like those used
are specific to file storage, storage for both hard disk with file storage.
making it a faster option than drives and frequently updated • Objects also contain metadata,
general network servers data. You can store blocks on which is information about the
• File storage uses a hierarchical Storage Area Networks (SANs) file that helps with processing
structure where files are or in cloud storage and usability.
organized by the user in environments • Unlike with file storage, you
folders and subfolders, which • many organizations are must use an Application
makes it easier to find and transitioning away from block Programming Interface (API)
manage files (path or directory because of the limited scale
structure etc) to access and manage objects.
and lack of metadata. • Use cases: IoT Data, Backup
• Use cases: File archival, File • Use cases: Databases, email and recovery, Archival, Video
Storage servers Surveillance

Source:
https://phoenixnap.com/blog/object-storage-vs-block-storage
https://www.ibm.com/cloud/blog/object-vs-file-vs-block-storage
Object vs. Block Storage
27

Point of Comparison Object storage Block storage

Unique, identifiable, and distinct units called objects Fixed-sized blocks store portions of the data in a
Data storage store data in a flat-file system. hierarchical system and reassemble when needed.

Metadata Unlimited, customizable contextual information. Limited, basic information.

Cost More cost-effective. More expensive.

Scalability Unlimited scalability. Limited scalability.

Suitable for high volumes of unstructured data. Best for transactional data and database storage.
Performance Performs best with large files. Performs best with files small in size.

A centralized system that stores data on-premise


A centralized or geographically dispersed system
or in private cloud. Latency may become an issue
Location that stores data on-premise, private, hybrid, or
if the application and the storage are
public cloud.
geographically far apart.

Source:
https://phoenixnap.com/blog/object-storage-vs-block-storage
https://www.ibm.com/cloud/blog/object-vs-file-vs-block-storage
Storage Selection Aspects 28

• Cost: Because the costs involved with block and file storage are higher, many organizations choose object
storage for high volumes of data.

• Management ease: The metadata and searchability make object storage a top choice for high volumes
of data. File storage, with its hierarchical organization system, is more appropriate for lower volumes of
data.

• Volume: Organizations with high volumes of data often choose object or block storage.

• Retrievability: Data is relatively retrievable from all three types of storage, though file and object storage
are typically easier to access.

• Handling of metadata: Although file storage contains very basic metadata, information with extensive
metadata is typically best served by object storage.

• Data protection: While the data is stored, its essential the data is protected from breaches and
cybersecurity threats.

• Storage use cases: Each type of storage is most effective for different use cases and workflows. By
understanding their specific needs, organizations can select the type that fits the majority of their storage
use cases.

Source:
https://phoenixnap.com/blog/object-storage-vs-block-storage
https://www.ibm.com/cloud/blog/object-vs-file-vs-block-storage
Azure Storage Options 29

Azure virtual machines


(VMs) are one of several
types of on-demand,
scalable computing
resources that Azure
offers. Typically, you
choose a virtual machine
when you need more
control over the
computing environment
than the other choices
offer.

Introduction to Azure Storage - Cloud storage on Azure | Microsoft Learn


Azure General Purpose Storage 30

Account
31

Azure Storage Architecture

my account

Copied the data from one file to data lake

tables we have seen in link storage


Azure Storage Options
32

Type of Azure Quick Description Use case


Storage
Azure Files Offers fully managed cloud file shares that you You want to "lift and shift" an application to the cloud
can access from anywhere. You can mount Azure that already uses the native file system APIs to share
file shares from cloud or on-premises data.
deployments of Windows, Linux, and macOS. You want to replace or supplement on-premises file
servers or NAS devices.
Azure Blob Allows unstructured data to be stored and You want your application to support streaming and
Storage accessed at a massive scale in block blobs. random access scenarios. Able to access application data
Also supports Azure Data Lake Storage for from anywhere. Build an enterprise data lake on Azure
enterprise big data analytics solutions. and perform big data analytics.
Azure Disks Allows data to be persistently stored and You want to store data that isn't required to be accessed
accessed from an attached virtual hard disk. from outside the virtual machine to which the disk is
attached. API access for native files
Azure Elastic Azure Elastic SAN is a fully integrated solution You want large scale storage that is interoperable with
SAN that simplifies deploying, scaling, managing, and multiple types of compute resources (such as SQL,
configuring a SAN, while also offering built-in MariaDB, Azure virtual machines, and Azure Kubernetes
cloud capabilities like high availability. Services) visa iSCSI protocol.
Azure It is a volume management, deployment, and You want to dynamically and automatically provision
Container orchestration service that integrates with persistent volumes to store data for stateful applications
Storage Kubernetes and is built natively for containers. running on Kubernetes clusters.
Azure Queues Allows for asynchronous message queueing You want to decouple application components and use
between application components. asynchronous messaging to communicate between them
Azure Tables Allows you to store structured NoSQL data in the You want to store flexible datasets like user data for web
cloud, providing a key/attribute store with a applications, address books, device information, or other
schemaless design. types of metadata your service requires.
it is similar to S3
33

Azure Blob Storage –


Foundation for Data Lake
(Object Storage AWS S3 Equivalent)
on the top of Bolob storage is data lake platform

• Blob Storage is optimized for storing massive amounts of


unstructured data. Unstructured data is data that doesn't
adhere to a particular data model or definition, such as
text or binary data.
• Blog Storage is designed for:
o Serving images or documents directly to a browser.
o Storing files for distributed access.
o Streaming video and audio.
o Writing to log files.
o Storing data for backup and restore, disaster recovery, and
archiving.
o Storing data for analysis by an on-premises or Azure-hosted
service.
How does well architected pillars work 34

for BLOB storage

Secured Comprehensive
Scalable, durable, Optimized for
Authentication with data
and available data lakes
Microsoft Entra ID management
Sixteen nines of File namespace
(formerly Azure End-to-end
designed and multi-
Active Directory) lifecycle
durability with protocol access
and role-based management,
geo-replication support enabling
access control policy-based
and flexibility to analytics
(RBAC), plus access control,
scale as needed. workloads for
encryption at rest and immutable
data insights.
we can't lose the
and advanced threat (WORM) storage.
object once we store protection.
it, its that durable it should work with
RDBMS,
35

General BLOB storage concepts:


Types of Resources

Blob Storage offers three types of resources:


• The storage account
• A container in the storage account
• A blob in a container
we can create folders in a structured way

Blob was specific


csv file we had
imported

UTD JSOM
Jsom may contain ITM Container was sales folder/product folder etc
Source Introduction to Blob (object) Storage - Azure Storage |MSBA
Microsoftetc
Learn
General BLOB storage concepts:
Storage Accounts
36

A storage account provides a unique namespace in Azure for your data. Every object that you
store in Azure Storage has an address that includes your unique account name. The combination
of the account name and the Blob Storage endpoint forms the base address for the objects in
your storage account.
For example, if your storage account is named buanutsom, then the default endpoint for
Blob Storage is:
http://buanutsom.blob.core.windows.net

Type of Account Account Tier Usage

General-purpose v2 Standard Standard storage account type for blobs, file shares,
queues, and tables. Recommended for most
scenarios using Blob Storage or one of the other
Azure Storage services.
Block blob Premium Premium storage account type for block blobs and
append blobs. Recommended for scenarios with high
transaction rates or that use smaller objects or
require consistently low storage latency.
Page blob Premium Premium storage account type for page blobs only.

Source Introduction to Blob (object) Storage - Azure Storage | Microsoft Learn


General BLOB storage concepts:
Container
37

Containers
A container organizes a set of blobs, similar to a directory in a file system. A storage
account can include an unlimited number of containers, and a container can store an
unlimited number of blobs.
A container name must be a valid DNS name, as it forms part of the unique URI (Uniform
resource identifier) used to address the container or its blobs. Follow these rules when
naming a container:
• Container names can be between 3 and 63 characters long.
• Container names must start with a letter or number, and can contain only lowercase
letters, numbers, and the dash (-) character.
• Two or more consecutive dash characters aren't permitted in container names.
o The URI for a container is similar to:
o https://myaccount.blob.core.windows.net/mycontainer

Source Introduction to Blob (object) Storage - Azure Storage | Microsoft Learn


General BLOB storage concepts:
Type of Blobs
38

Blobs
Azure Storage supports three types of blobs:
• Block blobs store text and binary data. Block blobs are made up of blocks of
data that can be managed individually. Block blobs can store up to about
190.7 TiB.
• Append blobs are made up of blocks like block blobs, but are optimized for
append operations. Append blobs are ideal for scenarios such as logging data
from virtual machines.
• Page blobs store random access files up to 8 TiB in size. Page blobs store
virtual hard drive (VHD) files and serve as disks for Azure virtual machines.

Source Introduction to Blob (object) Storage - Azure Storage | Microsoft Learn


S3 has 7 tiers , so S3 has surpassed Azure
39

Azure BLOB Storage Tiers


OLTP type of type,
transactions are happening infrequently access data,
and can store on cheaper
you want to access it as soon hardware. Not necessarily
as possible be faster.
40

Azure Data Lake


Storage Gen 2
41

Quick recap: What is a data lake?

A data lake is a single, centralized repository where you can store all your
data, both structured and unstructured. A data lake enables your
organization to quickly and more easily store, access, and analyze a wide
variety of data in a single location.
With a data lake, you don't need to conform your data to fit an existing
structure. Instead, you can store your data in its raw or native format, usually
as files or as binary large objects (blobs).

Azure Data Lake Storage is a cloud-based enterprise data lake solution. It's
engineered to store massive amounts of data in any format and to facilitate
big data analytical workloads. You use it to capture data of any type and
ingestion speed in a single location for easy access and analysis using various
frameworks.

Source: Azure Data Lake Storage Introduction - Azure Storage | Microsoft Learn
Understand Azure Data Lake Storage Gen2
42

Distributed cloud
storage for data lakes
• HDFS-compatibility –
Common file system
for Hadoop, Spark, and
others
• Flexible security
through folder and file
level permissions
• Built on Azure Storage:
– High performance and
scalability
– Data redundancy
through built-in
replication

© Copyright Microsoft Corporation. All rights reserved.


Azure Data Lake Storage Gen 2 43

vs Azure Blob Storage

Enable Hierarchical Namespace in a blob container to use Azure


Data Lake Storage Gen2

Azure Blob Storage Azure Data Lake Storage Gen2

Azure Storage Account


Azure Storage Account
Blob Container
Blob Container Directory
blob1 File1
File2
folder1/blob2 Hierarchical Namespace

Blobs can be organized in virtual directories, but File system includes directories and files, and is
each path is considered a single blob in a flat compatible with large scale data analytics systems
namespace – Folder level operations are not like Hadoop, Databricks, and Azure Synapse
supported Analytics

© Copyright Microsoft Corporation. All rights reserved.


(ADLS) Data Lake Storage v2 and BLOB Storage 44

are inseparable

Source: Under the hood: Performance, scale, security for cloud analytics with ADLS Gen2 | Microsoft Azure Blog
45

Security end to end for ADLS

For customers wanting to build a data lake to serve the entire enterprise, security is no
lightweight consideration. There are multiple aspects to providing end-to-end security for
your data lake:
• Authentication – Azure Active Directory OAuth bearer tokens provide industry-standard
authentication mechanisms backed by the same identity service used throughout Azure
and Office365.
• Access control – A combination of Azure Role Based Access Control (RBAC) and POSIX-
compliant Access Control Lists (ACLs) to provide flexible and scalable access control.
Significantly, the POSIX ACLs are the same mechanism used within Hadoop.
• Encryption at rest and transit – Data stored in ADLS is encrypted using either a system-
supplied or customer-managed key. Additionally, data is encrypted using TLS 1.2 whilst in
transit.
• Network transport security – Given that ADLS exposes endpoints on the public Internet,
transport-level protections are provided via Storage Firewalls that securely restrict where
the data may be accessed from, enforced at the packet level.

Source: Under the hood: Performance, scale, security for cloud analytics with ADLS Gen2 | Microsoft Azure Blog
Data Encryption: Data At Rest 46

• Data at rest includes information that resides in persistent storage on physical media, in
any digital format. The media can include files on magnetic or optical media, archived
data, and data backups. Microsoft Azure offers a variety of data storage solutions to
meet different needs, including file, disk, blob, and table storage. Microsoft also
provides encryption to protect Azure SQL Database, Azure Cosmos DB, and Azure Data
Lake.
• Data encryption at rest using AES 256 data encryption is available for services across the
software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service
(IaaS) cloud models.

• Azure Data Lake storage is an enterprise-wide repository of every type of data collected
in a single place prior to any formal definition of requirements or schema. Data Lake
Store supports "on by default," transparent encryption of data at rest, which is set up
during the creation of your account. By default, Azure Data Lake Store manages the keys
for you, but you have the option to manage them yourself.
• Three types of keys are used in encrypting and decrypting data: the Master Encryption
Key (MEK), Data Encryption Key (DEK), and Block Encryption Key (BEK). The MEK is used
to encrypt the DEK, which is stored on persistent media, and the BEK is derived from the
DEK and the data block. If you are managing your own keys, you can rotate the MEK.
Encryption of data in transit 47

Data-link Layer encryption in Azure TLS encryption in Azure Azure Storage transactions

When you interact with Azure Storage


Whenever Azure Customer traffic moves Microsoft gives customers the ability to through the Azure portal, all transactions
between datacenters-- outside physical use Transport Layer Security (TLS) protocol take place over HTTPS. You can also use the
boundaries not controlled by Microsoft (or to protect data when it’s traveling Storage REST API over HTTPS to interact with
Azure Storage. You can enforce the use of
on behalf of Microsoft)-- a data-link layer between the cloud services and HTTPS when you call the REST APIs to access
encryption method using the IEEE 802.1AE customers. Microsoft data centers objects in storage accounts by enabling the
MAC Security Standards (also known as negotiate a TLS connection with client secure transfer that's required for the
storage account.
MACsec) is applied from point-to-point systems that connect to Azure services.
across the underlying network hardware. TLS provides strong authentication,
The packets are encrypted on the devices message privacy, and integrity (enabling
before being sent, preventing physical detection of message tampering,
“man-in-the-middle” or interception, and forgery), interoperability,
snooping/wiretapping attacks. algorithm flexibility, and ease of
deployment and use.

In-transit encryption in Data Lake Key management with Key Vault


Data in transit (also known as data in Without proper protection and
motion) is also always encrypted in Data management of the keys, encryption is
Lake Store. In addition to encrypting data rendered useless. Key Vault is the
prior to storing it in persistent media, the Microsoft-recommended solution for
data is also always secured in transit by managing and controlling access to
using HTTPS. HTTPS is the only protocol encryption keys used by cloud services.
that is supported for the Data Lake Store Permissions to access keys can be assigned
REST interfaces. to services or to users through Microsoft
Entra accounts.

Source: https://learn.microsoft.com/en-us/azure/security/fundamentals/encryption-overview
This is why Microsoft azure is better than S3
48

Azure Storage Redundancy

Azure Storage always stores multiple copies of your data to protect it from planned and
unplanned events. Examples of these events include transient hardware failures, network or
power outages, and massive natural disasters. Redundancy ensures that your storage account
meets its availability and durability targets even in the face of failures.
When deciding which redundancy option is best for your scenario, consider the tradeoffs
between lower costs and higher availability.

The factors that help determine which redundancy option you should choose include:
• How your data is replicated within the primary region.
• Whether your data is replicated from a primary region to a second, geographically distant
region, to protect against regional disasters (geo-replication).
• Whether your application requires read access to the replicated data in the secondary
region during an outage in the primary region (geo-replication with read access).

Source: Next few slides for Azure Storage Redundancy:


https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy
49

Azure Storage Redundancy… contd

• The services that comprise Azure Storage are managed through a common Azure
resource called a storage account.
• The storage account represents a shared pool of storage that can be used to
deploy storage resources such as blob containers (Blob Storage), file shares (Azure
Files), tables (Table Storage), or queues (Queue Storage).
• The redundancy setting for a storage account is shared for all storage services
exposed by that account.
• All storage resources deployed in the same storage account have the same
redundancy setting. Consider isolating different types of resources in separate
storage accounts if they have different redundancy requirements.
50
Azure Storage Redundancy… contd
Redundancy in the primary region

Data in an Azure Storage account is always replicated three times in


the primary region. Azure Storage offers two options for how your
data is replicated in the primary region:
• Locally redundant storage (LRS) copies your data synchronously
three times within a single physical location in the primary region.
LRS is the least expensive replication option, but isn't
recommended for applications requiring high availability or
durability.
• Zone-redundant storage (ZRS) copies your data synchronously
across three Azure availability zones in the primary region. For
applications requiring high availability, Microsoft recommends
using ZRS in the primary region, and also replicating to a
secondary region.
• Locally redundant storage (LRS) replicates your storage account
three times within a single data center in the primary region. LRS
provides at least 99.999999999% (11 nines) durability of objects
over a given year.
• A write request to a storage account that is using LRS happens
synchronously. The write operation returns successfully only after
the data is written to all three replicas.
51
Azure Storage Redundancy… contd
LRS - Redundancy in the primary region

LRS is the lowest-cost redundancy option and offers the least


durability compared to other options. LRS protects your data against
server rack and drive failures. However, if a disaster such as fire or
flooding occurs within the data center, all replicas of a storage
account using LRS might be lost or unrecoverable. To mitigate this
risk, Microsoft recommends using zone-redundant
storage (ZRS), geo-redundant storage (GRS), or geo-zone-redundant
storage (GZRS).
LRS is a good choice for the following scenarios:
• If your application stores data that can be easily reconstructed if
data loss occurs, consider choosing LRS.
• If your application is restricted to replicating data only within a
region due to data governance requirements, consider choosing
LRS. In some cases, the paired regions across which the data is
geo-replicated might be within another region.
Azure Storage Redundancy… contd
Zone Redundant Storage (ZRS) 52

• Zone-redundant storage (ZRS) replicates your storage


account synchronously across three Azure availability
zones in the primary region.
• Each availability zone is a separate physical location with
independent power, cooling, and networking. ZRS offers
durability for storage resources of at least
99.9999999999% (12 9s) over a given year.
• When you utilize ZRS, your data remains accessible for
both read and write operations even if a zone becomes
unavailable.
• If a zone becomes unavailable, Azure undertakes
networking updates such as Domain Name System (DNS)
repointing. These updates could affect your application if
you access data before the updates are complete.
• A write request to a storage account that is using ZRS
happens synchronously. The write operation returns
successfully only after the data is written to all replicas
across the three availability zones. If an availability zone is
temporarily unavailable, the operation returns successfully
after the data is written to all available zones.
• Microsoft recommends using ZRS in the primary region for
scenarios that require high availability. ZRS is also
recommended for restricting replication of data to a
particular region to meet data governance requirements.
Why Geo and secondary region redundancy 53

• ZRS provides excellent performance, low latency, and resiliency for your data
if it becomes temporarily unavailable. However, ZRS by itself might not fully
protect your data against a regional disaster where multiple zones are
permanently affected.
• Geo-zone-redundant storage (GZRS) uses ZRS in the primary region and also
geo-replicates your data to a secondary region. GZRS is available in many
regions, and is recommended for protection against regional disasters.
• Redundancy options can help provide high durability for your applications. In
many regions, you can copy the data within your storage account to a
secondary region located hundreds of miles away from the primary region.
Copying your storage account to a secondary region ensures that your data
remains durable during a complete regional outage or a disaster in which
the primary region isn't recoverable.
• When you create a storage account, you select the primary region for the
account. The paired secondary region is determined based on the primary
region, and can't be changed.
54

Secondary region redundancy


Azure Storage offers two options for copying your data to a secondary region:

• Geo-redundant storage (GRS) copies your data synchronously three times within a
single physical location in the primary region using LRS. It then copies your data
asynchronously to a single physical location in the secondary region. Within the
secondary region, your data is copied synchronously three times using LRS.
• Geo-zone-redundant storage (GZRS) copies your data synchronously across three
Azure availability zones in the primary region using ZRS. It then copies your data
asynchronously to a single physical location in the secondary region. Within the
secondary region, your data is copied synchronously three times using LRS.
• When you utilize GRS or GZRS, the data in the secondary region isn't available for
read or write access unless there's a failover to the primary region.
Secondary region redundancy 55

GRS- Geo Redundant Storage

• Geo-redundant storage (GRS) copies your data synchronously three times within a single
physical location in the primary region using LRS. It then copies your data asynchronously
to a single physical location in a secondary region that is hundreds of miles away from the
primary region. GRS offers durability for storage resources of at least
99.99999999999999% (16 9s) over a given year.
• A write operation is first committed to the primary location and replicated using LRS.
The update is then replicated asynchronously to the secondary region. When data is
written to the secondary location, it also replicates within that location using LRS.
Secondary region redundancy 56

Geo-zone-redundant storage

• Geo-zone-redundant storage (GZRS) combines the high availability provided by redundancy across
availability zones with protection from regional outages provided by geo-replication. Data in a GZRS storage
account is copied across three Azure availability zones in the primary region. In addition, it also replicates to
a secondary geographic region for protection from regional disasters. Microsoft recommends using GZRS for
applications requiring maximum consistency, durability, and availability, excellent performance, and
resilience for disaster recovery.
• With a GZRS storage account, you can continue to read and write data if an availability zone becomes
unavailable or is unrecoverable. Additionally, your data also remains durable during a complete regional
outage or a disaster in which the primary region isn't recoverable. GZRS is designed to provide at least
99.99999999999999% (16 9s) durability of objects over a given year.
57

Plan for data loss


Because data is replicated asynchronously from the primary to the secondary region, the
secondary region is typically behind the primary region in terms of write operations. If a disaster
strikes the primary region, it's likely that some data would be lost and that files within a
directory or container wouldn't be consistent. For more information about how to plan for
potential data loss
Parameter LRS ZRS GRS/RA-GRS GZRS/RA-GZRS

Percent durability of objects at least 99.999999999% at least 99.9999999999% at least 99.99999999999999% (16 9s) at least 99.99999999999999% (16
over a given year (11 9s) (12 9s) 9s)

Availability for read requests At least 99.9% (99% for At least 99.9% (99% for At least 99.9% (99% for At least 99.9% (99% for cool/cold
cool/cold/archive access cool/cold access tier) cool/cold/archive access tiers) for access tier) for GZRS
tiers) GRS
At least 99.99% (99.9% for At least 99.99% (99.9% for cool/cold
cool/cold/archive access tiers) for access tier) for RA-GZRS
RA-GRS

Availability for write requests At least 99.9% (99% for At least 99.9% (99% for At least 99.9% (99% for At least 99.9% (99% for cool/cold
cool/cold/archive access cool/cold access tier) cool/cold/archive access tiers) access tier)
tiers)

Number of copies of data Three copies within a Three copies across Six copies total, including three in the Six copies total, including three
maintained on separate nodes single region separate availability zones primary region and three in the across separate availability zones in
within a single region secondary region the primary region and three locally
redundant copies in the secondary
region
58

A bit about
AWS Storage Options
59

What is Amazon S3
“Amazon Simple Storage Service (Amazon S3) is an object storage service that offers
industry-leading scalability, data availability, security, and performance. Customers of
all sizes and industries can use Amazon S3 to store and protect any amount of data for a
range of use cases, such as data lakes, websites, mobile applications, backup and
restore, archive, enterprise applications, IoT devices, and big data analytics. Amazon
S3 provides management features so that you can optimize, organize, and configure access to
your data to meet your specific business, organizational, and compliance requirements.”

Internet-scale Built-in Low price per Benefit from


storage Grow redundancy GB per month AWS’s massive
without limits Designed for No commitment security
99.999999999% No up-front cost investments
durability

Highly durable object storage for all types of data


Key Features of S3 60

• Data Lifecyle Management


• Amazon S3 provides strong read-
after-write consistency for PUT and • Multi-factor authentication delete
DELETE requests of objects in your • Flexible access control mechanisms
Amazon S3 bucket in all AWS • Time-limited access to object
Regions.
• Behavior applies to both writes of
03 04 • Access logs
• Multiple client and server-side
new objects as well as PUT Strong Access Encryption options
requests that overwrite existing Consistency And Data Mgmt
objects and DELETE requests.

02 05 • Delivered using SQS, SNS,


• Manage costs, meet or Lambda
regulatory requirements, Storage Data • Enable you to trigger
reduce latency Management Processing workflows, alerts or other
• Save multiple distinct processing
copies of your data for
compliance requirements.
• Cost Monitoring
01
06
Storage Amazon Simple Storage Data Analytics
Classes Service (Amazon S3)
• S3 storage classes provide a way
• Amazon S3 offers
to customize your approach to
features to help you gain
cloud storage.
visibility into your
• Each tier of storage class has a storage usage, which
particular role it aims to achieve empowers you to better
and is broken up across seven understand, analyze, and
tiers to allow users to only use the optimize your storage at
storage they truly need.. scale.
• Storage classes: S3 Standard , S3
Intelligent-Tiering , S3 Standard-
Infrequent Access (S3 Standard-
IA) , S3 One Zone-Infrequent
Access (S3 One Zone-IA) , S3
Outposts, Glacier , Glacier Deep
Archive
61

AMAZON S3 BASE CONCEPTS


62

S3 Components

• Buckets
• Objects
• Keys
• S3 Versioning
• Version ID
• Bucket policy
• Access control lists (ACLs)
• S3 Access Points
• Regions
63

BUCKETS

• Containers for objects stored in S3


Serve several purposes:

o Organise the Amazon S3 namespace


at the highest level Identify the
account responsible for charges
o Play a role in access control
o Serve as the unit of aggregation for
usage reporting
64

OBJECTS

• Fundamental entities stored in Amazon


S3 Consist of data & metadata

o Data portion is opaque to Amazon S3


o Metadata is a set of name-value pairs that
describe the object
o Object is uniquely identified within a bucket
by a key (name) and a version ID
65

KEYS

• Unique identifier for an object within a bucket.


• Every object in a bucket has exactly one key
• Combination of a bucket, key & version ID
uniquely identify each object
66

REGIONS

• The geographical region where


Amazon S3 will store the buckets
that you create
• Choose a region to optimise
latency, minimise costs, or
address regulatory
requirements.
67

Globally Unique

Bucket Name + Object Name (key)


68

Globally
Unique

Bucket Name + Object Name (key)


Amazon S3

bucket bucket bucket

object object object object object object


69

Globally
Unique

Bucket Name + Object Name (key)


Amazon S3

ianm-aws-docs ianm-aws-bootstrap aws-exampl.es

s3-webinar.pptx vid/s3-webinar.mp4 wp/bootstrap.sh wp/credentials.txt index.html logo.png


70

Object key

Unique within a bucket

Object key
Max 1024 bytes Including ‘path’
UTF-8 prefixes
Unique within a
bucket
71

Object
key
Max 1024 bytes Including ‘path’
UTF-8 prefixes
Unique within a bucket

assets/js/jquery/plugins/jtables.js

an example object key


72

ACCESS CONTROLS
73

You decide what to share


Apply policies to buckets and objects

SECURE BY DEFAULT

Policies, ACLs & IAM


Use S3 policies, ACLs or IAM to define rules
74

IAM Bucket ACLs


Policies Policies
• Fine grained • Fine grained
• Apply policies at the • Coarse grained
• role based access
bucket level in S3 • Apply access control
• Apply policies to S3 at rules at the bucket
• Incorporate user
role, user & group level and/or object level in S3
restrictions without
using IAM

Allow Allow Allow


Bob, Jane Everyone, Bob, Jane
Actions Actions Actions
PutObject PutObject Read
Resource Resource
arn:aws:s3:::mybucket/* arn:aws:s3:::mybucket/*

Bob Jane mybucket mybucket myobject


75

STORAGE CLASSES
Types of Storage Classes
76

https://aws.amazon.com/s3/storage-classes-infographic/
S3 Storage classes: Innovation of its own type 77

S3 Intelligent- S3 Glacier
S3 Standard-IA S3 One Zone-IA† S3 Glacier Flexible S3 Glacier
S3 Standard Tiering* Instant Retrieval
Retrieval Deep Archive

Brief high durability, Only object data that is data that is archive storage low-cost storage, lowest-cost
description availability, class to accessed less accessed less class that delivers up to 10% lower storage class
and move objects frequently, but frequently but the lowest-cost cost (than S3 and supports
performance across tiers requires rapid requires rapid storage for long- Glacier Instant long-term
object storage automaticall access when access when lived data that is Retrieval), for retention and
for frequently y to save cost needed needed. Uses rarely accessed archive data that digital
accessed data only one and requires is accessed 1—2 preservation for
availability zone retrieval in times per year data that may
milliseconds and is retrieved be accessed
asynchronously. once or twice in
a year.

Typical Use Data lakes, Data Lakes Long term Storage, cost Long term storage Data storage with Long term
cloud Native and Storage , Backup effective backup of data that may facility of querying digital
apps, Content application and DR strategy need quarterly the data-at-rest as preservation of
Distribution with lot of access needed data with max
Engines, data changes yearly access
Websites need

Durability 99.999999999% 99.999999999% 99.999999999% 99.999999999% 99.999999999% 99.999999999% 99.999999999%


(11 9’s) (11 9’s) (11 9’s) (11 9’s) (11 9’s) (11 9’s) (11 9’s)

Availability 99.99% 99.9% 99.9% 99.5% 99.9% 99.99% 99.99%

Availability
≥3 ≥3 ≥3 1 ≥3 ≥3 ≥3
Zones
Min Storage
Duration N/A N/A 30 days 30 days 90 days 90 days 180 days
(Days)
First Byte
milliseconds milliseconds milliseconds milliseconds milliseconds minutes or hours hours
Latency
78

References
Azure Storage Redundancy:
https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy

Azure Blob Storage and Data Lake Storage Documentation:


About Blob (object) storage - Azure Storage | Microsoft Learn

AWS S3 Documentation:
Amazon Simple Storage Service Documentation

Azure Data Encryption at Rest and in transit


Azure encryption overview | Microsoft Learn
79

Thank you…
80

Backup
81

Ingestion and storage layers in the reference architecture

Ingestion Storage

• Matches AWS services to data • Provides durable, scalable storage


source characteristics
• Includes a metadata catalog for
• Integrates with storage governance and discoverability of
data

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 81
82
Matching ingestion services to variety, volume, and
velocity

Ingest

Amazon AppFlow
SaaS apps

AWS DMS

OLTP ERP CRM LOB

DataSync

File shares
Kinesis Data Streams

Web Devices Sensors Social Kinesis Data Firehose


media

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
83
Help move data from
datastores to AWS

AWS DMS AWS Snowball AWS Amazon


Snowmobile AppFlow

Amazon Kinesis Amazon Kinesis Amazon Managed AWS


Data Streams Data Firehose Streaming for DataSync
Kafka

More about AWS SNOW Family: https://aws.amazon.com/snow/


84
Enterprise data vision is hindered by on-site datastores
when it comes to 4V’s:
Volume, Velocity, Veracity, and Variety (Hint: Data lake)

Amazon Simple Amazon Simple AWS Lake AWS Glue


Storage Service Storage Service Formation
(Amazon S3) Glacier

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
85
Modern data architecture storage layer

Storage layer: Catalog

AWS Glue Data Lake


Catalog Formation

Storage layer: Storage

Amazon Redshift

Native integration

Amazon S3

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
86
Storage for variety, volume, and velocity

Highly structured data is loaded into traditional schemas Use case: Fast BI dashboards

Semistructured data is loaded into staging tables


Amazon Redshift

Unstructured, semistructured, and


structured data is stored as objects
Use case: Big data AI/ML

Amazon S3

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
87

Storage zones for data in different states

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Catalog layer for governance and discoverability 88

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to extract insights from the data 89

Amazon Redshift Amazon EMR AWS Glue


(Elastic MapReduce)

Amazon Athena Amazon Elastic Amazon Kinesis


Search Data Analytics

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
90

Imbibe data visualization, and machine learning


services for advanced analytics and predictions

AWS Data Amazon QuickSight Amazon SageMaker


Exchange

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
91

Backup:
AWS:
Modern data architecture pipeline: Processing and
consumption
Design Principles and Patterns for Data Pipelines
92

Processing and consumption layers in the reference architecture

Processing Analysis and Visualization (Consumption)

Transforms data into a consumable Democratizes consumption across


state the organization
Uses purpose-built components Provides unified access to stored
data and metadata

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 92
93

Modern architecture pipeline: Processing

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 93
Modern data architecture: 94

Consumption layer

Consumption

Athena
Interactive
SQL
Amazon
Redshift

Business
QuickSight
intelligence

Machine
SageMaker
learning

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 94
95
Consuming data by using interactive SQL

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 95
96
Consuming data for business intelligence

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 96
97

Consuming data for ML

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 97

You might also like