Week 4- Azure-AWSStorage
Week 4- Azure-AWSStorage
Storage Scenarios
Prof. Mandar Samant
Unless Otherwise Stated, this presentation refers to study material from Microsoft Azure Learn, AWS documentation, and Snowflake Academic Courses.
• Data pipeline and Data Platform Services scenario example
3
Quick Recap
Data Lakehouse 4
Source: https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
Modern Data Analytics Architecture – Conceptual 5
Organization
Patterns?
Lambda Architecture
Processing Model Combines batch and real-time processing. Focuses solely on stream (real-time) processing.
Higher due to managing separate batch and speed Simpler, as it uses only one stream processing
Complexity
layers. layer.
Fault-tolerant, as batch processing ensures data Fault-tolerant with real-time processing but
Fault Tolerance
accuracy. depends on stream integrity.
Suitable for both real-time and batch processing Best for real-time processing; batch processing is
Use Case
needs. less emphasized.
Batch layer allows accurate reprocessing of Reprocessing is done by replaying the stream in
Data Reprocessing
historical data. real-time.
Batch layer provides high accuracy, speed layer Provides consistent results, but may not match the
Accuracy
offers immediate but less accurate results. accuracy of dedicated batch processing.
Key design
considerations
• Seamless data
movement
• Unified
governance Lake
Formation
AWS Glue
Source: Analytics end-to-end with Azure Synapse - Azure Architecture Center | Microsoft Learn
Matching ingestion services to variety, volume, and 11
velocity
Ingest
Azure Data
SaaS apps Factory
Azure Synapse
Analytics Pipelines
OLTP ERP CRM Business
Applications
Azure File Sync
File shares
Azur Event Hub
On Premise Data-
limited connectivity
Store and Manager Enterprise Data 12
Azure Synapse
Analytics
Built-in integration
Ingest Process
Azure Data
Lake Storage
Gen 2
15
Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
16
Data Lake Zones/Layers..continued…
Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
17
Data Lake Zones/Layers. Continued…
Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
18
Data Lake Zones/Layers. Continued…
Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
19
Azure Stream
Elastic Jobs on Microsoft Purview Azure Data
Analytics
Azure Factory
20
AWS:
Sample use-cases for AWS data and
analytics services
AWS Data Analytics Pipeline Services 22
Representative view
Amazon Aurora
AWS Glue
Amazon AWS
AppFlow DataSync
AWS Database
Automation Migration Service
23
Storage
Storage Introduction 25
Source:
https://phoenixnap.com/blog/object-storage-vs-block-storage
https://www.ibm.com/cloud/blog/object-vs-file-vs-block-storage
Object vs. Block Storage
27
Unique, identifiable, and distinct units called objects Fixed-sized blocks store portions of the data in a
Data storage store data in a flat-file system. hierarchical system and reassemble when needed.
Suitable for high volumes of unstructured data. Best for transactional data and database storage.
Performance Performs best with large files. Performs best with files small in size.
Source:
https://phoenixnap.com/blog/object-storage-vs-block-storage
https://www.ibm.com/cloud/blog/object-vs-file-vs-block-storage
Storage Selection Aspects 28
• Cost: Because the costs involved with block and file storage are higher, many organizations choose object
storage for high volumes of data.
• Management ease: The metadata and searchability make object storage a top choice for high volumes
of data. File storage, with its hierarchical organization system, is more appropriate for lower volumes of
data.
• Volume: Organizations with high volumes of data often choose object or block storage.
• Retrievability: Data is relatively retrievable from all three types of storage, though file and object storage
are typically easier to access.
• Handling of metadata: Although file storage contains very basic metadata, information with extensive
metadata is typically best served by object storage.
• Data protection: While the data is stored, its essential the data is protected from breaches and
cybersecurity threats.
• Storage use cases: Each type of storage is most effective for different use cases and workflows. By
understanding their specific needs, organizations can select the type that fits the majority of their storage
use cases.
Source:
https://phoenixnap.com/blog/object-storage-vs-block-storage
https://www.ibm.com/cloud/blog/object-vs-file-vs-block-storage
Azure Storage Options 29
Account
31
my account
Secured Comprehensive
Scalable, durable, Optimized for
Authentication with data
and available data lakes
Microsoft Entra ID management
Sixteen nines of File namespace
(formerly Azure End-to-end
designed and multi-
Active Directory) lifecycle
durability with protocol access
and role-based management,
geo-replication support enabling
access control policy-based
and flexibility to analytics
(RBAC), plus access control,
scale as needed. workloads for
encryption at rest and immutable
data insights.
we can't lose the
and advanced threat (WORM) storage.
object once we store protection.
it, its that durable it should work with
RDBMS,
35
UTD JSOM
Jsom may contain ITM Container was sales folder/product folder etc
Source Introduction to Blob (object) Storage - Azure Storage |MSBA
Microsoftetc
Learn
General BLOB storage concepts:
Storage Accounts
36
A storage account provides a unique namespace in Azure for your data. Every object that you
store in Azure Storage has an address that includes your unique account name. The combination
of the account name and the Blob Storage endpoint forms the base address for the objects in
your storage account.
For example, if your storage account is named buanutsom, then the default endpoint for
Blob Storage is:
http://buanutsom.blob.core.windows.net
General-purpose v2 Standard Standard storage account type for blobs, file shares,
queues, and tables. Recommended for most
scenarios using Blob Storage or one of the other
Azure Storage services.
Block blob Premium Premium storage account type for block blobs and
append blobs. Recommended for scenarios with high
transaction rates or that use smaller objects or
require consistently low storage latency.
Page blob Premium Premium storage account type for page blobs only.
Containers
A container organizes a set of blobs, similar to a directory in a file system. A storage
account can include an unlimited number of containers, and a container can store an
unlimited number of blobs.
A container name must be a valid DNS name, as it forms part of the unique URI (Uniform
resource identifier) used to address the container or its blobs. Follow these rules when
naming a container:
• Container names can be between 3 and 63 characters long.
• Container names must start with a letter or number, and can contain only lowercase
letters, numbers, and the dash (-) character.
• Two or more consecutive dash characters aren't permitted in container names.
o The URI for a container is similar to:
o https://myaccount.blob.core.windows.net/mycontainer
Blobs
Azure Storage supports three types of blobs:
• Block blobs store text and binary data. Block blobs are made up of blocks of
data that can be managed individually. Block blobs can store up to about
190.7 TiB.
• Append blobs are made up of blocks like block blobs, but are optimized for
append operations. Append blobs are ideal for scenarios such as logging data
from virtual machines.
• Page blobs store random access files up to 8 TiB in size. Page blobs store
virtual hard drive (VHD) files and serve as disks for Azure virtual machines.
A data lake is a single, centralized repository where you can store all your
data, both structured and unstructured. A data lake enables your
organization to quickly and more easily store, access, and analyze a wide
variety of data in a single location.
With a data lake, you don't need to conform your data to fit an existing
structure. Instead, you can store your data in its raw or native format, usually
as files or as binary large objects (blobs).
Azure Data Lake Storage is a cloud-based enterprise data lake solution. It's
engineered to store massive amounts of data in any format and to facilitate
big data analytical workloads. You use it to capture data of any type and
ingestion speed in a single location for easy access and analysis using various
frameworks.
Source: Azure Data Lake Storage Introduction - Azure Storage | Microsoft Learn
Understand Azure Data Lake Storage Gen2
42
Distributed cloud
storage for data lakes
• HDFS-compatibility –
Common file system
for Hadoop, Spark, and
others
• Flexible security
through folder and file
level permissions
• Built on Azure Storage:
– High performance and
scalability
– Data redundancy
through built-in
replication
Blobs can be organized in virtual directories, but File system includes directories and files, and is
each path is considered a single blob in a flat compatible with large scale data analytics systems
namespace – Folder level operations are not like Hadoop, Databricks, and Azure Synapse
supported Analytics
are inseparable
Source: Under the hood: Performance, scale, security for cloud analytics with ADLS Gen2 | Microsoft Azure Blog
45
For customers wanting to build a data lake to serve the entire enterprise, security is no
lightweight consideration. There are multiple aspects to providing end-to-end security for
your data lake:
• Authentication – Azure Active Directory OAuth bearer tokens provide industry-standard
authentication mechanisms backed by the same identity service used throughout Azure
and Office365.
• Access control – A combination of Azure Role Based Access Control (RBAC) and POSIX-
compliant Access Control Lists (ACLs) to provide flexible and scalable access control.
Significantly, the POSIX ACLs are the same mechanism used within Hadoop.
• Encryption at rest and transit – Data stored in ADLS is encrypted using either a system-
supplied or customer-managed key. Additionally, data is encrypted using TLS 1.2 whilst in
transit.
• Network transport security – Given that ADLS exposes endpoints on the public Internet,
transport-level protections are provided via Storage Firewalls that securely restrict where
the data may be accessed from, enforced at the packet level.
Source: Under the hood: Performance, scale, security for cloud analytics with ADLS Gen2 | Microsoft Azure Blog
Data Encryption: Data At Rest 46
• Data at rest includes information that resides in persistent storage on physical media, in
any digital format. The media can include files on magnetic or optical media, archived
data, and data backups. Microsoft Azure offers a variety of data storage solutions to
meet different needs, including file, disk, blob, and table storage. Microsoft also
provides encryption to protect Azure SQL Database, Azure Cosmos DB, and Azure Data
Lake.
• Data encryption at rest using AES 256 data encryption is available for services across the
software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service
(IaaS) cloud models.
• Azure Data Lake storage is an enterprise-wide repository of every type of data collected
in a single place prior to any formal definition of requirements or schema. Data Lake
Store supports "on by default," transparent encryption of data at rest, which is set up
during the creation of your account. By default, Azure Data Lake Store manages the keys
for you, but you have the option to manage them yourself.
• Three types of keys are used in encrypting and decrypting data: the Master Encryption
Key (MEK), Data Encryption Key (DEK), and Block Encryption Key (BEK). The MEK is used
to encrypt the DEK, which is stored on persistent media, and the BEK is derived from the
DEK and the data block. If you are managing your own keys, you can rotate the MEK.
Encryption of data in transit 47
Data-link Layer encryption in Azure TLS encryption in Azure Azure Storage transactions
Source: https://learn.microsoft.com/en-us/azure/security/fundamentals/encryption-overview
This is why Microsoft azure is better than S3
48
Azure Storage always stores multiple copies of your data to protect it from planned and
unplanned events. Examples of these events include transient hardware failures, network or
power outages, and massive natural disasters. Redundancy ensures that your storage account
meets its availability and durability targets even in the face of failures.
When deciding which redundancy option is best for your scenario, consider the tradeoffs
between lower costs and higher availability.
The factors that help determine which redundancy option you should choose include:
• How your data is replicated within the primary region.
• Whether your data is replicated from a primary region to a second, geographically distant
region, to protect against regional disasters (geo-replication).
• Whether your application requires read access to the replicated data in the secondary
region during an outage in the primary region (geo-replication with read access).
• The services that comprise Azure Storage are managed through a common Azure
resource called a storage account.
• The storage account represents a shared pool of storage that can be used to
deploy storage resources such as blob containers (Blob Storage), file shares (Azure
Files), tables (Table Storage), or queues (Queue Storage).
• The redundancy setting for a storage account is shared for all storage services
exposed by that account.
• All storage resources deployed in the same storage account have the same
redundancy setting. Consider isolating different types of resources in separate
storage accounts if they have different redundancy requirements.
50
Azure Storage Redundancy… contd
Redundancy in the primary region
• ZRS provides excellent performance, low latency, and resiliency for your data
if it becomes temporarily unavailable. However, ZRS by itself might not fully
protect your data against a regional disaster where multiple zones are
permanently affected.
• Geo-zone-redundant storage (GZRS) uses ZRS in the primary region and also
geo-replicates your data to a secondary region. GZRS is available in many
regions, and is recommended for protection against regional disasters.
• Redundancy options can help provide high durability for your applications. In
many regions, you can copy the data within your storage account to a
secondary region located hundreds of miles away from the primary region.
Copying your storage account to a secondary region ensures that your data
remains durable during a complete regional outage or a disaster in which
the primary region isn't recoverable.
• When you create a storage account, you select the primary region for the
account. The paired secondary region is determined based on the primary
region, and can't be changed.
54
• Geo-redundant storage (GRS) copies your data synchronously three times within a
single physical location in the primary region using LRS. It then copies your data
asynchronously to a single physical location in the secondary region. Within the
secondary region, your data is copied synchronously three times using LRS.
• Geo-zone-redundant storage (GZRS) copies your data synchronously across three
Azure availability zones in the primary region using ZRS. It then copies your data
asynchronously to a single physical location in the secondary region. Within the
secondary region, your data is copied synchronously three times using LRS.
• When you utilize GRS or GZRS, the data in the secondary region isn't available for
read or write access unless there's a failover to the primary region.
Secondary region redundancy 55
• Geo-redundant storage (GRS) copies your data synchronously three times within a single
physical location in the primary region using LRS. It then copies your data asynchronously
to a single physical location in a secondary region that is hundreds of miles away from the
primary region. GRS offers durability for storage resources of at least
99.99999999999999% (16 9s) over a given year.
• A write operation is first committed to the primary location and replicated using LRS.
The update is then replicated asynchronously to the secondary region. When data is
written to the secondary location, it also replicates within that location using LRS.
Secondary region redundancy 56
Geo-zone-redundant storage
• Geo-zone-redundant storage (GZRS) combines the high availability provided by redundancy across
availability zones with protection from regional outages provided by geo-replication. Data in a GZRS storage
account is copied across three Azure availability zones in the primary region. In addition, it also replicates to
a secondary geographic region for protection from regional disasters. Microsoft recommends using GZRS for
applications requiring maximum consistency, durability, and availability, excellent performance, and
resilience for disaster recovery.
• With a GZRS storage account, you can continue to read and write data if an availability zone becomes
unavailable or is unrecoverable. Additionally, your data also remains durable during a complete regional
outage or a disaster in which the primary region isn't recoverable. GZRS is designed to provide at least
99.99999999999999% (16 9s) durability of objects over a given year.
57
Percent durability of objects at least 99.999999999% at least 99.9999999999% at least 99.99999999999999% (16 9s) at least 99.99999999999999% (16
over a given year (11 9s) (12 9s) 9s)
Availability for read requests At least 99.9% (99% for At least 99.9% (99% for At least 99.9% (99% for At least 99.9% (99% for cool/cold
cool/cold/archive access cool/cold access tier) cool/cold/archive access tiers) for access tier) for GZRS
tiers) GRS
At least 99.99% (99.9% for At least 99.99% (99.9% for cool/cold
cool/cold/archive access tiers) for access tier) for RA-GZRS
RA-GRS
Availability for write requests At least 99.9% (99% for At least 99.9% (99% for At least 99.9% (99% for At least 99.9% (99% for cool/cold
cool/cold/archive access cool/cold access tier) cool/cold/archive access tiers) access tier)
tiers)
Number of copies of data Three copies within a Three copies across Six copies total, including three in the Six copies total, including three
maintained on separate nodes single region separate availability zones primary region and three in the across separate availability zones in
within a single region secondary region the primary region and three locally
redundant copies in the secondary
region
58
A bit about
AWS Storage Options
59
What is Amazon S3
“Amazon Simple Storage Service (Amazon S3) is an object storage service that offers
industry-leading scalability, data availability, security, and performance. Customers of
all sizes and industries can use Amazon S3 to store and protect any amount of data for a
range of use cases, such as data lakes, websites, mobile applications, backup and
restore, archive, enterprise applications, IoT devices, and big data analytics. Amazon
S3 provides management features so that you can optimize, organize, and configure access to
your data to meet your specific business, organizational, and compliance requirements.”
S3 Components
• Buckets
• Objects
• Keys
• S3 Versioning
• Version ID
• Bucket policy
• Access control lists (ACLs)
• S3 Access Points
• Regions
63
BUCKETS
OBJECTS
KEYS
REGIONS
Globally Unique
Globally
Unique
Globally
Unique
Object key
Object key
Max 1024 bytes Including ‘path’
UTF-8 prefixes
Unique within a
bucket
71
Object
key
Max 1024 bytes Including ‘path’
UTF-8 prefixes
Unique within a bucket
assets/js/jquery/plugins/jtables.js
ACCESS CONTROLS
73
SECURE BY DEFAULT
STORAGE CLASSES
Types of Storage Classes
76
https://aws.amazon.com/s3/storage-classes-infographic/
S3 Storage classes: Innovation of its own type 77
S3 Intelligent- S3 Glacier
S3 Standard-IA S3 One Zone-IA† S3 Glacier Flexible S3 Glacier
S3 Standard Tiering* Instant Retrieval
Retrieval Deep Archive
Brief high durability, Only object data that is data that is archive storage low-cost storage, lowest-cost
description availability, class to accessed less accessed less class that delivers up to 10% lower storage class
and move objects frequently, but frequently but the lowest-cost cost (than S3 and supports
performance across tiers requires rapid requires rapid storage for long- Glacier Instant long-term
object storage automaticall access when access when lived data that is Retrieval), for retention and
for frequently y to save cost needed needed. Uses rarely accessed archive data that digital
accessed data only one and requires is accessed 1—2 preservation for
availability zone retrieval in times per year data that may
milliseconds and is retrieved be accessed
asynchronously. once or twice in
a year.
Typical Use Data lakes, Data Lakes Long term Storage, cost Long term storage Data storage with Long term
cloud Native and Storage , Backup effective backup of data that may facility of querying digital
apps, Content application and DR strategy need quarterly the data-at-rest as preservation of
Distribution with lot of access needed data with max
Engines, data changes yearly access
Websites need
Availability
≥3 ≥3 ≥3 1 ≥3 ≥3 ≥3
Zones
Min Storage
Duration N/A N/A 30 days 30 days 90 days 90 days 180 days
(Days)
First Byte
milliseconds milliseconds milliseconds milliseconds milliseconds minutes or hours hours
Latency
78
References
Azure Storage Redundancy:
https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy
AWS S3 Documentation:
Amazon Simple Storage Service Documentation
Thank you…
80
Backup
81
Ingestion Storage
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 81
82
Matching ingestion services to variety, volume, and
velocity
Ingest
Amazon AppFlow
SaaS apps
AWS DMS
DataSync
File shares
Kinesis Data Streams
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
83
Help move data from
datastores to AWS
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
85
Modern data architecture storage layer
Amazon Redshift
Native integration
Amazon S3
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
86
Storage for variety, volume, and velocity
Highly structured data is loaded into traditional schemas Use case: Fast BI dashboards
Amazon S3
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
87
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Catalog layer for governance and discoverability 88
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to extract insights from the data 89
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
90
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
91
Backup:
AWS:
Modern data architecture pipeline: Processing and
consumption
Design Principles and Patterns for Data Pipelines
92
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 92
93
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 93
Modern data architecture: 94
Consumption layer
Consumption
Athena
Interactive
SQL
Amazon
Redshift
Business
QuickSight
intelligence
Machine
SageMaker
learning
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 94
95
Consuming data by using interactive SQL
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 95
96
Consuming data for business intelligence
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 96
97
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 97