0% found this document useful (0 votes)
171 views

Azure - Implementation Notes

Azure Storage provides cloud storage services that are highly available, secure, durable, scalable and redundant. To create an Azure Storage account, you need to specify a subscription, resource group, storage account name, region and performance/redundancy options. Designing an effective partition strategy is important and involves choosing a partition key, scheme (range or hash), defining the strategy and distributing data accordingly.

Uploaded by

harinivedhavalli
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
171 views

Azure - Implementation Notes

Azure Storage provides cloud storage services that are highly available, secure, durable, scalable and redundant. To create an Azure Storage account, you need to specify a subscription, resource group, storage account name, region and performance/redundancy options. Designing an effective partition strategy is important and involves choosing a partition key, scheme (range or hash), defining the strategy and distributing data accordingly.

Uploaded by

harinivedhavalli
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Azure Data Engineer

Azure Storage:

Azure Storage is a Microsoft-managed service providing cloud storage that is highly available, secure,
durable, scalable, and redundant. Azure Storage includes Azure Blobs (objects), Azure Data Lake Storage
Gen2, Azure Files, Azure Queues, and Azure Tables

Azure account refers to the Azure Billing account---> mapped to the email id that you used
to sign up for Azure--->An account can contain multiple subscriptions; each of these
subscriptions can have multiple resource groups and the resource groups, in turn, can have
multiple resources.
---> billing is done at the level of subscriptions

To Create an Azure Storage Account:

Basics:
1.) Subscription (there is no limit to the number of storage accounts you can create per subscription
in Azure)
2.) Resource group (A Resource group is a container that holds related resources for an Azure
solution)
3.) Storage account name (Globally Unique)
4.) Region (Proximity to Users, Compliance Requirements, Redundancy and Disaster Recovery,
Pricing, Service Availability based Region, Network Performance between your applications and
the chosen region. Review the SLAs for Azure Storage services in different regions)
Azure Data Engineer

5.) Performance (Standard and Premium)


6.) Redundancy (LRS, GRS, ZRS, GZRS)
LRS---> Replicates data within a single data center
GRS---> Replicates data to a secondary region for disaster recovery
ZRS---> Replicates data across different availability zones
GZRS---> Combines GRS and ZRS for maximum redundancy.

Advanced:

1.) Require secure transfer for REST API operations---> HTTP, HTTPS are performed securely using
SSL/TLS encryption
2.) Allow enabling public access on individual containers-----> By default, containers within a storage
account are private. Enabling this option allows you to grant public access to specific containers if
needed.
3.) Enable storage account key access----> allows you to access the storage account using the
account keys
4.) Default to Azure Active Directory authorization in the Azure portal---> allows you to use Azure
Active Directory (AD) for authentication and authorization instead of storage account keys. It
provides more secure and granular access control to your storage account resources.
5.) Minimum TLS version- Transport Layer Security and Choosing a higher version ensures stronger
encryption and better security.
6.) Enable hierarchical namespace
7.) ACCESS PROTOCOLS - Enable SFTP and network file system v3----> Enabling these protocols
allows you to access your storage account using SFTP (Secure File Transfer Protocol) and NFS
(Network File System) v3.
8.) BLOB STORAGE - Access across tenant Replication, and Access Tier
9.) AZURE FILES - Enable Large File Shares

Networking

1.) Network access------> 1. Enable public access from all networks


2. Enable public access from selected virtual networks and IP addresses
3. Disable public access and use private access
Virtual networks

Network routing

Routing Preferences ------> Microsoft network routing and Internet routing

Microsoft network routing ensures that traffic between Azure resources within the same
region stays within the Azure network, while Internet routing allows traffic to flow through
the internet.

Data Protection
Azure Data Engineer

1.) Enable point-in-time restore for containers


2.) Enable soft delete for blobs [Days to retain deleted blobs and Soft delete enables you to recover
blobs that were previously marked for deletion, including blobs that were overwritten.]
3.) Enable soft delete for containers
4.) Enable soft delete for file shares

Tracking:

Enable versioning for blobs---> Use versioning to automatically maintain previous versions of your
blobs.

Enable blob change feed ---> Keep track of create, modification, and delete changes to blobs in your
account.

Access control:

Enable version-level immutability support

- Allows you to set time-based retention policy on the account-level that will apply to all blob
versions. Enable this feature to set a default policy at the account level. Without enabling this, you can still
set a default policy at the container level or set policies for specific blob versions. Versioning is required
for this property to be enabled.

Encryption:

Encryption type -----> 1. Microsoft Managed keys

2. Customer Managed Keys

Customer Managed Keys-------> 1. Blob and file service only, or


2. To all service types.

Customer-managed key (CMK) support can be limited to blob service and file service only,
or to all service types. After the storage account is created, this support cannot be
changed.

Designing a partition strategy for files in Azure:

1. Choose a partition key: Determine a partition key based on the characteristics of your data, such
as customer ID, date, or geographical location. This key will be used to distribute your data across
different partitions.
2. Select a partitioning scheme: Azure provides two partitioning schemes: partition by range and
partition by hash. Partition by range is suitable when you have sequential or time-based data.
Partition by hash is useful when you want to distribute data uniformly across partitions.
Azure Data Engineer

3. Define the partitioning strategy: Implement the chosen partitioning scheme by creating a
partition map. This map specifies the partition key, the partition boundaries (in the case of range
partitioning), and the number of partitions (in the case of hash partitioning).
4. Distribute the data: When writing data to Azure, include the partition key in the data. Azure will
use this key to determine the appropriate partition for storing the data.

Designing a partition strategy for files has partition key and the partition logic which are dependent on
one another. Example, if we take the partition key has Create Date then the partition logic need adhere to
this Partition key in order to store the files in exact partition.

Example for partition by range:

def get_partition_key(date):
if "2020-01-01" <= date <= "2020-06-30":
return "Partition A"
elif "2020-07-01" <= date <= "2020-12-31":
return "Partition B"
else:
return "Invalid Date Range"

# Example usage
file_date = "2020-05-15"
partition_key = get_partition_key(file_date)
print(partition_key) # Output: Partition A

Example for partition by hash:

import hashlib

def get_partition_key(file_name):
# Generate a hash value for the file name
hash_value = hashlib.md5(file_name.encode()).hexdigest()

# Extract a portion of the hash value to use as the partition key


partition_key = hash_value[:2]

return partition_key

def store_file(file_name, file_content):


partition_key = get_partition_key(file_name)
Azure Data Engineer

# Logic to store the file in the appropriate partition based on the


partition key
# For example, you can use Azure Blob Storage and create containers for each
partition

# Code to store the file in the corresponding partition container


# For example, using Azure Blob Storage SDK:
# blob_service_client =
BlobServiceClient.from_connection_string(connection_string)
# container_client = blob_service_client.get_container_client(partition_key)
# blob_client = container_client.get_blob_client(file_name)
# blob_client.upload_blob(file_content)

def access_file(file_name):
partition_key = get_partition_key(file_name)

# Logic to access the file based on the partition key


# For example, you can retrieve the file from the corresponding partition
container

# Code to access the file from the corresponding partition container


# For example, using Azure Blob Storage SDK:
# blob_service_client =
BlobServiceClient.from_connection_string(connection_string)
# container_client = blob_service_client.get_container_client(partition_key)
# blob_client = container_client.get_blob_client(file_name)
# file_content = blob_client.download_blob().readall()

return file_content

Azure Storage uses <account name + container name + blob name> as


the partition key.

Designing a partition strategy for analytical workloads

There are three main types of partition strategies for analytical workloads. These are listed here:

 Horizontal partitioning, which is also known as sharding


 Vertical partitioning
 Functional partitioning

Horizontal partitioning

In a horizontal partition, we divide the table data horizontally, and subsets of rows are stored in
different data stores. Each of these subsets of rows (with the same schema as the parent table)
are called shards. Essentially, each of these shards is stored in different database instances.
Azure Data Engineer

NOTE

Don't try to balance the data to be evenly distributed across partitions unless specifically
required by your use case because usually, the most recent data will get accessed more
than older data. Thus, the partitions with recent data will end up becoming bottlenecks
due to high data access.

Vertical partitioning

In a vertical partition, we divide the data vertically, and each subset of the columns is stored
separately in a different data store. This is ideal for column-oriented data stores such as HBase,
Cosmos DB, and so on.
Azure Data Engineer

Functional partitioning

Functional partitions are similar to vertical partitions, except that here, we store entire tables or
entities in different data stores. They can be used to segregate data belonging to different
organizations, frequently used tables from infrequently used ones, read-write tables from read-
only ones, sensitive data from general data, and so on.
Azure Data Engineer

Designing a partition strategy for efficiency/performance

 Design effective folder structures to improve the efficiency of data reads and writes.
 Partition data such that a significant amount of data can be pruned while running
queries.
 File sizes in the range of 256 megabytes (MB) to 100 gigabytes (GB) perform really
well with analytical engines such as HDInsight and Azure Synapse, gen2 . So, aggregate
the files to these ranges before running the analytical engines on them.
 For I/O-intensive jobs, try to keep the optimal I/O buffer sizes in the range of 4 to 16
MB; anything too big or too small will become inefficient.
 Run more containers or executors per virtual machine (VM) (such as Apache Spark
executors or Apache Yet Another Resource Negotiator (YARN) containers).

Iterative query performance improvement process

1. List business-critical queries, the most frequently run queries, and the slowest queries.
2. Check the query plans for each of these queries using the EXPLAIN keyword and see the
amount of data being used at each stage (we will be learning about how to view query
plans in the later chapters).
3. Identify the joins or filters that are taking the most time. Identify the corresponding data
partitions.
4. Try to split the corresponding input data partitions into smaller partitions, or change the
application logic to perform isolated processing on top of each partition and later merge
only the filtered data.
5. You could also try to see if other partitioning keys would work better and if you need to
repartition the data to get better job performance for each partition.
6. If any particular partitioning technology doesn't work, you can explore having more than
one piece of partitioning logic—for example, you could apply horizontal partitioning
within functional partitioning, and so on.
7. Monitor the partitioning regularly to check if the application access patterns are balanced
and well distributed. Try to identify hot spots early on.
8. Iterate this process until you hit the preferred query execution time.

Designing a partition strategy for Azure Synapse Analytics

A dedicated SQL pool is a massively parallel processing (MPP) system that splits the queries
into 60 parallel queries and executes them in parallel. Each of these smaller queries runs on
something called a distribution. A distribution is a basic unit of processing and storage for a
dedicated SQL pool. There are three different ways to distribute (shard) data among
distributions, as listed here:

 Round-robin tables
 Hash tables
Azure Data Engineer

 Replicated tables

Partitioning is supported on all the distribution types in the preceding list. Apart from the
distribution types, Dedicated SQL pool also supports three types of tables: clustered
columnstore, clustered index, and heap tables.Partitioning is supported in all of these types of
tables, too.

In a dedicated SQL pool, data is already distributed across its 60 distributions, so we need to be
careful in deciding if we need to further partition the data. The clustered columnstore tables work
optimally when the number of rows per table in a distribution is around 1 million.

For example, if we plan to partition the data further by the months of a year, we are talking about
12 partitions x 60 distributions = 720 sub-divisions. Each of these divisions needs to have at least
1 million rows; in other words, the table (usually a fact table) will need to have more than 720
million rows. So, we will have to be careful to not over-partition the data when it comes to
dedicated SQL pools.

Identifying when partitioning is needed in ADLS Gen2

As we have learned in the previous chapter, we can partition data according to our requirements
—such as performance, scalability, security, operational overhead, and so on—but there is
another reason why we might end up partitioning our data, and that is the various I/O bandwidth
limits that are imposed at subscription levels by Azure. These limits apply to both Blob storage
and ADLS Gen2.

The rate at which we ingest data into an Azure Storage system is called the ingress rate, and
the rate at which we move the data out of the Azure Storage system is called the egress rate.

Resource Limit
Maximum number of storage accounts with standard endpoints per region per 250 by default,
subscription, including standard and premium storage accounts. 500 by request 1

Maximum number of storage accounts with Azure DNS zone endpoints (preview) 5000 (preview)
per region per subscription, including standard and premium storage accounts.
Default maximum storage account capacity 5 PiB 2

Maximum number of blob containers, blobs, file shares, tables, queues, entities, No limit
or messages per storage account.
Default maximum request rate per storage account 20,000 requests
per second 2
Azure Data Engineer

Resource Limit
Default maximum ingress per general-purpose v2 and Blob storage account in 60 Gbps 2

the following regions (LRS/GRS):

 Australia East
 Central US
 East Asia
 East US 2
 Japan East
 Korea Central
 North Europe
 South Central US
 Southeast Asia
 UK South
 West Europe
 West US
Default maximum ingress per general-purpose v2 and Blob storage account in 60 Gbps 2

the following regions (ZRS):

 Australia East
 Central US
 East US
 East US 2
 Japan East
 North Europe
 South Central US
 Southeast Asia
 UK South
 West Europe
 West US 2
Default maximum ingress per general-purpose v2 and Blob storage account in 25 Gbps 2

regions that aren't listed in the previous row.


Default maximum ingress for general-purpose v1 storage accounts (all regions) 10 Gbps 2

Default maximum egress for general-purpose v2 and Blob storage accounts in 120 Gbps 2

the following regions (LRS/GRS):

 Australia East
 Central US
 East Asia
 East US 2
Azure Data Engineer

Resource Limit
 Japan East
 Korea Central
 North Europe
 South Central US
 Southeast Asia
 UK South
 West Europe
 West US
Default maximum egress for general-purpose v2 and Blob storage accounts in 120 Gbps 2

the following regions (ZRS):


 Australia East
 Central US
 East US
 East US 2
 Japan East
 North Europe
 South Central US
 Southeast Asia
 UK South
 West Europe
 West US 2
Default maximum egress for general-purpose v2 and Blob storage accounts in 50 Gbps 2

regions that aren't listed in the previous row.


Maximum number of IP address rules per storage account 200
Maximum number of virtual network rules per storage account 200
Maximum number of resource instance rules per storage account 200
Maximum number of private endpoints per storage account 200

Develop data processing (40–45%) (4)


Azure Data Engineer

Ingest and transform data (Chapter 8)

Transforming data by using Apache Spark

Apache Spark supports transformations with three different Application Programming


Interfaces (APIs): Resilient Distributed Datasets (RDDs), DataFrames, and Datasets. We will
learn about RDDs and DataFrame transformations in this chapter. Datasets are just extensions of
DataFrames, with additional features like being type-safe (where the compiler will strictly check
for data types) and providing an object-oriented (OO) interface.

What are RDDs?

RDDs are an immutable fault-tolerant collection of data objects that can be operated on in
parallel by Spark.

You might also like