Azure Data Engineer
Azure Data Engineer
Pre-requisite
• Azure fundamentals
• Good overview of Azure EventHub, IOT Hub, IOT edge
• Implement and Use Azure key Vault
• Very good knowledge on Azure AD
• At multiple places, azure documentation lacks links to
working code, in those cases write your own where
possible.
Skill Measured
• Services in Scope
• SQL Databases
• Azure Synapse Analytics
• Data Lake Gen2
• Cosmos DB
• Azure Databricks
• Azure Datafactory
• Stream Analytics
• Horizontals
• Azure monitor
• Diagnostics and Log Analytics
• Optimization
• Security
• High Availability
• Disaster Recovery
SQL
Databases
Implementation Models
Deployment Models
• Managed Instance
• High compatibility with SQL server
• Size upto 8TB
• Supports Private IP in VNET
• Supports BYOL
• SQL Databases – Single and Elastic Pools
• Low compatibility with SQL server
• Size upto 100TB
• Does not support Private IP
• SQL Virtual Machines
• Azure Doc – Feature comparison
Purchasing Models
• VCore
• DTU – Blended HW model
• Azure Doc – Purchasing model , Service Teirs
Azure SQL Databases
Elastic Pool
• Geo-Replication
• Replicates data to same or other region
• Supports read at secondary
• Supports Multiple replicas
• Requires connection string update
• Supports only SQL databases
• Azure Docs - overview
Continuity databases
• Supports both SQL databases and managed instances
• Does not support same region replication
• No need to change connection string
• Azure Docs – overview and failover group tutorials (5 tutorials
at the time of writing)
• Backup and Recovery
• Azure Docs - Overview and configure long term retention
policy
Azure SQL Databases
Data Security
• SUCCESSFUL_DATABASE_AUTHENTICATION_GROUP
• FAILED_DATABASE_AUTHENTICATION_GROUP
• Azure Doc - Overview
• Firewalls and Virtual Networks
• Use firewall to restrict access to database from a single IP or IP ranges.
• These are at server level not DB level
• If using IP, make sure it is static IP.
• Azure Docs – Overview
• Private end point
• Create a private EP in VNET, then create network rule on server
• Private Endpoint must exist in the same region as Serve
• Azure Docs - Overview
Azure SQL Databases
• Private Link
• It provides private IP address to the configured azure service, in this
case it is Azure SQL
• Private End point requires network/firewall rule for SQL access, Private
Link does not.
• Azure Docs - Overview
• TDE
• Customer managed key, which uses Keyvault integration
• Azure Docs - Encryption with own key
Data Security
• Service managed key
• Azure Docs - Overview
• Always encrypted
• Provides encryption at rest when in database and also when it moves
between client and server
• Azure Docs – Always Encrypted
• Azure AD Authentication
• Permissions can be managed using external / AD Groups
• Link admin account to server
• Create contained users same as AAD accounts
• Azure Docs - Overview , Configure AAD
Azure SQL Databases
Optimize /
Performance Tuning
• Azure Diagnostics: Type of telemetry available and how to export
these. Important ones are Basic, Automated Tuning and SQLInsights.
• Azure Docs - Overview
• Intelligent Insights – Azure Doc
• Automates Tuning
• Three actions available Create Index, Drop Index and Force Last
Good Plan.
• Drop Index is disabled by default
• Servers can inherit azure defaults and databases from server.
• Azure Doc - Overview and how to implement
Azure SQL Databases
Monitor
Get Started
Data Distribution
• Data is stored across compute nodes in 60 distributions (or less)
I.e if number of nodes is 60 , each nodes gets one distribution
and is expensive and if you have single node all distributions are
on that node (cheapest)
• Control node splits the query into 60 small queries to run on 60
distributions
• Choosing Distribution Column
• This column cannot be updated
• Must have many unique values, and distributes the data
evenly across 60 partitions. Partition skew can lead to
performance issues
• Use a column from Group By, not from where clause
• To optimize JOIN performance, join columns must be hash
distributed, use equal operator and must have same data type
• To change the distribution column, create new table as CTAS
with new distribution column and then collect fresh stats on the
table.
• Azure Doc – Table Distribution and replicated tables
Azure Synapse
Partitioning
Data Loading
Security
Optimize
Data Distribution
Partitions
• Physical partitions hold one or more logical partitions
• Logical partitions are based on partition keys ex. Userid
• In addition to partition keys, each item has index ID.
These two put together are item index
• Each physical partition provides 10000 rps throughput
and 50GB
• Hot partition issue can happen, If load is not distributed
evenly across partition key
• Partition key should have high cardinality
• For select heavy container, choose a key that appears in
filters
• Azure Doc - Overview
Azure CosmosDB
Consistency Levels
Strong The reads are guaranteed to return the most recent committed version of an item. A client never sees an
uncommitted or partial write. Users are always guaranteed to read the latest committed write.
The reads might lag behind writes by at most "K" versions (that is, "updates") of an item or by "T" time interval
Bounded
Staleness Provides strong consistency for single master , single region clients
Within a single client session reads are guaranteed to honor the consistent-prefix, monotonic reads, monotonic writes,
Session read-your-writes, and write-follows-reads guarantees.
Clients outside the session perform either with consistent prefix or eventual
Consistent Consistent prefix consistency level guarantees that read never see out-of-order writes.
Prefix
There's no ordering guarantee for reads. In the absence of any further writes, the replicas eventually converge.
Eventual
Eventual consistency is the weakest form of consistency because a client may read the values that are older than the
ones it had read before.
Monitor CosmosDB
Throughput
Data Access
• Access can be via AD IAM permissions or via Keys and Resource Tokens.
• Account Management activities like master key rotation,global replication etc are
available via AD only
• Keys and resource token allow control of data operations which AD does not
• Restrict user access to data operations - Azure Doc
• Master keys (primary & secondary) can be regenerated and rotated
• Move secondary to primary and then generate new secondary key. Ensure all
applications are using secondary key to connect
• Resource tokens can be generated via mid-teir for end devices like mobile
Azure CosmosDB
• IAM
• Cosmos DB operator cannot read data, but can admin account, db
and containers. Neither can he access the keys
• Cosmos DB admin changes can be locked down to prevent changes
from key based access - "disableKeyBasedMetadataWriteAccess"
• IP address whitelisting
• Access to cosmosdb can be limited to specific IP address or IP CIDR block
• By using service endpoint access can be limited to certain subnet in VNET
• This is similar to how this works for SQL Databases
Azure CosmosDB
Reference Architecture
• CosmosDB
• CosmosDB with IOT
Azure Data
Lake
Azure DataLake
Architecture
Data Access
• RBAC
• Shared Key and Shared Access Signature
• ACL on file and directories
• RBAC vs ACL
• RBAC is resolved first and takes precedence. If access is approved based on RBAC then no ACL check is
performed.
• RBAC does not provide file / directry level access control
ADF Excercies
• Best way to go through ADF is to do hands-on, below are the links which cover
required range of topics:-
• ADF Overview
• ADF Create / Implement
• ADF Using CMK
• ADF – COPY Data
• ADF – Mapping Data Flows
• ADF – Use Key Vault secrets
• ADF – e2e LAB
Azure DataFactory
• IR – Create Azure IR
ADF – Triggers
Access Control