Azure Data Engineer Guide
Azure Data Engineer Guide
Skills measured
Data Storage:
Type of Data
Azure Storage
4 configurations options available includes
pg. 1
Demysttify.Tech
1. Azure Blob
2. Azure Files
3. Azure Queues
4. Azure Tables
Performance:
Standard allows you to have any data service (Blob, File, Queue, and Table) and
uses magnetic disk drives.
Premium limits you to one specific type of blob called a page blob and uses
solid-state drives (SSD) for storage.
Access tier:
Hot
o When the frequent operation is data retrieved.
Cold
o When the data is not often accessed.
Note:
Data Lake Storage (ADLS) Gen2 can be enabled in the Azure Storage. Hierarchical
Namespace:
o The ADLS Gen2 hierarchical namespace accelerates big data analytics
workloads and enables file-level access control lists (ACLs)
Account kind: StorageV2 (general purpose v2)
o The current offering that supports all storage types and all of the latest
features
A storage account is a container that groups a set of Azure Storage services
together.
pg. 2
Demysttify.Tech
pg. 3
Demysttify.Tech
Enterprise Security
o Can integrate with variety of Azure data platform services and Power BI
Azure HD-Insight
Deploy cluster of Hadoop or Storm or Spark
pg. 4
Demysttify.Tech
pg. 5
Demysttify.Tech
o Normalizing values
o Missing/Null data
o De-duplication
o Pivoting Data frames
Advanced Transformations
COSMOS-DB
Can Build Globally Distributed Databases with Cosmos DB, it can handle
Document databases
Key value stores
Column family stores
Graph databases
pg. 6
Demysttify.Tech
Scalability
Performance
Availability
Programming Models
pg. 7
Demysttify.Tech
Choosing Partition-Key
Enable quick lookup of data
Enable it to Auto scale when needed
Selection of right partition key is important during development process
Partition key is the value used to organise your data into Logical divisions.
o e.g.: In a Retail scenario
ProductID and UserID value as a partition key is a good choice.
Note: A physical node can have 10 GB of information that means each Unique partition
Key can have 10 GB of unique values.
Creating a Cosmos-DB
az account list —output table // Lists the set of Azure subscriptions that we
have
Export DB_NAME=“Products”
pg. 8
Demysttify.Tech
pg. 9
Demysttify.Tech
Consistency
Guarantees
Level
Eventual consistency provide the weakest read consistency but offer lowest
latency of both reads and writes. ‼️ 🚩
What is the Latency I will have to use in order to provide the lower latency of reads and
writes ‼️ 🚩 - Eventual Consistency
pg. 10
Demysttify.Tech
o Creates a database with near 100% compatibility with the latest SQL
server.
o Useful for SQL Server customers who would like to migrate on-premises
servers instance in a “lift and shift” manner.
AZURE SQL-DW
3 types
Enterprise DW
o Centralized data store that provides analytics and decision support
Data Marts
o Designed for the needs of a single Team or business unit such as sales
pg. 11
Demysttify.Tech
Bottom-Up Architecture
Top-down Architecture
o Starts in minutes
o Integrated with AzureML, PowerBI & ADF
o Enterprise Ready
Azure-DW GEN-2
pg. 12
Demysttify.Tech
Creation of Azure DW
pg. 13
Demysttify.Tech
— Load the data from Azure Blob storage to SQL Data Warehouse
pg. 14
Demysttify.Tech
pg. 15
Demysttify.Tech
Data Streams
pg. 16
Demysttify.Tech
Is a highly scalable publish-subscribe service that can invest millions of events per
second and stream them into multiple applications
pg. 17
Demysttify.Tech
Navigate to Entities
Event Hub
Shared Access policies
o Policy will generate Primary key and Secondary key and the connection
string
pg. 18
Demysttify.Tech
pg. 19
Demysttify.Tech
Linked Services
Linked services are much like connection strings, which define the connection
information needed for Data Factory to connect to external resources.
pg. 20
Demysttify.Tech
Data Sets
pg. 21
Demysttify.Tech
pg. 22
Demysttify.Tech
Activities within ADF defines the actions that will be performed on the data and there
are three categories including:
Pipelines
Network Security
Securing your network from attacks and unauthorized access is an important part of any
architecture.
pg. 23
Demysttify.Tech
Encryption
pg. 24
Demysttify.Tech
RSA
EC
Managing Encryption
Databases stores information that is sensitive, such as physical address, email address,
and phone numbers. The following is used to protect this data:
pg. 25
Demysttify.Tech
Azure Monitor
Azure Monitor provides a holistic monitoring approach by collecting, analysing, and
acting on telemetry from both cloud and on-premises environments
Metric Data
Provides quantifiable information about a system over time that enables you to
observe the behaviour of a system.
Log Data
Logs can be queried and even analysed using Log Analytics. In addition, this
information is typically presented in the overview page of an Azure Resource in
the Azure portal.
pg. 26
Demysttify.Tech
Alerts
Alerts notify you of critical conditions and potentially take corrective automated
actions based on triggers from metrics or logs.
Measures the performance and reachability of the networks that you have
configured.
Contains rich, out-of-the box views you can get insights into key scenarios,
including:
o Monitor client and server errors.
o Check requests per hour
Connectivity Issues
pg. 27
Demysttify.Tech
SQL Database
Cosmos DB
Colocation of Resources
Storage Issues ‼️ 🚩
Consistency
pg. 28
Demysttify.Tech
Corruption
Data redundancy
pg. 29
Demysttify.Tech
Data redundancy is the process of storing data in multiple locations to ensure that it is
highly available.
Disaster Recovery
There should be process that are involved in backing up or providing failover for
databases in an Azure data platform technology. Depending on circumstances, there are
numerous approaches that can be adopted.
Scenarios
1. Recommended service: Azure Cosmos-DB
Semi-structured: because of the need to extend or modify the schema for new
product
Azure Cosmos DB indexes every field by default
pg. 30
Demysttify.Tech
Advantages:
Creates and updates will be somewhat infrequent and can have higher latency
than read operations.
Latency & throughput: Retrievals by ID need to support low latency and high
throughput. Creates and updates can have higher latency than read operations.
pg. 31
Demysttify.Tech
pg. 32
Demysttify.Tech
Data Size is
Hash
Fact Table huge more Large dimension tables
Distributed
than 100 GB
pg. 33
Demysttify.Tech
Note:
Language Support SQL, Python, R, and Scala SQL, Python, and R (not Scala)
pg. 34
Demysttify.Tech
TYPE DESCRIPTION
pg. 35
Demysttify.Tech
Node or
Item Document Row Document Item
Edge
Data Masking
Masking value Example
Attribute
Performance Parameters
pg. 36
Demysttify.Tech
Azure Stream
Depends on Streaming Unit
Analytics
Azure Cosmos DB Depends on Data Integration Unit (or) Request Units (RU)
pg. 37
Demysttify.Tech
Scaling
o Compute resources can be scaled in two different directions:
Scaling up is the action of adding more resources to a single
instance.
Scaling out is the addition of instances.
Performance When optimizing for performance, you'lll look at network and
storage to ensure performance is acceptable. Both can impact the response time
of your application and databases.
Patterns and Practices
o Partitioning
In many large-scale solutions, data is divided into seperate
partitions that can be managed and accessed seperatly.
o Scaling
Is the process of allocating scale units to match performance
requiremnets. This can be done either automatically or manually
o Caching
Is a mechanism to store frequently used data or assests (web pages,
images) for faster retrieval.
Availabilty
o Focus on maintaining uptime through small-scale incidents and temporary
conditions like partial network outages.
Recoverability
o Focus on recovery from data loss and from large scale disasters.
o Recovery Ponit Objective
The maximum duration of acceptable data loss.
o Recovery Time Objective
The maximum duration of acceptable downtime.
Azure Storage
pg. 38
Demysttify.Tech
BLOB
It is also a backbone for creating a storage account that can be used as a Data
Lake storage
CosmosDB
Allows existing MongoDB client SDKs, drivers, and tools to interact with the data
transparently, as if they are running against an actual MongoDB database.
Data is stored in document format, similar to Core (SQL)
pg. 39
Demysttify.Tech
Using Cassandra Query language (CQL), the data will appear to be a partitioned
row store.
The original table API only allows for indexing on the partition and row keys;
there are no secondary indexes.
Storing table data in Cosmos DB automatically indexes all the properties, requires
no index management.
Querying is accomplished by using OData and LINQ queries in code, and the
original REST API for GET operations.
v) Gremlin API
Provides a graph based view over the data. Remember that at the lowest level, all
data in any Azure Cosmos DB is stored in an ARS format.
Use a traversal language to query a graph database, and Azure Cosmos DB
supports Apache Tinkepop's Gremlin language.
pg. 40
Demysttify.Tech
pg. 41
Demysttify.Tech
- We are not looking into any relationships so Gremlin is not the right choice
- Other CosmosDB API's are not used since the existing queries are MongoDB native and
there MongoDB is the best fit
pg. 42
Demysttify.Tech
pg. 43
Demysttify.Tech
pg. 44
Demysttify.Tech
Enables you to build efficient and scalable solutions for each of the patterns
shown below
pg. 45
Demysttify.Tech
pg. 46
Demysttify.Tech
SQL Security
pg. 47
Demysttify.Tech
pg. 48
Demysttify.Tech
pg. 49
Demysttify.Tech
Lambda Architecture
pg. 50
Demysttify.Tech
When working with very large data sets, it can take a long time to run the sort of
queries that clients need.
Often require algorithms such as Spark/ Map reduce that operate in parallel
across the entire data set.
The results are then stored separately from the raw data and used for querying.
Drawback to this approach is that it introduces latency
The lambda architecture, addresses this problem by creating two paths for data flow:
Kappa Architecture
pg. 51
Demysttify.Tech
IOT
pg. 52
Demysttify.Tech
IOT Edge devices: Devices cannot be constantly connected to the cloud in this case IOT
edge devices contian some processing analysis logic within it. So that there is no
constant dependency for the cloud.
IOT devices: Are constantly connected to the cloud which provides capability tp
perform data processing and analysis
Cloud Gateway (IOT Hub): Provides a cloud for a device to connect securely to the
cloud and send data. It acts a message broker between the devices and the other azure
services.
pg. 53
Demysttify.Tech
Batch Processing
pg. 54
Demysttify.Tech
In a big data context, batch processing may operate over very large data sets,
where the computation takes significant time.
One example of batch processing is transforming a large set of flat, semi-
structured CSV or JSON files into a schematized and structured format that is
ready for further querying.
pg. 55
Demysttify.Tech
pg. 56
Demysttify.Tech
To read data from multiple data sources such as Azure Blob Storage, ADLS, Azure
Cosmos DB, or SQL DW and turn it into breakthrough insights using spark.
pg. 57
Demysttify.Tech
pg. 58
Demysttify.Tech
Realtime Processing
pg. 59
Demysttify.Tech
Challenges
pg. 60
Demysttify.Tech
A cloud-based data integration service that allows you to orchestrate and automate
data movement and data transformation.
ADF Components
pg. 61
Demysttify.Tech
pg. 62
Demysttify.Tech
pg. 63
Demysttify.Tech
Complex Workloads
Identity Management
Identifying users that access your resources in an important part of security design.
pg. 64
Demysttify.Tech
Infrastructure Protection
Roles are defined as collections of access permissions. Security principals are mapped to
roles directly or through group membership.
Roles are sets of permissions that users can be granted to. Management groups
add the ability to group subscriptions together and apply policy at an even
higher level.
An Azure service can be assigned an identity to ease the management of service access
to other Azure resources.
Service Principals:
Managed identities:
When you create a managed identity for a service, you create an account on the
Azure AD tenant. Azure infrastructure will automatically take care of
authentication.
pg. 65
Demysttify.Tech
Azure services such as Blob storage, Files share, Table storage, and Data Lake
Store all build on Azure Storage.
High-level security benefits for the data in the cloud:
o Protect the data at rest
That is encrypt the data before persisting it into the storage and
decrypt while retrieving. eg: Blob, Queue
o Protect the data in transit
o Support browser cross-domain access
o Control who can access data
o Audit storage class
All data writtern to storage is encrypted with SSE i.e, 256 bit advanced standard
AES cipher. SSE automatically encrypts data on writting to Azure storage. This
feature cannot be disabled
For VM's Azure lets you encrypt virtual hard-disks by using Azure disk encryption.
This encryption uses bit locker for windows images and uses DEM encrypt for
Linux.
Azure key Vault stores the keys automatically to help you control and manage
disk encryption, keys and secret automatically.
pg. 66
Demysttify.Tech
pg. 67
Demysttify.Tech
Azure Key-Vault
Safeguard cryptographic keys and other secrets used by cloud apps and services.
Encryption in transit
Keep your data secure by enabling transport-level security between Azure and
the client.
Always use HTTPS to secure communication over the public internet.
pg. 68
Demysttify.Tech
When you call the REST APIs to access objects in storage accounts, you can
enforce the use of HTTPS by requiring secure transfer for the storage account.
Grant access to user and service identities from Azure Active Directory
Grant access to storage scopes ranging from entire enterprise down to one blob
container
Define custom roles that match your security model
Leverage Privileged Identity Management to reduce standing administrative
access.
AAD Authentication and RBAC currently support AAD, OAuth and RBAC on Storage
Resource Provider via ARM.
pg. 69
Demysttify.Tech
Storage Explorer provides the ability to manage access policies for containers.
A shared access signature (SAS) provides you with a way to grant limited access
to other clients, without exposing your account key.
Provides delegated access to resources in your storage account.
Service level
pg. 70
Demysttify.Tech
Account level
o Targets the storage account and can apply to multiple services and
resources
o For example, you can use an account-level SAS to allow the ability to
create file systems.
Immutability Policies
Firewall rules
pg. 71
Demysttify.Tech
Azure SQL DB has a built-in firewall that is used to allow and deny network access
to both the db server itself, as well as individual db.
Server-level firewall rules
o Allow access to Azure services
o IP address rules
o Virtual network rules
Database-level firewall rules
o IP address rules
Network Security
Network security is protecting the communication of resources within and outside of
your network. The goal is to limit exposure at the network layer across your services and
systems
pg. 72
Demysttify.Tech
Internet protection:
By assessing the resources that are internet-facing, and only allow inbound and
outbound communication when necessary. Ensure that they are restricted to only
ports/protocols required.
To isolate Azure services to only allow communication from virtual networks, use
VNet service endpoints. With service endpoints, Azure service resources can be
secured to your virtual network.
Network Integration:
pg. 73
Demysttify.Tech
Storage Firewall
o Block internet access to data
o Grant access to clients in specific vnet
o Grant access to clients from on-premise networks via public peering
network gateway
Private endpoints
Azure private endpoint is a fundamental building block for private link in Azure. It
enables service like Azure VM to communicate privately with private link
resources.
It is a network interwork interface that connects you privately and securely to
service powered by Azure Private link.
A private endpoint assigns a private IP address from your Azure Virtual Network
(VNET) to the storage account.
pg. 74
Demysttify.Tech
private endpoint enables communication from the same VNet, regionally peered
VNets, globally peered VNets, and on-premises using VPN or Express Route, and
services powered by private link.
It secures all traffic between your VNet and the storage account over a private
link.
Azure SQL database - Secure your data in transit, at rest and on display
For SQL Server you can create audits that contain specifications for server-level
events and specifications for database-level events.
Audited events can be written to the event logs or to audit files
pg. 75
Demysttify.Tech
There are several levels of auditing for SQL Server, depending on government or
standards requirements for your installation.
Azure SQL DB and Azure Synapse Analytics auditing tracks database events and
writes them to an audit log in your Azure storage account.
Enable Threat detection to know any malicious activities on SQL DB or potential
security threats.
Use an Azure SQL Database managed instance securely with public endpoints
The managed instance must integrate with multi-tenant-only PaaS offerings. -You
need higher throughput of data exchange than is possible when you're using
VPN.
Company policies prohibit PaaS inside corporate networks.
pg. 76
Demysttify.Tech
Azure SQL Database and Azure Synapse Analytics data discovery & classification
Labels
o The main classification attributes used to define the sensitivity level of the
data stored in the column.
Information Types
o Provide additional granularity into the type of data stored in the column.
pg. 77
Demysttify.Tech
Miscellaneous
Serverless Computing
Containers
pg. 78
Demysttify.Tech
Azure Kubernetes Service allows you to set up virtual machines to act as your
nodes. Azure hosts the Kubernetes management plane and only bills for the
running worker nodes that host your containers.
Performance Bottlenecks
Azure Monitor
A single management point for infrastructure-level logs and monitoring for most
of your Azure services.
Log Analytics
You can query and aggregate data across logs. This cross-source correlation can
help you identity issues or performance problems that may not be evident when
looking at logs or metrics individually.
Telemetry can include individual page request times, exceptions within your
application, and even custom metrics to track business logic. This telemetry can
provide a wealth of insight into apps.
pg. 79
Demysttify.Tech
It uses two types of keys to authenticate users and provide access to its data and
resources.
pg. 80
Demysttify.Tech
pg. 81
Demysttify.Tech
pg. 82
Demysttify.Tech
Does not store any data except for linked service credentials for cloud data
stores, which are encrypted by using certificates.
pg. 83
Demysttify.Tech
Encryption-in-use (Column-level
Information
Always Encrypted granularity; Decrypted only for
Protection
processing by client).
pg. 84
Demysttify.Tech
CONTROL DESCRIPTION
Virtual Network Use this feature to allow traffic from a specific Virtual
firewall rules Network within the Azure boundary.
Encryption on Azure
Technique or service
Type Enables encryption of
used
pg. 85
Demysttify.Tech
SQLDW or Synapse
To achieve fastest loading speed for moving data into a DW table
load data into a staging table. Define the staging table as a heap and use round-
robin for the distribution option.
Distribution Type
IOT HUb, Event Hub, Blob are the three ways to bring data into Stream Analytics
Anything related to RBAC Identities majority cases answer would be Service
Srincipal
In MySQL sharding is the best way to partition the data.
o Criteria to select a column for sharding
Unique (data should be well distributed)
Cosmos DB Partition keys should generally be based on unique values.
For a Database using a nonclustered columnstore index will improve performance
on analytics and not clustered columnstore index.
pg. 86
Demysttify.Tech
You can use Azure Event Hubs, IoT hub, and Azure Blob storage for streaming
data input.
Azure Stream Analytics Supports Azure SQL DB or Azure Blob storage
for reference data input.
Primary key and secondary key grant access to remotely administer the Storage
account.
Event Hubs Capture creates files in Avro format.
If notebooks are involved with scheduling or autoscale of clusters it is databricks.
RBAC support for databricks via Premium clusters.
If there is a question based on IOT Hub or Event Hub probability of the answer
being Stream Analytics for processing is maximum.
If you see the term "Relationship" or nodes and vertices in CosmosDB question
by default the option in Gremlin API.
If hierarchical or Big Data related storage involved then ADLS Gen2.
If flat file related storage then blob.
pg. 87