0% found this document useful (0 votes)
29 views293 pages

DP 203 Notes

The document provides a comprehensive guide on using Azure Blob Storage, including creating storage accounts, managing files within containers, and implementing security measures such as Access Keys and Role-Based Access Control (RBAC). It also covers the creation of linked services in Azure Data Factory for data integration, utilizing Key Vault for secret management, and generating SAS tokens for secure access. Best practices for managing access and permissions are emphasized throughout the document.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views293 pages

DP 203 Notes

The document provides a comprehensive guide on using Azure Blob Storage, including creating storage accounts, managing files within containers, and implementing security measures such as Access Keys and Role-Based Access Control (RBAC). It also covers the creation of linked services in Azure Data Factory for data integration, utilizing Key Vault for secret management, and generating SAS tokens for secure access. Best practices for managing access and permissions are emphasized throughout the document.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 293

Blob Storage

This doesn’t provide a hierarchy for storing files.

Root

Container 1 Container 2

File1 File2 Subfolder


If you implement the above file structure and if you hop from one container to
another you will find that the Subfolder will get disappeared. In order to make it
persist you will need to add some files in Subfolder.

Root

Container 1 Container 2

File1 File2 Subfolder

File 3

Creating a Storage Account


Search “Storage Accounts”

Click on “Create”

Fill all the relevant details


Review + create

Creating Blob Storage


Once inside the Storage Account navigate to “Storage browser”
Table
Queue
Inside a Queue we have 2 messages

To dequeue
Message 1 is gone

Lifecycle Management Policy


Creating a Pipeline using Data Factory
In this demo we will create a linked service for our on-premises MySQL setup
Since your source is MySQL and is an on-premises setup so in this case you
cannot use AutoResolveIntegrationRuntime, because that IR is for migrating
cloud to cloud data. So, in this case we go with Self-Hosted Integration
Runtime.
Give it a Name and Description
Option 1:
If you choose to go with this option, then you just need to install the express
setup on to your local computer.

Option 2:

If you go with this option, then download the IR and register the IR anywhere
you want with one of those 2 keys.

Target Side Setup

Inside the target Data Lake, make a container and name it as RAW. It will act
as Raw Layer
Set Up Linked Service for Target
Then Create Dataset for both Source and Target
After setting appropriate properties, you will see that by selecting the
dedicate Linked Service, automatically the dataset will detect which IR this
Linked Service is using and then you will see the list of tables of the database
that you had mentioned in the Linked Service

After that you can Preview the data


Now create a new dataset for the target side i.e. ADLSg2
Select a file format

While you go with the option to stored data as csv you will require to provide
File path as well
Creating Managed VNet IR
Configure as per the requirements

Go to virtual network
Enable and create

Now after creating the Manage VNet IR, utilize it in Linked Service
And from the drop down choose managed vnet ir
Both this IR requires private endpoints

After selecting Subscription, Server and Database you will see there is no
Managed private endpoint. So, create one

Either you can create from there as shown in the fig or you can create from
the menu bar
Choose for Azure SQL Database

Choose subscription and server name

So far you create a MPE and inside that you specified the Server Name. In
this way Azure SQL Server instance would detect that some application(ADF
in this case) wants to connect to it securely using that MPE.
Go to your Azure SQL Server
Now if you create a Linked Service and choose
ManageVNetIntegrationRuntime, and provide with the necessary parameters
you will see that Managed Private Endpoints are approved
And create the linked service

Data Lake Security


Two ways to access Data Lake:

a) Anonymous Access
b) Access with Identity

Anonymous Access
 From the Azure Portal search for Storage Account
 Enable hierarchical namespace
 After doing that, navigate to “Containers” on the left-hand side tab of
Storage Account and create some containers
To make a container publicly accessible simply navigate to the container and
click on the option above: Change access level

You can see the status has been changed. To give someone access you have to set
permission globally at the Storage Account level as shown below. But always ensure
that those settings are disabled because you do not need to have your data leaked
Even if the Anonymous access level for “public” container is shown enabled
but you would not be able to see the content inside it because you just
disabled the access globally at Storage Account level.

A1) Access Key

 On the left menu, under “Security + networking” you shall see “Access
keys” option
 There will be 2 keys: key1 and key2
 You can use any one of the keys to establish a connection or you can save
these keys as Secret inside a Key Vault
Creating a Linked Service based on Access Keys
Mention your subscription and Storage account name

This method of using Access Keys gives full control over Storage Account.
Also, if you go to json definition of Linked Service, you will see encryption
details. Connection will successfully run because you are using Access Keys
as authentication
Now after successfully creating a Linked Service you can open the code view and
could see the details of your Access keys under “encryptedCredential”. As this
option gives full powers, so it is better to have it disable at Storage Level

 Also, in case you have multiple linked services for a Storage Account and
all the Linked Services are using Access Keys for Authentication. In such
situation if you suspect that your key has been compromised then you
would rotate the key. And those Linked Services that were configured
earlier with the old key wont work. This means you will have to use the
newly generated key in all the Linked Services that are connected to that
Storage Account.

Storage Account Level

ADF-Linked Service level

The connection will not get established, and you will be asked to provide with
new Storage Account key (Access key). Meaning you would have to provide
this value in all the associated Linked Services
Best practice is to store all those Access keys inside a Key Vault so that you
do not have to hard code your Storage Account keys explicitly in the
definition of any Linked Service
Further, we will learn how to utilize a Key Vault to store Access keys or SAS
token or anything that you would not like to hard code

Key Vault
 Create a key Vault
 Choose Subscription
 Choose Resource Group

Another thing is how (what type of Authentication) some other services will
use to connect to Key Vault. There are 2 types shown below under
“Permission model”

Access Configuration: Choose “Vault access policy” for now. By choosing


an option from there , Key Vault is telling all the resources that if anyone
wants to connect with me, they will have to use the Vault access policy as an
Authentication
How to save secrets inside a Key-Vault

 Go to “Secret” on menu at left side


 Then Generate/Import a Secret
You will save one of the key values in a Key Vault. However, if you did not
define any Access Policy, you will not be able to create a secret because
Access Policy defines what a user can do with objects(Keys, Secrets,
Certificates) inside a Key Vault.

Access Policy for Key Vault


Key vault supports up to 1024 access policy entries.Because of this
limitation, it is recommend assigning access policies to groups of users,
rather than individual users.

1. Select the permissions you want under Key permissions, Secret


permissions, and Certificate permissions.
2. Under the Principal selection pane, enter the name of the user,
app or service principal in the search field and select the
appropriate result.

If you are using a Managed identity for the app, search for and
select the name of the app itself.

3. Review the access policy changes and select Create to save the
access policy.
4. Back on the Access policies page, verify that your access policy is
listed.
Now you have all the secrets inside the Key Vault. Next thing is who will
come to grab those secrets and how they should talk to Key Vault so that
Key Vault could provide those secrets.

Managed identity
Where to look for it?

Go to any one of the created Resources

On the menu at left side, you should see Managed Identity option
You can get Managed Identity from there. Both are identities and
in addition, on the top, you can see there are two options:
System-assigned and User-assigned.

Next thing is you go back again to Key Vault and inside the Access
Policy add the Managed Identity of whatever resource who can
access the secrets inside Key Vault

Set whatsoever permission you want to grant to the


Managed Identity
Choose the Managed Identity as Principal

But this is not the best practice. We would like to have a separate group
created (inside Microsoft Entra ID). Then as a Principal, you choose that
group.
Creating a group in Microsoft Entra

Go to Manage  Groups  New group

Give some name to the group


Make yourself the owner of the group and add the Managed Identity
of ADF as a member to the group as shown below
Because we wanted our ADF to read secrets from Key Vault.
Therefore, we put ADF inside a group and we create an Access
Policy for this group inside the Key Vault. Because ADF is already
inside the group, so indirectly we gave permission to ADF(but in
better way )

Therefore, you should see the name of the group you created
Now, finally go back to Key Vault and create an Access Policy with
necessary permissions and add group that you created inside
Microsoft Entra.

In Principal, give your group that you created inside Microsoft


Entra ID

Now you should see your group inside the Access Policy of Key
Vault
So, long story short, we want our ADF to access Key Vault so we will create a
Linked Service pointing to the Key Vault

 Provide your Key Vault name


 As you are using a System-assigned managed identity, select that
option

Now, this linked service is pointing towards the Key Vault already, we will
make use of it. It is like a scenario if you want to access the Key Vault and do
not know its location you will ask the Linked Service that is pointing towards
the Key-Vault.

In real life, if you do not know the location of a place you open Google Maps
and search for that location. This linked service is something similar. Our ADF
will ask this Linked Service about the location of the Key Vault.

Now, let us create Linked Service to Data Lake but at the same time, ADF will
ask Linked Service pointing to Key Vault to go grab secrets for me. The
Linked Service pointing towards Key Vault will only grab those secrets that
ADF is eligible to see.

Create a Linked Service

Change the Account selection method to Azure Key Vault

The moment you switch to Azure Key Vault, you will see that ADF will
automatically detect that there is already a Linked Service that will show ADF
path to go grab secret. So ADF can establish a connection with Data Lake
using that Linked Service
See! When ADF asked the Linked Service(that was pointing to the Key Vault)
for a direction to the Key Vault, the Linked Service went and automatically
provided the options ADF has (in the form of a Secret name)Then finally
using the secret the connection was established successfully.
Access to Data Lake using SAS Tokens
This is taken from inside the Data Lake.

This is what the Account Level SAS token looks like, why Account Level SAS?
Because you are grabbing it at the Account Level.

Granular as compared to Access Keys because of these options

You can choose for what service you want to generate SAS tokens for, e.g.,
Blob, File, Queue, and Table.
What resources type? (Service, Container, Object)(Did you notice
one thing? It does not provide us with an option on which container
in specific)(meh! not really granular) Anyway, let’s generate one
SAS

Oopsie! SAS under the hood requires Access Keys to sign-in!


One more hint as to way it is called Account Level SAS

Because at account level SAS tokens can be generated for Blob, File, Queue
and Table but not for any specific container of directory.
Get the Blob service SAS URL: but we want to have Data Lake SAS token
Just simply replace “blob” keyword with “dfs” keyword.
Create a new Linked Service (Configured to use SAS to
connect with Data Lake)
Authentication type: choose SAS URI
Paste URL in the below section

Do not forget to replace “blob” with “dfs”


Service SAS (actually granular)
Now, we created a new directory and inside that, two images are there.
We want to generate an SAS token just for one image

Click on those 3 dots in the right side


Cl
ick on Generate SAS

Under the hood, of course, these SAS would also be using Access Keys

But what is this “Stored Access Policy” thing?


This is the Policy that you define at Service level (containers, directories)
And then when you generate a Service SAS token, you provide the Policy
But we did not create a Policy yet.

How to create a Policy?


Let us create a Policy for Raw container

Add policy
Then you can grant whatever permissions you want to be
automatically inherited by SAS token in there

You should see your created Policy for the RAW container
Now, let’s go create a Service Level SAS token by passing this Policy inside
the token

You can see that our created Policy appeared to choose from the options
Can you see whatever Policy you defined at container level is automatically
inherited by this token.

Talking about the high-level overview, when you define an Access Policy at
container or directory level you grant some permissions to it. It signifies what
an Access Policy can do in the container or directory. So, if you are
generating a Service SAS token by making use of an Access Policy, then
whatever user/application/service principal is using that SAS token, the
container or directory will understand to what extend or what piece of work a
user can perform on me with that SAS token.

User Delegated SAS


 This doesn’t use Access keys
 You generate User Delegated SAS at the same place where you create
Service Level SAS token.

Let me show you what I am talking about

Just switch to ‘User delegation key’


From permission tab, you can choose what type of permissions you want
your SAS token to have. Please, notice that there is no usage of Access Keys
at all.

Role-Based Access Control(RBAC)


Where to check Roles
Resource Group level  Access control(IAM)  Roles

And if you want to see set of permissions available default roles, you can navigate to the
“View” option and can then see the json format of the code.
In the fig above which is in the form of json like structure: actions are what an owner
can do and inside notActions: list of the things that owner cannot do
For example, in case of Contributor role
 Actions = what he can do
 notActions = inside this is what a Contributor cannot do

In a similar way we can create our own customize role by defining set of
permissions under actions and notActions.

How to assign Roles

Resource Group Level  Access control(IAM)

Navigate to Role assignment


Then select “Add”

Choose “Add role assignment”


And let’s say you want to assign someone a Contributor role
Choose “Privileged administrator roles”

After that you will see 3 options


 “Selected role” is what Role you want to assign
 “Assign access to” is to who you want to assign role to
 “Member” adds members who should have that role

To check different types of Roles defined on a Scope(e.g. Resource Group)


Go to a particular Scope  Navigate to “Access control (IAM)” “Role assignment”
Then you can see list of Security Principals along with their roles as shown in the figure
below.
To check your access/permissions

To check someone else’s permissions


Go to the Scope on which you want to verify access permission
Go to Access control (IAM)
Go to “Check access”
Search the Security Principal
There are 2 splits in plane
Control Plane : don’t have access to data
Data Plane: has access to data

To access data in Data Lake:

Roles
Storage Blob Data Owner
Storage Blob Data Contributor
Storage Blob Data Reader

Assigning RBAC to ADF(Managed Identity)


We want to grant access to ADF on DataLake
Therefore, Scope = DataLake, Security Principal = ADF, Role = Data Contributor
This can be broken down in two phases:

In Microsoft Entra ID create a group and add Managed Identity of ADF as a member.
At the Storage Account grant Storage Blob Data Owner/Contributor/Reader role.

Create a group -> Add ADF as a member to the group -> Grant RBAC to that group

Go to your Microsoft Entra ID


Navigate to Groups
You will see list of Groups that had already been created

Add a new group


Give it a name

Add yourself as Owner

Add Managed Identity of ADF as member


After creating the group, you should the group in main menu
 Now, go to the Scope(Resource Group/Storage Account/Container)
 Navigate to Access Control (IAM)
 Grant Data Contributor role to the recently created group
Access Control (IAM) Role assignmentsAdd

Switch to “Privileged administrator roles”


Search for Storage Blob Data Contributor role
As you know you already created a group so settings should be like below
If you did not create a group, you would have to choose “Managed identity” again
and again

Select member as the recently created group


Now to verify if our ADF could access the Storage Account
Let’s create a New Linked service from ADF to Data Lake that uses RBAC
Choose “System-assigned managed identity”

Here you will see that you just provided “Storage Account Name”
How is it working?
When you give Storage Account name, the intelligence will see that you are
creating a Linked Service from that instance of ADF to which you already
granted RBAC on that Storage Account

To enforce RBAC you must utilize Microsoft Entra.

Setting up CICD for Data Factory


we only want our Dev instance of ADF configured, not the Prod instance of
ADF
Go to ADF that you will use for Dev environment
Click on Manage.
Click on Git configuration option
Click on ‘Configure’

Then choose Azure DevOps Git as Repository Type

Also choose your Microsoft Entra ID


Provide name of organization that you created inside Azure DevOps

Give an appropriate name to you project and click on create


Entra ID Association with Azure DevOps Organization

 Confirm that your Azure DevOps organization is linked to the correct Microsoft
Entra ID.

o Go to Azure DevOps → Organization Settings → Azure Active


Directory.

o Check if the Azure DevOps organization is connected to the correct


Microsoft Entra ID tenant.
Note: You will be logged out automatically. To login back use
https://dev.azure.com/<organization_name>
After this if you try to choose an Azure Organization name, you shall see
kaamKarde on the list.
But you cannot see any project
Check Azure DevOps Project Visibility

 Ensure that the project visibility settings in Azure DevOps allow it to be


detected by ADF.

o Go to Azure DevOps → Project Settings → General → Visibility.


o Set it to Public or ensure your account has access if it's Private.

If it still doesn’t work


Add your ADF

Collaboration Branch: give name of your Main Branch


Root folder: give its name according to documentation

If you are seeing something like this that would mean your Dev
instance of ADF is Git configured
Now to verify is you made this setup correctly, go to Azure DevOps and see if any
folder with name “/data-factory/” was created or not

And yes, it indeed got created

Also, inside the folder you should see json definition of your Data Factory
Next step is to make sure to include Global Parameters inside ARM template
In Azure DevOps:
Create a folder /devops/ and then, in that folder create following files:
 package.json
 adf-build-job.yml

Go to your repository and click on 3 dots

Create this file, but inside a folder as indicated above

Provide the folder and name of the file


Paste the json script and hit commit

adf-build-job.yml

Copy the json script and hit commit

After this create a script for pipeline(adf-azure-pipeline.yml). This file is


the definition of our CICD pipeline
Now paste the json script

in the json code replace the template variables provide path to adf-build-
job.yml file
subscriptionId : Dev Resources Group ID

resourceGroupName: Resource Group name of Dev Env.

dataFactoryName: Dev instance of Data Factory name


repoRootFolder: path to datafactory folder where definition of datafactory is
stored

packageJsonFolder: path to folder where package.json form is stored

After this create a Pipeline on Azure DevOps


After this just run the pipeline manually to check if it is building a package
Azure Databricks
It is a Data and Analytics platform
Use cases:

 Build enterprise Data Lakehouse


 ETL & Data Engineering jobs
 ML, AI, Data Science
 Data Warehousing, Analytics, BI
 Data Governance and Data Sharing
 Streaming Analytics

Where does Databricks fits in the BI flow?


 We have different types of sources viz. files, DB, API etc. and we dump the data
into our Data Lake.
 So, the data is ingested into DataLake using tools like ADF, Logic Apps or
pipelines in Synapse.
 We also split our data inside DataLake to avoid total mess.
 Once we have data in RAW layer, we want to start transforming our data, apply
business rules, handling null/duplicates/anomalies etc. We do that with
Databricks.
 Therefore, Databricks read data from RAW layer of DataLake, do some
transformation and write data back in DataLake.

Why to use Databricks when we were transforming data inside the


Database only?
Earlier, the transformation was happening inside the database itself. So, the storage
and compute infrastructure were tightly coupled. And with increase in data volume,
the same database was being overloaded and processing was not happening within
the desired time range. So, companies started to scale up, moved their database to
powerful serves that can do their job. But scaling up is quite costly.

 So, a time reached where we could not scale up our resources further because it
was not practically, physically and economically possible.
 So, a solution was developed where instead of scaling up, we did scale out,
where task was distributed among several worker nodes.
 Since each worker node was getting small portion of the whole work, it should be
faster and cheaper.

Illustration of Spark Architecture


 Because Databricks was developed on top of Apache Spark, so its
architecture is important to look at.
 Distribution of work happens automatically.
 Instead of scaling up single machine we scaled out.

 In Spark we have separation of compute from storage.


Provisioning Azure Databricks
 Select Databricks
 Create a Workspace
Fill appropriate fields

Pricing Options

Then create your workspace with default settings for now

Go to Resources Launch
Workspace
Databricks UI

How to create a cluster?


Go to Compute

Then top right corner hit “Create compute”


Name of Cluster

 Policy - you can set policy which say how many nodes a user can
deploy
 Multi-Node/Single-Node - with Multi Node you will see options
including Worker type, min worker, max worker but this is not the case
for Single Node. In single node setting UI will look something like this:
Node type = it is talking about the driver node

Cluster Access Mode


Performance

Which S/W version of Databricks you want to use? Select from there
this makes your workloads run faster

When you want to shutdown your cluster after being used

After all settings are done hit Create to create your cluster.

Spot instances in case of multi-node cluster

Depending on customers who ask for VMs Microsoft must be prepared


upfront to address their request and deploy those VMs. So, they must have
some free VMs. But somehow Microsoft pays for those VMs. So, Microsoft
offers those free VMs in cheaper price in that orange area. But suppose in
case of high demand, Microsoft would obviously address first to their regular
customers so you will be kicked out and those VMs will be taken away from
you and will be handed over to the regular customers.
But Databricks say your spot instances will be replaced by a regular VM. So,
spot instances mean VMs in that orange area.

This is ok in dev env.

Creating a Notebook

New Notebook

Multiple cells in notebook

Languages supported
WE WILL IMPLEMENT THIS

Read the data from DataLake by establishing connection and then save it
back to the same DataLake in delta format.

Connecting Databricks to DataLake

https://learn.microsoft.com/en-us/azure/databricks/introduction/
https://learn.microsoft.com/en-us/azure/databricks/connect/
storage/azure-storage

Using Account Keys: Fastest and Worst method

Copy it

Configure the component of the code:

<storage-account>

Grab the Access keys from the Storage Account and paste it inside the
notebook
Execute the cell

Make sure the notebook is attached your cluster


So far connection is being established

To grab the data

 Provide path your data


 Storage-account-name
 Container-name

Because Databricks by default supports Delta Lake format. So, tell the
Databricks explicitly that the data is in json format
Saving data back to DataLake as Delta

Create a new container to save data

Create the conformed container manually inside DataLake


Lecture 32

Pin Cluster
Multiple Cluster

 You can create different cluster whose configuration can be different,


whose work will be different depending on workload. And then its up to
you on which cluster you want to create your notebook.
 If a cluster is terminated or stopped, you will not pay for it.

Workspace Menu
Workspace

You can organize your notebooks in some logical order

 Shared: accessible to everyone


 Users:
If you would like to organise your notebooks in some folder

Create a new folder

Inside the project create a notebook


So, after the creation under workspace, under Project you will see your
notebook

Then you can set permissions by right click on the Project Folder like who
should have access to this Project Folder and content inside it

Apart from this all the users will have their own workspace
So, you can also create folders, notebooks within your workspace

If you want to version control your notebooks


Recent will have the notebooks who have been working recently

You can see every change made to your notebook from there

Databricks Utilities
Most commonly used

Sample Datasets

Reading some of those datasets


But the thing is where all those sample datasets coming from?

Databricks managed RG

While provisioning Databricks workspace we must indicate some RG inside


which Databricks workspace would be located. Maybe that RG will be
consisting of many resources like ADF, logic apps. ADLSg2 etc.

But if you go in Azure portal


You will see some RG begins with Databricks name. Which means that
workspace in Databricks is also creating Secondary RG which will create
many resources.

If you go inside that Storage Account which is managed by Databricks


Indeed a DataLake

Inside the containers


You will see some containers

If you go inside any one of those containers

You will see an error like this. You cannot see what is inside. Because this RG
is given to the Databricks’ workspace to deploy all the resources like VMs ,
networking stuff, interface cards etc. Also, all those sample datasets are
coming from that Storage Account container(from root specifically).
Loading CSV data from Storage Account to Databricks
Establish connection with Storage Account

Go to Data Lake and grab primary endpoints

Paste it

Change the protocol and indicate which container you want to use
This dbutils will display content of your Data Lake

Now to read the content of CSV file

You will see this error

Issue is you used square brackets. So, escape them


Problem

Solution

Verify the schema of csv


To define schema explicitly

Loading CSV using SQL


Exchanging Data between languages

This will fail

So, you have to register the dataframe as a table first


SQL to Python

Transformations:
Suppose you want to group the rows by last name

First 10 rows:
Lecture 33
Connecting to ADLSg2

1) Unity Catalog
Databricks is promoting this method. They think it is best for data
governance.
2) Legacy Method
other methods that were existing prior to existence of Unity
Catalog
a) Access via Account Key
just a string that give full access to a Storage Account(DataLake).
b) Access via Service Principal
Simply an account explicitly created in Azure that you can later use to
grant some permissions.
You will create a Service Principal, and grant it access to DataLake. For
this you can either use RBAC or ACL

Go to Azure Portal -> Microsoft Entra ID

Inside the Microsoft Entra ID you can create Service Principles

Go to App Registration
Then click “New registration”

Give name to it

And hit “Register”


You can create several types of Access with this Service Principal. But here
we are using Secrets

Create a new Client Secret

Give it a name

You will get some value that can be used later


So, far Service Principal is created. Next part is granting this Service Principal
RBAC to Data Lake

Storage Account Level

Go to Access Control -> Role Assignment

You will see all the roles under this. If there is already some group present in
there which already had same set of permissions that you want your newly
created Service Principal to have(Storage Blob Contributor Role in this case)
then you simply add your Service Principle to that group. To do this you have
to go to

Microsoft Entra ID -> Groups -> find your group -> add a new
member(Service Principal in this case) to the group
This proves that there is already a group called “Data-Lake-Contributor” that
has permissions that we want to include to our Databricks notebook
Now if you go to Databricks and create a new notebook and then try to
connect to the DataLake, the connection should fail because you haven’t
configured the Databricks yet.
As per the documentation

https://learn.microsoft.com/en-us/azure/databricks/connect/storage/azure-
storage

You can do configuration in 2 ways, either just to the scope of the present
Notebook or for the whole cluster. For the whole cluster meaning whatever
notebook you create will already have this configuration.

Fill the service_credential part: this will hold the secret of Service Principal
you created earlier in Microsoft Entra ID when and remember you created
App Registration of this
And in place of storage-account give name of your DataLake

Then provide application-id

This is just the identifier of the Service Principal

To find it: Microsoft Entra ID -> App registration -> All applications -> your
service principle-> Application ID
Finally mention directory-id
Both are not secrets, so it is fine to have them there

c) Access via SAS tokes


Go to Storage Account -> Shared access signature
Generate SAS token

Copy the code from the documentation

Fill storage acc name

And provide the value of SAS token

d) Secret Scopes

In this you utilize Azure Key Vault

Then inside it you save secret of your Service Principal


Then inside Databricks you create a Secret Scope. It is just a way to access
Azure Key Vault from Databricks

Those Secret Scopes works only if Key Vault is configured by using Access
Policy mode
Go to Key Vault
Add your Secret to Key Vault
Inside value give secret of Service Principal that you had created earlier
inside Microsoft Entra ID while App Registration

From Databricks side


https://learn.microsoft.com/en-us/azure/databricks/
security/secrets/secret-scopes

First part is address of your databricks workspace instance and second part
is secret

Copy paste URL


Meaning who will be able to set permissions and access to the key vault
This will tell Databricks what Key Vault you are talking about

Key Vault -> Properties

DNS Name = Go to Key Vault -> Properties -> Vault URI

Resource ID = Resource ID of Key Vault

Create Scope

Go to Key Vault -> Access Policy

You will see new policy created as shown below


Now go to documentation to grab the code

Secret-scope = name of your secret scope that you created

Key = and what secret you want to read from it.

List all the secret scopes


This will list all the secrets available in the Key Vault

3. Deprecated Method
a) Access via Mount Points

earlier you had to provide the protocol(abfss), container(raw) name, path to


your storage account(tybuldatalakedemo), suffix by dfs.core.windows.net

They simplify access patterns to our data on a DataLake with Mount Points

It has lots of disadvantages. It is not supported with Unity Catalog

Our ADLSg2 will have some hierarchy.

When you are creating a Mount point:

 provide a location to which it should point. E.g. our mount point should
point to our DataLake
 Indicate which protocol(driver) to be used to connect to DataLake. In
our case we are using abfss
 Credentials(how to connect to DataLake)
So, basically all these 3 things altogether will be encapsulated inside the
Mount Point. so, end users later on will not have to care about those 3 things
while using Mount Point as a regular point

How to create it?


https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts?
source=recommendations

From Documentation
Source = Provide a path where your Mount Point should point

Provide your container and storage account

Mountpoint :

mount_point = How do you want mount_point to be called

This thing specifies how you want to connect to DataLake:


We are using Service Principal based authentication like we did earlier

Provide application-id , directory-id and secret

Listing Mount-points:

Querying Mount-point:
They point towards

b) AAD Credentials Passthrough

 Available with Premier tier


 Requires some configuration in cluster
 While creating a cluster under Advanced options
Users are granted some access on Storage Accounts and whenever those
users are connecting to Databricks to do their job they should be given same
permissions that they have on Storage Account(Data Lake)

Therefore, permissions of user are checked based on what permissions he


has on Storage Account

In this way each user will have different type of permissions on Storage
Account based on which they can have different level of control on data.
Meaning different users have different versions of data.

Databricks:

Create a New Notebook -> make sure you connect to your new cluster

Disadvantage:
Doesn’t work with ADF

Not supported with workflows

Lecture 35

We want to save our results of our Data Transformation

In this demo we are saving the transformed data back on DataLake.


Whenever you have to connect from Databricks to DataLake you have to
configure your notebook in one of the above discussed method.

In this we are using access type as Access via Service Principal method

Earlier you were pasting the Python code in first cell of the notebooks. But in
this method, you will put above code inside the cluster as shown below

Cluster Configuration:

Go to compute
Then to your desired cluster that you wish to use

Navigate to “Edit” at top right corner


Navigate to “Advanced options”
This is the place where you must paste the code. And then you would have
to restart the cluster

Then go back to your notebook to verify that connectivity worked fine

You could see that you are able to list the content therefore connectivity is
established.

In this demo we are configuring the access settings not at notebook level but
cluster level.

Preparing the data:

In this step, read whatsoever data you want to read but at the end make the
data in form of dataframe and then register the dataframe to convert the
dataframe into a table. Once its in table form you can use SQL like queries
Transform your data

In this apply whatsoever transformation you want to apply

Saving the data to ADLSg2


Create a new container(curated)

We will use this container to save the transformed data.

There are multiple ways to save data to DataLake. But we will save our data
as delta format

Well just grab the data and save it as delta format

The result of SQL transformation is available in _sqldf

You will use python command as:

If you revisit your curated container, you would see parquet files along with
the _delta_log folder.
Then to verify if you can read that curated data you simply read it on
Databricks using pyspark command.

Therefore, we are able to read data

Save the data as managed tables


You had to provide the full path to location where data is stored and the
format to be specified. But if you want to make other developers’ life easier.
Maybe you would like to hide all those details and expose an object with user
friendly name that they could use in their queries.
Register your table in Catalog, in a metastore THAT WILL ALLOW SOME
OTHER USER TO USE IT

How to do that?

You create a database(a logical container that store definition of your


objects)

At this moment our database is created somewhere

Then you create a table. You have to indicate which database you want it to
be put in. Then name of the table. Then write a query that will populate the
table
The benefit of creating this table in database is:

Any end user can read this data without explicitly mention the path of source
in your datalake:

Therefore, you have your data that persisted somewhere.


How to see what databases and what tables are available in our
environment:

Go to Catalog

You can also see you created database

Also, the tables available


In the Overview you can see some structure of the table, Sample Data

The data is using delta format

Where is the data getting stored? It uses delta format, and delta is a file
format so file has to be stored somewhere

Now if you go to Details you will see:


Type = Managed (by Databricks)

Location = dbfs(Databricks File System)

Even though you are the admin of the whole Azure, but you won’t have
permission to see data stored in the DataLake managed by Databricks

Only way to have access to that data is to use Databricks. What if some
other services like ADF that would like to connect to that data? This will not
be possible without going through Databricks. And coz you have no control
over that DataLake you have no control to set permission using RBAC or ACL.

We created a Database and inside that we created a table(e.g. Minifigs). The


data is not stored somewhere in the Databricks but inside the storage layer.
Because you did not provide the path therefore databricks stored it inside its
managed RG. The flow is shown below.

To be specific the definition of database and table is stored inside Hive


Metastore(contains metadata of our data)
When you tell Databricks to please you can register our table in theDL

Go to your notebook

Create a table

Just a little update, just provide Location to your datalake. Except this
everything in code is similar.

Now to verify, if you go to Catalog you should see your external table
This would mean it is registered at the Catalog

And now if you go to Details

Type = External

Loaction = conatainer inside your datalake

You can even see the sample data

But how come can you see sample data when you are browsing your
Catalaog and from there you can go to Datalake and query the data

Ans: we did configuration to connect to data at cluster level


Uploading data quickly:

Drag and drop file in the canvas


After uploading the file you specify where to save the data
 At which catalog
 At which schema
 At which table name

They used schema analogous to database

Create the table

After that you must see new table


But you will find that it will be saved as managed tables(by Databricks)

Database vs Schema
Go back to you notebook and utilize one of the tables that is present in Hive
Metastore. But the table should be external

Add a row in your external table

INSERT:
If you rerun the count statement you should see one more row added to the
external table

To check if the row really added, verify that at scope of Storage Account
You can see a new file was created

You can also use SELECT statement to verfy

Likewise you can update and delete data

You can see the updated name


Delete the row:

As the row no longer is present you will see result like this:

Identity Column
In many cases SCD we would like to have some kind of artificial ID column
that we could use to uniquely identify a row. It is impornat in case of SCD
type 2, where we maty have many versions of same row with same business,
which means we cannot use that buss id to identify the row.

In Databricks we can do that with Identity property

After running this command your table will get registered in the Catalog

It would be empty. Now insert data to the above table

Now query the table. You should see the ID column whose values would be
automatically provided by Databricks
Lecture 35

Automating the process with Azure Databricks Autoloader


We know that we have multiple data sources like csv, json, xml, databases,
APIs etc.

And we don not want to develop any reporting solution directly from the
source data. Therefore, we ingest the data in our Datalake, transform the
data and then build reporting solutions.

For ingestion we used ADF.

While ingestion data we try to avoid Data Swamp. So, we make different
layers within our Datalake.

Inside the RAW layer we keep the data in its native state, meaning as it is.

But we know that there is another file format that is better and optimal for
data analytics.

Therefore the next that we would do is to convert the data into delta format
And usually the place where we save the delta files, according to Microsoft
Documentation, is Conformed container. You can also name this container as
Silver its totally up to the business requirement

It is totally up to you if you are only converting the data type or doing some
other transformation as well.
Agenda: Grab the data from RAW layer, convert the data types, file format to
delta or maybe some other steps and save it to another layer. This process
could be achieved by Autoloader

Autoloader: a feature provided by Databricks

Environment Setup

Inside the RAW container create AutoLoaderInput directory. Suppose we have


daily processing

In which ADF connects to data sources, grabs the data and saves it
AutoLoaderInput directory as csv format (in our demo). Now, we want
Autoloader to detect those files, pick them up, process and save them as
delta in different location, i.e, in Conformed container create a separate
directory. Additionally, we want to register our newly created data inside
Databricks Catalog so that you can easily use data in your queries.
For the demo we manually uploaded a CSV file inside RAW ->
AutoLoaderInput

But in real scenarios thi s task would be done by ADF

Code for Autoloader


 File_path = location from where the AutoLoader should read/detect
files(input for Autoloader)
 Target_path = location where output should be saved as Delta format
(conformed container)
 Checkpoint_path = needed by Autoloader to tracks its progress by
referring to this container
 Table_name = name of database and table to be used in Hive
Metastore Catalog.

spark.readStreaim = hit that Autoloader is using Streaming under the


hood

.format(“cloudFiles”) = Telling Databrick that we want to use Autoloader

.option(“cloudFiles.format”, “csv”) = format of the files input for Autoloader

.option(“cloudFiles.schemaLocation”, checkpoint_path) = location where


schema of the input data processed by Autoloader to be saved

.load(file_path) = load output to the given location


We are adding 2 coulmns : source_file and processing time

.writeStream = once we have the data this phrase will write the data
somewhere and that somewhere is defined using the phrase .option(“path”,
target_path). And while doing that record your progress
to .option(“checkpointLocation”, checkpoint_path).

You are asking in code to run the process every 1 sec

Also you are asking to store the data inside Hive Metastore table
using .toTable(table_name)

Now to verify if your file inside the AutoloaderInput directory was read, and
then put at the conformed container, simply gram the database and table
used earlier:

Since you already registered the data in Hive Metastore when you wrote a
script for Autoloader

It should be available
Now to check how our table looks like in the Catalog

Indeed it is External table and is present in our Datalake


Schema Inference

First command will drop you table and last 2 commands will remove data
from Datalake.

We will ask the Autoloader to infer the schema(data type) of the columns
while loading the files

Schema Evolution

Suppose source is producing some new columns. And on other hand, the
Autoloader has been saving the schema of previously read files inside the
“checkpoint_path”
And inside the _schemas container you will have different versions
Those versions basically are files that consists of schema generated by
Databricks upon reading 2 different files. Meaning if you read n files, the total
number of files that would be created inside this _schemas directory will be
n.

If you open those files, you can see the columns

If you want Autoloader to handle schema mismatch issue you use following
command

After rerunning the program again you will see new column has been added

And for rest of the rows the value will be labelled as “null”

There are different modes for Autoloader to address Schema Evolution


Mode 1 = if there is schema mismatch Autoloader will fail but new columns
will be added to the schema and then retry the process.

Mode 2 = The schema will not evolve automatically, and the Autoloader will
not fail if there is schema mismatch. Those new columns will be saved in a
new column called “rescued data column”

Mode 3 = process will just simply fail if there is schema mismatch. New
columns will not be stored anywhere. You must manually update the schema
or just delete the file that is creating the error(a strict mode)

Mode 4 = if anything happens, the new columns will be ignored.

Demo (Rescue Mode)

Change the Schema mode

Strat the stream


Now if you rerun your SELECT statement

You will see something like

If later if the business wants to incorporate the “Salary” column you can
simply parse the rescued data column and reterive the value and put it
inside the new column.

Error Handling

Suppose you can uploading some file and in that file there is “Age” column
that accepts integer values. But there is a row that has entry as a string in
“Age column”.

And then if you are using rescue mode you would see an entry something
like as shown below

You would release the source data is rubbish, and you clean it by updating
the value in Age column and change its data type.
Batch Processing
Now there might arise situations when ADF is dumping data in the container
that is monitored by the Autoloader. And suppose we configured our pipeline
to run during mid-night that would mean our Autoloader would sit ideal for
most of the time and under the hood Autoloader is using the Databricks
cluster as compute layer that would mean you would have to pay even
when nothing is being done.

Change the trigger mode

File Detection Mode


But the thing is how Autoloader would know which file to process and which files it already
processed

The Autoloader saves its progress inside checkpoint location. In this way it knows what
things it already has done.

Another thing is we configure input location for Autoloader so that it can detect is something
new has been uploaded to that input location. For this Autoloader uses 2 modes:
a) Directory listing mode

evert time Autoloader starts it go to the input location and list everything that is stored in
the directory. If there is lots of files, then listing process might take some time and might
cost some money.
b) Notification Mode

Some process is uploading file to data lake. In Azure there are many events that get created
for example when some file is uploaded to a datalake a new event called Evaluation blob
created is fired and there are multiple services that could react to the event.
Under the hood it is using Event Grid that is used to manage those events to subscribe to
them if something happens.
Autoloader is subscribing to the file created events to the particular location in the Datalake.
In the end whenever it's happened autoloader will create a Queue that is part of Storage
Account and all of those events will be saves under particular queue as messages. And
message will contain information regarding what file was created, where it is stored

Lecture 39
Azure Synapse Analytics
Apache Spark is good in processing huge amount of data. And the tool which we were using
so far that uses Spark under the hood was Databricks. Databricks is not a product developed
by Microsoft. Maybe in company you have some rule not to use 3 rd party solution like
Databricks. Microsoft gives an option to use Spark Pools in Synapse.

In realty is difficult to work with Spark because it involves configuring various nodes. So, we
were using Databricks that made us easier to deploy clusters without caring about the
infrastructure.

But if we don’t want to leave Azure Environment, we can do same thing as we have been
doing inside Databricks to deploy cluster by utilizing Spark Pools. Therefore, Spark Pools can
also be used for data transformation
One advantage of using Spark Pools is that it integrates with other Azure services well.

Setting up Synapse Workspace

Search “Synapse Analytics”

Click on “Create”
You will have to provide a separate Data Lake for Synapse where it will hold catalog data and
metadata associated with the workspace.

Suppose we created the following Storage Account for Synapse. Next step is to create a
linked service pointing to the Data Lake dedicated for Synapse Workspace
Go to your Synapse Workspace

To open UI
How to make Spark Pool : It is something like cluster from Databricks. Simply a compute
power that we will use to run query on our notebooks.

Synapse -> Manage


Create a New Spark Pool

Configure it properly
Min Number of nodes always 3

Smallest Pool that you can create consists of 3 nodes. But in Databricks we could create 1
single node

Means as per the demand the number of nodes can be increased but not more than 30
Also, like what we had in Databricks we can turn off the pool when there is no activity

Apache Spark version: they are always behind Databricks in releasing new versions of Spark
Data Tab

Data tab as 2 inner tabs

And inside the Linked tab you will see all the linked services related to data that is defined in
the workspace
It also allows you to browse the content of the Data Lake without leaving the UI

If you open RAW container to which you were ingesting data


Here you did not configure the security part
One more cool thong is you right click on a file, import it on the Notebook, a code will
automatically be generated and then you can run it

Suppose you are creating a Notebook on Spark Pool. You specify to which pools this
notebook should be attached
Displaying Data

mssparkutils

%% , not %
Saving data as delta

Spark doesn’t know what _sqldf is

Assign explicitly your transformed results to a dataframe


Again, read the delta file

Creating Database

Populate the table


Now read data from this table

It means that data was saved somewhere and minifig table was registered somewhere. But
where>
And if you open it

You can see the minifig table with metadata


Remember, in Databricks Hive Metastore we were able to see properties of tables. But in
case of Synapse, you have to write a script

Now you are interested in where is minifig getting stored


That’s the primary storage of Synapse Workspace

One more thing is Databricks save file as delta by default whereas Synapse doesn’t

Now if you have to save file as delta you have to tell Synapse explicitly
Saving Data to our Data lake
Lecture 41
Transforming data with Data Flows

What are data flows?


Data flows are visually designed data transformations in Azure Synapse
Analytics. Data flows allow data engineers to develop data transformation
logic without writing code. The resulting data flows are executed as activities
within Azure Synapse Analytics pipelines that use scaled-out Apache Spark
clusters. Data flow activities can be operationalized using existing Azure
Synapse Analytics scheduling, control, flow, and monitoring capabilities.

Data flows provide an entirely visual experience with no coding required.


Your data flows run on Synapse-managed execution clusters for scaled-out
data processing. Azure Synapse Analytics handles all the code translation,
path optimization, and execution of your data flow jobs.

Authoring data flows


Data flow has a unique authoring canvas designed to make building
transformation logic easy. The data flow canvas is separated into three parts:
the top bar, the graph, and the configuration panel.
Graph

The graph displays the transformation stream. It shows the lineage of source
data as it flows into one or more sinks. To add a new source, select Add
source. To add a new transformation, select the plus sign on the lower right
of an existing transformation. Learn more on how to manage the data flow
graph.
Configuration panel

The configuration panel shows the settings specific to the currently selected
transformation. If no transformation is selected, it shows the data flow. In the
overall data flow configuration, you can add parameters via
the Parameters tab.

Each transformation contains at least four configuration tabs.

Transformation settings

The first tab in each transformation's configuration pane contains the


settings specific to that transformation.

Optimize

The Optimize tab contains settings to configure partitioning schemes.


Inspect

The Inspect tab provides a view into the metadata of the data stream that
you're transforming. You can see column counts, the columns changed, the
columns added, data types, the column order, and column
references. Inspect is a read-only view of your metadata. You don't need to
have debug mode enabled to see metadata in the Inspect pane.

As you change the shape of your data through transformations, you'll see the
metadata changes flow in the Inspect pane. If there isn't a defined schema
in your source transformation, then metadata won't be visible in
the Inspect pane. Lack of metadata is common in schema drift scenarios.
Top bar

The top bar contains actions that affect the whole data flow, like validation
and debug settings. You can view the underlying JSON code and data flow
script of your transformation logic as well.

Getting started
Data flows are created from the Develop pane in Synapse studio. To create a
data flow, select the plus sign next to Develop, and then select Data Flow.

This action takes you to the data flow canvas, where you can create your
transformation logic. Select Add source to start configuring your source
transformation.

Source transformation in mapping data flows


Data flows are available both in Azure Data Factory and Azure Synapse
Pipelines.

A source transformation configures your data source for the data flow. When
you design data flows, your first step is always configuring a source
transformation. To add a source, select the Add Source box in the data flow
canvas.
Every data flow requires at least one source transformation, but you can add
as many sources as necessary to complete your data transformations. You
can join those sources together with a join, lookup, or a union
transformation.

Each source transformation is associated with exactly one dataset or linked


service. The dataset defines the shape and location of the data you want to
write to or read from. If you use a file-based dataset, you can use wildcards
and file lists in your source to work with more than one file at a time.

Integration datasets
It means that reuse one of the datasets that we created previously. This
option helps us to choose a dataset that is available and visible in the scope
of our whole Synapse Workspace

Inline datasets
Persists only to the scope of a particular dataflow. They will not be available
to be used outside.

Inline datasets are recommended when you use flexible schemas, one-off
source instances, or parameterized sources. If your source is heavily
parameterized, inline datasets allow you to not create a "dummy" object.
Inline datasets are based in Spark, and their properties are native to data
flow.

Schema options

Because an inline dataset is defined inside the data flow, there isn't a
defined schema associated with the inline dataset. On the Projection tab, you
can import the source data schema and store that schema as your source
projection. On this tab, you find a "Schema options" button that allows you to
define the behavior of ADF's schema discovery service.

 Use projected schema: This option is useful when you have a large
number of source files that ADF scans as your source. ADF's
default behavior is to discover the schema of every source file.
But if you have a pre-defined projection already stored in your
source transformation, you can set this to true and ADF skips
auto-discovery of every schema. With this option turned on, the
source transformation can read all files in a much faster manner,
applying the pre-defined schema to every file.
 Allow schema drift: Turn on schema drift so that your data flow
allows new columns that aren't already defined in the source
schema.
 Validate schema: Setting this option causes the data flow to fail if
any column and type defined in the projection doesn't match the
discovered schema of the source data.
 Infer drifted column types: When new drifted columns are
identified by ADF, those new columns are cast to the appropriate
data type using ADF's automatic type inference.
Workspace DB (Synapse workspaces only)
In Azure Synapse workspaces, an additional option is present in data flow
source transformations called Workspace DB. This allows you to directly pick a
workspace database of any available type as your source data without
requiring additional linked services or datasets. The databases created
through the Azure Synapse database templates are also accessible when you
select Workspace DB.
Source settings
After you've added a source, configure via the Source settings tab. Here
you can pick or create the dataset your source points at. You can also select
schema and sampling options for your data.

Development values for dataset parameters can be configured in debug


settings. (Debug mode must be turned on.)
Output stream name: The name of the source transformation.

Source type: Choose whether you want to use an inline dataset or an


existing dataset object.

Test connection: Test whether or not the data flow's Spark service can
successfully connect to the linked service used in your source dataset.
Debug mode must be on for this feature to be enabled.

Schema drift: Schema drift is the ability of the service to natively handle
flexible schemas in your data flows without needing to explicitly define
column changes.

 Select the Allow schema drift check box if the source columns
change often. This setting allows all incoming source fields to flow
through the transformations to the sink.

 Selecting Infer drifted column types instructs the service to


detect and define data types for each new column discovered.
With this feature turned off, all drifted columns are of type string.

Validate schema: If Validate schema is selected, the data flow fails to run
if the incoming source data doesn't match the defined schema of the
dataset.

Skip line count: The Skip line count field specifies how many lines to
ignore at the beginning of the dataset.
Sampling: Enable Sampling to limit the number of rows from your source.
Use this setting when you test or sample data from your source for
debugging purposes. This is very useful when executing data flows in debug
mode from a pipeline.

To validate your source is configured correctly, turn on debug mode and


fetch a data preview. For more information, see Debug mode.

Source options
The Source options tab contains settings specific to the connector and
format chosen. This includes details like isolation level for those data sources
that support it (like on-premises SQL Servers, Azure SQL Databases, and
Azure SQL Managed instances), and other data source specific settings as
well.

Projection

Used to show schema of input and output data. Meaning it will give some
idea how our data looks like when it enters the transformation and how the
data will look when it leaves our transformation.
However, the dataflows are visual tool it needs some compute power to
execute our steps. In case of pipelines Integration Runtimes were our
compute infrastructure.

Debug Mode is telling Azure to prepare Integration Runtime that would


execute various steps in our dataflow

Debug time to live is for shutting down the Spark Cluster when in no use.

Integration Runtime choice: Self-hosted runtimes are not supported. Azure


integration runtimes and Azure integration runtime with managed virtual
network are supported

Like schemas in datasets, the projection in a source defines the data


columns, types, and formats from the source data. For most dataset types,
such as SQL and Parquet, the projection in a source is fixed to reflect the
schema defined in a dataset. When your source files aren't strongly typed
(for example, flat .csv files rather than Parquet files), you can define the data
types for each field in the source transformation.

Import schema

Select the Import schema button on the Projection tab to use an active
debug cluster to create a schema projection. It's available in every source
type. Importing the schema here overrides the projection defined in the
dataset. The dataset object won't be changed.

Importing schema is useful in datasets like Avro and Azure Cosmos DB that
support complex data structures that don't require schema definitions to
exist in the dataset. For inline datasets, importing schema is the only way to
reference column metadata without schema drift.

If your text file has no defined schema, select Detect data type so that the
service samples and infers the data types. Select Define default format to
autodetect the default data formats.

Reset schema resets the projection to what is defined in the referenced


dataset.

Overwrite schema allows you to modify the projected data types here the
source, overwriting the schema-defined data types. You can alternatively
modify the column data types in a downstream derived-column
transformation. Use a select transformation to modify the column names.

Data preview

If debug mode is on, the Data Preview tab gives you an interactive
snapshot of the data at each transform.
To fix the above error:

Debug Settings  Row limit

Lower the number of rows to be read


Optimize the source transformation
The Optimize tab allows for editing of partition information at each
transformation step. In most cases, Use current partitioning optimizes for
the ideal partitioning structure for a source.

If you're reading from an Azure SQL Database source,


custom Source partitioning likely reads data the fastest. The service reads
large queries by making connections to your database in parallel. This source
partitioning can be done on a column or by using a query.
Flattening the Array

Configure your Flatten activity

Incoming stream : input the Flatten activity

Unroll by the column you want to explode

Input columns: how our data should look like


If you preview your data, you will see after selection results in Unroll by our
result colum got exploded.

Then in Input columns add the columns explicitly

Handling Null
In case when you ingest data from an API, its previous page value was set to
NULL for first page. Now maybe we have some business rule that says that
replace the “NULL”

To “Not Available ”

Derived Column (Schema modifier)

It allows us to add a new column or alter an existing column.

Name your transformation and under Columns choose the column you want
to play with
Inside the Expression write your logic
Removing NULL values this time

Filter = where conditions in SELECT statements


Removing and renaming columns

SELECT transformation

Simply hit on bin icon to remove the column

Converting Data types


String to Datetime
Choose the column
Specify type (timestamp)
Specify format

Conditional Logic
Add a new column and build an expression of it

Saving transformed data to a data lake


Choose Sink transformation
Specify Sink type

Add a new dataset that will store your results as delta format
But you will see that there is no option to create a dataset to save our
results as delta in case we choose Integration dataset as Sink type.

Lets proceed with Inline dataset. And while choosing the Inline dataset
you will see from the option that it supports delta format

Indicate the linked service

If you want your result to be stored in a separate container you will


have to configure your Sink transformation a bit

Explicitly add a directory

Executing Dataflows
Inside your pipeline simply add Dataflow as an activity. This is the only
way to run your flow

There you will provide the name of your recently created Data flow.
Then you will Debug the pipeline

LECTURE 44
Loading data to Dedicated SQL Pool

How to load data to dedicated SQL Pool


How to connect to that dedicated SQL Pool
In this demo we already have the transformed data in the Data Lake inside
curated layer and at the target side we will delete all the records and grab
everything from curated layer and load it to dedicated SQL Pool

Polybase

We are considering pull approach here, where SQL Pool is pulling the data
from external source(Data Lake in this case). This approach can be used
when you have some on-premises Data Warehouse and data is already
transformed there implemented as some stored-procedures. Therefore, this
is just lift and shift case. There is no involvement of Databricks, ADF or
Synapse.

Polybase approach is simply a possibility to query data that is outside our


database using T-SQL queries.
You have to create some objects to implement this approach

Go to Synapse Workspace  Develop tab  SQL Script


Switch the connection to your dedicated SQL Pool

To read the data externally you would require to connect to the


source(datalake in our case). For this you create an object called Database
Scoped Credential and configured it using Synapse’s Managed Identity.
This means that the SQL Pool will use Managed Identity of the Synapse
Analytics Workspace to connect to the DataLake.

So far you haven’t used it yet.

Next step is created an object for External Data source. It is simply a


pointer to the source. While defining external data source you are specify the
location of source container/directory on your Data Lake along with
authentication method to connect to it.

Next step is to define the external file format meaning the format of file to be
read by the Synapse from this curated container. But unfortunately, Polybase
does not read delta file format. It can either read CSV or parquet file format.
CSV file format

Defining CSV file format

FORMAT_TYPE = you are specifying that file is csv


FILED_TERMINATOR = columns are separated using “,” character
STRING_DELIMITER = strings are delimited by ‘ “ ’ character

The options available are limited while defining CSV format. In case if there is
some special characters you wont be able to read the data. There is no
option to specify escape characters

This record has embedded double quotes

Then you create an External Table referencing the External Data source.
EXTERNAL TABLE = meaning the data is not stored inside Synapse, it lies
somewhere else. We know that in case of CSVs all columns are string so to
avoid the datatype mismatch thing we are defining all the columns to have
string type. Also, we are defining the container inside DATA_SOURCE and
where exactly under LOCATION and for file format we are providing the file
format defined earlier as MinifigsCSV

Note: Make sure to provide Storage Blob Data Contributor role to


managed identity of Synapse Workspace on the Data Lake
The table definition will appear under SQL Database. The data is not in the
Synapse jurisdiction but lies on the DataLake.
Querying the data
Parquet file format

Create a new file format named as MinifigsParquet


FORMAT_TYPE = PARQUET
DATA_COMPRESSION = snappy

Then recreate another External Table to use the above file format
This time you did not provide kept column datatype as string instead you explicitly
gave appropriate datatype to columns and there would be no data mismatch upon
loading the data because in Parquet schema comes with the file. Also, you specify
the FILE_FORMAT that you already created. You also changed the LOCATION

At this point DML operation will not be supported. Meaning you will only be eligible
to read the data. Update, Delete records will not be supported because the data still
lives on the Data Lake. To have the data make available inside the dedicated SQL
Pool. For this you will have to create another separate table and then load the data
from External Table to that table using the phrase
CTAS = Create Table As SELECT

You give name to your table and provide distribution


Populate the new table with rows that are already stored in external table
Now you can read, update and delete the data from this table as shown below

Now, the next challenge is if the data changes in the source it would get reflect
inside external table but not inside in the staging table on which you fire DML
queries.

Another Method
Copy

Delta is not supported.


You simply provide target database name and table name(COPY INTO),
location of data that lives on external source(FROM), authentication
method(CREDENTIAL) and file type(FILE_TYPE)
First drop data from the Staging Table that was there from the Polybase Pull
exercise by using TRUNCATE TABLE

PUSH

Go to ADF and use copy activity. We already know that copy activity exists
both in ADF and Synapse if we want to move data from one place to another.

Create a new pipeline Grab a Copy data activity  Define Source and Sink
You will find that there is no option to choose delta file format while creating
a Dataset
Note: Delta file format is not supported in Pipelines

Creating a Data Flow


We know we used Data Flows to save files as delta format.

We are using Source Type as Inline because Dataset at pipeline level does
not support delta format

Choose a Liked service pointing Data Lake under Linked service option.
Under Source options specify where in your Data Lake your data is present.

To peek inside the data, enable Data flow debug mode

In the Projection option click Import schema


You can also see your data by clicking on Data preview

Next thing is how to connect to Synapse dedicated SQL Pool


Grant db owner role to managed identity of ADF on SQL Pool.

[tybuladf] = name of Managed identity


FROM EXTERNAL PROVIDER = Entra
Granting managed identity db_owner permission

Also grant DATABASE BULK OPERATIONS permission

Therefore, ADF can connect to dedicated SQL Pool


Then add Sink activity

As you know your sink location is dedicated SQL Pool so create a new dataset
Create a new linked service to Synapse
Indicate to which table you want to store the data
Write a SQL query in Pre SQL scripts

Enable staging option

Define a staging location where data will be first dump to a datalake and
then from there it will be loaded to dedicated SQL Pool. Dataflow is
converting the data in way that is supported by Polybase
Copy Activity in ADF
Create a new pipeline  Use copy activity

Source side configuration

Choose the Dataset that is already configured to read a csv from a DataLake.
Just update the path to the data that you would like to be read

You can also notice that this copy activity has an additional functionality that
was not present earlier
Sink side configuration

Configure it accordingly
Because you are reading a csv file you can make some changes under
Mapping tab

Under Settings tab you can enable staging option


if you do not enable it and click on “debug” you will see a prompt as shown
below

Where to provide information regarding Staging Area


Use the same linked service that points to the Data Lake. Browse and choose
Stage container

Using Spark Pools

We saw that when we did transformations using Spark Pools, we already had
our data available in delta format. Also, Spark can read delta file format. Now
the question is can Spark Pool connect to SQL Pool

Create a New Notebook in Synapse Workspace


Attach it to the Spark Pool

Try to connect to dedicated SQL Pool


Data Tab  Workspace  SQL database  desired table that you wanna load
(right click)  New Notebook  Load to DataFrame

This will result in Scala code to connect to the table to read its content
Or else you can use the Python code

synapsesql(“<dedicated-sql-pool.<schema>.<table>”)
This way you can read data from a table saved on our SQL Pool to a
dataframe in Spark Pool

Saving data to SQL Pool

This time we were able to read data from delta format and we used Spark
Pools and to make it work we utilized SPARK.WRITE.SYNAPSESQL but
under the hood it used copy operation.

To check the staging area used by Spark Pool

Using Databricks
We have got our Databricks workspace that saved the data to our Data Lake
as delta format. We will utilize the same workspace.
You have set some initial configurations defining how to connect to data lake
and to Synapse. We will use a dedicated Service Principal and store its secret
in a key vault

But also, you need to grant permissions to the service principal inside SQL
Pool

Verify if you can retrieve the data from Data Lake

Now write the data to dedicated SQL Pool


sqldw = connector available for Databricks to connect to a SQL Pool

Also, you can read the data from SQL Pools to Databricks

Lecture 45

Result Set Caching


Go to Synapse Workspace  Create a Notebook

Suppose this is your query

And this is the output

Upon re-running the query again. You will notice that this time query will take
more time to be executed
The solution is to cache the data
The following highlighted line just provides label to the query so that you can
refer to the same query again and again by this label
Enabling result cache hit

It has to executed on the master database

After setting this up if you again rerun your query, the result will be cache

And if you again rerun the query then it will take data from the cached
memory
The cached is evicted: every 2-day, if data has changed, if size reaches its
limit, if query is non deterministic

You might also like