DP 203 Notes
DP 203 Notes
Root
Container 1 Container 2
Root
Container 1 Container 2
File 3
Click on “Create”
To dequeue
Message 1 is gone
Option 2:
If you go with this option, then download the IR and register the IR anywhere
you want with one of those 2 keys.
Inside the target Data Lake, make a container and name it as RAW. It will act
as Raw Layer
Set Up Linked Service for Target
Then Create Dataset for both Source and Target
After setting appropriate properties, you will see that by selecting the
dedicate Linked Service, automatically the dataset will detect which IR this
Linked Service is using and then you will see the list of tables of the database
that you had mentioned in the Linked Service
While you go with the option to stored data as csv you will require to provide
File path as well
Creating Managed VNet IR
Configure as per the requirements
Go to virtual network
Enable and create
Now after creating the Manage VNet IR, utilize it in Linked Service
And from the drop down choose managed vnet ir
Both this IR requires private endpoints
After selecting Subscription, Server and Database you will see there is no
Managed private endpoint. So, create one
Either you can create from there as shown in the fig or you can create from
the menu bar
Choose for Azure SQL Database
So far you create a MPE and inside that you specified the Server Name. In
this way Azure SQL Server instance would detect that some application(ADF
in this case) wants to connect to it securely using that MPE.
Go to your Azure SQL Server
Now if you create a Linked Service and choose
ManageVNetIntegrationRuntime, and provide with the necessary parameters
you will see that Managed Private Endpoints are approved
And create the linked service
a) Anonymous Access
b) Access with Identity
Anonymous Access
From the Azure Portal search for Storage Account
Enable hierarchical namespace
After doing that, navigate to “Containers” on the left-hand side tab of
Storage Account and create some containers
To make a container publicly accessible simply navigate to the container and
click on the option above: Change access level
You can see the status has been changed. To give someone access you have to set
permission globally at the Storage Account level as shown below. But always ensure
that those settings are disabled because you do not need to have your data leaked
Even if the Anonymous access level for “public” container is shown enabled
but you would not be able to see the content inside it because you just
disabled the access globally at Storage Account level.
On the left menu, under “Security + networking” you shall see “Access
keys” option
There will be 2 keys: key1 and key2
You can use any one of the keys to establish a connection or you can save
these keys as Secret inside a Key Vault
Creating a Linked Service based on Access Keys
Mention your subscription and Storage account name
This method of using Access Keys gives full control over Storage Account.
Also, if you go to json definition of Linked Service, you will see encryption
details. Connection will successfully run because you are using Access Keys
as authentication
Now after successfully creating a Linked Service you can open the code view and
could see the details of your Access keys under “encryptedCredential”. As this
option gives full powers, so it is better to have it disable at Storage Level
Also, in case you have multiple linked services for a Storage Account and
all the Linked Services are using Access Keys for Authentication. In such
situation if you suspect that your key has been compromised then you
would rotate the key. And those Linked Services that were configured
earlier with the old key wont work. This means you will have to use the
newly generated key in all the Linked Services that are connected to that
Storage Account.
The connection will not get established, and you will be asked to provide with
new Storage Account key (Access key). Meaning you would have to provide
this value in all the associated Linked Services
Best practice is to store all those Access keys inside a Key Vault so that you
do not have to hard code your Storage Account keys explicitly in the
definition of any Linked Service
Further, we will learn how to utilize a Key Vault to store Access keys or SAS
token or anything that you would not like to hard code
Key Vault
Create a key Vault
Choose Subscription
Choose Resource Group
Another thing is how (what type of Authentication) some other services will
use to connect to Key Vault. There are 2 types shown below under
“Permission model”
If you are using a Managed identity for the app, search for and
select the name of the app itself.
3. Review the access policy changes and select Create to save the
access policy.
4. Back on the Access policies page, verify that your access policy is
listed.
Now you have all the secrets inside the Key Vault. Next thing is who will
come to grab those secrets and how they should talk to Key Vault so that
Key Vault could provide those secrets.
Managed identity
Where to look for it?
On the menu at left side, you should see Managed Identity option
You can get Managed Identity from there. Both are identities and
in addition, on the top, you can see there are two options:
System-assigned and User-assigned.
Next thing is you go back again to Key Vault and inside the Access
Policy add the Managed Identity of whatever resource who can
access the secrets inside Key Vault
But this is not the best practice. We would like to have a separate group
created (inside Microsoft Entra ID). Then as a Principal, you choose that
group.
Creating a group in Microsoft Entra
Therefore, you should see the name of the group you created
Now, finally go back to Key Vault and create an Access Policy with
necessary permissions and add group that you created inside
Microsoft Entra.
Now you should see your group inside the Access Policy of Key
Vault
So, long story short, we want our ADF to access Key Vault so we will create a
Linked Service pointing to the Key Vault
Now, this linked service is pointing towards the Key Vault already, we will
make use of it. It is like a scenario if you want to access the Key Vault and do
not know its location you will ask the Linked Service that is pointing towards
the Key-Vault.
In real life, if you do not know the location of a place you open Google Maps
and search for that location. This linked service is something similar. Our ADF
will ask this Linked Service about the location of the Key Vault.
Now, let us create Linked Service to Data Lake but at the same time, ADF will
ask Linked Service pointing to Key Vault to go grab secrets for me. The
Linked Service pointing towards Key Vault will only grab those secrets that
ADF is eligible to see.
The moment you switch to Azure Key Vault, you will see that ADF will
automatically detect that there is already a Linked Service that will show ADF
path to go grab secret. So ADF can establish a connection with Data Lake
using that Linked Service
See! When ADF asked the Linked Service(that was pointing to the Key Vault)
for a direction to the Key Vault, the Linked Service went and automatically
provided the options ADF has (in the form of a Secret name)Then finally
using the secret the connection was established successfully.
Access to Data Lake using SAS Tokens
This is taken from inside the Data Lake.
This is what the Account Level SAS token looks like, why Account Level SAS?
Because you are grabbing it at the Account Level.
You can choose for what service you want to generate SAS tokens for, e.g.,
Blob, File, Queue, and Table.
What resources type? (Service, Container, Object)(Did you notice
one thing? It does not provide us with an option on which container
in specific)(meh! not really granular) Anyway, let’s generate one
SAS
Because at account level SAS tokens can be generated for Blob, File, Queue
and Table but not for any specific container of directory.
Get the Blob service SAS URL: but we want to have Data Lake SAS token
Just simply replace “blob” keyword with “dfs” keyword.
Create a new Linked Service (Configured to use SAS to
connect with Data Lake)
Authentication type: choose SAS URI
Paste URL in the below section
Under the hood, of course, these SAS would also be using Access Keys
Add policy
Then you can grant whatever permissions you want to be
automatically inherited by SAS token in there
You should see your created Policy for the RAW container
Now, let’s go create a Service Level SAS token by passing this Policy inside
the token
You can see that our created Policy appeared to choose from the options
Can you see whatever Policy you defined at container level is automatically
inherited by this token.
Talking about the high-level overview, when you define an Access Policy at
container or directory level you grant some permissions to it. It signifies what
an Access Policy can do in the container or directory. So, if you are
generating a Service SAS token by making use of an Access Policy, then
whatever user/application/service principal is using that SAS token, the
container or directory will understand to what extend or what piece of work a
user can perform on me with that SAS token.
And if you want to see set of permissions available default roles, you can navigate to the
“View” option and can then see the json format of the code.
In the fig above which is in the form of json like structure: actions are what an owner
can do and inside notActions: list of the things that owner cannot do
For example, in case of Contributor role
Actions = what he can do
notActions = inside this is what a Contributor cannot do
In a similar way we can create our own customize role by defining set of
permissions under actions and notActions.
Roles
Storage Blob Data Owner
Storage Blob Data Contributor
Storage Blob Data Reader
In Microsoft Entra ID create a group and add Managed Identity of ADF as a member.
At the Storage Account grant Storage Blob Data Owner/Contributor/Reader role.
Create a group -> Add ADF as a member to the group -> Grant RBAC to that group
Here you will see that you just provided “Storage Account Name”
How is it working?
When you give Storage Account name, the intelligence will see that you are
creating a Linked Service from that instance of ADF to which you already
granted RBAC on that Storage Account
Confirm that your Azure DevOps organization is linked to the correct Microsoft
Entra ID.
If you are seeing something like this that would mean your Dev
instance of ADF is Git configured
Now to verify is you made this setup correctly, go to Azure DevOps and see if any
folder with name “/data-factory/” was created or not
Also, inside the folder you should see json definition of your Data Factory
Next step is to make sure to include Global Parameters inside ARM template
In Azure DevOps:
Create a folder /devops/ and then, in that folder create following files:
package.json
adf-build-job.yml
adf-build-job.yml
in the json code replace the template variables provide path to adf-build-
job.yml file
subscriptionId : Dev Resources Group ID
So, a time reached where we could not scale up our resources further because it
was not practically, physically and economically possible.
So, a solution was developed where instead of scaling up, we did scale out,
where task was distributed among several worker nodes.
Since each worker node was getting small portion of the whole work, it should be
faster and cheaper.
Pricing Options
Go to Resources Launch
Workspace
Databricks UI
Policy - you can set policy which say how many nodes a user can
deploy
Multi-Node/Single-Node - with Multi Node you will see options
including Worker type, min worker, max worker but this is not the case
for Single Node. In single node setting UI will look something like this:
Node type = it is talking about the driver node
Which S/W version of Databricks you want to use? Select from there
this makes your workloads run faster
After all settings are done hit Create to create your cluster.
Creating a Notebook
New Notebook
Languages supported
WE WILL IMPLEMENT THIS
Read the data from DataLake by establishing connection and then save it
back to the same DataLake in delta format.
https://learn.microsoft.com/en-us/azure/databricks/introduction/
https://learn.microsoft.com/en-us/azure/databricks/connect/
storage/azure-storage
Copy it
<storage-account>
Grab the Access keys from the Storage Account and paste it inside the
notebook
Execute the cell
Because Databricks by default supports Delta Lake format. So, tell the
Databricks explicitly that the data is in json format
Saving data back to DataLake as Delta
Pin Cluster
Multiple Cluster
Workspace Menu
Workspace
Then you can set permissions by right click on the Project Folder like who
should have access to this Project Folder and content inside it
Apart from this all the users will have their own workspace
So, you can also create folders, notebooks within your workspace
You can see every change made to your notebook from there
Databricks Utilities
Most commonly used
Sample Datasets
Databricks managed RG
You will see an error like this. You cannot see what is inside. Because this RG
is given to the Databricks’ workspace to deploy all the resources like VMs ,
networking stuff, interface cards etc. Also, all those sample datasets are
coming from that Storage Account container(from root specifically).
Loading CSV data from Storage Account to Databricks
Establish connection with Storage Account
Paste it
Change the protocol and indicate which container you want to use
This dbutils will display content of your Data Lake
Solution
Transformations:
Suppose you want to group the rows by last name
First 10 rows:
Lecture 33
Connecting to ADLSg2
1) Unity Catalog
Databricks is promoting this method. They think it is best for data
governance.
2) Legacy Method
other methods that were existing prior to existence of Unity
Catalog
a) Access via Account Key
just a string that give full access to a Storage Account(DataLake).
b) Access via Service Principal
Simply an account explicitly created in Azure that you can later use to
grant some permissions.
You will create a Service Principal, and grant it access to DataLake. For
this you can either use RBAC or ACL
Go to App Registration
Then click “New registration”
Give name to it
Give it a name
You will see all the roles under this. If there is already some group present in
there which already had same set of permissions that you want your newly
created Service Principal to have(Storage Blob Contributor Role in this case)
then you simply add your Service Principle to that group. To do this you have
to go to
Microsoft Entra ID -> Groups -> find your group -> add a new
member(Service Principal in this case) to the group
This proves that there is already a group called “Data-Lake-Contributor” that
has permissions that we want to include to our Databricks notebook
Now if you go to Databricks and create a new notebook and then try to
connect to the DataLake, the connection should fail because you haven’t
configured the Databricks yet.
As per the documentation
https://learn.microsoft.com/en-us/azure/databricks/connect/storage/azure-
storage
You can do configuration in 2 ways, either just to the scope of the present
Notebook or for the whole cluster. For the whole cluster meaning whatever
notebook you create will already have this configuration.
Fill the service_credential part: this will hold the secret of Service Principal
you created earlier in Microsoft Entra ID when and remember you created
App Registration of this
And in place of storage-account give name of your DataLake
To find it: Microsoft Entra ID -> App registration -> All applications -> your
service principle-> Application ID
Finally mention directory-id
Both are not secrets, so it is fine to have them there
d) Secret Scopes
Those Secret Scopes works only if Key Vault is configured by using Access
Policy mode
Go to Key Vault
Add your Secret to Key Vault
Inside value give secret of Service Principal that you had created earlier
inside Microsoft Entra ID while App Registration
First part is address of your databricks workspace instance and second part
is secret
Create Scope
3. Deprecated Method
a) Access via Mount Points
They simplify access patterns to our data on a DataLake with Mount Points
provide a location to which it should point. E.g. our mount point should
point to our DataLake
Indicate which protocol(driver) to be used to connect to DataLake. In
our case we are using abfss
Credentials(how to connect to DataLake)
So, basically all these 3 things altogether will be encapsulated inside the
Mount Point. so, end users later on will not have to care about those 3 things
while using Mount Point as a regular point
From Documentation
Source = Provide a path where your Mount Point should point
Mountpoint :
Listing Mount-points:
Querying Mount-point:
They point towards
In this way each user will have different type of permissions on Storage
Account based on which they can have different level of control on data.
Meaning different users have different versions of data.
Databricks:
Create a New Notebook -> make sure you connect to your new cluster
Disadvantage:
Doesn’t work with ADF
Lecture 35
In this we are using access type as Access via Service Principal method
Earlier you were pasting the Python code in first cell of the notebooks. But in
this method, you will put above code inside the cluster as shown below
Cluster Configuration:
Go to compute
Then to your desired cluster that you wish to use
You could see that you are able to list the content therefore connectivity is
established.
In this demo we are configuring the access settings not at notebook level but
cluster level.
In this step, read whatsoever data you want to read but at the end make the
data in form of dataframe and then register the dataframe to convert the
dataframe into a table. Once its in table form you can use SQL like queries
Transform your data
There are multiple ways to save data to DataLake. But we will save our data
as delta format
If you revisit your curated container, you would see parquet files along with
the _delta_log folder.
Then to verify if you can read that curated data you simply read it on
Databricks using pyspark command.
How to do that?
Then you create a table. You have to indicate which database you want it to
be put in. Then name of the table. Then write a query that will populate the
table
The benefit of creating this table in database is:
Any end user can read this data without explicitly mention the path of source
in your datalake:
Go to Catalog
Where is the data getting stored? It uses delta format, and delta is a file
format so file has to be stored somewhere
Even though you are the admin of the whole Azure, but you won’t have
permission to see data stored in the DataLake managed by Databricks
Only way to have access to that data is to use Databricks. What if some
other services like ADF that would like to connect to that data? This will not
be possible without going through Databricks. And coz you have no control
over that DataLake you have no control to set permission using RBAC or ACL.
Go to your notebook
Create a table
Just a little update, just provide Location to your datalake. Except this
everything in code is similar.
Now to verify, if you go to Catalog you should see your external table
This would mean it is registered at the Catalog
Type = External
But how come can you see sample data when you are browsing your
Catalaog and from there you can go to Datalake and query the data
Database vs Schema
Go back to you notebook and utilize one of the tables that is present in Hive
Metastore. But the table should be external
INSERT:
If you rerun the count statement you should see one more row added to the
external table
To check if the row really added, verify that at scope of Storage Account
You can see a new file was created
As the row no longer is present you will see result like this:
Identity Column
In many cases SCD we would like to have some kind of artificial ID column
that we could use to uniquely identify a row. It is impornat in case of SCD
type 2, where we maty have many versions of same row with same business,
which means we cannot use that buss id to identify the row.
After running this command your table will get registered in the Catalog
Now query the table. You should see the ID column whose values would be
automatically provided by Databricks
Lecture 35
And we don not want to develop any reporting solution directly from the
source data. Therefore, we ingest the data in our Datalake, transform the
data and then build reporting solutions.
While ingestion data we try to avoid Data Swamp. So, we make different
layers within our Datalake.
Inside the RAW layer we keep the data in its native state, meaning as it is.
But we know that there is another file format that is better and optimal for
data analytics.
Therefore the next that we would do is to convert the data into delta format
And usually the place where we save the delta files, according to Microsoft
Documentation, is Conformed container. You can also name this container as
Silver its totally up to the business requirement
It is totally up to you if you are only converting the data type or doing some
other transformation as well.
Agenda: Grab the data from RAW layer, convert the data types, file format to
delta or maybe some other steps and save it to another layer. This process
could be achieved by Autoloader
Environment Setup
In which ADF connects to data sources, grabs the data and saves it
AutoLoaderInput directory as csv format (in our demo). Now, we want
Autoloader to detect those files, pick them up, process and save them as
delta in different location, i.e, in Conformed container create a separate
directory. Additionally, we want to register our newly created data inside
Databricks Catalog so that you can easily use data in your queries.
For the demo we manually uploaded a CSV file inside RAW ->
AutoLoaderInput
.writeStream = once we have the data this phrase will write the data
somewhere and that somewhere is defined using the phrase .option(“path”,
target_path). And while doing that record your progress
to .option(“checkpointLocation”, checkpoint_path).
Also you are asking to store the data inside Hive Metastore table
using .toTable(table_name)
Now to verify if your file inside the AutoloaderInput directory was read, and
then put at the conformed container, simply gram the database and table
used earlier:
Since you already registered the data in Hive Metastore when you wrote a
script for Autoloader
It should be available
Now to check how our table looks like in the Catalog
First command will drop you table and last 2 commands will remove data
from Datalake.
We will ask the Autoloader to infer the schema(data type) of the columns
while loading the files
Schema Evolution
Suppose source is producing some new columns. And on other hand, the
Autoloader has been saving the schema of previously read files inside the
“checkpoint_path”
And inside the _schemas container you will have different versions
Those versions basically are files that consists of schema generated by
Databricks upon reading 2 different files. Meaning if you read n files, the total
number of files that would be created inside this _schemas directory will be
n.
If you want Autoloader to handle schema mismatch issue you use following
command
After rerunning the program again you will see new column has been added
And for rest of the rows the value will be labelled as “null”
Mode 2 = The schema will not evolve automatically, and the Autoloader will
not fail if there is schema mismatch. Those new columns will be saved in a
new column called “rescued data column”
Mode 3 = process will just simply fail if there is schema mismatch. New
columns will not be stored anywhere. You must manually update the schema
or just delete the file that is creating the error(a strict mode)
If later if the business wants to incorporate the “Salary” column you can
simply parse the rescued data column and reterive the value and put it
inside the new column.
Error Handling
Suppose you can uploading some file and in that file there is “Age” column
that accepts integer values. But there is a row that has entry as a string in
“Age column”.
And then if you are using rescue mode you would see an entry something
like as shown below
You would release the source data is rubbish, and you clean it by updating
the value in Age column and change its data type.
Batch Processing
Now there might arise situations when ADF is dumping data in the container
that is monitored by the Autoloader. And suppose we configured our pipeline
to run during mid-night that would mean our Autoloader would sit ideal for
most of the time and under the hood Autoloader is using the Databricks
cluster as compute layer that would mean you would have to pay even
when nothing is being done.
The Autoloader saves its progress inside checkpoint location. In this way it knows what
things it already has done.
Another thing is we configure input location for Autoloader so that it can detect is something
new has been uploaded to that input location. For this Autoloader uses 2 modes:
a) Directory listing mode
evert time Autoloader starts it go to the input location and list everything that is stored in
the directory. If there is lots of files, then listing process might take some time and might
cost some money.
b) Notification Mode
Some process is uploading file to data lake. In Azure there are many events that get created
for example when some file is uploaded to a datalake a new event called Evaluation blob
created is fired and there are multiple services that could react to the event.
Under the hood it is using Event Grid that is used to manage those events to subscribe to
them if something happens.
Autoloader is subscribing to the file created events to the particular location in the Datalake.
In the end whenever it's happened autoloader will create a Queue that is part of Storage
Account and all of those events will be saves under particular queue as messages. And
message will contain information regarding what file was created, where it is stored
Lecture 39
Azure Synapse Analytics
Apache Spark is good in processing huge amount of data. And the tool which we were using
so far that uses Spark under the hood was Databricks. Databricks is not a product developed
by Microsoft. Maybe in company you have some rule not to use 3 rd party solution like
Databricks. Microsoft gives an option to use Spark Pools in Synapse.
In realty is difficult to work with Spark because it involves configuring various nodes. So, we
were using Databricks that made us easier to deploy clusters without caring about the
infrastructure.
But if we don’t want to leave Azure Environment, we can do same thing as we have been
doing inside Databricks to deploy cluster by utilizing Spark Pools. Therefore, Spark Pools can
also be used for data transformation
One advantage of using Spark Pools is that it integrates with other Azure services well.
Click on “Create”
You will have to provide a separate Data Lake for Synapse where it will hold catalog data and
metadata associated with the workspace.
Suppose we created the following Storage Account for Synapse. Next step is to create a
linked service pointing to the Data Lake dedicated for Synapse Workspace
Go to your Synapse Workspace
To open UI
How to make Spark Pool : It is something like cluster from Databricks. Simply a compute
power that we will use to run query on our notebooks.
Configure it properly
Min Number of nodes always 3
Smallest Pool that you can create consists of 3 nodes. But in Databricks we could create 1
single node
Means as per the demand the number of nodes can be increased but not more than 30
Also, like what we had in Databricks we can turn off the pool when there is no activity
Apache Spark version: they are always behind Databricks in releasing new versions of Spark
Data Tab
And inside the Linked tab you will see all the linked services related to data that is defined in
the workspace
It also allows you to browse the content of the Data Lake without leaving the UI
Suppose you are creating a Notebook on Spark Pool. You specify to which pools this
notebook should be attached
Displaying Data
mssparkutils
%% , not %
Saving data as delta
Creating Database
It means that data was saved somewhere and minifig table was registered somewhere. But
where>
And if you open it
One more thing is Databricks save file as delta by default whereas Synapse doesn’t
Now if you have to save file as delta you have to tell Synapse explicitly
Saving Data to our Data lake
Lecture 41
Transforming data with Data Flows
The graph displays the transformation stream. It shows the lineage of source
data as it flows into one or more sinks. To add a new source, select Add
source. To add a new transformation, select the plus sign on the lower right
of an existing transformation. Learn more on how to manage the data flow
graph.
Configuration panel
The configuration panel shows the settings specific to the currently selected
transformation. If no transformation is selected, it shows the data flow. In the
overall data flow configuration, you can add parameters via
the Parameters tab.
Transformation settings
Optimize
The Inspect tab provides a view into the metadata of the data stream that
you're transforming. You can see column counts, the columns changed, the
columns added, data types, the column order, and column
references. Inspect is a read-only view of your metadata. You don't need to
have debug mode enabled to see metadata in the Inspect pane.
As you change the shape of your data through transformations, you'll see the
metadata changes flow in the Inspect pane. If there isn't a defined schema
in your source transformation, then metadata won't be visible in
the Inspect pane. Lack of metadata is common in schema drift scenarios.
Top bar
The top bar contains actions that affect the whole data flow, like validation
and debug settings. You can view the underlying JSON code and data flow
script of your transformation logic as well.
Getting started
Data flows are created from the Develop pane in Synapse studio. To create a
data flow, select the plus sign next to Develop, and then select Data Flow.
This action takes you to the data flow canvas, where you can create your
transformation logic. Select Add source to start configuring your source
transformation.
A source transformation configures your data source for the data flow. When
you design data flows, your first step is always configuring a source
transformation. To add a source, select the Add Source box in the data flow
canvas.
Every data flow requires at least one source transformation, but you can add
as many sources as necessary to complete your data transformations. You
can join those sources together with a join, lookup, or a union
transformation.
Integration datasets
It means that reuse one of the datasets that we created previously. This
option helps us to choose a dataset that is available and visible in the scope
of our whole Synapse Workspace
Inline datasets
Persists only to the scope of a particular dataflow. They will not be available
to be used outside.
Inline datasets are recommended when you use flexible schemas, one-off
source instances, or parameterized sources. If your source is heavily
parameterized, inline datasets allow you to not create a "dummy" object.
Inline datasets are based in Spark, and their properties are native to data
flow.
Schema options
Because an inline dataset is defined inside the data flow, there isn't a
defined schema associated with the inline dataset. On the Projection tab, you
can import the source data schema and store that schema as your source
projection. On this tab, you find a "Schema options" button that allows you to
define the behavior of ADF's schema discovery service.
Use projected schema: This option is useful when you have a large
number of source files that ADF scans as your source. ADF's
default behavior is to discover the schema of every source file.
But if you have a pre-defined projection already stored in your
source transformation, you can set this to true and ADF skips
auto-discovery of every schema. With this option turned on, the
source transformation can read all files in a much faster manner,
applying the pre-defined schema to every file.
Allow schema drift: Turn on schema drift so that your data flow
allows new columns that aren't already defined in the source
schema.
Validate schema: Setting this option causes the data flow to fail if
any column and type defined in the projection doesn't match the
discovered schema of the source data.
Infer drifted column types: When new drifted columns are
identified by ADF, those new columns are cast to the appropriate
data type using ADF's automatic type inference.
Workspace DB (Synapse workspaces only)
In Azure Synapse workspaces, an additional option is present in data flow
source transformations called Workspace DB. This allows you to directly pick a
workspace database of any available type as your source data without
requiring additional linked services or datasets. The databases created
through the Azure Synapse database templates are also accessible when you
select Workspace DB.
Source settings
After you've added a source, configure via the Source settings tab. Here
you can pick or create the dataset your source points at. You can also select
schema and sampling options for your data.
Test connection: Test whether or not the data flow's Spark service can
successfully connect to the linked service used in your source dataset.
Debug mode must be on for this feature to be enabled.
Schema drift: Schema drift is the ability of the service to natively handle
flexible schemas in your data flows without needing to explicitly define
column changes.
Select the Allow schema drift check box if the source columns
change often. This setting allows all incoming source fields to flow
through the transformations to the sink.
Validate schema: If Validate schema is selected, the data flow fails to run
if the incoming source data doesn't match the defined schema of the
dataset.
Skip line count: The Skip line count field specifies how many lines to
ignore at the beginning of the dataset.
Sampling: Enable Sampling to limit the number of rows from your source.
Use this setting when you test or sample data from your source for
debugging purposes. This is very useful when executing data flows in debug
mode from a pipeline.
Source options
The Source options tab contains settings specific to the connector and
format chosen. This includes details like isolation level for those data sources
that support it (like on-premises SQL Servers, Azure SQL Databases, and
Azure SQL Managed instances), and other data source specific settings as
well.
Projection
Used to show schema of input and output data. Meaning it will give some
idea how our data looks like when it enters the transformation and how the
data will look when it leaves our transformation.
However, the dataflows are visual tool it needs some compute power to
execute our steps. In case of pipelines Integration Runtimes were our
compute infrastructure.
Debug time to live is for shutting down the Spark Cluster when in no use.
Import schema
Select the Import schema button on the Projection tab to use an active
debug cluster to create a schema projection. It's available in every source
type. Importing the schema here overrides the projection defined in the
dataset. The dataset object won't be changed.
Importing schema is useful in datasets like Avro and Azure Cosmos DB that
support complex data structures that don't require schema definitions to
exist in the dataset. For inline datasets, importing schema is the only way to
reference column metadata without schema drift.
If your text file has no defined schema, select Detect data type so that the
service samples and infers the data types. Select Define default format to
autodetect the default data formats.
Overwrite schema allows you to modify the projected data types here the
source, overwriting the schema-defined data types. You can alternatively
modify the column data types in a downstream derived-column
transformation. Use a select transformation to modify the column names.
Data preview
If debug mode is on, the Data Preview tab gives you an interactive
snapshot of the data at each transform.
To fix the above error:
Handling Null
In case when you ingest data from an API, its previous page value was set to
NULL for first page. Now maybe we have some business rule that says that
replace the “NULL”
To “Not Available ”
Name your transformation and under Columns choose the column you want
to play with
Inside the Expression write your logic
Removing NULL values this time
SELECT transformation
Conditional Logic
Add a new column and build an expression of it
Add a new dataset that will store your results as delta format
But you will see that there is no option to create a dataset to save our
results as delta in case we choose Integration dataset as Sink type.
Lets proceed with Inline dataset. And while choosing the Inline dataset
you will see from the option that it supports delta format
Executing Dataflows
Inside your pipeline simply add Dataflow as an activity. This is the only
way to run your flow
There you will provide the name of your recently created Data flow.
Then you will Debug the pipeline
LECTURE 44
Loading data to Dedicated SQL Pool
Polybase
We are considering pull approach here, where SQL Pool is pulling the data
from external source(Data Lake in this case). This approach can be used
when you have some on-premises Data Warehouse and data is already
transformed there implemented as some stored-procedures. Therefore, this
is just lift and shift case. There is no involvement of Databricks, ADF or
Synapse.
Next step is to define the external file format meaning the format of file to be
read by the Synapse from this curated container. But unfortunately, Polybase
does not read delta file format. It can either read CSV or parquet file format.
CSV file format
The options available are limited while defining CSV format. In case if there is
some special characters you wont be able to read the data. There is no
option to specify escape characters
Then you create an External Table referencing the External Data source.
EXTERNAL TABLE = meaning the data is not stored inside Synapse, it lies
somewhere else. We know that in case of CSVs all columns are string so to
avoid the datatype mismatch thing we are defining all the columns to have
string type. Also, we are defining the container inside DATA_SOURCE and
where exactly under LOCATION and for file format we are providing the file
format defined earlier as MinifigsCSV
Then recreate another External Table to use the above file format
This time you did not provide kept column datatype as string instead you explicitly
gave appropriate datatype to columns and there would be no data mismatch upon
loading the data because in Parquet schema comes with the file. Also, you specify
the FILE_FORMAT that you already created. You also changed the LOCATION
At this point DML operation will not be supported. Meaning you will only be eligible
to read the data. Update, Delete records will not be supported because the data still
lives on the Data Lake. To have the data make available inside the dedicated SQL
Pool. For this you will have to create another separate table and then load the data
from External Table to that table using the phrase
CTAS = Create Table As SELECT
Now, the next challenge is if the data changes in the source it would get reflect
inside external table but not inside in the staging table on which you fire DML
queries.
Another Method
Copy
PUSH
Go to ADF and use copy activity. We already know that copy activity exists
both in ADF and Synapse if we want to move data from one place to another.
Create a new pipeline Grab a Copy data activity Define Source and Sink
You will find that there is no option to choose delta file format while creating
a Dataset
Note: Delta file format is not supported in Pipelines
We are using Source Type as Inline because Dataset at pipeline level does
not support delta format
Choose a Liked service pointing Data Lake under Linked service option.
Under Source options specify where in your Data Lake your data is present.
As you know your sink location is dedicated SQL Pool so create a new dataset
Create a new linked service to Synapse
Indicate to which table you want to store the data
Write a SQL query in Pre SQL scripts
Define a staging location where data will be first dump to a datalake and
then from there it will be loaded to dedicated SQL Pool. Dataflow is
converting the data in way that is supported by Polybase
Copy Activity in ADF
Create a new pipeline Use copy activity
Choose the Dataset that is already configured to read a csv from a DataLake.
Just update the path to the data that you would like to be read
You can also notice that this copy activity has an additional functionality that
was not present earlier
Sink side configuration
Configure it accordingly
Because you are reading a csv file you can make some changes under
Mapping tab
We saw that when we did transformations using Spark Pools, we already had
our data available in delta format. Also, Spark can read delta file format. Now
the question is can Spark Pool connect to SQL Pool
This will result in Scala code to connect to the table to read its content
Or else you can use the Python code
synapsesql(“<dedicated-sql-pool.<schema>.<table>”)
This way you can read data from a table saved on our SQL Pool to a
dataframe in Spark Pool
This time we were able to read data from delta format and we used Spark
Pools and to make it work we utilized SPARK.WRITE.SYNAPSESQL but
under the hood it used copy operation.
Using Databricks
We have got our Databricks workspace that saved the data to our Data Lake
as delta format. We will utilize the same workspace.
You have set some initial configurations defining how to connect to data lake
and to Synapse. We will use a dedicated Service Principal and store its secret
in a key vault
But also, you need to grant permissions to the service principal inside SQL
Pool
Also, you can read the data from SQL Pools to Databricks
Lecture 45
Upon re-running the query again. You will notice that this time query will take
more time to be executed
The solution is to cache the data
The following highlighted line just provides label to the query so that you can
refer to the same query again and again by this label
Enabling result cache hit
After setting this up if you again rerun your query, the result will be cache
And if you again rerun the query then it will take data from the cached
memory
The cached is evicted: every 2-day, if data has changed, if size reaches its
limit, if query is non deterministic