DP-203-Azure Data Engr Study Material
DP-203-Azure Data Engr Study Material
Page | 1
2
SAAS >> Software as a Service (Ex of SAAS is: Skype, Gmail, FB,
WhatsApp…etc.)
The Azure cloud services are trained and created to deploy and manage even
complex apps, through virtual infrastructure. It supports various programming
languages, devices, databases, operating systems, and extensive frameworks.
Page | 2
3
Therefore, Azure services intended for the professionals and enterprises offer
all-around alternatives to the traditional means of organizational processes,
with top Azure services greatly improving the performance.
(or)
Example of Resources/Services:
Resources/Services Category
Storage Accounts PASS
Data Lake Gen2 Storage Accounts PASS
SQL Database IAAS/PASS
SQL Servers IAAS/PASS
Azure Data Factory(ADF) PASS
Azure Storage Explorer(ASE) PASS
Az copy PASS
Azure Data Bricks PASS
Page | 3
4
the same resource group. We must wait until this operation has been
completed.
What is Azure Storage Services/Account: Azure Storage is Microsoft's cloud
storage solution for modern data storage scenarios.
What is Azure Storage Account: Azure storage account contains
all our storage data objects like blobs, files, queues & table
storage account provides a unique namespace for our Storage
data that is accessible from anywhere in the world over HTTP or
HTTPS.
Container/Blobs and File storage are the main concepts on which Azure
Data Engineers works on.
Table and Queue storage are used and worked by Developers.
Container/Blob (Binary large object) : It is used to store binary large objects, In
a blob we can store unstructured data and it is a part of storage service.
1. In Blob we have different types of storage. i.e.: (i)Page blob. (ii)Append
blob (iii)Block blob.
2. Page blob: it is used to keep the VM disks, the data which we are using
very frequently keep under page blob, its pricing is cheaper than block
blob, basically we are storing here UN-structure data (ex: video files (2-3
hrs.), VM disks, DB files, Unstructured DB files.... etc.)
1. Append blob: Used for logging purposes such as VM logs, diagnostics
logs...etc.
2. Block Blob: It gives us the URL access of the data which helps us to keep
the data such as docs, videos, images, pdf’s…etc.
3. Storage account (SA) is just a name space (or) place holder once we have
or created a Storage Account (SA) then we will get the access of
Blob/Queue/Table/File share storages…etc.
4. When we are creating a Blob & file in Storage Account then we create a
container, which is nothing but a folder.
5. Normally we upload .vhd files in page blob. When we create a page blob
or block blob and if we go into it then will get the URL, this URL is private
by default, and it won’t open in browser.
6. When we create a SA then we basically fills the below details.
(i)Subscription (ii)Resource Group (iii) Storage account name (iv)Location
(v)performance: standard & premium (vi)Account kind (vii)Replication
(viii) Access tier
Page | 4
5
Page | 5
6
19.If we want to download/upload the data from all the networks then we
can choose this option as All networks else if we want to restrict for any
network then we can choose as selected networks.
20.Data Protection: (i)Disable (ii)enable
21.It allows us to recover our blob data to save when blobs and blob
snapshots are deleted, if we override the blob data by any chance then
by enabling this option it saves our data for specific time like (7 or 9 or 20
or 50 days based on our choice).
22.File Share Storage: We create a file share and then we can help the users
to map this file share with the team.
23.We can create a directory or folder in file share that we have created to
any machine then it will give us an option to which o/s or m/c we want
to connect like(windows, Linux, MacOS) and this file share will map to
our m/c.
24.The letter we choose for Drive then accordingly for every letter will get a
different script to run in PowerShell to map the file share to our M/c or
VM.
25.To Map the file share to our VM we need a port No 443(this is an SMB
port no or CIFS) this port No should be open in our environment by our
internet service provider, if this port is open then only we can able to
map it .
26.In Azure VM this 443 port No is open or not, if its open then we can map
our file share to any VM
27.When we are creating a folder/container in blob then we have 3 options
for public access level and i.e.: (i)Private (ii)Blob (iii)container.
(i)Private: Only accessible to the owner/subscriber who has created the
blob storage or accessible to the users whom the access has been granted
by owner.
(ii)Blob: Means any person can read (blobs only, blobs means files only)blob
is under the container, blob is just one file and container can have many
blobs under it.
(iii)container: Anonymous read access for containers and blobs.
28.If we open the link in browser then after we upload the files/docs in
Storage account(SA) for BLOB then the link is somewhat like this.
https://Practice11.blob.core.windows.net/manish/1st%20March(5).jpg
Page | 6
7
Page | 7
8
Page | 8
9
storage and CDN is implemented then the End point for him is from New
York and if any other user accessing the same video/file then this
video/file is in cached & doesn’t approach to the network, once it gets
cached then the other user request doesn’t approach/hit the n/w.
CDN supports (or) we can implement CDN for cloud service/Storage
Account/Web Applications and custom origin, we create CDN under
Storage account.
The purpose of creating a CDN is to make sure that the files which we are
trying to access from blob storage that should also be access from CDN.
After implementing CDN in overview we can see the endpoint host name
as https://educdn.azureedge.net.
After implementing CDN in overview we can see the origin host name as
https://practice11.blob.core.windows.net.
We can implement the CDN for Blob/web application/custom
origin/cloud services….etc
when we click on caching rule(left side under CDN) then we can see
query string caching behavior as (i)ignore query string (ii)Bypass caching
for query string (iii)cache every unique URL.
The purpose of creating a CDN is to make sure that the files which we are
trying to access in Blob storage should be accessible from CDN end point
host name, so the link we have for
Blob as https://practice11.blob.core.windows.net/Gareth/mydetails
CDN as https://educdn.azureedge.net/Gareth/mydetails
Educdn >> CDN name
. azureedge.net >> CDN host name
Gareth >>folder or container name that we have created under blob
storage
My details >> files that we have uploaded under blob folder/container.
For Storage offering we can implement CDN only for blob Storage.
Once the CDN is implemented and enabled then the customer can
access the same file using Endpoint host name
CDN Endpoint: CDN end point is a subdomain of CDN
hostname(i.e.:.azureedge.net) which is used to deliver the files using HTTP (or)
HTTPS.
We do not know where we have (or) in which location we have created
the CDN, it will take care by Microsoft, where they are creating a CDN
Page | 9
10
Page | 10
11
analytics and that one is a centralized platform where we get all alerts
notification.
Web Application Firewalls:
A WAF prevents malicious attacks close to the attack sources before they
enter our virtual network. We get a global protection at scale without
sacrificing performance. A WAF policy easily links to any Azure Front Door
profile in our subscription. New rules can be easily deployed within
minutes, so we can respond quickly to changing threat patterns, further
information about the WAF can get from below link
What is Azure Web Application Firewall on Azure Front Door? | Microsoft Learn
Page | 11
12
3. When we click on Add an account in ASE then will get multiple option to
login in to our ASE…the different options we have to login-in to our ASE
are as follows
(i)Add an Azure account (ii)Use a connection string (iii) (iv)Use a
Storage account name and key (iv) use a shared access signature
4. (i)Add and azure account: Here we pass our azure subscription creds
which is a global admin access. After login-in we can perform many
operations in ASE Ex: we can upload/download/Open/new
folder/Rename/Delete/create snapshot…. etc.…if we are not able to see
any of the SA in ASE then it means we don’t have access for that
particular SA, but with Global admin access creds we can see all the
contents of SA and all SAs in ASE.
5. In ASE we can perform operations for all offerings of SA like
Blob/File/Table/Queue storages.
6. Connecting to multiple SA in ASE is possible.
7. (ii)Use a Storage Account Name & Key: To login to ASE, we pass the SA
name (exactly same) and key including display name, this key value will
get from Access Keys (left side in settings tab, after getting into SA). By
login-in with this option we have to pass
Display name (it can be anything we can give) SA name (It should be
exactly the same)
Account key (have to pass the key value exactly the same), if the key has
been shared by global admin to users in Europe/USA…etc., still they can
able to connect to ASE & can access the SA content
8. If we connect to ASE using above options, then we can able to see only
the given storage account content not all the SA contents (very
important to remember this point)
9. The user who got login-in by passing the key value will have full right to
delete/add/modify the data into the SA, after we delete any files/folders
from blob there is a retention period we will keep in recycle bin for about
7/15/30/60 days as per our choice.
10.Once we share this key with anyone in the world then they will get the
complete access of our SA and if we believe they are not in our
organization anymore & we don’t want them to access or authorize our
SA content then just if we refresh the key then the old key value will get
expired, but yes again we have to share this new key after refresh with
all the users across the globe
Page | 12
13
11.If we want to remove the SA that we have logged in to ASE, then simply
select the SA>>right click >>De-attach
12.If we login-in to ASE for one SA by passing the Account name and key
and if we pass the other account details like name and key then we can
see the other SA details as well…. likewise goes on, we can see as many
SA details we can.
13.(iii) Use a connection String: Here we give the display name as we want
and pass the string value from SA Access key. If we see in Access key we
have 2 keys values and 2 connection string values why we have 2 ??
14.If is for backup (or) we have a key1 & key2 , if we refresh the key1 then
the access will gone, so what we can do is we can give key1 to priority
level user, & key2 to less prioritize user, & if we experience the key has
been compromised either key1 or key2 , so whatever the key has been
compromise we can refresh it, so that the other users are not impacted,
we can categorized the users and can share the keys.
15.When we are uploading a file in ASE will get an option under Blob type
saying which blob we can store this file as Block blob/page blob/append
blob.
16.(iv) Shared Access Signature (SAS). With this feature we can grant access
to the user for our SA to a specific period of time, when we grant access
to users using SA’s then that access is a restricted access(means with
some limitations not like connection string (or) SA name & key)
17.Use case of SAS:
(i) if we want to give file share access to some ABC set of users and don’t
want to give file share access to some set of users BCD in SA, then we go
with SAS concept.
(ii) if we have some users who are going to be in our team for 3 months
then we can go with this concept for a specific period of time.
18.When we are setting SAS(left side) under SA then we get below options
to
Allowed services >> (i)Blob (ii)File (iii)Queue (iv)Table
Allowed Resource Type >> (i)service (ii) container (iii) object.
Allowed permissions >>(i) Read (ii)Write (iii)Delete (iv)List (v)Add
(vi)Create (vii)Update…etc.
Start & Expiry Date/time >> Here we can give 1 day or 1 hr even
Allowed protocols >> (i)HTTP (ii)HTTPS…etc.
Page | 13
14
19.If we are selecting only blob on top then only we get the Blob services in
SAS URL.
Soft Delete: Azure Files share storage services and Blob storage services offers
soft delete so that we can more easily recover our data when it is mistakenly
deleted by an application or other storage account user.
Soft delete allows us to recover our Files share storage services and Blob
storage services in case of accidental deletes
When soft delete for Azure file share & Blob storage is enabled, if a file
share or Blob storage is deleted, it transitions to a soft deleted state
instead of being permanently erased. You can configure the amount of
time soft deleted data is recoverable before it's permanently deleted and
undelete the share anytime during this retention period. After being
undeleted, the share and all of contents, including snapshots, will be
restored to the state it was in prior to deletion. Soft delete only works on a
file share level - individual files that are deleted will still be permanently
erased.
Soft delete can be enabled on either new or existing file shares. Soft
delete is also backwards compatible, so you don't have to make any
changes to your applications to take advantage of the protections of soft
delete.
To permanently delete a file share in a soft delete state before its expiry
time, you must undelete the share, disable soft delete, and then delete
the share again. Then you should re-enable soft delete, since any other
file shares in that storage account will be vulnerable to accidental deletion
while soft delete is off.
For soft-deleted premium file shares, the file share quota (the provisioned
size of a file share) is used in the total storage account quota calculation
until the soft-deleted share expiry date, when the share is fully deleted.
Page | 14
15
If you enable soft delete for file shares, delete some file shares, and then
disable soft delete, if the shares were saved in that period you can still
access and recover those file shares. When you enable soft delete, you
also need to configure the retention period.
Retention period:
The retention period is the amount of time that soft deleted file shares are
stored and available for recovery. For file shares that are explicitly
deleted, the retention period clock starts when the data is deleted.
Currently you can specify a retention period between 1 and 365 days. You
can change the soft delete retention period at any time. An updated
retention period will only apply to shares deleted after the retention
period has been updated. Shares deleted before the retention period
update will expire based on the retention period that was configured when
that data was deleted.
Implementation Steps:
1) Create a Storage Account (When creating a SA soft delete will be
automatically enabled this we can see in Data protection tab while
creating the SA, we can also change No of retain days for the file
share/blob storage)
2) Create a 2-file share and upload a file inside the file share.
3) Delete one of the file share by clicking on 3 dots(extreme right) and click
on Delete share>>check the check box I agree to the deletion of my file
share….and finally click on Delete
4) Now if we click on Refresh under file share then will see there is only one
file share available.
5) Click on Show deleted shares then here we see the deleted file share
also.
6) Now click on Deleted file share and click on three dots (extreme right)
and click on Undeleted
Note: soft delete is enabled/workable on file share not on inside the file share
files that we upload, if we deleted a file that we have uploaded inside the file
then that file cannot be retain back.
Scenario 1: How to delete a file share permanently when the soft delete is
enabled
Page | 15
16
If we want to delete a file share permanently when a file share is enabled then
we have to disable the soft delete first then we can delete the file share and
then this time the file share will be deleted permanently.
Implementation of above point:
1)Click on Soft delete: 7 days and choose Soft delete for all file shares as
Disabled and finally click on Save button(@ bottom)
2)Now if you delete a file share then this will be deleted permanently bcoz the
soft delete is disabled and if we click on Show deleted shares it will not show
the deleted file share bcoz this time the file share got deleted permanently.
Scenario 2: How to delete a file share in a soft delete state before its expiry
time.
Step1: Firstly delete the share (before doing this step1 ensure the soft delete is
enabled) and click on Show deleted shares
Step2: Undelete the file share which we have deleted
Step2: disable soft delete
Step3: Delete the same file share again
Step4: Now if we click on Show deleted shares then now we won’t see the file
share which we have deleted.
Hence, we can also delete a file share in a soft delete state before its expiry
time.
Page | 16
17
Step2: Now go inside Storage Account1 and click on object replication (left
side)>>+Create replication rules
Step3: Destination subscription: Free trial
Step4: Destination storage account: here carefully select the storage in which
we want to replicate the data
Step5: Source container: here select the container from 1st storage account
Step6: Destination container: here select the container from 2nd storage
account
Step7: Copy over: click on change then we find the below 3 options.
(i)Everything if we to copy everything
(ii) Only new objects
(iii) Custom
Click on Save and finally click on create (to create an object replication rule)
Note: when the object replication rule has been implemented and If we delete
the objects/files from the 1st Storage account container then automatically in
the 2nd storage account container the objects/files will be deleted.
Azure Data Lake Storage Gen2 Storage Accounts:
Azure Data Lake Storage Gen2 (ADLS Gen2) is a cloud-based repository/Storage
account for both structured and unstructured data. For example, we could use
it to store everything from documents to images to social media streams. Data
Lake Storage Gen2 is built on top of Blob Storage. This gives us the best of both
worlds.
Azure Blob Storage is one of the most common Azure storage types. It's an
object storage service for workloads that need high-capacity storage. Azure
Data Lake is a storage service intended primarily for big data analytics
workloads.
Azure Data Lake Gen1 is a storage service that's optimized for big data
analytics workloads. Its hierarchical file system can store machine-learning
data, including log files, as well as interactive streaming analytics. It is
Page | 17
18
Azure Data Lake Gen2 converges the features and capabilities of Data Lake
Gen1 with Blob Storage. It inherits the file system semantics, file-level security
and scaling features of Gen1 and builds them on Blob Storage. This results in a
low-cost, tiered-access, high-security and high availability big data storage
option.
Azure Blob Storage and Data Lake are well suited to specific situations and uses.
One challenge of Azure blobs is when customers use it, they can incur lots of
data transfer charges. Along with the typical data transfer read/write charges at
the various tiers -- Premium, Hot, Cool and Archive -- there are iterative
read/write operation charges, indexing charges, SSH FTP transfers, fees for data
transfers for georeplicated data and more. Each transfer type may only cost
fractions of cents, but when doing hundreds of thousands of transactions,
these costs can add up quickly.
Azure Data Lake enables users to store and analyze petabytes (PB) of data
quickly and efficiently. It centralizes data storage, encrypts all data and offers
role-based access control. Because Data Lake storage is highly customizable, it
is economical. Users can independently scale storage and computing services
and use object-level tiering to optimize costs.
Page | 18
19
Azure Logic Apps simplifies the way that we connect legacy, modern, and
cutting-edge systems across cloud, on premises, and hybrid environments
and provides low-code-no-code tools for you to develop highly scalable
integration solutions for your enterprise and business-to-business (B2B)
scenarios.
Page | 19
20
This list describes just a few example tasks, business processes, and
workloads that we can automate using Azure Logic Apps:
Implementation Steps:
Step1:
Create a Resource Group
Step2:
create a queue storage services with a
Create a storage account and
name as mylogicappqueue in the SA.
Step3:
Search for logic app in Azure portal>>+Add>>and fill the details
accordingly
Logic App name: NareshStudentsAPPScheduler/any name of your
choice
Plan type: consumption
Leave rest of the values to default and provision a logic app then a
automatically it will navigate us to Logic App Designer page/or click
on overview of Logic App>>scroll down a little>>Category:
schedule>> and click on Scheduler – Add message to queue>>use
this template>>click on +/sign-in and pass the below details
(i)Connection name: MyconnectionforLA
(ii)Authentication type: Access key
Page | 20
21
Page | 21
22
see the messages that we have set click on Logic app designer (left
side under Logic App)
Note: if we want to put all the Error messages in a separate queue
then create a new que in same storage account with a name as
myqueue-error>>go to Logic App>>Logic app designer (left
side)>>click on Handle errors box>>Queue Name:myqueue-
error>>Save >>Run Trigger
Now come to Storage Account>>go to myqueue-error and here will
see all the messages coming in as Some error occurred in a separate
queue.
SQL DB as service in Azure:
Basically Azure gives us two options to run SQL Server workloads.
IAAS: Azure SQL Database(DBAAS)
PAAS: SQL Server on Azure VM’s i.e. SQL Server inside fully managed VM.
Azure SQL Database: Azure SQL DB is a cloud based relational database service
that is built on SQL Server Technologies, it supports T-SQL commands, tables,
indexes, views, primary keys, store procedures, triggers, roles, functions…etc.
SQL Database delivers predictable performance, scalability with no downtime,
business continuity and data protection with almost zero administration, with
which we can focus rapid app development and accelerating our time to
market rather than managing virtual machines and infrastructure as it is based
on SQL server engine, SQL DB supports existing SQL server tools, libraries and
API’s which makes it easier for us to move and extend to cloud.
In Azure SQL DB’s are available in two purchasing models DTU & vCore. SQL
Databases is available in
(i)Basic,
(ii)Standard/General Purpose,
(iii)Premium (Business critical & Hyper scale service tiers) each service tier
offers different level of performance and capabilities to support lightweight to
heavyweight database workloads, we can build our first app on a small
Page | 22
23
database for few months and then we can change the service tier manually or
programmatically at any time based upon our convenience without any
downtime to our apps and customers.
Benefits of SQL DB as Service:
High availability: For each SQL DB created on Azure, there are three
replicas of that Database.
On Demand: One can quickly provision the DB when needed with a few
mouse clicks.
Reduced Management Overhead: It allows you to extent your business
applications into the cloud by building on core SQL server functionality
SQL Database Deployment options:
(i)Single Database: it is an isolated single DB, it has its own guaranteed
compute, memory, and storage.
(ii)Elastic pool: Collection of single DB’s with fixed set of resources such
CPU. Memory shared by all DB’s in pool.
(iii)Managed Instances: Set of Databases which can be used together.
Azure SQL Database Purchasing Model:
There are two purchasing models or service tier DTU & vCore
1) Database Transaction Unit(DTU):
DTU stands for Database Transaction Unit and it is a combined measure of
compute, storage, & IO resources. This DTU based model is not supported for
managed instance.
2) vCore:
vCore are virtual core provides higher compute, memory, and storage limits
and gives us the great control over the compute and storage resources that we
create and pay for.
Implementation Steps:
Search for SQL server in Azure portal>>Create and deploy the SQL server in
Azure.
If we click on SQL databases (left side) then will see no DB’s available inside this
SQL server, now will provision a new DB in Azure portal.
Page | 23
24
Page | 24
25
Page | 26
27
and for this phone we can see now the ratings, feedbacks, reviews of
that phone and this is all nothing but data which is help us to take the
decision to purchase this phone or not, like this way its helps the other
customers to decide whether to purchase this product or not, even after
delivery it sends you mail and reminders to provide the ratings and
feedback about the product(this is also a type of data that helps us and
other customers too)
Hence in this way data is a key role now a days and in future also there
are plenty of opportunities that will get, and the volume of data is
growing enormously these days.
If we observed previously, we were having data in KB, MB & GB, but now
all the business applications or Enterprise applications are having data in
Gigabytes(GB) and Terabytes(TB), bud down the line after some years we
may expect these data grows to
(i)PB>>Petabyte
(ii)EB>>Exabyte
(iii)ZB>>Zettabyte
(iv)YB>>Yottabyte
(v)BB>>Brontobyte
(vi)GB>>Geopbyte
We may also experience the Enterprise applications may contain,
structured, semi structure and un-structured data and to process,
transform, execute and load this data Azure cloud computing is offering a
variety of different services for Azure Data Engineers…like (i)Blb SA,
(ii)ADL Gen2 SA, (iii)MS-SQL DB, (iv)Cosmos DB, (v)ADF, (vi)Azure Data
bricks, (vii)FTP Servers, (viii)Json, (ix)GitHub portal, (x)Azure Devops
portal…. etc.…. etc.
We are processing, segregating and executing this data to add a value to
the Business…like if you are doing a frequent shopping on Flipkart or
Amazon…then they will keep sending you new products launch,
attractive discounts on special events, similar products that you have
searched in previous records, you have just done the search and left the
shopping in between then they will keep all the records of your browsing
history and interest and they keep sending us the promotion events and
messages.
If the sales of the products are decreasing quarter wise or week wise
then by capturing all the data and records we can do the analysis on top
Page | 27
28
of the data and we can realize why the sales got decrease/increase
whether the advertisements was less, or the sales team was on leave, or
the market was down, any recession occurred…etc.
These days if you are a first time customer like you are using Zomato,
swiggy or redubs or abhibus mobile apps they are giving 50% discounts
for first timers, with this they are attracting the customers and getting all
our information and also doing advertisements for their business
applications and a value add, this all can be possible if we collect the
data from the customers on their visit to our web application.
These data will collect from the variety of sources and load the data into
cloud computing by doing some transformations, analysis on top of the
data and load finally to some SA and from these the Data scientist will
use the power BI reports, Tableau, and different visualization tools that
are available in the market.
We received the data from many sources and that data could be
structured, semi-structured or unstructured data to process, transform
and load these types of data Azure offers us variety of different activities,
controls and data flows as part of Azure Data engineering.
We can save cost almost 6 times if we process, extract, transform and
load the data via Azure Cloud Resources and storage into Azure Storage
services and with this we can improve the performance, productivity,
time, manpower, cost…etc. with cloud computing.
There are many tools in the market, like SSIS, Informatica, Oracle BI…etc.
but how benefit and different the cloud Data Engineering resources
when it compares to traditional data transformation tools (SA, Gen2 SA,
ADF, Data bricks, SQL DB’s in cloud…etc.).
Now many organizations are moving their jobs and ETL tools from SSIS,
Informatica to Azure Data Engineering resources and services offered by
Azure cloud computing
Page | 28
29
(i)Cost.
(ii)Productivity.
(iii)Performance.
(iv)Efficiency.
(v)Security(encryption).
(vi)Reliability.
(vii)Easy to use.
(viii)native to many source and destination platforms…. etc. etc.
Page | 29
30
Azure Data Engineering services are very much cost effective when it
comes to the comparison of other tools available in the market and even
bcoz of this also all the tech firms and clients are preferring to go for
Azure Data Engineering services to process, extract, transform and load
the data.
If we see one of important services of Azure Data Engineering i.e.: Azure
Data Factory(ADF) as this is server less infrastructure as we no need to
worry about the underlying infrastructure everything will be taken care
by Microsoft, it is offered as a PAAS Service, SSIS, Informatica, Data
Stage, all these ETL jobs are getting migrated to Azure Data Factory.
With Microsoft(Msft) Azure we only pay what we need, lets us say for an
example we have created an ADF to move the data from source to target
then we have to pay only for the processing time which has taken to
move the data from source to target…. ex: if there is 100GB of data and if
want to move to cloud from on-prem and if it is taking some 20 mins of
time to load this data then I am ending paying only for 20 mins to Azure
cloud computing (i.e.: to Msft)
Azure Data Engineering services are feasible to access, process and load
the data from any resource and from any region and even at any time…
there is no need for us to create a VPN for security point of view bcoz
Msft is providing the default encryption for all of its resources that we
are using in Azure cloud computing, we just need an access (linked
service) to connect to the different sources
Page | 30
31
Page | 31
32
In ADF we can fetch the data from variety of different sources the data
could in any size, any format, any shape we can fetch/extract the data
from source, transform and load it into destinations.
Based upon the business needs we can insert/dump the data into
destination in any different format as compare with source (Ex: if the
data in source in excel format we can extract transform and load the data
into .csv format)
We can even load the data from multiple sources to one single
destination (like SQL DB single Table, or in one single file or in parquet
format varies business to business requirements).
Microsoft has inbuilt many source and destination as Datasets as native
to Azure Data Factory studios for Azure Data Engineers.
Process & Procedures to Load the Data from Source to Target using ADF:
Page | 32
33
If we want to move the files (file1, file2 & file3…etc) from my source to
target then we need some compute infrastructure, and this compute
infrastructure is taken care by integration runtime and this IR is
automatically managed by ADF based upon the data volume which has
to move from source to target
Integration Runtime (IR):: It basically provides compute infrastructure and it is
used by ADF and this compute infrastructure is like network, storage,
memory…etc. and all these things can be taken care by Integration Runtime, by
default will have an IR while creating an ADF service. If we want to move the
data from cloud to cloud or want to move the data from public network to
cloud, then we cannot install any external integration runtime and by default
will have this IR.
When performing the transformation means we are performing here like
Joints, Unions, Select, Where Aggregations (Max, Min, Avg, Sum…etc.)
Having…. etc.
When client don’t want to load the data as it is…. they need some
transformations to be performed before loading the data in target then
these transformations are helpful.
ADF is complete ETL (Extract Transform Load) or ELT (Extract Load
Transform) tool in which we can extract, transform and load the data into
destination.
Building blocks of Azure Data Factory::
Page | 33
34
Page | 34
35
(ii)we can also move the data from cloud to cloud, it might be from AWS
or might be from public network also, and for this we can use Azure
Integration runtime to connect to the source and then we can transfer
the data from one place to an another.
The data is available in AWS, and we want to move the data to Azure
cloud or inside the Azure cloud and this can be done with integration run
time.
This integration run time concept is very useful and it provides a
compute infrastructure. If we wanted to move the data from one place
to an another, then we must install a software an executable file (i.e.:
SHIR>>Self Hosted Integration Runtime) as shown in below image.
Azure Data Factory Version 1 were having lot of limitations and problem
and due to which Version 1 is deprecated and a 2nd version of Azure Data
Factory launched in the year 2019 and now we are using everywhere
ADF Version 2 only.
When we want to move the data from On-prem to Azure cloud DB then
we must install Self Hosted Integration Runtime(SHIR) in On-prem server
(where our DB is) bcoz the on-prem servers are connected to a private
network.
Whatever the data type is like unstructured, semi-structured or
structured when we want to move from source(On-prem) to target
(Azure cloud DB) then we have to install SHIR in on-prem
When we want to move the data from one cloud DB to another then we
need Linked services.
Page | 35
36
If we don’t want to move our data to other regions, then here we can
create our own integration runtime and by default will have the Azure
Integration runtime.
In ADF itself we are having a data flows, maybe our source system hosted
on a virtual network and here we can create Azure integration runtime
and we can enable the virtual network option and we can connect the
security to our system (source system)
ADF is completely code free tool, most of the things we can configure
and setup using drag & drop.
If we want to automate our workflow and we want to schedule our
pipeline, then in ADF itself we have a different kind of triggers available.
Page | 36
37
Here we can see how the components of ADF are clearly dependent on
each other.
Triggers:
o Triggers are used to schedule execution of pipeline.
o Pipelines and triggers have many to many relationships, ex:
multiple triggers can Kick off a single pipeline or a single trigger
can Kick off multiple pipelines.
When to use ADF:
We can use ADF When we are building a big data analytics solution on
Microsoft Azure
We can use ADF When we are building a modern data warehouse
solution that relies on technologies such as (i)SQL Server. (ii)SSIS and
(iii)SQL Server Analysis Services
ADF also provides the ability to run SSIS packages on Azure or build a
modern ETL/ELT pipeline and letting us access both on-premises and
cloud data services.
Page | 38
39
We can use ADF to migrate or copy the data from physical server to the
cloud or from a non-Azure cloud to Azure(blob storage, data lake
storage, SQL Cosmos DB)
ADF can be used to migrate both structured and binary data.
When it comes to any ETL product in the market what customer
currently looking is for cost, productivity, performance, security and ADF
is providing these all features.
Compare to other ETL tools ADF is very effective and building big data
analytics solution in Microsoft Azure and building a modern data
warehouse solution with lot of benefits and features for Azure Data Engr.
The underlying infrastructure everything is managed by ADF even when
we are running SSIS packages the underlying infrastructure everything is
managed by Azure this is the reason for which companies and clients are
moving to ADF.
When we don’t want to maintain your underlying infrastructure, and
everything should be managed by Microsoft and we want to move to any
PAAS services then everything is managed by the cloud vendor
(Microsoft Azure)
With ADF we can connect to any public network and we can move the
data to the cloud we can move either structured or unstructured data or
binary data (like audio files, video files, image files, sensor data,
streaming data…etc.) using ADF services.
Page | 39
40
around 8 mins with 200 MBPS bandwidth) then ADF will bill us not more
than $12 for the monthly execution.
Cloud Scale: ADF being a PAAS offering can quickly scale if needed, for
the big data movement with data sizes from terabytes to petabytes, we
will need the scale of multiple nodes to chunk data in parallel.
Enterprise grade Security: The biggest concern around any data
integration solution is the security, as the data may well contains
sensitive personally identifiable information(PII)
High Performance Hybrid Connectivity: ADF supports more than 90+
connectors the connectors support on-premises sources as well, which
helps us to a data integration solution with our on-premises sources.
Easy Interaction: As ADF supports so many connectors and that makes it
easy to interact with all kinds of technologies.
Visual UI authoring and monitoring tool: it makes us super productive
as we can go with drag and drop development. The main goal of the
visual tool is to allow us to be productive wit ADF by getting pipelines up
and running quickly without requiring us to write a single line of code.
Schedule pipeline execution: Every business have different latency
requirement (hourly, daily, weekly, monthly…and so on) and jobs can be
schedule as per the business requirements.
Complete data flow end to end with ADF Process:
Page | 40
41
Data Flow:
Data flow allows data engineers to develop graphical data
transformation logic without writing code,
Data flows are executed as activities within azure data factory pipelines
using scaled-out Azure Data bricks clusters.
Within ADF, Integration runtime(IR) are the compute infrastructure use
to provide data integration capabilities such as data flows & data
movement. ADF has the following three IR types
1) Azure integration runtime: All patching, scaling, and maintenance of the
underlying infrastructure are managed by Microsoft, and the IR can only
access the data stores and services in public networks.
2) Self-hosted integration runtime: The infrastructure and hardware are
managed by us, and we will need to address all the patching, scaling and
maintenance, the IR can access the resources in both public and private
networks.
3) Azure-SSIS integration runtimes: VM’s running the SSIS engine allow us
to natively execute SSIS packages. All the patching, scaling and
maintenance are managed by Microsoft, the IR can access resources in
both public & private networks.
Mapping Data Flows for Transformation & Aggregation:
Mapping data flows are visually designed data transformation in Azure Data
Factory, it allows data engineers to develop data transformation logic without
writing code, the resulting data flows are executed as activities with ADF
pipelines and that use scaled out Apache Spark Cluster.
There are three different cluster types available in mapping Data Flows i.e.:
General Purpose: We use the default general purpose cluster when we intend
to balance the performance and cost, this cluster will be ideal for most data
flow workloads.
Memory Optimized: more costly per core memory-optimized clusters if our
data flow has many joins and lookups since they can store more data in
memory and will minimize any out of memory errors we may get. If we
experience any out of memory errors when executing data flows, switch to a
memory optimized Azure IR configuration.
Page | 41
42
Page | 42
43
In real time projects will create a separate storage account for each
environment like for Atest environments data will create a separate SA.
For Dev environments data will create a separate SA.
For Pro environments data will create a separate SA…. etc.
We use Azure storage explorer to configure all the different
environments SA’s at one place to avoid going to Azure portal every time
and opening the SA for each and every time for different environments.
When we don’t want to move the data particularly to specific regions
then at the time of defining the integration runtime we can do this
configuration here, mentioning these are the regions in which data
should not get transfer if someone Azure Data Engr trying to do it so
then it should get failed automatically.
To securely connecting the storage account, we can use Rest API
There is no restriction to load the data into Storage account we can load
even 100 GB, 200GB, 1TB or 5TB…etc. no limit as such.
Create a Data Lake Gen2 storage account and while creating a SA in
Advanced tab under Data Lake Storage Gen2 check the checkbox for
Enable hierarchical namespace (when we are checking this checkbox it
means we are trying to create a Data Lake Gen2 SA, the only change
between this blob SA and Data lake Gen2 SA is this check only). Data lake
storage Gen2 will accelerate big data analytics workloads if we are trying
to perform any analytics on top of the data then we should load the data
into data lake storage Gen2.
The main difference between blob and data lake storage Gen2 is to load
the data in a data lake in hierarchical folder structure (Year. Month.
Week. Day. Hr…etc.)
When we want to perform analytics on data then we should load the
data into data lake storage Gen2. Blob storage is not meant for doing big
data analytics. Azure Data Lake Storage(ADL’s)Gen2 is meant for big data
analytics. we can also do Access Control List up to five level in ADL’s Gen2
but not on blob storage.
Blob Storage service supports only object-based storage and ADL’s Gen2
supports both file and object based storage.
ADL’s Gen2 has a great security when compared to blob storage service.
In Advance tab of Azure Storage Account when Enable hierarchical
namespace is checked then it is ADL’s Gen2 and when it is not checked
then it is blob storage service.
Page | 43
44
Implementations steps for copying the data from Blob SA to ADL’s Gen2 SA
using ADF:
Implementation steps for copying the zip file from Blob SA to ADL’s Gen2 SA
using ADF:
Step1: Create Blob SA(1925blbsa>>as source)>>create a container/folder in it
and upload below zip folder.
Page | 44
45
Page | 45
46
If we want to copy the data from one SA to another and both the SA’s are
Blob SA, then there is no need to have 2 separate linked services to be
created.
If we want to copy the files which are in the form of videos, clips,
reels...etc. basically unstructured data then we use binary format files.
Implementation steps to perform Metadata activity in Azure Data Factory
(ADF):
Step1: Create a Blob Storage Account and blob container inside it and
place/upload the below .csv file.
Step2: Create a ADL Gen2 Storage Account and blob container inside it.
Step3: Create ADF>>Launch ADF studio>>Author>>pipelines>>click 3 dots of
pipelines and say New pipeline>>Name: DynamicPipeline>>in Activities
pane(@ center) type Get Metadata>>drag and drop the Get Metadata control
to center>>click on settings tab(below center)>>+New>>search blob
storage>>click on Azure blob storage>>Continue>>choose DelimitedText(csv)
file>>Continue>>Name: DS_Input>>click on Linked service>>+New>>Name:
LS_BlbStorage>>Select Azure Subscription & Blob Storage
account(carefully)>>Test connection>>create>>click on folder
ikon>>myblobcon>>select the file here>>ok>>ensure the First row as header
checkbox is selected>>Import schema as None>>ok
Step4: In setting tabs for Filed list click on +New>>click the drop box every time
and click on +New every time and select the below options for everyone box
column count
Content MD5(Message Digest)
Item name
Item type
Exists
Last modified
Size
Page | 46
47
Structure
Step5: Click on Publish all(at top)>>Publish>>click on Debug>>wait for some 5
mins till it’s get deployed and in output tab see the arrows for inputs and
outputs(as shown below)
Metadata means when the file got created, on which date and time it got
modified, what is the file size, what is the file type (like excel, csv, notepad….
etc.) and the header column is called as schema.
Implementation steps to perform Validation & If Condition activity in Azure
Data Factory (ADF):
Step1: Continue this in same ADF and in same above pipeline>> in Activities
pane(@ center) type validation>>drag and drop the validation control to
center>>Name: Validate if file exist>>click on settings tab>>Dataset: click the
nob and choose DS_Link(this Dataset we have created in above demo, so above
demo is related to this demo)>>Timeout:7.00:00:00(means this control will
keep on validating whether this file is there or not till 7 days)>>Sleep:
30(means for every 30 seconds it will check)>>Minimum size: 10 Bytes (means
here I want if the file size is less than 10 bytes then don’t pick the file)>>Now
establish a connection between validation control and Get Metadata
control(this we have done in above demo) by dragging the green line from
validation control to Get Metadata control>>Publish all(this should get
succeeded)>>Debug(this should get succeeded)>>Wait for some 1-2
mins(depends upon the file size) then it will check whether the file exists or not
by validation control in the blbsa(as this is source) if yes then from Get
Metadata control it will give us the Metadata details of the file.
Page | 47
48
Step2: Now remove the .csv file from blbsa (as this is our
source)>>Publish>>Debug the Pipeline>>Now here we should get Status as
timeout bcoz we have removed the .csv file from source and it will get timeout
bcoz we set the timeout as 50 seconds for validation control in settings tab.
Step3: Search for If condition control(generally we use this if condition control
to check whether the files content is really exists or not, file exist or not, file
content is as expected or not) in Activities and drag and drop>>click on the if
condition control>>In General tab give the name accordingly>>click on
Activities tab>>click on Expression box>>Add dynamic content, then right side
window will open in the same windows & below we can see all the expressions
option>>just make a single click on Metadata_Control column count and then
Add @equals method as shown below in the expression box>>after we write
the expression as shown below then finally click on ok(at below)
Page | 48
49
Step5:Click on sink tab>>+New>>in search type gen2 and select Azure Datalake
Storage Gen2>>continue>>Delimited
text>>Continue>>Name:DS_Output>>Linked service:+New>>Name:
LS_Adlstorage>>Azure subscription: choose accordingly>>Storage account
name:1963adlsa(this SA we have created at the top and gave this name to SA,
choose the storage accordingly whatever the name you gave)>>Create
Step6: click on folder ikon (under File path)>>myconadl (this container we
created and gave name like this, that’s what we are selecting here the
same)>>ok>>ensure the box first row as header has selected>>Import schema:
none>>ok>>now in same sink tab only scroll down and change the File
extension: .csv>>publish>>click on pipeline
Step7: Click on Publish (this should be successfully published)>>Debug>>Now
all the controls which we placed in pipelines should get succeeded and we can
see the filed copied from source blbstorageAccount to ADLstorageAccount.
Implementation steps to perform multiple activities or controls in ADF
pipelines:
Step1: Create a Blob Storage Account(1959blbsa) and blob container inside it
and place/upload the multiple .csv files and also zip folder/files.
Page | 49
50
Page | 50
51
Step7: come to pipeline>>in activities search Get Metadata and drag and drop
this activity/control into pipeline from activities pane>>In General tab give
Name: Get Metadata>>Click settings tab>>For Dataset click the nob and choose
DS_Inputfiles>>Field list: +New>>from the dropdown box select child items(as
shown in below image)>>publish and Debug>>after this is completed check the
input and output.
Step8: Now in activities pane search for filter>>drag and drop the filter
activity/control to pipelines>>establish a connection with green line
extension>>in General tab>>give Name: MyFilter>>click on settings tab>>click
on Items box>>Add dynamic content>> a window will get open on right side
and in that select Get Metadata childitems>>ok> click on condition box>>Add
dynamic content>> a window will get open on right side and in that select
Activity outputs tab>>and write the below expression 1) in pipeline expression
builder box>>and then finally click on ok
1) @endswith(item().name,'.csv')>>This expression is to pick all the
files who has .csv extension
2) @startswith(item().name,'Sales_')>>This expression is to pick all
the files whose name starts with Sales_
Step9: Now click on publish and Debug and if we see in Input and Output of
Filter activity (as shown in below image) after Debug succeeded then will see in
output it is picking only .csv files extension to copy from the source to target,
and here in our source Storage Account we kept files of .zip extension and .csv
extension both.
Page | 51
52
Step10: Now drag and drop ForEach control/activity from Activities pane to
pipeline>>Establish a connection (by dragging the green line) from Filter
control to ForEach control>>Click on ForEach activity>>in General tab pass
Name: ForEachFile>>in setting tab>>click on items box>>Add dynamic
content>>in Activity output tab>>click on MyFilter (this name MyFilter we gave
above for ForEach control) and type the below expression in pipeline
expression builder box>>ok
@activity('MyFilter').output.value
Sequential: when we are trying to process a file (just one file) which is of 50GB
and there is no parallel process here then we click on sequential and then we
can process the 50GB file easily and can load into the target…here the files will
get loaded one by one in sequential order like up to 3 or 4 files we can go for
this sequential option.
Batch count: When we have No of files (like 40-50 file or more of like 100MB or
200MB…etc.) then we use Batch count, this batch count will do the parallel
processing of files and parallel processing always improves the performance,
the default value for Batch count is 20
Step11: In ForEach control click on the pencil ikon(as shown below) and in
activities pane search for copy data control and drag and drop in pipeline>>in
General Tab give Name: CopyDataFromSourceToTarget>>in Source Tab click on
+New>>in search box type blob storage>>click on
it>>continue>>DelimitedText>>Ok>>Name: DS_InputForMultiFiles>>Linked
service: LS_Source(this we have created above)>>For File path click on folder
ikon>>select the container(this we have created above)>>ok>>ensure the
checkbox first row as header>>Import schema as None>>ok
Page | 52
53
Step 12: Now in source tab only click on Open under source
dataset>>Parameters>>+New>>Name: SourceFiles>>click on connections
tab>>click on File name box then click on Add dynamic content>>click on
SourceFiles then an expression will get on pipeline expression builder>>ok
Step13: In source tab click on SourceFiles Textbox>>Add dynamic
content>>click on ForEachitem>>and type the below expression in Pipeline
expression builder>Ok
@item().name
Page | 53
54
Note: Instead of hard coding the values, we can pass the parameters for
Datasets, linked services & at pipeline level and once we create a parameter,
we cannot modify it.
Linked Service for source:
Step 5:
Launch the ADF studios>>Manage>>Linked services>>+New>>In search type
HTTP>>click on Http>>continue>>Name: LS_HTTP>> Authentication type:
Anonymous >>Expand the parameters (in same window, just scroll down
Page | 54
55
Page | 55
56
SourceRelativeURL: suresh12345/AzureDataEngineering_Batch/main/ecdc_data/country_response.csv
Now from above URL path there are many files and we can pass the
SourceBaseURL & SourceRelativeURL accordingly(as mentioned above) and can
load the data of any file from GitHub to our Azure Storage services in cloud
computing.
Hence here dynamically we are passing the URL values(i.e.: GitHub links) of the
GitHub account from where we are directly loading the data to our Azure
Datalake Gen2 Storage Services.
Allocating variables to ADF pipelines:
When we create a variables in ADF pipelines then we can able to modify
whenever we want, but while creating a parameters we cannot able to modify
any value.
Step 12: Now delete the parameters of the pipelines>>go to
pipeline>>parameters tab>>select the 2 parameters i.e: SourceBaseURL &
SourceRelativeURL>>click Delete on top>>
Step 13:
Click on variables tab>>+New>>SourceBaseURL>>click on +New
again>>SourceRelativeURL>>on same window for Default value box pass the
URL values for SourceBaseURL & SourceRelativeURL
SourceBaseURL: https://raw.githubusercontent.com/
SourceRelativeURL: suresh12345/AzureDataEngineering_Batch/main/ecdc_data/hospital_admissions.csv
Page | 56
57
Step 14:
Click on copy data control>>click on Source tab>>click the box of
BaseURL>>window will get open on right side>> remove the old expression
>>click on variables tab in the newly opened window>>click on
SourceBaseURL>>ok>>click the box of RelativeURL>>window will get open on
right side>>remove the old expression>>click on variables tab in the newly
opened window>>click on SourceRelativeURL>>ok
Note: We have pass the values to variables but if we want to overwrite the
variables values sometimes then we can use the set variables control.
Step 15: In activities pane search for set variables>>drag and drop this control
in ADF pipelines before copy control>>establish a connection with green
line>>click on the set variable control>>in General tab give Name:
SetVariable>>click on settings tab>>variable type: Pipeline variable>>Name:
SourceRelativeURL>> and pass value as
Value: suresh12345/AzureDataEngineering_Batch/main/ecdc_data/testing.csv
Step 16: Publish the pipeline>>Debug and now if we notice the variable value
that what we have passed for pipeline has been overwritten by the variable
value that we have passed for Set variable control.
Hence like this we can overwrite the value of the variable using Set variable
control with ADF pipelines in ADF studios.
Creating Dynamic Pipelines with lookup activity to copy multiple files data in
ADL StorageGen2:
Lookup activity can retrieve a dataset from any of the data sources
supported by data factory and Synapse pipelines. We can use it to
dynamically determine which objects to operate on in a subsequent
activity, instead of hard coding the object name. Some object examples
are files and tables.
Lookup activity can be used to read a config files, to read a single row, to
read a config table, we use this lookup activity to retrieve the data from
Page | 57
58
multiple sources, it can read and return the contents of configuration file
and table
Now from above screen we can understand that we are reading the data from
2 different files. i.e.: 1)cases_deaths.csv & 2)hospital_admissions.csv
Else directly take the below file and directly upload it in Source Storage
Account container(first put it in your desktop and then upload it in SA
container)
Step2:
Create a storage account(as 1964blbsaconfig, this is blob storage account not
ADL Gen2 storage account)>>create a container inside the storage account as
config>>come inside the config folder and upload the config
file(ecdc_file_list_for_2_files.json)which we prepare in above step from below
path(here we have saved these all files in downloads)
Page | 58
59
C:\Users\wasay\Downloads\AzureDataEngineering_Batch-main\
AzureDataEngineering_Batch-main\config\section5
So here whenever we want to modify or add an extra file then there is no need
to touch the ADF pipelines, here directly we can go to the config file and add
the new file details like how we have added for the above 2 files.
Creating ADF & Linked service for the storage account:
Step 3:
Create ADF>>Launch ADF studios>>click on manage ikon(left side)>>Linked
services>>+New>>in search type blob storage>>click on Azure blob
storage>>continue>>Name: LS_BlbSA>>select the subscription & Storage
account carefully>>test connection>>create
Create a Dataset for StorageAccount:
Step 4: Create a Dataset>>New Dataset>>in search type blob storage>>click on
Azure blob storage>>continue>>Json>>continue>>Name:
DS_BlbJsonconfig>>Linked Service: LS_BlbSA>>File path: click on folder
ikon>>click on config>>click on ecdc_file_list_for_2_files.json>>ok>>Import
schema: None>>ok
Step 5: Create a new pipeline>>Name: PL_Dynamicpipeline>>in activities pane
search for lookup activity>>drag and drop the lookup activity from activities
pane to pipeline canvas>>Click the lookup control and in General tab>>Name:
GetConfigfilesFromBlbSA>>click on settings tab>>click on Source dataset box
knob>>and click on DS_BlbJsonconfig>>uncheck First row only checkbox for
sure
Step6:
Now publish the pipeline>>Debug>>Now if we click at output and see the
window open then count will be 2 bcoz in blob storage account config file we
have mentioned 2 config files.
Page | 59
60
Hence like this the lookup control will read the files based upon the No of files
we are passing in the config file.
Copying data from multiple files:
Step1:
Create a Blob Storage Account(1971blbsa)>>create a container(myconfig)
inside the SA>>upload multiple .csv files inside the container as shown in image
below>>Get the below .csv files from below path in our laptop C:\Users\wasay\
Downloads\AzureDataEngineering_Batch-main\AzureDataEngineering_Batch-
main\ecdc_data
Step2:
Create an Azure Datalake storage Gen2 Storage account(1974adlsa)>>create
container(mydestconfig) inside the SA and this is our target SA.
Step3:
Create ADF>>Launch ADF studios>>click on Author ikon(left
side)>>pipeline>>new pipeline>>Name: PL_LoadAllFiles>>In Activities pane
search copy data>click on General tab>>Name: LoadMultipleFiles>>click on
Source tab>>+New>>in search type blob storage>>click on Azure blob
storage>>Continue>>Delimited text>>Continue>>Name:
ds_inputfiles01>>Linked service: LS_BlbSA>>For File path: click on the folder
Page | 60
61
And if we see in above image we have mentioned .zip file then only this one zip
file will get copied from source SA to Destination SA.
Note2: If we want to increase the processing power(DIU’s) of the ADF pipelines
then select the copy control>>settings tab>>for Maximum data integration
unit:8 or 16 or 32 whatever the value we want we can pass as we increase the
value the processing power will increase/performance will increase and quickly
Page | 61
62
the files will get copied from source SA to destination SA (by default it is Auto
means the No of files we are having in source and to copy it to destination
Storage Account it will increase automatically based upon the volume of files)
Copying the files from GitHub Dynamically with the use of Dynamic
parameters allocation-AUTOMATION PROCESS:
Step1:
Create a storage account (as 1964blbsaconfig, this is blob storage account not
ADL Gen2 storage account)>>create a container inside the storage account as
config>>come inside the config folder and upload the below .json config file
So here whenever we want to modify or add an extra file then there is no need
to touch the ADF pipelines, here directly we can go to the config file and add
the new file details like how we have added for the above 2 files.
Step2:
Create ADL Storage Gen2 StorageAccount and create one container inside it
and also create one folder(mypracticedata) inside the container
Creating ADF & Linked service for the storage accounts:
Step 3:
Create ADF>>Launch ADF studios>>click on manage ikon(left side)>>Linked
services>>+New>>in search type blob storage>>click on Azure blob
storage>>continue>>Name: LS_BlbSA>>select the subscription & Storage
account carefully>>test connection>>create
Step4:
Create a linked service(LS_ADLSGen2Connection) for ADL Storage Gen2 same
as above step but this is for Destination Storage account.
Create a Dataset for Lookup activity:
Step 3: Create a Dataset>>New Dataset>>in search type blob storage>>click on
Azure blob storage>>continue>>Json>>continue>>Name:
DS_BlbJsonconfig>>Linked Service: LS_BlbSA>>File path: click on folder
Page | 62
63
Page | 63
64
Page | 64
65
@item().sinkFileName
Step15:
Publish>>Debug>>Now if we see in blbsa config file what filenames we have
mentioned the same files will get copied to our Destination Storage account
and here we have copied the files dynamically with the help of parameters.
Below is GitHub URL which contains multiple .csv files(as source)
Page | 65
66
Now if we want to copy multiple files like 3, 4, 5, 6….n files like that then there
is no need to touch the ADF pipelines or any of the activities/controls we have
designed, directly we can add the baseURL, relativeURL & files names in the
config file present in the source Storage Account and run the ADF pipeline what
all the .csv files we mentioned in config file will get copied to our ADL SA
To add the code in the config file in Source SA>>go inside the blob Source
SA>>container>>click on the config file>>click on Edit and add the code as
shown in above image.
Come to the pipeline>>publish (if required)>>Debug>>and here will see all the
Four files we have mentioned in the .json config file has been copied to the
destination Storage Account(ADL SA).
Hence, we proved here dynamically we are loading or copying the data(files)
from source to destination storage account without touching the ADF pipelines
again and again.
Note: Click on For Each activity >>Settings tab>>Batch count:2(then it will
process 2 files at a time, if pass 4 it will process 4 files at a time whatever the
No we are passing accordingly it will process that many files to copy from
source to target and the default size for batch count if we wont mention
anything then it is 20)
Triggers: Basically, we are having 3 types of triggers for ADF pipelines and i.e.:
1)schedule-based triggers
2)Tumbling windows triggers
Page | 66
67
So here whenever we want to modify or add an extra file then there is no need
to touch the ADF pipelines, here directly we can go to the config file and add
the new file details like how we have added for the above 2 files.
Step2:
Create ADL Storage Gen2 StorageAccount and create one container inside it
and also create one folder(mypracticedata) inside the container
Creating ADF & Linked service for the storage accounts:
Step 3:
Create ADF>>Launch ADF studios>>click on manage ikon(left side)>>Linked
services>>+New>>in search type blob storage>>click on Azure blob
storage>>continue>>Name: LS_BlbSA>>select the subscription & Storage
account carefully>>test connection>>create
Step4:
Create a linked service(LS_ADLSGen2Connection) for ADL Storage Gen2 same
as above step but this is for Destination Storage account.
Create a Dataset for Lookup activity:
Step 3: Create a Dataset>>New Dataset>>in search type blob storage>>click on
Azure blob storage>>continue>>Json>>continue>>Name:
DS_BlbJsonconfig>>Linked Service: LS_BlbSA>>File path: click on folder
ikon>>click on config>>click on ecdc_file_list_for_2_files.json>>ok>>Import
schema: None>>ok>>Publish
Page | 67
68
Page | 68
69
@item().sinkFileName
Page | 70
71
Step 17: Publish>>Click on Monitor ikon(left side) and this time we wont click
on Debug manually means(run the pipeline) this time it should get
executed/run/debug by itself bcoz we have set up a schedule trigger, click on
Refresh (on top center) and if we see now the trigger is getting executed
automatically by itself and we have not Debug it manually.
Note: Every time the pipeline gets trigger then Microsoft will charge us, so we
can even stop this trigger>>click manage ikon (left side)>>click on Stop
button(middle as shown in below image), once we stop the trigger we can see
the status of the trigger as stopped and finally click on Publish to save the
changes.
Copying the data from Azure SQL DB to ADL Gen2 Storage Account:
With Azure SQL Database, we can create a highly available and high-
performance data storage layer for the applications and solutions in
Azure. SQL Database can be the right choice for a variety of modern cloud
applications because it enables us to process both relational data
and nonrelational structures, such as graphs, JSON, spatial, and XML.
Page | 71
72
Azure SQL Database is based on the latest stable version of the Microsoft
SQL Server database engine. we can use advanced query processing features,
such as high-performance in-memory technologies and intelligent query processing. In
fact, the newest capabilities of SQL Server are released first to Azure SQL
Database, and then to SQL Server itself. You get the newest SQL Server
capabilities with no overhead for patching or upgrading, tested across
millions of databases.
Page | 72
73
Step3: Launch the SSMS in your laptop>>and pass the details as shown below
and click on connect then here will login to the Azure DB server which we have
created in Azure cloud portal.
Step4:
Create a blob Storage Account>>create container inside the SA and upload
multiple .csv files inside the container.
Step5:
Create ADF>>Launch ADF studios>>Manage(left side)>>+New>>in search type
blob>>click on Azure blob storage>>continue>>Name: LS_FilesToSqlDB>>select
the subscription and SA carefully>>test connection>>Create
Step6:
click on Author>>Pipeline>>New pipeline>>Name: PL_BLOB_TO_SQLDB>>in
activities pane search for copydata activity and drag and drop on pipeline
canvas>>click on General tab>>Name: Copy_data_from_blob_to_SqlDB>>click
on Source tab>>+New>>In search type blob>>click on Azure blob
storage>>Continue>>Delimited text>>Continue>>Name:
ds_inputdataset>>Linked service: LS_FilesToSqlDB>>For file path: click on the
folder ikon>>click on the folder(myblobcon) and don’t select any one particular
file here, just come inside the folder>>say ok>>Ensure to check the box for first
row as header>>Import schema as : None>>Ok
Page | 73
74
Step7:
In source tab only click on open(means we came here inside the
Dataset)>>click on Browse>>click on container(this we create inside the
SA)>>click on any one .csv file(that we have uploaded in our BlbSA) as per our
choice>>ok
Step8:
Come to pipeline>>Click on Sink tab>>+New>>in search type sql>>click on
Azure SQL Database>>Continue>>Name: ds_SQLConnection>>Linked service:
+New>>Name: LS_CsvFilesToSqlDB>>Select the subscription, DB server name
whatever we gave at the step1, and DB carefully>>Authentication Type: SQL
authentication>>Username: Gareth>>Password:
Shaikpet@123>>testconnection>>Create>>Table Name: None>>check the
checkbox Edit>>Import schema : None>>Ok
Step9:
Goto pipeline>>click on the copy data activity/control>>click on Sink
tab>>open>>check the check box Edit first>>for schema name text box:
dbo>>for table name text box: Product(we can give any name and this is going
to be our table in SQLDB in Azureportal).
Step10:
Goto pipeline>>click on the copy data activity/control>>click on Sink tab>>For
Table option:Auto create enable(with this it will automatically create a table for
us in target(i.e.: SQLDB)>>Publish>>Debug
Step11:
Now come to SQLDB in Azure portal>>click on Query editor(left side)>>make a
login with username and password>>expand the tables folder here will find a
table got created with the name we pass along with a data. Just pass the below
query to verify the data in SQL DB
Select * from [dbo].[Product]
Hence here we have uploaded the data in SQLDB table from .CSV file using ADF
Pipelines and Copy Activity.
Page | 74
75
Implementation steps for loading the data from Blb SA to SqlDB in multiple
table:
Step1:
Create an SQL DB in Azure portal as per the above demo process and
procedure.
Step2:
Create a blob Storage Account>>create container inside the SA and upload
multiple .csv files and excel files inside the container.
Step3:
Create ADF>>Launch ADF studios>>Manage(left side)>>+New>>in search type
blob>>click on Azure blob storage>>continue>>Name: LS_FilesToSqlDB>>select
the subscription and SA carefully>>test connection>>Create.
Step4:
Launch ADF>>Author>>Dataset>>New Dataset>>In search type blob>>Azure
blob storage>>continue>>delimited>>continue>>Name: ds_blbfiles>>Linked
service: LS_FilesToSqlDB>>For file path click on folder ikon>>click the container
beside(don’t choose any specific file here)>>Ok>>check the checkbox as first
row as header>>Import schema as none>>ok
Step5:
Click on Author>>Dataset>>click on 3 dots>>New Dataset>>in search type
blob>>Click on Azure blob storage>>continue>>Delimited>>continue>>Name:
ds_blbfilestosql>>Linked service: LS_FilestoSQLDB>>For File path click on
Folder ikon>>click on container(this we create inside the SA)>>don’t select any
one particular file, just click on the container beside>>Ok>>check the checkbox
as first row as header>>Import schema as none>>ok>>Publish all
Step6:
click on Author>>Pipeline>>New pipeline>>Name: PL_BLOB_TO_SQLDB>>in
activities pane search for Get Metadata activity/control and drag and drop in
pipeline canvas>>In General tab>>Name: Get Files>>click on settings tab>>for
Dataset: ds_blbfiles
Step7:
Page | 75
76
Now click on Publish All>>Debug and here will see the output for the 2
controls/activity i.e.: Get Metadata control and Filter control
Step9:
Now search for ForEach control and drop in pipeline canvas after Filter
control>>establish a connection with green line b/w Filter control and ForEach
control>>Click on ForEach activity>>in General tab>>Name: ForEachFile>>Click
on settings tab>>click on items text box>>add dynamic content>>and click on
Filter CSV Files(Filter CSV Files activity output) and add .value as shown below
>>click Ok
@activity('Filter CSV Files').output.value
Step10:
Now double click on ForEach activity(to go inside the ForEach activity)>>in
Activities pane search for copy data control/activity and paste it in pipeline
canvas>>In General tab>>Name: CopyFilesFromBlbStorageToSQLDB>>click on
source tab>>For Sourcedataset box choose ds_blbfilestosql>>click on open(to
go inside the dataset)>>Click on parameters tab>>+New>>Name:
SourceFiles>>click on connections tab>>click on Filename text box>>add
dynamic content>>Just click once on SourceFiles then an expression will get
printed on Pipeline Expression Builder>>Ok
Page | 76
77
Step11:
Click on copyData activity>>Source tab>>click on SourceFile text box>>Add
dynamic content>>click on ForEachFile and an expression will get printed and
add .name as shown below>>click on Ok
@item().name
Step12:
Click on CopyControl>>Sink tab>>+New>>in search type SQL>>click on Azure
SQL Database>>continue>>Name: ds_sqltables>>click on Linked service
box>>+New>>Name: LS_Filesinsqltables>>select subscription, sql server name,
DB name, Username, Password accordingly and then click on test
connection>>click create>>click on Edit check box>>Import schema as
None>>and click on Ok finally.
Step13:
Click on Open(to go inside the dataset) in Sink tab>>Click on parameters tab>>
+New>>Name: TableName>>click back on connection tab>>click on edit check
box>>schema name text box type dbo and for table name text box click Add
dynamic content>>click TableName the expression will get printed in pipeline
expression builder>>Ok
Step14:
Come to pipeline at copy control>>in Sink tab>>click on TableName text
box>>Add dynamic content>>Click on ForEachFile then expression will get
printed as shown below and add .name>>Ok.
@item().name
Step15:
Come to copydata activity>>click on sink tab>>For Table option choose Auto
create table radio button>>Publish All>>Debug>>Now go inside the SQLDB in
azure portal click on Query Editor>>pass the creds>>and here we can see all
the tables got created along with data in it. We may see the same in SSMS in
our laptop.
Step16:
If we see in the SQLDB all the tables got created and data got loaded into the
tables but all the tables has an extension of .csv and if we want to remove
Page | 77
78
the .csv extension for each and every table then login to SSMS delete all the
tables with below command
Drop Table table_name1,table_name2,table_name3,table_name4;>>SQL
Command
Step17:
come to copy activity>>sink tab>>click on TableName text box>>Add dynamic
content>>and put the below expression in pipeline expression builder>>and
click ok finally>>Publish All>>Debug
@replace(item().name,'.csv','')
Hence now all the tables are created again in our DB without .csv extension
with all the data inside the tables.
Implementation steps to copy the data from SQLDB to ADL Gen2 Storage
Account:
Step1:
Create an SQL DB in Azure portal as per the above demo process and
procedure.
Step2:
Create a blob Storage Account>>create container inside the SA and upload
multiple .csv files and excel files inside the container.
Step3:
Create ADF>>Launch ADF studios>>Manage(left side)>>+New>>in search type
blob>>click on Azure blob storage>>continue>>Name: LS_FilesToSqlDB>>select
the subscription and SA carefully>>test connection>>Create.
Step4:
Launch ADF>>Author>>Dataset>>New Dataset>>In search type blob>>Azure
blob storage>>continue>>delimited>>continue>>Name: ds_blbfiles>>Linked
service: LS_FilesToSqlDB>>For file path click on folder ikon>>click the container
beside(don’t choose any specific file here)>>Ok>>check the checkbox as first
row as header>>Import schema as none>>ok
Step5:
Page | 78
79
Now click on Publish All>>Debug and here will see the output for the 2
controls/activity i.e.: Get Metadata control and Filter control
Step9:
Now search for ForEach control and drop in pipeline canvas after Filter
control>>establish a connection with green line b/w Filter control and ForEach
Page | 79
80
Step11:
Click on copyData activity>>Source tab>>click on SourceFile text box>>Add
dynamic content>>click on ForEachFile and an expression will get printed and
add .name as shown below>>click on Ok
@item().name
Step12:
Click on CopyControl>>Sink tab>>+New>>in search type SQL>>click on Azure
SQL Database>>continue>>Name: ds_sqltables>>click on Linked service
box>>+New>>Name: LS_Filesinsqltables>>select subscription, sql server name,
DB name, Username, Password accordingly and then click on test
connection>>click create>>click on Edit check box>>Import schema as
None>>and click on Ok finally.
Step13:
Click on Open(to go inside the dataset) in Sink tab>>Click on parameters tab>>
+New>>Name: TableName>>click back on connection tab>>click on edit check
box>>schema name text box type dbo and for table name text box click Add
dynamic content>>click TableName the expression will get printed in pipeline
expression builder>>Ok
Step14:
Come to pipeline at copy control>>in Sink tab>>click on TableName text
box>>Add dynamic content>>Click on ForEachFile then expression will get
printed as shown below and add .name>>Ok.
Page | 80
81
@item().name
Step15:
Come to copydata activity>>click on sink tab>>For Table option choose Auto
create table radio button>> in sink tab only>>click on TableName text
box>>Add dynamic content>>and put the below expression in pipeline
expression builder>>and click ok finally>>Publish All>>Debug
@replace(item().name,'.csv','')
Now all the tables are created in our DB with all the data inside the tables.
Step16:
Create ADL Gen2 Storage account(ex: sqltoadlsa) and container inside the SA.
Step17:
Create a new pipeline>>Name: PL_SQLDB_TOADL>>drag and drop lookup
activity>>in General tab>>Name: GetTables>>in Settings tab>>+New>>in
search type Sql(bcoz we are pulling the data from SQL DB)>>continue>>Name:
ds_inputtables>>Linked service:LS_Filesinsqltables(this Linked service we have
created in step 12)>>check the checkbox Edit>>Import schema as None>>Ok
Step18:
Create a new Dataset(ex: ds_ADLSGen2)>>in search type Gen2>>click on Azure
Data Lake Storage Gen2>>Continue>>DelimitedText>>Continue>>Name:
ds_ADLSGen2>>Linked service: +New>>Name: LS_ADLSGen2>>select the
subscription, Storage Account(sqltoadlsa) carefully>>test connections>>create
Step18:
In Settings tab only>>uncheck First row only checkbox>>for Use query click on
query radio button then a box will appear for us>>click on that box>>click on
Add dynamic content>>and paste the below query
SELECT
*
FROM
database_name.INFORMATION_SCHEMA.TABLES
WHERE table_type = 'BASE TABLE'
Note: Here in above query for database_name pass the SQLDB name what we
gave at the time of DB creation in azure portal.
Page | 81
82
Step20:
Double click on ForEach activity>>drag and drop copy activity/control>>In
General tab>>Name: CopyDataFromSQLDBToADLSA>>In Source
tab>>+New>>in search type sql>>click on Azure SQL
Database>>Continue>>Name: ds_inputsqltables>>Ok>>click on open(inside
source tab only)>>click the Edit checkbox>>click on parameters
tab>>+New>>For Name text box pass Table_Schema>>click again on
+New>>For Name text box pass Table_Name
Step21:
Come to pipeline>>click on copy activity>>in Source tab>>For Source dataset
text box drop down choose ds_inputsqltables dataset(this is just a refresh we
are doing to populate the parameters)>>click on Table_schema txt box>>Add
dynamic content>>click on ForEachTable and add .Table_SCHEMA as shown
below and say ok
@item().TABLE_SCHEMA
Step22:
Page | 82
83
Come to Copy Data activity>>Now click on Sink tab>>for Sink dataset choose
ds_ADLSGen2>>click on +New(in sink tab only)>>in search type gen2>>click on
Azure Data Lake Storage Gen2>>Continue>>Delimited
text>>continue>>Name:ds_outputfiles>>Linked service:LS_ADLSGen2>>click
on folder ikon>>click on container>>in Directory box pass as
OutputTables>>check first row as header>>Import shcema as none>>Ok>>Now
click on open(in sink tab only)>>click on parameters
tab>>+New>>Name:Filename>>click back on connection tab>>click on
filename text box>>Add dynamic content>>click on Filename>>ok
Step23:
Come to copy activity>>sink tab>>for Sink dataset care fully choose
ds_outputfiles>>click on Filename text box>>Add dynamic content and paste
the below expression in pipeline expression builder and click OK finally.
@concat(item().TABLE_SCHEMA,'_',item().TABLE_NAME,'.csv')
Step24:
Click on copy activity>>click on source tab>>click open>>click on connection
tab>>click on schema name txt box>>add dynamic content>>click on Table_
Schema>>Ok>>click on table name text box>>add dynamic content>>click on
Table_Name>>ok
Publish All>>Debug>>Hence now we can see what all the tables we have in SQL
DB will be loaded into our ADL Gen2 Storage account in the form of .csv files.
Execution of above pipelines in sequence one after the other:
If we want to execute multiple pipelines one after the others based upon the
project requirements then we can create a new
pipeline>>Name:PL_Executepipeline>>in activities pane search Execute
pipeline activity>>drag and drop in pipeline canvas>>In General tab>>Name:
ExecuteFirstPipeline>>In Settings tab>>For Invoked pipeline choose
PL_BLOB_TO_SQLDB>>drag and drop Execute pipeline activity again one more
time from activities pane to pipeline canvas>>establish a connection between
the two activities>>In General tab>>Name: ExecuteSecondPipeline>>In Settings
tab>> For Invoked pipeline choose PL_SQLDB_TOADL>>Publish All>>Debug.
When can we pick up schedule-based triggers??tumbling window base
triggers?? & Event base triggers??
Page | 83
84
Page | 84
85
Page | 85
86
can pass the value if required>>Blob path ends with: .csv(if particularly we
want to pick the csv files only)>>For Event: check the box Blob is
created(means when the file is placed)>>check the box Start trigger on
creation>>Continue>>Ok>>Publish All.
Step5:
Now whenever someone placed a new file of .csv extension in Blb SA container
then this trigger will get initiated and execute the pipeline.
Step6:
Now let us put one .csv file in our storage account and then come to monitor
inside ADF studios>>Trigger runs(left side)>>refresh>then here will see a
trigger is been initiated and if we click on pipeline runs (left side above) then
here will see the pipeline which we have set inside the trigger will get
executed.
Azure Key Vault:
Azure Key Vault is a cloud service for securely storing and accessing
secrets(passwords). A secret is anything that we want to tightly control access
to, such as API keys, passwords, certificates. Key Vault service supports two
types of containers: vaults and managed hardware security module (HSM)
pools. Vaults support storing software and HSM-backed keys, secrets, and
certificates. Managed HSM pools only support HSM-backed keys. See Azure Key
Vault REST API overview for complete details.
Why we use Azure Key Vault:
Our applications can securely access the information they need by using URIs.
These URIs allow the applications to retrieve specific versions of a secret.
There's no need to write custom code to protect any of the secret information
Page | 86
87
stored in Key Vault….for further information about Azure Key Vault please
refer the link Azure Key Vault Overview - Azure Key Vault | Microsoft Learn
Page | 87
88
Page | 88
89
(ii)System Assigned Managed Identity (here will get all the things by default) for
secure purpose we use this option
Page | 89
90
GitHub is a website and cloud-based service that helps developers store and
manage their code, as well as track and control changes to their code. To
understand exactly what GitHub is, we need to know two connected principles:
Version control
Git
Repositories
Branches
Commits
Pull Requests
Git (the version control software GitHub is built on)
Page | 90
91
Repository
Branch
Commits
Each commit (change) has a description explaining why a change was made.
Pull Requests
2)Password
3)Username
4)will get an OTP on our mail pass the OTP and will get login to the GitHub
platform.
Implementation steps to setup the code repository for ADF in GitHub
Platform:
1)Come to Azure portal and create a Blob SA, ADF, and then Linked service in
ADF, Dataset in ADF and a pipeline, just put one simple activity/control
GetMetada then publish and debug the pipeline.
2)If already have an account in GitHub platform, then go to the below link
https://github.com/signup
click on Sign in on top right pass the mailID & Password and signin.
3)click on New (available in green color left side)>>Repository name: ADF-
Repo>>scroll down a little>>Description: My first repo for ADF
artifacts>>choose public/private option (as per project requirements). Here I
am choosing public>>check the box Add a README file>>Create repository.
4)come to Azure portal>>click on Azure Data factory>>choose Set up code
repository as shown in below image>>
Page | 93
94
development or activities that we have done then it will get merge with
the master branch then will do the publish.
If we login to the GitHub portal and see we are having 2
branches((i)Master branch & (ii)Practice branch) in ADF-Repo that we
have created above the same is shown below
Click on the Create pull request then it will redirect us to GitHub portal
again click here on Create pull request (right side green color)>>Pass the
Page | 94
95
Page | 95
96
Page | 96
97
Organizations
Page | 97
98
Page | 98
99
21. On Azure DevOps first you can open the Azure Devops portal in our browser
and we have to login by using our Azure DevOps portal, so, this is the URL
https://dev.azure.com
22. Once you login into the application. You will see the organization on the left
side. The first step is we must choose an organization. So, we have to choose
under what organization we are going to create a new project.
If we want to disconnect the ADF with our GitHub portal(as we did
above) then come inside the ADF>>Launch ADF Studios>>Manage(left
side)>>Git configuration>>Disconnect>>Enter the ADF
Name(ex:NareshADF1911)>>Disconnect
come to Azure portal(here ensure to come to Azure paid subscription i.e:
[email protected]) in another tab>>Active Directory>>App
registrations (left side)>>+New registration>>Name:
newappreg>>register>>click on App to get all the details of App
registration.
Come to dev.azure.com (https://dev.azure.com ) in another tab>>right
click on top>>switch directory>>create New organization in Azure devops
portal>>Name: TestADF-Repo>>Pass the captcha>>Continue.
Now inside this organization create a new project with name: ADF-
Project>>visibility: private>>Create project
In this project we can create multiple work items(epic, issues, tasks…etc.)
assign the work items to different members in the team and have to
workup on all the Boards main services & sub services
Click on Data Factory on top>>click on Setup code repository (as shown
below)
Page | 99
100
translation, path optimization, and execution of your data flow jobs, below
is the link for for Azure Data flow
Page | 101
102
Step6:
First click on DetailsData>>Data preview tab>>Refresh(to see the source
data)>>Click on Projection tab>>Detect data type>>and here we can see the
source columns data type can change if needed(like for yeas passed column as
Integer, Name column as string, Marks column as string)
This way we can prepare our source transformation accordingly as per the
input type that we have for Dataflows
Inline Datasets in Dataflow Source Control in Source Settings tab: Inline
datasets are spark, like whatever the transformations we are doing inside the
dataflows they are internally converted to spark scalar code and it will run on
top of Data bricks spark cluster since it is a driver and cluster code, cluster code
is a group of machines instead of running in a single node it will run on group
of machines parallelly and it will provide the output for us.
When a format is supported for both inline and in a dataset object, there
are benefits to both. Dataset objects are reusable entities that can be
used in other data flows and activities such as Copy. These reusable
entities are especially useful when we use a hardened schema. Datasets
aren't based in Spark. Occasionally, you might need to override certain
settings or schema projection in the source transformation.
Inline datasets are recommended when you use flexible schemas, one-off
source instances, or parameterized sources. If your source is heavily
parameterized, inline datasets allow you to not create a "dummy" object.
Inline datasets are based in Spark, and their properties are native to data
flow.
Page | 102
103
To use an inline dataset, select the format you want in the Source
type selector. Instead of selecting a source dataset, you select the linked
service you want to connect to.
Page | 103
104
Click on + symbol below the source control a window will pop below and in
that search type filter and select filter transformation-as shown in below image
(if we want to filter any rows then will use this filter transformation)>>
Step8:
Click on filter transformation>>In Filter settings tab>>Output stream name:
FilterRows>>click on Filter on box>>click on open expression builder and type
the below expression>>Save and finish
YearPassed == 2009 || Product_Type == 'Electronics'
Note: This expression can be written purely on the .csv file columns that we are
uploading, here YearPassed, Product_Type are columns we are having in .csv
that we have uploaded/placed in Step1
Step9:
Step10:
Click on + symbol below the filter transformation control a window will pop
below and in that search type sink>>click on Sink transformation (as shown in
Page | 104
105
image below)
Step11:
Click on Sink transformation>>Sink tab>>Output stream name: SinkData>>For
Dataset: ds_blbsa(this dataset we have created above)>>Publish All
Step12:
Launch ADF studios in another tab>>click on Manage(left side)>>Integration
runtimes>>+New>>Azure, Self-Hosted>>Continue>>Azure>>Continue>>Name:
integrationRuntime1>>create>>Publish All
Step13:
Click on Manage>>Integration runtimes>>click on integrationRuntime>>Data
flow runtime then here we can decide the compute size like small, medium,
large or we can even customize accordingly as per project
requirements>>Publish All
Step14:
Create a new pipeline>>Name: Run_Dataflow>>drag and drop Data flow
activity/control from activities pane to pipeline canvas>>click on Dataflow
activity>>Name: RunDataFlow>>In Settings tab>>Data flow: df_dataflow>>
Publish All(if required)>>Refresh the page if Publish All is succeeded>>Run on
(Azure IR):integration Runtime1(this we have created above)>>for Logging level
choose Basic radio button>>Publish All.
Step15:
If we want the file name in our destination blb SA as per our choice then come
to Dataflow>>click on sink transformation>>in Settings tab>>File name option:
Name file as column data>>column data: choose any column>>Publish
All>>Debug and now we can see our files in destination blb SA as per the file
name we want.
Page | 105
106
Step6:
click on + >>a window will appear at the bottom>>in search type select>>click
on Select transformation>>click on Select settings tab>>scroll down and here
will see what all the columns that are going to appear in our destination if we
don’t need some columns then we can select that column and delete as shown
in below image, we can do this based upon the project requirements.
Page | 106
107
Step7:
Click on + on Select transformations>>in below window search for Sort and
click on Sort transformation>>in Sort settings tab scroll down and for Sort
conditions click the knob and select the column name(means here we are
inserting the data in destination based upon names in Ascending order)
Step8:
Click on + on sort transformation>> in below window search for sink and click
on sink transformation>>click on Sink transformation>>In Sink tab>>for Dataset
click the knob and choose ds_blbsa>>Enable dataflow debug at the top
Step9:
Click on Sort transformation>>click on Data preview tab>>Refresh>>here we
can see what all the columns we are going to insert in our destination and we
can also notice here that last columns is not getting inserted in our destination.
Step10:
Click on Sink transformation>>Optimize tab>>click on single partition radio
button>>click on settings tab>>File name option: Output to single file>>Output
Page | 107
108
to single file: Details.csv(this file will get in our destination Blb SA)>>Publish
All>>Debug
Step11:
Create a pipeline>>Name: Run_DataflowAgain>>Drag and drop Data flow
activity into pipeline canvas>>in General tab>>Name:df_dataflow1>>in settings
tab>>Data flow:df_dataflow1(this data flow we have created in above
steps)>>Publish All>>Debug.
Note: Now if we notice here we can see in the destination storage account(i.e.:
Blb SA) the details files will be present with only 2 columns where as in our
Source Storage Account(adlSA) this same file is having multiple columns(may 3-
4).
Implementation of Dataflows using Aggregate & Sink transformation:
Step1:
Create a Blb SA Storage Account and a container inside it and upload below .csv
file inside the container.
Step2:
Create ADF, Linked Services for Blb SA and choose delimited text while creating
a Linked service bcoz we have uploaded .csv file in the storage account.
Step3:
Create a new data flow>>Name: df_dataflow2>>Click on Add source box>>in
Source settings tab>>For Dataset: +New>>in search type blob
storage>>continue>>delimited text>>continue>>For File path: click on folder
ikon and select the file>>check First row as header>>Import schema: Form
connection/store>>ok
Step4:
Click on Source transformation>>click on Projection tab and change the data
type for Sales and Year column to Short as shown in below image.
Page | 108
109
Step5:
Enable data flow debug>>Ok>>click on Data preview tab>>Refresh(to check the
data)>>click on + on source transformation>>in search type Aggregate>>click
on Aggregate transformation>>in Aggregate settings tab scroll down and for
columns select Country>>click on Aggregates(as shown below)>>For Column
type/say MaxSales>>click on ANY for Expression then an expression builder will
be opened and there type the below expression>>Save and finish
max(Sales)
Step6:
Click on +Add>> For Column type/say MinSales>>click on ANY for Expression
then an expression builder will be opened and there type the below
expression>>Save and finish
min(Sales)
Page | 109
110
Step7:
Step8:
Step9:
Step10:
Step11:
Click on + on Aggregate transformation>>In search type sink>>click on sink
transformation>>in Sink tab>>Output stream name: DataLoading>>Dataset:
Page | 110
111
Note1: Hence with all above transformations we can see that for single file
data we have divided into multiple aggregations using aggregate
transformations and divided the data into multiple files and loaded in our Blb
SA(this we have considered as both source and destination in this demo).
Note2:
If we want all the details in one single files instead of multiple files then click on
Sink transformation>>Click on Optimize tab>>click the radio button as single
partition>>click on Settings tab>>For File name option: Output to single
file>>For Output to single file : SalesDetails.csv(this file name we are giving we
can give any name as per the project requirements)
Implementation of Dataflow with conditional split & Sink transformation:
Step1:
Create a Blb SA, container inside the SA and upload below .csv file inside the
Blb SA
Step2:
Page | 111
112
Create ADL Gen2 Storage account, create container and a folder inside the
container(Ex: conditional split) inside the SA,
Step3:
Create an ADF>>Launch ADF>>
(i)Create a Linked Service (LS_Blbsa) for Blb SA & Create a dataset(ds_blbsa) for
Blb SA
(iI)Create a Linked Service (LS_Adlsa) for ADL Gen2 SA & Create a
dataset(ds_adlsa) for ADL Gen2 SA
Step4:
In ADF studio create new Dataflow>>Name: df_dataflow12>>click on source
transformation>>For Dataset: ds_blbsa>>Enable Data flow debug option(on
top).
Step5:
Click on + Source transformation>>in search type conditional split>>click on
conditional split>>in conditional split settings tab>>Split on: All matching
conditions>>For Stream names(text box): USAUK>>For condition box click on
ANY and in expression builder type the below expression>>Save and finish
Country == 'USA' || Country == 'UK'
Step6:
Click on + on extreme right(as shown below) to add a new row condition>>For
Stream names(text box):USAIND>> For condition box click on ANY and in
expression builder type the below expression>>Save and finish
Step7:
Page | 112
113
Page | 113
114
Step1:
Create a Blb SA, container inside the SA and upload below 2 .csv file inside the
Blb SA
Step2:
Create an ADF>>Launch ADF>>
(i)Create a Linked Service (LS_BlbSA) for Blb SA & Create a dataset(ds_blbsa) for
Blb SA
(ii)Create a Linked Service (LS_AdlSA) for Adl SA & Create a dataset(ds_adlsa)
for Adl Gen2 SA
Step3:
Create a new dataflow>>Name: df_dataflow5>>Click on Add Source for Source
transformation>>in Source settings tab>>Output stream name:
source1>>Dataset: ds_blbsa (this we have created at the top and for this
source1 we have set Sales_File_2014.csv)>>For Dataset: click on open and here
we see the Sales_File_2014.csv(as shown in below image) and if we are not
seeing this file click on Browse and select Sales_File_2014.csv file
Step4:
Click on Add Source(below) for Source transformation>>in Source settings
tab>>Output stream name: source2>>Dataset: ds_blbsa(this we have created
at the top and for this source2 we have set Sales_File_2020.csv and if we are
not seeing this file click on Browse and select Sales_File_2020.csv file)
Step5:
Page | 114
115
Step2:
Create an ADF>>Launch ADF>>
(i)Create a Linked Service (LS_BlbSA) for Blb SA & Create a dataset(ds_blbsa) for
Blb SA
(ii)Create a Linked Service (LS_AdlSA) for Adl SA & Create a dataset(ds_adlsa)
for Adl Gen2 SA
Step3:
In ADF create a Dataflow>>Name: df_dataflow7>>click on Add source>>click on
Source transformation(source1)>>in Source settings tab>>Dataset:
ds_blbsa>>For Dataset click on open and click on Browse to keep
Sales_Files_2014.csv.
Step4:
Page | 115
116
Step6:
Click on + on Exists transformation>>in search type sink>>click on sink
transformation>>in Sink tab>>For Dataset: ds_adlsa>>in Settings tab>>File
name option: Name file as column data>>click Refresh(@ top right of the page
as shown below)
Step7:
Create a new pipeline>>Name: PL_df_dataflow7>>Drag and drop the Dataflow
entity>>Name: Dataflow7>>in settings tab>>Data flow:df_datflow7>>Publish
All>>Publish>>Debug.
Page | 116
117
Step2:
Create an ADF>>Launch ADF>>
(i)Create a Linked Service (LS_BlbSA) for Blb SA & Create a dataset(ds_blbsa) for
Blb SA
(ii)Create a Linked Service (LS_AdlSA) for Adl SA & Create a dataset(ds_adlsa)
for Adl Gen2 SA
Step3:
In ADF create a Dataflow>>Name: df_dataflow11>>click on Add source>>click
on Source transformation(source1)>>in Source settings tab>>Dataset:
ds_blbsa>>For Dataset click on open and click on Browse to keep
2017_Students_Batch.csv.
Step4:
Click on Add source again>>click on source transformation(source2)>>in Source
settings tab>>Dataset: ds_blbsa>>For Dataset click on open and click on
Browse to keep 2018_Students_Batch.csv.
Step5:
Click on + of source1>>in search type Join>click on Join transformation>>in Join
settings tab>>Right stream: source2>>For Join type: inner (choose which type
of join we want to consider like Inner join, Left outer join, Full outer
join….etc)>>Join conditions: StudentsID(for both Left: source1’s column and
Right: source2’s column)>>Enable Data flow debug option.
Step6:
Click on + of Join transformation>>in search type sink>>click on sink
transformation>>in Sink tab>>Dataset: ds_adlsa>>in Settings tab>>File name
option: Output to single file>>Output to single file: Innerjoinresults.csv>>in
Optimize tab>>single partition>>Publish All>>Publish.
Step7:
Page | 117
118
Step2:
Create an ADF>>Launch ADF>>
(i)Create a Linked Service (LS_BlbSA) for Blb SA &
Create a dataset(ds_derivedcol2014) for Blb SA
(ii)Create a Linked Service (LS_AdlSA) for Adl SA & Create a dataset(ds_adlsa)
for Adl Gen2 SA
Step3:
Create a new dataflow>>Name:df_dataflow55>>click on Add source>>in
Source settings tab>>Dataset: ds_derivedcol2014
Step4:
Page | 118
119
toInteger(trim(right(Country, 6),'()'))
Step6:
Click on +(shown below)>>click on Add column(as shown below)>>in the 2 nd
column which generated just now mentioned as Country>>click on ANY on 2 nd
column>>write the below expression in expression builder.
toString(left(Country, length(Country)-6))
Step7:
Click on Derived column transformation>>click on Data preview
tab>>Refresh>>Now here we can see the country column carrying only
countries in it and and a new Derived column Year has been emerged which is
carrying Years only(as shown in image below)
Page | 119
120
Step8:
Click on + on Derived column>>in search type sink>>click on sink
transformation>>in Sink tab>>Dataset: ds_adlsa>>in Settings tab>>File name
option:Output to single file>>Output to single
file:Derivedcolumnresults.csv>>in Optimize tab>>click on Single partition>>In
Data preview tab>>Refresh(to see how the data is getting loaded in our
destination ADL Gen2 Storage account)>>Publish All>>Publish
Step9:
Create a new pipeline>>Name:PL_df_dataflow55>>in settings
tab>>Dataflow:dataflow55>>Compute size: Medium(optional)>>Publish
All>>Publish>>Debug
Step10:
Now come to destination Storage Account i.e.: ADL Gen2 SA and we can see a
Derivedcolumnresults.csv file in destination SA
Page | 120
121
Step3:
Create an ADF>>Create a Linked service(LS_Sql) for SQL DB, create this LS for
Adventure Works DB>>Create a Dataset(ds_Sql) for [SalesLT].[Product] table
present in AdventureWorks SQL DB>>Publish All>>Publish
Step4:
Create a new dataflow>>Name:df_dataflow77>>Click on Add source>>in
Source settings tab>>Dataset:ds_sql>>Enable data flow debug option>>In Data
Preview tab>>Refresh
Step5:
Click on + on source transformation>>in search type Derived column>>click on
Derived column>>in Derived column’s settings tab>>For columns: Color>>click
on ANY (shown below) to open the expression builder and type the below
expression>>Save and finish
Page | 121
122
Step9:
Click on +Add >>Add column(as shown below)>>for newly added column pass
the name as size(as shown below)>>double click on ANY>>and write the below
expression>>Save and finish
Step6:
Click on + on Derived column>>in search type Pivot>>click on Pivot>>in Pivot
settings tab>>scroll down>>click on Group by >>For columns: Size(as shown
below)
Step6: Now click on Pivot key>>scroll down>>For Pivot Key: Color(as shown
below)
Page | 122
123
Step7:
Now click on Pivoted columns(as shown below)>>double click on ANY to open
the expression builder and write the below expression for avg of standard
cost>>save & finish>>give name as Avg for next text box(as shown in image
below).
avg(StandardCost)
Step8:
Click on Pivot transformation>>Data preview>>Refresh>> To see the data
reflection in which the columns turned to rows as we have used Pivot
transformation.
Step9:.
Click on + on pivot transformation>>in search type for sink
transformation>>click on Sink transformation>>in Sink tab>>Dataset:
ds_adlsa>>in Settings tab>>File name option: Output to single file>>Output to
single file:pivotresults.csv>>in Optimize tab>>click on Single partition>>In Data
Page | 123
124
preview tab>>click on Refresh and finally we can see here how and what data is
going to inserted in our destination i.e: ADL Gen2 SA from SQL DB table(i.e:
[SalesLT].[Product])>>Publish All>>Publish…finally our all transformations looks
like below(shown in image)
Step10:
Create a new pipeline>>Name: PL_df_dataflow77>>Drag and drop the data
flow activity into pipeline canvas>>in Settings tab>>Data flow:
df_dataflow77>>Compute size: Medium>>Publish All>>Publish>>Debug
Hence here we can see the data got exported from Adventure Works
DB(source) to ADG Gen Storage Account( destination SA)
Union & Union All:
UNION and UNION ALL in SQL are used to retrieve data from two or more
tables. UNION returns distinct records from both the table, while UNION ALL
returns all the records from both the tables.
Windows Functions:
A window function performs a calculation across a set of table rows that are
somehow related to the current row. This is comparable to the type of
calculation that can be done with an aggregate function. But unlike regular
aggregate functions, use of a window function does not cause rows to become
grouped into a single output row — the rows retain their separate identities.
Behind the scenes, the window function can access more than just the current
row of the query result.
RANK() –
As the name suggests, the rank function assigns rank to all the rows
within every partition. Rank is assigned such that rank 1 given to the
first row and rows having same value are assigned same rank. For the
next rank after two same rank values, one rank value will be skipped.
DENSE_RANK() –
It assigns rank to each row within partition. Just like rank function first
row is assigned rank 1 and rows having same value have same rank.
Page | 124
125
Step2:
Create an ADF>>Create a Linked service(LS_Sql) for SQL DB, create this LS for
Adventure Works DB>>Create a Dataset(ds_Sql) for [SalesLT].[Product] table
present in AdventureWorks SQL DB>>Publish All>>Publish
Step3:
Create ADL Gen2 Storage Account, create container inside it, folder(ex:
DensRanks) and create a Dataset in ADF for this ADL Gen2 SA
Step4:
Page | 125
126
Step6:
Click on Window transformation>>Data preview>>Refresh>>if we navigate to
extreme right then we can see Rank(as shown below) which shows the same
rank 3 for 3 rows bcoz the standard cost is having the same value and next rank
it took 6 not 4. Here if the particular row is having the same value then it gives
the same rank to that rows(ex: Standard cost)
Step7:
Page | 126
127
Step8:
Click on Window transformation>>Data preview>>Refresh>>navigate to
extreme right and here we can see Rank and DenseRank values and here in
DenseRank if we see the next immediate ranks are not getting skipped(as
shown below) as compare to Rank
Step9:
Click on Window transformation>>in Windows settings tab>>click on
+Add>>Add column (as shown below)>>type RowNumber for the newly
launched column (on left as shown below)>>type rowNumber()(on right as
shown below) in expression box
Page | 127
128
Step10:
Click on Window transformation>>Data preview>>Refresh>>navigate to
extreme right and here we can see Rank and DenseRank & RowNumber values
as shown below
Step11:
Click on + on Window transformation>>in search type sink>>click on Sink
transformation>>Dataset: ds_adlsa>>File name option:Output to single
file>>Output to single file:WindowsRanksresults.csv>>In Optimize tab>>select
single partition>>Data preview>>Refresh>>Publish All>>Publish.
Step12:
Create a new pipeline>>Name: PL_df_dataflow99>>Drag and drop the
Dataflow activity>>In settings tab>>Data flow: df_dataflow99>>Publish
All>>Publish>>Debug
Page | 128
129
Page | 129
130
your data on RAM and no longer makes sense to fit all your
data on a local machine. On a high level, it is a unified
analytics engine for Big Data processing, with built-in
modules for streaming, SQL, machine learning, and graph
processing. Spark is one of the latest technologies that is
being used to quickly and easily handle Big Data and can
interact with language shells like Scala, Python, and R.
Page | 130
131
The Spark follows the master-slave architecture. Its cluster consists of a single
master and multiple slaves.
Page | 131
132
The Spark architecture consist of single master and multiple slaves, based upon
the volume of data and work load we can configure data bricks cluster we can
specify maximum 12 worker nodes based upon the work loads for you to do
the transformation based upon the volume of data and based upon the
workload.we have to choose all these things while creating a data bricks spark
cluster and the spark architecture depends upon two abstractions one is
resilient distributed dataset directed as Acyclic graph.
The Resilient Distributed Datasets are the group of data items that can be
stored in-memory on worker nodes. Here,
Spark Components
The Spark project consists of different types of tightly integrated components.
At its core, Spark is a computational engine that can schedule, distribute and
monitor multiple applications.
Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
Page | 132
133
Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for
structured data.
o It allows to query the data via SQL (Structured Query Language) as well
as the Apache Hive variant of SQL? called the HQL (Hive Query
Language).
o It supports JDBC and ODBC connections that establish a relation
between Java objects and existing databases, data warehouses and
business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and
JSON.
Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming
analytics.
o It accepts data in mini-batches and performs RDD transformations on
that data.
o Its design ensures that the applications written for streaming data can be
reused to analyse batches of historical data with little modification.
o The log files generated by web servers can be considered as a real-time
example of a data stream.
MLlib
o The MLlib is a Machine Learning library that contains various machine
learning algorithms.
o These include correlations and hypothesis testing, classification and
regression, clustering, and principal component analysis.
o It is nine times faster than the disk-based implementation used by
Apache Mahout.
Page | 133
134
GraphX
o The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
o It facilitates to create a directed graph with arbitrary properties attached
to each vertex and edge.
o To manipulate graph, it supports various fundamental operators like
subgraph, join Vertices, and aggregate Messages.
Azure Databricks:
is an industry-leading, cloud-based data engineering tool used for processing,
exploring, and transforming Big Data and using the data with machine learning
models. It is a tool that provides a fast and simple way to set up and use a
cluster to analyse and model off of Big data. In a nutshell, it is the platform that
will allow us to use PySpark (The collaboration of Apache Spark and Python) to
work with Big Data. The version we will be using in this blog will be the
community edition (completely free to use). Without further ado…
Page | 134
135
Azure Databricks is a data analytics platform optimized for the Microsoft Azure
cloud services platform. Azure Databricks offers three environments:
It can process large amounts of data with Databricks and since it is part
of Azure; the data is cloud native.
The clusters are easy to set up and configure.
It has an Azure Synapse Analytics connector as well as the ability to
connect to Azure DB.
It is integrated with Active Directory.
It supports multiple languages. Scala is the main language, but it also
works well with Python, SQL, and R.
Azure Data bricks is also an ETL tool where we extract the data from
source and loads it into target
Page | 135
136
Whatever the transformations we have done till now like moving the data from
source to target and even with Dataflows then we can use ADF, and if we want
to do the complex transformations and if we want to do any user defined
functions then we use the Azure Databricks where we can process huge
volume of data, if we are having Peta bytes of data or Giga bytes of data that
we are receiving from source where we need to do complex kind of
transformations then we can use Azure Databricks service.
In Azure Databricks we can connect to any kind of data source, we can connect
to On-premises, we can connect to Azure Blob Storage or DataLake Gen2
Storage…etc. and we can move the data from any source to any destination
using Azure Data Bricks
i.e:
(i)Multi node>>Here multiple users can connect to the cluster and cluster
Notebooks that has been created...
(ii)Single node>>Here single users can connect to the cluster and cluster
Notebooks that has been created…
We are having multiple Long term support (LTS) versions for Azure Databricks
provisioning as shown below…
Page | 136
137
Step2:
Mouse over to left>>click on Compute>>Create compute(center)>>create a
cluster>>Fill the details accordingly as shown in image below and finally click
on Create compute(@ below)
Page | 137
138
Pools: When we want to make a list of resources and that we want to keep in
all those pools then we can make use of all these pools.
Azure Databricks makes a distinction between all-purpose clusters and Job
clusters. We use all-purpose clusters to analyze data collaboratively using
interactive notebooks and we use Job clusters to run fast and robust
automated jobs, we can create an All-purpose cluster using UI, CLI & Rest API.
Steps to see the clusters in Azure Data Bricks:
Page | 138
139
Page | 139
140
Copy the below Python script and paste it in Python notebook(as shown
below) and hit shift+enter to run the script in Python notebook
print("Spark version", sc.version, spark.sparkContext.version, spark.version)
print("Python version", sc.pythonVer)
In Azure Data bricks cluster Spark supports four (4) different types of
languages…i.e: Python, Scala, SQL & R-programming…
If we want to see the version history of a notebook, then click on File(@
the top) in the notebook>>scroll down>>Version history.
Some times If we are getting errors while executing the python scripts in
Cluster notes books then click on Run on top in notebook>>click on
Restart compute resource or go to compute/cluster and restart the
cluster.
Refer the link below for Azure Databricks hands-on!
Azure Databricks Hands-on. This tutorial will explain what is… | by Jean-Christophe Baey |
Medium (From this link we can get all the python, Scala codes…etc)
Generate a new cell in the notebook by click on top right in the cell as
shown in image below
Page | 140
141
After the cell has generated in notebook paste the Scala code(the Scala
code we can get from above link) in the cell body as shown below
Page | 141
142
Page | 142
143
Access Keys:
I/4UBW2dm+Cl1XX2i2N9Y5LA3d1VCQB6WbX64p+fRpXxQPcfDG/DLKbcwAgPbi
goE0cufB+4TIH6+ASt9xzTnA==
Container name: mycon
Step4: Now from the above link copy the entire code (To set up the file access,
you need to do this:) and paste it in Pyspark notebook cell and make the
changes accordingly as per the SA, Container name & Access keys.
Step5: click on extreme right top nob in the cell body and run the python code
inside the cell by clicking on Run cell and we see output as below.
Mounting: /mnt/mycon
Reading .csv file from Blob SA with WASBS METHOD from Azure Databricks
cluster:
Step1: Create a Blob Storage Account & container inside the storage account
and upload the .csv file inside the SA container.
Page | 143
144
Note: These commands are absolute case sensitive, so while typing these
commands in cluster notebook we have to give 100% attentions with upper
case and lower case.
Else we can directly read the file by passing the below commands in another
cell which gives the same output as above.
df = spark.read.csv("/mnt/mycon/usd_to_eur.csv", header = True, inferSchema = True)
display(df)
Step5: click on + as shown below in image and here we can do the data
visualization and multiple types of charts(Line chart, Bar chart, Area chart, Pie
chart, Scatter chart, bubble chart…etc. etc.)and can also apply various filters on
it.
Page | 144
145
Step6: create a new cell and type the below commands which gives different
results
df.printSchema()>>this command shows the schema of csv file
df.describe().show()>>this command shows the aggregates values of the file
records(like count, mean, min, max, stddev…etc)
df.head(5)>>this command shows only top 5(whatever the No we pass here that many
records will be displayed)records.
Step7: create a new cell and type the below commands which helps us to
create the temporary view and to convert or replace the code from Python to
SQL
df.createOrReplaceTempView("xrate")>>hold the result in temp
view and to convert from Python to SQL
Step8: Create a new cell and paste the below command to get the output
displayed as Group by year and order by year Desc from xrate(temp view)
df = spark.sql("SELECT YEAR(Date) as year, COUNT(Date) as count, MEAN(Rate) as mean
From xrate GROUP BY YEAR(Date) ORDER BY year DESC")>>command to get the data
from xrate
display(df)>>command to display the output.
Page | 145
146
Step9:
Create a new cell and if we want to write the SQL query directly then first
select SQL on to right inside the cell and then directly, we can write the SQL
Queries as shown below.
SELECT YEAR(Date) as year, COUNT(Date) as count, MEAN(Rate) as mean From xrate
GROUP BY YEAR(Date) ORDER BY year DESC
Page | 146
147
retDF = (
df
.groupBy(f.year("Date").alias("year"))
.agg(f.count("Date").alias("count"), f.mean("Rate").alias("mean"))
.sort(f.desc("year"))
)
display(retDF.head(4))
Pyspark: When we are integrating with Python library and with this spark we
can able to call it in our coding. Hence wit this we can Pyspark.
Here in the above methods, groupBy, agg, sort..etc.. these are all methods and
we are applying these methods on top of dataframe(df).
Importing Apache Spark libraries & writing the code in Databricks cluster in
%Scala:
Create a new cell in same cluster Notebook and paste the below code
%scal
a
import org.apache.spark.sql.functions._
var df = spark.table("xrate")
// or
// df = spark.sql("select * from xrate")
var Row(minValue, maxValue) = df.select(min("Rate"),
max("Rate")).head
Page | 147
148
Here in above code, we are importing Apache Spark function to get the Min &
Max values for Rate column
Hence, like this we can write the code in Azure Data bricks cluster notebooks
on either Python, Sql or Scala my mentioning the magic command in cell body.
Azure AZ copy:
AzCopy is a command-line utility that we can use to copy blobs
or files to or from a storage account; we have to download AZ
Copy, connect to our storage account, and then can transfer
the files.
Migration From Private cloud to Public cloud(Forward Migration):
Download the AzCopy dll from below link in our laptop
https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-
v10
After its gets downloaded>>extract the zip file>>go inside the
folders>>copy(ctrl+c) the azcopy.exe and paste it in below path in our
laptop>>
C:\Windows\System32>>
Go to Azure portal and create a storage account and create a
container(blob) storage service
come inside the blob/container storage>>click on properties (left side
inside)>>copy the URL and paste it in a separate notepad
Now go to Shared access signature (inside the storage account)>>select
all the options>>click radio button HTTPS and HTTP(for sure) >>click on
Generate SAS and connection string>>copy the SAS
token(carefully)>>concatenate this SAS token with container blob
storage service URL(as example shown below)
Page | 148
149
?sv=2020-08-04&ss=b&srt=sco&sp=rwdlacitfx&se=2021-12-
24T20:09:55Z&st=2021-12-
24T12:09:55Z&spr=https,http&sig=HmtQmRiO0C
%2BablXp8%2B961rT6GtcYZSuJxakd8josccs%3D>> SAS generated token
Doing the Concatenation (as shown below)
https://mysa1972.blob.core.windows.net/mycontainer/?sv=2020-08-
04&ss=b&srt=sco&sp=rwdlacitfx&se=2021-12-24T20:09:55Z&st=2021-12-
24T12:09:55Z&spr=https,http&sig=HmtQmRiO0C
%2BablXp8%2B961rT6GtcYZSuJxakd8josccs%3D
Now search for command prompt in our laptop and open with run as
administrator>>type azcopy.exe copy "here give the source path where
our files are present in our local laptop to copy to our Azure container
storage service" "here give the container storage service URL along with
SAS token" --recursive>>and then finally hit enter
Now come to our container storage service and we could be able to see
all the files/data that we have uploaded using Azcopy from our local
laptop to Azure cloud storage services.
Hence we have migrated the Data from On-prem(Private Cloud) to Public
cloud computing
Migration of Data from One storage Account to Another (Cloud to Cloud
Migration)
Note: Firstly, ensure the 2nd storage account (or) destination storage account
should be empty and must be having a container/Blob storage service
created inside the storage account and not having the same files/data which
we are going to dump using azure AZ Copy and if the same files/data
already present in the destination storage account and if we run the
command AZ Copy again then it will ignore if already files are present.
Follow all the steps same as above and now in command prompt
azcopy.exe copy “Source Storage Account URL
(container_url_followed_with SAS token)” “Destination Storage Account
URL(container_url_followed_with SAS token)” --recursive
Migration from Public cloud to Private cloud (Reverse Migration):
Create an empty folder in any drive in your laptop (ex: F drive)
Page | 149
150
Create a Storage Account and create a blob container and let’s keep
some data init (ex: files)
Open the cmd prompt with Administrator access and pass the below
commands using AzCopy as shown below.
azcopy.exe copy “Source Storage Account URL
(container_url_followed_with SAS token)” “Destination Path this is our F
drive local path from our local laptop” –recursive
same region and now if we want to replicate it to the other way around like in
case if something has happened in this LRS region then manually we can pick
GRS and click on Save button (top).
Hence, the Storage Account replication can be implemented from one region
(as primary) to another region (Secondary) in cases of Disaster recovery (RPO &
RTO) based on the specific requirement in projects.
Step2: Refresh the Database Folder and here will see AdventureWorks2014 SB
in our SSMS
Page | 151
152
Page | 152
153
Step5: After the migration has been completed in DMA tools connect to Azure
cloud computing SQL server and SQL DB via SSMS and check whether the data
has been migrated successfully or not with all the data and entries.
Importing/Migrating Full DB directly from On-prem(local server) to Cloud DB
Server:
Note: Try to Import or Migrate the DB as small size as possible else it will take
hours of time to get migrated.
Step1: Deploy SQL Server and SQL DB in cloud computing portal
Step2: Create a Storage Account and upload the .backpack file of sql DB inside
the Storage Account container, which comes as page blob.
Step3: Come to SQL Server (which got deployed along with DB in cloud
portal)>>Import database>>Select backup>>click the storage account (which
we created in above step)>>click on mycon (container we created inside the
SA)>>click the DB (which we have uploaded>>Select>>Ok
Step4: Wait for some time until the importing of DB is completed in our Azure
SQL Server.
Step5: In Azure portal if we see we can find the DB which we have imported, go
inside the DB(in azure portal)>>click on Query Editor(left side)>>Pass the userId
Page | 153
154
and Password and say Ok>>Expand the tables folder and here we can see all
the tables that we have in our DB and we can write the query in the query
editor to verify the data.
Page | 154