0% found this document useful (0 votes)
977 views32 pages

Azure Databricks Documentation

The document discusses how to integrate Azure Databricks with Azure Data Factory to provide an end-to-end solution for data engineering and analytics. It describes how the services can be used together for data ingestion, machine learning, real-time processing, and automated data engineering.

Uploaded by

DevAbdo Omar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
977 views32 pages

Azure Databricks Documentation

The document discusses how to integrate Azure Databricks with Azure Data Factory to provide an end-to-end solution for data engineering and analytics. It describes how the services can be used together for data ingestion, machine learning, real-time processing, and automated data engineering.

Uploaded by

DevAbdo Omar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

By Abdelrahman Omar

1|Page

bbi.ai
Content
Introduction…………………………………………………………………………………………………………………………..3
Azure databricks with data factory………………………………………………………………………………………..4
Set-up portal environment…………………………………………………………………………………………………….6
Create an Azure Databricks service………………………………………………………………………………………24
Create a Spark cluster in Azure Databricks……………………………………………………………………………26
Azure databricks development…………………………………………………………………………………………….28
Developing ODS script…………………………………………………………………………………………………………31
Load data to synapse………………………………………………………………………………………………………….32

2|Page
Introduction
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform that
provides a fully managed and secure cloud environment for data engineers, data scientists, and
machine learning engineers. With Azure Databricks, you can easily process vast amounts of
data, build and train machine learning models, and collaborate on projects with your team
members.

One of the key benefits of using Azure Databricks is that it allows you to easily scale your data
analytics and machine learning workloads, without the need to manage infrastructure. You can
spin up clusters of virtual machines to process your data and then shut them down when you're
finished, which means you only pay for the compute resources you use.

In addition to being highly scalable, Azure Databricks is also highly flexible. It integrates with a
wide range of data sources and tools, including Azure Blob Storage, Azure Data Lake Storage,
Azure Synapse Analytics, and Azure Stream Analytics. You can use a variety of programming
languages, such as Python, R, SQL, and Scala, to analyze your data and build machine learning
models.

Azure Databricks also provides a range of collaboration and productivity features, including a
web-based notebook interface for writing and executing code, built-in version control for
managing code changes, and a range of data visualization tools for exploring your data.

Overall, Azure Databricks is an excellent platform for anyone looking to perform data analytics
and machine learning in a fast, easy, and collaborative environment, without the need to
manage infrastructure

3|Page
Azure databricks with data factory
Azure Databricks and Azure Data Factory are two powerful Azure services that can be
integrated to provide an end-to-end solution for data engineering and advanced analytics.
Azure Databricks is a collaborative analytics platform that can process big data, machine
learning, and advanced analytics, while Azure Data Factory is a cloud-based data integration
service that allows you to create, schedule, and manage workflows to move and transform data
we will discuss the usage of integrating Azure Databricks with Azure Data Factory.

Data ingestion and transformation:


One of the main uses of integrating Azure Databricks with Azure Data Factory is data ingestion
and transformation. With this integration, you can use Azure Data Factory to move data from
various sources to Azure Databricks for processing. Azure Databricks can then be used to
perform data transformation and cleansing, and the transformed data can be written back to a
storage account using Azure Data Factory. This process can be automated using Data Factory
pipelines, allowing you to easily manage the entire data ingestion and transformation process.

Machine learning:
Another key use of integrating Azure Databricks with Azure Data Factory is machine learning.
With this integration, you can use Azure Data Factory to move data from various sources to
Azure Databricks, where you can build and train machine learning models. You can then use
Azure Data Factory to deploy the trained models to a production environment. This process can
be automated using Data Factory pipelines, allowing you to easily manage the entire machine
learning process.

Real-time data processing:


Integrating Azure Databricks with Azure Data Factory can also be used for real-time data
processing. With this integration, you can use Azure Data Factory to move data from various
sources to Azure Event Hubs, which can then trigger an Azure Databricks notebook. The
notebook can process the data in real-time, and the processed data can be written back to a
storage account using Azure Data Factory. This process can be automated using Data Factory
pipelines, allowing you to easily manage the entire real-time data processing process.

4|Page
Automated data engineering:
Integrating Azure Databricks with Azure Data Factory can also be used for automated data
engineering. With this integration, you can use Azure Data Factory to move data from various
sources to Azure Databricks, where you can perform data transformation and engineering. You
can then use Azure Data Factory to deploy the transformed data to a production environment.
This process can be automated using Data Factory pipelines, allowing you to easily manage the
entire data engineering process.

Conclusion:
Integrating Azure Databricks with Azure Data Factory can provide a powerful end-to-end
solution for data engineering and advanced analytics. With this integration, you can easily move
data from various sources to Azure Databricks for processing, perform data transformation and
cleansing, build and train machine learning models, and perform real-time data processing. This
process can be automated using Data Factory pipelines, allowing you to easily manage the
entire process.

5|Page
Set-up portal environment

Sign up to portal
1- Sign up to azure portal

2-after sign up sign in to azure portal

6|Page
Create resource group

3- create resource group

4- press create and review

7|Page
Create storage account

To create an Azure storage account with the Azure portal, follow these steps:

1. From the left portal menu, select Storage accounts to display a list of your storage
accounts. If the portal menu isn't visible, click the menu button to toggle it on.

8|Page
2- On the Storage accounts page, select Create.

9|Page
3- Enter standard configuration of the basic properties for a new storage account.

4- Enter standard configuration of the advanced properties for a new storage account.

10 | P a g e
5- Enter standard configuration of the networking properties for a new storage account.

11 | P a g e
6- Enter standard configuration of the data protection properties for a new storage
account.

12 | P a g e
7- Enter standard configuration of the encryption properties for a new storage account.

13 | P a g e
8- Enter standard configuration of the index tag properties for a new storage account.

14 | P a g e
When you navigate to the Review + create tab, Azure runs validation on the storage account
settings that you have chosen. If validation passes, you can proceed to create the storage
account.

If validation fails, then the portal indicates which settings need to be modified.

The following image shows the Review tab data prior to the creation of a new storage account.

15 | P a g e
service principal

we'll learn how to create an Azure Active Directory (Azure AD) application and service principal
that can be used with the role-based access control. When you register a new application in
Azure AD, a service principal is automatically created for the app registration. The service
principal is the app's identity in the Azure AD tenant. Access to resources is restricted by the
roles assigned to the service principal, giving you control over which resources can be accessed
and at which level. For security reasons, it's always recommended to use service principals with
automated tools rather than allowing them to sign in with a user identity.

1- Sign-in to the Azure portal.


2- Search for and Select Azure Active Directory.
3- Select App registrations, then select New registration
4- Name the application, for example "example-app".
5- Select a supported account type, which determines who can use the application.
6- Under Redirect URI, select Web for the type of application you want to create.
Enter the URI where the access token is sent to.
7- Select Register.

16 | P a g e
17 | P a g e
Assign a role to the application

To access resources in your subscription, you must assign a role to the application. Decide
which role offers the right permissions for the application. To learn about the available roles,
see Azure built-in roles.

You can set the scope at the level of the subscription, resource group, or resource. Permissions
are inherited to lower levels of scope.

1- Sign-in to the Azure portal.


2- Select the level of scope you wish to assign the application to. For example, to assign a
role at the subscription scope, search for and select Subscriptions. If you don't see the
subscription you're looking for, select global subscriptions filter. Make sure the
subscription you want is selected for the tenant.
3- Select Access control (IAM).
4- Select Add, then select Add role assignment.
5- In the Role tab, select the role you wish to assign to the application in the list. For
example, to allow the application to execute actions like reboot, start and stop
instances, select the Contributor role.
6- Select the Next.
7- On the Members tab. Select Assign access to, then select User, group, or service
principal
8- Select Select members. By default, Azure AD applications aren't displayed in the
available options. To find your application, Search for it by its name.
9- Select the Select button, then select Review + assign.

18 | P a g e
Your service principal is set up. You can start using it to run your scripts or apps. To manage
your service principal (permissions, user consented permissions, see which users have
consented, review permissions, see sign in information, and more), go to Enterprise
applications.

The next section shows how to get values that are needed when signing in programmatically.

19 | P a g e
Sign in to the application

When programmatically signing in, pass the tenant ID and the application ID in your
authentication request. You also need a certificate or an authentication key. To obtain the
directory (tenant) ID and application ID:

1- Search for select Azure Active Directory.


2- From App registrations in Azure AD, select your application.
3- On the app's overview page, copy the Directory (tenant) ID value and store it in your
application code.
4- Copy the Application (client) ID value and store it in your application code.

Set up authentication

There are two types of authentication available for service principals: password-based
authentication (application secret) and certificate-based authentication. We recommend using
a certificate, but you can also create an application secret.

Create a new application secret

1- Search for and select Azure Active Directory.


2- Select App registrations and select your application from the list.
3- Select Certificates & secrets.
4- Select Client secrets, and then Select New client secret.
5- Provide a description of the secret, and a duration.
6- Select Add.

20 | P a g e
Once you've saved the client secret, the value of the client secret is displayed. Copy this value
because you won't be able to retrieve the key later. You'll provide the key value with the
application ID to sign in as the application. Store the key value where your application can
retrieve it.

Configure access policies on resources

You might need to configure extra permissions on resources that your application needs to
access. For example, you must also update a key vault's access policies to give your application
access to keys, secrets, or certificates.

To configure access policies:

1- Select your key vault and select Access policies.


2- Select Add access policy, then select the key, secret, and certificate permissions you
want to grant your application. Select the service principal you created previously.
3- Select Add to add the access policy.
4- Save.

21 | P a g e
22 | P a g e
Configure Sql server
If you want to make source as database we can configure sql server

1- Sign in to azure portal


2- Enter standard properties for SQL server
3- Press review and create .

23 | P a g e
Create an Azure Databricks service

In this section, you create an Azure Databricks service by using the Azure portal.

1- From the Azure portal menu, select Create a resource.

24 | P a g e
2- Then, select Analytics > Azure Databricks.

25 | P a g e
3- Under Azure Databricks Service, provide the following values to create a Databricks
service:

4- The account creation takes a few minutes. To monitor the operation status, view
the progress bar at the top.
5- Select Pin to dashboard and then select Create.

Create a Spark cluster in Azure Databricks

1- In the Azure portal, go to the Databricks service that you created, and select Launch
Workspace.
2- You're redirected to the Azure Databricks portal. From the portal, select Cluster.

26 | P a g e
3- In the New cluster page, provide the values to create a cluster.

27 | P a g e
4- Fill in values for the following fields, and accept the default values for the other fields:
5- Enter a name for the cluster.

6- Make sure you select the Terminate after __ minutes of inactivity check box. If the
cluster isn't being used, provide a duration (in minutes) to terminate the cluster.

7- Select Create cluster. After the cluster is running, you can attach notebooks to the
cluster and run Spark jobs.

Azure databricks development


1- In the Azure portal, go to the Azure Databricks service that you created, and select
Launch Workspace.
2- On the left, select Workspace. From the Workspace drop-down, select Create >
Notebook.

28 | P a g e
3- In the Create Notebook dialog box, enter a name for the notebook. Select Python as the
language, and then select the Spark cluster that you created earlier.

4- Select Create.
5- The following code block sets default service principal credentials for any ADLS Gen 2
account accessed in the Spark session. The second code block appends the account
name to the setting to specify credentials for a specific ADLS Gen 2 account. Copy and
paste either code block into the first cell of your Azure Databricks notebook.

Session configuration

29 | P a g e
Account configuration

Configure SQL server as a source

30 | P a g e
Developing ODS script
First we will develop staging script to extract data from SQL server source
And target delta lake to be ODS

1- Create note book with python language

2- Define source as SQL server

3- Define target as delta lake


Before we define the source we write SQL Create statement to creat ODS table on delta
lake

31 | P a g e
Then define ODS target table .

Load data to synapse


In this section we will load data from delta lake to synapse

1- Define source as delta lake


2- Define target as synapse

32 | P a g e

You might also like