0% found this document useful (0 votes)

210 views18 pages

Lab 3 - Enabling Team Based Data Science With Azure Databricks

This document provides instructions for a lab on enabling team-based data science with Azure Databricks. The lab has 4 exercises that will have students provision an Azure Databricks instance, create a workspace and cluster, read data from an Azure Data Lake storage account using Databricks, and perform simple transformations on the data. Completing the lab will help students explain and work with Azure Databricks and understand how to read and transform data using this tool.

Uploaded by

Mangesh Abnave

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

210 views18 pages

Lab 3 - Enabling Team Based Data Science With Azure Databricks

Uploaded by

Mangesh Abnave

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Lab 3 - Enabling Team Based Data Science

with Azure Databricks

Estimated Time: 75 minutes

Pre-requisites: It is assumed that the case study for this lab has already been read. It is
assumed that the content and lab for module 1: Azure for the Data Engineer has also been
completed

Lab files: The files for this lab are located in the Allfiles\Labfiles\Starter\DP-200.3 folder.

Lab overview
By the end of this lab the student will be able to explain why Azure Databricks can be used to
help in Data Science projects. The students will provision an Azure Databricks instance and will
then create a workspace that will be used to perform simple data preparation tasks from a Data
Lake Store Gen2 store. Finally, the student will perform a walk-through of performing
transformations using Azure Databricks.

Lab objectives
After completing this lab, you will be able to:

1. Explain Azure Databricks

2. Work with Azure Databricks
3. Read data with Azure Databricks
4. Perform transformations with Azure Databricks

Scenario
In response to the Information Services (IS) department, you will start the process of building a
predictive analytics platform by listing out the benefits of using the technology. The
department will be joined by data scientists and they want to ensure that there is a predictive
analytics environment available to the new team members.

You will stand up and provision an Azure Databricks environment, and then test that this
environment works by performing a simple data preparation routine on the service by
ingesting data from a pre-existing Data Lake Storage Gen2 account. As a data engineer, it has
been indicated to you that you may be required to help the data scientists perform data
preparation exercises. To that end, you have been recommended to walk-through a notebook
that can help you perform basic transformations.

At the end of this lad, you will have:

1. Explained Azure Databricks

2. Worked with Azure Databricks
3. Read data with Azure Databricks
4. Performed transformations with Azure Databricks

IMPORTANT: As you go through this lab, make a note of any issue(s) that you have
encountered in any provisioning or configuration tasks and log it in the table in the document
located at \Labfiles\DP-200-Issues-Doc.docx. Document the Lab number, note the technology,
Describe the issue, and what was the resolution. Save this document as you will refer back to it
in a later module.

Exercise 1: Explain Azure Databricks

Important: Perform exercise 2 first, and return to exercise 1 after starting the creation of a
Databricks Cluster in exercise 2, as it will take 10 minutes to provision.
Estimated Time: 15 minutes

Individual exercise

The main task for this exercise are as follows:

1. From the content you have learned in this course so far, identify the digital
transformation requirement that Azure Databricks will meet and a candidate data source
for Azure Databricks.

2. The instructor will discuss the findings with the group.

Task 1: Define the digital transformation and candidate data source.

1. From the lab virtual machine, start Microsoft Word, and open up the file DP-200-
Lab03-Ex01.docx from the Allfiles\Labfiles\Starter\DP-200.3 folder.

2. Spend 10 minutes documenting the digital transformation requirement and candidate

data source as outlined in the case study and the scenario of this lab.

Task 2: Discuss the findings with the Instructor

1. The instructor will stop the group to discuss the findings.

Result: After you completed this exercise, you have created a Microsoft Word document that
identifies the digital transformation requirement that Azure Databricks will meet and a
candidate data source.

Exercise 2: Work with Azure Databricks

Estimated Time: 20 minutes

Individual exercise

The main tasks for this exercise are as follows:

1. Create an Azure Databricks Premium Tier instance in a resource group.

2. Open Azure Databricks

3. Launch a Databricks Workspace and create a Spark Cluster

Task 1: Create and configure an Azure Databricks instance.

1. In the Azure portal, at the top left of the screen, click on the Home hyperlink.

2. In the Azure portal, click on the + Create a resource icon.

3. In the New screen, click in the Search the Marketplace text box, and type the
word databricks. Click Azure Databricks in the list that appears.

4. In the Azure Databricks blade, click Create.

5. In the Azure Databricks Service blade, create an Azure Databricks Workspace with the

following settings:

o Workspace name: awdbwsstudxx, where xx are your initials.

o Subscription: the name of the subscription you are using in this lab

o Resource group: awrgstudxx, where xx are your initials.

o Location: the name of the Azure region which is closest to the lab location and
where you can provision Azure VMs.

o Pricing Tier: Premium (+ Role-based access controls).

o Deploy Azure Databricks workspace in your Virtual Network: No.

6. In the Azure Databricks Service blade, click Create.

Note: The provision will take approximately 3 minutes. The Databricks Runtime is built
on top of Apache Spark and is natively built for the Azure cloud. Azure Databricks
completely abstracts out the infrastructure complexity and the need for specialized
expertise to set up and configure your data infrastructure. For data engineers, who care
about the performance of production jobs, Azure Databricks provides a Spark engine
that is faster and performant through various optimizations at the I/O layer and
processing layer (Databricks I/O).

Task 2: Open Azure Databricks.

1. Confirm that the Azure Databricks service has been created.

2. In the Azure portal, navigate to the Resource group screen.

3. In the Resource groups screen, click on the **awrgstudxx resource group, where xx are

your initials.

4. In the awrgstudxx screen, click awdbwsstudxx, where xx are your initials to open

Azure Databricks. This will open your Azure Databricks service.
Task 3: Launch a Databricks Workspace and create a Spark Cluster.
1. In the Azure portal, in the awdbwsstudxx screen, click on the button Launch
Workspace.

Note: You will be signed into the Azure Databricks Workspace in a separate tab in
Microsoft Edge.

2. Under Common Tasks, click New Cluster.

3. In the Create Cluster screen, under New Cluster, create a Databricks Cluster with the
following settings, and then click on Create Cluster:

o Cluster name: awdbclstudxx, where xx are your initials.

o Cluster Mode: Standard

o Pool: None

o Databricks Runtime Version: Runtime: 6.2 (Scala 2.11, Spark 2.4.4)

o Make sure you select the Terminate after 60 minutes of inactivity check box. If
the cluster isn't being used, provide a duration (in minutes) to terminate the
cluster.

o Leave all the remaining options to their current settings.

4. In the Create Cluster screen, click on Create Cluster and leave the Microsoft Edge
screen open.

Note: The creation of the Azure Databricks instance will take approximately 10 minutes as the
creation of a Spark cluster is simplified through the graphical user interface. You will note that
the State of Pending whilst the cluster is being created. This will change to Running when the
Cluster is created.
Note: While the cluster is being created, go back and perform Exercise 1.

Exercise 3: Read data with Azure Databricks

Estimated Time: 30 minutes

Individual exercise

The main tasks for this exercise are as follows:

1. Confirm that the Databricks cluster has been created.

2. Collect the Azure Data Lake Store Gen2 account name

3. Enable your Databricks instance to access the Data Lake Gen2 Store.

4. Create a Databricks Notebook and connect to a Data Lake Store.

5. Read data in Azure Databricks.

Task 1: Confirm the creation of the Databricks cluster

1. Return back to Microsoft Edge, under Interactive Clusters confirm that the state

column is set to Running for the cluster named awdbclstudxx, where xx are your
initials.

Task 2: Collect the Azure Data Lake Store Gen2 account name
1. In Microsoft Edge, click on the Azure portal tab, click Resource groups, and then
click awrgstudxx, and then click on awdlsstudxx, where xx are your initials.

2. In the awdlsstudxx screen, under settings, click on Access keys, and then click on the
copy icon next to the Storage account name and paste it into Notepad.

Task 3: Enable your Databricks instance to access the Data Lake Gen2
Store.
1. In the Azure portal, Click the Home hyperlink, and then click the Azure Active
Directory icon.

2. In the Microsoft - Overview screen, click on App registrations.

3. In the Microsoft - App registrations screen, click on the + New registration button.

4. In the register an application screen, provide the name of DLAccess and under
the Redirect URI (optional) section, ensure Web is selected and
type https://adventure-works.com/exampleapp for the application value. After
setting the values.

5. Click Register. The DLAccess screen will appear.

6. In the DLAccess registered app screen, copy the Application (client) ID and Directory

(tenant) ID and paste both into Notepad.

7. In the DLAccess registered app screen, click on Certificates and Secrets, and the

click + New Client Secret

8. In the Add a client secret screen. type a description of DL Access Key, and
a duration of In 1 year for the key. When done, click Add.
Important: When you click on Add, the key will appear as shown in the graphic below.
You only have one opportunity to copy this key value into Notepad

9. Copy the Application key value and paste it into Notepad

10. Assign the Storage Blob Data Contributor permission to your resource group. In the
Azure portal, click on the Home hyperlink, and then the Resource groups icon, click on
the resource group awrgstudxx, where xx are your initials.

11. In the awrgstudxx screen, click on Access Control (IAM)

12. Click on the Role assignments tab.

13. Click + Add, and click Add role assignment

14. In the Add role assignment blade, under Role, select Storage Blob Data Contributor.

15. In the Add role assignment blade, under Select, select DLAccess, and then click Save.

16. In the Azure portal, click the Home hyperlink, and then click the Azure Active
Directory icon, Note your role. If you have the User role, you must make sure that non-
administrators can register applications.

17. Click Users, and then click User settings in the Users - All users blade, Check the App
registrations setting. This value can only be set by an administrator. If set to Yes, any
user in the Azure AD tenant can register an app.

18. Close down the Users - All users screen.

19. In the Azure Active Directory blade, click Properties.

20. Click on the Copy icon next to the Directory ID to get your tenant ID and paste this into
notepad.

21. Save the notepad document in the folder Allfiles\Labfiles\Starter\DP-

200.3 as DatabricksDetails.txt

Task 4: Create a Databricks Notebook and connect to a Data Lake

Store.
1. In Microsoft Edge, click on the tab Clusters - Databricks

Note: You will see the Clusters page.

2. In the Azure Databricks blade on the left of Microsoft Edge, click on Under Workspace,
click on the drop down next to Workspace, then point to Create and then click
on Notebook.

3. In the Create Notebook screen, next to Name type My Notebook.

4. Next to the Language drop down list, select Scala.

5. Ensure that the Cluster states the name of the cluster that you have created earlier, click
on Create
Note: This will open up a Notebook with the title My Notebook (Scala).

6. In the Notebook, in the cell Cmd 1, copy the following code and paste it into the cell:

7. //Connect to Azure Data Lake Storage Gen2 account

8.
9. spark.conf.set("fs.azure.account.auth.type", "OAuth")
10. spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
11. spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-
name>.dfs.core.windows.net", "<application-id>")
12. spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-
name>.dfs.core.windows.net", "<authentication-key>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-
name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant-
id>/oauth2/token")

13. In this code block, replace the application-id, authentication-id, tenant-id, file-

system-name and storage-account-name placeholder values in this code block with
the values that you collected earlier and are held in notepad.

14. In the Notebook, in the cell under Cmd 1, click on the Run icon and click on Run Cell as
highlighted in the following graphic.

Note A message will be returned at the bottom of the cell that states "Command took
0.0X seconds -- by person at 4/4/2019, 2:46:48 PM on awdbclstudxx"
Task 5: Read data in Azure Databricks.
1. In the Notebook, hover your mouse at the top right of cell Cmd 1, and click on the Add
Cell Below icon. A new cell will appear named Cmd2.

2. In the Notebook, in the cell Cmd 2, copy the following code and paste it into the cell:

3. //Read JSON data in Azure Data Lake Storage Gen2 file system
4.
val df = spark.read.json("abfss://<file-system-name>@<storage-account-
name>.dfs.core.windows.net/preferences.json")

5. In this code block, replace the file-system-name with the word logs and storage-

account-name placeholder values in this code block with the value that you collected
earlier and are held in notepad.

6. In the Notebook, in the cell under Cmd 2, click on the Run icon and click on Run Cell.

Note A message will be returned at the bottom of the cell that states that a Spark job
has executed and "Command took 0.0X seconds -- by person at 4/4/2019, 2:46:48 PM
on awdbclstudxx"

7. In the Notebook, hover your mouse at the top right of cell Cmd 2, and click on the Add
Cell Below icon. A new cell will appear named Cmd3.

8. In the Notebook, in the cell Cmd 3, copy the following code and paste it into the cell:

9. //Show result of reading the JSON file

10.
df.show()
11. In the Notebook, in the cell under Cmd 3, click on the Run icon and click on Run Cell.

12. Leave the Azure Databricks Notebook open

Result In this exercise, you have performed the necessary steps that setup up the permission
for Azure Databricks to access data in an Azure Data Lake Store Gen2. You then used scala to
connect up to a Data Lake Store and you read data and created a table output showing the
preferences of people.

Exercise 4: Perform basic transformations with Azure

Databricks
Estimated Time: 10 minutes
Individual exercise

The main tasks for this exercise are as follows:

1. Retrieve specific columns on a Dataset

2. Performing a column rename on a Dataset

3. Add an Annotation

4. If Time permits: Additional transformations

Task 1: Retrieve specific columns on a Dataset

1. In the Notebook, hover your mouse at the top right of cell Cmd 3, and click on the Add
Cell Below icon. A new cell will appear named Cmd4.

2. In the Notebook, in the cell Cmd 4, copy the following code and paste it into the cell:

3. //Retrieve specific columns from a JSON dataset in Azure Data Lake Storage Gen2 file
system
4.
5. val specificColumnsDf = df.select("firstname", "lastname", "gender", "location",
"page")
specificColumnsDf.show()

6. In the Notebook, in the cell under Cmd 4, click on the Run icon and click on Run Cell.

Note A message will be returned at the bottom of the cell that states that a Spark job
has executed, a table of results are returned and "Command took 0.0X seconds -- by
person at 4/4/2019, 2:46:48 PM on awdbclstudxx"
Task 2: Performing a column rename on a Dataset
1. In the Notebook, hover your mouse at the top right of cell Cmd 4, and click on the Add
Cell Below icon. A new cell will appear named Cmd5.

2. In the Notebook, in the cell Cmd 5, copy the following code and paste it into the cell:

3. //Rename the page column to bike_preference

4.
5. val renamedColumnsDF = specificColumnsDf.withColumnRenamed("page", "bike_preference")
renamedColumnsDF.show()

6. In the Notebook, in the cell under Cmd 5, click on the Run icon and click on Run Cell.

Note A message will be returned at the bottom of the cell that states that a Spark job
has executed, a table of results are returned and "Command took 0.0X seconds -- by
person at 4/4/2019, 2:46:48 PM on awdbclstudxx"
Task 3: Adding Annotations
1. In the Notebook, hover your mouse at the top right of cell Cmd 5, and click on the Add
Cell Below icon. A new cell will appear named Cmd6.

2. In the Notebook, in the cell Cmd 6, copy the following code and paste it into the cell:

3. This code connects to the Data Lake Storage filesystem named "Data" and reads data in
the preferences.json file stored in that data lake. Then a simple query has been
created to retrieve data and the column "page" has been renamed to "bike_preference".

4. In the Notebook, in the cell under Cmd 6, click on the down pointing arrow icon and
click on Move up. Repeat until the cell appears at the top of the Notebook.

5. Leave the Azure Databricks Notebook open

Note A future lab will explore how this data can be exported to another data platform
technology
Result: After you completed this exercise, you have created an annotation within a notebook.

Task 4: If time permits or post course review

If you have completed this lab early, the following sections provide links to content that can
help you learn more about basic and advanced transformations in Azure.

If the url are inaccessible, there is a copy of the notebooks ion the Allfiles\Labfiles\Starter\DP-
200.3\Post Course Review folder
Basic transformations

1. Within the Workspace, using the command bar on the left, select Workspace, Users,
and select your username (the entry with house icon).

2. In the blade that appears, select the downwards pointing chevron next to your name,
and select Import.

3. On the Import Notebooks dialog, select URL below and paste in the following URL:

https://github.com/MicrosoftDocs/mslearn-perform-basic-data-transformation-in-azure-
databricks/blob/master/DBC/05.1-Basic-ETL.dbc?raw=true

1. Select Import.

2. A folder named 05.1-Basic-ETL after the import should appear. Select that folder.

3. The folder will contain one or more notebooks that you can use to learn basic
transformations using scala or python.

Follow the instructions within the notebook, until you've completed the entire notebook. Then
continue with the remaining notebooks in order:

 01-Course-Overview-and-Setup - This notebook gets you started with your Databricks

workspace.
 02-ETL-Process-Overview - This notebook contains exercises to help you query, large
data files and visualize your results.
 03-Connecting-to-Azure-Blob-Storage - You perform basic aggregation and Joins in
this notebook.
 04-Connecting-to-JDBC - This notebook lists the steps for accessing data from various
sources using Databricks.
 05-Applying-Schemas-to-JSON - In this notebook you learn how to query JSON &
Hierarchical Data with DataFrames
 06-Corrupt-Record-Handling - This notebook lists the exercises that help you
understand how to create ADLS and use Databricks DataFrames to query and analyze
this data.
 07-Loading-Data-and-Productionalizing - Here you use Databricks to query and
analyze data stores in Azure Data Lake Storage Gen2.
 Parsing-Nested-Data - This notebook is located in the Optional subfolder, and includes
a sample project for you explore later on in your own time.

[Note] You'll find corresponding notebooks within the Solutions subfolder. These contain
completed cells for exercises that ask you to complete one or more challenges. Refer to these if
you get stuck or simply want to see the solution.
Advanced transformations
1. Within the Workspace, using the command bar on the left, select Workspace, Users,
and select your username (the entry with house icon).

2. In the blade that appears, select the downwards pointing chevron next to your name,
and select Import.

3. On the Import Notebooks dialog, select URL below and paste in the following URL:

https://github.com/MicrosoftDocs/mslearn-perform-advanced-data-transformation-in-azure-
databricks/blob/master/DBC/05.2-Advanced-ETL.dbc?raw=true

1. Select Import.

2. A folder named 05.2-Advanced-ETL after the import should appear. Select that folder.

3. The folder will contain one or more notebooks that you can use to learn basic
transformations using scala or python.

Follow the instructions within the notebook, until you've completed the entire notebook. Then
continue with the remaining notebooks in order:

 01-Course-Overview-and-Setup - This notebook gets you started with your Databricks

workspace.
 02-Common-Transformations - In this notebook you perform some common data
transformation using Spark built-in functions.
 03-User-Defined-Functions - In this notebook you perform custom transformation
using user-defined functions.
 04-Advanced-UDFs - In this notebook you use advanced user-defined functions to
perform some complex data transformations.
 05-Joins-and-Lookup-Tables - In this notebook you learn how to use standard and
broadcast join for tables.
 06-Database-Writes - This notebook contains exercises to write data to a number of
target databases in parallel, storing the transformed data from your ETL job.
 07-Table-Management - Here you handle managed and unmanaged tables to
optimize your data storage.
 Custom-Transformations - This notebook is located in the Optional subfolder, and
includes a sample project for you to explore later on in your own time.

Pythons Basics
No ratings yet
Pythons Basics
104 pages
Azure Databricks Monitoring
100% (1)
Azure Databricks Monitoring
22 pages
Day-1-Google Cloud Platform - Infrastructure Lab PDF
No ratings yet
Day-1-Google Cloud Platform - Infrastructure Lab PDF
22 pages
winPS createManageVMs PDF
No ratings yet
winPS createManageVMs PDF
1,159 pages
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Guide To Completing A SMETA Report 6.1
No ratings yet
Guide To Completing A SMETA Report 6.1
100 pages
Lab 3 - Enabling Team Based Data Science With Azure Databricks
No ratings yet
Lab 3 - Enabling Team Based Data Science With Azure Databricks
18 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
Azure Databricks An Introduction
No ratings yet
Azure Databricks An Introduction
54 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Vijay Kanth - Azure Data Engineer
No ratings yet
Vijay Kanth - Azure Data Engineer
2 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Microsoft Azure Interview Important Question
No ratings yet
Microsoft Azure Interview Important Question
12 pages
Lab 5 - Working With Relational Data Stores in The Cloud
No ratings yet
Lab 5 - Working With Relational Data Stores in The Cloud
15 pages
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages
UNIX and Shell Scripting - Module 3
No ratings yet
UNIX and Shell Scripting - Module 3
13 pages
Azure Data Factory
No ratings yet
Azure Data Factory
6 pages
Machine Learning in Spark
100% (1)
Machine Learning in Spark
26 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Azure SQL Database
No ratings yet
Azure SQL Database
15 pages
Azure Data Factory Interview Questions and Aswers
No ratings yet
Azure Data Factory Interview Questions and Aswers
5 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
MS Azure Data Factory Lab Overview
No ratings yet
MS Azure Data Factory Lab Overview
58 pages
70 532 Exam Guide
100% (1)
70 532 Exam Guide
560 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
Interview Series ADF Part-1
No ratings yet
Interview Series ADF Part-1
17 pages
Azure Devops: Sato Naoki (Neo) - @satonaoki Jazug Tohoku Azure Devops #Jazug #Azuredevops
No ratings yet
Azure Devops: Sato Naoki (Neo) - @satonaoki Jazug Tohoku Azure Devops #Jazug #Azuredevops
34 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Data Engineering Databricks
No ratings yet
Data Engineering Databricks
139 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Databricks Apache Spark Certified Developer Master Cheat Sheet
100% (1)
Databricks Apache Spark Certified Developer Master Cheat Sheet
29 pages
TalendOpenStudio BigData UG 5.2.1 en
No ratings yet
TalendOpenStudio BigData UG 5.2.1 en
266 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
9 - CT071-3-3-DDAC - Introduction To Azure Cosmos DB
No ratings yet
9 - CT071-3-3-DDAC - Introduction To Azure Cosmos DB
30 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Jarupula Praveen
No ratings yet
Jarupula Praveen
7 pages
Ajay Kumar Azure Resume - Latest
No ratings yet
Ajay Kumar Azure Resume - Latest
4 pages
Cloudera Administration
No ratings yet
Cloudera Administration
399 pages
Azure Data Engineer Course Curriculum Nareshit
No ratings yet
Azure Data Engineer Course Curriculum Nareshit
10 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages
Top 40 Interview Questions On Azure 1721030940
No ratings yet
Top 40 Interview Questions On Azure 1721030940
6 pages
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
Data Modeling
No ratings yet
Data Modeling
3 pages
06-Setting Up Unity Catalog
No ratings yet
06-Setting Up Unity Catalog
5 pages
Databricks
No ratings yet
Databricks
11 pages
Cloudera Apache Impala Guide
No ratings yet
Cloudera Apache Impala Guide
691 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Microsoft AZURE® AZ-104 Administrator Practice Tests
From Everand
Microsoft AZURE® AZ-104 Administrator Practice Tests
iCertify Training
No ratings yet
Ultimate Microsoft Intune for Administrators: Master Enterprise Endpoint Security and Manage Devices, Apps, and Cloud Security with Expert Microsoft Intune Strategies (English Edition)
From Everand
Ultimate Microsoft Intune for Administrators: Master Enterprise Endpoint Security and Manage Devices, Apps, and Cloud Security with Expert Microsoft Intune Strategies (English Edition)
Paul Winstanley
No ratings yet
Get The Most, From The Best!!
No ratings yet
Get The Most, From The Best!!
26 pages
SonarQube Backup
No ratings yet
SonarQube Backup
3 pages
PySpark Training
No ratings yet
PySpark Training
3 pages
KVM Management
No ratings yet
KVM Management
5 pages
Get The Most, From The Best!!
No ratings yet
Get The Most, From The Best!!
48 pages
Get The Most, From The Best!!
No ratings yet
Get The Most, From The Best!!
51 pages
RACE - QP Template
No ratings yet
RACE - QP Template
1 page
Selling Skills: Two Day(s) Training Programme
No ratings yet
Selling Skills: Two Day(s) Training Programme
3 pages
Pitc Aws Chapter 8
No ratings yet
Pitc Aws Chapter 8
66 pages
Introduction To Nexus Repository Management
No ratings yet
Introduction To Nexus Repository Management
1 page
MCQ Quiz Reva 25questions
No ratings yet
MCQ Quiz Reva 25questions
17 pages
Red Hat Certified System Administrator
No ratings yet
Red Hat Certified System Administrator
3 pages
Cisco ASR 903 Essentials-Mapped With ASR920
No ratings yet
Cisco ASR 903 Essentials-Mapped With ASR920
1 page
GuruStack - DevOps Certification
No ratings yet
GuruStack - DevOps Certification
9 pages
Oracle Database 12c R2 - Official ADMIN1
No ratings yet
Oracle Database 12c R2 - Official ADMIN1
4 pages
Cisco ASR 9000 HardwareOverview-Mapped With ASR9010
No ratings yet
Cisco ASR 9000 HardwareOverview-Mapped With ASR9010
1 page
Implementing Cisco Network Security Exam (210-260) : Security Concepts Common Security Principles
No ratings yet
Implementing Cisco Network Security Exam (210-260) : Security Concepts Common Security Principles
4 pages
Pitc
No ratings yet
Pitc
2 pages
MCQ Tutorial - MCQ Questions For Set 31 in Cloud Computing
No ratings yet
MCQ Tutorial - MCQ Questions For Set 31 in Cloud Computing
2 pages
Creating Virtual Machine1
No ratings yet
Creating Virtual Machine1
5 pages
Lesson Plan For CRS GH Curriculum
No ratings yet
Lesson Plan For CRS GH Curriculum
12 pages
Condition Based Maintenance Principles and Applications
No ratings yet
Condition Based Maintenance Principles and Applications
48 pages
Wallstreetjournal 20230103 TheWallStreetJournal
No ratings yet
Wallstreetjournal 20230103 TheWallStreetJournal
42 pages
Ipl Final Print
No ratings yet
Ipl Final Print
17 pages
Business & Corporate Law: Dr. Jaya Mathew
No ratings yet
Business & Corporate Law: Dr. Jaya Mathew
180 pages
Shop 1 Valuation Report
No ratings yet
Shop 1 Valuation Report
4 pages
Agricultural Demand Side Management
No ratings yet
Agricultural Demand Side Management
5 pages
Cert Week MB800
No ratings yet
Cert Week MB800
10 pages
Written Data and Infromation
No ratings yet
Written Data and Infromation
12 pages
Case Studies
No ratings yet
Case Studies
7 pages
Pragmatic Strategy Eastern Wisdom Global Success 1st Edition Ikujiro Nonaka Zhichang Zhu Download
No ratings yet
Pragmatic Strategy Eastern Wisdom Global Success 1st Edition Ikujiro Nonaka Zhichang Zhu Download
84 pages
SubhanshiPanda
No ratings yet
SubhanshiPanda
3 pages
Bonus Sheet
No ratings yet
Bonus Sheet
2 pages
Study Guide 2024
No ratings yet
Study Guide 2024
98 pages
Foreign Exchange Management Act Notification - Precaution Before Borrowing
No ratings yet
Foreign Exchange Management Act Notification - Precaution Before Borrowing
12 pages
Exhibit 9 23 Summarizes The Information Disclosed by General Electric Company
No ratings yet
Exhibit 9 23 Summarizes The Information Disclosed by General Electric Company
1 page
Comparisonal Analysis of Formwork Costs Convention
No ratings yet
Comparisonal Analysis of Formwork Costs Convention
7 pages
A Basic Primer Relative To Managed Buy/Sell Trade Programs: MT 799, MT 999 AND MT 199 Bank Swift Messages
100% (1)
A Basic Primer Relative To Managed Buy/Sell Trade Programs: MT 799, MT 999 AND MT 199 Bank Swift Messages
4 pages
Dasmarinas-2014-Revised Revenue Code of The City of Dasmarinas
No ratings yet
Dasmarinas-2014-Revised Revenue Code of The City of Dasmarinas
98 pages
Topic: Barry's Peer Becomes His Boss
No ratings yet
Topic: Barry's Peer Becomes His Boss
9 pages
1 Introduction What Is International Business
No ratings yet
1 Introduction What Is International Business
46 pages
Module 7 TAX
No ratings yet
Module 7 TAX
8 pages
Edexcel Year12Assessment Paper2 MS
No ratings yet
Edexcel Year12Assessment Paper2 MS
11 pages
Ferro Industries Case Study
No ratings yet
Ferro Industries Case Study
15 pages
Literature Review On Impulse Buying
100% (3)
Literature Review On Impulse Buying
6 pages
Wi Fi Password
No ratings yet
Wi Fi Password
4 pages
English-6-Wlp-Sept 18-22,2023
No ratings yet
English-6-Wlp-Sept 18-22,2023
13 pages
Topics For Speaking Test
No ratings yet
Topics For Speaking Test
2 pages
Court of Tax Appeals: VENUE: Second Division Courtroom, 3
No ratings yet
Court of Tax Appeals: VENUE: Second Division Courtroom, 3
4 pages
QM - Chapter 4 (MCQ'S)
No ratings yet
QM - Chapter 4 (MCQ'S)
15 pages

Lab 3 - Enabling Team Based Data Science With Azure Databricks

Uploaded by

Lab 3 - Enabling Team Based Data Science With Azure Databricks

Uploaded by

Lab 3 - Enabling Team Based Data Science

with Azure Databricks

1. Explain Azure Databricks

At the end of this lad, you will have:

1. Explained Azure Databricks

Exercise 1: Explain Azure Databricks

The main task for this exercise are as follows:

2. The instructor will discuss the findings with the group.

Task 1: Define the digital transformation and candidate data source.

2. Spend 10 minutes documenting the digital transformation requirement and candidate

Task 2: Discuss the findings with the Instructor

Exercise 2: Work with Azure Databricks

The main tasks for this exercise are as follows:

1. Create an Azure Databricks Premium Tier instance in a resource group.

2. Open Azure Databricks

3. Launch a Databricks Workspace and create a Spark Cluster

Task 1: Create and configure an Azure Databricks instance.

2. In the Azure portal, click on the + Create a resource icon.

4. In the Azure Databricks blade, click Create.

5. In the Azure Databricks Service blade, create an Azure Databricks Workspace with the

o Workspace name: awdbwsstudxx, where xx are your initials.

o Resource group: awrgstudxx, where xx are your initials.

o Pricing Tier: Premium (+ Role-based access controls).

o Deploy Azure Databricks workspace in your Virtual Network: No.

Task 2: Open Azure Databricks.

2. In the Azure portal, navigate to the Resource group screen.

3. In the Resource groups screen, click on the **awrgstudxx resource group, where xx are

4. In the awrgstudxx screen, click awdbwsstudxx, where xx are your initials to open

2. Under Common Tasks, click New Cluster.

o Cluster name: awdbclstudxx, where xx are your initials.

o Databricks Runtime Version: Runtime: 6.2 (Scala 2.11, Spark 2.4.4)

o Leave all the remaining options to their current settings.

Exercise 3: Read data with Azure Databricks

The main tasks for this exercise are as follows:

1. Confirm that the Databricks cluster has been created.

2. Collect the Azure Data Lake Store Gen2 account name

4. Create a Databricks Notebook and connect to a Data Lake Store.

5. Read data in Azure Databricks.

Task 1: Confirm the creation of the Databricks cluster

1. Return back to Microsoft Edge, under Interactive Clusters confirm that the state

2. In the Microsoft - Overview screen, click on App registrations.

3. In the Microsoft - App registrations screen, click on the + New registration button.

5. Click Register. The DLAccess screen will appear.

6. In the DLAccess registered app screen, copy the Application (client) ID and Directory

7. In the DLAccess registered app screen, click on Certificates and Secrets, and the

9. Copy the Application key value and paste it into Notepad

11. In the awrgstudxx screen, click on Access Control (IAM)

12. Click on the Role assignments tab.

13. Click + Add, and click Add role assignment

18. Close down the Users - All users screen.

19. In the Azure Active Directory blade, click Properties.

21. Save the notepad document in the folder Allfiles\Labfiles\Starter\DP-

Task 4: Create a Databricks Notebook and connect to a Data Lake

Note: You will see the Clusters page.

3. In the Create Notebook screen, next to Name type My Notebook.

4. Next to the Language drop down list, select Scala.

7. //Connect to Azure Data Lake Storage Gen2 account

13. In this code block, replace the application-id, authentication-id, tenant-id, file-

5. In this code block, replace the file-system-name with the word logs and storage-

9. //Show result of reading the JSON file

12. Leave the Azure Databricks Notebook open

Exercise 4: Perform basic transformations with Azure

The main tasks for this exercise are as follows:

1. Retrieve specific columns on a Dataset

2. Performing a column rename on a Dataset

4. If Time permits: Additional transformations

Task 1: Retrieve specific columns on a Dataset

3. //Rename the page column to bike_preference

5. Leave the Azure Databricks Notebook open

Task 4: If time permits or post course review

2. A folder named 05.1-Basic-ETL after the import should appear. Select that folder.

 01-Course-Overview-and-Setup - This notebook gets you started with your Databricks

2. A folder named 05.2-Advanced-ETL after the import should appear. Select that folder.

 01-Course-Overview-and-Setup - This notebook gets you started with your Databricks

You might also like