Lab 3 - Enabling Team Based Data Science With Azure Databricks
Lab 3 - Enabling Team Based Data Science With Azure Databricks
Pre-requisites: It is assumed that the case study for this lab has already been read. It is
assumed that the content and lab for module 1: Azure for the Data Engineer has also been
completed
Lab files: The files for this lab are located in the Allfiles\Labfiles\Starter\DP-200.3 folder.
Lab overview
By the end of this lab the student will be able to explain why Azure Databricks can be used to
help in Data Science projects. The students will provision an Azure Databricks instance and will
then create a workspace that will be used to perform simple data preparation tasks from a Data
Lake Store Gen2 store. Finally, the student will perform a walk-through of performing
transformations using Azure Databricks.
Lab objectives
After completing this lab, you will be able to:
Scenario
In response to the Information Services (IS) department, you will start the process of building a
predictive analytics platform by listing out the benefits of using the technology. The
department will be joined by data scientists and they want to ensure that there is a predictive
analytics environment available to the new team members.
You will stand up and provision an Azure Databricks environment, and then test that this
environment works by performing a simple data preparation routine on the service by
ingesting data from a pre-existing Data Lake Storage Gen2 account. As a data engineer, it has
been indicated to you that you may be required to help the data scientists perform data
preparation exercises. To that end, you have been recommended to walk-through a notebook
that can help you perform basic transformations.
IMPORTANT: As you go through this lab, make a note of any issue(s) that you have
encountered in any provisioning or configuration tasks and log it in the table in the document
located at \Labfiles\DP-200-Issues-Doc.docx. Document the Lab number, note the technology,
Describe the issue, and what was the resolution. Save this document as you will refer back to it
in a later module.
Individual exercise
1. From the content you have learned in this course so far, identify the digital
transformation requirement that Azure Databricks will meet and a candidate data source
for Azure Databricks.
Result: After you completed this exercise, you have created a Microsoft Word document that
identifies the digital transformation requirement that Azure Databricks will meet and a
candidate data source.
Individual exercise
3. In the New screen, click in the Search the Marketplace text box, and type the
word databricks. Click Azure Databricks in the list that appears.
o Subscription: the name of the subscription you are using in this lab
o Location: the name of the Azure region which is closest to the lab location and
where you can provision Azure VMs.
Note: The provision will take approximately 3 minutes. The Databricks Runtime is built
on top of Apache Spark and is natively built for the Azure cloud. Azure Databricks
completely abstracts out the infrastructure complexity and the need for specialized
expertise to set up and configure your data infrastructure. For data engineers, who care
about the performance of production jobs, Azure Databricks provides a Spark engine
that is faster and performant through various optimizations at the I/O layer and
processing layer (Databricks I/O).
Note: You will be signed into the Azure Databricks Workspace in a separate tab in
Microsoft Edge.
3. In the Create Cluster screen, under New Cluster, create a Databricks Cluster with the
following settings, and then click on Create Cluster:
o Cluster Mode: Standard
o Pool: None
o Make sure you select the Terminate after 60 minutes of inactivity check box. If
the cluster isn't being used, provide a duration (in minutes) to terminate the
cluster.
Note: The creation of the Azure Databricks instance will take approximately 10 minutes as the
creation of a Spark cluster is simplified through the graphical user interface. You will note that
the State of Pending whilst the cluster is being created. This will change to Running when the
Cluster is created.
Note: While the cluster is being created, go back and perform Exercise 1.
Individual exercise
Task 2: Collect the Azure Data Lake Store Gen2 account name
1. In Microsoft Edge, click on the Azure portal tab, click Resource groups, and then
click awrgstudxx, and then click on awdlsstudxx, where xx are your initials.
2. In the awdlsstudxx screen, under settings, click on Access keys, and then click on the
copy icon next to the Storage account name and paste it into Notepad.
Task 3: Enable your Databricks instance to access the Data Lake Gen2
Store.
1. In the Azure portal, Click the Home hyperlink, and then click the Azure Active
Directory icon.
8. In the Add a client secret screen. type a description of DL Access Key, and
a duration of In 1 year for the key. When done, click Add.
Important: When you click on Add, the key will appear as shown in the graphic below.
You only have one opportunity to copy this key value into Notepad
10. Assign the Storage Blob Data Contributor permission to your resource group. In the
Azure portal, click on the Home hyperlink, and then the Resource groups icon, click on
the resource group awrgstudxx, where xx are your initials.
15. In the Add role assignment blade, under Select, select DLAccess, and then click Save.
16. In the Azure portal, click the Home hyperlink, and then click the Azure Active
Directory icon, Note your role. If you have the User role, you must make sure that non-
administrators can register applications.
17. Click Users, and then click User settings in the Users - All users blade, Check the App
registrations setting. This value can only be set by an administrator. If set to Yes, any
user in the Azure AD tenant can register an app.
20. Click on the Copy icon next to the Directory ID to get your tenant ID and paste this into
notepad.
2. In the Azure Databricks blade on the left of Microsoft Edge, click on Under Workspace,
click on the drop down next to Workspace, then point to Create and then click
on Notebook.
5. Ensure that the Cluster states the name of the cluster that you have created earlier, click
on Create
Note: This will open up a Notebook with the title My Notebook (Scala).
6. In the Notebook, in the cell Cmd 1, copy the following code and paste it into the cell:
14. In the Notebook, in the cell under Cmd 1, click on the Run icon and click on Run Cell as
highlighted in the following graphic.
Note A message will be returned at the bottom of the cell that states "Command took
0.0X seconds -- by person at 4/4/2019, 2:46:48 PM on awdbclstudxx"
Task 5: Read data in Azure Databricks.
1. In the Notebook, hover your mouse at the top right of cell Cmd 1, and click on the Add
Cell Below icon. A new cell will appear named Cmd2.
2. In the Notebook, in the cell Cmd 2, copy the following code and paste it into the cell:
3. //Read JSON data in Azure Data Lake Storage Gen2 file system
4.
val df = spark.read.json("abfss://<file-system-name>@<storage-account-
name>.dfs.core.windows.net/preferences.json")
6. In the Notebook, in the cell under Cmd 2, click on the Run icon and click on Run Cell.
Note A message will be returned at the bottom of the cell that states that a Spark job
has executed and "Command took 0.0X seconds -- by person at 4/4/2019, 2:46:48 PM
on awdbclstudxx"
7. In the Notebook, hover your mouse at the top right of cell Cmd 2, and click on the Add
Cell Below icon. A new cell will appear named Cmd3.
8. In the Notebook, in the cell Cmd 3, copy the following code and paste it into the cell:
Note A message will be returned at the bottom of the cell that states that a Spark job
has executed, a table of results are returned and "Command took 0.0X seconds -- by
person at 4/4/2019, 2:46:48 PM on awdbclstudxx"
Result In this exercise, you have performed the necessary steps that setup up the permission
for Azure Databricks to access data in an Azure Data Lake Store Gen2. You then used scala to
connect up to a Data Lake Store and you read data and created a table output showing the
preferences of people.
3. Add an Annotation
2. In the Notebook, in the cell Cmd 4, copy the following code and paste it into the cell:
3. //Retrieve specific columns from a JSON dataset in Azure Data Lake Storage Gen2 file
system
4.
5. val specificColumnsDf = df.select("firstname", "lastname", "gender", "location",
"page")
specificColumnsDf.show()
6. In the Notebook, in the cell under Cmd 4, click on the Run icon and click on Run Cell.
Note A message will be returned at the bottom of the cell that states that a Spark job
has executed, a table of results are returned and "Command took 0.0X seconds -- by
person at 4/4/2019, 2:46:48 PM on awdbclstudxx"
Task 2: Performing a column rename on a Dataset
1. In the Notebook, hover your mouse at the top right of cell Cmd 4, and click on the Add
Cell Below icon. A new cell will appear named Cmd5.
2. In the Notebook, in the cell Cmd 5, copy the following code and paste it into the cell:
6. In the Notebook, in the cell under Cmd 5, click on the Run icon and click on Run Cell.
Note A message will be returned at the bottom of the cell that states that a Spark job
has executed, a table of results are returned and "Command took 0.0X seconds -- by
person at 4/4/2019, 2:46:48 PM on awdbclstudxx"
Task 3: Adding Annotations
1. In the Notebook, hover your mouse at the top right of cell Cmd 5, and click on the Add
Cell Below icon. A new cell will appear named Cmd6.
2. In the Notebook, in the cell Cmd 6, copy the following code and paste it into the cell:
3. This code connects to the Data Lake Storage filesystem named "Data" and reads data in
the preferences.json file stored in that data lake. Then a simple query has been
created to retrieve data and the column "page" has been renamed to "bike_preference".
4. In the Notebook, in the cell under Cmd 6, click on the down pointing arrow icon and
click on Move up. Repeat until the cell appears at the top of the Notebook.
Note A future lab will explore how this data can be exported to another data platform
technology
Result: After you completed this exercise, you have created an annotation within a notebook.
If the url are inaccessible, there is a copy of the notebooks ion the Allfiles\Labfiles\Starter\DP-
200.3\Post Course Review folder
Basic transformations
1. Within the Workspace, using the command bar on the left, select Workspace, Users,
and select your username (the entry with house icon).
2. In the blade that appears, select the downwards pointing chevron next to your name,
and select Import.
3. On the Import Notebooks dialog, select URL below and paste in the following URL:
https://github.com/MicrosoftDocs/mslearn-perform-basic-data-transformation-in-azure-
databricks/blob/master/DBC/05.1-Basic-ETL.dbc?raw=true
1. Select Import.
3. The folder will contain one or more notebooks that you can use to learn basic
transformations using scala or python.
Follow the instructions within the notebook, until you've completed the entire notebook. Then
continue with the remaining notebooks in order:
[Note] You'll find corresponding notebooks within the Solutions subfolder. These contain
completed cells for exercises that ask you to complete one or more challenges. Refer to these if
you get stuck or simply want to see the solution.
Advanced transformations
1. Within the Workspace, using the command bar on the left, select Workspace, Users,
and select your username (the entry with house icon).
2. In the blade that appears, select the downwards pointing chevron next to your name,
and select Import.
3. On the Import Notebooks dialog, select URL below and paste in the following URL:
https://github.com/MicrosoftDocs/mslearn-perform-advanced-data-transformation-in-azure-
databricks/blob/master/DBC/05.2-Advanced-ETL.dbc?raw=true
1. Select Import.
3. The folder will contain one or more notebooks that you can use to learn basic
transformations using scala or python.
Follow the instructions within the notebook, until you've completed the entire notebook. Then
continue with the remaining notebooks in order:
[Note] You'll find corresponding notebooks within the Solutions subfolder. These contain
completed cells for exercises that ask you to complete one or more challenges. Refer to these if
you get stuck or simply want to see the solution.