Big Data and Visualization
Big Data and Visualization
Ideal Audience
CIOs
VPs and Directors of Business Intelligence
IT Managers
Data Architects and DBAs
Data Analysts and Data Scientists
Overview
This hands-on lab is designed to provide exposure to many of Microsoft’s transformative line of
business applications built using Microsoft big data and advanced analytics. The goal is to show an
end-to-end solution, leveraging many of these technologies, but not necessarily doing work in every
component possible. The lab architecture is below and includes:
Overview
AdventureWorks Travel (AWT) provides concierge services for business travelers. In an increasingly
crowded market, they are always looking for ways to differentiate themselves, and provide added
value to their corporate customers. They are looking to pilot a web app that their internal customer
service agents can use to provide additional information useful to the traveler during the flight
booking process. They want to enable their agents to enter in the flight information and produce a
prediction as to whether the departing flight will encounter a 15-minute or longer delay, considering
the weather forecasted for the departure hour. In this hands-on lab, attendees will build an end-to-
end solution to predict flight delays, accounting for the weather forecast.
Solution Architecture
Below is a diagram of the solution architecture you will build in this lab. Please study this carefully so
you understand the whole of the solution as you are working on the various components.
The solution begins with loading their historical data into blob storage using Azure Data Factory
(ADF). By setting up a pipeline containing a copy activity configured to copy time partitioned source
data, they could pull all their historical information, as well as ingest any future data, into Azure blob
storage through a scheduled, and continuously running pipeline. Because their historical data is
stored on-premises, AWT would need to install and configure an Azure Data Factory Integration
Runtime (formerly known as a Data Management Gateway). Azure Machine Learning (Azure ML)
would be used to develop a two-class classification machine learning model, which would then be
operationalized as a Predictive Web Service using ML Studio. After operationalizing the ML model, a
second ADF pipeline, using a Linked Service pointing to Azure ML’s Batch Execution API and an
AzureMLBatchExecution activity, would be used to apply the operational model to data as it is moved
to the proper location in Azure storage. The scored data in Azure storage can be explored and
prepared using Spark SQL on HDInsight, and the results visualized using a map visualization in Power
BI.
Setup Requirements
A corporate email address (e.g., your @microsoft.com email)
Microsoft Azure Subscription must be pay-as-you-go or MSDN
Additional Requirements
You will need a subscription to Microsoft Azure. Please see the next page for how to create a trial
subscription.
Azure Registration
Azure
We need an active Azure subscription in order to perform this workshop. There are a few ways to
accomplish this. If you already have an active Azure subscription, you can skip the remainder of this
page. Otherwise, you'll either need to use an Azure Pass or create a trial account. The instructions for
both are below.
Azure Pass
If you've been provided with a voucher, formally known as an Azure Pass, then you can use that to
create a subscription. In order to use the Azure Pass, direct your browser to
https://www.microsoftazurepass.com and, following the prompts, use the code provided to create
your subscription.
Trial Subscription
Direct your browser to https://azure.microsoft.com/en-us/free/ and begin by clicking on the green
button that reads Start free.
1. In the first section, complete the form in its entirety. Make sure you use your real email
address for the important notifications.
2. In the second section, enter a real mobile phone number to receive a text verification
number. Click send message and re-type the received code.
3. Enter a valid credit card number. NOTE: You will not be charged. This is for verification of
identity only in order to comply with federal regulations. Your account statement may see a
temporary hold of $1.00 from Microsoft, but, again, this is for verification only and will "fall
off" your account within 2-3 banking days.
This may take a minute or two, but you should see a welcome screen informing you that your
subscription is ready. The Azure subscription is good for up to $200 of resources for 30 days. After 30
days, your subscription (and resources) will be suspended unless you convert your trial subscription
to a paid one. And, should you choose to do so, you can elect to use a different credit card than the
one you just entered.
Synopsis: In this exercise, you will set up your environment for use in the rest of the hands-on lab.
You should follow all the steps provided in this section to prepare your environment
before attending the hands-on lab.
Deploy to Azure
2. In the Custom deployment blade that appears, enter the following values:
Resource group: Use and existing Resource group, or create a new one by entering a
unique name, such as “bigdatalab-[your intials or first name]”.
Location: Select a location for the Resource group. Recommend using East US, East
US 2, West Central US, or West US 2, as some resources, such as Data Factory, are
only available in those regions.
App name: Enter a unique name, such as your initials or first name. This value must
be between 3 and 10 characters long, and should not contain any special characters.
Note the name, as you will need to use it in your Lab VM deployment in Task 3 as
well.
Cluster Login User Name: Enter a name, or accept the default. Note all references to
this in the lab use the default user name, demouser, so if you change it, please note
it for future reference throughout the lab.
Cluster Login Password: Enter a password, or accept the default. Note all references
to this in the lab use the default password, Password.1!!, so if you change it, please
note it for future reference throughout the lab.
Select Purchase.
1. Navigate to http://www.wunderground.com/weather/api/.
4. Scroll down until you see the area titled How much will you use our service? Ensure
Developer is selected.
6. Complete the Create an Account form by providing your email address and a password, and
agreeing to the terms.
10. Complete the brief contact form. When answering where will the API be used, select
Website. For Will the API be used for commercial use, select No. Select Purchase Key.
11. You should be taken to a page that displays your key, similar to the following:
12. Take note of your API Key. It is available from the text box labeled Key ID.
13. To verify that your API Key is working, modify the following URL to include your API Key:
http://api.wunderground.com/api//hourly10day/q/SEATAC.json.
14. Open your modified link in a browser, you should get a JSON result showing the 10-day,
hourly weather forecast for the Seattle-Tacoma International Airport
Deploy to Azure
2. In the Custom deployment blade that appears, enter the following values:
Resource group: Choose Use Existing, and select the same resource group you used
when deploying your HDInsight cluster and Azure ML workspace, above.
App name: IMPORTANT: You must enter the same App name you used in the
deployment above in Task 1.
VM User Name: Enter a name, or accept the default. Note all references to this in the
lab use the default user name, demouser, so if you change it, please note it for
future reference throughout the lab.
VM Password: Enter a password, or accept the default. Note all references to this in
the lab use the default password, Password.1!!, so if you change it, please not it for
future reference throughout the lab.
Select Purchase.
3. The deployment will take about 10 minutes to complete.
2. From the left side menu in the Azure portal, click on Resource groups, then enter your
resource group name into the filter box, and select it from the list.
3. Next, select your lab virtual machine from the list.
6. Select Connect, and enter the following credentials (or the non-default credentials if you
changed them):
7. In a web browser on the Lab VM navigate to the Power BI Desktop download page
(https://powerbi.microsoft.com/en-us/desktop/).
5. When the install is complete, uncheck View Release Notes, and select Finish.
Build a ML Model
Synopsis: In this exercise, attendees will implement a classification experiment. They will load the
training data from their local machine into a dataset. Then, they will explore the data to identify the
primary components they should use for prediction, and use two different algorithms for predicting
the classification. They will evaluate the performance of both and algorithms choose the algorithm
that performs best. The model selected will be exposed as a web service that is integrated with the
sample web app.
2. On the Machine Learning Studio workspace blade, select Launch Machine Learning
Studio.
2. Download the three CSV sample datasets from here: http://bit.ly/2wGAqrl (If you get an error,
or the page won’t open, try pasting the URL into a new browser window and verify the case
sensitive URL is exactly as shown).
3. Extract the ZIP and verify you have the following files:
FlightDelaysWithAirportCodes.csv
FlightWeatherWithAirportCodes.csv
AirportCodeLocationLookupClean.csv
4. In the Machine Learning Studio browser window, select + NEW at the bottom left.
5. Select Dataset under New, and then select From Local File.
3. Next, you will explore the Flight delays datasets to understand what kind of cleanup (e.g.,
data munging) will be necessary.
7. Because all 20 columns are displayed, you can scroll the grid horizontally. Scroll until you see
the DepDel15 column, and select it to view statistics about the column. The DepDel15
column displays a 1 when the flight was delayed at least 15 minutes and 0 if there was no
such delay. In the model you will construct, you will try to predict the value of this column for
future data. Notice in the Statistics panel that a value of 27444 appears for Missing Values.
This means that 27,444 rows do not have a value in this column. Since this value is very
important to our model, we will need to eliminate any rows that do not have a value for this
column.
8. Next, select the CRSDepTime column. Our model will approximate departure times to the
nearest hour, but departure time is captured as an integer. For example, 8:37 am is captured
as 837. Therefore, we will need to process the CRSDepTime column, and round it down to the
nearest hour. To perform this rounding will require two steps, first you will need to divide the
value by 100 (so that 837 becomes 8.37). Second, you will round this value down to the
nearest hour (so that 8.37 becomes 8.)
10. Close the Visualize dialog, and go back to the design surface.
11. To perform our data munging, we have multiple options, but in this case, we’ve chosen to use
an Execute R Script module, which will perform the following tasks:
12. To add the module, search for Execute R Script by entering “Execute R” into the Search
experiment items box.
13. Drag this module on to the design surface beneath your FlightDelaysWithAirportCodes
dataset. Select the small circle at the bottom of the FlightDelaysWithAirportCodes dataset,
drag and release when your mouse is over the circle found in the top left of the Execute R
Script module. These circles are referred to as ports, and by taking this action you have
connected the output port of the dataset with the input port of the Execute R Script module,
meaning data from the dataset will flow along this path.
14. In the Properties panel for Execute R Script module, select the Double Windows icon to
maximize the script editor.
15. Replace the script with the following (Press CTRL+A to select all then CTRL+V to paste)
# Import data from the input port
ds.flights <- maml.mapInputPort(1)
# Round departure times down to the nearest hour, and export the result as a ne
w column named "CRSDepHour"
ds.flights[, "CRSDepHour"] <- floor(ds.flights[, "CRSDepTime"] / 100)
# Trim the columns to only those we will use for the predictive model
ds.flights = ds.flights[, c("OriginAirportCode","OriginLatitude", "OriginLongit
ude", "Month", "DayofMonth", "CRSDepHour", "DayOfWeek", "Carrier", "DestAirport
Code", "DestLatitude", "DestLongitude", "DepDel15")]
16. Select the check mark in the bottom right to save the script (Do not worry if the formatting
is off before hitting the check mark.)
17. Select Save on the command bar at the bottom to save your in-progress experiment.
18. Select Run in the command bar at the bottom to run the experiment.
19. When the experiment is finished running, you will see a finished message in the top right
corner of the design surface, and green check marks over all modules that ran.
20. You should run your experiment whenever you need to update the metadata describing what
data is flowing through the modules, so that newly added modules can be aware of the
shape of your data (most modules have dialogs that can suggest columns, but before they
can make suggestions you need to have run your experiment).
21. To verify the results of our R script, right-click the left output port (Result Dataset) of the
Execute R Script module and select Visualize.
22. In the dialog that appears, scroll over to DepDel15 and select the column. In the statistics
you should see that Missing Values reads 0.
23. Now, select the CRSDepHour column, and verify that our new column contains the rounded
hour values from our CRSDepTime column.
24. Finally, observe that we have reduced the number of columns from 20 to 12. Close the
dialog.
25. At this point the Flight Delay Data is prepared, and we turn to preparing the historical
weather data.
3. Observe that this data set has 406,516 rows and 29 columns. For this model, we are going to
focus on predicting delays using WindSpeed (in MPH), SeaLevelPressure (in inches of Hg),
and HourlyPrecip (in inches). We will focus on preparing the data for those features.
4. In the dialog, select the WindSpeed column, and review the statistics. Observe that the
Feature Type was inferred as String and that there are 32 Missing Values. Below that,
examine the histogram to see that, even though the type was inferred as string, the values
are all actually numbers (e.g. the x-axis values are 0, 6, 5, 7, 3, 8, 9, 10, 11, 13). We will
need to ensure that we remove any missing values and convert WindSpeed to its proper type
as a numeric feature.
5. Next, select the SeaLevelPressure column. Observe that the Feature Type was inferred as
String and there are 0 Missing Values. Scroll down to the histogram, and observe that many
of the features are of a numeric value (e.g., 29.96, 30.01, etc.), but there are many features
with the string value of M for Missing. We will need to replace this value of "M" with a suitable
numeric value so that we can convert this feature to be a numeric feature.
6. Finally, examine the HourlyPrecip feature. Observe that it too was inferred to have a
Feature Type of String and is missing values for 374,503 rows. Looking at the histogram,
observe that besides the numeric values, there is a value T (for Trace amount of rain). We
need to replace T with a suitable numeric value and covert this to a numeric feature.
7. To preform our data cleanup, we will use a Python script, in which we will perform the
following tasks:
WindSpeed: Replace missing values with 0.0, and “M” values with 0.005
HourlyPrecip: Replace missing values with 0.0, and “T” values with 0.005
Round “Time” column down to the nearest hour, and add value to a new column named
“Hour”
10. Paste in the following script into the Python script window, and select the checkmark at the
bottom right of the dialog (press CTRL+A to select all then CTRL+V to paste and then
immediately select the checkmark -- don't worry if the formatting is off before hitting the
checkmark).
# imports
import pandas as pd
import math
# Pare down the variables in the Weather dataset to just the columns being
used by the model
df_result = dataframe1[['AirportCode', 'Month', 'Day', 'Hour', 'WindSpeed',
'SeaLevelPressure', 'HourlyPrecip']]
# Return value must be of a sequence of pandas.DataFrame
return df_result
def roundDown(x):
z = int(math.floor(x/100.0))
return z
13. Right-click the first output port of the Execute Python Script module, and select Visualize.
14. In the statistics, verify that there are now only the 7 columns we are interested in, and that
WindSpeed, SeaLevelPressure, and HourlyPrecip are now all Numeric Feature types and that
they have no missing values.
2. Drag a Join Data module onto the design surface, beneath and centered between both
Execute R and Python Script modules. Connect the output port (1) of the Execute R Script
module to input port (1) of the Join Data module, and the output port (1) of the Execute
Python Script module to the input port (2) of the Join Data module.
3. In the Properties panel for the Join Data module, relate the rows of data between the two
sets L (the flight delays) and R (the weather).
4. Select Launch Column selector under Join key columns for L. Set the Join key columns
for L to include OriginAirportCode, Month, DayofMonth, and CRSDepHour, and select the
check box in the bottom right.
5. Select Launch Column selector under Join key columns for R. Set the join key columns
for R to include AirportCode, Month, Day, and Hour, and select the check box in the bottom
right.
6. Leave the Join Type at Inner Join, and uncheck Keep right key columns in joined table (so
that we do not include the redundant values of AirportCode, Month, Day, and Hour).
7. Next, drag an Edit Metadata module onto the design surface below the Join Data module,
and connect its input port to the output port of the Join Data module. We will use this module
to convert the fields that were unbounded String feature types, to the enumeration like
Categorical feature.
8. On the Properties panel of the Edit Metadata module, select Launch column selector and
set the Selected columns to DayOfWeek, Carrier, DestAirportCode, and OriginAirportCode,
and select the checkbox in the bottom right.
11. Launch the column selector, and choose Begin With All Columns, choose Exclude and set
the selected columns to exclude: OriginLatitude, OriginLongitude, DestLatitude, and
DestLongitude.
12. Save your experiment.
13. Run the experiment to verify everything works as expected and when completed, Visualize
by right-clicking on the output of the Select Columns in Dataset module. You will see the
joined datasets as output.
1. To create our training and validation datasets, drag a Split Data module beneath Select
Columns in Dataset, and connect the output of the Select Columns in Dataset module to the
input of the Split Data module.
2. On the Properties panel for the Split Data module, set the Fraction of rows in the first
output dataset to 0.7 (so 70% of the historical data will flow to output port 1). Set the
Random seed to 7634.
3. Next, add a Train Model module and connect it to output 1 of the Split Data module.
4. On the Properties panel for the Train Model module, set the Selected columns to
DepDel15.
5. Drag a Two-Class Logistic Regression module above and to the left of the Train Model
module and connect the output to the leftmost input of the Train Model module
6. Below the Train Model drop a Score Model module. Connect the output of the Train Model
module to the leftmost input port of the Score Model and connect the rightmost output of the
Split Data module to the rightmost input of the Score Model.
7. Save the experiment.
9. When the experiment is finished running (which takes a few minutes), right-click on the
output port of the Score Model module and select Visualize to see the results of its
predictions. You should have a total of 13 columns.
10. If you scroll to the right so that you can see the last two columns, observe there are Scored
Labels and Scored Probabilities columns. The former is the prediction (1 for predicting
delay, 0 for predicting no delay) and the latter is the probability of the prediction. In the
following screenshot, for example, the last row shows a delay predication with a 53.1%
probability.
11. While this view enables you to see the prediction results for the first 100 rows, if you want to
get more detailed statistics across the prediction results to evaluate your model's
performance, you can use the Evaluate Model module.
12. Drag an Evaluate Model module on to the design surface beneath the Score Model module.
Connect the output of the Score Model module to the leftmost input of the Evaluate Model
module.
14. When the experiment is finished running, right-click the output of the Evaluate Model module
and select Visualize. In this dialog box, you are presented with various ways to understand
how your model is performing in the aggregate. While we will not cover how to interpret
these results in detail, we can examine the ROC chart that tells us that at least our model
(the blue curve) is performing better than random (the light gray straight line going from 0,0
to 1,1)—which is a good start for our first model!
Task 8: Operationalize the experiment
1. Now that we have a functioning model, let us package it up into a predictive experiment that
can be called as a web service.
2. In the command bar at the bottom, select Set Up Web Service and then select Predictive
Web Service [Recommended]. (If Predictive Web Service is grayed out, run the
experiment again.
3. A copy of your training experiment is created, and a new tab labeled Predictive
Experiment is added, which contains the trained model wrapped between web service input
(e.g. the web service action you invoke with parameters) and web service output modules
(e.g., how the result of scoring the parameters are returned).
4. We will make some adjustments to the web service input and output modules to control the
parameters we require and the results we return.
5. Move the Web Service Input module down, so it is to the right of the Join Data module.
Connect the output of the Web service input module to input of the Edit Metadata module.
6. Right-click the line connecting the Join Data module and the Edit Metadata module and select
Delete.
7. In between the Join Data and the Edit Metadata modules, drop a Select Columns in
Dataset module. Connect the Join Data module’s output to the Select Columns module’s
input, and the Select Columns output to the Edit Metadata module’s input.
8. In the Properties panel for the Select Columns in Dataset module, set the Select columns to
All Columns, and select Exclude. Enter columns DepDel15, OriginLatitude,
OriginLongitude, DestLatitude, and DestLongitude.
9. This configuration will update the web service metadata so that these columns do not appear
as required input parameters for the web service.
10. Select the Select Columns in Dataset module that comes after the Metadata Editor
module, and delete it.
11. Connect the output of the Edit Metadata module directly to the right input of the Score Model
module.
12. As we removed the latitude and longitude columns from the dataset to remove them as input
to the web service, we have to add them back in before we return the result so that the
results can be easily visualized on a map.
13. To add these fields back, begin by deleting the line between the Score Model and Web
service output.
15. Add a Join Data module, and position it below and to the left of the
AirportCodeLocationLookupClean module.
16. Connect the output of the Score Model module to the leftmost input of the Join Data
module and the output of the AirportCodeLocationLookupClean dataset to the rightmost
input of the Join Data module.
17. In the Properties panel for the Join Data module, for the Join key columns for L set the
selected columns to OriginAirportCode. For the Join key columns for R, set the Selected
columns to AIRPORT. Uncheck Keep right key columns in joined table.
18. Add a Select Columns in Dataset module beneath the Join Data module. Connect the Join
Data output to the input of the Select Columns in Dataset module.
19. In the Property panel, begin with All Columns, and set the Selected columns to Exclude
the columns: AIRPORT_ID and DISPLAY_AIRPORT_NAME.
20. Add an Edit Metadata module. Connect the output of the Select Columns in Dataset module
to the input of the Edit Metadata module.
21. In the Properties panel for the Metadata Editor, use the column selector to set the Selected
columns to LATITUDE and LONGITUDE. In the New column names enter: OriginLatitude,
OriginLongitude.
22. Connect the output of the Edit Metadata to the input of the web service output module.
23. Run the experiment.
24. When the experiment is finished running, select Deploy Web Service, Deploy Web
Service [NEW] Preview.
25. On the Deploy experiment page, select Create New… in the Price Plan drop down, and enter
Dev Test as the Plan Name. Select Standard DevTest (FREE) under Monthly Plan Options.
26. Select Deploy.
27. When the deployment is complete, you will be taken to the Web Service Quickstart page.
Select the Consume tab.
28. Leave the Consume page open for reference during Exercise 4, Task 1. At that point, you
need to copy the Primary Key and Batch Requests Uri (omitting the querystring – “?api-
version=2.0
Setup Azure Data Factory
Synopsis: In this exercise, attendees will create a baseline environment for Azure Data Factory
development for further operationalization of data movement and processing. You will create a Data
Factory service, and then install the Integration Runtime which is the agent that facilitates data
movement from on-premises to Microsoft Azure.
2. From the left side menu in the Azure portal, click on Resource groups, then enter your
resource group name into the filter box, and select it from the list.
Note: You may need to launch an InPrivate/Incognito session in your browser if you
have multiple Microsoft accounts.
2. From the top left corner of the Azure portal, select + Create a resource, and select Data +
Analytics, then select Data Factory.
3. On the New data factory blade, enter the following:
Name: Provide a name, such as bigdata-adf
Subscription: Select your subscription
Resource Group: Choose Use existing, and select the Resource Group you created
when deploying the lab prerequisites
Version: Select V1
Location: Select one of the available locations from the list nearest the one used by
your Resource Group
Select Create
4. The ADF deployment will take several minutes.
5. Once the deployment is completed, you will receive a notification that it succeeded.
6. Select the Go to resource button, to navigate to the newly created Data Factory.
7. On the Data Factory blade, select Author and Deploy under Actions.
8. Next, select …More, then New integration runtime (gateway).
13. Paste the key1 value into the box in the middle of the Microsoft Integration Runtime
Configuration Manager screen.
14. Select Register.
15. It will take a minute or two to register. If it takes more than a couple of minutes, and the
screen does not respond or returns an error message, close the screen by clicking the
Cancel button.
16. The next screen will be New Integration Runtime (Self-hosted) Node. Select Finish.
17. You will then get a screen with a confirmation message.
18. Select the Launch Configuration Manager button to view the connection details.
19. You can now return to the Azure portal, and click OK twice to complete the Integration
Runtime setup.
20. You can view the Integration Runtime by expanding Integration runtimes on the Author and
Deploy blade.
21. Close the Author and Deploy blade, to return to the the Azure Data Factory blade. Leave this
open for the next exercise.
Develop a data factory
pipeline for data movement
Synopsis: In this exercise, you will create an Azure Data Factory pipeline to copy data (.CSV file)
from an on-premises server (Lab VM) to Azure Blob Storage. The goal of the exercise is to
demonstrate data movement from an on-premises location to Azure Storage (via the Integration
Runtime). You will see how assets are created, deployed, executed, and monitored.
5. From the Specify File server share connection screen, enter the following:
6. On the Choose the input file or folder screen, select the folder FlightsAndWeather, and
select Choose.
7. On the next screen, check the Copy files recursively check box, and select Next.
8. On the File format settings page, leave the default settings, and select Next.
9. On the Destination screen, select Azure Blob Storage, and select Next.
10. On the Specify the Azure Blob storage account screen, enter the following:
Connection name: BlobStorageOutput
Account selection method: Leave as From Azure subscriptions
Azure Subscription: Select your subscription
Storage account name: Select <YOUR_APP_NAME>sparkstorage. Make sure you
select the storage account with the sparkstorage suffix, or you will have issues with
subsequent exercises. This ensures data will be copied to the storage account that
the Spark cluster users for its data files.
11. Before selecting Next, please ensure you have selected the proper sparkstorage account.
Finally, select Next.
12. From the Choose the output file or folder tab, enter the following:
17. After saving the Copy settings, select Next on the Summary tab.
18. On the Deployment screen you will see a message that the deployment in is progress, and
after a minute or two that the deployment completed.
19. Select the Click here to monitor copy pipeline link at the bottom of the Deployment
screen.
20. From the Data Factory Resource Explorer, you should see the pipeline activity status Ready.
This indicates the CSV files are successfully copied from your VM to your Azure Blob Storage
location.
4. You may need to adjust the Start time in the window, as follows, and then select Apply.
Operationalize ML Scoring
with Azure ML and Data
Factory
Synopsis: In this exercise, you will extend the Data Factory to operationalize the scoring of data
using the previously created Azure Machine Learning (ML) model.
Back in Exercise 1, Task 9, you left your ML Web Service’s Consume page open.
Return to that page, and copy and paste the following values into the JSON below.
The value of mlEndpoint below is your web service’s Batch Request URL,
remember to **remove the query string (e.g., “?api_version=2.0”).
apiKey is the Primary Key of your web service.
Your tenant string should be populated automatically.
Delete the other optional settings (updateResourceEndpoint, servicePrincipalId,
servicePrincipalKey).
{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"description": "",
"typeProperties": {
"mlEndpoint": "<Specify the batch scoring URL>",
"apiKey": "<Specify the published workspace model’s API key>",
"tenant": "<Specify your tenant string>"
}
}
}
6. Select Deploy.
Task 2: Create Azure ML input dataset
1. Still on the Author and Deploy blade, select …More again.
2. To create a new dataset that will be copied into Azure Blob storage, select New dataset
from the top.
5. Select Deploy.
Task 3: Create Azure ML scored dataset
1. Select …More again, and select New dataset.
4. Select Deploy.
3. Select Deploy.
Task 5: Monitor pipeline activities
1. Close the Author and Deploy blade, and return to the Data Factory overview.
3. Once again, you may need to shift the start time in order to see the items in progress and
ready states.
Synopsis: In this exercise, you will prepare a summary of flight delay data in HDFS using Spark SQL.
1. In the Azure portal, navigate to your HDInsight cluster, and from the Overview blade select
Secure Shell (SSH).
2. On the SSH + Cluster login blade, select your cluster from the Hostname drop down, then
select the copy button next to the SSH command.
3. On your Lab VM, open a new Git Bash terminal window.
4. At the prompt, paste the SSH command you copied from your HDInsight SSH + Cluster login
blade.
5. Enter yes, if prompted about continuing, and enter the following password for the
sshuser:
Abc!1234567890
6. At the sshuser prompt within the bash terminal, enter the following command to install
pandas on the cluster:
4. Juptyer Notebook will open in a new browser window. Log in with the following credentials:
6. Copy the text below, and paste it into the first cell in the Jupyter notebook. This will read
the data from our Scored_FlightsAndWeather.csv file, and output it into a Hive table named
“FlightDelays.”
import spark.sqlContext.implicits._
resultDataFrame.write.mode("overwrite").saveAsTable("FlightDelays")
9. You will see in asterisk appear between the brackets in front of the cell.
11. Below the cell, you will see the output from executing the command.
12. Now, we can query the hive table which was created by the previous command. Paste the
text below into the empty cell at the bottom on the notebook, and select the Run cell
button for that cell.
%%sql
13. Once completed you will see the results displayed as a table.
14. Next, you will create a table that summarizes the flight delays data. Instead of containing
one row per flight, this new summary table will contain one row per origin airport at a given
hour, along with a count of the quantity of anticipated delays. In a new cell below the results
of our previous cell, paste the following text, and select the Run cell button from the
toolbar.
%%sql
15. Execution of this cell should return a results table like the following.
16. Since the summary data looks good, the final step is to save this summary calculation as a
table, which we can later query using Power BI (in the next exercise).
17. To accomplish this, paste the text below into a new cell, and select the Run cell button
from the toolbar.
18. To verify the table was successfully created, go to another new cell, and enter the
following query.
%%sql
21. You can also select Pie, Scatter, Line, Area, and Bar chart visualizations of the dataset.
Visualizing in PowerBI
Desktop
Synopsis: In this exercise, you will create a Power BI Report to visualize the data in HDInsight Spark.
2. From the left side menu in the Azure portal, click on Resource groups, then enter your
resource group name into the filter box, and select it from the list.
2. When Power BI Desktop opens, you will need to enter your personal information, or Sign
in if you already have an account.
3. Select Get data on the screen that is displayed next.
4. Select Azure from the left, and select Azure HDInsight Spark (Beta) from the list of
available data sources.
5. Select Connect.
6. You will receive a prompt warning you that the Spark connector is still in preview. Select
Continue.
7. On the next screen, you will be prompted for your HDInsight Spark cluster URL.
8. To find your Spark cluster URL, go into the Azure portal, and navigate to your Spark
cluster, as you did in Exercise 5, Task 1. Once on the cluster blade, look for the URL under
the Essentials section
9. Copy the URL, and paste it into the Server box on the Power BI Azure HDInsight Spark
dialog.
10. Select DirectQuery for the Data Connectivity mode, and select OK.
13. In the Navigator dialog, check the box next to flightdelayssummary, and select Load.
14. It will take several minutes for the data to load into the Power BI Desktop client.
3. With the Map visualization still selected, drag the OriginLatLong field to the Location field
under Visualizations.
4. Next, drag the NumDelays field to the Size field under Visualizations.
5. You should now see a map that looks similar to the following (resize and zoom on your map if
necessary):
6. Unselect the Map visualization by clicking on the white space next to the map in the report
area.
7. From the Visualizations area, select the Stacked Column Chart icon to add a bar chart
visual to the report’s design surface.
8. With the Stacked Column Chart still selected, drag the Day field and drop it into the Axis
field located under Visualizations.
9. Next, drag the AvgDelayProbability field over, and drop it into the Value field.
10. Grab the corner of the new Stacked Column Chart visual on the report design surface, and
drag it out to make it as wide as the bottom of your report design surface. It should look
something like the following.
11. Unselect the Stacked Column Chart visual by clicking on the white space next to the map
on the design surface.
12. From the Visualizations area, select the Treemap icon to add this visualization to the report.
13. With the Treemap visualization selected, drag the OriginAirportCode field into the Group
field under Visualizations.
14. Next, drag the NumDelays field over, and drop it into the Values field.
15. Grab the corner of the Treemap visual on the report design surface, and expand it to fill the
area between the map and the right edge of the design surface. The report should now look
similar to the following.
16. You can cross filter any of the visualizations on the report by clicking on one of the other
visuals within the report, as shown below. (This may take a few seconds to change, as the
data is loaded.)
17. You can save the report, by clicking Save from the File menu, and entering a name and
location for the file.
Deploy Intelligent Web App
Synopsis: In this exercise, you will deploy an intelligent web application to Azure from GitHub. This
application leverages the operationalized machine learning model that was deployed in Exercise 1 to
bring action-oriented insight to an already existing business process.
2. Read through the README information on the GitHub page and capture the required
parameters.
NOTE: If you run into errors during the deployment that indicate a bad request or unauthorized,
verify that the user you are logged into the portal with an account that is either a Service
Administrator or a Co-Administrator. You won’t have permissions to deploy the website otherwise.
7. After a short time, the deployment will complete, and you will be presented with a link to
your newly deployed web application. CTRL+Click to open it in a new tab.
8. Try a few different combinations of origin, destination, date, and time in the application. The
information you are shown is the result of both the ML API you published, as well as
information retrieved from the Weather Underground API.
Synopsis: In this exercise, attendees will deprovision any Azure resources that were created in
support of the workshop.
You should follow all steps provided after attending the Hands-on workshop.
2. Search for the name of your research group and select it from the list.
3. Select Delete in the command bar and confirm the deletion by re-typing the Resource
group name and selecting Delete.