Fundamentals of Machine Learning With QA
Fundamentals of Machine Learning With QA
learning
Machine learning is the basis for most modern artificial intelligence
solutions. A familiarity with the core concepts on which machine learning
is based is an important foundation for understanding AI.
Learning objectives
After completing this module, you will be able to:
Describe core concepts of machine learning
Identify different types of machine learning
Describe considerations for training and evaluating machine learning
models
Describe core concepts of deep learning
Use automated machine learning in Azure Machine Learning service
Introduction
Completed100 XP
1 minute
In this module, you'll explore some of the core concepts on which machine
learning is based, learn how to identify different kinds of machine learning
model, and examine the ways in which machine learning models are
trained and evaluated. Finally, you'll learn how to use Microsoft Azure
Machine Learning to train and deploy a machine learning model, without
needing to write any code.
Note
5 minutes
Machine learning has its origins in statistics and mathematical modeling of
data. The fundamental idea of machine learning is to use data from past
observations to predict unknown outcomes or values. For example:
The proprietor of an ice cream store might use an app that combines
historical sales and weather records to predict how many ice creams
they're likely to sell on a given day, based on the weather forecast.
A doctor might use clinical data from past patients to run automated
tests that predict whether a new patient is at risk from diabetes
based on factors like weight, blood glucose level, and other
measurements.
A researcher in the Antarctic might use past observations automate
the identification of different penguin species (such
as Adelie, Gentoo, or Chinstrap) based on measurements of a bird's
flippers, bill, and other physical attributes.
y = f(x)
4. Now that the training phase is complete, the trained model can
be used for inferencing. The model is essentially a software
program that encapsulates the function produced by the
training process. You can input a set of feature values, and
receive as an output a prediction of the corresponding label.
Because the output from the model is a prediction that was
calculated by the function, and not an observed value, you'll
often see the output from the function shown as ŷ (which is
rather delightfully verbalized as "y-hat").
10 minutes
There are multiple types of machine learning, and you must apply the
appropriate type depending on what you're trying to predict. A breakdown
of common types of machine learning is shown in the following diagram.
Regression
Binary classification
Multiclass classification
In some cases, clustering is used to determine the set of classes that exist
before training a classification model. For example, you might use
clustering to segment your customers into groups, and then analyze those
groups to identify and categorize different classes of customer (high value
- low volume, frequent small purchaser, and so on). You could then use
your categorizations to label the observations in your clustering results
and use the labeled data to train a classification model that predicts to
which customer category a new customer might belong.
Regression
Completed100 XP
12 minutes
After each train, validate, and evaluate iteration, you can repeat the
process with different algorithms and parameters until an acceptable
evaluation metric is achieved.
Example - regression
For our example, let's stick with the ice cream sales scenario we discussed
previously. For our feature, we'll consider the temperature (let's assume
the value is the maximum temperature on a given day), and the label we
want to train a model to predict is the number of ice creams sold that day.
We'll start with some historic data that includes records of daily
temperatures (x) and ice cream sales (y):
Expand table
51 1
52 0
67 14
65 14
70 23
69 20
72 23
75 26
73 22
81 30
78 26
83 36
We'll start by splitting the data and using a subset of it to train a model.
Here's the training dataset:
Expand table
51 1
65 14
69 20
72 23
75 26
81 30
To get an insight of how these x and y values might relate to one another,
we can plot them as coordinates along two axes, like this:
Now we're ready to apply an algorithm to our training data and fit it to a
function that applies an operation to x to calculate y. One such algorithm
is linear regression, which works by deriving a function that produces a
straight line through the intersections of the x and y values while
minimizing the average distance between the line and the plotted points,
like this:
The line is a visual representation of the function in which the slope of the
line describes how to calculate the value of y for a given value of x. The
line intercepts the x axis at 50, so when x is 50, y is 0. As you can see
from the axis markers in the plot, the line slopes so that every increase of
5 along the x axis results in an increase of 5 up the y axis; so when x is
55, y is 5; when x is 60, y is 10, and so on. To calculate a value of y for a
given value of x, the function simply subtracts 50; in other words, the
function can be expressed like this:
f(x) = x-50
You can use this function to predict the number of ice creams sold on a
day with any given temperature. For example, suppose the weather
forecast tells us that tomorrow it will be 77 degrees. We can apply our
model to calculate 77-50 and predict that we'll sell 27 ice creams
tomorrow.
To validate the model and evaluate how well it predicts, we held back
some data for which we know the label (y) value. Here's the data we held
back:
Expand table
52 0
67 14
70 23
73 22
78 26
83 36
We can use the model to predict the label for each of the observations in
this dataset based on the feature (x) value; and then compare the
predicted label (ŷ) to the known actual label value (y).
Using the model we trained earlier, which encapsulates the function f(x)
= x-50, results in the following predictions:
Expand table
Temperature (x) Actual sales (y) Predicted sales (ŷ)
52 0 2
67 14 17
70 23 20
73 22 23
78 26 28
83 36 33
We can plot both the predicted and actual labels against the feature
values like this:
Based on the differences between the predicted and actual values, you
can calculate some common metrics that are used to evaluate a
regression model.
The variance in this example indicates by how many ice creams each
prediction was wrong. It doesn't matter if the prediction
was over or under the actual value (so for example, -3 and +3 both
indicate a variance of 3). This metric is known as the absolute error for
each prediction, and can be summarized for the whole validation set as
the mean absolute error (MAE).
In the ice cream example, the mean (average) of the absolute errors (2, 3,
3, 1, 2, and 3) is 2.33.
The mean absolute error metric takes all discrepancies between predicted
and actual labels into account equally. However, it may be more desirable
to have a model that is consistently wrong by a small amount than one
that makes fewer, but larger errors. One way to produce a metric that
"amplifies" larger errors by squaring the individual errors and calculating
the mean of the squared values. This metric is known as the mean
squared error (MSE).
In our ice cream example, the mean of the squared absolute values
(which are 4, 9, 9, 1, 4, and 9) is 6.
The mean squared error helps take the magnitude of errors into account,
but because it squares the error values, the resulting metric no longer
represents the quantity measured by the label. In other words, we can say
that the MSE of our model is 6, but that doesn't measure its accuracy in
terms of the number of ice creams that were mispredicted; 6 is just a
numeric score that indicates the level of error in the validation
predictions.
All of the metrics so far compare the discrepancy between the predicted
and actual values in order to evaluate the model. However, in reality,
there's some natural random variance in the daily sales of ice cream that
the model takes into account. In a linear regression model, the training
algorithm fits a straight line that minimizes the mean variance between
the function and the known label values. The coefficient of
determination (more commonly referred to as R2 or R-Squared) is a
metric that measures the proportion of variance in the validation results
that can be explained by the model, as opposed to some anomalous
aspect of the validation data (for example, a day with a highly unusual
number of ice creams sales because of a local festival).
The calculation for R2 is more complex than for the previous metrics. It
compares the sum of squared differences between predicted and actual
labels with the sum of squared differences between the actual label
values and the mean of actual label values, like this:
R2 = 1- ∑(y-ŷ)2 ÷ ∑(y-ȳ)2
Don't worry too much if that looks complicated; most machine learning
tools can calculate the metric for you. The important point is that the
result is a value between 0 and 1 that describes the proportion of variance
explained by the model. In simple terms, the closer to 1 this value is, the
better the model is fitting the validation data. In the case of the ice cream
regression model, the R2 calculated from the validation data is 0.95.
Iterative training
After multiple iterations, the model that results in the best evaluation
metric that's acceptable for the specific scenario is selected.
Binary classification
Completed100 XP
12 minutes
Binary classification algorithms are used to train a model that predicts one
of two possible labels for a single class. Essentially,
predicting true or false. In most real scenarios, the data observations
used to train and validate the model consist of multiple feature (x) values
and a y value that is either 1 or 0.
Expand table
67 0
103 1
114 1
72 0
116 1
65 0
To train the model, we'll use an algorithm to fit the training data to a
function that calculates the probability of the class label being true (in
other words, that the patient has diabetes). Probability is measured as a
value between 0.0 and 1.0, such that the total probability for all possible
classes is 1.0. So for example, if the probability of a patient having
diabetes is 0.7, then there's a corresponding probability of 0.3 that the
patient isn't diabetic.
There are many algorithms that can be used for binary classification, such
as logistic regression, which derives a sigmoid (S-shaped) function with
values between 0.0 and 1.0, like this:
Note
f(x) = P(y=1 | x)
For three of the six observations in the training data, we know that y is
definitely true, so the probability for those observations that y=1
is 1.0 and for the other three, we know that y is definitely false, so the
probability that y=1 is 0.0. The S-shaped curve describes the probability
distribution so that plotting a value of x on the line identifies the
corresponding probability that y is 1.
The diagram also includes a horizontal line to indicate the threshold at
which a model based on this function will predict true (1) or false (0). The
threshold lies at the mid-point for y (P(y) = 0.5). For any values at this
point or above, the model will predict true (1); while for any values below
this point it will predict false (0). For example, for a patient with a blood
glucose level of 90, the function would result in a probability value of 0.9.
Since 0.9 is higher than the threshold of 0.5, the model would
predict true (1) - in other words, the patient is predicted to have diabetes.
Expand table
66 0
107 1
112 1
71 0
87 1
89 1
Expand table
Blood glucose (x) Actual diabetes diagnosis (y) Predicted diabetes diagnosis (ŷ
66 0 0
107 1 1
112 1 1
71 0 0
87 1 0
89 1 1
Accuracy
The simplest metric you can calculate from the confusion matrix
is accuracy - the proportion of predictions that the model got right.
Accuracy is calculated as:
(TN+TP) ÷ (TN+FN+FP+TP)
(2+3) ÷ (2+1+0+3)
=5÷6
= 0.83
Accuracy might initially seem like a good metric to evaluate a model, but
consider this. Suppose 11% of the population has diabetes. You could
create a model that always predicts 0, and it would achieve an accuracy
of 89%, even though it makes no real attempt to differentiate between
patients by evaluating their features. What we really need is a deeper
understanding of how the model performs at predicting 1 for positive
cases and 0 for negative cases.
Recall
Recall is a metric that measures the proportion of positive cases that the
model identified correctly. In other words, compared to the number of
patients who have diabetes, how many did the model predict to have
diabetes?
TP ÷ (TP+FN)
3 ÷ (3+1)
=3÷4
= 0.75
Precision
TP ÷ (TP+FP)
3 ÷ (3+0)
=3÷3
= 1.0
= 1.5 ÷ 1.75
= 0.86
Another name for recall is the true positive rate (TPR), and there's an
equivalent metric called the false positive rate (FPR) that is calculated
as FP÷(FP+TN). We already know that the TPR for our model when using
a threshold of 0.5 is 0.75, and we can use the formula for FPR to calculate
a value of 0÷2 = 0.
The ROC curve for a perfect model would go straight up the TPR axis on
the left and then across the FPR axis at the top. Since the plot area for the
curve measures 1x1, the area under this perfect curve would be 1.0
(meaning that the model is correct 100% of the time). In contrast, a
diagonal line from the bottom-left to the top-right represents the results
that would be achieved by randomly guessing a binary label; producing an
area under the curve of 0.5. In other words, given two possible class
labels, you could reasonably expect to guess correctly 50% of the time.
In the case of our diabetes model, the curve above is produced, and
the area under the curve (AUC) metric is 0.875. Since the AUC is higher
than 0.5, we can conclude the model performs better at predicting
whether or not a patient has diabetes than randomly guessing.
Multiclass classification
Completed100 XP
12 minutes
0: Adelie
1: Gentoo
2: Chinstrap
Note
Expand table
Flipper length (x) Species (y)
167 0
172 0
225 2
197 1
189 1
232 2
158 0
f0(x) = P(y=0 | x)
f1(x) = P(y=1 | x)
f2(x) = P(y=2 | x)
The elements in the vector represent the probabilities for classes 0, 1, and
2 respectively; so in this case, the class with the highest probability is 2.
Let's assume that we've validated our multiclass classifier, and obtained
the following results:
Expand table
165 0 0
171 0 0
205 2 1
195 1 1
183 1 1
221 2 2
214 2 2
The confusion matrix for a multiclass classifier is similar to that of a binary
classifier, except that it shows the number of predictions for each
combination of predicted (ŷ) and actual class labels (y):
From this confusion matrix, we can determine the metrics for each
individual class as follows:
Expand table
To calculate the overall accuracy, recall, and precision metrics, you use
the total of the TP, TN, FP, and FN metrics:
The overall F1-score is calculated using the overall recall and precision
metrics:
Clustering
Completed100 XP
10 minutes
Example - clustering
There are no known labels in the dataset, just two features. The goal is
not to identify the different types (species) of flower; just to group similar
flowers together based on the number of leaves and petals.
Expand table
0 5
0 6
1 3
1 3
1 6
1 8
2 3
2 7
2 8
There are multiple algorithms you can use for clustering. One of the most
commonly used algorithms is K-Means clustering, which consists of the
following steps:
Since there's no known label with which to compare the predicted cluster
assignments, evaluation of a clustering model is based on how well the
resulting clusters are separated from one another.
There are multiple metrics that you can use to evaluate cluster
separation, including:
Average distance to cluster center: How close, on average, each
point in the cluster is to the centroid of the cluster.
Average distance to other center: How close, on average, each
point in the cluster is to the centroid of all other clusters.
Maximum distance to cluster center: The furthest distance
between a point in the cluster and its centroid.
Silhouette: A value between -1 and 1 that summarizes the ratio of
distance between points in the same cluster and points in different
clusters (The closer to 1, the better the cluster separation).
Deep learning
Completed100 XP
12 minutes
Expand table
Neurons fire in response to electrochemical stimuli. When fired, Each neuron is a function that operates on an input va
the signal is passed to connected neurons. The function is wrapped in an activation function that
pass the output on.
Just like other machine learning techniques discussed in this module, deep
learning involves fitting training data to a function that can predict a label
(y) based on the value of one or more features (x). The function (f(x)) is
the outer layer of a nested function in which each layer of the neural
network encapsulates functions that operate on x and the weight (w)
values associated with them. The algorithm used to train the model
involves iteratively feeding the feature values (x) in the training data
forward through the layers to calculate output values for ŷ, validating the
model to evaluate how far off the calculated ŷ values are from the
known y values (which quantifies the level of error, or loss, in the model),
and then modifying the weights (w) to reduce the loss. The trained model
includes the final weight values that result in the most accurate
predictions.
The label we're trying to predict (y) is the species of the penguin, and that
there are three possible species it could be:
Adelie
Gentoo
Chinstrap
The process for inferencing a predicted penguin class using this network
is:
1. The feature vector for a penguin observation is fed into the input
layer of the neural network, which consists of a neuron for
each x value. In this example, the following x vector is used as the
input: [37.3, 16.8, 19.2, 30.0]
2. The functions for the first layer of neurons each calculate a weighted
sum by combining the x value and w weight, and pass it to an
activation function that determines if it meets the threshold to be
passed on to the next layer.
3. Each neuron in a layer is connected to all of the neurons in the next
layer (an architecture sometimes called a fully connected network) so
the results of each layer are fed forward through the network until
they reach the output layer.
4. The output layer produces a vector of values; in this case, using
a softmax or similar function to calculate the probability distribution
for the three possible classes of penguin. In this example, the output
vector is: [0.2, 0.7, 0.1]
5. The elements of the vector represent the probabilities for classes 0,
1, and 2. The second value is the highest, so the model predicts that
the species of the penguin is 1 (Gentoo).
While it's easier to think of each case in the training data being passed
through the network one at a time, in reality the data is batched into
matrices and processed using linear algebraic calculations. For this
reason, neural network training is best performed on computers with
graphical processing units (GPUs) that are optimized for vector and matrix
manipulation.
6 minutes
To create an Azure Machine Learning workspace, you can use the Azure
portal, as shown here:
Azure Machine Learning studio
4. Select Launch studio (or open a new browser tab and navigate
to https://ml.azure.com, and sign into Azure Machine Learning
studio using your Microsoft account). Close any messages that are
displayed.
5. In Azure Machine Learning studio, you should see your newly created
workspace. If not, select All workspaces in the left-hand menu and then
select the workspace you just created.
Basic settings:
o Data type:
Name: bike-rentals
Description: Historic bike rental data
Type: Tabular
o Data source:
Select From web files
o Web URL:
Web URL: https://aka.ms/bike-rentals
Skip data validation: do not select
o Settings:
File format: Delimited
Delimiter: Comma
Encoding: UTF-8
Column headers: Only first file has headers
Skip rows: None
Dataset contains multi-line data: do not select
o Schema:
Include all columns other than Path
Review the automatically detected types
Task settings:
o Max trials: 3
o Max nodes: 3
o Timeout: 15
o Iteration timeout: 15
Compute:
4. Wait for the job to finish. It might take a while — now might be a
good time for a coffee break!
When the automated machine learning job has completed, you can review
the best model it trained.
1. On the Overview tab of the automated machine learning job, note
the best model
summary.
Note You may see a message under the status “Warning: User specified exit
score reached…”. This is an expected message. Please continue to the next step.
2. Select the text under Algorithm name for the best model to view
its details.
1. On the Model tab for the best model trained by your automated machine
learning job, select Deploy and use the Web service option to deploy the
model with the following settings:
o Name: predict-rentals
o Description: Predict cycle rentals
o Compute type: Azure Container Instance
o Enable authentication: Selected
2. Wait for the deployment to start - this may take a few seconds.
The Deploy status for the predict-rentals endpoint will be indicated in
the main part of the page as Running.
3. Wait for the Deploy status to change to Succeeded. This may take 5-10
minutes.
Test the deployed service
CodeCopy
{
"Inputs": {
"data": [
{
"day": 1,
"mnth": 1,
"year": 2022,
"season": 2,
"holiday": 0,
"weekday": 1,
"workingday": 1,
"weathersit": 2,
"temp": 0.3,
"atemp": 0.3,
"hum": 0.3,
"windspeed": 0.3
}
]
},
"GlobalParameters": 1.0
}
CodeCopy
{
"Results": [
444.27799000000000
]
}
The test pane took the input data and used the model you trained to
return the predicted number of rentals.
Let’s review what you have done. You used a dataset of historical bicycle
rental data to train a model. The model predicts the number of bicycle
rentals expected on a given day, based on seasonal and
meteorological features.
Clean-up
1. In the Azure portal, in the Resource groups page, open the resource
group you specified when creating your Azure Machine Learning
workspace.
2. Click Delete resource group, type the resource group name to confirm
you want to delete it, and select Delete.
Knowledge check
Completed
200 XP
3 minutes
1. You want to create a model to predict the cost of heating an office building based on its size in
square feet and the number of employees working there. What kind of machine learning problem is
this?
Regression
Clustering
2. You need to evaluate a classification model. Which metric can you use?
Precision
Silhouette
To evaluate the aggregate difference between predicted and actual label values
Correct. A loss function determines the overall variance, or loss, between predicted and actual label
values.
To calculate the cost of training a neural network rather than a statistical model
4. What does automated machine learning in Azure Machine Learning enable you to do?
Automatically provision Azure Machine Learning workspaces for new data scientists in an
organization
Automatically run multiple training jobs using different algorithms and parameters to find the best
model
Correct. Automated machine learning runs multiple training jobs, varying algorithms and parameters,
to find the best model for your data.
Summary
Completed100 XP
1 minute
The module also introduced Azure Machine Learning; a cloud platform for
end-to-end machine learning operations, and gave you the opportunity to
use automated machine learning in Azure Machine Learning for yourself.
Tip
To learn more about Azure Machine Learning and its capabilities, see
the Azure Machine Learning page.