0% found this document useful (0 votes)
19 views36 pages

Unit 2 Supervised Learning

The document provides an overview of machine learning types, including supervised, unsupervised, and semi-supervised learning, detailing their concepts, goals, and examples. It discusses specific case studies, such as digit recognition using the MNIST dataset and wine quality prediction using Principal Component Analysis (PCA). Additionally, it highlights the challenges of labeling data and the use of semi-supervised techniques like label propagation and active learning to enhance model training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views36 pages

Unit 2 Supervised Learning

The document provides an overview of machine learning types, including supervised, unsupervised, and semi-supervised learning, detailing their concepts, goals, and examples. It discusses specific case studies, such as digit recognition using the MNIST dataset and wine quality prediction using Principal Component Analysis (PCA). Additionally, it highlights the challenges of labeling data and the use of semi-supervised techniques like label propagation and active learning to enhance model training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Machine learning can be broadly categorized into several types,

primarily based on the kind of data they learn from and the type
of task they perform.1

Here's a breakdown of the main types:

1. Supervised Learning:
 Concept: The algorithm learns from labelled data, which
means the data includes both the input features and the
correct output (or target variable). It's like learning with a
teacher who provides the answers. 2

 Goal: To learn a mapping function that can predict the output


for new, unseen input data. 3
Examples:

o Classification: Predicting categories (e.g., spam or not spam, cat or dog). 4

Algorithms include Logistic Regression, Support Vector Machines,


Decision Trees, Random Forests, Naive Bayes.

o Regression: Predicting continuous values (e.g., house prices, stock


prices). Algorithms include Linear Regression, Polynomial Regression,
Support Vector Regression, Decision Tree Regression, Random Forest
Regression. 56
2. Unsupervised Learning:

 Concept: The algorithm learns from unlabeled data, meaning the data only
includes input features without corresponding outputs. It's like learning
7

without a teacher, discovering patterns on its own. 8

 Goal: To find patterns, structures, or groupings in the data. 9


 Examples:
o Clustering: Grouping similar data points together (e.g., customer segmentation,
document clustering). Algorithms include K-Means Clustering, Hierarchical Clustering,
10

DBSCAN. 11

o Dimensionality Reduction: Reducing the number of features while preserving important


information (e.g., feature extraction, data visualization). Algorithms include Principal
12

Component Analysis (PCA), t-SNE.

o Association Rule Mining: Discovering relationships between variables (e.g., market


basket analysis, recommendation systems). Algorithms include Apriori, FP-Growth. 13
Semi-Supervised Learning:
 Concept: A combination of supervised and unsupervised learning. The
algorithm learns from a dataset that contains both labelled and unlabelled
data.
 Goal: To leverage the limited labelled data to improve the learning from the
abundant unlabelled data.
 Examples:
o Image classification with a small number of labelled images and a large
number of unlabelled images.
o Natural language processing tasks where labelled data is scarce.
1. Supervised Learning:

CASE STUDY: DISCERNING DIGITS FROM IMAGES

These images aren’t unlike the Captcha checks many websites have in place to make sure
you’re not a computer trying to hack into the user accounts.

Our research goal is to let a computer recognize images of numbers (step one of the data science
process).
The data we’ll be working on is the MNIST data set, which is often used in the data science literature
for teaching and benchmarking
The MNIST images can be found in the data sets package of Scikit-learn and are already normalized
for you (all scaled to the same size: 64x64 pixels), so we won’t need much data preparation (step
three of the data science process). But let’s first fetch our data as step two of the data science
process, with the following listing.
but pl.matshow() returns a two-dimensional array (a matrix) reflecting the shape of the image. To flatten it into a list, we
need to call reshape() on digits.images. The net result will be a one-dimensional array that looks something like this:
From this point on, it’s a standard classification problem, which brings us to step five of the data science process:
model building.

Now that we have a way to pass the contents of an image into the classifier, we need to pass it a training data set so it
can start learning how to predict the numbers in the images.

We mentioned earlier that Scikit-learn contains a subset of the MNIST database (1,800 images), so we’ll use that. Each
image is also labeled with the number it actually shows. This will build a probabilistic model in memory of the most
likely digit shown in an image given its grayscale values.

Once the program has gone through the training set and built the model, we can then pass it the test set of data to
see how well it has learned to interpret the images using the model.
The following listing shows how to implement these steps in code.
The end result of this code is called a confusion matrix, such as the one shown in figure 3.6. Returned as a two-
dimensional array, it shows how often the number predicted was the correct number on the main diagonal and also in the
matrix entry (i,j), where j was predicted but the image showed i.

Looking at figure 3.6 we can see that the model predicted the number 2 correctly 17 times (at coordinates 3,3), but also
that the model predicted the number 8 15 times when it was actually the number 2 in the image (at 9,3).
From the confusion matrix, we can deduce that for most images the predictions are quite accurate. In a good
model you’d expect the sum of the numbers on the main diagonal of the matrix (also known as the matrix trace)
to be very high compared to the sum of all matrix entries, indicating that the predictions were correct for the
most part.

Let’s assume we want to show off our results in a more easily understandable way or we want to inspect several of
the images and the predictions our program has made: we can use the following code to display one next to the
other. Then we can see where the program has gone wrong and needs a little more training. If we’re satisfied with
the results, the model building ends here and we arrive at step six: presenting the results.
Figure 3.7 shows how all predictions seem to be correct except for the digit number 2, which it labels as 8.
We should forgive this mistake as this 2 does share visual similaritieswith 8.
The bottom left number is ambiguous, even to humans; is it a 5 or a 3? It’s debatable, but the algorithm
thinks it’s a 3. By discerning which images were misinterpreted, we can train the model further by labeling
them with the correct number they display and feeding them back into the model as a new training set
(step 5 of the data science process). This will make the model more accurate, so the cycle of learn, predict,
correct continues and the predictions become more accurate. This is a controlled data set we’re using for
the example. All the examples are the same size and they are all in 16 shades of gray.

In this supervised learning example, it’s apparent that without the labels associated with each image telling the
program what number that image shows, a model cannot be built and predictions cannot be made.
Unsupervised Learning

CASE STUDY: FINDING LATENT VARIABLES IN A WINE QUALITY DATA SET

In this short case study, you’ll use a technique known as Principal Component Analysis (PCA) to find latent variables in a
data set that describes the quality of wine.
Then you’ll compare how well a set of latent variables works in predicting the quality of wine against the original
observable set.
Part one of the data science process is to set our research goal: We want to explain the subjective “wine quality”
feedback using the different wine properties.
Our first job then is to download the data set (step two: acquiring data), as shown in the following listing, and
prepare it for analysis (step three: data preparation).
Then we can run the PCA algorithm and view the results to look at our options.
Because PCA is an explorative technique, we now arrive at step four of the data science process:
Data exploration, as shown in the following listing.
The plot generated from the wine data set is shown in figure 3.8. What you hope to see is an elbow or hockey stick
shape in the plot. This indicates that a few variables can represent the majority of the information in the data set while
the rest only add a little more.
In our plot, PCA tells us that reducing the set down to one variable can capture approximately 28% of the total
information in the set (the plot is zero-based, so variable one is at position zero on the x axis), two variables will capture
approximately 17% more or 45% total, and so on. Table 3.3 shows you the full read-out.

An elbow shape in the plot suggests that five variables can hold most of the information found inside the data. You
could argue for a cut-off at six or seven variables instead, but we’re going to opt for a simpler data set versus one
with less variance in data against the original data set.
At this point, we could go ahead and see if the original data set recoded with five latent variables is good enough to
predict the quality of the wine accurately, but before we do, we’ll see how we might identify what they represent.
INTERPRETING THE NEW VARIABLES
With the initial decision made to reduce the data set from 11 original variables to 5 latent variables, we can check to
see whether it’s possible to interpret or name them based on their relationships with the originals.
Actual names are easier to work with than codes such as lv1, lv2, and so on. We can add the line of code in the
following listing to generate a table that shows how the two sets of variables correlate.
COMPARING THE ACCURACY OF THE ORIGINAL DATA SET WITH LATENT VARIABLES
Now that we’ve decided our data set should be recoded into 5 latent variables rather than the 11 originals, it’s time to
see how well the new data set works for predicting the quality of wine when compared to the original. We’ll use the
Naïve Bayes Classifier algorithm we saw in the previous example for supervised learning to help.
Let’s start by seeing how well the original 11 variables could predict the wine quality scores.
The following listing presents the code to do this.
Now we’ll run the same prediction test, but starting with only 1 latent variable instead of the original 11. Then we’ll
add another, see how it did, add another, and so on to see how the predictive performance improves. The following
listing shows how this is done.
The resulting plot is shown in figure 3.9.
The plot in figure 3.9 shows that with only 3 latent variables, the classifier does a better job of predicting
wine quality than with the original 11. Also, adding more latent variables beyond 5 doesn’t add as much
predictive power as the first 5.
This shows our choice of cutting off at 5 variables was a good one, as we’d hoped.
We looked at how to group similar variables, but it’s also possible to group observations.
Semi-supervised learning

It shouldn’t surprise you to learn that while we’d like all our data to be labeled so we can use the more powerful
supervised machine learning techniques, in reality we often start with only minimally labeled data, if it’s labeled at
all. We can use our unsupervised machine learning techniques to analyze what we have and perhaps add labels to
the data set, but it will be prohibitively costly to label it all.

Our goal then is to train our predictor models with as little labeled data as possible. This is where semi-supervised
learning techniques come in—hybrids of the two approaches we’ve already seen.
Take for example the plot in figure 3.12. In this case, the data has only two labeled observations; normally this is too
few to make valid predictions.
A common semi-supervised learning technique is label propagation. In this technique, you start with a labeled data
set and give the same label to similar data points. This is similar to running a clustering algorithm over the data set
and labeling each cluster based on the labels they contain. If we were to apply this approach to the data set in
figure 3.12, we might end up with something like figure 3.13. One special approach to semi-supervised learning worth
mentioning here is active learning. In active learning the program points out the observations it wants to see
labeled for its next round of learning based on some criteria you have specified. For example, you might set it to try
and label the observations the algorithm is least certain about, or you might use multiple models to make a
prediction and select the points where the models disagree the most.

You might also like