Chapter5 - Machine Learning
Chapter5 - Machine Learning
The many different types of machine learning algorithms have been designed
in such dynamic times to help solve real-world complex problems. The ml
algorithms are automated and self-modifying to continue improving over time.
Before we delve into the top 10 machine learning algorithms you should know,
let's take a look at the different types of machine learning algorithms and how
they are classified.
• Supervised
• Unsupervised Learning
• Semi-supervised Learning
• Reinforcement Learning
However, these four types of ml algorithms are further classified into more
types.
What Are The 10 Popular Machine Learning Algorithms?
Below is the list of Top 10 commonly used Machine Learning (ML) Algorithms:
• Linear regression
• Logistic regression
• Decision tree
• SVM algorithm
• KNN algorithm
• K-means
How Learning These Vital Algorithms Can Enhance Your Skills in Machine
Learning
There are three types of most popular Machine Learning algorithms, i.e -
supervised learning, unsupervised learning, and reinforcement learning. All
three techniques are used in this list of 10 common Machine Learning
Algorithms:
List of Popular Machine Learning Algorithms
1. Linear Regression
In this equation:
• Y – Dependent Variable
• a – Slope
• X – Independent variable
• b – Intercept
The coefficients a & b are derived by minimizing the sum of the squared
difference of distance between data points and the regression line.
2. Logistic Regression
These methods listed below are often used to help improve logistic regression
models:
• include interaction terms
• eliminate features
• regularize techniques
3. Decision Tree
Even if these features are related to each other, a Naive Bayes classifier would
consider all of these properties independently when calculating the probability
of a particular outcome.
A Naive Bayesian model is easy to build and useful for massive datasets. It's
simple and is known to outperform even highly sophisticated classification
methods.
6. KNN (K- Nearest Neighbors) Algorithm
KNN can be easily understood by comparing it to real life. For example, if you
want information about a person, it makes sense to talk to his or her friends
and colleagues!
7. K-Means
• Each data point forms a cluster with the closest centroids, i.e., K
clusters.
In today's world, vast amounts of data are being stored and analyzed by
corporates, government agencies, and research organizations. As a data
scientist, you know that this raw data contains a lot of information - the
challenge is to identify significant patterns and variables.
In practice, x almost always represents multiple data points. So, for example, a
housing price predictor might consider not only square footage (x1) but also
number of bedrooms (x2), number of bathrooms (x3), number of floors (x4),
year built (x5), ZIP code (x6), and so forth. Determining which inputs to use is
an important part of ML design. However, for the sake of explanation, it is
easiest to assume a single input value.
where and are constants. Our goal is to find the perfect values of and
to make our predictor work as well as possible.
Optimizing the predictor h(x) is done using training examples. For each
training example, we have an input value x_train , for which a corresponding
output, y , is known in advance. For each example, we find the difference
between the known, correct value y , and our predicted value h(x_train) .
With enough training examples, these differences give us a useful way to
measure the “wrongness” of h(x) . We can then tweak h(x) by tweaking the
values of and to make it “less wrong”. This process is repeated until the
system has converged on the best values for and . In this way, the
predictor becomes trained, and is ready to do some real-world predicting.
We’re using simple problems for the sake of illustration, but the reason ML
exists is because, in the real world, problems are much more complex. On this
flat screen, we can present a picture of, at most, a three-dimensional dataset,
but ML problems often deal with data with millions of dimensions and very
complex predictor functions. ML solves problems that cannot be solved by
numerical means alone.
With that in mind, let’s look at another simple example. Say we have the
following training data, wherein company employees have rated their
satisfaction on a scale of 1 to 100:
First, notice that the data is a little noisy. That is, while we can see that there is
a pattern to it (i.e., employee satisfaction tends to go up as salary goes up), it
does not all fit neatly on a straight line. This will always be the case with real-
world data (and we absolutely want to train our machine using real-world
data). How can we train a machine to perfectly predict an employee’s level of
satisfaction? The answer, of course, is that we can’t. The goal of ML is never to
make “perfect” guesses because ML deals in domains where there is no such
thing. The goal is to make guesses that are good enough to be useful.
It’s obvious that this is a terrible guess and that this machine doesn’t know
very much.
Now let’s give this predictor all the salaries from our training set, and note the
differences between the resulting predicted satisfaction ratings and the actual
satisfaction ratings of the corresponding employees. If we perform a little
mathematical wizardry (which I will describe later in the article), we can
calculate, with very high certainty, that values of 13.12 for and 0.61 for
are going to give us a better predictor.
And if we repeat this process, say 1,500 times, our predictor will end up
looking like this:
At this point, if we repeat the process, we will find that and will no longer
change by any appreciable amount, and thus we see that the system has
converged. If we haven’t made any mistakes, this means we’ve found the
optimal predictor. Accordingly, if we now ask the machine again for the
satisfaction rating of the employee who makes $60,000, it will predict a rating
of ~60.
This function takes input in four dimensions and has a variety of polynomial
terms. Deriving a normal equation for this function is a significant challenge.
Many modern machine learning problems take thousands or even millions of
dimensions of data to build predictions using hundreds of coefficients.
Predicting how an organism’s genome will be expressed or what the climate
will be like in 50 years are examples of such complex problems.
Let’s take a closer look at how this iterative process works. In the above
example, how do we make sure and are getting better with each step, not
worse? The answer lies in our “measurement of wrongness”, along with a little
calculus. (This is the “mathematical wizardry” mentioned to previously.)
The wrongness measure is known as the cost function (aka loss function), .
The input represents all of the coefficients we are using in our predictor. In
our case, is really the pair and . gives us a mathematical
measurement of the wrongness of our predictor is when it uses the given
values of and .
The choice of the cost function is another important piece of an ML program.
In different contexts, being “wrong” can mean very different things. In our
employee satisfaction example, the well-established standard is the linear least
squares function:
With least squares, the penalty for a bad guess goes up quadratically with the
difference between the guess and the correct answer, so it acts as a very
“strict” measurement of wrongness. The cost function computes an average
penalty across all the training examples.
Now we see that our goal is to find and for our predictor h(x) such that
our cost function is as small as possible. We call on the power of
calculus to accomplish this.
Consider the following plot of a cost function for some particular machine
learning problem:
Here we can see the cost associated with different values of and . We can
see the graph has a slight bowl to its shape. The bottom of the bowl represents
the lowest cost our predictor can give us based on the given training data. The
goal is to “roll down the hill” and find and corresponding to this point.
This is where calculus comes in to this machine learning tutorial. For the sake
of keeping this explanation manageable, I won’t write out the equations here,
but essentially what we do is take the gradient of , which is the pair of
derivatives of (one over and one over ). The gradient will be
different for every different value of and , and defines the “slope of the
hill” and, in particular, “which way is down” for these particular s. For
example, when we plug our current values of into the gradient, it may tell us
that adding a little to and subtracting a little from will take us in the
direction of the cost function-valley floor. Therefore, we add a little to ,
subtract a little from , and voilà! We have completed one round of our
learning algorithm. Our updated predictor, h(x) = + x, will return better
predictions than before. Our machine is now a little bit smarter.
That covers the basic theory underlying the majority of supervised machine
learning systems. But the basic concepts can be applied in a variety of ways,
depending on the problem at hand.
Classification Problems in Machine Learning
As it turns out, the underlying machine learning theory is more or less the
same. The major differences are the design of the predictor h(x) and the
design of the cost function .
Our examples so far have focused on regression problems, so now let’s take a
look at a classification example.
Here are the results of a cookie quality testing study, where the training
examples have all been labeled as either “good cookie” ( y = 1 ) in blue or “bad
cookie” ( y = 0 ) in red.
It turns out there’s a nice function that captures this behavior well. It’s called
the sigmoid function, g(z) , and it looks something like this:
Notice that the sigmoid function transforms our output into the range
between 0 and 1.
The logic behind the design of the cost function is also different in
classification. Again we ask “What does it mean for a guess to be wrong?” and
this time a very good rule of thumb is that if the correct guess was 0 and we
guessed 1, then we were completely wrong—and vice-versa. Since you can’t
be more wrong than completely wrong, the penalty in this case is enormous.
Alternatively, if the correct guess was 0 and we guessed 0, our cost function
should not add any cost for each time this happens. If the guess was right, but
we weren’t completely confident (e.g., y = 1 , but h(x) = 0.8 ), this should
come with a small cost, and if our guess was wrong but we weren’t completely
confident (e.g., y = 1 but h(x) = 0.3 ), this should come with some significant
cost but not as much as if we were completely wrong.
This behavior is captured by the log function, such that:
Again, the cost function gives us the average cost over all of our training
examples.
So here we’ve described how the predictor h(x) and the cost function
differ between regression and classification, but gradient descent still works
fine.
A classification predictor can be visualized by drawing the boundary line; i.e.,
the barrier where the prediction changes from a “yes” (a prediction greater
than 0.5) to a “no” (a prediction less than 0.5). With a well-designed system,
our cookie data can generate a classification boundary that looks like this:
The machine learning algorithms used to do this are very different from those
used for supervised learning, and the topic merits its own post. However, for
something to chew on in the meantime, take a look at clustering
algorithms such as k-means, and also look into dimensionality
reduction systems such as principle component analysis. You can also read our
article on semi-supervised image classification.
We’ve covered much of the basic theory underlying the field of machine
learning but, of course, we have only scratched the surface.
Keep in mind that to really apply the theories contained in this introduction to
real-life machine learning examples, a much deeper understanding of these
topics is necessary. There are many subtleties and pitfalls in ML and many
ways to be lead astray by what appears to be a perfectly well-tuned thinking
machine. Almost every part of the basic theory can be played with and altered
endlessly, and the results are often fascinating. Many grow into whole new
fields of study that are better suited to particular problems.
What is a Confusion Matrix in Machine Learning?
when it is not predicting the minority classes. This is where confusion matrices
are useful.
Here we will use hierarchical clustering to group data points and visualize the
clusters using both a dendrogram and scatter plot.
Have you been in a situation where you expected your machine learning model
to perform really well but it sputtered out a poor accuracy? You’ve done all the
hard work – so where did the classification model go wrong? How can you
correct this?
There are plenty of ways to gauge the performance of your classification model
but none have stood the test of time like the confusion matrix. It helps us
evaluate how our model performed, where it went wrong and offers us
guidance to correct our path.
In this article, we will explore what is confusion matrix in machine learning and
how a Confusion matrix gives a holistic view of the performance of your model.
And unlike its name, you will realize that a Confusion matrix python is a pretty
simple yet powerful concept. So let’s unravel the mystery around the
confusion matrix!
But wait – what’s TP, FP, FN and TN here? That’s the crucial part of a confusion
matrix. Let’s understand each term below.
Understanding True Positive, True Negative, False Positive and False Negative
in a Confusion Matrix
• True Positive (TP) = 560; meaning 560 positive class data points were
correctly classified by the model
• True Negative (TN) = 330; meaning 330 negative class data points were
correctly classified by the model
• False Positive (FP) = 60; meaning 60 negative class data points were
incorrectly classified as belonging to the positive class by the model
• False Negative (FN) = 50; meaning 50 positive class data points were
incorrectly classified as belonging to the negative class by the model
This turned out to be a pretty decent classifier for our dataset considering the
relatively larger number of true positive and true negative values.
Remember the Type 1 and Type 2 errors. Interviewers love to ask the difference
between these two.
2. Why Do We Need a Confusion Matrix?
Let’s say you want to predict how many people are infected with a contagious
virus in times before they show the symptoms, and isolate them from the
healthy population (ringing any bells, yet? ). The two values for our target
variable would be: Sick and Not Sick.
Our dataset is an example of an imbalanced dataset. There are 947 data points
for the negative class and 3 data points for the positive class. This
But it is giving the wrong idea about the result. Think about it.
Our model is saying “I can predict sick people 96% of the time”. However, it is
doing the opposite. It is predicting the people who will not get sick with 96%
accuracy while the sick are spreading the virus!
Do you think this is a correct metric for our model given the seriousness of the
issue? Shouldn’t we be measuring how many positive cases we can predict
correctly to arrest the spread of the contagious virus? Or maybe, out of the
correctly predicted cases, how many are positive cases to check the reliability
of our model?
This is where we come across the dual concept of Precision and Recall.
Precision tells us how many of the correctly predicted cases actually turned out
to be positive.
Recall tells us how many of the actual positive cases we were able to predict
correctly with our model.
50% percent of the correctly predicted cases turned out to be positive cases.
Whereas 75% of the positives were successfully predicted by our model.
Awesome!
Recall is a useful metric in cases where False Negative trumps False Positive.
Recall is important in medical cases where it doesn’t matter whether we raise
a false alarm but the actual positive cases should not go undetected!
But there will be cases where there is no clear distinction between whether
Precision is more important or Recall. What should we do in those cases? We
combine them!
4. F1-Score
In practice, when we try to increase the precision of our model, the recall goes
down, and vice-versa. The F1-score captures both the trends in a single value:
But there is a catch here. The interpretability of the F1-score is poor. This
means that we don’t know what our classifier is maximizing – precision or
recall? So, we use it in combination with other evaluation metrics which gives
us a complete picture of the result.
5. Confusion Matrix using scikit-learn in Python
You know the theory – now let’s put it into practice. You can create matrix with
the Scikit-learn (sklearn) library in Python.
We can import the confusion matrix function from sklearn. metrics. Let’s split
our dataset into the input features and target output dataset.
Figure 11: Splitting data into variables and target dataset
As we can see, our data contains a massive range of values, some are single
digits, and some have three numbers. To make our calculations more
straightforward, we will scale our data and reduce it to a small range of values
using the Standard Scaler.
Now, let's split our dataset into two: one to train our model and another to
test our model. To do this, we use train_test_split imported from sklearn.
Using a Logistic Regression Model, we will perform Classification on our train
data and predict our test data to check the accuracy.
Confusion Matrix for Machine Learning
To find the accuracy of a confusion matrix and all other metrics, we can import
accuracy_score and classification_report from the same library.
Using the predicted values(pred) and our actual values(y_test), we can create a
confusion matrix with the confusion_matrix function.
Then, using the ravel() method of our confusion_matrix function, we can get
the True Positive, True Negative, False Positive, and False Negative values.
Figure 16: Extracting matrix value
Finally, using the classification report, we can find the values of various metrics
of our confusion matrix.