Machine Learning (6CS4-02) Unit-1 Notes
Machine Learning (6CS4-02) Unit-1 Notes
(6CS4-02)
Unit-1 Notes
PSO1: Ability to interpret and analyze network specific and cyber security issues, automation
in real word environment.
PSO2: Ability to Design and Develop Mobile and Web-based applications under realistic
constraints.
Course Outcome:
CO-PO Mapping:
CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Understand the concept of
machine learning and apply
supervised learning techniques.
3 3 3 3 2 1 1 1 1 2 1 3
Unit No./
Total Lect. Lect.
Topics
Lecture Reqd. No.
Reqd.
1. Introduction to subject and scope 1 1
2. Introduction to learning, Types of learning and Applications 1 2
3. Supervised Learning 1 3
4. Linear Regression Model 1 4
Unit-I 5. Naïve Bayes Classifier 1 5
(10) 6. Decision Tree 1 6
7. K-nearest Neighbor 1 7
8. Logistic Regression 1 8
9. Support Vector Machine 1 9
10. Random Forest Algorithm 1 10
Text Book:
Machine learning- Tom M Mitchell
Introduction:
Machine learning is a subfield of artificial intelligence (AI). The goal of machine learning
generally is to understand the structure of data and fit that data into models that can be
understood and utilized by people.
Although machine learning is a field within computer science, it differs from traditional
computational approaches. In traditional computing, algorithms are sets of explicitly
programmed instructions used by computers to calculate or problem solve. Machine learning
algorithms instead allow for computers to train on data inputs and use statistical analysis in
order to output values that fall within a specific range. Because of this, machine learning
facilitates computers in building models from sample data in order to automate decision-
making processes based on data inputs.
Two of the most widely adopted machine learning methods are supervised learning which
trains algorithms based on example input and output data that is labeled by humans,
and unsupervised learning which provides the algorithm with no labeled data in order to
allow it to find structure within its input data.
Supervised Learning
Supervised learning:-Supervised learning is when the model is getting trained on a
labelled dataset. Labelled dataset is one which have both input and output parameters.
In this type of learning both training and validation datasets are labelled.
Learning (training): Learn a model using the training data.
Testing: Test the model using unseen test data to assess the model accuracy
This is similar to a teacher-student scenario. There is a teacher who guides the student to
learn from books and other materials. The student is then tested and if correct, the student
passes. Else, the teacher tunes the student and makes the student learn from the mistakes that
he or she had made in the past. That is the basic principle of Supervised Learning.
Suppose you have a niece who has just turned 2 years old and is learning to speak. She knows
the words, Papa and Mumma, as her parents have taught her how she needs to call them. You
want to teach her what a dog and a cat is. So what do you do? You either show her videos of
dogs and cats or you bring a dog and a cat and show them to her in real-life so that she can
understand how they are different.
Now there are certain things you tell her so that she understands the differences between the 2
animals.
Now you take your niece back home and show her pictures of different dogs and cats. If she
is able to differentiate between the dog and cat, you have successfully taught her.
So what happened here? You were there to guide her to the goal of differentiating between a
dog and a cat. You taught her every difference there is between a dog and a cat. You then
tested her if she was able to learn. If she was able to learn, she called the dog as a dog and a
cat as a cat. If not, you taught her more and were able to teach her. You acted as the
supervisor and your niece acted as the algorithm that had to learn. You even knew what was a
dog and what was a cat. Making sure that she was learning the correct thing. That is the
principle that Supervised Learning follows.
Why is it Important?
Learning gives the algorithm experience which can be used to output the predictions
for new unseen data
Experience also helps in optimizing the performance of the algorithm
Real-world computations can also be taken care of by the Supervised Learning
algorithms
.
Types of Supervised Learning
Supervised Learning has been broadly classified into 2 types.
Regression
Classification
Regression is the kind of Supervised Learning that learns from the Labelled Datasets and is
then able to predict a continuous-valued output for the new data given to the algorithm. It
is used whenever the output required is a number such as money or height etc.
Classification:-
It is a Supervised Learning task where output is having defined labels(discrete value).
Classification means to group the output into a class
For example in below Figure A, Output – Purchased has defined labels i.e. 0 or 1 ; 1
means the customer will purchase and 0 means that customer won‟t purchase. The
goal here is to predict discrete values belonging to a particular class and evaluate on
the basis of accuracy.
It can be either binary or multi class classification. In binary classification, model
predicts either 0 or 1 ; yes or no but in case of multi class classification, model
predicts more than one class.
Example: Gmail classifies mails in more than one classes like social, promotions,
updates, forum.
Supervised Learning Algorithms are used in a variety of applications. Let‟s go through some
of the most well-known applications.
So now that we have finished all the disadvantages, let‟s retrace back and summarize what
we have learnt today.
We had an overview of what Machine Learning is and its various types. We then understood
in depth of what supervised learning is, why is it so important. Later, we went through the
various types of supervised Learning which are regression and classification. After that, we
discussed the various algorithms, the applications of supervised Learning, differences
between Supervised and Unsupervised Learning and the disadvantages that you may face
when you work with supervised Learning Algorithms.
A linear regression is one of the easiest statistical models in machine learning. Understanding
its algorithm is a crucial part of the Data Science Certification‟s course curriculum. It is used
to show the linear relationship between a dependent variable and one or more independent
variables.
Before we drill down to linear regression in depth, let me just give you a quick overview of
what is a regression as Linear Regression is one of a type of Regression algorithm
Linear Regression –This algorithm assumes that there is a linear relationship between the
2 variables, Input (X) and Output (Y), of the data it has learnt from. The Input variable is
called the Independent Variable and the Output variable is called the Dependent Variable.
When unseen data is passed to the algorithm, it uses the function, calculates and maps the
input to a continuous value for the output.
Logistic Regression – This algorithm predicts discrete values for the set of
Independent variables that have been passed to it. It does the prediction by mapping
the unseen data to the logit function that has been programmed into it. The algorithm
predicts the probability of the new data and so it‟s output lies between the range of 0
and 1.
Classification, on the other hand, is the kind of learning where the algorithm needs to map the
new data that is obtained to any one of the 2 classes that we have in our dataset. The classes
need to be mapped to either 1 or 0 which in real-life translated to „Yes‟ or „No‟, „Rains‟ or
„Does Not Rain‟ and so forth. The output will be either one of the classes and not a number as
it was in Regression. Some of the most well-known algorithms are discussed below:
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features,
all of these properties independently contribute to the probability that this fruit is an apple and
that is why it is known as „Naive‟.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and
P(x|c). Look at the equation below:
Above,
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class.
The class with the highest posterior probability is the outcome of prediction.
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.
It is easy and fast to predict class of test data set. It also perform well in multi class
prediction
When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
It perform well in case of categorical input variables compared to numerical variable(s).
For numerical variable, normal distribution is assumed (bell curve, which is a strong
assumption).
Cons:
If categorical variable has a category (in test data set), which was not observed in training
data set, then model will assign a 0 (zero) probability and will be unable to make a
prediction. This is often known as “Zero Frequency”. To solve this, we can use the
smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability outputs
from predict_proba are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real
life, it is almost impossible that we get a set of predictors which are completely
independent.
Let‟s say that you hosted a huge party and you want to know how many of your guests were
non-vegetarians. To solve this problem, let‟s create a simple Decision Tree.
Decision Tree Example – Decision Tree Algorithm
In the above diagram I‟ve created a Decision tree that classifies a guest as either vegetarian or
non-vegetarian. Each node represents a predictor variable that will help to conclude whether
or not a guest is a non-vegetarian. As you traverse down the tree, you must make decisions at
each node, until you reach a dead end.
Now that you know the logic of a Decision Tree, let‟s define a set of terms related to a
Decision Tree.
Root Node: The root node is the starting point of a tree. At this point, the first split is
performed.
Internal Nodes: Each internal node represents a decision point (predictor variable) that
eventually leads to the prediction of the outcome.
Leaf/ Terminal Nodes: Leaf nodes represent the final class of the outcome and therefore
they‟re also called terminating nodes.
Branches: Branches are connections between nodes, they‟re represented as arrows. Each
branch represents a response such as yes or no.
So that is the basic structure of a Decision Tree. Now let‟s try to understand the workflow of
a Decision Tree.
How Does The Decision Tree Algorithm Work?
The Decision Tree Algorithm follows the below steps:
Step 1: Select the feature (predictor variable) that best classifies the data set into the desired
classes and assign that feature to the root node.
Step 2: Traverse down from the root node, whilst making relevant decisions at each internal
node such that each internal node best classifies the data.
Step 3: Route back to step 1 and repeat until you assign a class to the input data.
The above-mentioned steps represent the general workflow of a Decision Tree used for
classification purposes.
ID3 Algorithm:
The ID3 algorithm follows the below workflow in order to build a Decision Tree:
The first step in this algorithm states that we must select the best attribute. What does that
mean?
The best attribute (predictor variable) is the one that, separates the data set into different
classes, most effectively or it is the feature that best splits the data set.
Now the next question in your head must be, “How do I decide which variable/ feature best
splits the data?”
1. Information Gain
2. Entropy
What Is Entropy?
Entropy measures the impurity or uncertainty present in the data. It is used to decide how a Decision
Tree can split the data.
Information Gain is important because it used to choose the variable that best splits the data
at each node of a Decision Tree. The variable with the highest IG is used to split the data at
the root node.
how Information Gain and Entropy are used to create a Decision Tree, let‟s look at an
example. The below data set represents the speed of a car based on certain parameters.
Your problem statement is to study this data set and create a Decision Tree that classifies the
speed of a car (response variable) as either slow or fast, depending on the following predictor
variables:
Road type
Obstruction
Speed limit
We‟ll be building a Decision Tree using these variables in order to predict the speed of a car.
Like I mentioned earlier we must first begin by deciding a variable that best splits the data set
and assign that particular variable to the root node and repeat the same thing for the other
nodes as well.
At this point, you might be wondering how do you know which variable best separates the
data? The answer is, the variable with the highest Information Gain best divides the data into
the desired output classes.
So, let‟s begin by calculating the Entropy and Information Gain (IG) for each of the predictor
variables, starting with „Road type‟.
In our data set, there are four observations in the „Road type‟ column that correspond to four
labels in the „Speed of car‟ column. We shall begin by calculating the entropy of the parent
node (Speed of car).
Step one is to find out the fraction of the two classes present in the parent node. We know
that there are a total of four values present in the parent node, out of which two samples
belong to the „slow‟ class and the other 2 belong to the „fast‟ class, therefore:
p(slow) = no. of ‘slow’ outcomes in the parent node / total number of outcomes
p(fast) = no. of ‘fast’ outcomes in the parent node / total number of outcomes
Now that we know that the entropy of the parent node is 1, let‟s see how to calculate the
Information Gain for the „Road type‟ variable. Remember that, if the Information gain of the
„Road type‟ variable is greater than the Information Gain of all the other predictor variables,
only then the root node can be split by using the „Road type‟ variable.
In order to calculate the Information Gain of „Road type‟ variable, we first need to split the
root node by the „Road type‟ variable.
Decision Tree (Road type) – Decision Tree Algorithm
In the above illustration, we‟ve split the parent node by using the „Road type‟ variable, the
child nodes denote the corresponding responses as shown in the data set. Now, we need to
measure the entropy of the child nodes.
The entropy of the right-hand side child node (fast) is 0 because all of the outcomes in this
node belongs to one class (fast). In a similar manner, we must find the Entropy of the left-
hand side node (slow, slow, fast).
In this node there are two types of outcomes (fast and slow), therefore, we first need to
calculate the fraction of slow and fast outcomes for this particular node.
[Weighted avg]Entropy(children) = (no. of outcomes in left child node) / (total no. of outcomes in
parent node) * (entropy of left node) + (no. of outcomes in right child node)/ (total no. of outcomes in
parent node) * (entropy of right node)
By using the above formula you‟ll find that the, Entropy(children) with weighted avg. is =
0.675
Our final step is to substitute the above weighted average in the IG formula in order to
calculate the final IG of the „Road type‟ variable:
Therefore,
Let‟s try to understand the KNN algorithm with a simple example. Let‟s say we want a
machine to distinguish between images of cats & dogs. To do this we must input a dataset of
cat and dog images and we have to train our model to detect the animals based on certain
features. For example, features such as pointy ears can be used to identify cats and similarly
we can identify dogs based on their long ears.
the dataset during the training phase, when a new image is given to the model, the KNN
algorithm will classify it into either cats or dogs depending on the similarity in their features.
So if the new image has pointy ears, it will classify that image as a cat because it is similar to
the cat images. In this manner, the KNN algorithm classifies data points based on how similar
they are to their neighboring data points.
Features Of KNN Algorithm
The KNN algorithm has the following features:
KNN is a Supervised Learning algorithm that uses labeled input data set to predict the output
of the data points.
It is one of the most simple Machine learning algorithms and it can be easily implemented for
a varied set of problems.
It is mainly based on feature similarity. KNN checks how similar a data point is to its
neighbor and classifies the data point into the class it is most similar to.
Unlike most algorithms, KNN is a non-parametric model which means that it does not make
any assumptions about the data set. This makes the algorithm more effective since it can
handle realistic data.
KNN is a lazy algorithm, this means that it memorizes the training data set instead of learning
a discriminative function from the training data.
KNN can be used for solving both classification and regression problems.
KNN Algorithm Example
To make you understand how KNN algorithm works, let‟s consider the following scenario:
In the above image, we have two classes of data, namely class A (squares) and Class B
(triangles)
The problem statement is to assign the new input data point to one of the two classes by using
the KNN algorithm
The first step in the KNN algorithm is to define the value of „K‟. But what does the „K‟ in the
KNN algorithm stand for?
„K‟ stands for the number of Nearest Neighbors and hence the name K Nearest Neighbors
(KNN).
In the above image, I‟ve defined the value of „K‟ as 3. This means that the algorithm will
consider the three neighbors that are the closest to the new data point in order to decide the
class of this new data point.
The closeness between the data points is calculated by using measures such as Euclidean and
Manhattan distance, which I‟ll be explaining below.
At „K‟ = 3, the neighbors include two squares and 1 triangle. So, if I were to classify the new
data point based on „K‟ = 3, then it would be assigned to Class A (squares).
But what if the „K‟ value is set to 7? Here, I‟m basically telling my algorithm to look for the
seven nearest neighbors and classify the new data point into the class it is most similar to.
At „K‟ = 7, the neighbors include three squares and four triangles. So, if I were to classify the
new data point based on „K‟ = 7, then it would be assigned to Class B (triangles) since the
majority of its neighbors were of class B.
In practice, there‟s a lot more to consider while implementing the KNN algorithm. This will
be discussed in the demo section of the blog.
Earlier I mentioned that KNN uses Euclidean distance as a measure to check the distance
between a new data point and its neighbors, let‟s see how.
Euclidian Distance – KNN Algorithm
Consider the above image, here we‟re going to measure the distance between P1 and P2 by
using the Euclidian Distance measure.
The coordinates for P1 and P2 are (1,4) and (5,1) respectively.
The Euclidian Distance can be calculated like so:
It is as simple as that! KNN makes use of simple measure in order to solve complex
problems, this is one of the reasons why KNN is such a commonly used algorithm.
Where Xi denotes feature variables and „i‟ are data points ranging from i=1, 2, ….., n
Ci denotes the output class for Xi for each i
The condition, Ci ∈ {1, 2, 3, ……, c} is acceptable for all values of „i‟ by assuming that the
total number of classes is denoted by „c‟.
Now let‟s pretend that there‟s a data point „x‟ whose output class needs to be predicted. This
can be done by using the K-Nearest Neighbour (KNN) Algorithm.
An SVM is implemented in a slightly different way than other machine learning algorithms.
It is capable of performing classification, regression and outlier detection.
Advantages of SVM
Effective in high dimensional spaces
Still effective in cases where the number of dimensions is greater than the number of
samples
Uses a subset of training points in the decision function that makes it memory
efficient
Different kernel functions can be specified for the decision function that also makes it
versatile
Disadvantages of SVM
If the number of features is much larger than the number of samples, avoid over-
fitting in choosing kernel functions and regularization term is crucial.
SVMs do not directly provide probability estimates, these are calculated using five-fold cross-
validation.
To select the maximum hyperplane in the given sets, the support vector machine follows the
following sets:
Generate hyperplanes which segregates the classes in the best possible way
Select the right hyperplane with the maximum segregation from either nearest data
points
How to deal with inseparable and non-linear planes
In some cases, hyperplanes can not be very efficient. In those cases, the support vector
machine uses a kernel trick to transform the input into a higher-dimensional space. With this,
it becomes easier to segregate the points. Let us take a look at the SVM kernels.
SVM Kernels
An SVM kernel basically adds more dimensions to a low dimensional space to make it easier
to segregate the data. It converts the inseparable problem to separable problems by adding
more dimensions using the kernel trick. A support vector machine is implemented in practice
by a kernel. The kernel trick helps to make a more accurate classifier. Let us take a look at
the different kernels in the Support vector machine.
Linear Kernel – A linear kernel can be used as a normal dot product between any
two given observations. The product between the two vectors is the sum of the
multiplication of each pair of input values. Following is the linear kernel equation.
Radial Basis Function Kernel – The radial basis function kernel is commonly used
in SVM classification, it can map the space in infinite dimensions. Following is the
RBF kernel equation.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:
Note: To better understand the Random Forest Algorithm, you should have knowledge of the
Decision Tree Algorithm.
o There should be some actual values in the feature variable of the dataset so that the classifier can
predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
<="" li="">
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to
the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and when a new data point occurs, then
based on the majority of results, the Random Forest classifier predicts the final decision. Consider the
below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
o It enhances the accuracy of the model and prevents the overfitting issue.