UNIT 4 Supervised Learning
UNIT 4 Supervised Learning
Supervised learning is a machine learning approach where an algorithm learns a function that
maps an input to an output based on labeled examples. It involves training a model on a set of
inputs and their corresponding correct outputs, allowing the model to learn the underlying
patterns and make predictions on unseen data.
Key Takeaways
• It involves training a model using a dataset and learning the underlying patterns to
make predictions on unseen data.
Supervised learning starts with a dataset that contains input examples paired with
their correct outputs, also known as labels. The dataset is divided into a training set and a test
set. The training set is used to train the model, while the test set evaluates the model’s
performance. During training, the model learns to identify patterns and correlations between
input features and the corresponding output labels.
1. Preprocessing the data: This involves handling missing values, feature scaling, and
encoding categorical variables to prepare the dataset for training.
2. Splitting the dataset: The dataset is split into a training set and a test set to assess the
performance of the model on unseen data.
3. Selecting an algorithm: Choosing the most suitable algorithm for the given problem,
such as linear regression, decision trees, or support vector machines.
4. Training the model: The algorithm is trained on the training set by adjusting its
internal parameters to minimize the prediction errors.
5. Evaluating the model: The trained model is evaluated using metrics like accuracy,
precision, recall, or mean squared error, depending on the problem.
6. Making predictions: Once the model is trained and evaluated, it can make predictions
on new, unseen data points by applying the learned function.
*Supervised learning enables machines to learn patterns and make predictions based on
existing knowledge.* By providing labeled examples, this approach allows the model to
make informed decisions and predictions in various domains.
Generalization error could be measured by MSE. As the model capacity increases, the bias decreases
as the model fits the training datasets better. However, the variance increases, as your model become
sophisticated to fit more patterns of the current dataset, changing datasets (even if they come from the
same distribution) would be impactful.
VC Dimension
In the world of machine learning, understanding the complexity of a model is crucial for
ensuring good performance on new, unseen data. The Vapnik-Chervonenkis (VC) dimension
is a fundamental concept that helps quantify the complexity of a hypothesis class. Introduced
by Vladimir Vapnik and Alexey Chervonenkis, this measure provides insight into a model’s
capacity to fit a variety of functions. This blog explores the concept of VC dimension, its
mathematical foundations, and its implications for machine learning models and
generalization.
What is VC Dimension?
Let,
𝐻={𝑠𝑒𝑡 𝑜𝑓 𝑙𝑖𝑛𝑒𝑎𝑟 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟𝑠 𝑖𝑛 2 𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠}H={set of linear classifiers in 2 Dimensions}
Then any 3 points can be classified by 𝐻H correctly with separating hyper plane as shown in the
following figure.
And that's why the VC dimension of 𝐻H is 3. Because for any 4 points in 2D plane, a linear classifier
can not shatter all the combinations of the points. For example,
For this set of points, there is no separating hyper plane can be drawn to classify this set. So the
VC dimension is 3.
Model Assumptions in Regression
Now, this will adjust the coefficients based on your training data. If
there is noise in the training data, then the estimated coefficients won’t generalize well to the
future data. This is where regularization comes in and shrinks or regularizes these learned
estimates towards zero.
Ridge Regression
Above image shows ridge regression, where the RSS is modified by adding
the shrinkage quantity. Now, the coefficients are estimated by minimizing
this function. Here, λ is the tuning parameter that decides how much we want to
penalize the flexibility of our model. The increase in flexibility of a model is
represented by increase in its coefficients, and if we want to minimize
the above function, then these coefficients need to be small. This is
how the Ridge regression technique prevents coefficients from rising
too high. Also, notice that we shrink the estimated association of each
variable with the response, except the intercept β0, This intercept is a
measure of the mean value of the response when xi1 = xi2 = …= xip =
0.
The coefficients that are produced by the standard least squares method are scale
equivariant,
i.e. if we multiply each input by c then the corresponding
coefficients are scaled by a factor of 1/c. Therefore, regardless of how
the predictor is scaled, the multiplication of predictor and
coefficient(Xjβj) remains the same. However, this is not the case with ridge
regression, and therefore, we need to standardize the predictors or bring the predictors to
the same scale before performing ridge regression. The formula used to do this is
given below.
Lasso
A standard least squares model tends to have some variance in it, i.e.
this model won’t generalize well for a data set different than its
training data. Regularization, significantly reduces the variance of the model, without
substantial increase in its bias. So the tuning parameter λ, used in the
regularization techniques described above, controls the impact on bias
and variance. As the value of λ rises, it reduces the value of
coefficients and thus reducing the variance. Till a point, this increase in λ is
beneficial as it is only reducing the variance(hence avoiding overfitting), without loosing
But after certain value, the model starts
any important properties in the data.
loosing important properties, giving rise to bias in the model and thus
underfitting. Therefore, the value of λ should be carefully selected.
Bayes Theorem:
Example:
X is given as,
Now, you can obtain the values for each by looking at the dataset and
substitute them into the equation. For all entries in the dataset, the
denominator does not change, it remain static. Therefore, the
denominator can be removed and a proportionality can be introduced.
In our case, the class variable(y) has only two outcomes, yes or no.
There could be cases where the classification could be multivariate.
Therefore, we need to find the class y with maximum probability.
Using the above function, we can obtain the class, given the
predictors.
Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of
a particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below
algorithm:
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will
choose the k=5.
o Next, we will calculate the Euclidean distance between the data
points. The Euclidean distance is the distance between two points,
which we have already studied in geometry. It can be calculated
as:
o By calculating the Euclidean distance we got the nearest
neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence
this new data point must belong to category A.
Logistic Regression
The answer is that you will have to use a type of function, different
from linear functions, called a logistic function, or a sigmoid
function.
Sigmoid Function
The sigmoid function/logistic function is a function that resembles an
“S” shaped curve when plotted on a graph. It takes values between 0
and 1 and “squishes” them towards the margins at the top and bottom,
labeling them as 0 or 1.
The equation for the Sigmoid function is this:
.
Let’s see how the sigmoid function represent the given dataset.
This gives a value y that is extremely close to 0 if x is a large negative
value and close to 1 if x is a large positive value. After the input value
has been squeezed towards 0 or 1, the input can be run through a
typical linear function, but the inputs can now be put into distinct
categories.
Introduction to SVM
Support vector machines (SVMs) are powerful yet flexible supervised
machine learning algorithms which are used both for classification and
regression. But generally, they are used in classification problems. In
1960s, SVMs were first introduced but later they got refined in 1990.
SVMs have their unique way of implementation as compared to other
machine learning algorithms. Lately, they are extremely popular
because of their ability to handle multiple continuous and categorical
variables.
Working of SVM
An SVM model is basically a representation of different classes in a
hyperplane in multidimensional space. The hyperplane will be
generated in an iterative manner by SVM so that the error can be
minimized. The goal of SVM is to divide the datasets into classes to
find a maximum marginal hyperplane (MMH).
Decision Tree:
Tree models where the target variable can take a discrete set of values
are called classification trees. Decision trees where the target variable
can take continuous values (typically real numbers) are
called regression trees. Classification And Regression Tree (CART)
is general term for this.
While making decision tree, at each node of tree we ask different type
of questions. Based on the asked question we will calculate the
information gain corresponding to it.
Information Gain
Pure
Impure
• Get list of rows (dataset) which are taken into consideration for
making decision tree (recursively at each nodes).
• Divide the node on best question. Repeat again from step 1 again
until we get pure node (leaf nodes).
• Prone to overfitting.
Since the random forest combines multiple trees to predict the class of
the dataset, it is possible that some decision trees may predict the
correct output, while others may not. But together, all the trees predict
the correct output. Therefore, below are two assumptions for a better
Random forest classifier:
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points
(Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.