Machine Learning (R20a0518)
Machine Learning (R20a0518)
ENGINEERING
DIGITAL NOTES
ON
MACHINE LEARNING
R20A0518
Prepared by padmaja
Overfitting, Underfitting 63
Expectation-Maximization. 74
V
Reinforcement Learning: Exploration and 75
exploitation trade-offs
Non-associative learning 77
UNIT-1
Machine Learning is a concept which allows the machine to learn from examples and experience, and
that too without being explicitly programmed. So instead of you writing the code, what you do is you
feed data to the generic algorithm, and the algorithm/ machine builds the logic based on the given
data.
Machine Learning algorithm is trained using a training data set to create a model. When new input
data is introduced to the ML algorithm, it makes a prediction on the basis of the model.The prediction
is evaluated for accuracy and if the accuracy is acceptable, the Machine Learning algorithm is
deployed. If the accuracy is not acceptable, the Machine Learning algorithm is trained again and
again with an augmented raining data set
MACHINE LEARNING 1
DEPARTMENT OF CSE AY:2023-24
Supervised Learning is the one, where you can consider the learning is guided by a teacher. We have a
dataset which acts as a teacher and its role is to train the model or the machine. Once the model gets trained
it can start making a prediction or decision when new data is given to it.
The model learns through observation and finds structures in the data. Once the model is given a dataset, it
automatically finds patterns and relationships in the dataset by creating clusters in it. What it cannot do is
add labels to the cluster, like it cannot say this a group of apples or mangoes, but it will separate all the
apples from mangoes.
Suppose we presented images of apples, bananas and mangoes to the model, so what it does, based on
some patternsand relationships it creates clusters and divides the dataset into those clusters. Now if a new
data is fed to the model, it adds it to one of the created clusters.
MACHINE LEARNING 2
DEPARTMENT OF CSE AY:2023-24
Classification of Machine Learning Algorithms Machine Learning algorithms can be classified into:
1. Supervised Algorithms – Linear Regression, Logistic Regression, Support Vector Machine (SVM),
DecisionTrees, Random Forest
2. Unsupervised Algorithms – K Means Clustering.
3. Reinforcement Algorithm
given labels based on certain parameters through which the machine will learn these features and
patterns andclassify some new input data based on the learning from this training data.
Supervised Learning Algorithms can be broadly divided into two types of algorithms, Classification and
Regression.Classification Algorithms
Just as the name suggests, these algorithms are used to classify data into predefined classes or labels.
Regression Algorithms
These algorithms are used to determine the mathematical relationship between two or more variables and
the level of dependency between variables. These can be used for predicting an output based on the
interdependency of two or more variables. For example, an increase in the price of a product will decrease
its consumption, which means, in this case, the amount of consumption will depend on the price of the
product. Here, the amount of consumption will be called as the dependent variable and price of the product
will be called the independent variable. The level of dependency of the amount of consumption on the price
of a product will help us predict the future value of the amount of consumption based on the change in
prices of the product.
We have two types of regression algorithms: Linear Regression and Logistic Regression
MACHINE LEARNING 3
DEPARTMENT OF CSE AY:2023-24
blue etc. The graph of logistic regression consists of a non-linear sigmoid function which demonstrates the
probabilities of the variables.
Another machine learning concept which is extensively used in the field is Neural Networks..
Normalization is a scaling technique in Machine Learning applied during data preparation to change
the values of numeric columns in the dataset to use a common scale. It is not necessary for all datasets
in a model. It is required only when features of machine learning models have different ranges.
Although Normalization is no mandate for all datasets available in machine learning, it is used
whenever the attributes of the dataset have different ranges. It helps to enhance the performance and
reliability of a machine learning model. In this article, we will discuss in brief various Normalization
techniques in machine learning, why it is used, examples of normalization in an ML model, and much
more. So, let's start with the definition of Normalization in Machine Learning.
o Xn = Value of Normalization
o Xmaximum = Maximum value of a feature
o Xminimum = Minimum value of a feature
Example: Let's assume we have a model dataset having maximum and minimum values of feature as
mentioned above. To normalize the machine learning model, values are shifted and rescaled so their
range can vary between 0 and 1. This technique is also known as Min-Max scaling. In this scaling
MACHINE LEARNING 4
DEPARTMENT OF CSE AY:2023-24
Case1- If the value of X is minimum, the value of Numerator will be 0; hence Normalization will also be 0.
Xn = (X - Xminimum) / ( Xmaximum - Xminimum)
This method was introduced by Karl Pearson. It works on a condition that while the data in a higher
dimensional space is mapped to data in a lower dimension space, the variance of the data in the lower
dimensional space should be maximum.
It involves the following steps:
Construct the covariance matrix of the data.
Compute the eigenvectors of this matrix.
Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of
variance of theoriginal data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data loss in the
process. But, the most important variances should be retained by the remaining eigenvectors.
There are a lot of machine learning problems which a nonlinear, and the use of nonlinear feature mappings
can help to produce new features which make prediction problems linear. In this section we will discuss the
following idea: transformation of the dataset to a new higher-dimensional (in some cases infinite-
dimensional) feature space and theuse of PCA in that space in order to produce uncorrelated features. Such a
method is called Kernel Principal Component Analysis or KPCA.
MACHINE LEARNING 6
DEPARTMENT OF CSE AY:2023-24
where . Will consider that the dimensionality of the feature space equals to .
Eigendecompsition of is given by
By the definition of
And therefore
So far, we have assumed that the mapping is known. From the equations above, we can see, that only a
thing that we need for the data transformation is the eigendecomposition of a Gram matrix . Dot products,
which are its elements can be defined without any definition of . The function defining such dot
products in some Hilbert space is called kernel. Kernels are satisfied by the Mercer’s theorem. There are
many different types of kernels, there are several popular:
1. Linear: ;
2. Gaussian: ;
3. Polynomial: .
MACHINE LEARNING 7
DEPARTMENT OF CSE AY:2023-24
Using a kernel function we can write new equation for a projection of some data item onto -th
eigenvector:
So far, we have assumed that the columns of have zero mean. Using
Summary: Now we are ready to write the whole sequence of steps to perform KPCA:
1. Calculate .
2. Calculate .
3. Find the eigenvectors of corresponding to nonzero eigenvalues and normalize them:
.
4. Sort found eigenvectors in the descending order of coresponding eigenvalues.
5. Perform projections onto the given subset of eigenvectors.
The method described above requires to define the number of components, the kernel and its parameters. It
shouldbe noted, that the number of nonlinear principal components in the general case is infinite, but since
we are computing the eigenvectors of a matrix, at maximum we can calculate nonlinear principal
components.
MACHINE LEARNING 8
DEPARTMENT OF CSE AY:2023-24
UNIT-II
Regression models: Simple Linear Regression, multiple linear Regression. Cost Function, Gradient Descent,
Performance Metrics: Mean Absolute Error(MAE),Mean Squared Error(MSE) R-Squared error, Adjusted R
Square.
Supervised and unsupervised are mostly used by a lot machine learning engineers and data geeks.
Reinforcement learning is really powerful and complex to apply for problems.
Supervised learning
as we know machine learning takes data as input. lets call this data Training data
what are Inputs and Labels(Targets)?? for example addition of two numbers a=5,b=6 result =11, Inputs are
5,6and Target is 11
We first train the model with the lots of training data(inputs&targets)then with new data and the logic
we got before we predict the output
(Note :We don’t get exact 6 as answer we may get value which is close to 6 based on training data and
algorithm)
This process is called Supervised Learning which is really fast and accurate.
MACHINE LEARNING 9
DEPARTMENT OF CSE AY:2023-24
Regression: This is a type of problem where we need to predict the continuous-response value (ex :
above we predictnumber which can vary from -infinity to +infinity)
how many total runs can be on board in a cricket game?etc… there are tons of things we can predict if we
wish.
Classification: This is a type of problem where we predict the categorical response value where the data
can beseparated into specific “classes” (ex: we predict one of the values in a set of values).
MACHINE LEARNING 10
DEPARTMENT OF CSE AY:2023-24
Unsupervised learning
The training data does not include Targets here so we don’t tell the system where to go , the system has
to understanditself from the data we give.
Here training data is not structured (contains noisy data,unknown data and etc..)
Unsupervised process
There are also different types for unsupervised learning like Clustering and anomaly detection (clustering is
prettyfamous)
MACHINE LEARNING 11
DEPARTMENT OF CSE AY:2023-24
Bit similar to multi class classification but here we don’t provide the labels, the system understands
from data itselfand cluster the data.
MACHINE LEARNING 12
DEPARTMENT OF CSE AY:2023-24
Unsupervised learning is bit difficult to implement and its not used as widely as supervised.
Reinforcement Learning is a type of Machine Learning, and thereby also a branch of Artificial Intelligence.
It allows machines and software agents to automatically determine the ideal behavior within a specific
context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its
behavior; this is known as thereinforcement signal.
There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement Learning is
defined by a specific type of problem, and all its solutions are classed as Reinforcement Learning
algorithms. In the problem, an agent is supposed to decide the best action to select based on his current state.
When this step is repeated, the problemis known as a Markov Decision Process.
In order to produce intelligent programs (also called agents), reinforcement learning goes through the
following steps:
3. After the action is performed, the agent receives reward or reinforcement fromthe environment.
Q-Learning
MACHINE LEARNING 13
DEPARTMENT OF CSE AY:2023-24
Use cases:
Some applications of the reinforcement learning algorithms are computer played board games (Chess, Go),
robotic hands, and self-driving
cars.
Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically, Regression
analysis helps us to understand how the value of the dependent variable is changing corresponding to an
independent variable when other independent variables are held fixed. It predicts continuous/real values such
as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and get
sales on that. The below list shows the advertisement made by the company in the last 5 years and the
corresponding sales:
MACHINE LEARNING 14
DEPARTMENT OF CSE AY:2023-24
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine learning,
we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between variables and
enablesus to predict the continuous output variable based on the one or more predictor variables. It is
mainly used for prediction, forecasting, time series modeling, and determining the causal-effect
relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the
machine learning model can make predictions about the data. In simple words, "Regression shows a line or
curve that passes through all the datapoints on target-predictor graph in such a way that the vertical
distance between the datapoints and the regression line is minimum." The distance between datapoints and
line tells whether a model has captured a strong relationship or not.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has its
own importance on different scenarios, but at the core, all the regression methods analyze the effect of the
independent variable on dependent variables. Here we are discussing some important types of regression
which are given below:
MACHINE LEARNING 15
DEPARTMENT OF CSE AY:2023-24
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
COST FUNCTION
The cost function also called the loss function , computes the difference or distance between actual
output and predicted output.
It determines the performance of a Machine Learning Model using a single real number, known
as cost value/model error. This value depicts the average error between the actual and predicted
outputs.
On a broader level, the cost function evaluates how accurately the model maps the input and output
data relationship . Understanding the consistencies and inconsistencies in the model’s performance
for a given dataset is critical. These models work with real-world applications and the slightest
error can impact the overall projection and incur losses.
MACHINE LEARNING 16
DEPARTMENT OF CSE AY:2023-24
Depending upon the given dataset, use case, problem, and purpose, there are primarily three types
of cost functions as follows:
Regression Cost Function
In simpler words, Regression in Machine Learning is the method of retrograding from ambiguous &
hard-to-interpret data to a more explicit & meaningful model.
It is a predictive modeling technique to examine the relationship between independent features and
dependent outcomes.
The Regression models operate on serial data or variables. Therefore, they predict continuous
outcomes like weather forecasts, probability of loan approvals, car & home costs, the expected
employees’ salary, etc.
When the cost function deals with the problem statement of the Regression Model, it is known as
Regression Cost Function. It computes the error as the distance between the actual output and the
predicted output.
The Regression Cost Functions are the simplest and fine-tuned for linear progression. The most
common among them are:
i. Mean Error (ME)
ME is the most straightforward approach and acts as a foundation for other Regression Cost
Functions. It computes the error for every training dataset and calculates the mean of all derived
errors.
ME is usually not suggested because the error values are either positive or negative. During mean
calculation, they cancel each other and give a zero-mean error outcome.
ii. Mean Absolute Error (MAE)
MAE, also known as L1 Loss, overcomes the drawback of Means Error (ME) mentioned above. It
computes the absolute distance between the actual output and predicted output and is insensitive to
anomalies. In addition, MAE does not penalize high errors caused by these anomalies.
Overall, it effortlessly operates the dataset with any anomalies and predicts outcomes with better
precision.
However, MAE comes with the drawback of being non-differentiable at zero. Thus, fail to perform
well in Loss Function Optimization Algorithms that involve differentiation to evaluate optimal
coefficients.
iii. Mean Squared Error (MSE)
MSE, also known as L2 Loss, is used most frequently and successfully improves the drawbacks of
both ME and MAE. It computes the “square” of the distance between the actual output and
predicted output, preventing negative error possibilities.
Due to squaring errors, MSE penalizes high errors caused by the anomalies and is beneficial to Loss
Function Optimization Algorithms for evaluating optimal coefficients.
Its more enhanced extensions are Root Mean Squared Error (RMSE) and Root Mean Squared
Logarithmic Error (RMSLE).
Unlike MAE, MSE is extensively sensitive to anomalies wherein squaring errors quantify it
multiple times (into a larger error).
In machine learning models, training periods are one of the critical phases to make the model more
accurate. To understand how precise a model works, you can just run it across required case
MACHINE LEARNING 17
DEPARTMENT OF CSE AY:2023-24
scenarios. But to know how wrong the model is, or what are the points that cause more faults in the
output, a comparative function is required.
A cost function is a single real number used to indicate the distance between actual output and
predicted output in an ML model. To improve the whole model, when this cost function is
optimized through an algorithm to find the minimum possible number of errors in the model, it is
called gradient descent.
Gradient Descent is the productive optimization algorithm that minimizes the cost function and
generates the most promising results. The reason is its ability to identify the slightest potential error
in the model.
It is possible to have different cost values at distinct positions in a model. Thus, for sustainable
utilization of resources (without wastage), immediate steps need to be taken to minimize model
errors. Here, Gradient Descent iteratively tweaks the model with optimal coefficients (parameters)
that help to downsize the cost function .
Decision Trees
regression tasks. It has a hierarchical tree structure consisting of a root node, branches, internal
nodes, and leaf nodes. Decision trees are used for classification and regression tasks, providing
easy-to-understand models.
A decision tree is a hierarchical model used in decision support that depicts decisions and their
potential outcomes, incorporating chance events, resource expenses, and utility. This algorithmic
MACHINE LEARNING 18
DEPARTMENT OF CSE AY:2023-24
model utilizes conditional control statements and is non-parametric, supervised learning, useful for
both classification and regression tasks. The tree structure is comprised of a root node, branches,
It is a tool that has applications spanning several different areas. Decision trees can be used for
classification as well as regression problems. The name itself suggests that it uses a flowchart like a
tree structure to show the predictions that result from a series of feature-based splits. It starts with a
Root Node: The initial node at the beginning of a decision tree, where the entire population
Decision Nodes: Nodes resulting from the splitting of root nodes are known as decision
nodes. These nodes represent intermediate decisions or conditions within the tree.
Leaf Nodes: Nodes where further splitting is not possible, often indicating the final
decision tree is referred to as a sub-tree. It represents a specific portion of the decision tree.
Pruning: The process of removing or cutting down specific nodes in a decision tree to
Branch / Sub-Tree: A subsection of the entire decision tree is referred to as a branch or sub-
tree. It represents a specific path of decisions and outcomes within the tree.
Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is known as
a parent node, and the sub-nodes emerging from it are referred to as child nodes. The parent
node represents a decision or condition, while the child nodes represent the potential
MACHINE LEARNING 20
DEPARTMENT OF CSE AY:2023-24
Decision trees are upside down which means the root is at the top and then this root is split into
various several nodes. Decision trees are nothing but a bunch of if-else statements in layman terms.
It checks if the condition is true and if it is then it goes to the next node attached to that decision.
In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If yes
then it will go to the next feature which is humidity and wind. It will again check if there is a strong
wind or weak, if it’s a weak wind and it’s rainy then the person may go and play.
MACHINE LEARNING 21
DEPARTMENT OF CSE AY:2023-24
Did you notice anything in the above flowchart? We see that if the weather is cloudy then we must
To answer this question, we need to know about few more concepts like entropy, information gain,
and Gini index. But in simple terms, I can say here that the output for the training dataset is always
yes for cloudy weather, since there is no disorderliness here we don’t need to split the node further.
The goal of machine learning is to decrease uncertainty or disorders from the dataset and for this,
Now you must be thinking how do I know what should be the root node? what should be the
decision node? when should I stop splitting? To decide this, there is a metric called “Entropy”
1. Starting at the Root: The algorithm begins at the top, called the “root node,” representing
2. Asking the Best Questions: It looks for the most important feature or question that splits
the data into the most distinct groups. This is like asking a question at a fork in the tree.
MACHINE LEARNING 22
DEPARTMENT OF CSE AY:2023-24
3. Branching Out: Based on the answer to that question, it divides the data into smaller
subsets, creating new branches. Each branch represents a possible route through the tree.
4. Repeating the Process: The algorithm continues asking questions and splitting the data at
each branch until it reaches the final “leaf nodes,” representing the predicted outcomes or
classifications.
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows
therelationship between thecontinuous variables.
o It is used for solving the regression problem in machine learning.
MACHINE LEARNING 23
DEPARTMENT OF CSE AY:2023-24
o Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable(Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there ismore than one input variable, then such linear regression is called multiple linear
regression.
o The relationship between variables in the linear regression model can be explained using the below
image. Here we arepredicting the salary of an employee on the basis of the year of experience.
MACHINE LEARNING 24
DEPARTMENT OF CSE AY:2023-24
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a
similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern
recognition already in the beginning of 1970’s as a non-parametric technique
ALGORITHM
A case is classified by a majority vote of its neighbors, with the case being assigned to the class most
common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is simply
assigned to the class of its nearest neighbor
MACHINE LEARNING 25
DEPARTMENT OF CSE AY:2023-24
what is a classifier?
A classifier is a machine learning model that is used to discriminate different objects based on certain
features.
Bayes Theorem:
MACHINE LEARNING 26
DEPARTMENT OF CSE AY:2023-24
Using Bayes theorem, we can find the probability of A happening, given that Bhas occurred. Here, B
is the evidenceand A is the hypothesis. The assumption made here is that the predictors/features are
independent. That is presence of one particular feature does not affect the other. Hence it is called
naive.
Example:
Let us take an example to get some better intuition. Consider the problem of playing golf. The dataset is
represented as below.
We classify whether the day is suitable for playing golf, given the features of the day. The columns represent
these features and the rows represent individual entries. If we take the first row of the dataset, we can observe
MACHINE LEARNING 27
DEPARTMENT OF CSE AY:2023-24
that is not suitable for playing golf if the outlook is rainy, temperature is hot, humidity is high and it is not
windy. We make twoassumptions here, one as stated above we consider that these predictors are independent.
That is, if the temperature is hot, it does not necessarily mean that the humidity is high. Another assumption
made here is that all the predictors have an equal effect on the outcome. That is, the day being windy does not
have more importance in deciding to playgolf or not.
The variable y is the class variable(play golf), which represents if it is suitable to play golf or not given the
conditions. Variable X represent the parameters/features.
X is given as,
Here x_1,x_2….x_n represent the features, i.e they can be mapped to outlook, temperature, humidity and
windy. By substituting for X and expanding using the chain rule we get,
Now, you can obtain the values for each by looking at the dataset and substitute them into the equation. For all
entries in the dataset, the denominator does not change, it remain static. Therefore, the denominator can be
removed and a proportionality can be introduced.
In our case, the class variable(y) has only two outcomes, yes or no. There could be cases where the
classificationcould be multivariate. Therefore, we need to find the class y with maximum probability.
Using the above function, we can obtain the class, given the predictors.Types of Naive Bayes Classifier:
MACHINE LEARNING 28
DEPARTMENT OF CSE AY:2023-24
Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when
the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more
sophisticated classification methods.
To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the illustration
above. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases
as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which
hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian
analysis, this belief isknown as the prior probability. Prior probabilities are based on previous experience, in
this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually
happen.
Thus, we can write:
Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many
GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given
that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final
classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form
a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).
MACHINE LEARNING 29
DEPARTMENT OF CSE AY:2023-24
Assume that you are given a characteristic information of 10,000 people living in your town. You are asked to
study them and come up with the algorithm which should be able to tell whether a new person coming to the
town is male or a female.
The tree shown above divides the data in such a way that we gain the maximum information, to understand
the tree
– If a person’s hair length is less than 5 Inches, weight greater than 55 KGs then there are 80% chances
for thatperson being a Male.
If you are familiar with Predictive Modelling e.g., Logistic Regression, Random Forest etc. – You might be
wondering what is the difference between a Logistic Model and Decision Tree!
Because in both the algorithms we are trying to predict a categorical variable.
There are a few fundamental differences between both but ideally both the approaches should give you the
same results. The best use of Decision Trees is when your solution requires a representation. For example,
you are working for a Telecom Operator and building a solution using which a call center agent can take a
MACHINE LEARNING 30
DEPARTMENT OF CSE AY:2023-24
There are very less chances that a call center executive will understand the Logistic Regression or the
equations, but using a more visually appealing solution you might gain a better adoption from your call center
team.
How does Decision Tree work?
There are multiple algorithms written to build a decision tree, which can be used according to the problem
characteristics you are trying to solve. Few of the commonly used algorithms are listed below:
Though the methods are different for different decision tree building algorithms but all of them works on the
principle of Greediness. Algorithms try to search for a variable which give the maximum information gain or
divides the data in the most homogenous way.
For an example, consider the following hypothetical dataset which contains Lead Actor and Genre of a movie
alongwith the success on box office:
Lead Actor Genre Hit(Y/N)
Let say, you want to identify the success of the movie but you can use only one variable – There are the
followingtwo ways in which this can be done:
MACHINE LEARNING 31
DEPARTMENT OF CSE AY:2023-24
You can clearly observe that Method 1 (Based on lead actor) splits the data best while the second method
(Based on Genre) have produced mixed results. Decision Tree algorithms do similar things when it comes to
select variables.
There are various metrics which decision trees use in order to find out the best split variables. We’ll go
through them one by one and try to understand, what do they mean?
Entropy & Information Gain
The word Entropy is borrowed from Thermodynamics which is a measure of variability or chaos or
randomness. Shannon extended the thermodynamic entropy concept in 1948 and introduced it into statistical
studies and suggested the following formula for statistical entropy:
MACHINE LEARNING 32
DEPARTMENT OF CSE AY:2023-24
Graph shown above shows the variation of Entropy with the probability of a class, we can clearly see
that Entropy ismaximum when probability of either of the classes is equal. Now, you can understand
that when a decision algorithm tries to split the data, it selects the variable which will give us
maximum reduction in system Entropy.
Captured impurity or entropy after splitting data using Method 1 can be calculated using the
followingformula: “Entropy (Parent) – Weighted Average of Children Entropy”
Which is,
Now using the method used above, we can calculate the Information Gain as:
MACHINE LEARNING 33
DEPARTMENT OF CSE AY:2023-24
Hence, we can clearly see that Method 1 gives us more than 4 times information gain compared to Method 2
and hence Method 1 is the best split variable.
Gain Ratio
Soon after the development of entropy mathematicians realized that Information gain is biased toward multi-
valued attributes and to conquer this issue, “Gain Ratio” came into picture which is more reliable than
Information gain. The gain ratio can be defined as:
Assuming we are dividing our variable into ‘n’ child nodes and Di represents the number of records going
into various child nodes. Hence gain ratio takes care of distribution bias while building a decision tree.
And Hence,
Gini Index
There is one more metric which can be used while building a decision tree is Gini Index (Gini Index is
mostly used in CART). Gini index measures the impurity of a data partition K, formula for Gini Index can be
written down as:
Where m is the number of classes, and P i is the probability that an observation in K belongs to the class. Gini
Index assumes a binary split for each of the attribute in S, let say T 1 & T2. The Gini index of K given this
partitioning is given by:
Which is nothing but a weighted sum of each of the impurities in split nodes. The reduction in impurity is
given by:
MACHINE LEARNING 34
DEPARTMENT OF CSE AY:2023-24
Similar to Information Gain & Gain Ratio, split which gives us maximum reduction in impurity is
considered fordividing our data.
= 0.49
= 0.24 + 0.19
= 0.43
LINEAR REGRESSION
Linear regression is a statistical approach for modelling relationship between a dependent variable
with a given setof independent variables.
Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function
that predicts theresponse value(y) as accurately as possible as a function of the feature or independent
variable(x).
Let us consider a dataset where we have a value of response y for every feature x:
MACHINE LEARNING 35
DEPARTMENT OF CSE AY:2023-24
Now, the task is to find a line which fits best in above scatter plot so that we can predict the
response for any newfeature values. (i.e a value of x not present in dataset)
This line is called regression line.
The equation of regression line is represented as:
Here,
h(x_i) represents the predicted response value for ith observation.
b_0 and b_1 are regression coefficients and represent y-intercept and slope of regression line
respectively.
To create our model, we must “learn” or estimate the values of regression coefficients b_0 and b_1.
And once we’ve estimated these coefficients, we can use the model to predict responses!
MACHINE LEARNING 36
DEPARTMENT OF CSE AY:2023-24
LOGISTIC REGRESSION
Consider a model with features x1, x2, x3 … xn. Let the binary output be denoted by Y, that can take the
values 0 or 1.
Let p be the probability of Y = 1, we can denote it as p = P(Y=1).
The mathematical relationship between these variables can be denoted as:
Here the term p/(1−p) is known as the odds and denotes the likelihood of the event taking place.
Thus ln(p/(1−p)) is known as the log odds and is simply used to map the probability that lies between 0
and 1 to a range between (−∞,+∞). The terms b0, b1, b2… are parameters (or weights) that we will
estimate during training.
So this is just the basic math behind what we are going to do. We are interested in the probability p in this
equation. So we simplify the equation to obtain the value of p:
MACHINE LEARNING 37
DEPARTMENT OF CSE AY:2023-24
1. The log term ln on the LHS can be removed by raising the RHS as a power of e:
This actually turns out to be the equation of the Sigmoid Function which is widely used in other machine
learning applications. The Sigmoid Function is given by:
MACHINE LEARNING 38
DEPARTMENT OF CSE AY:2023-24
Now we will be using the above derived equation to make our predictions. Before that we will train our
model to obtain the values of our parameters b0, b1, b2… that result in least error. This is where the error
or loss function comes in.
Loss Function
The loss is basically the error in our predicted value. In other words it is a difference between our
predicted value and the actual value. We will be using the L2 Loss Function to calculate the error.
Theoretically you can use any function to calculate the error. This function can be broken down as:
1. Let the actual value be yᵢ. Let the value predicted using our model be denoted as ȳᵢ. Find the
difference between the actual and predicted value.
Now that we have the error, we need to update the values of our parameters to minimize this error. This is
where the “learning” actually happens, since our model is updating itself based on it’s previous output to
obtain a more accurate output in the next step. Hence with each iteration our model becomes more and
more accurate. We will be using the Gradient Descent Algorithm to estimate our parameters. Another
commonly used algorithm is the Maximum Likelihood Estimation.
MACHINE LEARNING 39
DEPARTMENT OF CSE AY:2023-24
The loss or error on the y axis and number of iterations on the x axis.
You might know that the partial derivative of a function at it’s minimum value is equal to 0. So gradient
descent basically uses this concept to estimate the parameters or weights of our model by minimizing the
loss function. Click here for a more detailed explanation on how gradient descent works.
For simplicity, for the rest of this tutorial let us assume that our output depends only on a single feature x.
So we can rewrite our equation as:
Thus we need to estimate the values of weights b0 and b1 using our given training data.
1. Initially let b0=0 and b1=0. Let L be the learning rate. The learning rate controls by how much the
values of b0 and b1 are updated at each step in the learning process. Here let L=0.001.
MACHINE LEARNING 40
DEPARTMENT OF CSE AY:2023-24
2. Calculate the partial derivative with respect to b0 and b1. The value of the partial derivative will
tell us how far the loss function is from it’s minimum value. It is a measure of how much our weights
need to be updated to attain minimum or ideally 0 error. In case you have more than one feature, you
need to calculate the partial derivative for each weight b0, b1 … bn where n is the number of features.
For a detailed explanation on the math behind calculating the partial derivatives.
4. We repeat this process until our loss function is a very small value or ideally reaches 0 (meaning no
errors and 100% accuracy). The number of times we repeat this learning process is known as iterations or
epochs.
MACHINE LEARNING 41
DEPARTMENT OF CSE AY:2023-24
In this equation b0 is the regression coefficient for the intercept and the bi values are the regression
coefficients (for variables 1 through k) computed from the data.
So for example, we could estimate (i.e., predict) a person's weight as a function of the person's height and
gender. You could use linear regression to estimate the respective regression coefficients from a sample of
data, measuring height, weight, and observing the subjects' gender. For many data analysis problems,
estimates of the linear relationships between variables are adequate to describe the observed data, and to
make reasonable predictions for new observations..
However, there are many relationships that cannot adequately be summarized by a simple linear equation, for
two major reasons:
Distribution of dependent variable. First, the dependent variable of interest may have a non-continuous
distribution, and thus, the predicted values should also follow the respective distribution; any other predicted
values are not logically possible. For example, a researcher may be interested in predicting one of three
possible discrete outcomes (e.g., a consumer's choice of one of three alternative products). In that case, the
dependent variable can only take on 3 distinct values, and the distribution of the dependent variable is said to
be multinomial. Or suppose you are trying to predict people's family planning choices, specifically, how
many children families will have, as a function of income and various other socioeconomic indicators. The
MACHINE LEARNING 42
DEPARTMENT OF CSE AY:2023-24
dependent variable - number of children - is discrete (i.e., afamily may have 1, 2, or 3 children and so on, but
cannot have 2.4 children), and most likely the distribution of that variable is highly skewed (i.e., most
families have 1, 2, or 3 children, fewer will have 4 or 5, very few will have 6
Support Vector Machine or SVM are supervised learning models with associated learning algorithms that
analyze data for classification( clasifications means knowing what belong to what e.g ‘apple’ belongs to
class ‘fruit’ while ‘dog’ to class ‘animals’ -see fig.1)
In support vector machines, it looks somewhat like which separates the blue balls from red.
SVM is a classifier formally defined by a separating hyperplane. An hyperplane is a subspace of one
dimension lessthan its ambient space. The dimension of a mathematical space (or object) is informally
defined as the minimumnumber of coordinates (x,y,z axis) needed to specify any point (like each blue and
red point) within it while anambient space is the space surrounding a mathematical object.
Therefore the hyperplane of a two dimensional space below (fig.2) is a one dimensional line dividing the red
and bluedots.
MACHINE LEARNING 43
DEPARTMENT OF CSE AY:2023-24
Can you try to solve the above problem linearly like we did with Fig. 2?NO!
The red and blue balls cannot be separated by a straight line as they are randomly distributed and this, in
reality, is how most real life problem data are -randomly distributed.
In machine learning, a “kernel” is usually used to refer to the kernel trick, a method of using a linear
classifier to solve a non-linear problem. It entails transforming linearly inseparable data like (Fig. 3) to
linearly separable ones (Fig. 2). The kernel function is what is applied on each data instance to map the
original non-linear observations intoa higher-dimensional space in which they become separable.
Using the dog breed prediction example again, kernels offer a better alternative. Instead of defining a slew of
features, you define a single kernel function to compute similarity between breeds of dog. You provide this
kernel, together with the data and labels to the learning algorithm, and out comes a classifier.
So this is with two features, and we see we have a 2D graph. If we had three features, we could have a 3D
graph. The 3D graph would be a little more challenging for us to visually group and divide, but still do-able.
The problem occurs when we have four features, or four-thousand features. Now you can start to understand
the power of machine learning, seeing and analyzing a number of dimensions imperceptible to us.
Common examples include image classification (is it a cat, dog, human, etc)or
handwritten digitrecognition (classifying an image of a handwritten number into a digit
from 0 to 9).
MACHINE LEARNING 44
DEPARTMENT OF CSE AY:2023-24
This algorithm applies the same trick as k-means but with one difference that here in the calculation of
distance,kernel method is used instead of the Euclidean distance.
MACHINE LEARNING 45
DEPARTMENT OF CSE AY:2023-24
UNIT-III
Supervised Learning – II (Neural Networks)
Neural Network Representation – Problems – Perceptrons , Activation Functions, Artificial Neural
Networks(ANN) , Back Propagation Algorithm.
Convolutional Neural Networks - Convolution and Pooling layers, , Recurrent Neural Networks (RNN).
Classification Metrics: Confusion matrix, Precision, Recall, Accuracy, F-Score, ROC curves
Neuron consists of three basic components –weights, thresholds and a single activationfunction. An
Artificial neural network(ANN) model based on the biological neural sytems is shown in Figure 2.
MACHINE LEARNING 46
DEPARTMENT OF CSE AY:2023-24
Training: It is the process in which the network is taught to change itsweight and bias.
Learning: It is the internal process of training where the artificial neural system learns to
update/adapt the weights and biases.
MACHINE LEARNING 47
DEPARTMENT OF CSE AY:2023-24
Features of Backpropagation:
1. it is the gradient descent method as used in the case of simple perceptron network with
the differentiable unit.
2. it is different from other networks in respect to the process by which the weights are
calculated during the learning period of the network.
3. training is done in the three stages :
the feed-forward of input training pattern
the calculation and backpropagation of the error
updation of the weight
Backpropagation Algorithm:
MACHINE LEARNING 48
DEPARTMENT OF CSE AY:2023-24
Step 3: Calculate the output of each neuron from the input layer to the hidden layer to the output
layer.
Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the weights to reduce the
error.
Step 6: Repeat the process until the desired output is achieved.
Parameters :
x = inputs training vector x=(x1,x2, ............... xn).
t = target vector t=(t1,t2 .................... tn).
δk = error at output unit.
δj = error at hidden layer.
α = learning rate.
V0j = bias of hidden unit j.
Backpropagation is “backpropagation of errors” and is very useful for training neural networks.
It’s fast, easy to implement, and simple. Backpropagation does not require any parameters to be
set, except the number of inputs. Backpropagation is a flexible method because no prior
knowledge of the network is required.
Types of Backpropagation
MACHINE LEARNING 49
DEPARTMENT OF CSE AY:2023-24
Advantages:
Disadvantages:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.
Performance is highly dependent on input data.
Spending too much time training.
The matrix-based approach is preferred over a mini-batch.
Deep learning is a computer software that mimics the network of neurons in a brain. It is a subset of
machine learning and is called deep learning because it makes use of deep neural networks.
MACHINE LEARNING 50
DEPARTMENT OF CSE AY:2023-24
Each Hidden layer is composed of neurons. The neurons are connected to each other. The neuron will
process and then propagate the input signal it receives the layer above it. The strength of the signal given the
neuron in the next layer depends on the weight, bias and activation function.
The network consumes large amounts of input data and operates them through multiple layers; the network
can learnincreasingly complex features of the data at each layer.
Deep learning is a powerful tool to make prediction an actionable result. Deep learning excels in pattern
discovery (unsupervised learning) and knowledge-based prediction. Big data is the fuel for deep learning.
When both are combined, an organization can reap unprecedented results in term of productivity, sales,
management, and innovation.
Deep learning can outperform traditional method. For instance, deep learning algorithms are 41% more
accurate than machine learning algorithm in image classification, 27 % more accurate in facial recognition
and 25% in voice recognition.
It has been shown that simple deep learning techniques like CNN can, in some cases, imitate the knowledge
of experts in medicine and other fields. The current wave of machine learning, however, requires training
data sets that are not only labeled but also sufficiently broad and universal.
MACHINE LEARNING 51
DEPARTMENT OF CSE AY:2023-24
Deep-learning methods required thousands of observation for models to become relatively good at
classification tasks and, in some cases, millions for them to perform at the level of humans. Without surprise,
deep learning is famous in giant tech companies; they are using big data to accumulate petabytes of data. It
allows them to create an impressive and highly accurate deep learning model.
Unsupervised
Unsupervised feature learning is learning features from unlabeled data. The goal of unsupervised feature
learning is often to discover low-dimensional features that captures some structure underlying the high-
dimensional input data. When the feature learning is performed in an unsupervised way, it enables a form of
semisupervised learning where features learned from an unlabeled dataset are then employed to improve
performance in a supervised setting with labeled data. Several approaches are introduced in the following.
A Recurrent Neural Network is architected in the same way as a “traditional” Neural Network. We
have someinputs, we have some hidden layers and we have some outputs.
The only difference is that each hidden unit is doing a slightly different function. So, let’s explore how this
hiddenunit works.
A recurrent hidden unit computes a function of an input and its own previous output, also known as the cell
state. For textual data, an input could be a vector representing a word x(i) in a sentence of n words (also
known as word embedding).
MACHINE LEARNING 52
DEPARTMENT OF CSE AY:2023-24
W and U are weight matrices and tanh is the hyperbolic tangent function.
Similarly, at the next step, it computes a function of the new input and its previous cell state: s2 =
tanh(Wx1+ Us1 . This behavior is similar to a hidden unit in a feed-forward Network. The difference, proper
to sequences, is that we are adding an additional term to incorporate its own previous state.
A common way of viewing recurrent neural networks is by unfolding them across time. We can notice that
we are using the same weight matrices W and U throughout the sequence. This solves our problem of
parameter sharing. We don’t have new parameters for every point of the sequence. Thus, once we learn
something, it can apply at any point in the sequence.
The fact of not having new parameters for every point of the sequence also helps us deal with variable-
length sequences. In case of a sequence that has a length of 4, we could unroll this RNN to four timesteps.
In other cases, we can unroll it to ten timesteps since the length of the sequence is not prespecified in the
algorithm. By unrolling we simply mean that we write out the network for the complete sequence. For
example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-
layer neural network, one layer for each word.
MACHINE LEARNING 53
DEPARTMENT OF CSE AY:2023-24
1. Input Layers: It’s the layer in which we give input to our model. The number of neurons in
this layer is equalto the total number of features in our data (number of pixels in the case of an
image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden layer. There can
be many hidden layers depending upon our model and data size. Each hidden layer can have
different numbers of neurons which are generally greater than the number of features. The output from
each layer is computed by matrix multiplication of output of the previous layer with learnable weights
of that layer and then by the addition of learnable biases followed by activation function which makes
the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function like
sigmoid orsoftmax which converts the output of each class into the probability score of each class.
The data is then fed into the model and output from each layer is obtained this step is called
feedforward, we then calculate the error using an error function, some common error functions are
cross-entropy, square loss error, etc. After that, we backpropagate into the model by calculating the
derivatives. This step is called Back propagation which basically is used to minimize the
loss. Here’s thebasic python code for a neural network with random inputs and two hidden
layers.
MACHINE LEARNING 54
DEPARTMENT OF CSE AY:2023-24
Now imagine taking a small patch of this image and running a small neural network on it, with say, k
outputsand represent them vertically. Now slide that neural network across the whole image, as a result,
we will get another image with different width, height, and depth. Instead of just R, G, and B channels
now we have more channels but lesser width and height. This operation is called Convolution. If the
patch size is the same as that of the image it will be a regular neural network. Because of this small
patch, we have fewer weights.
Now let’s talk about a bit of mathematics that is involved in the whole
convolution process.
Convolution layers consist of a set of learnable filters (a patch in the above image). Every filter has
small width and height and the same depth as that of input volume (3 if the input layer is image input).
For example, if we have to run convolution on an image with dimension 34x34x3. The possible size
of filters can be axax3, where ‘a’ can be 3, 5, 7, etc but small as compared to image dimension.
During forward pass, we slide each filter across the whole input volume step by step where each
step is called stride (which can have value 2 or 3 or even 4 for high dimensional images) and compute
the dot product between the weights of filters and patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together and as a
result, we’ll get output volume having a depth equal to the number of filters. The network will learn all
the filters.
MACHINE LEARNING 55
DEPARTMENT OF CSE AY:2023-24
Types of layers:
1. Input Layer: This layer holds the raw input of the image with width 32, height 32, and depth 3.
2. Convolution Layer: This layer computes the output volume by computing the dot product between all filters
and image patches. Suppose we use a total of 12 filters for this layer we’ll get output volume of
dimension 32x 32 x 12.
3. Activation Function Layer: This layer will apply an element-wise activation function to the output of
the convolution layer. Some common activation functions are RELU: max(0, x), Sigmoid: 1/(1+e^-x),
Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume will have dimension 32 x 32
x 12.
4. Pool Layer: This layer is periodically inserted in the covnets and its main function is to reduce the size
of volume which makes the computation fast reduces memory and also prevents overfitting. Two common
types of pooling layers are max pooling and average pooling. If we use a max pool with 2 x 2 filters and
stride 2,the resultanvolume will be of dimension .
Performance Metrics
• Accuracycan be calculated by taking average of the values lying across the “main diagonal” i.e
Accuracy = (True Positives+False Negatives)/Total Number of Samples
Precision:-It is the number of correct positive results divided by the number of positive results predicted by
classifier.
MACHINE LEARNING 56
DEPARTMENT OF CSE AY:2023-24
• Recall :- It is the number of correct positive results divided by the number of all relevant samples
It is an umbrella term for supervised machine learning techniques that involves predicting structured objects,
rather than scalar discrete or real values.
Similar to commonly used supervised learning techniques, structured prediction models are typically trained
by means of observed data in which the true prediction value is used to adjust model parameters. Due to the
complexityof the model and the interrelations of predicted variables the process of prediction using a trained
model and of training itself is often computationally infeasible and approximate inference and learning
methods are used.
For example, the problem of translating a natural language sentence into a syntactic representation such as a
parse tree can be seen as a structured prediction problem in which the structured output domain is the set of
all possible parse trees. Structured prediction is also used in a wide variety of application
domains including bioinformatics, natural language processing, speech recognition, and computer vision.
MACHINE LEARNING 57
DEPARTMENT OF CSE AY:2023-24
Sequence tagging is a class of problems prevalent in natural language processing, where input data are often
sequences (e.g. sentences of text). The sequence tagging problem appears in several guises, e.g. part-of-
speech tagging and named entity recognition. In POS tagging, for example, each word in a sequence must
receive a "tag" (class label) that expresses its "type" of word:
DT-DeterminerVB-Verb
JJ-AdjectiveNN-Noun
Ranking :-
Learning to Rank (LTR) is a class of techniques that apply supervised machine learning (ML) to solve
ranking problems. The main difference between LTR and traditional supervised ML is this:
The most common application of LTR is search engine ranking, but it's useful anywhere you need to produce
a ranked list of items.
The training data for a LTR model consists of a list of items and a "ground truth" score for each of those
items. For search engine ranking, this translates to a list of results for a query and a relevance rating for each
of those results with respect to the query. The most common way used by major search engines to generate
these relevance ratingsis to ask human raters to rate results for a set of queries
Learning to rank algorithms have been applied in areas other than information retrieval:
MACHINE LEARNING 58
DEPARTMENT OF CSE AY:2023-24
UNIT - IV
Model Validation in Classification : Cross Validation - Holdout Method, K-Fold, Stratified K-Fold, Leave-
One-Out Cross Validation. Bias-Variance tradeoff, Regularization , Overfitting, Underfitting. Ensemble
Methods: Boosting, Bagging, Random Forest.
Model validation is the process of evaluating a trained model on test data set. This provides the
generalization ability of a trained model. Here I provide a step by step approach to complete first iteration of
model validation in minutes.
The basic recipe for applying a supervised machine learning model are:
CV is a technique used to train and evaluate an ML model using several portions of a dataset. This
implies that rather than splitting the dataset into two parts only, one to train on and another to test on, the
dataset is divided into more slices, or “folds”. And these slices use CV techniques to train the ML model
so as to test its predictive capability and hence accuracy.
In the process of building a training set, different portions of data are gathered, while the remaining ones
are reserved for constructing a validation set. This strategic approach ensures that the model
continuously leverages new and diverse data during training and testing stages, promoting its ability to
adapt to various scenarios and challenges.
MACHINE LEARNING 59
DEPARTMENT OF CSE AY:2023-24
One key objective of employing cross-validation is to safeguard the model against overfitting.
Overfitting occurs when a model simply memorizes the samples in the training set, resulting in an
artificially high predictive test score. However, such a model may struggle to generalize well on unseen
data, leading to a lack of useful results. By validating the model's performance on a separate validation
set, CV helps identify if the model has truly learned meaningful patterns and can generalize to new and
unseen scenarios effectively.
1. Slice and reserve portions of the dataset for the training set,
2. Using what's left, test the ML model.
3. Use CV techniques to test the model using the reserve portions of the dataset created in step 1.
1. CV assists in realizing the optimal tuning of hyperparameters (or model settings) that increase
the overall efficiency of the ML model's performance.
2. Training data is efficiently utilized as every observation is employed for both testing and
training.
1. One of the main considerations with computer vision (CV) is the significant increase in testing
and training time it requires for machine learning models. This is because CV involves multiple
iterative testing cycles to ensure the accuracy and efficiency of the model.
It includes various steps such as test preparation, execution, and rigorous analysis of the results
to fine-tune and optimize the CV system. Therefore, understanding the time commitment
involved in CV development is crucial for effectively leveraging its potential benefits.
2. Additional computation translates to increased resource demands. Cross Validation is known for
its high computational expense, necessitating ample processing power. This results in the first
drawback of extended time, which further inflates the budgetary requirements for an ML model
project.
Cross validation in machine learning is a crucial technique for evaluating the performance of predictive
models. It involves dividing the available data into multiple subsets, or folds, to train and test the model
iteratively.Non-exhaustive methods, such as k-fold cross-validation, randomly partition the data into k
subsets and train the model on k-1 folds while evaluating it on the remaining fold.On the other hand,
exhaustive methods, like leave-one-out cross-validation, systematically leave out one data point at a
time for testing while training the model on the remaining data points.These methods provide a
MACHINE LEARNING 60
DEPARTMENT OF CSE AY:2023-24
comprehensive assessment of the model's performance and help in addressing overfitting or underfitting
issues effectively.
1. Holdout Method
2. K-Fold CV
3. Stratified K-Fold CV
4. Leave-P-Out CV
5. Leave-One-Out CV
Holdout Method
The holdout method is a basic CV approach in which the original dataset is divided into two discrete
segments:
1. Training Data - As a reminder this set is used to fit and train the model.
2. Test Data - This set is used to evaluate the model.
As a non-exhaustive method, the Hold-out model 'trains' the ML model on the training dataset and
evaluates the ML model using the testing dataset.
In the majority of cases, the size of the training dataset is typically much larger than the test dataset.
Therefore, a standard holdout method split ratio is 70:30 or 80:20. Furthermore, the overall dataset is
randomly rearranged before dividing it into the training and test set portions using the predetermined
ratio.
There are several disadvantages to the holdout method that need to be considered. One drawback is that
as the model trains on distinct combinations of data points, it can sometimes yield inconsistent results,
which can introduce doubt into the validity of the model and the overall validation process.
Another concern is that there is no certainty that the training dataset selected fully represents the
MACHINE LEARNING 61
DEPARTMENT OF CSE AY:2023-24
complete dataset. If the original data sample is not large enough, there is a possibility that the test data
may contain information that the model will fail to recognize because it was not included in the original
training data portion.
However, despite these limitations, the Holdout CV method can be considered ideal in situations where
time is a scarce project resource and there is an urgency to train and test an ML model using a large
dataset.
K fold Cross-Validation
The k-fold cross-validation method is considered an improvement over the holdout method due to its
ability to provide additional consistency to the overall testing score of machine learning models. This
improvement is achieved by applying a specific procedure for selecting and dividing the training and
testing datasets.
To implement k-fold cross-validation, the original dataset is divided into k number of partitions. The
holdout method is then performed k number of occasions, each time using a different partition as the
testing set, while the remaining partitions are used for training. This repeated process helps to obtain a
more reliable and robust evaluation of the model's performance by leveraging a larger amount of data for
testing and training purposes.
Let us look at an example: if the value of k is set to six, there will be six subsets of equivalent sizes or
folds of data. In the first iteration, the model trains on one subset and validates on the other. In the
second iteration, the model re-trains on another subset and then is tested on the remaining subset. And so
on for six iterations in total.
MACHINE LEARNING 62
DEPARTMENT OF CSE AY:2023-24
The k-fold cross-validation randomly splits the original dataset into k number of folds
The test results of each iteration are then averaged out, which is called the CV accuracy. Finally, CV
accuracy is employed as a performance metric to contrast and compare the efficiencies of different ML
models.It is important to note that the value of k is incidental or random. However, the k value is
commonly set to ten within the data science field. The k-fold cross-validation approach is widely
recognized for generating ML models with reduced subjectivity. By ensuring that each data point is
present in both testing and training datasets, this technique enhances the objectivity of the
models.Moreover, the k-fold method proves to be particularly advantageous for data science projects
with a finite amount of data. It maximizes the utilization of available data by repeatedly utilizing
different data sets
Jake VanderPlas, gives the process of model validation in four simple and clear steps. There is also a whole
process needed before we even get to his first step. Like fetching all the information we need from the data to
make a good judgement for choosing a class model. Also providing finishing touches to confirm the results
after. I will get into depth about these steps and break it down further.
Data cleansing and wrangling.
Feature engineering to optimize the metrics. (Skip this during first pass).
Data pre-processing.
Feature selection.
Model selection.
Model validation.
Get the best model and check it against test data set.
Domain knowledge on the problem in hand will be of great use for feature engineering. This is a bigger topic
in itselfand requires extensive investment of time and resource.
Data pre-processing.
Data pre-processing converts features into format that is more suitable for the estimators. In general,
machine learning model prefer standardization of the data set. I will make use of RobustScaler for our
example.
MACHINE LEARNING 63
DEPARTMENT OF CSE AY:2023-24
Feature selection.
High variance: The model is very sensitive to the provided inputs for the learned features.
Low accuracy: One model (or one algorithm) to fit the entire training data might not provide
you with the nuance your project requires.
Features noise and bias: The model relies heavily on too few features while making a
prediction.
Ensemble Algorithm
A single algorithm may not make the perfect prediction for a given data set. Machine learning
algorithms have their limitations and producing a model with high accuracy is challenging. If we
build and combine multiple models, we have the chance to boost the overall accuracy. We then
implement the combination of models by aggregating the output from each model with two
objectives:
MACHINE LEARNING 64
DEPARTMENT OF CSE AY:2023-24
1. Bagging
2. Boosting
3. Stacking
4. Blending
5.
BAGGING
The idea of bagging is based on making the training data available to an iterative learning process.
Each model learns the error produced by the previous model using a slightly different subset of the
training data set. Bagging reduces variance and minimizes overfitting. One example of such a
technique is the random forest algorithm.
This technique is based on a bootstrapping sampling technique. Bootstrapping creates multiple sets
of the original training data with replacement. Replacement enables the duplication of sample
instances in a set. Each subset has the same equal size and can be used to train models in parallel.
The method involves:
Creating multiple subsets from the original dataset with replacement,
Building a base model for each of the subsets,
Running all the models in parallel,
Combining predictions from all models to obtain final predictions.
MACHINE LEARNING 65
DEPARTMENT OF CSE AY:2023-24
Boosting
Boosting is a machine learning ensemble technique that reduces bias and variance by converting weak learners
into strong learners. The weak learners are applied to the dataset in a sequential manner. The first step is building
an initial model and fitting it into the training set.
A second model that tries to fix the errors generated by the first model is then fitted. Here’s what the entire
process looks like:
Create a subset from the original data,
Build an initial model with this data,
Run predictions on the whole data set,
Calculate the error using the predictions and the actual values,
Assign more weight to the incorrect predictions,
Create another model that attempts to fix errors from the last model,
Run predictions on the entire dataset with the new model,
Create several models with each model aiming at correcting the errors generated by the previous one,
Obtain the final model by weighting the mean of all the models.
Random Forest Algorithm widespread popularity stems from its user-friendly nature and
adaptability, enabling it to tackle both classification and regression problems effectively. The
algorithm’s strength lies in its ability to handle complex datasets and mitigate overfitting, making
One of the most important features of the Random Forest Algorithm is that it can handle the data
set containing continuous variables, as in the case of regression, and categorical variables, as in
the case of classification. It performs better for classification and regression tasks.
MACHINE LEARNING 66
DEPARTMENT OF CSE AY:2023-24
One of the most important features of the Random Forest Algorithm is that it can handle the data
set containing continuous variables, as in the case of regression, and categorical variables, as in
the case of classification. It performs better for classification and regression tasks. In this tutorial,
we will understand the working of random forest and implement random forest on a
classification task.
As mentioned earlier, Random forest works on the Bagging principle. Now let’s dive in and
understand bagging in detail.
Step 1: In the Random forest model, a subset of data points and a subset of features is selected
for constructing each decision tree. Simply put, n random records and m features are taken from
the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for Classification and
regression, respectively.
For example
Consider the fruit basket as the data as shown in the figure below. Now n number of samples are
taken from the fruit basket, and an individual decision tree is constructed for each sample. Each
decision tree will generate an output, as shown in the figure. The final output is considered based on
majority voting. In the below figure, you can see that the majority decision tree gives output as an
apple when compared to a banana, so the final output is taken as an apple.
MACHINE LEARNING 67
DEPARTMENT OF CSE AY:2023-24
Diversity: Not all attributes/variables/features are considered while making an individual tree;
each tree is different.
Immune to the curse of dimensionality: Since each tree does not consider all the features, the
feature space is reduced.
Parallelization: Each tree is created independently out of different data and attributes. This
means we can fully use the CPU to build random forests.
Train-Test split: In a random forest, we don’t have to segregate the data for train and test as
there will always be 30% of the data which is not seen by the decision tree.
Stability: Stability arises because the result is based on majority voting/ averaging.
MACHINE LEARNING 68
DEPARTMENT OF CSE AY:2023-24
UNIT – V
Unsupervised Learning : Clustering-K-means, K-Modes, K-Prototypes, Gaussian Mixture
Models, Expectation-Maximization.
Reinforcement Learning: Exploration and exploitation trade-offs, non-associative learning, Markov decision
processes, Q-learning.
Introduction to clustering
As the name suggests, unsupervised learning is a machine learning technique in which models are
not supervised using training dataset. Instead, models itself find the hidden patterns and insights
from the given data. It can be compared to learning which takes place in the human brain while
learning new things. It can be defined as:
“Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.”
Below are some main reasons which describe the importance of Unsupervised Learning:
MACHINE LEARNING 69
DEPARTMENT OF CSE AY:2023-24
o
Unsupervised learning works on unlabeled and uncategorized data
which make unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding
output so to solve such cases, we need unsupervised learning.
we have taken an unlabeled input data, which means it is not categorized and corresponding outputs
are also not given. Now, this unlabeled input data is fed to the machine learning model in order to
train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then will
apply suitable algorithms such as k- means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groupsaccording to
the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types of problems:
Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
MACHINE LEARNING 70
DEPARTMENT OF CSE AY:2023-24
o Hierarchal clustering
o Anomaly detection
o Neural Networks
One of the most used clustering algorithm is k-means. It allows to group the data according
to the existing similarities among them in k clusters, given as input to the algorithm. I‟ll
startwith a simple example.
Let’s imagine we have 5 objects (say 5 people) and for each of them we know two features
MACHINE LEARNING 71
DEPARTMENT OF CSE AY:2023-24
As you probably already know, I‟m using Python libraries to analyze my data. The k-means
algorithm is implemented in the scikit-learn package. To use it, you will just need the following line
in your script:
At this point, you will maybe have noticed something. The basic concept of k-means stands on
mathematical calculations (means, euclidian distances). But what if our data is non-numerical or, in
other words, categorical? Imagine, for instance, to have the ID code and date of birth of the five
people of the previous example, instead of their heights and weights.
We could think of transforming our categorical values in numerical values and eventually apply k-
means. But beware: k-means uses numerical distances, so it could consider close two really distant
objects that merely have been assigned two close numbers.
Expectation-step is used to assign data points to the nearest cluster, and the Maximization-step is
When using the K-means algorithm, we must keep the following points in mind:
It is suggested to normalize the data while dealing with clustering algorithms such as K-
Means since such algorithms employ distance-based measurement to identify the similarity
Because of the iterative nature of K-Means and the random initialization of centroids, K-
Means may become stuck in a local optimum and fail to converge to the global optimum. As
k-Prototype
One of the conventional clustering methods commonly used in clustering techniques and efficiently
used for large data is the K-Means algorithm. However, its method is not good and suitable for data
that contains categorical variables. This problem happens when the cost function in K-Means is
calculated using the Euclidian distance that is only suitable for numerical data. While K-Mode is
only suitable for categorical data only, not mixed data types.
Facing these problems, Huang proposed an algorithm called K-Prototype which is created in order to
handle clustering algorithms with the mixed data types (numerical and categorical variables). K-
MACHINE LEARNING 73
DEPARTMENT OF CSE AY:2023-24
Reinforcement learning
Reinforcement learning addresses the question of how an autonomous agent that senses and acts in its
environment can learn to choose optimal actions to achieve its goals
Introduction
Consider building a learning robot. The robot, or agent, has a set of sensors
to observe the state of itsenvironment, and a set of actions it can performto alter this state.
Its task is to learn a control strategy, or policy, for choosing actions that achieve its goals.
The goals of the agent can be defined by a reward function that assigns a
numericalvalue to each distinctaction the agent may take from each distinct state.
This reward function may be built into the robot, or known only to an external
teacher whoprovides thereward value for each action performed bythe robot.
The task of the robot is to perform sequences of actions, observe their
consequences,and learn a controlpolicy.
The control policy is one that, from any initial state, chooses actions that
maximize thereward accumulatedover time by the agent.
Example:
A mobile robot may have sensors such as a camera and sonars, and actions such as "move
forward"and "turn."
The robot may have a goal of docking onto its battery charger whenever its battery level islow.
The goal of docking to the battery charger can be captured by assigning a positive
reward (Eg., +100) to state- action transitions that immediately result in a connection to the charger
and a reward of zero to every other state-action transition.
MACHINE LEARNING 74
DEPARTMENT OF CSE AY:2023-24
1. Delayed reward: The task of the agent is to learn a target function 𝜋 that maps from
the current state s to the optimal action a = 𝜋 (s). In reinforcement learning, training information is
not available in (s, 𝜋 (s)). Instead, the trainer provides only a sequence of immediate reward values
as the agent executes its sequence of actions. The agent, therefore, faces the problem of temporal
credit assignment: determining which of the actions in its sequence are to be credited with
producing the eventual rewards.
3. Partially observable states: The agent's sensors can perceive the entire state of the
environment at each time step, in many practical situations sensors provide only partial information.
In such cases, the agent needs to consider its previous observations together with its current sensor
data when choosing actions, and the best policy may be onethat chooses actions specifically to
improve the observability of the environment.
4. Life-long learning: Robot requires to learn several related tasks within the same
environment,using the same sensors. For example, a mobile robot may need to learn how to dock
on its battery charger, how to navigate through narrow corridors, and how to pick up output from
laser printers. This setting raises the possibility of using previously obtained experience or
knowledge to reduce sample complexity when learning new tasks.
Learning Task
Consider Markov decision process (MDP) where the agent can perceive a set S of distinct states of
itsenvironment and has a set A of actions that it can perform
At each discrete time step t, the agent senses the current state st, chooses a current action
at, andperforms it.
The environment responds by giving the agent a reward rt = r(st, at) and by producing the
succeedingstate st+l
= δ(st, at). Here the functions δ(st, at) and r(st, at) depend only on the current state and action, and not
onearlier states or actions.
The task of the agent is to learn a policy, 𝝅: S → A, for selecting its nextaction a, based on
the current observedstate st; that is, 𝝅(st) = at.
Howshall we specify precisely which policy π we would like the agent to learn?
MACHINE LEARNING 75
DEPARTMENT OF CSE AY:2023-24
Considers the average reward per time step over the entire lifetime of the agent.
We require that the agent learn a policy π that maximizes Vπ (st) for allstates s. such a
policy is called an optimalpolicy and denote it by π*
Refer the value function Vπ*(s) an optimal policy as V*(s). V*(s) gives the maximum
discounted cumulative rewardthat the agent can obtain starting from state s.
Example:
The six grid squares in this diagram represent six possible states, or locations,for theagent.
Each arrow in the diagram represents a possible action the agent can take tomove from one state
to another.
MACHINE LEARNING 76
DEPARTMENT OF CSE AY:2023-24
The number associated with each arrow represents the immediate reward r(s, a) the
agent receives if it executesthe corresponding state-action transition
The immediate reward in this environment is defined to be zero forall state-action
transitions except for those leading into the state labelled G. The state G as the goal
state, and the agent can receive reward by entering thisstate.
Once the states, actions, and immediate rewards are defined, choose a value for the
discount factor γ, determine theoptimal policy π * and itsvalue function V*(s).
Let’s choose γ = 0.9. The diagramat the bottom of the figure shows one optimal
Values of V*(s) and Q(s, a) follow from r(s, a), and the discount factor γ =
0.9. An optimal policy, corresponding toactions with maximal Q values,is also shown.
Non-Associative Learning:
As applied to animal behavior, is instances where behavior toward stimulus changes in the
absence of any apparent associated stimulus or event (such as a reward or punishment).
In non-associative learning, the person is being trained on how to respond to a certain situation.
There is a right and a wrong answer.
Supervised learning algorithms use non-associative learning. These algorithms learn from the
training data. Primarily, they are taught based on the assumption there is a right or wrong answer.
The cost function, or loss, associated with the algorithm, is a similar concept to ‘punishment.’
In non-associative machine learning, you use the training data set to teach the machine learning
algorithm how to predict on the data set.
This is instead of letting the algorithm learn for itself on what the outcome should be.
1. REGRESSION ANALYSIS
The classic example of supervised ML using regression is the prediction of house prices.
MACHINE LEARNING 77
DEPARTMENT OF CSE AY:2023-24
For example, the number of rooms a house has (input) and the price of the house (output).
This training data will teach the machine how the number of rooms and price are related, allowing it
to make predictions of the output, cost of a house, based on the inputs, number of rooms.
2. CLASSIFICATION ANALYSIS
If we move onto classification analysis, we begin to use machine learning to determine which group
an object belongs to. One of the classic examples is whether or not a tumor is malignant or benign.
Or you could use it to say yes or no if someone is likely to pass an exam.
Another example is, will this person develop diabetes? Yes or No.
In classification analysis, the labeled training data set will have a sample set of people and their
characteristics alongside whether or not they developed diabetes.
This training data is there to teach the machine how different characteristics of a person’s genetics
or lifestyle contribute to whether or not they would get diabetes.
Q LEARNING
The training information available to the learner is the sequence of immediate rewards r(si,ai)
for i = 0, 1,2, . . . .
Given this kind of training information it is easier to learn a numerical evaluation
function defined over states andactions, then implement the optimal policy in terms of
this evaluation function.
What evaluation function should the agent attempt to learn?
One obvious choice is V*. The agent should prefer state sl over state s2 whenever
V*(sl) > V*(s2), because thecumulative future reward will begreater from sl
The optimal action in state s is the action a that maximizes the sum of theimmediate
reward r(s, a) plus the value V*of the immediate successor state, discounted by γ.
The Q Function
The value of Evaluation function Q(s, a) is the reward receivedimmediately
upon executing action a from state s,plus the value (discounted by γ ) of
MACHINE LEARNING 78
DEPARTMENT OF CSE AY:2023-24
The key problem is finding a reliable way to estimate training valuesfor Q, given only
a sequence of immediaterewards r spread out over
Rewriting Equation
Q learning algorithm:
MACHINE LEARNING 79
DEPARTMENT OF CSE AY:2023-24
An Illustrative Example
To illustrate the operation of the Q learning algorithm, consider a single action taken
by an agent, and thecorresponding refinement to
MACHINE LEARNING 80
DEPARTMENT OF CSE AY:2023-24
The agent moves one cell to the right in its grid world and receives an
immediate reward of zero for thistransition.
𝑄̂ value associated with the resulting state (100), discounted byγ (.9).
Convergence
Will the Q Learning Algorithm converge toward a Q equal to the true Q function?
Yes, under certain conditions.
1. Assume the system is a deterministic MDP.
2. Assume the immediate reward values are bounded; that is, there exists
some positive constant c such that for allstates s and actions a, | r(s, a)|
<c
3. Assume the agent selects actions in such a fashion that it visits every possible
state-action pair infinitely often
MACHINE LEARNING 81
DEPARTMENT OF CSE AY:2023-24
Here are four machine learning trends that could become a reality in the near future:
Algorithms can help companies unearth insights about their business, but this proposition can be
expensive with no guarantees of a bottom-line increase. Companies often deal with havingto collect
data, hire data scientists and train them to deal with changing databases. Now that more data metrics
are becoming available, the cost to store it is dropping thanks to the cloud. There will no longer be
the need to manage infrastructure as cloud systems can generate new models as the scale of an
operation increases, while also delivering more accurate results. More open-source ML frameworks
are coming to the fold, obtaining pre-trained platforms thatcan tag images, recommend products and
perform natural language processing tasks.
Some of the tasks that ML can help companies deal with is the manipulation and classification of
large quantities of vectors in high-dimensional spaces. Current algorithms take a large chunk of time
to solve these problems, costing companies more to complete their business processes. Quantum
computers are slated to become all the rage soon as they can manipulate high-dimensional vectors at
a fraction of the time. These will be able to increase the number of vectors and dimensions that are
processed when compared to traditional algorithms in a quicker period of time.
3) Improved Personalization
Retailers are already making waves in developing recommendation engines that reach their target
audience more accurately. Taking this a step further, ML will be able to improve the personalization
techniques of these engines in more precise ways. The technology will offer more specific data that
they can then use on ads to improve the shopping experience for consumers.
4) Data on Data
As the amount of data available increases, the cost of storing this data decreases at roughly thesame
rate. ML has great potential in generating data of the highest quality that will lead to better models,
an improved user experience and more data that helps repeat but improve uponthis cycle. Companies
such as Tesla add a million miles of driving data to enhance its self- driving capabilities every hour.
Its Autopilot feature learns from this data and improves the software that propels these self-driving
vehicles forward as the company gathers more data onthe possible pitfalls of autonomous driving
technology.
MACHINE LEARNING 82
DEPARTMENT OF CSE AY:2023-24
MACHINE LEARNING 83