0% found this document useful (0 votes)
11 views111 pages

1 Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views111 pages

1 Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Machine Learning

Samatrix Consulting Pvt Ltd


What is Machine Learning?
Machine Learning
• To start the introduction to Machine Learning, let’s start with a simple
example.
• Suppose you have been assigned as Data Scientist to advice on how to
improve the sales of a particular product of a company.
• The company provided you with sales data from 200 different markets.
• The data also contains the advertising budgets for the product in each of
those markets for three different media: TV, radio, and newspaper.
• The client cannot directly increase the sales of the product.
• But they can adjust the advertisement budget for each of the three media.
Machine Learning
• As a data scientist, if you can establish the relationship between
advertisement expenditure and sales, you can provide your feedback
on how to adjust the budgets so that sales can increase.
• So, the objective is to develop a model that you can use to predict the
sales on the basis of the three media budgets.
Machine Learning

Fig 1: The plots display sales in thousands of units. Sales is a function of budget on
TV, Radio, and Newspaper in thousands of dollars across 200 markets
Input – Output Variables

Independent Vs Dependent Variables

Function Approximation
Function Approximation
• Let us consider another
dataset, income dataset.
• The left-hand panel shows
the plot of income versus
years of education for 30
individuals.
• Using the plot, you may be
able to predict the income
given the years of education.
• But the function that relates
the income to the years of
education is not known.
Function Approximation

Function Approximation

Why Approximate f

Prediction

Reducible - Irreducible Errors

Reducible - Irreducible Errors

Reducible - Irreducible Errors

Inference

Inference

Inference

Inference



Parametric Approach

Parametric Approach

Parametric Approach

Parametric Approach

Parametric Approach

Parametric Approach

Non - parametric Approach

Non - parametric Approach

Non - parametric Approach

Non - parametric Approach

Prediction Accuracy vs Model
Interpretability
Trade Off

Trade Off

Trade Off

Trade Off
• On the other hand, generalized additive models (GAMs) extend the
linear model to allow for certain non-linear relationships.
• Hence, GAMs are more flexible than linear regression.
• However, they are less interpretable than linear regression, because
the relationship between each predictor and the response is now
modeled using a curve.
• Finally, fully non-linear methods such as bagging, boosting, and
support vector machines with non-linear kernels are highly flexible
approaches that are harder to interpret.
Trade Off
• Hence, we can state that when inference is the goal, we should use
simple and relatively inflexible machine learning methods.
• However, there might be situations, when we are only interested in
prediction, and not in the interpretability of the predictive model.
• For instance, if we want to build a model to predict the price of a
stock, we would be interested in an algorithm that can predict
accurately whereas the interpretability is not a concern.
• In such cases, we should use the most flexible model available.
Assessing Model Accuracy
No Free Lunch Theorem
• During this course we would introduce a wide range of machine learning models.
• These models are more complex than the standard linear regression approach.
• The question is why do we need so many different machine learning approaches,
rather than having a best method?
• In statistics and machine learning, we follow no free lunch theorem.
• For a given data set, one specific approach may give us the best results but some
other scientific approach may give better results on a similar but different data
set.
• Hence, we need to explore and decide for each data set which approach provides
us the best results.
• The most challenging part of the machine learning is to select the approach that
can provide us the best results.
Measuring Quality of Fit

Training MSE
• We compute the MSE using the training data that we used to fit the model.
• Hence, we call it training MSE. However, in practice, we are not bothered about the
performance of the model on the training data.
• Rather, we are interested in the accuracy of prediction that we get using the previously
unseen test data.
• The question arises, why we are interested in unseen test data not in training data?
• Suppose our goal is to develop a machine learning model to predict the stock price base
on historical stock returns.
• We can use the last 6 months stock return data to train our model.
• We would not be interested in how well the model is predicting the stock price for a past
date.
• Rather we would be interested in how well the model can predict the stock price the next
day or the next month.
Training MSE
• Similarly, if we have clinical data that includes weight, blood pressure,
height, age, and family history of disease for a number of patients.
• We also have information about whether each patient has diabetes.
• This data can be used to train a machine learning model to predict the
risk of diabetes based on clinical observations.
• In practice, we are interested accurately predicting diabetes risk for
future patients based on their clinical observations.
• We do not want to know how accurately the model predicts diabetes
risk for patients used to train the model.
• We already know which of those patients have diabetes.
Test MSE

Model Selection
• How do we select a model that results in the minimization of the
MSE?
• In certain situations, the test data set might be available.
• In other words, we have a set of observations that we did not use to
train the machine learning method.
• In this case, we can evaluate the test observations and select the
model with the smallest test MSE.
Model Selection
• On the other hand, in certain situations, the test observations are not
available.
• In such situations, we can select the model with the smallest training
MSE.
• Even though the training MSE and test MSE appear to be closely
related, there is no guarantee that the model with the lowest training
MSE will also have the lowest test MSE.
• For many machine learning methods, the training set MSE can be
quite small, but the test MSE is often much larger.
Model Selection


Model Selection

Model Selection

Model Selection
• We have demonstrated the test MSE using the red curve in the
right-hand panel.
• The test MSE along with training MSE initially decline with the
increase in the level of flexibility.
• At a certain point the test MSE levels off and then it starts to increase
again.
Model Selection

Model Selection

Model Selection
• When we overfit the training data, the test MSE will be very large
because the supposed patterns that the method found in the training
data simply don’t exist in the test data.
• Note that regardless of whether or not overfitting has occurred, we
almost always expect the training MSE to be smaller than the test
MSE because most machine learning methods either directly or
indirectly seek to minimize the training MSE.
• Overfitting refers specifically to the case in which a less flexible model
would have yielded a smaller test MSE.
Model Selection


Model Selection


Bias – Variance Trade Off

Bias – Variance Trade Off

Bias – Variance Trade Off

Meaning of Bias – Variance Trade Off

Meaning of Bias – Variance Trade Off

Meaning of Bias – Variance Trade Off

Meaning of Bias – Variance Trade Off

Meaning of Bias – Variance Trade Off
• We can generalize the concept. As the model becomes more flexible, the
variance increases and the bias decreases.
• By analyzing the relative rate of change of these two quantities, we can
determine whether the test MSE will increase or decrease.
• As the flexibility of the model increases, the bias tends to initially decrease
faster than the variance increases.
• As a result, the expected test MSE decreases.
• After some point an increase in flexibility has little impact on the bias but it
starts to significantly increase the variance.
• Due to this, the test MSE increases.
• You can note this pattern of decreasing test MSE followed by increasing test
MSE in the right-hand panels of Figures 9–11.
Meaning of Bias – Variance Trade Off

The three plots in Figure 12 illustrate relationship between bias and


variance for the examples in Figures 9–11.
Meaning of Bias – Variance Trade Off

Meaning of Bias – Variance Trade Off

Meaning of Bias – Variance Trade Off
• The relationship between bias, variance, and test set MSE in Figure 12
is referred to as the bias-variance trade-off.
• Good test set performance of a machine learning method requires
low variance as well as low squared bias.
• This is referred to as a trade-off because it is easy to obtain a method
with extremely low bias but high variance (for instance, by drawing a
curve that passes through every single training observation) or a
method with very low variance but high bias (by fitting a horizontal
line to the data).
• The challenge lies in finding a method for which both the variance
and the squared bias are low.
Regression vs Classification

Approaches for Prediction

Linear Model

Linear Model

Linear Model

Linear Model - Classification

Linear Model - Classification

Model Accuracy - Classification

Model Accuracy - Classification

Bayes Classifier

Bayes Classifier

Bayes Classifier
• We can determine the Bayes classifier prediction using the Bayes
decision boundary.
• An observation that falls on the orange side of the boundary will be
assigned to the orange class whereas the observation on the blue side
of the boundary will be assigned to the blue class.
• The Bayes classifier gives the lowest possible test error rate, which is
known as Bayes error rate.
• The Bayes error rate is analogous to the irreducible errors.
N Nearest Neighbour

N Nearest Neighbour

N Nearest Neighbour

N Nearest Neighbour

N Nearest Neighbour

N Nearest Neighbour

N Nearest Neighbour

N Nearest Neighbour
• Hence for both the regression and classification models, the correct
flexibility level is critical to the success.
• The bias-variance tradeoff, and the resulting U-shape in the test error,
can make this a challenging task.
Thanks
Samatrix Consulting Pvt Ltd
• KNN -“Birds of a feather flock together.” similar things are near to each
other.
• K-NN is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat
and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it will put it
in either cat or dog category.
Advantages of KNN
1. No Training Period: KNN is called Lazy Learner (Instance based learning). It
does not learn anything in the training period. It does not derive any
discriminative function from the training data. In other words, there is no
training period for it. It stores the training dataset and learns from it only at
the time of making real time predictions. This makes the KNN algorithm much
faster than other algorithms that require training e.g. SVM, Linear Regression
etc.
2. Since the KNN algorithm requires no training before making
predictions, new data can be added seamlessly which will not impact the
accuracy of the algorithm.
3. KNN is very easy to implement. There are only two parameters required to
implement KNN i.e. the value of K and the distance function (e.g. Euclidean or
Manhattan etc.)
Disadvantages of KNN
1. Does not work well with large dataset: In large datasets, the cost of calculating
the distance between the new point and each existing points is huge which
degrades the performance of the algorithm.
2. Does not work well with high dimensions: The KNN algorithm doesn't work well
with high dimensional data because with large number of dimensions, it becomes
difficult for the algorithm to calculate the distance in each dimension.
3. Need feature scaling: We need to do feature scaling (standardization and
normalization) before applying KNN algorithm to any dataset. If we don't do so, KNN
may generate wrong predictions.
4. Sensitive to noisy data, missing values and outliers: KNN is sensitive to noise in
the dataset. We need to manually impute missing values and remove outliers.
Thanks
Samatrix Consulting Pvt Ltd

You might also like