1 Machine Learning
1 Machine Learning
Fig 1: The plots display sales in thousands of units. Sales is a function of budget on
TV, Radio, and Newspaper in thousands of dollars across 200 markets
Input – Output Variables
•
Independent Vs Dependent Variables
•
Function Approximation
Function Approximation
• Let us consider another
dataset, income dataset.
• The left-hand panel shows
the plot of income versus
years of education for 30
individuals.
• Using the plot, you may be
able to predict the income
given the years of education.
• But the function that relates
the income to the years of
education is not known.
Function Approximation
•
Function Approximation
•
Why Approximate f
•
Prediction
•
Reducible - Irreducible Errors
•
Reducible - Irreducible Errors
•
Reducible - Irreducible Errors
•
Inference
•
Inference
•
Inference
•
Inference
•
•
•
Parametric Approach
•
Parametric Approach
•
Parametric Approach
•
Parametric Approach
•
Parametric Approach
•
Parametric Approach
•
Non - parametric Approach
•
Non - parametric Approach
•
Non - parametric Approach
•
Non - parametric Approach
•
Prediction Accuracy vs Model
Interpretability
Trade Off
•
Trade Off
•
Trade Off
•
Trade Off
• On the other hand, generalized additive models (GAMs) extend the
linear model to allow for certain non-linear relationships.
• Hence, GAMs are more flexible than linear regression.
• However, they are less interpretable than linear regression, because
the relationship between each predictor and the response is now
modeled using a curve.
• Finally, fully non-linear methods such as bagging, boosting, and
support vector machines with non-linear kernels are highly flexible
approaches that are harder to interpret.
Trade Off
• Hence, we can state that when inference is the goal, we should use
simple and relatively inflexible machine learning methods.
• However, there might be situations, when we are only interested in
prediction, and not in the interpretability of the predictive model.
• For instance, if we want to build a model to predict the price of a
stock, we would be interested in an algorithm that can predict
accurately whereas the interpretability is not a concern.
• In such cases, we should use the most flexible model available.
Assessing Model Accuracy
No Free Lunch Theorem
• During this course we would introduce a wide range of machine learning models.
• These models are more complex than the standard linear regression approach.
• The question is why do we need so many different machine learning approaches,
rather than having a best method?
• In statistics and machine learning, we follow no free lunch theorem.
• For a given data set, one specific approach may give us the best results but some
other scientific approach may give better results on a similar but different data
set.
• Hence, we need to explore and decide for each data set which approach provides
us the best results.
• The most challenging part of the machine learning is to select the approach that
can provide us the best results.
Measuring Quality of Fit
•
Training MSE
• We compute the MSE using the training data that we used to fit the model.
• Hence, we call it training MSE. However, in practice, we are not bothered about the
performance of the model on the training data.
• Rather, we are interested in the accuracy of prediction that we get using the previously
unseen test data.
• The question arises, why we are interested in unseen test data not in training data?
• Suppose our goal is to develop a machine learning model to predict the stock price base
on historical stock returns.
• We can use the last 6 months stock return data to train our model.
• We would not be interested in how well the model is predicting the stock price for a past
date.
• Rather we would be interested in how well the model can predict the stock price the next
day or the next month.
Training MSE
• Similarly, if we have clinical data that includes weight, blood pressure,
height, age, and family history of disease for a number of patients.
• We also have information about whether each patient has diabetes.
• This data can be used to train a machine learning model to predict the
risk of diabetes based on clinical observations.
• In practice, we are interested accurately predicting diabetes risk for
future patients based on their clinical observations.
• We do not want to know how accurately the model predicts diabetes
risk for patients used to train the model.
• We already know which of those patients have diabetes.
Test MSE
•
Model Selection
• How do we select a model that results in the minimization of the
MSE?
• In certain situations, the test data set might be available.
• In other words, we have a set of observations that we did not use to
train the machine learning method.
• In this case, we can evaluate the test observations and select the
model with the smallest test MSE.
Model Selection
• On the other hand, in certain situations, the test observations are not
available.
• In such situations, we can select the model with the smallest training
MSE.
• Even though the training MSE and test MSE appear to be closely
related, there is no guarantee that the model with the lowest training
MSE will also have the lowest test MSE.
• For many machine learning methods, the training set MSE can be
quite small, but the test MSE is often much larger.
Model Selection
•
Model Selection
•
Model Selection
•
Model Selection
• We have demonstrated the test MSE using the red curve in the
right-hand panel.
• The test MSE along with training MSE initially decline with the
increase in the level of flexibility.
• At a certain point the test MSE levels off and then it starts to increase
again.
Model Selection
•
Model Selection
•
Model Selection
• When we overfit the training data, the test MSE will be very large
because the supposed patterns that the method found in the training
data simply don’t exist in the test data.
• Note that regardless of whether or not overfitting has occurred, we
almost always expect the training MSE to be smaller than the test
MSE because most machine learning methods either directly or
indirectly seek to minimize the training MSE.
• Overfitting refers specifically to the case in which a less flexible model
would have yielded a smaller test MSE.
Model Selection
•
Model Selection
•
Bias – Variance Trade Off
•
Bias – Variance Trade Off
•
Bias – Variance Trade Off
•
Meaning of Bias – Variance Trade Off
•
Meaning of Bias – Variance Trade Off
•
Meaning of Bias – Variance Trade Off
•
Meaning of Bias – Variance Trade Off
•
Meaning of Bias – Variance Trade Off
• We can generalize the concept. As the model becomes more flexible, the
variance increases and the bias decreases.
• By analyzing the relative rate of change of these two quantities, we can
determine whether the test MSE will increase or decrease.
• As the flexibility of the model increases, the bias tends to initially decrease
faster than the variance increases.
• As a result, the expected test MSE decreases.
• After some point an increase in flexibility has little impact on the bias but it
starts to significantly increase the variance.
• Due to this, the test MSE increases.
• You can note this pattern of decreasing test MSE followed by increasing test
MSE in the right-hand panels of Figures 9–11.
Meaning of Bias – Variance Trade Off