Theory in Machine Learning
Theory in Machine Learning
Predicting of y from x ∈ℜ
2
y = w0 + w1x y = w0 + w1x + w2x
Data points do not lie Added an extra feature x2. More features make better
on the straight line Slightly better fit the data fitting
• On the other hand, complicated, low bias model likely fits the
training data very well and so predictions vary wildly even
predictor values change slightly.
• This means this model has high variance, and it will not
generalize to new/unseen data well.
Visualizing Bias
True
Concept
Visualizing Bias
Training
Data
Visualizing Bias
Bias Mistakes
Fit a Linear
Model
Visualizing Variance
Model Predicts -
Fit a Linear
Model
Visualizing Variance
Mistakes will
vary
Variance: Sensitivity to
changes & noise
Fit a Linear
Model
Bias and Variance: More
Powerful Model
Model
Predicts +
Powerful Models can
represent complex
concepts
No Mistakes!
Model Predicts -
Overfitting vs Underfitting
Overfitting Underfitting
Fitting the data too well Learning too little of the true
Features are noisy / uncorrelated to concept
concept Features don’t capture concept
Modeling process very sensitive Too much bias in model
(powerful) Too little search to fit model
Too much search
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Overfitting
Avoid overfitting
3. Regularization
Underfitting
• A model is said to have underfitting when it cannot capture the
underlying trend of the training data and generalization ability.
• Causes due to less data but more features to build the model and we
try to build a linear model with a non-linear data.
• Train the model with training dataset and then compute loss
using validation data set.
• Each data is held out in turn and used to test a model trained
on the other N-1 objects or observations.
• Average squared validation loss: LCV = 1/N ∑ (t – wT x )2
n n n n
The total error of the model is composed of three terms: the (bias)²,
the variance, and an irreducible error term.
Bias vs.
•
Variance
Weights for some predictors have little predictive power,
results high-variance, low bias model.
Weight values
The model assigns non-zero values to the noise features, despite none of
them having any predictive power. Noise features have values similar to
some of the real features in the dataset.
Example
• Hyundai: This column
indicates the money spent
on advertising the Hyundai
cars in the given market.
• Sales – This column indicates the sales of cars in the given market. (Value of the sales in
thousands)
Example
• Let’s visualize the relationship between the features and the sales response using scatterplots.
To calculate coefficients, we will use the least square criterion, which means we will find a
line that will decrease the sum of squared errors.
Example
• From previous step we got the value of A and B. we will use the model to predict the
future sales of Hyundai cars.
• Let’s say in the new market Hyundai is spending 50 thousand dollars in advertising. That
means the new value of X will be 50.
• Now using Y = A + BX to predict the new value.
Example
• Let’s plot the observed data graph and the least square line using preds value and new x
value.
Limitations of Regression Analysis :
Loss =