Curve Fitting
Curve Fitting
Nazar Khan
Real-world Data
Real-world data has 2 important properties
1. underlying regularity,
j =0
where M is the order of the polynomial.
I Function y (x , w) is a
I non-linear function of the input x , but
I a linear function of the parameters w.
I So our model y (x , w) is a linear model.
1X N
E (w) = {y (xn , w) − tn }2
2 n=1
Over-tting
I Lower order polynomials can't capture the variation in data.
I Higher order leads to over-tting.
I Fitted polynomial passes exactly through each data point.
I But it oscillates wildly in-between.
I Gives a very poor representation of the real underlying function.
I Over-tting is bad because it gives bad generalization.
Over-tting
I To check generalization performance of a certain w∗ , compute E (w∗ ) on
a new test set.
I Alternative performance measure: root-mean-square error (RMS)
2 E (w ∗ )
r
ERMS =
N
I Mean ensures datasets of dierent sizes are treated equally. (How?)
I Square-root brings the squared error scale back to the scale of the target
variable t .
Root-mean-square error on training and test set for various polynomial orders
M.
Paradox?
I A polynomial of order M contains all polynomials of lower order.
I So higher order should always be better than lower order.
I But, it's not better. Why?
I Because higher order polynomial starts tting the noise instead of the
underlying function.
Over-tting
Over-tting
I Large M =⇒ more exibility =⇒ more tuning to noise.
I But, if we have more data, then over-tting is reduced.
1X N λ
Ẽ (w) = {y (xn , w) − tn }2 + ||w||2
2 n=1 2
where ||w||2 = wT w = w02 + w12 + · · · + wM2 and λ controls the relative
importance of the regularizer compared to the error term.
I Also called regularization, shrinkage, weight-decay.
Eect of regularization
Eect of regularization