Lab 7 - Bias and Variance
Lab 7 - Bias and Variance
This tutorial is an attempt to simulate the Bias-Variance decomposition and trade-off with R.
Suppose the true regression function that generates a set of observations is defined by 𝑓𝑓(𝑥𝑥 ) = 𝑥𝑥 2 .
The generated data contains noise due to external factors. Assuming that the noise follows normal
distribution with 𝜇𝜇 = 0 and 𝜎𝜎 = 0.1, the generated data is defined by
𝑦𝑦 = 𝑥𝑥 2 + 𝜖𝜖
where 𝜖𝜖~𝑁𝑁(𝜇𝜇 = 0, 𝜎𝜎 = 0.1).
We would like to train predictive models 𝑓𝑓̂ to estimate the true regression function, 𝑓𝑓. Specifically,
we build five predictive models as follows:
𝑓𝑓̂0 = 𝛽𝛽0
CDS501 Principles & Practices of Data Science & Analytics Update: 2 Sep 2021
𝑓𝑓̂0 is a horizontal line, 𝑓𝑓̂1 is a linear regression model and 𝑓𝑓̂2 , 𝑓𝑓̂6 and 𝑓𝑓̂10 polynomial regression
models. The 𝛽𝛽𝑖𝑖 are the parameters that define the regression models. Let’s generate the data and
train the models as follows.
We generate some data points from 0 to 1 with a step of 0.01 and perform the predictions using the
models.
Now, let’s plot the data and the predictions of the models and as well as the true regression.
As we can see that linear model reasonably fit the data. The polynomial model with degree of 2 and
3 fit the data much better. The polynomial model with degree of 7 seems to be overfitting the data.
CDS501 Principles & Practices of Data Science & Analytics Update: 2 Sep 2021
Simulating and Calculating Bias and Variance Errors
We will now perform 300 simulations to understand the relationship between bias and variance
error given the three predictive models predicting at point 𝑥𝑥 = 0.90. We define the necessary
variables for the simulation
> set.seed(1)
> mu <- 0
We define a for loop to perform the simulation. For each iteration, we generate a training data, build
the predictive models using the training data and perform predictions on 𝑥𝑥. We store the predictions
in a matrix called "predictions".
CDS501 Principles & Practices of Data Science & Analytics Update: 2 Sep 2021
> predictions <- matrix(0, nrow=n_sims, ncol=n_models)
>
>
> pred_hl,
> pred_lr,
> pred_pr1,
> pred_pr2,
> pred_pr3
> )
> }
CDS501 Principles & Practices of Data Science & Analytics Update: 2 Sep 2021
Now, let’s calculate the average of the bias error and variance error. We write functions to calculate
the errors as follows.
> }
> }
Calculate the errors as follows. We use squared bias in this table. Since bias can be positive or
negative, squared bias is more useful for observing the trend as complexity increases. bias <-
apply(predictions, 2, get_bias, f(x=x))
> error_table
As we can see the bias is decreasing as complexity of the model increases. An opposite trend can be
seen for variance, as model complexity increases, variance increases.
CDS501 Principles & Practices of Data Science & Analytics Update: 2 Sep 2021