MDL Assignment2 Spring23
MDL Assignment2 Spring23
Assignment 2
Maximum Marks: 100
Deadline: 11:55 PM, 17th February, 2023
1 Introduction
1.1 Bias-Variance trade-off
When we discuss model prediction, it is important to understand the various prediction
errors - bias and variance. There is a trade-off between a model’s ability to minimise bias
and variance. A proper understanding of these errors would help in distinguishing a
layman and an expert in Machine Learning. Before using different classifiers, it is
important to understand how to select a classifier to use.
Let us get started and understand some basic definitions that are relevant.
^
For basic definitions, when 𝑓 is applied to an unseen sample, 𝑥 refer here.
● Bias is the difference between the average prediction of our model and the correct
value that we are trying to predict. A model with high bias does not generalise the
data well and oversimplifies the model. It always leads to a high error on training
and test data.
● Variance is the variability of a model prediction for a given data point. Again,
imagine you can repeat the entire model-building process multiple times. The
variance is how much the predictions for a given point vary between different
realisations of the model.
● Noise is any unwanted distortion in data. Noise is anything that is spurious and
extraneous to the original data, that is not intended to be present in the first place
but was introduced due to a faulty capturing process.
● Irreducible error is the error that cannot be reduced by creating good models. It
is a measure of the amount of noise in the data. Here, it is important to understand
that no matter how good we make our model, our data will have a certain amount
of noise or irreducible error that cannot be removed.
For a simple linear regression model with only one feature the equation is:
where,
● = Predicted value/Target Value
● = Input
● = Gradient/slope/Weight
● = Bias
For a Multivariable regression model the equation is:
Once we have the prediction function we need to determine the value of weight/s and
bias. To see how to calculate the value of weight/s and bias, refer this article.
2 Tasks
2.1 Task 1: Linear Regression
Write a brief about what function the method LinearRegression().fit() performs.
2.2 Task 2: Gradient Descent
Explain how gradient descent works to find the coefficients. For simplicity, take the case
where there is one independent variable and one dependent variable.
2.3.2 Task
After re-sampling the data, you have 21 different datasets (20 train sets and 1 test set).
Train a linear classifier on each of the 20 train sets separately so that you have 20
different classifiers or models. Now you can calculate the bias and variance of the model
using the test set. You need to repeat the above process for the following class of
functions,
•
•
•
And so on, up till polynomials of degree 15. The only two functions that you are
allowed to use are (from sklearn):
• linear model.LinearRegression().fit()
• preprocessing.PolynomialFeatures()
These functions will help you find the appropriate coefficients with the default
parameters. Tabulate the values of bias and variance and also write a detailed re-
port explaining how bias and variance change as you vary your function classes.
Note: Whenever we are talking about the bias and variance of the model,
it refers to the average bias and variance of the model over all the test points.
Plot variation of , Variance and MSE against degree of polynomial in the same
graph.
Note: The formula for and Variance are for a single input, but as the testing data
contains more than one input, take the mean wherever required. You need to plot the
graph for polynomials of up to degree 10 only. (Plotting higher degrees makes the graph
difficult to interpret).
3 Bonus
We have provided you with the data of a discharging capacitor together with a loop
containing a resistor. Charge on the capacitor (dependent variable) is a function of
time(independent variable) and varies exponentially according to the following equation:
Given , perform linear regression on the data and report the values of
Capacitance(C) and Resistance(R).
Note: You cannot directly perform linear regression since the function is an exponential
one. You have to figure out another way to use linear regression on the dataset.
4 General Instructions
● The data is in numpy array format.
● Submit a zip file name rollnumber_assgn2.zip containing source code and the
report:
○ code.ipynb
○ bonus.ipynb (if done)
○ report.pdf
○ readme.md (if any assumptions)
● All coding has to be done in Python3 only, using Jupyter Notebook.
● Report should include all details needed for evaluation. Please include relevant
graphs, tables, analysis, observations and writeup as required for each of the tasks
above.
● Get familiar with numpy, matplotlib, pickle, pandas dataframe and sklearn.
● You should write vectorized code which performs much better compared to
individual iteration.
● Plagiarism will be penalised heavily.
● Manual evaluations will be held regarding which further details will be announced
later.
5 Marking Scheme
● Task 1: 5 marks
● Task 2: 5 marks
● Task 3: 30 marks
● Task 4: 10 marks
● Task 5: 20 marks
● Viva: 30 marks
● Bonus: 20 marks
Note: Marks lost in any task can be covered by bonus. However, bonus will not
compensate for any marks lost in Viva. The maximum marks is 100 for this assignment.