IE506 Assignment1
IE506 Assignment1
1 Instructions
Answer all questions. Write your answers clearly. You can score a maximum of 50 marks in this assign-
ment.
Answers for Question 1 should be provided in a single pdf file.
Name the pdf file as “IE506 yourrollno assignment1 q1.pdf”.
Use different python notebook (.ipynb) files for each programming based question (questions 2 and 3).
Name the .ipynb files as “IE506 rollno assignment1 q2.ipynb”, “IE506 rollno assignment1 q3.ipynb”.
Make sure that your answers and plots are clearly visible in .ipynb files.
Create a folder “IE506 rollno assignment1” and place all your solution .pdf and .ipynb files in the folder.
Zip the folder “IE506 rollno assignment1” to create “IE506 rollno assignment1.zip”. Upload the single zip
file “IE506 rollno assignment1.zip” in moodle.
There will be no extensions to the submission deadline.
Note: Submissions not following the instructions will not be evaluated.
2 Questions
1. Recall that the conditional mean of (parametrized model-based estimate of) response variable condi-
Pd
tioned on an input vector x of d attributes is given by E[Y |X = (x1 , x2 , . . . , xd )] = β0 + j=1 βj xj .
Letting x = (x1 x2 . . . xd 1)⊤ , β = (β1 β2 . . . β0 )⊤ , note that E[Y |X = (x1 , x2 , . . . , xd )] = β ⊤ x.
Let us denote E[Y |X = (x1 , x2 , . . . , xd )] by yb (predicted value).
Consider a data set D = {(xi , y i )}ni=1 . Let y = (y 1 y 2 . . . y n )⊤ and let y
b = (y 1 y 2 . . . y n )⊤ denote
the vectors containing actual values and predicted values of the response variable for the n samples
in data set D. Consider the OLS objective function to determine β values by solving minβ J(β) =
∥y − Xβ∥22 where X is a feature matrix whose construction is given in the notebook shared in class.
(a) [2 marks] Using the zero-gradient condition ∇β J = 0 discussed in class, use appropriate as-
sumptions to find a suitable matrix A such that yb = Ay. State the assumptions you used.
P
(b) [4 marks] In the matrix A in part 1a, denote the i-th diagonal entry by aii . Verify if i aii =
(d + 1). Also find suitable p and q such that p ≤ aii ≤ q.
(c) In the notebooks shared in class, write codes to compute aii .
∂ ŷ i
(d) [3 marks] Check if you can represent aii = ∂y i . Using this relation, explain the meaning of aii .
(e) [2 marks] Explore the other possible meanings of aii and explain the importance of aii based
on your investigations.
1-1
1-2 Assignment 1: Due On 30th January 2024 (11:59 PM IST)
(f) [3 marks] Recall that the residual for i-th sample is given by ei = y i − ŷ i . In the
qP notebooks
n i 2
i i i=1 (e )
shared, write code to compute the standardized residuals e⋆ = e /σ where σ = n−(d+1) .
Explain a possible reason for using 1/(n − (d + 1)) as a scaling factor to compute the standard
deviation σ of the residuals.
2. For the following questions, do not use any Python package. Write the complete code yourself. You
must reuse code provided in the notebooks used for class lectures.
(a) [1 mark] Read the dataset in data1.txt into a pandas dataframe.
(b) [1 mark] Display the corresponding data description and understand the contents of the data
in data1.txt.
(c) [1 mark] Display the number of samples and number of attributes.
(d) [1 mark] Replace the column names of data frame with meaningful column names, designed by
you using the description in data1.txt.
(e) [1 mark] Display the maximum, minimum, median, first quartile, third quartile information for
each relevant column in the dataframe. Use an appropriate pandas command.
(f) [1 mark] Use an appropriate pandas command to check if any column in the dataframe con-
tains any missing value or not. Drop those rows if there are missing values in the row. If not,
clearly indicate that there are no missing values.
(g) [2 marks] Find the regression coefficients for the data in data1.txt and plot the regression
line. Compute R2 . Explain your observations.
(h) [2 marks] Plot the residual vs fitted values. Explain your observations.
(i) [2 marks] Plot the standardized residual (discussed in question 1) vs fitted values. Compare
this plot with the residual vs fitted values plot. Explain your observations.
(j) [2 marks] Compute aii (discussed in question 1) for every Pi-th sample in the data set. Find
the set I of indices of the samples for which aii > (2/n) i aii . Rerun the regression to find
the regression coefficients based on those samples whose indices are not in I. Using the new
coefficients, plot the residual vs fitted values, standardized residual vs fitted values plots and
compute R2 . Comment on your observations. Can the samples whose indices are in I be called
outliers?
(k) [2 marks] Compute aii for every P i-th sample in the data set. Find the set I of indices of the
samples for which aii > (3/n) i aii . Rerun the regression to find the regression coefficients
based on those samples whose indices are not in I. Using the new coefficients, plot the residual
vs fitted values, standardized residual vs fitted values plots and compute R2 . Comment on your
observations. Can the samples whose indices are in I be called outliers? Compare and contrast
the observations in parts 2j and 2k.
P
(l) [2 marks] In parts 2j, 2k, we have used the condition aii > (p/n) i aii where p ∈ {2, 3}.
Explain why such a condition might be useful to segregate problematic samples.
3. [18 marks] Repeat the analysis in Question 2 for the data provided in file data2.txt. Write all
your observations clearly.