Ass 2
Ass 2
True.
2. Model Assumptions: Linear regression assumes a linear relationship between the predictors
and the target variable. In high-dimensional spaces, this assumption may not hold, leading to
poor performance. k-NN does not assume a specific functional form and can adapt to
complex, non-linear relationships.
3. Overfitting: In high-dimensional spaces, linear regression may overfit the training data,
especially if the number of features is large compared to the number of observations. k-NN
can be less prone to overfitting if the value of kkk is chosen appropriately, as it relies on local
neighborhoods rather than fitting a global model.
Overall, while linear regression is a powerful technique for many applications, k-NN can sometimes
provide better performance in high-dimensional spaces where the relationships between variables
are complex and non-linear.
2. Given the following dataset, find the uni-variate regression function that best fits the
dataset.
3. f(x)=1×x+4𝑓(𝑥)=1×𝑥+4
4. f(x)=1×x+5𝑓(𝑥)=1×𝑥+5
5. f(x)=1.5×x+3𝑓(𝑥)=1.5×𝑥+3
6. f(x)=2×x+1
7. To determine the best univariate regression function that fits the given dataset, follow these
steps:
8. Dataset
9. x=[2,3,4,10]x = [2, 3, 4, 10]x=[2,3,4,10]
10. y=[5.5,6.5,9,18.5]y = [5.5, 6.5, 9, 18.5]y=[5.5,6.5,9,18.5]
11. Step 1: Compute the Best-Fit Line Using Linear Regression
12. We will use the least squares method to find the line of best fit in the form y=mx+by = mx +
by=mx+b.
13. Calculate the Mean of xxx and yyy:
14. xˉ=2+3+4+104=194=4.75\bar{x} = \frac{2 + 3 + 4 + 10}{4} = \frac{19}{4} = 4.75xˉ=42+3+4+10
=419=4.75 yˉ=5.5+6.5+9+18.54=39.54=9.875\bar{y} = \frac{5.5 + 6.5 + 9 + 18.5}{4} = \
frac{39.5}{4} = 9.875yˉ=45.5+6.5+9+18.5=439.5=9.875
15. Calculate the Slope mmm:
16. m=∑i=1n(xi−xˉ)(yi−yˉ)∑i=1n(xi−xˉ)2m = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\
sum_{i=1}^n (x_i - \bar{x})^2}m=∑i=1n(xi−xˉ)2∑i=1n(xi−xˉ)(yi−yˉ)
17. Compute the numerator and denominator:
18. Numerator=(2−4.75)(5.5−9.875)+(3−4.75)(6.5−9.875)+(4−4.75)(9−9.875)+(10−4.75)
(18.5−9.875)\text{Numerator} = (2 - 4.75)(5.5 - 9.875) + (3 - 4.75)(6.5 - 9.875) + (4 - 4.75)(9 -
9.875) + (10 - 4.75)(18.5 - 9.875)Numerator=(2−4.75)(5.5−9.875)+(3−4.75)
(6.5−9.875)+(4−4.75)(9−9.875)+(10−4.75)(18.5−9.875) =(−2.75)(−4.375)+(−1.75)(−3.375)+
(−0.75)(−0.875)+(5.25)(8.625)= (-2.75)(-4.375) + (-1.75)(-3.375) + (-0.75)(-0.875) + (5.25)
(8.625)=(−2.75)(−4.375)+(−1.75)(−3.375)+(−0.75)(−0.875)+(5.25)(8.625)
=12.03125+5.90625+0.65625+45.28125= 12.03125 + 5.90625 + 0.65625 +
45.28125=12.03125+5.90625+0.65625+45.28125 =63.875= 63.875=63.875
Denominator=(2−4.75)2+(3−4.75)2+(4−4.75)2+(10−4.75)2\text{Denominator} = (2 - 4.75)^2 +
(3 - 4.75)^2 + (4 - 4.75)^2 + (10 -
4.75)^2Denominator=(2−4.75)2+(3−4.75)2+(4−4.75)2+(10−4.75)2 =(−2.75)2+(−1.75)2+
(−0.75)2+(5.25)2= (-2.75)^2 + (-1.75)^2 + (-0.75)^2 + (5.25)^2=(−2.75)2+(−1.75)2+
(−0.75)2+(5.25)2 =7.5625+3.0625+0.5625+27.5625= 7.5625 + 3.0625 + 0.5625 +
27.5625=7.5625+3.0625+0.5625+27.5625 =38.75= 38.75=38.75 m=63.87538.75≈1.65m = \
frac{63.875}{38.75} \approx 1.65m=38.7563.875≈1.65
19. Calculate the Intercept bbb:
20. b=yˉ−mxˉb = \bar{y} - m \bar{x}b=yˉ−mxˉ b=9.875−1.65×4.75b = 9.875 - 1.65 \times
4.75b=9.875−1.65×4.75 b=9.875−7.8125b = 9.875 - 7.8125b=9.875−7.8125 b≈2.0625b \
approx 2.0625b≈2.0625
21. Step 2: Form the Regression Equation
22. The regression equation is:
23. y=1.65x+2.0625y = 1.65x + 2.0625y=1.65x+2.0625
24. Step 3: Compare With Given Options
25. Let's compare this to the options provided:
26. f(x)=1⋅x+4f(x) = 1 \cdot x + 4f(x)=1⋅x+4
27. f(x)=1⋅x+5f(x) = 1 \cdot x + 5f(x)=1⋅x+5
28. f(x)=1.5⋅x+3f(x) = 1.5 \cdot x + 3f(x)=1.5⋅x+3
29. f(x)=2⋅x+1f(x) = 2 \cdot x + 1f(x)=2⋅x+1
30. Our computed regression function y=1.65x+2.0625y = 1.65x + 2.0625y=1.65x+2.0625 is
closest to:
31. f(x)=1.5⋅x+3f(x) = 1.5 \cdot x + 3f(x)=1.5⋅x+3
32. Although not an exact match, 1.5⋅x+31.5 \cdot x + 31.5⋅x+3 is reasonably close, considering
the options given and the rounding in practical scenarios.
3. Given a training data set of 500 instances, with each input instance having 6 dimensions
and each output being a scalar value, the dimensions of the design matrix used in applying
linear regression to this data is
500×6500×6
500×7500×7
500×62500×62
None of the above
In linear regression, the design matrix, often denoted as XXX, includes all the input features
for each instance, along with an additional column for the intercept (bias term) if you are
using a model with an intercept.
Given:
The training dataset consists of 500 instances.
Each instance has 6 dimensions (features).
Each output is a scalar value.
Dimensions of the Design Matrix
For linear regression:
1. Input Features: There are 6 features for each instance.
2. Intercept Term: If the model includes an intercept term, an additional column is added to the
design matrix to account for this.
Therefore, the design matrix XXX will have:
Rows: Equal to the number of instances (500).
Columns: Equal to the number of features plus 1 (for the intercept term).
So, the dimensions of the design matrix will be 500×(6+1)=500×7500 \times (6 + 1) = 500 \
times 7500×(6+1)=500×7.
Conclusion
The correct dimension of the design matrix for applying linear regression to this dataset is:
500 × 7
4. Assertion A: Binary encoding is usually preferred over One-hot encoding to represent
categorical data (eg. colors, gender etc)
Reason R: Binary encoding is more memory efficient when compared to One-hot encoding
Both A and R are true and R is the correct explanation of A
Both A and R are true but R is not the correct explanation of A
A is true but R is false
A is false but R is true
To evaluate the assertion and reason, let's break down both concepts:
Assertion A: Binary encoding is usually preferred over One-hot encoding to represent
categorical data (e.g., colors, gender, etc.)
Binary Encoding vs. One-hot Encoding:
One-hot Encoding: Each category is represented as a binary vector with a single 1 and all
other entries as 0. For a categorical variable with nnn possible categories, you need nnn
binary columns. This can lead to high-dimensional data if nnn is large.
Binary Encoding: Categories are first assigned unique integer values. These integer values
⌈log2(n)⌉\lceil \log_2(n) \rceil⌈log2(n)⌉ binary columns. This is often more memory efficient
are then converted to binary form. For a categorical variable with nnn categories, you use
1. v=argmax∑i=1N(vTxi)2 s.t. ∣v∣=1v = \text{argmax} \sum_{i=1}^N (v^T x_i)^2 \text{ s.t. } |v|
Let's analyze each statement regarding Principal Component Analysis (PCA):
2. v=argmin∑i=1N(vTxi)2 s.t. ∣v∣=1v = \text{argmin} \sum_{i=1}^N (v^T x_i)^2 \text{ s.t. } |v| =
largest eigenvalue of the covariance matrix.
v=argmax∑i=1N(vTxi)2 s.t. ∣v∣=1v = \text{argmax} \sum_{i=1}^N (v^T x_i)^2 \text{ s.t. } |v|
The TRUE statements are: