0% found this document useful (0 votes)
38 views7 pages

Ass 2

Uploaded by

chamundeeswari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views7 pages

Ass 2

Uploaded by

chamundeeswari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

1.

State True or False:


Typically, linear regression tend to underperform compared to k-nearest neighbor
algorithms when dealing with high-dimensional input spaces.

True.

In high-dimensional input spaces, linear regression can underperform compared to k-nearest


neighbor (k-NN) algorithms due to several reasons:

1. Curse of Dimensionality: As the number of dimensions (features) increases, the distance


between data points grows, making it harder for linear regression to capture the underlying
patterns. k-NN, on the other hand, can handle high-dimensional spaces better by focusing on
local patterns rather than global trends.

2. Model Assumptions: Linear regression assumes a linear relationship between the predictors
and the target variable. In high-dimensional spaces, this assumption may not hold, leading to
poor performance. k-NN does not assume a specific functional form and can adapt to
complex, non-linear relationships.

3. Overfitting: In high-dimensional spaces, linear regression may overfit the training data,
especially if the number of features is large compared to the number of observations. k-NN
can be less prone to overfitting if the value of kkk is chosen appropriately, as it relies on local
neighborhoods rather than fitting a global model.

Overall, while linear regression is a powerful technique for many applications, k-NN can sometimes
provide better performance in high-dimensional spaces where the relationships between variables
are complex and non-linear.

2. Given the following dataset, find the uni-variate regression function that best fits the
dataset.

3. f(x)=1×x+4𝑓(𝑥)=1×𝑥+4
4. f(x)=1×x+5𝑓(𝑥)=1×𝑥+5
5. f(x)=1.5×x+3𝑓(𝑥)=1.5×𝑥+3
6. f(x)=2×x+1
7. To determine the best univariate regression function that fits the given dataset, follow these
steps:
8. Dataset
9. x=[2,3,4,10]x = [2, 3, 4, 10]x=[2,3,4,10]
10. y=[5.5,6.5,9,18.5]y = [5.5, 6.5, 9, 18.5]y=[5.5,6.5,9,18.5]
11. Step 1: Compute the Best-Fit Line Using Linear Regression
12. We will use the least squares method to find the line of best fit in the form y=mx+by = mx +
by=mx+b.
13. Calculate the Mean of xxx and yyy:
14. xˉ=2+3+4+104=194=4.75\bar{x} = \frac{2 + 3 + 4 + 10}{4} = \frac{19}{4} = 4.75xˉ=42+3+4+10
=419=4.75 yˉ=5.5+6.5+9+18.54=39.54=9.875\bar{y} = \frac{5.5 + 6.5 + 9 + 18.5}{4} = \
frac{39.5}{4} = 9.875yˉ=45.5+6.5+9+18.5=439.5=9.875
15. Calculate the Slope mmm:
16. m=∑i=1n(xi−xˉ)(yi−yˉ)∑i=1n(xi−xˉ)2m = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\
sum_{i=1}^n (x_i - \bar{x})^2}m=∑i=1n(xi−xˉ)2∑i=1n(xi−xˉ)(yi−yˉ)
17. Compute the numerator and denominator:
18. Numerator=(2−4.75)(5.5−9.875)+(3−4.75)(6.5−9.875)+(4−4.75)(9−9.875)+(10−4.75)
(18.5−9.875)\text{Numerator} = (2 - 4.75)(5.5 - 9.875) + (3 - 4.75)(6.5 - 9.875) + (4 - 4.75)(9 -
9.875) + (10 - 4.75)(18.5 - 9.875)Numerator=(2−4.75)(5.5−9.875)+(3−4.75)
(6.5−9.875)+(4−4.75)(9−9.875)+(10−4.75)(18.5−9.875) =(−2.75)(−4.375)+(−1.75)(−3.375)+
(−0.75)(−0.875)+(5.25)(8.625)= (-2.75)(-4.375) + (-1.75)(-3.375) + (-0.75)(-0.875) + (5.25)
(8.625)=(−2.75)(−4.375)+(−1.75)(−3.375)+(−0.75)(−0.875)+(5.25)(8.625)
=12.03125+5.90625+0.65625+45.28125= 12.03125 + 5.90625 + 0.65625 +
45.28125=12.03125+5.90625+0.65625+45.28125 =63.875= 63.875=63.875
Denominator=(2−4.75)2+(3−4.75)2+(4−4.75)2+(10−4.75)2\text{Denominator} = (2 - 4.75)^2 +
(3 - 4.75)^2 + (4 - 4.75)^2 + (10 -
4.75)^2Denominator=(2−4.75)2+(3−4.75)2+(4−4.75)2+(10−4.75)2 =(−2.75)2+(−1.75)2+
(−0.75)2+(5.25)2= (-2.75)^2 + (-1.75)^2 + (-0.75)^2 + (5.25)^2=(−2.75)2+(−1.75)2+
(−0.75)2+(5.25)2 =7.5625+3.0625+0.5625+27.5625= 7.5625 + 3.0625 + 0.5625 +
27.5625=7.5625+3.0625+0.5625+27.5625 =38.75= 38.75=38.75 m=63.87538.75≈1.65m = \
frac{63.875}{38.75} \approx 1.65m=38.7563.875≈1.65
19. Calculate the Intercept bbb:
20. b=yˉ−mxˉb = \bar{y} - m \bar{x}b=yˉ−mxˉ b=9.875−1.65×4.75b = 9.875 - 1.65 \times
4.75b=9.875−1.65×4.75 b=9.875−7.8125b = 9.875 - 7.8125b=9.875−7.8125 b≈2.0625b \
approx 2.0625b≈2.0625
21. Step 2: Form the Regression Equation
22. The regression equation is:
23. y=1.65x+2.0625y = 1.65x + 2.0625y=1.65x+2.0625
24. Step 3: Compare With Given Options
25. Let's compare this to the options provided:
26. f(x)=1⋅x+4f(x) = 1 \cdot x + 4f(x)=1⋅x+4
27. f(x)=1⋅x+5f(x) = 1 \cdot x + 5f(x)=1⋅x+5
28. f(x)=1.5⋅x+3f(x) = 1.5 \cdot x + 3f(x)=1.5⋅x+3
29. f(x)=2⋅x+1f(x) = 2 \cdot x + 1f(x)=2⋅x+1
30. Our computed regression function y=1.65x+2.0625y = 1.65x + 2.0625y=1.65x+2.0625 is
closest to:
31. f(x)=1.5⋅x+3f(x) = 1.5 \cdot x + 3f(x)=1.5⋅x+3
32. Although not an exact match, 1.5⋅x+31.5 \cdot x + 31.5⋅x+3 is reasonably close, considering
the options given and the rounding in practical scenarios.

3. Given a training data set of 500 instances, with each input instance having 6 dimensions
and each output being a scalar value, the dimensions of the design matrix used in applying
linear regression to this data is
500×6500×6
500×7500×7
500×62500×62
None of the above
In linear regression, the design matrix, often denoted as XXX, includes all the input features
for each instance, along with an additional column for the intercept (bias term) if you are
using a model with an intercept.
Given:
 The training dataset consists of 500 instances.
 Each instance has 6 dimensions (features).
 Each output is a scalar value.
Dimensions of the Design Matrix
For linear regression:
1. Input Features: There are 6 features for each instance.
2. Intercept Term: If the model includes an intercept term, an additional column is added to the
design matrix to account for this.
Therefore, the design matrix XXX will have:
 Rows: Equal to the number of instances (500).
 Columns: Equal to the number of features plus 1 (for the intercept term).
So, the dimensions of the design matrix will be 500×(6+1)=500×7500 \times (6 + 1) = 500 \
times 7500×(6+1)=500×7.
Conclusion
The correct dimension of the design matrix for applying linear regression to this dataset is:
 500 × 7
4. Assertion A: Binary encoding is usually preferred over One-hot encoding to represent
categorical data (eg. colors, gender etc)
Reason R: Binary encoding is more memory efficient when compared to One-hot encoding
Both A and R are true and R is the correct explanation of A
Both A and R are true but R is not the correct explanation of A
A is true but R is false
A is false but R is true
To evaluate the assertion and reason, let's break down both concepts:
Assertion A: Binary encoding is usually preferred over One-hot encoding to represent
categorical data (e.g., colors, gender, etc.)
Binary Encoding vs. One-hot Encoding:
 One-hot Encoding: Each category is represented as a binary vector with a single 1 and all
other entries as 0. For a categorical variable with nnn possible categories, you need nnn
binary columns. This can lead to high-dimensional data if nnn is large.
 Binary Encoding: Categories are first assigned unique integer values. These integer values

⌈log⁡2(n)⌉\lceil \log_2(n) \rceil⌈log2(n)⌉ binary columns. This is often more memory efficient
are then converted to binary form. For a categorical variable with nnn categories, you use

compared to one-hot encoding, especially when nnn is large.


Assertion A is true in many cases because binary encoding can be more compact than one-
hot encoding, especially with many categories.
Reason R: Binary encoding is more memory efficient when compared to One-hot encoding
Binary encoding is indeed more memory efficient than one-hot encoding, especially for
variables with a large number of categories. This is because it reduces the number of
columns required to represent the data.
Reason R is true and accurately describes the advantage of binary encoding over one-hot
encoding.
Explanation of Relationship:
Both A and R are true, and R is the correct explanation of A.
Binary encoding is preferred due to its memory efficiency, which is a correct explanation for
why it might be chosen over one-hot encoding.
Therefore, the correct answer is:
 Both A and R are true and R is the correct explanation of A
5. Select the TRUE statement
Subset selection methods are more likely to improve test error by only focussing on the
most important features and by reducing variance in the fit.
Subset selection methods are more likely to improve train error by only focussing on the
most important features and by reducing variance in the fit.
Subset selection methods are more likely to improve both test and train error by focussing
on the most important features and by reducing variance in the fit.
Subset selection methods don’t help in performance gain in any way.
To determine the true statement regarding subset selection methods, let's review what
subset selection methods are and how they affect training and test errors.
Subset Selection Methods
Subset selection methods are techniques used in feature selection where the goal is to
choose a subset of the most relevant features from a larger set. Common methods include:
1. Forward Selection: Adding features one by one to find the best subset.
2. Backward Elimination: Starting with all features and removing the least significant ones.
3. Stepwise Selection: A combination of forward selection and backward elimination.
Effects on Training and Test Errors
1. Training Error: Subset selection methods usually reduce training error because they allow
the model to focus on the most relevant features. This can lead to a better fit on the training
data. However, reducing the number of features might sometimes lead to overfitting if not
done properly.
2. Test Error: The main advantage of subset selection methods is to improve generalization and
reduce variance. By focusing only on the most important features, these methods can help to
avoid overfitting and thus potentially improve test error. The test error might improve if the
reduced feature set generalizes better to unseen data, but this is not guaranteed.
Evaluation of Statements
1. Subset selection methods are more likely to improve test error by only focusing on the
most important features and by reducing variance in the fit.
o True: By selecting a subset of important features, the model can reduce overfitting
and variance, which often leads to improved test error.
2. Subset selection methods are more likely to improve train error by only focusing on the
most important features and by reducing variance in the fit.
o False: Subset selection methods usually improve test error rather than training error.
Training error may decrease, but the primary goal is to improve the model’s
performance on unseen data (test error).
3. Subset selection methods are more likely to improve both test and train error by focusing
on the most important features and by reducing variance in the fit.
o False: While subset selection methods can improve test error, they typically do not
improve training error significantly. The focus is on improving generalization rather
than specifically reducing training error.
4. Subset selection methods don’t help in performance gain in any way.
o False: Subset selection methods can help in performance gain by improving model
generalization and reducing overfitting.
Conclusion
The true statement is:
 Subset selection methods are more likely to improve test error by only focusing on the
most important features and by reducing variance in the fit.
6. Rank the 3 subset selection methods in terms of computational efficiency:
Forward stepwise selection, best subset selection, and forward stagewise regression.
Forward stepwise selection, forward stagewise regression and best subset selection.
Best subset selection, forward stagewise regression and forward stepwise selection.
Best subset selection, forward stepwise selection and forward stagewise regression.
To rank the subset selection methods in terms of computational efficiency, let's analyze
each method:
1. Best Subset Selection
 Description: Best subset selection evaluates all possible subsets of features to find the one
that best fits the model according to some criterion (e.g., minimizing error).
 Computational Complexity: It is the most computationally intensive because it requires
evaluating 2p2^p2p subsets, where ppp is the number of features. This exponential growth
makes it impractical for large numbers of features.
2. Forward Stepwise Selection
 Description: Forward stepwise selection starts with no features and adds them one by one
based on their contribution to improving the model. At each step, it evaluates all
remaining features to decide which one to add.
 Computational Complexity: It is less computationally intensive than best subset selection.
For each feature added, it requires evaluating all remaining features, leading to a
complexity of approximately O(p2)O(p^2)O(p2), where ppp is the number of features.
3. Forward Stagewise Regression
 Description: Forward stagewise regression is similar to forward stepwise selection but adds
features more gradually. At each stage, it adds features one at a time but with smaller
incremental changes.
 Computational Complexity: It is generally more computationally efficient than both best
subset selection and forward stepwise selection because it makes smaller, incremental
updates and does not evaluate as many feature combinations in each step.
Ranking by Computational Efficiency
1. Forward Stagewise Regression: Most efficient, due to its incremental approach and fewer
evaluations.
2. Forward Stepwise Selection: More efficient than best subset selection but less efficient
than forward stagewise regression.
3. Best Subset Selection: Least efficient, as it involves evaluating all possible subsets.
Conclusion
The correct ranking of the subset selection methods in terms of computational efficiency
is:
 Forward stagewise regression, forward stepwise selection, and best subset selection.
7. Choose the TRUE statements from the following: (Multiple correct choice)
Ridge regression since it reduces the coefficients of all variables, makes the final fit a lot
more interpretable.
Lasso regression since it doesn’t deal with a squared power is easier to optimize than
ridge regression.
Ridge regression has a more stable optimization than lasso regression.
Lasso regression is better suited for interpretability than ridge regression.
Let's evaluate each statement regarding Ridge and Lasso regression:
1. Ridge regression since it reduces the coefficients of all variables, makes the final fit a lot
more interpretable.
o False: Ridge regression applies L2 regularization, which reduces the magnitude of
all coefficients but does not force any coefficients to be exactly zero. This means all
variables remain in the model, potentially making the final fit less interpretable
compared to Lasso regression, which can zero out some coefficients.
2. Lasso regression since it doesn’t deal with a squared power is easier to optimize than ridge
regression.
o False: Lasso regression applies L1 regularization, which can lead to sparse solutions
(coefficients exactly zero). The optimization problem for Lasso is not necessarily
easier to solve than Ridge regression; in fact, it can be more complex due to the L1
penalty and the need for algorithms that handle the non-differentiability at zero.
3. Ridge regression has a more stable optimization than Lasso regression.
o True: Ridge regression (L2 regularization) has a more stable optimization process
because it involves differentiable L2 norms, which provides a smooth and
continuous penalty. Lasso regression (L1 regularization) can lead to non-
differentiability at zero, making optimization more challenging.
4. Lasso regression is better suited for interpretability than ridge regression.
o True: Lasso regression tends to produce sparse models by driving some coefficients
exactly to zero. This sparsity can make the model easier to interpret because it
effectively selects a subset of important features, potentially leading to a more
straightforward and interpretable model.
Conclusion
The TRUE statements are:
 Ridge regression has a more stable optimization than Lasso regression.
 Lasso regression is better suited for interpretability than Ridge regression.
8. Which of the following statements are TRUE? Let xi𝑥𝑖 be the i−𝑖−th datapoint in a
dataset of N𝑁 points. Let v𝑣 represent the first principal component of the dataset.
(Multiple answer questions)
v=argmax∑Ni=1(vTxi)2s.t.|v|=1𝑣=𝑎𝑟𝑔𝑚𝑎𝑥∑𝑖=1𝑁(𝑣𝑇𝑥𝑖)2𝑠.𝑡.|𝑣|=1
v=argmin∑Ni=1(vTxi)2s.t.|v|=1𝑣=𝑎𝑟𝑔𝑚𝑖𝑛∑𝑖=1𝑁(𝑣𝑇𝑥𝑖)2𝑠.𝑡.|𝑣|=1
Scaling at the start of performing PCA is done just for better numerical stability and
computational benefits but plays no role in determining the final principal components of
a dataset.
The resultant vectors obtained when performing PCA on a dataset can vary based on the
scale of the dataset.

1. v=argmax∑i=1N(vTxi)2 s.t. ∣v∣=1v = \text{argmax} \sum_{i=1}^N (v^T x_i)^2 \text{ s.t. } |v|
Let's analyze each statement regarding Principal Component Analysis (PCA):

= 1v=argmax∑i=1N(vTxi)2 s.t. ∣v∣=1


True: This statement describes the objective of PCA. The first principal component vvv is
the vector that maximizes the variance of the projections (vTxi)(v^T x_i)(vTxi).
Mathematically, this is equivalent to maximizing the sum of squared projections
∑i=1N(vTxi)2\sum_{i=1}^N (v^T x_i)^2∑i=1N(vTxi)2 subject to the constraint that vvv is a
unit vector (∣v∣=1)(|v| = 1)(∣v∣=1). This corresponds to the eigenvector associated with the

2. v=argmin∑i=1N(vTxi)2 s.t. ∣v∣=1v = \text{argmin} \sum_{i=1}^N (v^T x_i)^2 \text{ s.t. } |v| =
largest eigenvalue of the covariance matrix.

1v=argmin∑i=1N(vTxi)2 s.t. ∣v∣=1


False: This statement is incorrect because PCA aims to maximize the variance of the
projections, not minimize it. Minimizing the variance of the projections is not the goal of
PCA and does not correspond to the principal components.
3. Scaling at the start of performing PCA is done just for better numerical stability and
computational benefits but plays no role in determining the final principal components of
a dataset.
False: Scaling (or standardization) is crucial in PCA, especially when the features have
different units or scales. If the features are not scaled, the PCA might be dominated by
features with larger scales, and the principal components obtained will be biased towards
those features. Scaling ensures that each feature contributes equally to the computation of
the principal components.
4. The resultant vectors obtained when performing PCA on a dataset can vary based on the
scale of the dataset.
True: The principal components obtained from PCA can indeed vary depending on the
scale of the dataset. If the features are on different scales and are not standardized, PCA
will give more weight to features with larger scales. Standardizing the features to have zero
mean and unit variance ensures that the principal components are not biased by the scale
of the features.
Conclusion

 v=argmax∑i=1N(vTxi)2 s.t. ∣v∣=1v = \text{argmax} \sum_{i=1}^N (v^T x_i)^2 \text{ s.t. } |v|
The TRUE statements are:

= 1v=argmax∑i=1N(vTxi)2 s.t. ∣v∣=1


 The resultant vectors obtained when performing PCA on a dataset can vary based on the
scale of the dataset.

You might also like