0% found this document useful (0 votes)
19 views6 pages

CS215 Autumn 2024-1

Uploaded by

sriyaakepati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views6 pages

CS215 Autumn 2024-1

Uploaded by

sriyaakepati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Total

CS 215: Data Interpretation and Analysis 2024, End-Semester exam

Roll:
November 14, 2024.
1:30–4:30 pm Name:

Write all your answers in the space provided. Do not spend time/space giving irrelevant details
or details not asked for. Use the marks as a guideline for the amount of time you should spend on
a question. The exam is close book. You are only allowed to refer to five pages of hand-written
notes.

1. Annie thinks she just discovered a very efficient algorithm for solving a challenging graph
theory task. Her advisor asked her to run her algorithm 20 times and a baseline algorithm
30 times, and record the time taken. Let A1 . . . , A20 denote the recorded times of Annie’s
algorithms and B1 , . . . , B30 be the recorded times of the baseline. You need to suggest a test
statistic to establish if Annie’s algorithm is faster than the baseline under the assumption
that both running times follow a Gaussian distribution with possibly different means but
the same variance. [Running time cannot be negative! But we will ignore this aberration
for now.]

(a) Write down the null and alternative hypothesis for this test. ..1
H0 : µA = µB , H1 : µA < µB . The null hypothesis can also be written as H0 : µA ≥ µB .

(b) Write down the test statistic in terms of Ai s and Bj s, and the condition under which
you will accept Annie’s claim that her algorithm is faster than the baseline. Assume
10% significance level. ..3
2 2
P P P P
MA = i Ai /20, MB = i Bi /30, S = ( i (Ai − MA ) + j (Bj − MN ) )/48. T =
p
(MA − MB )/ S(1/20 + 1/30).
PH0 (T ) ∼ t48 follows a t-distribution with 48 degrees of freedom. We will accept Annie’s
claim if observed T is less than −t0.1,48 where t0.1,48 is the upper 10 percentile of the
t-distribution with 48 df.
(c) However, Annie suspects that possibly p% of her timings are abnormally high because
of the server facing an unusually high load. When 0 < p < 0.2, suggest a different test
statistic that is less affected by a few abnormal measurements. Make sure that the test
statistic you suggest is both robust to abnormal values, while also being as precise as
possible at normal data. ..4
We make use of the information that at most p% of the observations may be unusually
high — that is, outliers are one-sided. Instead of sample mean use winsorized or
trimmed mean where only the top p% largest running times are dropped from both the
A and B observations.

1
Note: If anyone mentioned trimmed or winsorized mean with dropping both largest
and smallest values, the credit should be reduced since that estimate is not as efficient.
Median would be even less efficient than these.
Instead of computing the shared variance on the entire data we compute it on the
subset after removing the p% largest values in each set.

2. A very large number of students took a challenging test. You do not know the distribution
of marks in the test. Your friend Aditya, who obtained 70 absolute marks in the test, claims
that his marks are in the top 5% among all test takers. In order to test his claims, you
ask around and find the absolute marks of n other test takers. Denote these as X1 , . . . , Xn .
Design a hypothesis test to check Aditya’s claim. Clearly state the null and alternative
hypothesis, the test statistic, the distribution of the test statistic, and how you will accept
or reject Aditya’s claim. ..4
H0 : F −1 (0.95) ≥ 70, H1 : F −1 (0.95) < 70
P
T = i sign(Xi − 70) where sign(z) is 1 if z ≥ 0 This counts the number of observations
greater than 70.
Under the null hypothesis, PH0 (T ) ∼ Bin(n, 0.05)
Accept Aditya’s claim (H1 ) if observed value is t is such that PH0 (T < t) < α where α is a
small significance level say 0.1.
Rubrics: 1 mark for the expression of H0 and H1 , 1 mark for the test statistic, 2 marks for
the final answer. If the null hypothesis is wrong, then maximum of 1 marks will be awarded
for this question.

3. What is the breakdown value of the test statistic used in the signed rank hypothesis test of
a distribution being symmetric about a median value m0 given n data samples X1 , . . . , Xn ?
Recall that the break value of a statistic is defined as the largest number m such that if
we replace m − 1 values in the data by arbitrary values, the statistic still remains within a
bounded set. ..2
The test statistic used in signed rank is always bounded irrespective of the values. So,
breakdown value is n.

4. Consider a two-dimensional data distribution with mean µ at 00 and covariance Σ = ρ1 ρ1


 

where ρ > 0.

(a) Write the Mahalonobis distance from µ of a point p = pp12 when ρ = 0.5. ..1 Simple


application of the formula yields p21 + p2 − 2(0.5)p1 p2


(b) Among the set of points S : x21 +x22 = 1, identify the subset whose Mahalonobis distance
from µ is the largest, and the set from where it is smallest. Interpret your answer. ..3
The Mahalonobis distance from mean can be written as (x21 + x22 − 2ρx 2
√1 x2 )/(2 − 2ρ ).
If we√restrict to points in S, the minimum
√ will be when
√ x1 = x2 = 1/ 2 or x1 = x2 =
−1/ 2 and maximum √ when
√ 1 x = 1/ 2, x2 = −1/ 2 or vice-versa. Among all points
in S, the√point [1/√ 2, 1/ 2] is most expected given the positive correlation, and the
point [1/ 2, −1/ 2] expresses negative correlation and is least expected.

5. Consider a p dimension Normal distribution X ∼ N (µ, Σ) where µ is of size p × 1 and Σ is


of size p × p. Assume p = 2.

2
(a) Let Z1 = X1 + X2 , Z2 = X1 − X2 Z3 = X1 What is the joint distribution of Z =
[Z1 , Z2 , Z3 ]? Provide both the form of the distribution and the parameters in terms of
parameters of X. ..4 Gaussian distribution
′ ′ ′ ′
For Zi = ai X where a1 = [1, 1] , a2 = [1, −1] , a3 = [1, 0] . With these we can define
Cov(Zi , Zj ) = ai Σaj , and define mean of each Zi as ai X Rubrics: 2 marks for right
mean, 2 marks for right covariance matrix Common mistakes: Add variances. Wrong
since the Xs are not independent.
(b) Write this conditional distribution P (X1 |X2 = 0). ..1 Simple application of formula
2
σ12
of conditional Gaussian gives P (X1 |X2 = 0) ∼ N (µ1 + σσ12 2
2 (−µ2 ); σ11 − σ 2 ). Rubrics:
22 22
1 mark for substitution in conditional distribution formula. Here notation for variance
2
of X1 is σ11 .
µ1|2 = µ1 + Σ12 Σ−1
22 (x2 − µ2 )

Σ1|2 = Σ11 − Σ12 Σ−1


22 Σ21

(c) Show that this conditional distribution P (X2 |X1 ≥ 0) is not guaranteed to be Gaussian
with an example.  ..3 Consider a Gaussian
distribution with Σ = 11 11 In this case X1 = X2 . We know that P (X2 |X2 ≥ 0) is the
half Gaussian distribution, not the Gaussian distribution. Rubrics: 3marks for valid
example and justification that not gaussian. 1mark only if example not given.
6. Given two uncorrelated zero-mean random variables X1 ∼ N (0, 100) and X2 ∼ N (0, 1),
what are the principal components of (X1 , X2 ) ..3 We need to find the covariance matrix
of the random vector X = [X1 , X2 ]T .
The covariance matrix is given by:
 
Var(X1 ) Cov(X1 , X2 )
C= (1)
Cov(X1 , X2 ) Var(X2 )

Since, X1 and X2 are uncorrelated, Cov(X1 , X2 ) = 0. Hence, the covariance matrix is given
by:  
100 0
C= (2)
0 1
The eigen values of the matrix can be calculated by equating the determinant of the following
matrix to 0:
100 − λ 0
=0 (3)
0 1−λ
Therefore, the eigen values are λ1 = 100 and λ2 = 1. The eigen vectors are given by:
The corresponding eigen vectors are given by [1, 0]T and [0, 1]T . Clearly one of the eigen
values is much larger than the other. Therefore, the principal component of the data is given
by projection of the data on the first eigen vector which is X1 .
Rubrics: 1 mark for covariance matrix and 2 marks for the final solution. If the covariance
matrix is not mentioned/incorrect, then 0 marks have been awarded.
7. Suppose you are trying to design a projection method for a creature that can only visualize
in one-dimension. For that we can just use the SNE or T-SNE algorithm but with learning
just a single coordinate Yi for each point Xi in the high-dimensional space.

3
(a) If the original dataset is 2-dimensional and consists of points uniformly distribution on
the surface of a unit circle, what will be the likely SNE projection? ..2 The
points are likely to be arranged in a line where two halves of the circle will collapse
into each other. For example, if you consider a unit circle, then (1,0) and (-1,0) will
collapse to the origin, while (0,1) maps to π/2 and (0, -1) maps to −π/2. This way all
near neighbors in 2-D stay close together in 1-D. The major distortion is that points
on the diametrically opposite semi-circle will be pulled in close.
(b) What is the main reason for favoring the T-SNE projection over the SNE projection?
Illustrate with an example if needed. ..3
SNE leads to over-crowding of points because for far-away points the penalty of being
far away is a drastic drop in probability with the Gaussian distribution. Heavy-tailed
distributions like the t-distribution used in T-SNE, is better capable of spreading out
the far away points.
8. Express an MA(1) model as an AR(q) model for an adequate value of q. ..2

x1 = η + w 1
x2 = η + θw1 + w2 = η + θ(x1 − η) + w2
.
xt = η + θwt−1 + wt
= η(1 − θ + θ2 − . . . (−1)t−1 θt−1 ) + θxt−1 − θ2 xt−2 + . . . (−1)t−1 x1 + wt

9. Suppose the amount a person A saves in an year is 90% of the amount he saved the previous
year and a random amount caused by fluctuations in the market conditions. Let St denote
the market condition at year t, and let it follow a Gaussian distribution with mean 10 and
variance 16 Lakhs. Assume the market condition is random and uncorrelated from one year
to the next. Assume the person started with zero savings in year 2000, let the amount saved
in year 2000+t be xt .
(a) Express xt with an appropriate time-series model. Is his annual rate of savings showing
an increasing or decreasing trend along time? ..3 This is an AR(1) series of
the form xt = 10 + 0.9xt−1 + Wt where Wt ∼ N (0, 16). Since the coefficient of xt−1 is
of magnitude less than 1, this is a stationary series.
(b) What is the expected savings at the end of 2030? ..2 We need to sum
up the expected value from 2000 to 2030. Since the series is stationary this becomes
30E(xt ) = 30 ∗ 10/(1 − 0.9) = 3000 lakhs.
(c) Now consider a different person B where the amount B saves per year is St the market
valuation this year, 90% of market condition in previous year St−1 . Express xt with an
appropriate time-series model. Is his annual rate of savings showing any increasing or
decreasing trend along time? ..2 This is an MA(1) series of the form
xt = 10 + 9 + 0.9wt−1 + wt where wt ∼ N (0, 16). This is a stationary model.
10. Suppose you have n observations as a dataset D = {(xi , yi ) : i = 1 . . . n} where the response
variable y = βx + α + ϵi where ϵi ∼ N (0, σ 2 ). You partition the dataset D into k random

4
disjoint partitions of size n/k each, estimate parameters
P (B1 , A1 )P
. . . (Bk , Ak ) from each of
the k partitions, and return the estimate as B̄ = j Bj /k, Ā = j Aj /k. [Recall that in
class we had analyzed that for a standard linear regression model on n samples, the variance
σ2 n 2
P
σ2 i=1 xi /n
of the estimated parameters is given as V ar(B) = n x2 −nx̄2 , V ar(A) = n x2 −nx̄2 ]
P P
i=1 i i=1 i

(a) Write the expression for the distribution of B̄, Ā. ..4 Each Aj
and Bj is Gaussian with mean as α, β respectively and variance given above. We know
that average of k independent Gaussians is Gaussian with average mean and average
σ 2 i∈Dp kx2i /n
P
Pk
variance divided by k. Thus, Ā ∼ N (α, p=1 k2 (P x2 − n x̄2 )
) where Dp denotes the
i∈Dp i k p

n/k instances in the p-th data partition,


σ2
Expression for B̄ is similarly B̄ ∼ N (β, kp=1
P
k2 ( i∈Dp x2i − n
P
x̄2 )
)
k p
Rubric: 1 mark for mentioning Gaussian distribution for both, 1 mark for mentioning
mean for both, 2 marks for correct variances. 1 mark deducted if variance expressed in
terms of Aj or Bj without explicit computation. 0.5 marks deducted for basic errors
WITHIN the variance term Aj or Bj .
(b) Contrast the above with the estimator A, B of normal linear regression over the entire
data. Which estimator has a smaller risk? Justify. ..3 The bias is the same for both
estimators. Let us consider the B parameter. We need to contrast variance these two
variance terms:
σ2 2
Pk
2 ) ) with
Pn σ 2
p=1 k( i∈D x2i − n 2
P
x̄ p i=1 x i −nx̄
p
Pkn 2
Let SS denote i=1 xi − nx̄2 and let SSp denote ( i∈Dp x2i − nk x̄2p ). Each of SSp ≤ SS.
P

Thus, the average of their reciprocal will also be greater than 1/SS. Thus, variance of
B̄ will be greater.
Rubric: 1 mark for correct conclusion, 1 mark for writing risk term (identifying 0 bias),
1 mark for explicit comparison of variance terms. Other correct justifications given 2
or 3 marks dependent on rigor of proof.

11. Suppose you want to estimate the effect of fan speed on the temperature of a GPU. You
have software controls to change the fan speed to any value between 1000 RPM and 3000
RPM. You have time to try 30 different fan speed settings, and record the temperature at
each setting. Thereafter, you will estimate a linear regression model to estimate temperature
y as a linear function of rpm x. At what 30 fan speeds will you perform your experiment,
to reduce the variance of your estimated linear regression parameters. Provide justification.
..3 Uniform distributed in the space between 1000 and 3000 to maximize the value of
X-variance.
Rubrics: 3 marks for correct distribution with justification, 1 mark for only mentioning ei-
ther the distribution or giving correct justification, 0 otherwise

12. In standard linear regression we assumed that the response variable Y follows a Gaussian
distribution with mean E[Y |x] = µx = α + βx. Instead, assume Y is a count variable that
follows a Poisson distribution and we model its dependence on x as log E[Y |x] = log λx =
α + βx. You are given a dataset D = {(xi , yi ) : i = 1 . . . n}.

(a) Write the expression for the maximum likelihood estimation of the parameters. ..4
Rubrics: 4 marks for the correct expression and 2 marks for partially correct expression

5
(b) Compute gradients with respect to any one parameter. ..2 Easy.
Rubrics: 2 marks for the correct gradient calculation either for A or B but no marks if
MLE expression in previous part is incorrect

(c) We have four datapoints x1 = 0.1, x2 = 0.4, x3 = 0.3, x4 = −0.3.


i. Consider the following kernel: K(t) = 21 I(−1 ≤ t ≤ 1). The bandwidth is h = 0.3.
What is the value of the kernel density estimator at x = 0.2? ..1 Easy.
Rubrics: 1/2 mark for answer and 1/2 mark for steps
ii. What is the empirical cumulative density function value at x = 0.35 ..1 3/4
Rubrics: 1/2 mark for answer and 1/2 mark for steps
(d) Analyze the bias and variance of a kernel density estimator with the Beta density
m−1 (1−x)n−1
f (x) = x B(m,n) when 0 ≤ x ≤ 1 and when the kernel is a uniform kernel K(t) =
1
2
I(−1 ≤ t ≤ 1) of width h, and the estimator is made on n samples X1 , . . . , Xn ..4
Just apply the formula discussed in class.
2 2 ′′
Bias is approximated as 21 σK 2
R
h f (x). For uniform kernel σK = t t2 k(t) = 1/3, and for
the given density function f ′′ (x) can be calculated easily by double differentiation.
f (x) t K 2 (t)dt
R R
The approximate variance formula is nh
. For the given kernel, t K 2 (t)dt =
1/2, and f (x) is what is given.

Total: 65

You might also like