Compre FoDS
Compre FoDS
1. Suppose there are 100 features for a classification problem and you are given one million examples.
Suppose you are given the best possible five principal components v1, v2, v3, v4, v5 (i.e., eigen vectors
of covariance matrix) after performing PCA and let the corresponding eigen values be m1, m2, m3, m4,
m5.
a. What would be the representation (or components) of this testing example in the five-dimensional
principal component space?
b. Specify the formula to find out the percentage of variance of the original data is being captured if
all points are transformed to five-dimensional components?
c. How many principal components should be considered to achieve 100 percent accuracy?
d. Let us consider another problem. Suppose there is a problem with two features and the four
training data points are (1,0), (2,3), (4,1) and (5,4). The eigen vector (with maximum eigen value)
of the corresponding covariance matrix is (2-1/2, 2-1/2 ). Write down the four data points after they
are transformed to the first principal component. Draw a two-dimensional graph to pictorially
represent your findings. How much percentage of the original variance is captured after this
transformation. [2 + 2 + 2 + 6 = 12 Marks]
2. a. Let x be D-variate feature vector and t be a single variate target attribute for a regression problem.
Let p(x,t) be the joint probability distribution from which the data is generated. Suppose you are given the
joint probability distribution p(x,t). Let L(t,y(x)) be the loss function and L(t,y(x)) for this problem is
taken to be the (y(x) – t)2.. If you are given the freedom to have y(x) to be as flexible as you can, then find
out y(x) that minimizes expected loss.
b. With y(x) that is found out by the above methodology, would the expected loss be zero or non-zero. If
it is zero prove it otherwise derive the remaining expected loss term. [6 + 6 = 12 Marks]
3. a. The objective function in ridge regression can also be derived using probabilistic approach by
assuming an appropriate prior distribution. Suppose there is a regression problem with ‘D’ features and a
single variate target attribute and you are given ‘N’ training data points. By assuming an appropriate prior
distribution, derive the objective function that is typically used in ridge regression.
Note: You may make necessary and appropriate assumptions to solve this problem.
b. In Bayesian curve fitting, derive the probability density function of the target attribute given the feature
vector of the test example and all training examples. You need not figure out the parameters of this
distribution but the derivation should be complete. [5 + 5 = 10 Marks]
4.a. Identify at least two advantages and two disadvantages of using color to visually represent
information.
4.b. Would simple random sampling (without replacement) be a good approach to sampling? Why or why
not?
4.c. Describe how a box plot can give information about whether the value of an attribute is
symmetrically distributed.
4.d. Explain the information conveyed by the star coordinates of Iris database shown in the following
Figures. [2 + 2 + 3 + 3 = 10 Marks]
5. a. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe
two methods for handling this problem.
5.b. Suppose a group of 12 sales price records has been sorted as follows: 5, 10, 11, 13, 15, 35, 50, 55, 72,
92, 204, 215. Partition them into three bins by each of the following methods. (i) equal-frequency
partitioning (ii) equal-width partitioning.
5.c. Robust data loading poses a challenge in database systems because the input data are often dirty. In
many cases, an input record may have several missing values and some records could be contaminated
(i.e., with some data values out of range or of a different data type than expected). Work out an automated
data cleaning and loading algorithm so that the erroneous data will be marked and contaminated data will
not be mistakenly inserted into the database during data loading.
5.d. Distinguish between noise and outliers by answering the following questions.
(a) Is noise ever interesting or desirable? Outliers?
(b) Can noise objects be outliers?
(c) Are noise objects always outliers?
(e) Can noise make a typical value into an unusual one, or vice versa? [3 + 3 + 3 + 3 = 12 Marks]
6.a. Suppose there is a linear regression problem with only one feature and one target attribute. Let
(x1,y1), (x2,y2),…,,(xN,yN) be ‘N’ training data points and you are asked to fit a simple linear regression
by minimizing the sum of squares of error. Let the resultant built up regression model be y = α + β x.
Prove or disprove that the predicted linear regression model (i.e., y = α + β x) passes through the mean of
training examples (x^, y^). [10 Marks]
6.b. Suppose there is a learning problem with ‘D’ features and we would like to find out the best subset of
features which might give optimal results. As the number of subsets of feature set is finite we can actually
find out the best feature subset by finding out the validation error with respect to each subset. Can you
give reasons why we will typically make use of heuristics like forward/backward greedy feature selection
algorithms? [4 Marks]
7. We know that variance of f(x) is defined as Var[f] = E[(f(x) – E[f(x)])2]. Show that variance of f(x) can
also be written as follows: Var[f] = E[f(x)2] – E[f(x)]2. [5 Marks]
8. Let X be a discrete random variable taking the values 1, 2, . . , n. We can have numerous discrete
probability distributions for the random variable X. Find out a probability distribution with maximum
entropy and another probability distribution with minimum entropy. [5 Marks]