CS6011: Kernel Methods For Pattern Analysis: Submitted by
CS6011: Kernel Methods For Pattern Analysis: Submitted by
Assignment 1
Submitted by
CS08B025 CS08B036
Regression tasks:
i) Polynomial Curve fitting (Dataset I, univariate data):
Degree of Polynomial for which MSE is minimum (for validation data) = 7 and corresponding MSE on test data = 4.9869e+003 From the Graph , we can see steep decrease in MSE from degree = 2 to degree = 3. So, data can be approximated by 3rd degree polynomial. And also as the degree of polynomial is increasing ( 16 to 20), MSE on test data is also increasing (Over Fitting).
Test data
Training data
Validation data Approximating the given data by a polynomial of degree 7 (optimum). Y(x, W) = w0 + w1 * x + w2 * x^2 + + w7 * x^7.
Scatter plot with target output on x-axis and model output on y-axis
Test data
Training data
Validation data
These plots can be approximated as straight lines of slope close to 1 (which shows target output ans model output are approximately equal). Ideally, If the model can approximate data exactly then all the points in the plot fall onto a line of slope = 1.
ii)
Linear Model for Regression using Gaussian Basis functions(Dataset II, bivariate data):
Test data
Training data
Validation data When lambda = 1 , optimum values of Width of gaussian = 7.6000, no. of clusters = 20, MSE (val. data)= 338.0205
From the Graphs , MSE is very high when c is low (close to zero). As value of basis functions is low (exp(-(r*r)/(c*c))).
Lambda = 0.5
Test data
Training data
Validation data When lambda = 0.5 , optimum values of Width of gaussian = 8.4000, no. of clusters = 20, MSE (val. data)= 263.5788
As the value of C (width of gaussian) is increasing from 0 , MSE decreases first and after reaching a optimum C, starts increasing. When C is near to zero the outputs are predicted by the points which are very near to the mean and we are not able to predict points farther from the mean. The reverse happens when we increase the value of C. And as the no. of Clusters is increasing, MSE is decreasing. Increasing number of cluster after some time leads to overlap of the Gaussians which leads to wrong prediction. For a larger variance even slightly less number of clusters may give high error.
Lambda = 0
When lambda = 0 , optimum values of Width of gaussian = 9.9000, no. of clusters = 18, MSE (val. data)= 39.8534
Scatter plot with target output on x-axis and model output on y-axis
Test data
Training data
Validation data
iii)
Test data
Training data
Validation data
When lambda = 0 , optimum values of Width of gaussian = 1.4970, no. of clusters = 6, MSE (val. data)= 223.691 MSE is approximately in the same range for different values of c and no.of clusters.
Test data
Training data
Validation data
Scatter plot with target output on x-axis and model output on y-axis
Test data
Training data
Validation data
For dataset II :
Plots of Mean square error
Test data
Training data
Lambda = 0.5
Validation data
When lambda = 0 .5, optimum values of Width of gaussian = 3.4000, no. of clusters = 20, MSE (val. data)= 302.9161
For high values of width of Gaussian and no.of clusters , MSE is also high.
Test data
Scatter plot with target output on x-axis and model output on y-axis
Test data
Training data
Validation data Performance of Generalized RBF is better for Dataset II(Bivariate) than Dataset I(Univariate) (from scatter plots). And Linear model for regression performed better than Generalized RBF, but reverse is expected.
Test data
Training data
Lambda = 0.5
Validation data
When lambda = 0 .5, optimum values of Width of gaussian = 190, no. of clusters = 5, MSE (val. data)= 410.3364
MSE is similar to above case (dataset II), initially decreasing with c, after reaching minimum, starts increasing.
Scatter plot with target output on x-axis and model output on y-axis
Test data
Training data
Validation data
Test data
Training data
Validation data Optimum Values of no.of nodes in Hidden layers Hidden Layer 1 = 11 Hidden Layer 2 = 7 We observe that choosing of optimal parameters in MLFFNN regression is very important as the error changes drastically for different values of hidden layers.
Scatter plot with target output on x-axis and model output on y-axis
Test data
Training data
Validation data
Test data
Training data
Validation data Optimum Values of no.of nodes in Hidden layers: Hidden layer 1 = 7 Hidden layer 2 =9
Scatter plot with target output on x-axis and model output on y-axis
Test data
Training data
Validation data
Test data
Training data
Validation data
Scatter plot with target output on x-axis and model output on y-axis
Test data
Training data
Validation data
Classification Tasks:
GMM Dataset Ia (Linearly Separable Data):
While doing GMM, in UCI benchmark, in the initialization phase, we did k means clustering. After doing k means, when we found the covariance matrices, we found that some rows of the covariance matrices were all rows. We inferred that for all the feature vectors, values of a particular dimension are same. To overcome this ill conditioned covariance matrix problem, we added gaussian noise for this dimension. We had to scale the data appropriately, otherwise all the components were collapsing to a single component.
We use a greedy strategy for finding the number of mixtures. We set maximum number of mixtures to k_max and initialize the number of mixtures to be k_max/2 for all the classes. Then we vary k from 1 to k_max for each class keeping the ks for other classes fixed. Validation is done for improving each classes accuracy without taking into consideration global accuracy. After we have optimal k for each class, we change ks for each class from 1 to k_max and find out if global accuracy is increasing on validation data. This is done two times. After applying the above method we observed the following, the linearly seperable data and overlapping data were both gaussian distribusions, so k = 1 was the optimal distribution. For the non linearly seperable data, k greater than 17 was able to accurately classify the data for both the classes. We used same covariance matrix for this case as estimating parameters is not possible with less data. For UCI benchmark data we got k's equal to 10, 19, 3, 7, 1 and were able
to get accuracy greater 98.5%. We infer that increasing k beyond a certain point does not increase accuracy as the clusters converge to the same mean. For image dataset GMM is comparable to mlffnn and bayes classifier. All the classifiers give almost the same accuracy for this dataset. We were not able to train the GMM for full covariance matrices for all mixtures because of scarcity of data available for images(100 data points as training data, as we had only 100 images for each class).
0 100 0 0
0 100 0
0 100
accuracy = 100.
Optimum no. of Clusters(K): 1.000000 1.000000 1.000000 1.000000
Optimum no.of Clusters(K): 17.000000 17.000000 Accuracy 100.000000 Confusion matrix 489.000000 0.000000 0.000000 488.000000
Confusion Matrix = 80 0 14 4 6 7 0
0 89
7 16 77 8 13
0 79
Dataset III :
Optimum no. of Clusters(K): 3.000000 1.000000 3.000000 Accuracy 76.250000 Confusion matrix 141.000000 11.000000 19.000000 96.000000 13.000000 21.000000
Accuracy 100.000000 Confusion matrix 100.000000 0.000000 0.000000 100.000000 0.000000 0.000000 0.000000 0.000000
Accuracy 81.250000 Confusion matrix 80.000000 0.000000 0.000000 89.000000 7.000000 16.000000 8.000000 13.000000
For Dataset II :
Accuracy 91.704036
Confusion matrix 273.000000 1.000000 0.000000 184.000000 0.000000 6.000000 0.000000 0.000000 0.000000 2.000000
0 100 0 0
0 100 0
0 100
Accuracy = 100 Perceptron was used to classify linearly seperable dataset. Voting mechanism was used for classifying. We got 100% percent accuracy for test data. However, the boundary drawn is never the optimal seperating hyperplane.
MLFFNN:
For Dataset Ia :
It gave very good results on synthetic data and the UCI benchmark data (> 99% for all of these except the overlapping dataset). However its performance was the poorest(75%) when we ran on the image data set(though the difference between all the classifiers was within 2%). Although we get very good accuracies, we need to choose the number of elements in the layers correctly otherwise mlffnn can give very bad results. For UCI benchmark dataset the accuracy varied from 20% to 99.4%, for image dataset from 50 to 75%, for non linearly seperable synthetic dataset 92% to 100%, for overlapping data 25% to 84.5%, when we varied the number of hidden layers(2 layers) from 7 to 12, so cross validation and plays a very important role in MLFFNN for model selection. However for other classifiers, (even for gmm, the variability was not so high). We had sigmoidal for hidden layers and linear for output layer. Making the transfer function for both the hidden layers linear decreased accuracy drastically for nonlinearly seperable dataset as a linear classifier cannot solve a non linear problem. However keeping one of the
layers non linear we were able to get 100% accuracy for a few combinations of hidden layers but in general the performance was not very good for most of the combinations. So, it is better to have non linear transfer functions.
Hidden Layer1 10.000000 Hidden Layer2 7.000000 Accuracy 100.000000 Confusion matrix 100.000000 0.000000 0.000000 0.000000 0.000000 100.000000 0.000000 0.000000 0.000000 0.000000 100.000000 0.000000 0.000000 0.000000 0.000000 100.000000
Hidden Layer1 8.000000 Hidden Layer2 9.000000 Accuracy 100.000000 Confusion matrix 489.000000 0.000000 0.000000 488.000000
Accuracy 84.750000 Confusion matrix 76.000000 0.000000 0.000000 87.000000 2.000000 16.000000 4.000000 2.000000
Datasets
From the output of the GMM, MLFFNN, Bayes classifier we infer that image database has a lot of overlap and the UCI benchmark dataset is non-linearly separable with very less overlap.