0% found this document useful (0 votes)
640 views

Machine Learning Assignment

This document outlines an assignment for an introduction to machine learning course. It involves experiments with linear classification and regression methods. Students must generate synthetic datasets, train linear classifiers and k-NN on the data, perform linear and ridge regression on a real-world dataset, and submit code and a report detailing their results. They are instructed not to collaborate with others on the individual assignment.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
640 views

Machine Learning Assignment

This document outlines an assignment for an introduction to machine learning course. It involves experiments with linear classification and regression methods. Students must generate synthetic datasets, train linear classifiers and k-NN on the data, perform linear and ridge regression on a real-world dataset, and submit code and a report detailing their results. They are instructed not to collaborate with others on the individual assignment.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

IITM-CS5011 : Introduction to Machine Learning Assignment #2

Given on: Aug 19, 10pm Due on : Sep 02, 11:55pm

The goal of this assignment is to experiment with linear methods for classication and regression. This is an individual assignment. Collaborations and discussions with others are strictly prohibited. You may use Matlab, Octave, Python, R or Java for your implementation. If you are using any other languages, please contact us before you proceed. You have to turn in the well documented code along with a detailed report of the results of the experiment electronically in Moodle. Typeset your report in Latex. Your report should contain detailed answer for all of the questions asked below. Look at the end of the assignment for submission instructions.

1. You will use a synthetic data set for the classication task. Generate two classes with 10 features each. Each class is given by a multivariate Gaussian distribution, with both classes sharing the same covariance matrix. Ensure that the covarianve matrix is not spherical, i.e., that it is not a diagonal matrix, with all the diagonal entries being the same. Generate 1000 examples for each class. Choose the centroids for the classes close enough so that there is some overlap in the classes. Specify clearly the details of the parameters used for the data generation. Randomly pick 40% of each class (i.e., 400 data points per class) as a test set, and train the classiers on the remaining 60% data. When you report performance results, it should be on the left out 40%. Call this dataset at DS1. 2. For DS1, learn a linear classier by using regression on indicator variable. Report the best t achieved by the classier, along with the coecients learnt. 3. For DS1, use k-NN to learn a classier. Repeat the experiment for dierent values of k and report the performance for each value. Technically this is not a linear classier, but I want you to appreciate how powerful linear classiers can be. Do you do better than regression on indicator variables or worse? Are there particular values of k which perform better? 4. Now instead of having a single multivariate Gaussian distribution per class, each class is going to be generated by a mixture of 3 Gaussians. For each class, dene 3 Gaussians, with rst Gaussian of the rst class sharing the covariance matrix with rst Gaussian of the second class and so on. For both the classes, x the mixture probability as (0.1,0.42,0.48) i.e. the sample has arisen from rst gaussian with probablity 0.1, second with probability 0.42 and so on. Now sample from this distribution and generate the

dataset similar to question 1. Call this dataset as DS2. Now perform the experiments in questions 2 and 3 again, but now using DS2. What do you observe? Can you comment on the performance of both the classier when you use DS1 and DS2? 5. For the regression tasks, you will use the Communities and Crime Data Set from the UCI repository (http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime). This is a real-life data set and as such would not have the nice properties that we expect. Your rst job is to make this dataset usable, by lling in all the missing values . Use the sample mean of the missing attribute. Is this is a good choice ? What else might you use? If you have a better method, describe it, and you may use it for lling in the missing data. Turn in the complete data set. 6. Fit the above data using linear regression. Report the residual error of the best t achieved, along with the coecients learnt. 7. Use Ridge-regression on the above data. Repeat the experiment for dierent values of . Report the residual error for each value, along with the coecients learnt. Which value of gives the best t? Is it possible to use the information you obtained during this experiment for feature selection? If so, what is the best t you achieve with a reduced set of features?

Submission Instructions
Specify your choice of language in the link provided in the moodle. Submit a single tarball/zip le containing the following les in the specied directory structure. Use the following naming convention: cs5011 a2 rollno.tar.gz cs5011 a2 rollno Dataset DS1-train.csv DS1-test.csv DS2-train.csv DS2-test.csv CandC-train.csv CandC-test.csv Report rollno-report.pdf Code all your code les

Page 2

You might also like