7641 Assignment 1
7641 Assignment 1
Supervised Learning
1 Assignment Weight
The assignment is worth 15% of the total points.
Read everything below carefully as this assignment has changed term-over-term.
2 Objective
The purpose of this project is to explore techniques in supervised learning. It is important to realize that
understanding an algorithm or technique requires understanding how it behaves empirically under a variety of
circumstances. As such, rather than implement each of the algorithms, you will be asked to experiment with
them and compare their performance. This is quite involved and also possibly quite different from what you
are used to; however, it is central and in many ways the essence of supervised learning.
3 Procedure
First, you should design two interesting classification problems. For the purposes of this assignment, a classifi-
cation problem is just a set of training examples and a set of test examples. You can download data, take from
your own research, or make up your own. Be careful about the datasets you choose, though. You’ll need to
explain why they are interesting, use them in later assignments, and have a deep understanding of them.
After selecting two interesting classification problems, you will go through the process of exploring the data,
tuning the algorithms you’ve learned about, and writing a thorough analysis of your findings. You need not
implement any learning algorithm yourself; however, you must participate in the journey of exploring, tuning,
and analyzing. Concretely, this means:
• You may program in any language you wish and are allowed to use any library, as long as it was not
written specifically to solve this assignment.
• TAs must be able to recreate your experiments on a standard linux machine if necessary.
• The analysis you provide in the report is paramount.
You should experiment with five learning algorithms on each dataset. They are:
• Decision Trees. Be sure to use some form of pruning. You are not required to use information gain (for
example, there is something called the GINI index that is sometimes used) to split attributes, but you
should describe whatever it is that you do use.
• Neural Networks. You may use networks of nodes with as many layers as you like and any activation
function you see fit.
• Boosted Decision Trees. As with decision trees, you will want to use some form of pruning. Since you
are using boosting you can afford to be much more aggressive about your pruning.
• Support Vector Machines. Make sure to try at least two different kernel functions.
1
Each algorithm is described in detail in your textbook, the assigned readings on Canvas, and on the internet.
Instead of implementing the algorithms yourself, you should use libraries that do this for you and make sure to
provide proper attribution. Also, note that you’ll need to do some fiddling to obtain good results and graphs,
and this might require you to modify these libraries in various ways.
• A description of your classification problems, and why you feel they are interesting. Think hard about
this. To be interesting the problems should be non-trivial on the one hand, but capable of admitting
comparisons and analysis of the various algorithms on the other. Avoid the mistake of working on the
largest most complicated and messy dataset you can find. The key is to be interesting and clear, no points
for hairy and complex.
• The training and testing error rates you obtained running the various learning algorithms on your problems.
At the very least you should include graphs that show performance on both training and test data as a
function of training size (note that this implies that you need to design a classification problem that has
more than a trivial amount of data) and – for the algorithms that are iterative – training times/iterations.
Both of these kinds of graphs are referred to as learning curves.
• Graphs for each algorithm showing training and testing error rates as a function of selected hyperparameter
ranges. This type of graph is referred to as a model complexity graph (also sometimes validation curve).
Please experiment with more than one hyperparameter and make sure the results and subsequent analysis
you provide are meaningful.
• Analyses of your results. Why did you get the results you did? Compare and contrast the different
algorithms. What sort of changes might you make to each of those algorithms to improve performance?
How fast were they in terms of wall clock time? Iterations? Would cross validation help? How much
performance was due to the problems you chose? Which algorithm performed best? How do you define
best? Be creative and think of as many questions you can, and as many answers as you can.
• scikit-learn (python)
• Weka (java)
• e1071/nnet/random forest(R)
• ML toolbox (matlab)
• tensorflow/pytorch (python)
Plotting:
• matplotlib (python)
• seaborn (python)
• yellowbrick (python)
• ggplot2 (R)
2
4 Submission Details
You must submit:
• A file named README.txt containing instructions for running your code (see note below)
• A file named yourgtaccount-analysis.pdf containing your writeup (GT account is what you log in with,
not your all-digits ID)
Note: we need to be able to get to your code and your data. Providing entire libraries isn’t necessary when a
URL would suffice; however, you should at least provide any files you found necessary to change and enough
support and explanation so we can reproduce your results on a standard linux machine.
5 Rescoring Criteria
When your assignment is scored, you will receive feedback explaining your errors and successes in some level of
detail. This feedback is for your benefit, both on this assignment and for future assignments. It is considered a
part of your learning goal to internalize this feedback.
If you are convinced that your score is in error in light of the feedback, you may request a rescore within a week
of the score and feedback being returned to you. A rescore request is only valid if it includes an explanation of
where the grader made an error.
It is important to note that because we consider your ability to internalize feedback a learning goal, we also
assess it. This ability is considered 10 percent of each assignment. We default to assigning you full credit. If
you request a rescore and do not receive at least 5 points as a result of the request, you will lose those 10 points.
3
“if you use an author’s specific word or words, you must place those words within quotation marks and you
must credit the source.” [Wis]. It is good style to use quotations sparingly. Obviously, you cannot quote other
people’s assignment and assume that is acceptable. Speaking of acceptable, citing is not a get-out-of-jail-free
card. You cannot copy text willy nilly, but cite it all and then claim it’s not plagiarism just because you cited it.
Too many quotes of more than, say, two sentences will be considered plagiarism and a terminal lack of academic
originality.
Your README file will include pointers to any code and libraries you used.
If we catch you. . .
We report all suspected cases of plagiarism to the Office of Student Integrity. Students who are under investi-
gation are not allowed to drop from the course in question, and the consequences can be severe, ranging from
a lowered grade to expulsion from the program.
References
[Col] Williams College. Citing Your Sources: Citing Basics. url: https://libguides.williams.edu/citing.
[Wis] University of Wisconsin - Madison. Quoting and Paraphrasing. url: https : / / writing . wisc . edu /
handbook/assignments/quotingsources.
Original assignment description written by Charles Isbell. Updated for Spring 2024 by John Mansfiled and
Theodore LaGrow. Modified for LATEX by John Mansfield.