Class Note For Machine Learning at University
Class Note For Machine Learning at University
Course Website:
https://www.cs.toronto.edu/~rgrosse/courses/csc311_f20/
Main source of information is the course webpage; check regularly!
Discussions: Piazza.
Sign up: https://piazza.com/utoronto.ca/fall2020/csc311
Your grade does not depend on your participation on
Piazza. It’s just a good way for asking questions, discussing with
your instructor, TAs and your peers. We will only allow questions
that are related to the course materials/assignments/exams.
You only need to pay attention to the course website for content and
Quercus for links.
(45%) 4 assignments
I Combination of pen & paper derivations and programming exercises
I Weighted equally
(5%) Read some classic papers
I Worth 5%, honor system
(25%) Two 1-hour exams held during normal class time
I Your higher mark will count for 15%, and your lower mark for 10%
I See website for times and dates (tentative)
(25%) Project
I Will require you to apply several algorithms to a challenge problem
and to write a short report analyzing the results
I Due during the final evaluation period
I More details TBA
Collaboration on the assignments is not allowed. Each student is responsible for his/her
own work. Discussion of assignments should be limited to clarification of the handout itself,
and should not involve any sharing of pseudocode or code or simulation results. Violation of
this policy is grounds for a semester grade of F, in accordance with university regulations.
Assignments should be handed in by deadline; a late penalty of 10% per day will be
assessed thereafter (up to 3 days, then submission is blocked).
Extensions will be granted only in special situations, and you will need to complete an
absence declaration form and notify us to request special consideration, or otherwise have a
written request approved by the course instructors at least one week before the due date.
Tom Mitchell
DOTA2 - Link
Why not jump straight to csc412/413, and learn neural nets first?
The principles you learn in this course will be essential to
understand and apply neural nets.
The techniques in this course are still the first things to try for a
new ML problem.
I E.g., try logistic regression before building a deep neural net!
There’s a whole world of probabilistic graphical models.
ML workflow sketch:
1. Should I use ML on this problem?
I Is there a pattern to detect?
I Can I solve it analytically?
I Do I have data?
2. Gather and organize data.
I Preprocessing, cleaning, visualizing.
3. Establishing a baseline.
4. Choosing a model, loss, regularization, ...
5. Optimization (could be simple, could be a Phd...).
6. Hyperparameter search.
7. Analyze performance & mistakes, and iterate back to step 4 (or 2).
Algorithm:
1. Find example (x∗ , t∗ ) (from the stored training set) closest to
x. That is:
x∗ = argmin distance(x(i) , x)
x(i) ∈train. set
2. Output y = t∗
Algorithm (kNN):
1. Find k examples {x(i) , t(i) } closest to the test instance x
2. Classification output is majority class
Xk
y = arg max I(t(z) = t(i) )
t(z)
i=1
Tradeoffs in choosing k?
Small k
I Good at capturing fine-grained patterns
training data
Large k
I Makes stable predictions by averaging over lots of examples
I May underfit, i.e. fail to capture important regularities
Balancing k
I Optimal choice of k depends on number of data points n.
I Nice theoretical properties if k → ∞ and k → 0.
√ n
I Rule of thumb: choose k < n.
I We can choose k using validation set (next slides).
The test set is used only at the very end, to measure the generalization
performance of the final configuration.
Image credit:
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_swiss_roll.html
Simple fix: normalize each dimension to be zero mean and unit variance.
I.e., compute the mean µj and standard deviation σj , and take
xj − µj
x̃j =
σj
[Belongie, Malik, and Puzicha, 2002. Shape matching and object recognition using shape
contexts.]
Intro ML (UofT) CSC311-Lec1 51 / 55
Example: 80 Million Tiny Images
Simple algorithm that does all its work at test time — in a sense,
no learning!
Can control the complexity by varying k
Suffers from the Curse of Dimensionality
Next time: parametric models, which learn a compact summary of
the data rather than referring back to it at test time.