0% found this document useful (0 votes)
191 views58 pages

Class Note For Machine Learning at University

. Thank you for following the instructions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views58 pages

Class Note For Machine Learning at University

. Thank you for following the instructions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

CSC 311: Introduction to Machine Learning

Lecture 1 - Introduction and Nearest Neighbors

Roger Grosse Chris Maddison Juhan Bae Silviu Pitis

University of Toronto, Fall 2020

Intro ML (UofT) CSC311-Lec1 1 / 55


This course

Broad introduction to machine learning


I Algorithms and principles for supervised learning
I nearest neighbors, decision trees, ensembles, linear regression,
logistic regression, SVMs
I Unsupervised learning: PCA, K-means, mixture models
I Basics of reinforcement learning
Coursework is aimed at advanced undergrads. We will use
multivariate calculus, probability, and linear algebra.

Intro ML (UofT) CSC311-Lec1 2 / 55


Course Information

Course Website:
https://www.cs.toronto.edu/~rgrosse/courses/csc311_f20/
Main source of information is the course webpage; check regularly!

Announcements, grades, & links: Quercus.


Did you receive the announcement?

Discussions: Piazza.
Sign up: https://piazza.com/utoronto.ca/fall2020/csc311
Your grade does not depend on your participation on
Piazza. It’s just a good way for asking questions, discussing with
your instructor, TAs and your peers. We will only allow questions
that are related to the course materials/assignments/exams.

Intro ML (UofT) CSC311-Lec1 3 / 55


Office hours: This week we are trialling Gather Town.
Roger Grosse, Monday 1PM-3PM
Silviu Pitis, Monday 6PM-8PM
Juhan Bae, Thursday 2PM-4PM
Chris Maddison, Friday 10AM-12PM

You only need to pay attention to the course website for content and
Quercus for links.

Intro ML (UofT) CSC311-Lec1 4 / 55


Course Information

Lectures will be delivered synchronously via Zoom, and recorded


for asynchronous viewing by enrolled students. All information
about attending virtual lectures, tutorials, and office hours will be
sent to enrolled students through Quercus.
You may download recorded lectures for your own academic use,
but you should not copy, share, or use them for any other purpose.
During lecture, please keep yourself on mute unless called upon.
In case of illness, you should fill out the absence declaration form
on ACORN and notify the instructors to request special
consideration.
For accessibility services: If you require additional academic
accommodations, please contact UofT Accessibility Services as
soon as possible, studentlife.utoronto.ca/as.

Intro ML (UofT) CSC311-Lec1 5 / 55


Course Information

Poll on course schedule!

Intro ML (UofT) CSC311-Lec1 6 / 55


Course Information

Recommended readings will be given for each lecture. But the


following will be useful throughout the course:
Hastie, Tibshirani, and Friedman: “The Elements of Statistical
Learning”
Christopher Bishop: “Pattern Recognition and Machine Learning”, 2006.
Kevin Murphy: “Machine Learning: a Probabilistic Perspective”, 2012.
David Mackay: “Information Theory, Inference, and Learning
Algorithms”, 2003.
Shai Shalev-Shwartz & Shai Ben-David: “Understanding Machine
Learning: From Theory to Algorithms”, 2014.
David Barber: ”Bayesian Reasoning and Machine Learning”, 2012.
Richard S. Sutton and Andrew G. Barto: ”Reinforcement Learning: An
Introduction” (2nd ed.), 2018.
There are lots of freely available, high-quality ML resources.

Intro ML (UofT) CSC311-Lec1 7 / 55


Requirements and Marking

(45%) 4 assignments
I Combination of pen & paper derivations and programming exercises
I Weighted equally
(5%) Read some classic papers
I Worth 5%, honor system
(25%) Two 1-hour exams held during normal class time
I Your higher mark will count for 15%, and your lower mark for 10%
I See website for times and dates (tentative)
(25%) Project
I Will require you to apply several algorithms to a challenge problem
and to write a short report analyzing the results
I Due during the final evaluation period
I More details TBA

Intro ML (UofT) CSC311-Lec1 8 / 55


More on Assignments

Collaboration on the assignments is not allowed. Each student is responsible for his/her
own work. Discussion of assignments should be limited to clarification of the handout itself,
and should not involve any sharing of pseudocode or code or simulation results. Violation of
this policy is grounds for a semester grade of F, in accordance with university regulations.

The schedule of assignments will be posted on the course webpage.

Assignments should be handed in by deadline; a late penalty of 10% per day will be
assessed thereafter (up to 3 days, then submission is blocked).

Extensions will be granted only in special situations, and you will need to complete an
absence declaration form and notify us to request special consideration, or otherwise have a
written request approved by the course instructors at least one week before the due date.

Intro ML (UofT) CSC311-Lec1 9 / 55


Related Courses

More advanced ML courses such as CSC413 (Neural Networks


and Deep Learning) and CSC412 (Probabilistic Learning and
Reasoning) both build upon the material in this course.
If you’ve already taken an applied statistics course, there will be
some overlap.
This is the second academic year this course is listed only as an
undergrad course. Previously it was CSC411, with a bit more
content and heavier workload. We borrow liberally from the
previous editions.

Intro ML (UofT) CSC311-Lec1 10 / 55


What is learning?

”The activity or process of gaining knowledge or skill by studying,


practicing, being taught, or experiencing something.”

Merriam Webster dictionary

“A computer program is said to learn from experience E with respect


to some class of tasks T and performance measure P, if its performance
at tasks in T, as measured by P, improves with experience E.”

Tom Mitchell

Intro ML (UofT) CSC311-Lec1 11 / 55


What is machine learning?

For many problems, it’s difficult to program the correct behavior


by hand
I recognizing people and objects
I understanding human speech
Machine learning approach: program an algorithm to
automatically learn from data, or from experience
Why might you want to use a learning algorithm?

Intro ML (UofT) CSC311-Lec1 12 / 55


What is machine learning?

For many problems, it’s difficult to program the correct behavior


by hand
I recognizing people and objects
I understanding human speech
Machine learning approach: program an algorithm to
automatically learn from data, or from experience
Why might you want to use a learning algorithm?
I hard to code up a solution by hand (e.g. vision, speech)
I system needs to adapt to a changing environment (e.g. spam
detection)
I want the system to perform better than the human programmers
I privacy/fairness (e.g. ranking search results)

Intro ML (UofT) CSC311-Lec1 12 / 55


What is machine learning?

It’s similar to statistics...


I Both fields try to uncover patterns in data
I Both fields draw heavily on calculus, probability, and linear algebra,
and share many of the same core algorithms
But it’s not statistics!
I Stats is more concerned with helping scientists and policymakers
draw good conclusions; ML is more concerned with building
autonomous agents
I Stats puts more emphasis on interpretability and mathematical
rigor; ML puts more emphasis on predictive performance,
scalability, and autonomy

Intro ML (UofT) CSC311-Lec1 13 / 55


Relations to AI

Nowadays, “machine learning” is often brought up with “artificial


intelligence” (AI)

AI does not always imply a learning based system


I Symbolic reasoning
I Rule based system
I Tree search
I etc.

Learning based system → learned based on the data → more


flexibility, good at solving pattern recognition problems.

Intro ML (UofT) CSC311-Lec1 14 / 55


Relations to human learning

Human learning is:


I Very data efficient
I An entire multitasking system (vision, language, motor control, etc.)
I Takes at least a few years :)

For serving specific purposes, machine learning doesn’t have to


look like human learning in the end.

It may borrow ideas from biological systems, e.g., neural networks.

It may perform better or worse than humans.

Intro ML (UofT) CSC311-Lec1 15 / 55


What is machine learning?

Types of machine learning


I Supervised learning: have labeled examples of the correct
behavior
I Reinforcement learning: learning system (agent) interacts with
the world and learns to maximize a scalar reward signal
I Unsupervised learning: no labeled examples – instead, looking
for “interesting” patterns in the data

Intro ML (UofT) CSC311-Lec1 16 / 55


History of machine learning

1957 — Perceptron algorithm (implemented as a circuit!)


1959 — Arthur Samuel wrote a learning-based checkers program
that could defeat him
1969 — Minsky and Papert’s book Perceptrons (limitations of
linear models)
1980s — Some foundational ideas
I Connectionist psychologists explored neural models of cognition
I 1984 — Leslie Valiant formalized the problem of learning as PAC
learning
I 1988 — Backpropagation (re-)discovered by Geoffrey Hinton and
colleagues
I 1988 — Judea Pearl’s book Probabilistic Reasoning in Intelligent
Systems introduced Bayesian networks

Intro ML (UofT) CSC311-Lec1 17 / 55


History of machine learning
1990s — the “AI Winter”, a time of pessimism and low funding
But looking back, the ’90s were also sort of a golden age for ML
research
I Markov chain Monte Carlo
I variational inference
I kernels and support vector machines
I boosting
I convolutional networks
I reinforcement learning
2000s — applied AI fields (vision, NLP, etc.) adopted ML
2010s — deep learning
I 2010–2012 — neural nets smashed previous records in
speech-to-text and object recognition
I increasing adoption by the tech industry
I 2016 — AlphaGo defeated the human Go champion
I 2018-now — generating photorealistic images and videos
I 2020 — GPT3 language model
now — increasing attention to ethical and societal implications
Intro ML (UofT) CSC311-Lec1 18 / 55
Computer vision: Object detection, semantic segmentation, pose
estimation, and almost every other task is done with ML.

Instance segmentation - Link

Intro ML (UofT) CSC311-Lec1 19 / 55


Speech: Speech to text, personal assistants, speaker identification...

Intro ML (UofT) CSC311-Lec1 20 / 55


NLP: Machine translation, sentiment analysis, topic modeling, spam
filtering.

Intro ML (UofT) CSC311-Lec1 21 / 55


Playing Games

DOTA2 - Link

Intro ML (UofT) CSC311-Lec1 22 / 55


E-commerce & Recommender Systems : Amazon,
netflix, ...

Intro ML (UofT) CSC311-Lec1 23 / 55


Why this class?

Why not jump straight to csc412/413, and learn neural nets first?
The principles you learn in this course will be essential to
understand and apply neural nets.
The techniques in this course are still the first things to try for a
new ML problem.
I E.g., try logistic regression before building a deep neural net!
There’s a whole world of probabilistic graphical models.

Intro ML (UofT) CSC311-Lec1 24 / 55


Why this class?
2017 Kaggle survey of data science and ML practitioners: what data
science methods do you use at work?

Intro ML (UofT) CSC311-Lec1 25 / 55


ML Workflow

ML workflow sketch:
1. Should I use ML on this problem?
I Is there a pattern to detect?
I Can I solve it analytically?
I Do I have data?
2. Gather and organize data.
I Preprocessing, cleaning, visualizing.
3. Establishing a baseline.
4. Choosing a model, loss, regularization, ...
5. Optimization (could be simple, could be a Phd...).
6. Hyperparameter search.
7. Analyze performance & mistakes, and iterate back to step 4 (or 2).

Intro ML (UofT) CSC311-Lec1 26 / 55


Implementing machine learning systems

You will often need to derive an algorithm (with pencil and


paper), and then translate the math into code.
Array processing (NumPy)
I vectorize computations (express them in terms of matrix/vector
operations) to exploit hardware efficiency
I This also makes your code cleaner and more readable!

Intro ML (UofT) CSC311-Lec1 27 / 55


Implementing machine learning systems

Neural net frameworks: PyTorch, TensorFlow, JAX, etc.


I automatic differentiation
I compiling computation graphs
I libraries of algorithms and network primitives
I support for graphics processing units (GPUs)
Why take this class if these frameworks do so much for you?
I So you know what to do if something goes wrong!
I Debugging learning algorithms requires sophisticated detective
work, which requires understanding what goes on beneath the hood.
I That’s why we derive things by hand in this class!

Intro ML (UofT) CSC311-Lec1 28 / 55


Preliminaries and Nearest Neighbor Methods

Intro ML (UofT) CSC311-Lec1 29 / 55


Introduction

Today (and for much of this course) we focus on supervised learning.


This means we are given a training set consisting of inputs and
corresponding labels, e.g.

Task Inputs Labels


object recognition image object category
image captioning image caption
document classification text document category
speech-to-text audio waveform text
.. .. ..
. . .

Intro ML (UofT) CSC311-Lec1 30 / 55


Input Vectors

What an image looks like to the computer:

[Image credit: Andrej Karpathy]

Intro ML (UofT) CSC311-Lec1 31 / 55


Input Vectors

Machine learning algorithms need to handle lots of types of data:


images, text, audio waveforms, credit card transactions, etc.
Common strategy: represent the input as an input vector in Rd
I Representation = mapping to another space that’s easy to
manipulate
I Vectors are a great representation since we can do linear algebra!

Intro ML (UofT) CSC311-Lec1 32 / 55


Input Vectors
Can use raw pixels:

Can do much better if you compute a vector of meaningful features.


Intro ML (UofT) CSC311-Lec1 33 / 55
Input Vectors

Mathematically, our training set consists of a collection of pairs of an


input vector x ∈ Rd and its corresponding target, or label, t
I Regression: t is a real number (e.g. stock price)
I Classification: t is an element of a discrete set {1, . . . , C}
I These days, t is often a highly structured object (e.g. image)
Denote the training set {(x(1) , t(1) ), . . . , (x(N ) , t(N ) )}
I Note: these superscripts have nothing to do with exponentiation!

Intro ML (UofT) CSC311-Lec1 34 / 55


Nearest Neighbors
Suppose we’re given a novel input vector x we’d like to classify.
The idea: find the nearest input vector to x in the training set and copy
its label.
Can formalize “nearest” in terms of Euclidean distance
v
u d
uX (a) (b)
(a) (b)
||x − x ||2 = t (x − x )2 j j
j=1

Algorithm:
1. Find example (x∗ , t∗ ) (from the stored training set) closest to
x. That is:
x∗ = argmin distance(x(i) , x)
x(i) ∈train. set
2. Output y = t∗

Note: we don’t need to compute the square root. Why?


Intro ML (UofT) CSC311-Lec1 35 / 55
Nearest Neighbors: Decision Boundaries

We can visualize the behavior in the classification setting using a Voronoi


diagram.

Intro ML (UofT) CSC311-Lec1 36 / 55


Nearest Neighbors: Decision Boundaries

Decision boundary: the boundary between regions of input space assigned to


different categories.

Intro ML (UofT) CSC311-Lec1 37 / 55


Nearest Neighbors: Decision Boundaries

Example: 2D decision boundary

Intro ML (UofT) CSC311-Lec1 38 / 55


Nearest Neighbors
[Pic by Olga Veksler]

Nearest neighbors sensitive to noise or mis-labeled data (“class noise”).


Solution?

Intro ML (UofT) CSC311-Lec1 39 / 55


k-Nearest Neighbors
[Pic by Olga Veksler]

Nearest neighbors sensitive to noise or mis-labeled data (“class noise”).


Solution?
Smooth by having k nearest neighbors vote

Intro ML (UofT) CSC311-Lec1 39 / 55


k-Nearest Neighbors
[Pic by Olga Veksler]

Nearest neighbors sensitive to noise or mis-labeled data (“class noise”).


Solution?
Smooth by having k nearest neighbors vote

Algorithm (kNN):
1. Find k examples {x(i) , t(i) } closest to the test instance x
2. Classification output is majority class
Xk
y = arg max I(t(z) = t(i) )
t(z)
i=1

I{statement} is the identity function and is equal to one whenever the


statement is true. We could also write this as δ(t(z) , t(i) ), with δ(a, b) = 1 if
a = b, 0 otherwise.
Intro ML (UofT) CSC311-Lec1 39 / 55
K-Nearest neighbors
k=1

[Image credit: ”The Elements of Statistical Learning”]


Intro ML (UofT) CSC311-Lec1 40 / 55
K-Nearest neighbors
k=15

[Image credit: ”The Elements of Statistical Learning”]


Intro ML (UofT) CSC311-Lec1 41 / 55
k-Nearest Neighbors

Tradeoffs in choosing k?
Small k
I Good at capturing fine-grained patterns

I May overfit, i.e. be sensitive to random idiosyncrasies in the

training data
Large k
I Makes stable predictions by averaging over lots of examples
I May underfit, i.e. fail to capture important regularities

Balancing k
I Optimal choice of k depends on number of data points n.
I Nice theoretical properties if k → ∞ and k → 0.
√ n
I Rule of thumb: choose k < n.
I We can choose k using validation set (next slides).

Intro ML (UofT) CSC311-Lec1 42 / 55


K-Nearest neighbors

We would like our algorithm to generalize to data it hasn’t seen before.


We can measure the generalization error (error rate on new examples)
using a test set.

[Image credit: ”The Elements of Statistical Learning”]


Intro ML (UofT) CSC311-Lec1 43 / 55
Validation and Test Sets

k is an example of a hyperparameter, something we can’t fit as part of


the learning algorithm itself
We can tune hyperparameters using a validation set:

The test set is used only at the very end, to measure the generalization
performance of the final configuration.

Intro ML (UofT) CSC311-Lec1 44 / 55


Pitfalls: The Curse of Dimensionality
Low-dimensional visualizations are misleading! In high dimensions,
“most” points are far apart.
If we want the nearest neighbor to be closer than , how many points do
we need to guarantee it?
The volume of a single ball of radius  is O(d )
The total volume of [0, 1]d is 1.
Therefore O ( 1 )d balls are needed to cover the volume.


[Image credit: ”The Elements of Statistical Learning”]


Intro ML (UofT) CSC311-Lec1 45 / 55
Pitfalls: The Curse of Dimensionality

In high dimensions, “most” points are approximately the same distance.


We can show this by applying the rules of expectation and covariance of
random variables in surprising ways. (“optional” homework question
coming up...)
Picture to keep in mind:

Intro ML (UofT) CSC311-Lec1 46 / 55


Pitfalls: The Curse of Dimensionality
Saving grace: some datasets (e.g. images) may have low intrinsic
dimension, i.e. lie on or near a low-dimensional manifold.

Image credit:
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_swiss_roll.html

The neighborhood structure (and hence the Curse of


Dimensionality) depends on the intrinsic dimension.
The space of megapixel images is 3 million-dimensional. The true
number of degrees of freedom is much smaller.

Intro ML (UofT) CSC311-Lec1 47 / 55


Pitfalls: Normalization

Nearest neighbors can be sensitive to the ranges of different features.


Often, the units are arbitrary:

Simple fix: normalize each dimension to be zero mean and unit variance.
I.e., compute the mean µj and standard deviation σj , and take
xj − µj
x̃j =
σj

Caution: depending on the problem, the scale might be important!

Intro ML (UofT) CSC311-Lec1 48 / 55


Pitfalls: Computational Cost

Number of computations at training time: 0


Number of computations at test time, per query (naı̈ve algorithm)
I Calculuate D-dimensional Euclidean distances with N data points:
O(N D)
I Sort the distances: O(N log N )
This must be done for each query, which is very expensive by the
standards of a learning algorithm!
Need to store the entire dataset in memory!
Tons of work has gone into algorithms and data structures for efficient
nearest neighbors with high dimensions and/or large datasets.

Intro ML (UofT) CSC311-Lec1 49 / 55


Example: Digit Classification
Decent performance when lots of data

Intro ML (UofT) CSC311-Lec1 50 / 55


Example: Digit Classification

KNN can perform a lot better with a good similarity measure.


Example: shape contexts for object recognition. In order to achieve
invariance to image transformations, they tried to warp one image to
match the other image.
I Distance measure: average distance between corresponding points
on warped images
Achieved 0.63% error on MNIST, compared with 3% for Euclidean KNN.
Competitive with conv nets at the time, but required careful engineering.

[Belongie, Malik, and Puzicha, 2002. Shape matching and object recognition using shape
contexts.]
Intro ML (UofT) CSC311-Lec1 51 / 55
Example: 80 Million Tiny Images

80 Million Tiny Images was


the first extremely large
image dataset. It consisted of
color images scaled down to
32 × 32.
With a large dataset, you can
find much better semantic
matches, and KNN can do
some surprising things.
Note: this required a carefully
chosen similarity metric.

[Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.]

Intro ML (UofT) CSC311-Lec1 52 / 55


Example: 80 Million Tiny Images

[Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.]


Intro ML (UofT) CSC311-Lec1 53 / 55
Conclusions

Simple algorithm that does all its work at test time — in a sense,
no learning!
Can control the complexity by varying k
Suffers from the Curse of Dimensionality
Next time: parametric models, which learn a compact summary of
the data rather than referring back to it at test time.

Intro ML (UofT) CSC311-Lec1 54 / 55


Questions?

Intro ML (UofT) CSC311-Lec1 55 / 55

You might also like