intro_slides
intro_slides
Introduction to
Machine Learning
https://introml.mit.edu
Marzyeh Ghassemi
[email protected]
Spring 2023!
Introduction to
Machine Learning
https://introml.mit.edu
Tomas Lozano-Perez
[email protected]
Spring 2023!
Introduction to
Machine Learning
https://introml.mit.edu
Wojciech Matusik
[email protected]
Spring 2023!
Introduction to
Machine Learning
https://introml.mit.edu
Vince Monardo
[email protected]
Spring 2023!
Introduction to
Machine Learning
https://introml.mit.edu
Shen Shen
[email protected]
Spring 2023!
Introduction to
Machine Learning
https://introml.mit.edu
Ashia Wilson
[email protected]
Full Staff
Lab
plus ~7 awesome LAs
Section 4 staff
Recitation + Lab
Lab
Recitation + Lab
Lab
Lab
Rest of Today
● Start our ML journey with an overview
● Work through recitation handout with others at your table
● Ask questions by putting yourself in the help queue
● No worries if no introml access yet; great chance to know your
neighbor (ask them to put you in the queue)
What we're teaching: Machine Learning!
Given:
• a collection of examples (gene sequences, documents, tree sections)
• an encoding of those examples in a computer (as vectors)
Derive:
• a computational model (called a hypothesis) that describes relationships
within and among the examples that is expected to characterize well new
examples from that same population, to make good predictions or decisions
A model might:
• classify images of cells as to whether they're cancerous
• specify groupings (clusters) of documents that address similar topics
• steer a car appropriately given lidar images of the surroundings
Very roughly, ML can be categorized into
(the categorization can be refined, e.g. there are active learning, semi-supervised, selective, contrastive,
few-shot, inverse reinforcement learning… )
dependency 0.5
Turkey Ankara Tokyo
/causal Poland
Greece Athens
Rome
-1 Spain
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their
which is used to replace every log P (wO |wI ) term in the Skip-gram objective. Thus the task is to
distinguish the target word wO from draws from the noise distribution Pn (w) using logistic regres-
sion, where there are k negative samples for each data sample. Our experiments indicate that values
of k in the range 5–20 are useful for small training datasets, while for large datasets the k can be as
small as 2–5. The main difference between the Negative sampling and NCE is that NCE needs both
samples and the numerical probabilities of the noise distribution, while Negative sampling uses only
samples. And while NCE approximately maximizes the log probability of the softmax, this property
is not important for our application.
Both NCE and NEG have the noise distribution Pn (w) as a free parameter. We investigated a number
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g.,
“in”, “the”, and “a”). Such words usually provide less information value than the rare words. For
example, while the Skip-gram model benefits from observing the co-occurrences of “France” and
“Paris”, it benefits much less from observing the frequent co-occurrences of “France” and “the”, as
nearly every word co-occurs frequently within a sentence with “the”. This idea can also be applied
in the opposite direction; the vector representations of frequent words do not change significantly
after training on several million examples.
[Slides adapted from 6.790]
To counter the imbalance between the rare and frequent words, we used a simple subsampling ap-
proach: each word wi in the training set is discarded with probability computed by the formula
!
t
P (wi ) = 1 − (5)
f (wi )
4
ChatGPT
Reinforcement learning
• Feature vector
• Label
• Training data
What do we want?
We want a “good” way to label new feature
vectors
• How to label? Learn a hypothesis
how well our hypothesis labels new feature vectors depends largely
on how expressive the hypothesis class is
What do we want?
We may consider the class of linear
regressors:
• Hypotheses take the form:
parameters to learn Θ
• What we really want is to generalize to future data!
• What we don’t want:
• Model does not capture the input-output relationship (e.g.,
not enough data) —> Underfitting
• Model too specific to training data —> Overfitting
How good is a hypothesis?
Hopefully predict well on future data
• How good is a regressor at one point?
g: guess,
a: actual
• Training error:
What we want: