Week3_LearningI
Week3_LearningI
of Artificial Intelligence
Lectures on learning
§ Learning: a process for improving the performance of an agent
through experience
§ Learning I (today):
§ The general idea: generalization from experience
§ Supervised learning: classification and regression
§ Learning II: neural networks and deep learning
§ Reinforcement learning: learning complex V and Q functions
Supervised learning
§ To learn an unknown target function f
§ Input: a training set of labeled examples (xj,yj) where yj = f(xj)
§ E.g., xj is an image, f(xj) is the label “giraffe”
§ E.g., xj is a seismic signal, f(xj) is the label “explosion”
§ Output: hypothesis h that is “close” to f, i.e., predicts well on unseen
examples (“test set”)
§ Many possible hypothesis families for h
§ Linear models, logistic regression, neural networks, decision trees, examples
(nearest-neighbor), grammars, kernelized separators, etc etc
§ Classification = learning f with discrete output value
§ Regression = learning f with real-valued output value
Inductive Learning (Science)
§ Simplest form: learn a function from examples
§ A target function: g
§ Examples: input-output pairs (x, g(x))
§ E.g. x is an email and g(x) is spam / ham
§ E.g. x is a house and g(x) is its selling price
§ Problem:
§ Given a hypothesis space H
§ Given a training set of examples xi
§ Find a hypothesis h(x) such that h ~ g
§ Includes:
§ Classification (outputs = class labels)
§ Regression (outputs = real numbers)
Classification example: Object recognition
x
X= f(x)=?
Example: Spam Filter
§ Input: an email Dear Sir.
§ Output: spam/ham First, I must solicit your confidence in
this transaction, this is by virture of its
§ Setup: nature as being utterly confidencial and
top secret. …
§ Get a large collection of example emails, each labeled
“spam” or “ham” (by hand) TO BE REMOVED FROM FUTURE
§ Learn to predict labels of new incoming emails MAILINGS, SIMPLY REPLY TO THIS
MESSAGE AND PUT "REMOVE" IN THE
§ Classifiers reject 200 billion spam emails per day SUBJECT.
§ Features: The attributes used to make the ham / 99 MILLION EMAIL ADDRESSES
FOR ONLY $99
spam decision
§ Words: FREE! Ok, Iknow this is blatantly OT but I'm
beginning to go insane. Had an old Dell
§ Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and
§ Non-text: SenderInContacts, AnchorLinkMismatch decided to put it to use, I know it was
§ … working pre being stuck in the corner,
but when I plugged it in, hit the power
nothing happened.
Example: Digit Recognition
1
§ Features: The attributes used to make the digit decision
§ Pixels: (6,8)=ON
§ Shape Patterns: NumComponents, AspectRatio, NumLoops
§ … ??
Other Classification Tasks
§ Medical diagnosis
§ input: symptoms
§ output: disease
§ Automatic essay grading
§ input: document
§ output: grades
§ Fraud detection
§ input: account activity
§ output: fraud / no fraud
§ Email routing
§ input: customer complaint email
§ output: which department needs to ignore this email
§ Fruit and vegetable inspection
§ input: image (or gas analysis)
§ output: moldy or OK
§ … many more
Regression example: Curve fitting
Regression example: Curve fitting
Regression example: Curve fitting
Regression example: Curve fitting
Regression example: Curve fitting
Basic questions
§ Which hypothesis space H to choose?
§ How to measure degree of fit?
§ How to trade off degree of fit vs. complexity?
§ “Ockham’s razor”
§ How do we find a good h?
§ How do we know if a good h will predict well?
Training and Testing
A few important points about learning
§ Data: labeled instances, e.g. emails marked spam/ham
§ Training set
§ Held out set
§ Test set
§ Features: attribute-value pairs which characterize each x Training
Data
§ Experimentation cycle
§ Learn parameters (e.g. model probabilities) on training set
§ (Tune hyperparameters on held-out set)
§ Compute accuracy of test set
§ Very important: never “peek” at the test set!
§ Evaluation Held-Out
§ Accuracy: fraction of instances predicted correctly Data
(Validation set)
§ Overfitting and generalization
§ Want a classifier which does well on test data
§ Overfitting: fitting the training data very closely, but not Test
generalizing well Data
§ Underfitting: fits the training set poorly
A few important points about learning
1000
900
300
500 1000 1500 2000 2500 3000 3500
Prediction: hw(x) = w0 + w1x House size in square feet
Error or “residual”
Observation y
Prediction hw(x)
0
0
x 20
Find w
§ Define loss function
§ Loss = ____________________________
§ We want the weights w* that minimize loss
§ At w* the derivatives of loss w.r.t. each weight are zero:
§ ¶Loss/¶w0 = __________________________
§ ¶Loss/¶w1 = __________________________
§ Exact solutions for N examples:
§ w1 = [NSj xjyj – (Sj xj)(Sj yj)]/[NSj xj2 – (Sj xj)2] and w0 = 1/N [Sj yj – w1Sj xj]
§ For the general case where x is an n-dimensional vector
§ X is the data matrix (all the data, one example per row); y is the column of labels
§ w* = (XTX)-1XTy
Least squares: Minimizing squared error
§ L2 loss function: sum of squared errors over all examples
§ Loss = Sj (yj – hw(xj))2 = Sj (yj – (w0 + w1xj))2
§ We want the weights w* that minimize loss
§ At w* the derivatives of loss w.r.t. each weight are zero:
§ ¶Loss/¶w0 = – 2 Sj (yj – (w0 + w1xj)) = 0
§ ¶Loss/¶w1 = – 2 Sj (yj – (w0 + w1xj)) xj = 0
§ Exact solutions for N examples:
§ w1 = [NSj xjyj – (Sj xj)(Sj yj)]/[NSj xj2 – (Sj xj)2] and w0 = 1/N [Sj yj – w1Sj xj]
§ For the general case where x is an n-dimensional vector
§ X is the data matrix (all the data, one example per row); y is the column of labels
§ w* = (XTX)-1XTy
Regression vs Classification
§ Linear regression when output is binary, ! ∈ −1, 1
§ ℎ! " = $" + $# "
y !! + !" #
1
x
§ Linear classification
§ Used with discrete output values
§ Threshold a linear function
§ ℎ! " = 1, if $" + $# " ≥ 0
§ ℎ! " = −1, if $" + $# " < 0 y
§ w: weight vector $ !! + !" #
1
§ Activation function g x
Threshold perceptron as linear classifier
Binary Decision Rule
§ A threshold perceptron is a single unit
that outputs
§ y = hw(x) = 1 when w.x ³ 0
= -1 when w.x < 0
§ In the input vector space
§ Examples are points x w0 : -3
money
2
wfree : 4
§ The equation w.x=0 defines a hyperplane y=1 (SPAM) wmoney : 2
§ One side corresponds to y=1
w.x
1
§ Other corresponds to y=-1
=0
0
y=-1 (HAM) 0 1 free
Example
Dear Stuart, I’m leaving Macrosoft
to return to academia. The money is
w0 : -3
money
1
x0 : 1 my BA first before applying?
=0
1
x0 : 1 undergraduates! Do I need to finish
=0
Non-Separable
Example: Earthquakes vs nuclear explosions
7.5 1
7
0.9
Proportion correct
6.5
6 0.8
5.5
5 0.7
x2
4.5 0.6
4
3.5 0.5
3
2.5 0.4
4.5 5 5.5 6 6.5 7 0 100 200 300 400 500 600 700
x1 Number of weight updates
Non-Separable
§ Convergence: if the training data are non-separable,
perceptron learning will converge to a minimum-error
solution provided the learning rate a is decayed appropriately
(e.g., a=1/t)
Perceptron learning with fixed a
7.5
1
7
6.5 0.9
Proportion correct
6
5.5 0.8
5
x2
0.7
4.5
4 0.6
3.5 0.5
3
2.5 0.4
4.5 5 5.5 6 6.5 7 0 20000 40000 60000 80000 100000
x1 Number of weight updates
7.5
7 1
6.5 0.9
Proportion correct
6
5.5 0.8
5
x2
0.7
4.5
4 0.6
3.5 0.5
3
2.5 0.4
4.5 5 5.5 6 6.5 7 0 20000 40000 60000 80000 100000
x1 Number of weight updates
$+,-.#,'/( ! ⋅ #
1
x
Perceptrons hopeless for XOR function
x1 x1 x1
1 1 1
0 0 0
0 1 x2 0 1 x2 0 1 x2
(a) x1 and x2 (b) x1 or x2 (c) x1 xor x2
Basic questions
§ Which hypothesis space H to choose?
§ How to measure degree of fit?
§ How to trade off degree of fit vs. complexity?
§ “Ockham’s razor”
§ How do we find a good h?
§ How do we know if a good h will predict well?
Classical stats/ML: Minimize loss function
§ Which hypothesis space H to choose?
§ E.g., linear combinations of features: hw(x) = wTx
§ How to measure degree of fit?
§ Loss function, e.g., squared error Σj (yj – wTx)2
§ How to trade off degree of fit vs. complexity?
§ Regularization: complexity penalty, e.g., ||w||2
§ How do we find a good h?
§ Optimization (closed-form, numerical); discrete search
§ How do we know if a good h will predict well?
§ Try it and see (cross-validation, bootstrap, etc.)
57
Probabilistic: Max. likelihood, max. a priori
§ Which hypothesis space H to choose?
§ Probability model P(y | x,h) , e.g., Y ~ N(wTx,σ2)
§ How to measure degree of fit?
§ Data likelihood Πj P(yj | xj,h)
§ How to trade off degree of fit vs. complexity?
§ Regularization or prior: argmaxh P(h) Πj P(yj | xj,h) (Max a Priori)
§ How do we find a good h?
§ Optimization (closed-form, numerical); discrete search
§ How do we know if a good h will predict well?
§ Empirical process theory (generalizes Chebyshev, CLT, PAC…);
§ Key assumption is (i)id
58
Bayesian: Computing posterior over H
§ Which hypothesis space H to choose?
§ All hypotheses with nonzero a priori probability
§ How to measure degree of fit?
§ Data probability, as for MLE/MAP
§ How to trade off degree of fit vs. complexity?
§ Use prior, as for MAP
§ How do we find a good h?
§ Don’t! Bayes predictor P(y|x,D) = Σh P(y|x,h) P(D|h) P(h)
§ How do we know if a good h will predict well?
§ Silly question! Bayesian prediction is optimal!!
59
Acknowledgement