0% found this document useful (0 votes)
10 views

Week3_LearningI

Uploaded by

albertadi412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Week3_LearningI

Uploaded by

albertadi412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

CSCI218: Foundations

of Artificial Intelligence
Lectures on learning
§ Learning: a process for improving the performance of an agent
through experience
§ Learning I (today):
§ The general idea: generalization from experience
§ Supervised learning: classification and regression
§ Learning II: neural networks and deep learning
§ Reinforcement learning: learning complex V and Q functions
Supervised learning
§ To learn an unknown target function f
§ Input: a training set of labeled examples (xj,yj) where yj = f(xj)
§ E.g., xj is an image, f(xj) is the label “giraffe”
§ E.g., xj is a seismic signal, f(xj) is the label “explosion”
§ Output: hypothesis h that is “close” to f, i.e., predicts well on unseen
examples (“test set”)
§ Many possible hypothesis families for h
§ Linear models, logistic regression, neural networks, decision trees, examples
(nearest-neighbor), grammars, kernelized separators, etc etc
§ Classification = learning f with discrete output value
§ Regression = learning f with real-valued output value
Inductive Learning (Science)
§ Simplest form: learn a function from examples
§ A target function: g
§ Examples: input-output pairs (x, g(x))
§ E.g. x is an email and g(x) is spam / ham
§ E.g. x is a house and g(x) is its selling price

§ Problem:
§ Given a hypothesis space H
§ Given a training set of examples xi
§ Find a hypothesis h(x) such that h ~ g

§ Includes:
§ Classification (outputs = class labels)
§ Regression (outputs = real numbers)
Classification example: Object recognition
x

f(x) giraffe giraffe giraffe llama llama llama

X= f(x)=?
Example: Spam Filter
§ Input: an email Dear Sir.
§ Output: spam/ham First, I must solicit your confidence in
this transaction, this is by virture of its
§ Setup: nature as being utterly confidencial and
top secret. …
§ Get a large collection of example emails, each labeled
“spam” or “ham” (by hand) TO BE REMOVED FROM FUTURE
§ Learn to predict labels of new incoming emails MAILINGS, SIMPLY REPLY TO THIS
MESSAGE AND PUT "REMOVE" IN THE
§ Classifiers reject 200 billion spam emails per day SUBJECT.

§ Features: The attributes used to make the ham / 99 MILLION EMAIL ADDRESSES
FOR ONLY $99
spam decision
§ Words: FREE! Ok, Iknow this is blatantly OT but I'm
beginning to go insane. Had an old Dell
§ Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and
§ Non-text: SenderInContacts, AnchorLinkMismatch decided to put it to use, I know it was
§ … working pre being stuck in the corner,
but when I plugged it in, hit the power
nothing happened.
Example: Digit Recognition

§ Input: images / pixel grids 0


§ Output: a digit 0-9
1
§ Setup:
§ MNIST data set of 60K collection hand-labeled images
§ Note: someone has to hand label all this data! 2
§ Want to learn to predict labels of new digit images

1
§ Features: The attributes used to make the digit decision
§ Pixels: (6,8)=ON
§ Shape Patterns: NumComponents, AspectRatio, NumLoops
§ … ??
Other Classification Tasks
§ Medical diagnosis
§ input: symptoms
§ output: disease
§ Automatic essay grading
§ input: document
§ output: grades
§ Fraud detection
§ input: account activity
§ output: fraud / no fraud
§ Email routing
§ input: customer complaint email
§ output: which department needs to ignore this email
§ Fruit and vegetable inspection
§ input: image (or gas analysis)
§ output: moldy or OK

§ … many more
Regression example: Curve fitting
Regression example: Curve fitting
Regression example: Curve fitting
Regression example: Curve fitting
Regression example: Curve fitting
Basic questions
§ Which hypothesis space H to choose?
§ How to measure degree of fit?
§ How to trade off degree of fit vs. complexity?
§ “Ockham’s razor”
§ How do we find a good h?
§ How do we know if a good h will predict well?
Training and Testing
A few important points about learning
§ Data: labeled instances, e.g. emails marked spam/ham
§ Training set
§ Held out set
§ Test set
§ Features: attribute-value pairs which characterize each x Training
Data
§ Experimentation cycle
§ Learn parameters (e.g. model probabilities) on training set
§ (Tune hyperparameters on held-out set)
§ Compute accuracy of test set
§ Very important: never “peek” at the test set!
§ Evaluation Held-Out
§ Accuracy: fraction of instances predicted correctly Data
(Validation set)
§ Overfitting and generalization
§ Want a classifier which does well on test data
§ Overfitting: fitting the training data very closely, but not Test
generalizing well Data
§ Underfitting: fits the training set poorly
A few important points about learning

§ What should we learn where?


§ Learn parameters from training data
§ Tune hyperparameters on different data
§ Why?
§ For each value of the hyperparameters, train
and test on the held-out data
§ Choose the best value and do a final test on
the test data

§ What are examples of hyperparameters?


Supervised Learning
§ Classification = learning f with discrete output value
§ Regression = learning f with real-valued output value
Linear Regression

Hypothesis family: Linear functions


Linear Regression

1000
900

House price in $1000


800
(x, y=f(x)), x: house size, y: house price 700
600
500
400
300
500 1000 1500 2000 2500 3000 3500
House size in square feet

Berkeley house prices, 2009


Linear regression = fitting a straight line/hyperplane
40
1000
900

House price in $1000


hw(x) 800
20
700
600
500
0
400
0
x 20

300
500 1000 1500 2000 2500 3000 3500
Prediction: hw(x) = w0 + w1x House size in square feet

Berkeley house prices, 2009


Prediction error
Error on one instance: y – hw(x)

Error or “residual”
Observation y
Prediction hw(x)

0
0
x 20
Find w
§ Define loss function

§ Find w* to minimize loss function


Least squares: Minimizing squared error
§ L2 loss function: sum of squared errors over all examples

§ Loss = ____________________________
§ We want the weights w* that minimize loss
§ At w* the derivatives of loss w.r.t. each weight are zero:

§ ¶Loss/¶w0 = __________________________

§ ¶Loss/¶w1 = __________________________
§ Exact solutions for N examples:
§ w1 = [NSj xjyj – (Sj xj)(Sj yj)]/[NSj xj2 – (Sj xj)2] and w0 = 1/N [Sj yj – w1Sj xj]
§ For the general case where x is an n-dimensional vector
§ X is the data matrix (all the data, one example per row); y is the column of labels
§ w* = (XTX)-1XTy
Least squares: Minimizing squared error
§ L2 loss function: sum of squared errors over all examples
§ Loss = Sj (yj – hw(xj))2 = Sj (yj – (w0 + w1xj))2
§ We want the weights w* that minimize loss
§ At w* the derivatives of loss w.r.t. each weight are zero:
§ ¶Loss/¶w0 = – 2 Sj (yj – (w0 + w1xj)) = 0
§ ¶Loss/¶w1 = – 2 Sj (yj – (w0 + w1xj)) xj = 0
§ Exact solutions for N examples:
§ w1 = [NSj xjyj – (Sj xj)(Sj yj)]/[NSj xj2 – (Sj xj)2] and w0 = 1/N [Sj yj – w1Sj xj]
§ For the general case where x is an n-dimensional vector
§ X is the data matrix (all the data, one example per row); y is the column of labels
§ w* = (XTX)-1XTy
Regression vs Classification
§ Linear regression when output is binary, ! ∈ −1, 1
§ ℎ! " = $" + $# "
y !! + !" #

1
x
§ Linear classification
§ Used with discrete output values
§ Threshold a linear function
§ ℎ! " = 1, if $" + $# " ≥ 0
§ ℎ! " = −1, if $" + $# " < 0 y
§ w: weight vector $ !! + !" #
1
§ Activation function g x
Threshold perceptron as linear classifier
Binary Decision Rule
§ A threshold perceptron is a single unit
that outputs
§ y = hw(x) = 1 when w.x ³ 0
= -1 when w.x < 0
§ In the input vector space
§ Examples are points x w0 : -3

money
2
wfree : 4
§ The equation w.x=0 defines a hyperplane y=1 (SPAM) wmoney : 2
§ One side corresponds to y=1

w.x
1
§ Other corresponds to y=-1

=0
0
y=-1 (HAM) 0 1 free
Example
Dear Stuart, I’m leaving Macrosoft
to return to academia. The money is
w0 : -3
money

2 is great here but I prefer to be free


wfree : 4 to do my own research; and I really
y=1 (SPAM) wmoney : 2 love teaching undergrads!
Do I need to finish
w.x

1
x0 : 1 my BA first before applying?
=0

xfree : 1 Best wishes


xmoney : 1 Bill
0
y=-1 (HAM) 0 1 free
w.x = -3x1 + 4x1 + 2x1 = 3
Weight Updates

Need a different solution than before given the characteristic of perceptron


Perceptron learning rule
§ If true y* ¹ hw(x) (an error), adjust the weights
§ If w.x < 0 but the output should be y*=1
§ This is called a false negative
§ Should increase weights on positive inputs
§ Should decrease weights on negative inputs
§ If w.x > 0 but the output should be y*=-1
§ This is called a false positive
§ Should decrease weights on positive inputs
§ Should increase weights on negative inputs
Perceptron Learning Rule
§ Start with weights = 0
y = hw(x) = 1 when w.x ³ 0
§ For each training instance:
= -1 when w.x < 0
§ If wrong: adjust the weight vector by
adding or subtracting the feature
vector. y* is true label.
*
* x
Example
Dear Stuart, I wanted to let you know
that I have decided to leave Macrosoft
w0 : -3
money

2 and return to academia. The money is


wfree : 4 is great here but I prefer to be free
y=1 (SPAM) wmoney : 2 to pursue more interesting research
and I really love teaching
w.x

1
x0 : 1 undergraduates! Do I need to finish
=0

xfree : 1 my BA first before applying?


xmoney : 1 Best wishes
0 Bill
y=0 (HAM) 0 1 free
w.x = -3x1 + 4x1 + 2x1 = 3

w ¬ w + a y* x w ¬ (-3,4,2) + 0.5 (0 – 1) (1,1,1)


a = 0.5 = (-3.5,3.5,1.5)
Perceptron convergence theorem
Separable
§ A learning problem is linearly separable iff there is some
hyperplane exactly separating positive from negative examples

§ Convergence: if the training data are linearly separable,


perceptron learning applied repeatedly to the training set will
eventually converge to a perfect separator

Non-Separable
Example: Earthquakes vs nuclear explosions

7.5 1
7
0.9

Proportion correct
6.5
6 0.8
5.5
5 0.7
x2

4.5 0.6
4
3.5 0.5
3
2.5 0.4
4.5 5 5.5 6 6.5 7 0 100 200 300 400 500 600 700
x1 Number of weight updates

63 examples, 657 updates required


Perceptron convergence theorem
Separable
§ A learning problem is linearly separable iff there is some
hyperplane exactly separating +ve from –ve examples

§ Convergence: if the training data are separable, perceptron


learning applied repeatedly to the training set will eventually
converge to a perfect separator

Non-Separable
§ Convergence: if the training data are non-separable,
perceptron learning will converge to a minimum-error
solution provided the learning rate a is decayed appropriately
(e.g., a=1/t)
Perceptron learning with fixed a

7.5
1
7
6.5 0.9

Proportion correct
6
5.5 0.8
5
x2

0.7
4.5
4 0.6
3.5 0.5
3
2.5 0.4
4.5 5 5.5 6 6.5 7 0 20000 40000 60000 80000 100000
x1 Number of weight updates

71 examples, 100,000 updates


fixed ! = 0.2, no convergence
Perceptron learning with decaying a

7.5
7 1
6.5 0.9

Proportion correct
6
5.5 0.8
5
x2

0.7
4.5
4 0.6
3.5 0.5
3
2.5 0.4
4.5 5 5.5 6 6.5 7 0 20000 40000 60000 80000 100000
x1 Number of weight updates

71 examples, 100,000 updates


decaying ! = 1000/(1000 + *), near-convergence
Non-Separable Case
Even the best linear boundary makes at least one mistake
Other Linear Classifiers
§ Perceptron is perfectly happy as long as it separates the training data
y

$+,-.#,'/( ! ⋅ #
1

§ Logistic Regression § Support Vector Machines (SVM)


1 § Maximize margin between boundary and nearest points
$#$%&'$( # =
1 + ' )*
y
$#$%&'$( !⋅#
1

x
Perceptrons hopeless for XOR function

x1 x1 x1

1 1 1

0 0 0
0 1 x2 0 1 x2 0 1 x2
(a) x1 and x2 (b) x1 or x2 (c) x1 xor x2
Basic questions
§ Which hypothesis space H to choose?
§ How to measure degree of fit?
§ How to trade off degree of fit vs. complexity?
§ “Ockham’s razor”
§ How do we find a good h?
§ How do we know if a good h will predict well?
Classical stats/ML: Minimize loss function
§ Which hypothesis space H to choose?
§ E.g., linear combinations of features: hw(x) = wTx
§ How to measure degree of fit?
§ Loss function, e.g., squared error Σj (yj – wTx)2
§ How to trade off degree of fit vs. complexity?
§ Regularization: complexity penalty, e.g., ||w||2
§ How do we find a good h?
§ Optimization (closed-form, numerical); discrete search
§ How do we know if a good h will predict well?
§ Try it and see (cross-validation, bootstrap, etc.)

57
Probabilistic: Max. likelihood, max. a priori
§ Which hypothesis space H to choose?
§ Probability model P(y | x,h) , e.g., Y ~ N(wTx,σ2)
§ How to measure degree of fit?
§ Data likelihood Πj P(yj | xj,h)
§ How to trade off degree of fit vs. complexity?
§ Regularization or prior: argmaxh P(h) Πj P(yj | xj,h) (Max a Priori)
§ How do we find a good h?
§ Optimization (closed-form, numerical); discrete search
§ How do we know if a good h will predict well?
§ Empirical process theory (generalizes Chebyshev, CLT, PAC…);
§ Key assumption is (i)id
58
Bayesian: Computing posterior over H
§ Which hypothesis space H to choose?
§ All hypotheses with nonzero a priori probability
§ How to measure degree of fit?
§ Data probability, as for MLE/MAP
§ How to trade off degree of fit vs. complexity?
§ Use prior, as for MAP
§ How do we find a good h?
§ Don’t! Bayes predictor P(y|x,D) = Σh P(y|x,h) P(D|h) P(h)
§ How do we know if a good h will predict well?
§ Silly question! Bayesian prediction is optimal!!
59
Acknowledgement

The lecture slides are based on the materials from ai.Berkey.edu


Questions

You might also like