Machine Learning
Machine Learning
8/17/2017 2
Advice for applying machine
learning
3
Debugging a learning algorithm
• Suppose you have implemented regularized
linear regression to predict housing prices
• Trained it
• But, when you test on new data you find it
makes unacceptably large errors in its
predictions
4
• What should you try next?
– Get more training data
• Often it does though, although you should always do some preliminary testing to make
sure more data will actually make a difference (discussed later)
– Try a smaller set a features
• Carefully select small subset
• You can do this by hand, or use some dimensionality reduction technique (e.g. PCA)
– Try getting additional features
• LOOK at the data
• Can be very time consuming
– Adding polynomial features
– Building your own, new, better features based on your knowledge of the
problem
• Can be risky if you accidentally over fit your data by creating new features which are
inherently specific/relevant to your training data
– Try decreasing or increasing λ
• Change how important the regularization term is in your calculations
– There are some simple techniques which can let you rule out half the things
on the list
• Save you a lot of time!
• This approach might work, but it’s very time-consuming, and largely a
matter of luck whether you end up fixing what the problem really is.
5
Evaluating a hypothesis
• When we fit parameters to training data, try
and minimize the error.
• We might think a low error is good - doesn't
necessarily mean a good parameter set
– Could, in fact, be indicative of overfitting
– This means your model will fail to generalize
• How do you tell if a hypothesis is overfitting?
– Could plot hθ(x)
– But with lots of features may be impossible to
plot
• Standard way to evaluate a hypothesis is
– Split data into two portions
• 1st portion is training set
• 2nd portion is test set
– Typical split might be 70:30 (training:test)
6
• a typical train and test scheme would be
– 1) Learn parameters θ from training data,
minimizing J(θ) using 70% of the training data
– 2) Compute the test error
– Jtest(θ) = average square error as measured on the
test set
7
• Sometimes there is a better way -
misclassification error (0/1 misclassification)
• We define the error as follows
– i.e. its the fraction in the test set the hypothesis mislabels
• These are the standard techniques for evaluating
a learned hypothesis
8
Model selection and training, validation
and test sets
• How to choose degree of polynomial
• Model selection problem - Try to chose the degree
for a polynomial to fit data
10
Improved model selection
• Given a training set instead split into three pieces
– 1 - Training set (60%) - m values
– 2 - Cross validation (CV) set (20%) - mcv
– 3 - Test set (20%) - mtest
• As before, we can calculate
– Training error
– Cross validation error
– Test error
11
– Then
• Minimize cost function for each of the models as before
• Test these hypothesis on the cross validation set to generate
the cross validation error
• Pick the hypothesis with the lowest cross validation error
– e.g. pick θ5
• Finally
– Estimate generalization error of model using the test set
• Final note
– In machine learning as practiced today - many people will
select the model using the test set and then check the
model is OK for generalization using the test error (which
we've said is bad because it gives a bias analysis)
• With a MASSIVE test set this is maybe OK
– But better practice is to have separate training and
validation sets
12
K-fold Cross Validation
• In some applications, data is scarce
• The k-fold cross validation technique is
designed to give an accurate estimate of the
true error without wasting too much data
• In k-fold cross validation the original training
set is partitioned into k subsets (folds) of
size m/k
• For each fold, the algorithm is trained on the
union of the other folds and then the error of
its output is estimated using the fold.
• the average of all these errors is the
estimation of the true error.
• The special case k = m, where m is the number
of examples, is called leave-one-out cross
validation(LOOCV).
Diagnosis - bias vs. variance
• If you get bad results usually because of one of
– High bias - under fitting problem
– High variance - over fitting problem
• Important to work out which is the problem
16
• Lets define training and cross validation error as before
• Now plot
– x = degree of polynomial d
– y = error for both training and cross validation (two lines)
• CV error and test set error will be very similar
– This plot helps us understand the error
17
• How do we apply this for diagnostics
– If cv error is high we're either at the high or the low end of d
18
Regularization and bias/variance
• How is bias and variance effected by regularization?
– Such a plot can help show you you're picking a good value
for λ 21
Learning curves
• A learning curve is often useful to plot
for algorithmic sanity checking or improving performance
• What is a learning curve?
– Plot Jtrain (average squared error on training set) and
Jcv (average squared error on cross validation set)
– Plot against m (number of training examples)
• m is a constant
• So artificially reduce m and recalculate errors with the smaller training set
sizes
– Jtrain
• Error on smaller sample
sizes is smaller (as less variance
to accommodate)
• So as m grows error grows
– Jcv
• Error on cross validation set
• When you have a tiny training
set your generalize badly
• But as training set grows your hypothesis generalize better
• So cv error will decrease as m increases
22
• What do these curves look like if you have
– High bias
– e.g. setting straight line to data
– Jtrain
• Training error is small at first and grows
• Training error becomes close to cross validation
• So the performance of the cross validation and training set end up
being similar (but very poor)
– Jcv
• Straight line fit is similar for a few vs. a lot of data
• So it doesn't generalize any better with lots of data because the
function just doesn't fit the data
– No increase in data will help it fit
• The problem with high bias is because cross validation and training
error are both high
• Also implies that if a learning algorithm has high bias as we get
more examples the cross validation error doesn't decrease
– So if an algorithm is already suffering from high bias, more data does
not help 23
• High variance e.g. high order polynomial
• Jtrain
– When set is small, training error is small too
– As training set sizes increases, value is still small
– But slowly increases (in a near linear fashion)
– Error is still low
• Jcv
– Error remains high, even when you have a moderate number of examples
– Because the problem with high variance (overfitting) is your model doesn't
generalize
• An indicative diagnostic that you have high variance is that there's a big
gap between training error and cross validation error
• If a learning algorithm is suffering from high variance, more data is
probably going to help
24
What to do next
• How do these ideas help us chose how we approach a problem?
• Original example
– Trained a learning algorithm (regularized linear regression)
– But, when you test on new data you find it makes unacceptably large errors in its predictions
– What should you try next?
• How do we decide what to do?
– Get more examples --> helps to fix high variance
• Not good if you have high bias
– Try adding additional features --> fixes high bias (because hypothesis is too simple, make
hypothesis more specific)
25
Choosing the network architecture
• selecting a network architecture
• One option is to use a small neural network
– Few (maybe one) hidden layer and few hidden units
– Such networks are prone to under fitting
– But they are computationally cheaper
• Larger network
– More hidden layers
• How do you decide that a larger network is good?
• Using a single hidden layer is good default
– Also try with 1, 2, 3, see which performs best on cross validation set
– So like before, take three sets (training, cross validation)
• More units
– This is computational expensive
– Prone to over-fitting
• Use regularization to address over fitting
26
Quiz
Suppose you have implemented regularized logistic
regression to classify what object is in an image. However
when you test your hypothesis on a new set of images,
you find it makes unacceptable large errors with its
predictions on new set of images. However you
hypothesis performs well on the training set. Which of
the following promising steps to take?
A. Try adding polynomial features
B. Get more training examples
C. Use fewer training examples
D. Try using smaller set of features
B, D
Quiz
Suppose you have implemented regularized logistic
regression to classify what object is in an image. However
when you test your hypothesis on a new set of images,
you find it makes unacceptable large errors with its
predictions on new set of images. However you
hypothesis performs well on the training set. Which of
the following promising steps to take?
A. Try decreasing the regularization parameter λ
B. Try evaluating the hypothesis on a cross validation set
rather than the test set
C. Try using smaller set of features
D. Try increasing the regularization parameter λ
C, D
Quiz
Suppose you have implemented regularized logistic regression
to predict what items customers will purchase on web
shopping site. However when you test your hypothesis on a
new set of customers, you find it makes unacceptable large
errors with its predictions on new set of images. Furthermore
your hypothesis performs poorly on the training set. Which of
the following promising steps to take?
A. Try decreasing the regularization parameter λ
B. Try adding polynomial features
C. Try using smaller set of features
D. Try to obtain and use additional features
ABD
Quiz
Which of the following statement are true?
A. The performance of the learning algorithm on the training set will
typically be better than its performance on the test set
B. Suppose you are training a regularized linear regression model.
The recommended way to choose what value of regularization
parameter λ to use is to choose the value of λ which gives the
lowest test error
C. Suppose you are training a regularized linear regression model.
The recommended way to choose what value of regularization
parameter λ to use is to choose the value of λ which gives the
lowest cross validation error
D. Suppose you are training a regularized linear regression model.
The recommended way to choose what value of regularization
parameter λ to use is to choose the value of λ which gives the
lowest training error
A, C
Quiz
Which of the following statement are true?
A. If a learning algorithm is suffering from high bias, only
adding more training examples may not improve the test
error significantly.
B. A model with more parameters is more prone to
overfitting and typically has higher variance.
C. When debugging learning algorithms, it is useful to plot a
learning curve to understand if there is a high bias or high
variance problem.
D. If a neural network has much lower training error than
test error, then adding more layers will help bring the test
error down because we can fit the test set better.
A, B, C
Quiz
You train a learning algorithm and find that it has unacceptably high error on the
test set. You plot the learning curve and obtain the figure as below. Is the algorithm
suffering high bias, high variance or neither?
A. High bias
B. High variance
C. Neither
How to fix a high bias or a high
variance problem?
How to fix a high bias problem
• The following tricks are employed to fix a high bias problem.
• Train longer:
– Many machine learning algorithms are set up as iterative optimization problems where the
training error ( or a function of the training error ) is minimized. Just letting the algorithm run
for more hours or days can help reduce the bias. In Neural Networks, you can change a
parameter called the “learning rate” which will help the training error go down faster.
• Train a more complex model:
– A more complex model will fit the training data better. In the case of Neural Networks, one
can add more layers. Finally, in the case of an SVM, you can use a non-linear SVMs instead of a
linear one.
• Obtain more features:
– Sometimes you just do not have enough information to train a model. For example, if you are
trying to train a model that can predict the gender of a person based on the color of their hair,
the problem might be impossible to solve. But if you add a new feature — the length of the
hair — the problem becomes more tractable.
• Decrease regularization:
– In a naive implementation, the parameters could take any values. This makes the model very
flexible. However, we may want to constrain the flexibility of the model to prevent overfitting.
So usually a regularization term is added to the cost function we are trying to minimize. we
can reduce it by reducing the value of regularization parameter.
How to fix a high variance problem?
• Obtain more data:
– Because the validation error is large, it means that the training set and
the validation set that were randomly chosen from the same dataset,
somehow have different characteristics. This usually means that you
do not have enough data and you need to collect more.
• Decrease number of features:
– Sometimes collecting more data is not an option. In that case, you can
reduce the number of features. You may have to remove features
manually. For example, in our previous example of identifying the
gender of a person based on hair color and hair length, you may
decide to drop hair color and keep hair length.
• Increase regularization:
– When we have a high variance problem the model is fitting the
training data. In fact, the model is probably fitting even the noise in
training set and therefore not performing as well on the validation set.
We can reduce the flexibility of the model by using regularization that
puts constraints on the magnitude of the parameters.
Machine Learning System Design
What to do next?
• Build an initial model quickly:
1. Train using training set — Fit the parameters
2. Development set — Tune the parameters
3. Test set — Assess the performance
• Prioritize Next Steps:
1. Use Bias and Variance Analysis to deal with
underfitting and overfitting
2. Analyse what is causing the error and fix them
until you have the required model ready!
Prioritizing what to work on -
spam classification example
• Idea of prioritizing what to work on is perhaps the most
important skill programmers typically need to develop
• Building a spam classifier
• Spam is email advertising
Spam classification
• What kind of features might we define
– Spam (1)
• Misspelled word
– Not spam (0)
• Real content
• How do we build a classifier
to distinguish between the two
– Feature representation
• How do we represent x (features of the email)?
– y = spam (1) or not spam (0)
Approach - choosing our own features
• Choose 100 words which are indicative of an email being
spam or not spam
– Spam --> e.g. buy, discount, deal
– Non spam --> Andrew, now
– All these words go into one long vector
• Encode this into a reference vector
– See which words appear in a message
• Define a feature vector x
– Which is 0 or 1 if a corresponding word in the reference vector
is present or not
• This is a bitmap of the word content of your email
– i.e. don't recount if a word appears more than once
– In practice its more common to have a training set and pick the
most frequently used n words, where n is 10,000 to 50,000
What's the best use of your time to improve system
accuracy?
• Natural inclination is to collect lots of data
– Honey pot anti-spam projects try and get fake email addresses
into spammers' hands, collect loads of spam
• Develop sophisticated features based on email routing
information (contained in email header)
– Spammers often try and obscure origins of email
– Send through unusual routes
• Develop sophisticated features for message body analysis
– Discount == discounts?
– DEAL == deal?
• Develop sophisticated algorithm to detect misspelling
– Spammers use misspelled word to get around detection systems
Recall: The true positive rate (TPR), or the proportion of all actual positives that
were classified correctly as positives, is also known as recall.
Precision is the proportion of all the model's positive classifications that are
actually positive. It is mathematically defined as:
Accuracy, recall, precision, and related metrics
Precision and Recall
• Suppose there are 100 games in your test set. There
are four possible outcomes:
– You predict team will win and they do win.
You predict team will win but they lose.
You predict team will lose and they do lose.
You predict team will lose but they win.
– True Positive (you predict + and are correct),
False Positive (you predict + but are incorrect),
True Negative (you predict – and are correct),
False Negative (you predict – but are incorrect).
• Suppose that for the 100 games, your results are:
– True Positive (TP) = 40 (correctly predicted a win)
False Positive (FP) = 20 (incorrectly predicted a win)
True Negative (TN) = 30 (correctly predicted a loss)
False Negative (FN) = 10 (incorrectly predicted a loss)
• 2×2 table, called a “confusion matrix
Precision and Recall
• Accuracy = 40 + 30 = 70/100 = 0.70
• Precision = TP / (TP+FP) = 40 / (40+20) = 40/60 = 0.67
• Recall = TP / (TP+FN) = 40 / (40+10) = 40/50 = 0.80
• In the case of a logistic regression classifier, you can adjust
something called the threshold, which is an internal
number between 0 and 1 that determines whether a
prediction is positive or not.
• As you increase the threshold value above 0.5, it becomes
more difficult for a data item to be classified as positive.
• If you change the threshold for a logistic regression
classifier, it turns out the precision and recall will change.
• If the precision increases (your chance of winning your bet),
the recall (your number of betting opportunities) will
decrease. And vice versa.
Trading off precision and recall
• For many applications we want to control the trade-
off between precision and recall
• Example
– Trained a logistic regression classifier
• Predict 1 if hθ(x) >= 0.5
• Predict 0 if hθ(x) < 0.5
– This classifier may give some value for precision and some value
for recall
– Predict 1 only if very confident
• One way to do this modify the algorithm we could modify the
prediction threshold
– Predict 1 if hθ(x) >= 0.8
– Predict 0 if hθ(x) < 0.2
• Now we can be more confident a 1 is a true positive
• But classifier has lower recall - predict y = 1 for a smaller number of
patients
– Risk of false negatives
– Another example - avoid false negatives
• This is probably worse for the cancer example
– Now we may set to a lower threshold
» Predict 1 if hθ(x) >= 0.3
• Predict 0 if hθ(x) < 0.7
– i.e. 30% chance they have cancer
– So now we have a higher recall, but lower precision
» Risk of false positives, because we're less discriminating in deciding what
means the person has cancer
• This threshold defines the trade-off
– We can show this graphically by plotting precision vs. recall
B, C
Quiz
Suppose you have trained a logistic regression classifier which is
outputing hѲ(x) . Currently, you predict 1 if hѲ(x)≥threshold , and
predict 0 if hѲ(x) < threshold , where currently the threshold is set to
0.5. Suppose you decrease the threshold to 0.1. Which of the following
are true? Check all that apply.
A. The classifier is likely to now have higher recall.
B. The classifier is likely to have unchanged precision and recall, but
higher accuracy
C. The classifier is likely to now have lower precision.
D. The classifier is likely to have unchanged precision and recall, but
lower accuracy
A,C
Quiz
Suppose you are working on a spam classifier, where spam emails are positive
examples ( y=1 ) and non-spam emails are negative examples ( y=0 ). You have a
training set of emails in which 99% of the emails are non-spam and the other 1% is
spam. Which of the following statements are true? Check all that apply
A. If you always predict non-spam (output y=0) your classifier will have 99%
accuracy on the training set, and it will likely perform similarly on the cross
validation set.
B. If you always predict non-spam (output y=0) your classifier will have an accuracy
of 99%
C. A good classifier should have both a high precision and high recall on the cross
validation set
D. If you always predict non-spam (output y=0) your classifier will have 99%
accuracy on the training set, but it will do much worse on the cross validation set
because it has overfit the training data
A
Quiz
Which of the following statements are true? Check all that apply.
A. It is a good idea to spend a lot of time collecting a large amount of data
before building your first version of a learning algorithm. If your model is
underfitting the training set, then obtaining more data is likely to help.
B. On skewed datasets (e.g., when there are more positive examples than
negative examples), accuracy is not a good measure of performance and
you should instead use F1 score based on the precision and recall.
C. After training a logistic regression classifier, you must use 0.5 as your
threshold for predicting whether an example is positive or negative.
D. Using a very large training set makes it unlikely for model to overfit the
training data.