0% found this document useful (0 votes)
33 views63 pages

Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views63 pages

Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Machine Learning

U19ADS503- Machine Learning


UNIT IV ADVICE FOR APPLYING MACHINE LEARNING 9
Debugging a learning algorithm - evaluating a hypothesis -
model selection and training, validation test sets - bias Vs
variance - regularization and bias/variance - learning curves
machine learning system design
UNIT V OTHER TOPICS 9
Unsupervised learning - k-means algorithm - optimization
objective - choosing number of clusters - Dimensionality
reduction - principle component analysis - Anomaly detection
- algorithm - developing and evaluating the algorithm -
anomaly detection Vs supervised algorithm -Case study -
recommender system - collaborative filtering - Large scale
machine learning - online learning - map reduce and
parallelism.

8/17/2017 2
Advice for applying machine
learning

3
Debugging a learning algorithm
• Suppose you have implemented regularized
linear regression to predict housing prices

• Trained it
• But, when you test on new data you find it
makes unacceptably large errors in its
predictions
4
• What should you try next?
– Get more training data
• Often it does though, although you should always do some preliminary testing to make
sure more data will actually make a difference (discussed later)
– Try a smaller set a features
• Carefully select small subset
• You can do this by hand, or use some dimensionality reduction technique (e.g. PCA)
– Try getting additional features
• LOOK at the data
• Can be very time consuming
– Adding polynomial features
– Building your own, new, better features based on your knowledge of the
problem
• Can be risky if you accidentally over fit your data by creating new features which are
inherently specific/relevant to your training data
– Try decreasing or increasing λ
• Change how important the regularization term is in your calculations
– There are some simple techniques which can let you rule out half the things
on the list
• Save you a lot of time!
• This approach might work, but it’s very time-consuming, and largely a
matter of luck whether you end up fixing what the problem really is.

5
Evaluating a hypothesis
• When we fit parameters to training data, try
and minimize the error.
• We might think a low error is good - doesn't
necessarily mean a good parameter set
– Could, in fact, be indicative of overfitting
– This means your model will fail to generalize
• How do you tell if a hypothesis is overfitting?
– Could plot hθ(x)
– But with lots of features may be impossible to
plot
• Standard way to evaluate a hypothesis is
– Split data into two portions
• 1st portion is training set
• 2nd portion is test set
– Typical split might be 70:30 (training:test)

6
• a typical train and test scheme would be
– 1) Learn parameters θ from training data,
minimizing J(θ) using 70% of the training data
– 2) Compute the test error
– Jtest(θ) = average square error as measured on the
test set

• This is the definition of the test set error


• if we were using logistic regression

7
• Sometimes there is a better way -
misclassification error (0/1 misclassification)
• We define the error as follows

• Then the test error is

– i.e. its the fraction in the test set the hypothesis mislabels
• These are the standard techniques for evaluating
a learned hypothesis
8
Model selection and training, validation
and test sets
• How to choose degree of polynomial
• Model selection problem - Try to chose the degree
for a polynomial to fit data

• d = what degree of polynomial do you want to pick?


• Choose a model, fit that model and get an es timate
of how well your hypothesis will generalize
9
• You could
– Take model 1, minimize with training data which generates a parameter
vector θ1 (where d =1)
– Take model 2, do the same, get a different θ2 (where d = 2)
– And so on
– Take these parameters and look at the test set error for each using the
previous formula
• Jtest(θ1)
• Jtest(θ2)
• ...
• Jtest(θ10)
• You could then
– See which model has the lowest test set error
• Say, for example, d=5 is the lowest
– Now take the d=5 model and say, how well does it generalize?
• You could use Jtest(θ5)
• BUT, this is going to be an optimistic estimate of generalization error, because our
parameter is fit to that test set (i.e. specifically choose it because the test set error is
small)
• So not a good way to evaluate if it will generalize
• To address this problem, we do something a bit different for model
selection

10
Improved model selection
• Given a training set instead split into three pieces
– 1 - Training set (60%) - m values
– 2 - Cross validation (CV) set (20%) - mcv
– 3 - Test set (20%) - mtest
• As before, we can calculate
– Training error
– Cross validation error
– Test error

11
– Then
• Minimize cost function for each of the models as before
• Test these hypothesis on the cross validation set to generate
the cross validation error
• Pick the hypothesis with the lowest cross validation error
– e.g. pick θ5
• Finally
– Estimate generalization error of model using the test set
• Final note
– In machine learning as practiced today - many people will
select the model using the test set and then check the
model is OK for generalization using the test error (which
we've said is bad because it gives a bias analysis)
• With a MASSIVE test set this is maybe OK
– But better practice is to have separate training and
validation sets

12
K-fold Cross Validation
• In some applications, data is scarce
• The k-fold cross validation technique is
designed to give an accurate estimate of the
true error without wasting too much data
• In k-fold cross validation the original training
set is partitioned into k subsets (folds) of
size m/k
• For each fold, the algorithm is trained on the
union of the other folds and then the error of
its output is estimated using the fold.
• the average of all these errors is the
estimation of the true error.
• The special case k = m, where m is the number
of examples, is called leave-one-out cross
validation(LOOCV).
Diagnosis - bias vs. variance
• If you get bad results usually because of one of
– High bias - under fitting problem
– High variance - over fitting problem
• Important to work out which is the problem

The degree of a model will increase as you move towards overfitting

16
• Lets define training and cross validation error as before
• Now plot
– x = degree of polynomial d
– y = error for both training and cross validation (two lines)
• CV error and test set error will be very similar
– This plot helps us understand the error

• We want to minimize both errors


– Which is why that d=2 model is the sweet spot

17
• How do we apply this for diagnostics
– If cv error is high we're either at the high or the low end of d

– if d is too small --> this probably corresponds to a high bias problem


– if d is too large --> this probably corresponds to a high variance problem
• For the high bias case, we find both cross validation and training error
are high
– Doesn't fit training data well
– Doesn't generalize either
• For high variance, we find the cross validation error is high but training
error is low
– So we suffer from overfitting (training is low, cross validation is high)
– i.e. training set fits well
– But generalizes poorly

18
Regularization and bias/variance
• How is bias and variance effected by regularization?

• The equation above describes fitting a high order


polynomial with regularization (used to keep parameter
values small)
– Consider three cases
• λ = large
– All θ values are heavily penalized
– So most parameters end up being close to zero
– So hypothesis ends up being close to 0
– So high bias -> under fitting data
• λ = intermediate
– Only this values gives the fitting which is reasonable
• λ = small
– Lambda = 0
– So we make the regularization term 0
– So high variance -> Get overfitting 19
Choosing λ
Have a set or range of values to use
Often increment by factors of 2 so
model(1)= λ = 0
model(2)= λ = 0.01
model(3)= λ = 0.02
model(4) = λ = 0.04
model(5) = λ = 0.08
.
.
.
model(p) = λ = 10
This gives a number of models which have different λ
20
• With these models
– Take each one (pth)
– Minimize the cost function
– This will generate some parameter vector
• Call this θ(p)
– So now we have a set of parameter vectors corresponding
to models with different λ values
• Take all of the hypothesis and use the cross validation
set to validate them
– Measure average squared error on cross validation set
– Pick the model which gives the lowest error
• Finally, take the one we've selected and test it with the
test set
• Bias/variance as a function of λ
– Plot λ vs Jtrain Jcv

– Such a plot can help show you you're picking a good value
for λ 21
Learning curves
• A learning curve is often useful to plot
for algorithmic sanity checking or improving performance
• What is a learning curve?
– Plot Jtrain (average squared error on training set) and
Jcv (average squared error on cross validation set)
– Plot against m (number of training examples)
• m is a constant
• So artificially reduce m and recalculate errors with the smaller training set
sizes
– Jtrain
• Error on smaller sample
sizes is smaller (as less variance
to accommodate)
• So as m grows error grows
– Jcv
• Error on cross validation set
• When you have a tiny training
set your generalize badly
• But as training set grows your hypothesis generalize better
• So cv error will decrease as m increases
22
• What do these curves look like if you have
– High bias
– e.g. setting straight line to data
– Jtrain
• Training error is small at first and grows
• Training error becomes close to cross validation
• So the performance of the cross validation and training set end up
being similar (but very poor)
– Jcv
• Straight line fit is similar for a few vs. a lot of data
• So it doesn't generalize any better with lots of data because the
function just doesn't fit the data
– No increase in data will help it fit
• The problem with high bias is because cross validation and training
error are both high
• Also implies that if a learning algorithm has high bias as we get
more examples the cross validation error doesn't decrease
– So if an algorithm is already suffering from high bias, more data does
not help 23
• High variance e.g. high order polynomial
• Jtrain
– When set is small, training error is small too
– As training set sizes increases, value is still small
– But slowly increases (in a near linear fashion)
– Error is still low
• Jcv
– Error remains high, even when you have a moderate number of examples
– Because the problem with high variance (overfitting) is your model doesn't
generalize
• An indicative diagnostic that you have high variance is that there's a big
gap between training error and cross validation error
• If a learning algorithm is suffering from high variance, more data is
probably going to help

24
What to do next
• How do these ideas help us chose how we approach a problem?
• Original example
– Trained a learning algorithm (regularized linear regression)
– But, when you test on new data you find it makes unacceptably large errors in its predictions
– What should you try next?
• How do we decide what to do?
– Get more examples --> helps to fix high variance
• Not good if you have high bias

– Smaller set of features --> fixes high variance (overfitting)


• Not good if you have high bias

– Try adding additional features --> fixes high bias (because hypothesis is too simple, make
hypothesis more specific)

– Add polynomial terms --> fixes high bias problem

– Decreasing λ --> fixes high bias

– Increases λ --> fixes high variance

25
Choosing the network architecture
• selecting a network architecture
• One option is to use a small neural network
– Few (maybe one) hidden layer and few hidden units
– Such networks are prone to under fitting
– But they are computationally cheaper
• Larger network
– More hidden layers
• How do you decide that a larger network is good?
• Using a single hidden layer is good default
– Also try with 1, 2, 3, see which performs best on cross validation set
– So like before, take three sets (training, cross validation)
• More units
– This is computational expensive
– Prone to over-fitting
• Use regularization to address over fitting

26
Quiz
Suppose you have implemented regularized logistic
regression to classify what object is in an image. However
when you test your hypothesis on a new set of images,
you find it makes unacceptable large errors with its
predictions on new set of images. However you
hypothesis performs well on the training set. Which of
the following promising steps to take?
A. Try adding polynomial features
B. Get more training examples
C. Use fewer training examples
D. Try using smaller set of features

B, D
Quiz
Suppose you have implemented regularized logistic
regression to classify what object is in an image. However
when you test your hypothesis on a new set of images,
you find it makes unacceptable large errors with its
predictions on new set of images. However you
hypothesis performs well on the training set. Which of
the following promising steps to take?
A. Try decreasing the regularization parameter λ
B. Try evaluating the hypothesis on a cross validation set
rather than the test set
C. Try using smaller set of features
D. Try increasing the regularization parameter λ
C, D
Quiz
Suppose you have implemented regularized logistic regression
to predict what items customers will purchase on web
shopping site. However when you test your hypothesis on a
new set of customers, you find it makes unacceptable large
errors with its predictions on new set of images. Furthermore
your hypothesis performs poorly on the training set. Which of
the following promising steps to take?
A. Try decreasing the regularization parameter λ
B. Try adding polynomial features
C. Try using smaller set of features
D. Try to obtain and use additional features

ABD
Quiz
Which of the following statement are true?
A. The performance of the learning algorithm on the training set will
typically be better than its performance on the test set
B. Suppose you are training a regularized linear regression model.
The recommended way to choose what value of regularization
parameter λ to use is to choose the value of λ which gives the
lowest test error
C. Suppose you are training a regularized linear regression model.
The recommended way to choose what value of regularization
parameter λ to use is to choose the value of λ which gives the
lowest cross validation error
D. Suppose you are training a regularized linear regression model.
The recommended way to choose what value of regularization
parameter λ to use is to choose the value of λ which gives the
lowest training error

A, C
Quiz
Which of the following statement are true?
A. If a learning algorithm is suffering from high bias, only
adding more training examples may not improve the test
error significantly.
B. A model with more parameters is more prone to
overfitting and typically has higher variance.
C. When debugging learning algorithms, it is useful to plot a
learning curve to understand if there is a high bias or high
variance problem.
D. If a neural network has much lower training error than
test error, then adding more layers will help bring the test
error down because we can fit the test set better.

A, B, C
Quiz
You train a learning algorithm and find that it has unacceptably high error on the
test set. You plot the learning curve and obtain the figure as below. Is the algorithm
suffering high bias, high variance or neither?

A. High bias
B. High variance
C. Neither
How to fix a high bias or a high
variance problem?
How to fix a high bias problem
• The following tricks are employed to fix a high bias problem.
• Train longer:
– Many machine learning algorithms are set up as iterative optimization problems where the
training error ( or a function of the training error ) is minimized. Just letting the algorithm run
for more hours or days can help reduce the bias. In Neural Networks, you can change a
parameter called the “learning rate” which will help the training error go down faster.
• Train a more complex model:
– A more complex model will fit the training data better. In the case of Neural Networks, one
can add more layers. Finally, in the case of an SVM, you can use a non-linear SVMs instead of a
linear one.
• Obtain more features:
– Sometimes you just do not have enough information to train a model. For example, if you are
trying to train a model that can predict the gender of a person based on the color of their hair,
the problem might be impossible to solve. But if you add a new feature — the length of the
hair — the problem becomes more tractable.
• Decrease regularization:
– In a naive implementation, the parameters could take any values. This makes the model very
flexible. However, we may want to constrain the flexibility of the model to prevent overfitting.
So usually a regularization term is added to the cost function we are trying to minimize. we
can reduce it by reducing the value of regularization parameter.
How to fix a high variance problem?
• Obtain more data:
– Because the validation error is large, it means that the training set and
the validation set that were randomly chosen from the same dataset,
somehow have different characteristics. This usually means that you
do not have enough data and you need to collect more.
• Decrease number of features:
– Sometimes collecting more data is not an option. In that case, you can
reduce the number of features. You may have to remove features
manually. For example, in our previous example of identifying the
gender of a person based on hair color and hair length, you may
decide to drop hair color and keep hair length.
• Increase regularization:
– When we have a high variance problem the model is fitting the
training data. In fact, the model is probably fitting even the noise in
training set and therefore not performing as well on the validation set.
We can reduce the flexibility of the model by using regularization that
puts constraints on the magnitude of the parameters.
Machine Learning System Design
What to do next?
• Build an initial model quickly:
1. Train using training set — Fit the parameters
2. Development set — Tune the parameters
3. Test set — Assess the performance
• Prioritize Next Steps:
1. Use Bias and Variance Analysis to deal with
underfitting and overfitting
2. Analyse what is causing the error and fix them
until you have the required model ready!
Prioritizing what to work on -
spam classification example
• Idea of prioritizing what to work on is perhaps the most
important skill programmers typically need to develop
• Building a spam classifier
• Spam is email advertising
Spam classification
• What kind of features might we define
– Spam (1)
• Misspelled word
– Not spam (0)
• Real content
• How do we build a classifier
to distinguish between the two
– Feature representation
• How do we represent x (features of the email)?
– y = spam (1) or not spam (0)
Approach - choosing our own features
• Choose 100 words which are indicative of an email being
spam or not spam
– Spam --> e.g. buy, discount, deal
– Non spam --> Andrew, now
– All these words go into one long vector
• Encode this into a reference vector
– See which words appear in a message
• Define a feature vector x
– Which is 0 or 1 if a corresponding word in the reference vector
is present or not
• This is a bitmap of the word content of your email
– i.e. don't recount if a word appears more than once
– In practice its more common to have a training set and pick the
most frequently used n words, where n is 10,000 to 50,000
What's the best use of your time to improve system
accuracy?
• Natural inclination is to collect lots of data
– Honey pot anti-spam projects try and get fake email addresses
into spammers' hands, collect loads of spam
• Develop sophisticated features based on email routing
information (contained in email header)
– Spammers often try and obscure origins of email
– Send through unusual routes
• Develop sophisticated features for message body analysis
– Discount == discounts?
– DEAL == deal?
• Develop sophisticated algorithm to detect misspelling
– Spammers use misspelled word to get around detection systems

Note: Often a research group randomly focus on one option


Error analysis
• If you're building a machine learning system often
good to start by building a simple algorithm
which you can implement quickly
– Spend at most 24 hours developing an initially
bootstrapped algorithm
• Implement and test on cross validation data
– Plot learning curves to decide if more data, features
etc will help algorithmic optimization
• Hard to tell in advance what is important
• Learning curves really help with this
• Way of avoiding premature optimization
• Error analysis
• Manually examine the samples (in cross validation set) that
the algorithm made errors on
• See if we can work out why
– Systematic patterns - help design new features to avoid these
shortcomings
• e.g.
– Built a spam classifier with 500 examples in CV set - gets 100
wrong
– Manually look at 100 and categorize them depending on
features
• e.g. type of email
– Looking at those email
• May find most common type of spam emails are pharmacy
emails, phishing emails
• What features would have helped classify them correctly
– e.g. deliberate misspelling
– Unusual email routing
– Unusual punctuation
– May fine some "spammer technique" is causing a lot of your misses
• Importance of numerical evaluation
– Have a way of numerically evaluate the algorithm
– If you're developing an algorithm, it's really good to have some
performance calculation which gives a single real number to tell
you how well its doing
– e.g.
• Say we are deciding if we should treat a set of similar words as the
same word
• This is done by stemming in NLP (e.g. "Porter stemmer" looks at
the etymological stem of a word)
• This may make your algorithm better or worse
– Also worth consider weighting error (false positive vs. false negative)
» e.g. is a false positive really bad, or is it worth have a few of one to
improve performance a lot
• Can use numerical evaluation to compare the changes
– See if a change improves an algorithm or not
– A single real number may be hard/complicated to compute
• But makes it much easier to evaluate how changes impact your
algorithm
• You should do error analysis on the cross validation set
instead of the test set
Error metrics for skewed analysis
• One case where it's hard to come up with good error metric -
skewed classes
• Example
– Cancer classification
• Train logistic regression model hθ(x) where
– Cancer means y = 1
– Otherwise y = 0
• Test classifier on test set
– Get 1% error --→ looks good
– But only 0.5% have cancer
» Now, 1% error looks very bad!
– this is an example of skewed classes
• Another example
– Algorithm has 99.2% accuracy
– Make a change, now get 99.5% accuracy
• Does this really represent an improvement to the algorithm?
– Did we do something useful, or did we just create something which
predicts y = 0 more often
• Get very low error, but classifier is still not great
Precision and recall
• Two new metrics - precision and recall
– Both give a value between 0 and 1
– Evaluating classifier on a test set
– For a test set, the actual class is 1 or 0
– Algorithm predicts some value for class, predicting a
value for each example in the test set
• Considering this, classification can be
– True positive (we guessed 1, it was 1)
– False positive (we guessed 1, it was 0)
– True negative (we guessed 0, it was 0)
– False negative (we guessed 0, it was 1)
Precision and Recall
• Precision
– How often does our algorithm cause a false alarm?
– Of all patients we predicted have cancer, what fraction of them actually have
cancer
• = true positives / # predicted positive
• = true positives / (true positive + false positive)
– High precision is good (i.e. closer to 1)
• You want a big number, because you want false positive to be as close to 0 as possible
• Recall
– How sensitive is our algorithm?
– Of all patients in set that actually have cancer, what fraction did we correctly
detect
• = true positives / # actual positives
• = true positive / (true positive + false negative)
– High recall is good (i.e. closer to 1)
• You want a big number, because you want false negative to be as close to 0 as possible
• By computing precision and recall, we get a better sense of how an
algorithm is doing
• Typically we say the presence of a rare class is what we're trying to
determine (e.g. positive (1) is the existence of the rare thing)
Accuracy, recall, precision, and related metrics
Accuracy is the proportion of all classifications that were correct, whether
positive or negative. It is mathematically defined as:

Recall: The true positive rate (TPR), or the proportion of all actual positives that
were classified correctly as positives, is also known as recall.

Precision is the proportion of all the model's positive classifications that are
actually positive. It is mathematically defined as:
Accuracy, recall, precision, and related metrics
Precision and Recall
• Suppose there are 100 games in your test set. There
are four possible outcomes:
– You predict team will win and they do win.
You predict team will win but they lose.
You predict team will lose and they do lose.
You predict team will lose but they win.
– True Positive (you predict + and are correct),
False Positive (you predict + but are incorrect),
True Negative (you predict – and are correct),
False Negative (you predict – but are incorrect).
• Suppose that for the 100 games, your results are:
– True Positive (TP) = 40 (correctly predicted a win)
False Positive (FP) = 20 (incorrectly predicted a win)
True Negative (TN) = 30 (correctly predicted a loss)
False Negative (FN) = 10 (incorrectly predicted a loss)
• 2×2 table, called a “confusion matrix
Precision and Recall
• Accuracy = 40 + 30 = 70/100 = 0.70
• Precision = TP / (TP+FP) = 40 / (40+20) = 40/60 = 0.67
• Recall = TP / (TP+FN) = 40 / (40+10) = 40/50 = 0.80
• In the case of a logistic regression classifier, you can adjust
something called the threshold, which is an internal
number between 0 and 1 that determines whether a
prediction is positive or not.
• As you increase the threshold value above 0.5, it becomes
more difficult for a data item to be classified as positive.
• If you change the threshold for a logistic regression
classifier, it turns out the precision and recall will change.
• If the precision increases (your chance of winning your bet),
the recall (your number of betting opportunities) will
decrease. And vice versa.
Trading off precision and recall
• For many applications we want to control the trade-
off between precision and recall
• Example
– Trained a logistic regression classifier
• Predict 1 if hθ(x) >= 0.5
• Predict 0 if hθ(x) < 0.5
– This classifier may give some value for precision and some value
for recall
– Predict 1 only if very confident
• One way to do this modify the algorithm we could modify the
prediction threshold
– Predict 1 if hθ(x) >= 0.8
– Predict 0 if hθ(x) < 0.2
• Now we can be more confident a 1 is a true positive
• But classifier has lower recall - predict y = 1 for a smaller number of
patients
– Risk of false negatives
– Another example - avoid false negatives
• This is probably worse for the cancer example
– Now we may set to a lower threshold
» Predict 1 if hθ(x) >= 0.3
• Predict 0 if hθ(x) < 0.7
– i.e. 30% chance they have cancer
– So now we have a higher recall, but lower precision
» Risk of false positives, because we're less discriminating in deciding what
means the person has cancer
• This threshold defines the trade-off
– We can show this graphically by plotting precision vs. recall

– This curve can take many different shapes depending on


classifier details
• Is there a way to automatically choose the threshold?
– Or, if we have a few algorithms, how do
we compare different algorithms or parameter sets?

• How do we decide which of these algorithms is best?


• We spoke previously about using a single real number
evaluation metric
• By switching to precision/recall we have two numbers
• Now comparison becomes harder
– Better to have just one number
• How can we convert P & R into one number?
• One option is the average - (P + R)/2
– This is not such a good solution
» Means if we have a classifier which predicts y = 1 all the time you get a high recall
and low precision
» Similarly, if we predict Y rarely get high precision and low recall
» So averages here would be 0.45, 0.4 and 0.51
• 0.51 is best, despite having a recall of 1 - i.e. predict y=1 for everything
» So average isn't great
– F1Score (fscore)
» = 2 * (PR/ [P + R])
» Fscore is like taking the average of precision and recall giving a higher weight to
the lower value
– Many formulas for computing comparable precision/accuracy values
» If P = 0 or R = 0 the Fscore = 0
» If P = 1 and R = 1 then Fscore = 1
» The remaining values lie between 0 and 1
• Threshold offers a way to control trade-off between precision and
recall
• F1 score gives a single real number evaluation metric
– If you're trying to automatically set the threshold, one way is to try a
range of threshold values and evaluate them on your cross
validation set
• Then pick the threshold which gives the best fscore.
Data for machine learning
• There have been studies of using different algorithms on data
• Algorithms
– Perceptron (logistic regression)
– Winnow
• Like logistic regression
• Used less now
– Memory based
• Used less now
– Naive Bayes
• Varied training set size and tried algorithms on a range of sizes
• What can we conclude
• Algorithms give remarkably similar performance
• As training set sizes increases accuracy increases
• Take an algorithm, give it more data, should beat a "better" one with less data
• Shows that
– Algorithm choice is pretty similar
– More data helps
• When is this true and when is it not?
– If we can correctly assume that features x have enough information to
predict y accurately, then more data will probably help
• A useful test to determine if this is true can be, "given x, can a human expert predict y?"
– So lets say we use a learning algorithm with many parameters such as logistic
regression or linear regression with many features, or neural networks with
many hidden features
• These are powerful learning algorithms with many parameters which can fit complex
functions
– Such algorithms are low bias algorithms
• Use a small training set
– Training error should be small
• Use a very large training set
– If the training set error is close to the test set error
– Unlikely to over fit with our complex algorithms
– So the test set error should also be small
– Another way to think about this is we want our algorithm to have low bias and
low variance
• Low bias --> use complex algorithm
• Low variance --> use large training set
Quiz
You are working on a spam classification system
using regularized logistic regression. "Spam" is a
positive class (y = 1) and "not spam" is the
negative class (y = 0). You have trained your
classifier and there are m = 1000 examples in
the cross-validation set. The chart of predicted
class vs. actual class is:
Actual class : 1 Actual class : 0
Predicted class: 1 85 890
Predicted class: 0 15 10
What is the classifier's recall (as a value from 0
to 1)?
Quiz
Suppose a massive dataset is available for training a learning algorithm.
Training on a lot of data is likely to give good performance when two of the
following conditions hold true. Which are the two?
A. The classes are not too skewed.
B. A human expert on the application domain can confidently
predict y when given only the features x (or more generally, if we have
some way to be confident that x contains sufficient information to
predict y
C. Our learning algorithm is able to represent fairly complex functions (for
example, if we train a neural network or other model with a large
number of parameters).
D. When we are willing to include high order polynomial features of x

B, C
Quiz
Suppose you have trained a logistic regression classifier which is
outputing hѲ(x) . Currently, you predict 1 if hѲ(x)≥threshold , and
predict 0 if hѲ(x) < threshold , where currently the threshold is set to
0.5. Suppose you decrease the threshold to 0.1. Which of the following
are true? Check all that apply.
A. The classifier is likely to now have higher recall.
B. The classifier is likely to have unchanged precision and recall, but
higher accuracy
C. The classifier is likely to now have lower precision.
D. The classifier is likely to have unchanged precision and recall, but
lower accuracy

A,C
Quiz
Suppose you are working on a spam classifier, where spam emails are positive
examples ( y=1 ) and non-spam emails are negative examples ( y=0 ). You have a
training set of emails in which 99% of the emails are non-spam and the other 1% is
spam. Which of the following statements are true? Check all that apply
A. If you always predict non-spam (output y=0) your classifier will have 99%
accuracy on the training set, and it will likely perform similarly on the cross
validation set.
B. If you always predict non-spam (output y=0) your classifier will have an accuracy
of 99%
C. A good classifier should have both a high precision and high recall on the cross
validation set
D. If you always predict non-spam (output y=0) your classifier will have 99%
accuracy on the training set, but it will do much worse on the cross validation set
because it has overfit the training data

A
Quiz
Which of the following statements are true? Check all that apply.
A. It is a good idea to spend a lot of time collecting a large amount of data
before building your first version of a learning algorithm. If your model is
underfitting the training set, then obtaining more data is likely to help.
B. On skewed datasets (e.g., when there are more positive examples than
negative examples), accuracy is not a good measure of performance and
you should instead use F1 score based on the precision and recall.
C. After training a logistic regression classifier, you must use 0.5 as your
threshold for predicting whether an example is positive or negative.
D. Using a very large training set makes it unlikely for model to overfit the
training data.

You might also like