0% found this document useful (0 votes)

33 views63 pages

Machine Learning

Uploaded by

velrayappandiv7263

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views63 pages

Machine Learning

Uploaded by

velrayappandiv7263

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Machine Learning

U19ADS503- Machine Learning

UNIT IV ADVICE FOR APPLYING MACHINE LEARNING 9
Debugging a learning algorithm - evaluating a hypothesis -
model selection and training, validation test sets - bias Vs
variance - regularization and bias/variance - learning curves
machine learning system design
UNIT V OTHER TOPICS 9
Unsupervised learning - k-means algorithm - optimization
objective - choosing number of clusters - Dimensionality
reduction - principle component analysis - Anomaly detection
- algorithm - developing and evaluating the algorithm -
anomaly detection Vs supervised algorithm -Case study -
recommender system - collaborative filtering - Large scale
machine learning - online learning - map reduce and
parallelism.

8/17/2017 2
Advice for applying machine
learning

3
Debugging a learning algorithm
• Suppose you have implemented regularized
linear regression to predict housing prices

• Trained it
• But, when you test on new data you find it
makes unacceptably large errors in its
predictions
4
• What should you try next?
– Get more training data
• Often it does though, although you should always do some preliminary testing to make
sure more data will actually make a difference (discussed later)
– Try a smaller set a features
• Carefully select small subset
• You can do this by hand, or use some dimensionality reduction technique (e.g. PCA)
– Try getting additional features
• LOOK at the data
• Can be very time consuming
– Adding polynomial features
– Building your own, new, better features based on your knowledge of the
problem
• Can be risky if you accidentally over fit your data by creating new features which are
inherently specific/relevant to your training data
– Try decreasing or increasing λ
• Change how important the regularization term is in your calculations
– There are some simple techniques which can let you rule out half the things
on the list
• Save you a lot of time!
• This approach might work, but it’s very time-consuming, and largely a
matter of luck whether you end up fixing what the problem really is.

5
Evaluating a hypothesis
• When we fit parameters to training data, try
and minimize the error.
• We might think a low error is good - doesn't
necessarily mean a good parameter set
– Could, in fact, be indicative of overfitting
– This means your model will fail to generalize
• How do you tell if a hypothesis is overfitting?
– Could plot hθ(x)
– But with lots of features may be impossible to
plot
• Standard way to evaluate a hypothesis is
– Split data into two portions
• 1st portion is training set
• 2nd portion is test set
– Typical split might be 70:30 (training:test)

6
• a typical train and test scheme would be
– 1) Learn parameters θ from training data,
minimizing J(θ) using 70% of the training data
– 2) Compute the test error
– Jtest(θ) = average square error as measured on the
test set

• This is the definition of the test set error

• if we were using logistic regression

7
• Sometimes there is a better way -
misclassification error (0/1 misclassification)
• We define the error as follows

• Then the test error is

– i.e. its the fraction in the test set the hypothesis mislabels
• These are the standard techniques for evaluating
a learned hypothesis
8
Model selection and training, validation
and test sets
• How to choose degree of polynomial
• Model selection problem - Try to chose the degree
for a polynomial to fit data

• d = what degree of polynomial do you want to pick?

• Choose a model, fit that model and get an es timate
of how well your hypothesis will generalize
9
• You could
– Take model 1, minimize with training data which generates a parameter
vector θ1 (where d =1)
– Take model 2, do the same, get a different θ2 (where d = 2)
– And so on
– Take these parameters and look at the test set error for each using the
previous formula
• Jtest(θ1)
• Jtest(θ2)
• ...
• Jtest(θ10)
• You could then
– See which model has the lowest test set error
• Say, for example, d=5 is the lowest
– Now take the d=5 model and say, how well does it generalize?
• You could use Jtest(θ5)
• BUT, this is going to be an optimistic estimate of generalization error, because our
parameter is fit to that test set (i.e. specifically choose it because the test set error is
small)
• So not a good way to evaluate if it will generalize
• To address this problem, we do something a bit different for model
selection

10
Improved model selection
• Given a training set instead split into three pieces
– 1 - Training set (60%) - m values
– 2 - Cross validation (CV) set (20%) - mcv
– 3 - Test set (20%) - mtest
• As before, we can calculate
– Training error
– Cross validation error
– Test error

11
– Then
• Minimize cost function for each of the models as before
• Test these hypothesis on the cross validation set to generate
the cross validation error
• Pick the hypothesis with the lowest cross validation error
– e.g. pick θ5
• Finally
– Estimate generalization error of model using the test set
• Final note
– In machine learning as practiced today - many people will
select the model using the test set and then check the
model is OK for generalization using the test error (which
we've said is bad because it gives a bias analysis)
• With a MASSIVE test set this is maybe OK
– But better practice is to have separate training and
validation sets

12
K-fold Cross Validation
• In some applications, data is scarce
• The k-fold cross validation technique is
designed to give an accurate estimate of the
true error without wasting too much data
• In k-fold cross validation the original training
set is partitioned into k subsets (folds) of
size m/k
• For each fold, the algorithm is trained on the
union of the other folds and then the error of
its output is estimated using the fold.
• the average of all these errors is the
estimation of the true error.
• The special case k = m, where m is the number
of examples, is called leave-one-out cross
validation(LOOCV).
Diagnosis - bias vs. variance
• If you get bad results usually because of one of
– High bias - under fitting problem
– High variance - over fitting problem
• Important to work out which is the problem

The degree of a model will increase as you move towards overfitting

16
• Lets define training and cross validation error as before
• Now plot
– x = degree of polynomial d
– y = error for both training and cross validation (two lines)
• CV error and test set error will be very similar
– This plot helps us understand the error

• We want to minimize both errors

– Which is why that d=2 model is the sweet spot

17
• How do we apply this for diagnostics
– If cv error is high we're either at the high or the low end of d

– if d is too small --> this probably corresponds to a high bias problem

– if d is too large --> this probably corresponds to a high variance problem
• For the high bias case, we find both cross validation and training error
are high
– Doesn't fit training data well
– Doesn't generalize either
• For high variance, we find the cross validation error is high but training
error is low
– So we suffer from overfitting (training is low, cross validation is high)
– i.e. training set fits well
– But generalizes poorly

18
Regularization and bias/variance
• How is bias and variance effected by regularization?

• The equation above describes fitting a high order

polynomial with regularization (used to keep parameter
values small)
– Consider three cases
• λ = large
– All θ values are heavily penalized
– So most parameters end up being close to zero
– So hypothesis ends up being close to 0
– So high bias -> under fitting data
• λ = intermediate
– Only this values gives the fitting which is reasonable
• λ = small
– Lambda = 0
– So we make the regularization term 0
– So high variance -> Get overfitting 19
Choosing λ
Have a set or range of values to use
Often increment by factors of 2 so
model(1)= λ = 0
model(2)= λ = 0.01
model(3)= λ = 0.02
model(4) = λ = 0.04
model(5) = λ = 0.08
.
.
.
model(p) = λ = 10
This gives a number of models which have different λ
20
• With these models
– Take each one (pth)
– Minimize the cost function
– This will generate some parameter vector
• Call this θ(p)
– So now we have a set of parameter vectors corresponding
to models with different λ values
• Take all of the hypothesis and use the cross validation
set to validate them
– Measure average squared error on cross validation set
– Pick the model which gives the lowest error
• Finally, take the one we've selected and test it with the
test set
• Bias/variance as a function of λ
– Plot λ vs Jtrain Jcv

– Such a plot can help show you you're picking a good value
for λ 21
Learning curves
• A learning curve is often useful to plot
for algorithmic sanity checking or improving performance
• What is a learning curve?
– Plot Jtrain (average squared error on training set) and
Jcv (average squared error on cross validation set)
– Plot against m (number of training examples)
• m is a constant
• So artificially reduce m and recalculate errors with the smaller training set
sizes
– Jtrain
• Error on smaller sample
sizes is smaller (as less variance
to accommodate)
• So as m grows error grows
– Jcv
• Error on cross validation set
• When you have a tiny training
set your generalize badly
• But as training set grows your hypothesis generalize better
• So cv error will decrease as m increases
22
• What do these curves look like if you have
– High bias
– e.g. setting straight line to data
– Jtrain
• Training error is small at first and grows
• Training error becomes close to cross validation
• So the performance of the cross validation and training set end up
being similar (but very poor)
– Jcv
• Straight line fit is similar for a few vs. a lot of data
• So it doesn't generalize any better with lots of data because the
function just doesn't fit the data
– No increase in data will help it fit
• The problem with high bias is because cross validation and training
error are both high
• Also implies that if a learning algorithm has high bias as we get
more examples the cross validation error doesn't decrease
– So if an algorithm is already suffering from high bias, more data does
not help 23
• High variance e.g. high order polynomial
• Jtrain
– When set is small, training error is small too
– As training set sizes increases, value is still small
– But slowly increases (in a near linear fashion)
– Error is still low
• Jcv
– Error remains high, even when you have a moderate number of examples
– Because the problem with high variance (overfitting) is your model doesn't
generalize
• An indicative diagnostic that you have high variance is that there's a big
gap between training error and cross validation error
• If a learning algorithm is suffering from high variance, more data is
probably going to help

24
What to do next
• How do these ideas help us chose how we approach a problem?
• Original example
– Trained a learning algorithm (regularized linear regression)
– But, when you test on new data you find it makes unacceptably large errors in its predictions
– What should you try next?
• How do we decide what to do?
– Get more examples --> helps to fix high variance
• Not good if you have high bias

– Smaller set of features --> fixes high variance (overfitting)

• Not good if you have high bias

– Try adding additional features --> fixes high bias (because hypothesis is too simple, make
hypothesis more specific)

– Add polynomial terms --> fixes high bias problem

– Decreasing λ --> fixes high bias

– Increases λ --> fixes high variance

25
Choosing the network architecture
• selecting a network architecture
• One option is to use a small neural network
– Few (maybe one) hidden layer and few hidden units
– Such networks are prone to under fitting
– But they are computationally cheaper
• Larger network
– More hidden layers
• How do you decide that a larger network is good?
• Using a single hidden layer is good default
– Also try with 1, 2, 3, see which performs best on cross validation set
– So like before, take three sets (training, cross validation)
• More units
– This is computational expensive
– Prone to over-fitting
• Use regularization to address over fitting

26
Quiz
Suppose you have implemented regularized logistic
regression to classify what object is in an image. However
when you test your hypothesis on a new set of images,
you find it makes unacceptable large errors with its
predictions on new set of images. However you
hypothesis performs well on the training set. Which of
the following promising steps to take?
A. Try adding polynomial features
B. Get more training examples
C. Use fewer training examples
D. Try using smaller set of features

B, D
Quiz
Suppose you have implemented regularized logistic
regression to classify what object is in an image. However
when you test your hypothesis on a new set of images,
you find it makes unacceptable large errors with its
predictions on new set of images. However you
hypothesis performs well on the training set. Which of
the following promising steps to take?
A. Try decreasing the regularization parameter λ
B. Try evaluating the hypothesis on a cross validation set
rather than the test set
C. Try using smaller set of features
D. Try increasing the regularization parameter λ
C, D
Quiz
Suppose you have implemented regularized logistic regression
to predict what items customers will purchase on web
shopping site. However when you test your hypothesis on a
new set of customers, you find it makes unacceptable large
errors with its predictions on new set of images. Furthermore
your hypothesis performs poorly on the training set. Which of
the following promising steps to take?
A. Try decreasing the regularization parameter λ
B. Try adding polynomial features
C. Try using smaller set of features
D. Try to obtain and use additional features

ABD
Quiz
Which of the following statement are true?
A. The performance of the learning algorithm on the training set will
typically be better than its performance on the test set
B. Suppose you are training a regularized linear regression model.
The recommended way to choose what value of regularization
parameter λ to use is to choose the value of λ which gives the
lowest test error
C. Suppose you are training a regularized linear regression model.
The recommended way to choose what value of regularization
parameter λ to use is to choose the value of λ which gives the
lowest cross validation error
D. Suppose you are training a regularized linear regression model.
The recommended way to choose what value of regularization
parameter λ to use is to choose the value of λ which gives the
lowest training error

A, C
Quiz
Which of the following statement are true?
A. If a learning algorithm is suffering from high bias, only
adding more training examples may not improve the test
error significantly.
B. A model with more parameters is more prone to
overfitting and typically has higher variance.
C. When debugging learning algorithms, it is useful to plot a
learning curve to understand if there is a high bias or high
variance problem.
D. If a neural network has much lower training error than
test error, then adding more layers will help bring the test
error down because we can fit the test set better.

A, B, C
Quiz
You train a learning algorithm and find that it has unacceptably high error on the
test set. You plot the learning curve and obtain the figure as below. Is the algorithm
suffering high bias, high variance or neither?

A. High bias
B. High variance
C. Neither
How to fix a high bias or a high
variance problem?
How to fix a high bias problem
• The following tricks are employed to fix a high bias problem.
• Train longer:
– Many machine learning algorithms are set up as iterative optimization problems where the
training error ( or a function of the training error ) is minimized. Just letting the algorithm run
for more hours or days can help reduce the bias. In Neural Networks, you can change a
parameter called the “learning rate” which will help the training error go down faster.
• Train a more complex model:
– A more complex model will fit the training data better. In the case of Neural Networks, one
can add more layers. Finally, in the case of an SVM, you can use a non-linear SVMs instead of a
linear one.
• Obtain more features:
– Sometimes you just do not have enough information to train a model. For example, if you are
trying to train a model that can predict the gender of a person based on the color of their hair,
the problem might be impossible to solve. But if you add a new feature — the length of the
hair — the problem becomes more tractable.
• Decrease regularization:
– In a naive implementation, the parameters could take any values. This makes the model very
flexible. However, we may want to constrain the flexibility of the model to prevent overfitting.
So usually a regularization term is added to the cost function we are trying to minimize. we
can reduce it by reducing the value of regularization parameter.
How to fix a high variance problem?
• Obtain more data:
– Because the validation error is large, it means that the training set and
the validation set that were randomly chosen from the same dataset,
somehow have different characteristics. This usually means that you
do not have enough data and you need to collect more.
• Decrease number of features:
– Sometimes collecting more data is not an option. In that case, you can
reduce the number of features. You may have to remove features
manually. For example, in our previous example of identifying the
gender of a person based on hair color and hair length, you may
decide to drop hair color and keep hair length.
• Increase regularization:
– When we have a high variance problem the model is fitting the
training data. In fact, the model is probably fitting even the noise in
training set and therefore not performing as well on the validation set.
We can reduce the flexibility of the model by using regularization that
puts constraints on the magnitude of the parameters.
Machine Learning System Design
What to do next?
• Build an initial model quickly:
1. Train using training set — Fit the parameters
2. Development set — Tune the parameters
3. Test set — Assess the performance
• Prioritize Next Steps:
1. Use Bias and Variance Analysis to deal with
underfitting and overfitting
2. Analyse what is causing the error and fix them
until you have the required model ready!
Prioritizing what to work on -
spam classification example
• Idea of prioritizing what to work on is perhaps the most
important skill programmers typically need to develop
• Building a spam classifier
• Spam is email advertising
Spam classification
• What kind of features might we define
– Spam (1)
• Misspelled word
– Not spam (0)
• Real content
• How do we build a classifier
to distinguish between the two
– Feature representation
• How do we represent x (features of the email)?
– y = spam (1) or not spam (0)
Approach - choosing our own features
• Choose 100 words which are indicative of an email being
spam or not spam
– Spam --> e.g. buy, discount, deal
– Non spam --> Andrew, now
– All these words go into one long vector
• Encode this into a reference vector
– See which words appear in a message
• Define a feature vector x
– Which is 0 or 1 if a corresponding word in the reference vector
is present or not
• This is a bitmap of the word content of your email
– i.e. don't recount if a word appears more than once
– In practice its more common to have a training set and pick the
most frequently used n words, where n is 10,000 to 50,000
What's the best use of your time to improve system
accuracy?
• Natural inclination is to collect lots of data
– Honey pot anti-spam projects try and get fake email addresses
into spammers' hands, collect loads of spam
• Develop sophisticated features based on email routing
information (contained in email header)
– Spammers often try and obscure origins of email
– Send through unusual routes
• Develop sophisticated features for message body analysis
– Discount == discounts?
– DEAL == deal?
• Develop sophisticated algorithm to detect misspelling
– Spammers use misspelled word to get around detection systems

Note: Often a research group randomly focus on one option

Error analysis
• If you're building a machine learning system often
good to start by building a simple algorithm
which you can implement quickly
– Spend at most 24 hours developing an initially
bootstrapped algorithm
• Implement and test on cross validation data
– Plot learning curves to decide if more data, features
etc will help algorithmic optimization
• Hard to tell in advance what is important
• Learning curves really help with this
• Way of avoiding premature optimization
• Error analysis
• Manually examine the samples (in cross validation set) that
the algorithm made errors on
• See if we can work out why
– Systematic patterns - help design new features to avoid these
shortcomings
• e.g.
– Built a spam classifier with 500 examples in CV set - gets 100
wrong
– Manually look at 100 and categorize them depending on
features
• e.g. type of email
– Looking at those email
• May find most common type of spam emails are pharmacy
emails, phishing emails
• What features would have helped classify them correctly
– e.g. deliberate misspelling
– Unusual email routing
– Unusual punctuation
– May fine some "spammer technique" is causing a lot of your misses
• Importance of numerical evaluation
– Have a way of numerically evaluate the algorithm
– If you're developing an algorithm, it's really good to have some
performance calculation which gives a single real number to tell
you how well its doing
– e.g.
• Say we are deciding if we should treat a set of similar words as the
same word
• This is done by stemming in NLP (e.g. "Porter stemmer" looks at
the etymological stem of a word)
• This may make your algorithm better or worse
– Also worth consider weighting error (false positive vs. false negative)
» e.g. is a false positive really bad, or is it worth have a few of one to
improve performance a lot
• Can use numerical evaluation to compare the changes
– See if a change improves an algorithm or not
– A single real number may be hard/complicated to compute
• But makes it much easier to evaluate how changes impact your
algorithm
• You should do error analysis on the cross validation set
instead of the test set
Error metrics for skewed analysis
• One case where it's hard to come up with good error metric -
skewed classes
• Example
– Cancer classification
• Train logistic regression model hθ(x) where
– Cancer means y = 1
– Otherwise y = 0
• Test classifier on test set
– Get 1% error --→ looks good
– But only 0.5% have cancer
» Now, 1% error looks very bad!
– this is an example of skewed classes
• Another example
– Algorithm has 99.2% accuracy
– Make a change, now get 99.5% accuracy
• Does this really represent an improvement to the algorithm?
– Did we do something useful, or did we just create something which
predicts y = 0 more often
• Get very low error, but classifier is still not great
Precision and recall
• Two new metrics - precision and recall
– Both give a value between 0 and 1
– Evaluating classifier on a test set
– For a test set, the actual class is 1 or 0
– Algorithm predicts some value for class, predicting a
value for each example in the test set
• Considering this, classification can be
– True positive (we guessed 1, it was 1)
– False positive (we guessed 1, it was 0)
– True negative (we guessed 0, it was 0)
– False negative (we guessed 0, it was 1)
Precision and Recall
• Precision
– How often does our algorithm cause a false alarm?
– Of all patients we predicted have cancer, what fraction of them actually have
cancer
• = true positives / # predicted positive
• = true positives / (true positive + false positive)
– High precision is good (i.e. closer to 1)
• You want a big number, because you want false positive to be as close to 0 as possible
• Recall
– How sensitive is our algorithm?
– Of all patients in set that actually have cancer, what fraction did we correctly
detect
• = true positives / # actual positives
• = true positive / (true positive + false negative)
– High recall is good (i.e. closer to 1)
• You want a big number, because you want false negative to be as close to 0 as possible
• By computing precision and recall, we get a better sense of how an
algorithm is doing
• Typically we say the presence of a rare class is what we're trying to
determine (e.g. positive (1) is the existence of the rare thing)
Accuracy, recall, precision, and related metrics
Accuracy is the proportion of all classifications that were correct, whether
positive or negative. It is mathematically defined as:

Recall: The true positive rate (TPR), or the proportion of all actual positives that
were classified correctly as positives, is also known as recall.

Precision is the proportion of all the model's positive classifications that are
actually positive. It is mathematically defined as:
Accuracy, recall, precision, and related metrics
Precision and Recall
• Suppose there are 100 games in your test set. There
are four possible outcomes:
– You predict team will win and they do win.
You predict team will win but they lose.
You predict team will lose and they do lose.
You predict team will lose but they win.
– True Positive (you predict + and are correct),
False Positive (you predict + but are incorrect),
True Negative (you predict – and are correct),
False Negative (you predict – but are incorrect).
• Suppose that for the 100 games, your results are:
– True Positive (TP) = 40 (correctly predicted a win)
False Positive (FP) = 20 (incorrectly predicted a win)
True Negative (TN) = 30 (correctly predicted a loss)
False Negative (FN) = 10 (incorrectly predicted a loss)
• 2×2 table, called a “confusion matrix
Precision and Recall
• Accuracy = 40 + 30 = 70/100 = 0.70
• Precision = TP / (TP+FP) = 40 / (40+20) = 40/60 = 0.67
• Recall = TP / (TP+FN) = 40 / (40+10) = 40/50 = 0.80
• In the case of a logistic regression classifier, you can adjust
something called the threshold, which is an internal
number between 0 and 1 that determines whether a
prediction is positive or not.
• As you increase the threshold value above 0.5, it becomes
more difficult for a data item to be classified as positive.
• If you change the threshold for a logistic regression
classifier, it turns out the precision and recall will change.
• If the precision increases (your chance of winning your bet),
the recall (your number of betting opportunities) will
decrease. And vice versa.
Trading off precision and recall
• For many applications we want to control the trade-
off between precision and recall
• Example
– Trained a logistic regression classifier
• Predict 1 if hθ(x) >= 0.5
• Predict 0 if hθ(x) < 0.5
– This classifier may give some value for precision and some value
for recall
– Predict 1 only if very confident
• One way to do this modify the algorithm we could modify the
prediction threshold
– Predict 1 if hθ(x) >= 0.8
– Predict 0 if hθ(x) < 0.2
• Now we can be more confident a 1 is a true positive
• But classifier has lower recall - predict y = 1 for a smaller number of
patients
– Risk of false negatives
– Another example - avoid false negatives
• This is probably worse for the cancer example
– Now we may set to a lower threshold
» Predict 1 if hθ(x) >= 0.3
• Predict 0 if hθ(x) < 0.7
– i.e. 30% chance they have cancer
– So now we have a higher recall, but lower precision
» Risk of false positives, because we're less discriminating in deciding what
means the person has cancer
• This threshold defines the trade-off
– We can show this graphically by plotting precision vs. recall

– This curve can take many different shapes depending on

classifier details
• Is there a way to automatically choose the threshold?
– Or, if we have a few algorithms, how do
we compare different algorithms or parameter sets?

• How do we decide which of these algorithms is best?

• We spoke previously about using a single real number
evaluation metric
• By switching to precision/recall we have two numbers
• Now comparison becomes harder
– Better to have just one number
• How can we convert P & R into one number?
• One option is the average - (P + R)/2
– This is not such a good solution
» Means if we have a classifier which predicts y = 1 all the time you get a high recall
and low precision
» Similarly, if we predict Y rarely get high precision and low recall
» So averages here would be 0.45, 0.4 and 0.51
• 0.51 is best, despite having a recall of 1 - i.e. predict y=1 for everything
» So average isn't great
– F1Score (fscore)
» = 2 * (PR/ [P + R])
» Fscore is like taking the average of precision and recall giving a higher weight to
the lower value
– Many formulas for computing comparable precision/accuracy values
» If P = 0 or R = 0 the Fscore = 0
» If P = 1 and R = 1 then Fscore = 1
» The remaining values lie between 0 and 1
• Threshold offers a way to control trade-off between precision and
recall
• F1 score gives a single real number evaluation metric
– If you're trying to automatically set the threshold, one way is to try a
range of threshold values and evaluate them on your cross
validation set
• Then pick the threshold which gives the best fscore.
Data for machine learning
• There have been studies of using different algorithms on data
• Algorithms
– Perceptron (logistic regression)
– Winnow
• Like logistic regression
• Used less now
– Memory based
• Used less now
– Naive Bayes
• Varied training set size and tried algorithms on a range of sizes
• What can we conclude
• Algorithms give remarkably similar performance
• As training set sizes increases accuracy increases
• Take an algorithm, give it more data, should beat a "better" one with less data
• Shows that
– Algorithm choice is pretty similar
– More data helps
• When is this true and when is it not?
– If we can correctly assume that features x have enough information to
predict y accurately, then more data will probably help
• A useful test to determine if this is true can be, "given x, can a human expert predict y?"
– So lets say we use a learning algorithm with many parameters such as logistic
regression or linear regression with many features, or neural networks with
many hidden features
• These are powerful learning algorithms with many parameters which can fit complex
functions
– Such algorithms are low bias algorithms
• Use a small training set
– Training error should be small
• Use a very large training set
– If the training set error is close to the test set error
– Unlikely to over fit with our complex algorithms
– So the test set error should also be small
– Another way to think about this is we want our algorithm to have low bias and
low variance
• Low bias --> use complex algorithm
• Low variance --> use large training set
Quiz
You are working on a spam classification system
using regularized logistic regression. "Spam" is a
positive class (y = 1) and "not spam" is the
negative class (y = 0). You have trained your
classifier and there are m = 1000 examples in
the cross-validation set. The chart of predicted
class vs. actual class is:
Actual class : 1 Actual class : 0
Predicted class: 1 85 890
Predicted class: 0 15 10
What is the classifier's recall (as a value from 0
to 1)?
Quiz
Suppose a massive dataset is available for training a learning algorithm.
Training on a lot of data is likely to give good performance when two of the
following conditions hold true. Which are the two?
A. The classes are not too skewed.
B. A human expert on the application domain can confidently
predict y when given only the features x (or more generally, if we have
some way to be confident that x contains sufficient information to
predict y
C. Our learning algorithm is able to represent fairly complex functions (for
example, if we train a neural network or other model with a large
number of parameters).
D. When we are willing to include high order polynomial features of x

B, C
Quiz
Suppose you have trained a logistic regression classifier which is
outputing hѲ(x) . Currently, you predict 1 if hѲ(x)≥threshold , and
predict 0 if hѲ(x) < threshold , where currently the threshold is set to
0.5. Suppose you decrease the threshold to 0.1. Which of the following
are true? Check all that apply.
A. The classifier is likely to now have higher recall.
B. The classifier is likely to have unchanged precision and recall, but
higher accuracy
C. The classifier is likely to now have lower precision.
D. The classifier is likely to have unchanged precision and recall, but
lower accuracy

A,C
Quiz
Suppose you are working on a spam classifier, where spam emails are positive
examples ( y=1 ) and non-spam emails are negative examples ( y=0 ). You have a
training set of emails in which 99% of the emails are non-spam and the other 1% is
spam. Which of the following statements are true? Check all that apply
A. If you always predict non-spam (output y=0) your classifier will have 99%
accuracy on the training set, and it will likely perform similarly on the cross
validation set.
B. If you always predict non-spam (output y=0) your classifier will have an accuracy
of 99%
C. A good classifier should have both a high precision and high recall on the cross
validation set
D. If you always predict non-spam (output y=0) your classifier will have 99%
accuracy on the training set, but it will do much worse on the cross validation set
because it has overfit the training data

A
Quiz
Which of the following statements are true? Check all that apply.
A. It is a good idea to spend a lot of time collecting a large amount of data
before building your first version of a learning algorithm. If your model is
underfitting the training set, then obtaining more data is likely to help.
B. On skewed datasets (e.g., when there are more positive examples than
negative examples), accuracy is not a good measure of performance and
you should instead use F1 score based on the precision and recall.
C. After training a logistic regression classifier, you must use 0.5 as your
threshold for predicting whether an example is positive or negative.
D. Using a very large training set makes it unlikely for model to overfit the
training data.

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Online Quiz 1
100% (1)
Online Quiz 1
9 pages
10: Advice For Applying Machine Learning: Deciding What To Try Next
No ratings yet
10: Advice For Applying Machine Learning: Deciding What To Try Next
8 pages
Machine Learning Using Matlab: Lecture 8 Advice On ML Application
No ratings yet
Machine Learning Using Matlab: Lecture 8 Advice On ML Application
30 pages
Week 6 Lecture Notes
No ratings yet
Week 6 Lecture Notes
9 pages
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
No ratings yet
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
28 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Lecture Slide 02 - Supervised Learning - Summer 2023
No ratings yet
Lecture Slide 02 - Supervised Learning - Summer 2023
43 pages
Regression and Generalization
No ratings yet
Regression and Generalization
67 pages
Bias Variance
No ratings yet
Bias Variance
14 pages
6应用机器学习的建议
No ratings yet
6应用机器学习的建议
79 pages
ML 5
No ratings yet
ML 5
14 pages
15-The Bias - Variance - Trade-Off-08-04-2024
No ratings yet
15-The Bias - Variance - Trade-Off-08-04-2024
23 pages
Unit IV
No ratings yet
Unit IV
51 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
K Fold
No ratings yet
K Fold
25 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
ML Tips and Tricks
No ratings yet
ML Tips and Tricks
32 pages
ML 01
No ratings yet
ML 01
24 pages
DSOST3
No ratings yet
DSOST3
31 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
Lecture 4 - Regularization
No ratings yet
Lecture 4 - Regularization
22 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
03 Model Selection and Train-Validation-Test Sets 12 Min
No ratings yet
03 Model Selection and Train-Validation-Test Sets 12 Min
7 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
L2 Supervised Learning
No ratings yet
L2 Supervised Learning
43 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Lect 03 Evaluation Part 2
No ratings yet
Lect 03 Evaluation Part 2
40 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
2020 Evaluation PDF
No ratings yet
2020 Evaluation PDF
25 pages
10 Advice For Applying Machine Learning
No ratings yet
10 Advice For Applying Machine Learning
25 pages
Linear Regression With Multiple Variable
No ratings yet
Linear Regression With Multiple Variable
30 pages
PE IV - Practical Machine Learning
No ratings yet
PE IV - Practical Machine Learning
7 pages
Training Evaluation
No ratings yet
Training Evaluation
42 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
10 CV Val1
No ratings yet
10 CV Val1
26 pages
4 Model Order
No ratings yet
4 Model Order
10 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
73 pages
ML PPT Lect - 4
No ratings yet
ML PPT Lect - 4
16 pages
5 CV Boot-Handout PDF
No ratings yet
5 CV Boot-Handout PDF
44 pages
Week 4 Lecture Slides BUS265 2023
No ratings yet
Week 4 Lecture Slides BUS265 2023
45 pages
06 Regularizations
No ratings yet
06 Regularizations
42 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
19 ML Intro
No ratings yet
19 ML Intro
33 pages
Model Selection and Evaluation
No ratings yet
Model Selection and Evaluation
23 pages
EDA Module 2
No ratings yet
EDA Module 2
28 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
Unit 2
No ratings yet
Unit 2
97 pages
Week 3
No ratings yet
Week 3
56 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture4
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture4
57 pages
Ch5 Resampling Methods
No ratings yet
Ch5 Resampling Methods
66 pages
BAI602 Module 2 Textbook
No ratings yet
BAI602 Module 2 Textbook
31 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
MPRA Paper 83458
No ratings yet
MPRA Paper 83458
32 pages
Open Electives Winter 2024-25-28nov - Open Electives Winter 2024-25 - 28nov-1
No ratings yet
Open Electives Winter 2024-25-28nov - Open Electives Winter 2024-25 - 28nov-1
1 page
10 04 Angle Between Two Curves PDF
No ratings yet
10 04 Angle Between Two Curves PDF
15 pages
Syllabus For Secondary Grade Teacher
No ratings yet
Syllabus For Secondary Grade Teacher
26 pages
Mathematics Talent Reward Programme
No ratings yet
Mathematics Talent Reward Programme
3 pages
Tennessee: Free Preview Copies!
No ratings yet
Tennessee: Free Preview Copies!
16 pages
Reams Black June 271977
100% (4)
Reams Black June 271977
195 pages
Week 4 Groundwater Flow Equation - Unconfined Aquifer
No ratings yet
Week 4 Groundwater Flow Equation - Unconfined Aquifer
18 pages
Lecture Notes 12-Higher-Order Taylor Methods
No ratings yet
Lecture Notes 12-Higher-Order Taylor Methods
85 pages
Case Studies On Theory of Computation
No ratings yet
Case Studies On Theory of Computation
2 pages
Computing Vedic Planetary Positions, As Per Vedic Astronomy and Mathematics PDF
100% (1)
Computing Vedic Planetary Positions, As Per Vedic Astronomy and Mathematics PDF
7 pages
Formula Sheet - ANOVA, Chi-Square & Regression
No ratings yet
Formula Sheet - ANOVA, Chi-Square & Regression
1 page
GWA Calculator by FilipiKnow
No ratings yet
GWA Calculator by FilipiKnow
17 pages
Prior Analytics: Syllogism
No ratings yet
Prior Analytics: Syllogism
9 pages
Nonlinear Solid Mechanics A Continuum Ap PDF
No ratings yet
Nonlinear Solid Mechanics A Continuum Ap PDF
2 pages
6 - Set Membership and Set Containment
No ratings yet
6 - Set Membership and Set Containment
30 pages
Lynn - Intro Folding in Architecture
No ratings yet
Lynn - Intro Folding in Architecture
7 pages
Msi and PLD Components - 20241204 - 071632 - 0000
No ratings yet
Msi and PLD Components - 20241204 - 071632 - 0000
50 pages
Parametric Equations and Polar Coordinates: Dr. Lê Xuân Đ I
No ratings yet
Parametric Equations and Polar Coordinates: Dr. Lê Xuân Đ I
32 pages
Nonstop Travel: Input
No ratings yet
Nonstop Travel: Input
2 pages
C Sharp Logical Test
No ratings yet
C Sharp Logical Test
6 pages
15-150703-Design and Analysis of Algorithms PDF
No ratings yet
15-150703-Design and Analysis of Algorithms PDF
2 pages
Measure of Dispersion
No ratings yet
Measure of Dispersion
11 pages
2024-26 - Jr.C-120 - Physics Teaching & Test Schedule With Class & Home Work
No ratings yet
2024-26 - Jr.C-120 - Physics Teaching & Test Schedule With Class & Home Work
30 pages
Ricco Serial Verb Constructions in Three-Participant Event
No ratings yet
Ricco Serial Verb Constructions in Three-Participant Event
50 pages
Module 2.5
No ratings yet
Module 2.5
32 pages
No-Frills Worksheet For All Ages - Present Simple vs. Present Continuous
No ratings yet
No-Frills Worksheet For All Ages - Present Simple vs. Present Continuous
2 pages
Math 1210 Project 2
No ratings yet
Math 1210 Project 2
3 pages
En 10083 C50 Steel Plate High Carbon Steel
No ratings yet
En 10083 C50 Steel Plate High Carbon Steel
2 pages

Machine Learning

Uploaded by

Machine Learning

Uploaded by

Machine Learning

U19ADS503- Machine Learning

• This is the definition of the test set error

• Then the test error is

• d = what degree of polynomial do you want to pick?

The degree of a model will increase as you move towards overfitting

• We want to minimize both errors

– if d is too small --> this probably corresponds to a high bias problem

• The equation above describes fitting a high order

– Smaller set of features --> fixes high variance (overfitting)

– Add polynomial terms --> fixes high bias problem

– Decreasing λ --> fixes high bias

– Increases λ --> fixes high variance

Note: Often a research group randomly focus on one option

– This curve can take many different shapes depending on

• How do we decide which of these algorithms is best?

You might also like