0% found this document useful (0 votes)
19 views

Module 2

Uploaded by

8497kfgt8w
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Module 2

Uploaded by

8497kfgt8w
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Module 2

Supervised
Learning
Copyright © 2018 McGraw Hill Education, All Rights Reserved.

PROPRIETARY MATERIAL © 2018 The McGraw Hill Education, Inc. All rights reserved. No part of this PowerPoint slide may be displayed, reproduced or distributed in any form or by any
means, without the prior written permission of the publisher, or used beyond the limited distribution to teachers and educators permitted by McGraw Hill for their individual course preparation.
If you are a student using this PowerPoint slide, you are using it without permission.
Inductive learning
Task of inductive learning:
• Given a collection of examples of a function , returns
a function that approximates .
• The approximating function is called hypothesis
function.
• The unknown true function correctly maps the input
space (of the entire data) to the output space Y.
• Central aim of designing is to suggest decisions for
unseen patterns.
• Better approximation of leads to better generalization.
• Generalization performance is the fundamental problem
in inductive learning.
• Off-training set error—the error on points not in the
training set, is used as a measure of generalization
performance.
• Inductive learning assumes that the best hypothesis
regarding unseen patterns is the one induced by the
observed training set
Occam’s Razor Principle
• A simpler algorithm can be expected to perform better
on a test set.
• “simpler” – may stand for fewer parameters, lesser
training time, fewer features and so forth.
• Generally searching is stopped for a design when the
solution is “good enough” and not the optimal one.
• Occam’s razor principle recommends hypothesis
functions that avoid overfitting of the training data.
Overfitting
• With increase in complexity, the training set’s
performance increases but performance of the test set
decreases, then overfitting has happened.

The accuracy of the


classifier
over training examples
increases monotonically
as the classifier grows in
complexity. However, the
accuracy over the
independent
test examples first
increases,
Heuristic Search in Inductive
Learning
Goal of machine learning:
• Not to learn exact representation of training data.
• To build a statistical model of process which generates
the data.
Success of learning: Depends on hypothesis space
complexity and sample complexity.
Search problem: finding a hypothesis function of
complexity consistent with the given training data
Machine learning community depend on tools that
appear to be heuristic, trail-and-error tools.
Estimating Generalization
Errors
Holdout method and random subsampling
• Certain amount of data reserved for testing and rest is
used for training.
• To partition dataset , randomly sample a set of training
examples from and use the rest for testing.
• For time-series data, use the earlier part for training
and the later for testing.
• Usually, one-third of the data is used for testing.
• This procedure of partitioning time-series data is
suitable because the learning machine is used in the
real world. Unseen data are from the future.
• Samples used for training and testing should have
same distribution.
• It can not be identified whether a sample is
representative or not since the distribution is unknown.
• Check: In classification problems, each class should be
represented in about the right proportion in the training
and test sets.
K-Fold Cross-Validation
• Data randomly partitioned into K mutually exclusive
subsets or “folds”, each of approximately equal size.
• In iteration k, partition is test set and remaining
partitions are collectively used to train the model.
• If stratification is adopted it is called stratified K-fold

• Error estimates obtained from K iterations are averaged


cross- validation for classification.

K=10 folds is the standard number used for predicting


to yield an overall error estimate.

the error rate of a learning technique.


Assessing Regression
Accuracy
Mean Square Error
• Most commonly used metric

Root Mean Square Error


• Same dimensions as the predicted value itself
Sum-of-Errors Squares
• Mathematical manipulation of MSE
Sum-of-Error-Squares
Assessing Classification
Accuracy
Misclassification Error
• Metric for assessing the accuracy of classification
algorithms is: number of samples misclassified by the
model
• For binary classification problems,

• For 0% error, for all data points


Confusion Matrix
• Decisions made on classifications based on
misclassification error rate lead to poor performance
when data is unbalanced.
• For example, in case of financial fraud detection, the
proportion of fraud cases is extremely small.
• In such classification problems, the interest is mainly
in minority cases.
• The class that the user is interested in is commonly
called positive class and the rest negative class.
• A single prediction on the test set has four possible
outcomes.
1. The true positive (TP) and true negative (TN) are
correct classifications.
2. A false positive (FP) occurs when the outcome is
incorrectly predicted as positive when it is actually
negative.
3. A false negative (FN) occurs when the outcome is
Hypothesized class (prediction)
incorrectly predicted as negative when it is actually
Classified +ve Classified –ve
positive.
Actual Class Actual +ve TP FN
(observation)
Actual -ve FP TN
Confusion Matrix
Misclassification Rate

True Positive Rate (tp rate)

• Determines sensitivity in detection of abnormal events


• Classification method with high sensitivity would

• FP = FN = 0 is desired.
rarely miss abnormal event.
True Negative Rate

• Determines the specificity in detection of the abnormal


event
• High specificity results in low rate of false alarms caused
by classification of a normal event as an abnormal one.

• Simultaneously high sensitivity and high specificity is


desired.
ROC Curves
• When a classifier algorithm is applied to test set, it
yields a confusion matrix, which corresponds to one
ROC point.
• An ROC curve is created by thresholding the classifier
with respect to its complexity.
• Each level of complexity in the space of the hypothesis
class produces a different point in the ROC space.
• Comparison of two learning schemes is done by
analyzing ROC curves in the same ROC space for the
learning schemes.
A sample ROC curve
An Overview of the Design
Cycle

An overview of the design

You might also like