0% found this document useful (0 votes)
9 views81 pages

4 DL

Machine learning is the study of algorithms that improve their performance on tasks through experience, utilizing statistics for inference and computer science for efficient algorithms. It has applications in various fields such as speech recognition, natural language processing, and medical analysis, with methods including supervised, unsupervised, and reinforcement learning. The document also discusses classification, regression, and clustering strategies, emphasizing the importance of model generalization and the bias-variance trade-off in building effective classifiers.

Uploaded by

kushalgangwar98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views81 pages

4 DL

Machine learning is the study of algorithms that improve their performance on tasks through experience, utilizing statistics for inference and computer science for efficient algorithms. It has applications in various fields such as speech recognition, natural language processing, and medical analysis, with methods including supervised, unsupervised, and reinforcement learning. The document also discusses classification, regression, and clustering strategies, emphasizing the importance of model generalization and the bias-variance trade-off in building effective classifiers.

Uploaded by

kushalgangwar98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Machine Learning

What is Machine Learning?


• Machine Learning
– Study of algorithms that
– improve their performance
– at some task
– with experience
• Optimize a performance criterion using example data
or experience.
• Role of Statistics: Inference from a sample
• Role of Computer science: Efficient algorithms to
– Solve the optimization problem
– Representing and evaluating the model for inference
2
Growth of Machine Learning
• Machine learning is preferred approach to
– Speech recognition, Natural language processing
– Computer vision
– Medical outcomes analysis
– Robot control
– Computational biology
• This trend is accelerating
– Improved machine learning algorithms
– Improved data capture, networking, faster computers
– Software too complex to write by hand
– New sensors / IO devices
– Demand for self-customization to user, environment
– It turns out to be difficult to extract knowledge from human experts →
failure of expert systems in the 1980’s.
Alpydin & Ch. Eick: ML Topic1 3
Applications
• Association Analysis
• Supervised Learning
– Classification
– Regression/Prediction
• Unsupervised Learning
• Reinforcement Learning

4
Learning Associations
• Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.

Example: P ( chips | beer ) = 0.7


Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Classification
• Example: Credit
scoring
• Differentiating
between low-risk
and high-risk
customers from
their income and
savings
Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk

Model 6
Classification: Applications
• Aka Pattern recognition
• Face recognition: Pose, lighting, occlusion (glasses,
beard), make-up, hair style
• Character recognition: Different handwriting styles.
• Speech recognition: Temporal dependency.
– Use of a dictionary or the syntax of the language.
– Sensor fusion: Combine multiple modalities; eg, visual (lip
image) and acoustic for speech
• Medical diagnosis: From symptoms to illnesse.

7
Face Recognition

Training examples of a person

Test images

AT&T Laboratories, Cambridge UK


http://www.uk.research.att.com/facedatabase.html

8
Prediction: Regression

• Example: Price of a used


car
y = wx+w0
• x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters

9
Regression Applications

• Navigating a car: Angle of the steering wheel.

10
Different Data Analysis Tasks

• Classification • Pattern detection


– Assign a category (ie, – Identify regularities (ie,
a class) for a new patterns) in temporal or
instance spatial data
• Clustering • Simulation
– Form clusters (ie, – Define mathematical
groups) with a set of formulas that can
instances generate data similar to
observations collected
11
Supervised Learning: Uses
Example: decision trees tools that create rules

• Prediction of future cases: Use the rule to predict the


output for future inputs
• Knowledge extraction: The rule is easy to understand
• Compression: The rule is simpler than the data it
explains
• Outlier detection: Exceptions that are not covered by
the rule, e.g., fraud

12
Unsupervised Learning
• Unsupervised learning is a type of machine learning
algorithm used to draw inferences from datasets
consisting of input data without labeled responses.
• Clustering: Grouping similar instances
• Other applications:
– Predicting the weather
– Calculating the height of a person in the school.
– Summarization.

13
Reinforcement Learning
• Topics:
– Policies: what actions should an agent take in a particular
situation
– Utility estimation: how good is a state (→used by policy)
• No supervised output but delayed reward
• Credit assignment problem (what was responsible for
the outcome)
• Applications:
– Game playing
– Robot in a maze
– Multiple agents, partial observability,
14 ...
Clustering Strategies
• K-means
– Iteratively re-assign points to the nearest cluster
center
• Agglomerative clustering
– Start with each point as its own cluster and iteratively
merge the closest clusters
• Mean-shift clustering
– Estimate modes of pdf
• Spectral clustering
– Split the nodes in a graph based on assigned links with
similarity weights

As we go down this chart, the clustering strategies have more tendency


to transitively group points even if they are not nearby in feature space
The machine learning
framework
• Apply a prediction function to a feature representation of
the image to get the desired output:

f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik
The machine learning
framework
y = f(x)
output prediction Image
function feature

• Training: given a training set of labeled examples {(x1,y1),


…, (xN,yN)}, estimate the prediction function f by minimizing
the prediction error on the training set
• Testing: apply f to a never before seen test example x and
output the predicted value y = f(x)
Slide credit: L. Lazebnik
Steps
Training Training
Labels
Training
Images
Image Learned
Training
Features model

Testing

Image Learned
Prediction
Features model
Test Image Slide credit: D. Hoiem and L. Lazebnik
Features
• Raw pixels

• Histograms

• GIST descriptors

• …
Slide credit: L. Lazebnik
Classifiers: Nearest neighbor

Training
Training Test
examples
examples example
from class 2
from class 1

f(x) = label of the training example nearest to x

• All we need is a distance function for our inputs


• No training required!
Slide credit: L. Lazebnik
Classifiers: Linear

• Find a linear function to separate the classes:

f(x) = sgn(w  x + b)

Slide credit: L. Lazebnik


Many classifiers to choose from
• SVM
• Neural networks
Which is the best one?
• Naïve Bayes
• Bayesian network
• Logistic regression
• Randomized Forests
• Boosted Decision Trees
• K-nearest neighbor
• RBMs
• Etc.

Slide credit: D. Hoiem


Recognition task and supervision
• Images in the training set must be annotated with the
“correct answer” that the model is expected to produce

Contains a motorbike

Slide credit: L. Lazebnik


Generalization

Training set (labels known) Test set (labels


unknown)

• How well does a learned model generalize from


the data it was trained on to a new test set?
Slide credit: L. Lazebnik
Generalization
• Components of generalization error
– Bias: how much the average model over all training sets differ
from the true model?
• Error due to inaccurate assumptions/simplifications made by
the model
– Variance: how much models estimated from different training
sets differ from each other
• Underfitting: model is too “simple” to represent all the
relevant class characteristics
– High bias and low variance
– High training error and high test error
• Overfitting: model is too “complex” and fits irrelevant
characteristics (noise) in the data
– Low bias and high variance
– Low training error and high test error
Slide credit: L. Lazebnik
Bias-Variance Trade-off

• Models with too few


parameters are
inaccurate because of a
large bias (not enough
flexibility).

• Models with too many


parameters are
inaccurate because of a
large variance (too much
sensitivity to the sample).

Slide credit: D. Hoiem


Bias-variance tradeoff

Underfitting Overfitting
Error

Test error

Training error

High Bias Low Bias


Low Variance
Complexity High Variance

Slide credit: D. Hoiem


Remember…
• No classifier is inherently
better than any other: you
need to make assumptions to
generalize

• Three kinds of error


– Inherent: unavoidable
– Bias: due to over-simplifications
– Variance: due to inability to
perfectly estimate parameters
from limited data

Slide
Slide
credit:
credit:
D. D.
Hoiem
Hoiem
How to reduce variance?

• Choose a simpler classifier

• Regularize the parameters

• Get more training data

Slide credit: D. Hoiem


Very brief tour of some classifiers
• K-nearest neighbor
• SVM
• Boosted Decision Trees
• Neural networks
• Naïve Bayes
• Bayesian network
• Logistic regression
• Randomized Forests
• RBMs
• Etc.
Classification
• Assign input vector to one of two or more
classes
• Any decision rule divides input space into
decision regions separated by decision
boundaries

Slide credit: L. Lazebnik


Nearest Neighbor Classifier

• Assign label of nearest training data point to each test data


point

from Duda et al.

Voronoi partitioning of feature space


for two-category 2D and 3D data Source: D. Lowe
K-nearest neighbor

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1
1-nearest neighbor

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1
3-nearest neighbor

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1
5-nearest neighbor

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1
Classifiers: Logistic Regression

Maximize likelihood of
label given data,
male
assuming a log-linear
model
Height
female

x2

x1 Pitch of voice
P( x1 , x2 | y = 1)
log = wT x
P( x1 , x2 | y = −1)

P( y = 1 | x1 , x2 ) = 1 / (1 + exp(− w T x ))
Classifiers: Linear SVM

x
x
x x x
x
o x
x
o o
o
o
x2

x1
• Find a linear function to separate the classes:
f(x) = sgn(w  x + b)
Classifiers: Linear SVM

x
x
x x x
x
o x
x
o o
o
o
x2

x1
• Find a linear function to separate the classes:
f(x) = sgn(w  x + b)
Classifiers: Linear SVM

x
x
x o
x x
x
o x
x
o o
o
o
x2

x1
• Find a linear function to separate the classes:
f(x) = sgn(w  x + b)
Nonlinear SVMs
• Datasets that are linearly separable work out great:

0 x

• But what if the dataset is just too hard?

0 x

• We can map it to a higher-dimensional space:


x2

0 x Slide credit: Andrew Moore


Nonlinear SVMs
• General idea: the original input space can
always be mapped to some higher-dimensional
feature space where the training set is
separable:

Φ: x → φ(x)

Slide credit: Andrew Moore


Classifiers: Decision Trees

x
x
x o
x x
x
o x
o x
o o
o
o
x2

x1
Classification Process

1. Classification tasks
2. Building a classifier
3. Evaluating a classifier

70
Classifying Mushrooms

◆ What mushrooms are edible,


i.e., not poisonous?
◆ Book lists many kinds of
mushrooms identified as
either edible, poisonous, or
unknown edibility
◆ Given a new kind
mushroom not listed in the
book, is it edible?

https://archive.ics.uci.edu/ml/datasets/Mushroom
71
Classifying Iris Plants

◆ Iris flowers have different


sepal and petal shapes:
◆ Iris Setosa
◆ Iris Versicolour
◆ Iris Virginica

◆ Suppose you are shown


lots of examples of each
type. Given a new iris
flower, what type is it?
https://en.wikipedia.org/wiki/Iris_setosa
https://en.wikipedia.org/wiki/Iris_versicolor
72
1. Classification Tasks

73
Classification Tasks

◆ Given:
◆ A set of classes
◆ Instances (examples)
of each class

◆ Generate: A method (aka


model) that when given a
new instance it will
determine its class

http://www.business-insight.com/html/intelligence/bi_overfitting.html 74
Classification Tasks

◆ Given: ◆ Instances are described


◆ A set of classes as a set of features or
attributes and their values
◆ Instances of each
class ◆ The class that the
◆ Generate: A method that
instance belongs to is
when given a new also called its “label”
instance it will ◆ Input is a set of
determine its class “labeled instances”

75
Classification Tasks

◆Given: A set of
labeled instances
◆Generate: A
method (aka model)
that when given a
new instance it will
hypothesize its class

80
Classifying a New Instance

84
Classifying New Instances

85
Training and Test Sets
Training instances
(training set)

Test instances
(test set)

86
Contamination
Training instances
(training set)

Test instances
(test set)

When training and test sets overlap


– this should NEVER happen

87
About Classification Tasks

◆ Classes must be disjoint, i.e., each instance belongs to


only one class
◆ Classification tasks are “binary” if there are only two
classes
◆ The classification method will rarely be perfect, it
will make mistakes in its classification of new
instances

88
2. Building a Classifier

89
What is a Modeler?
◆A
mathematical/algori
thmic approach to
generalize from
instances so it can
make predictions
about instances that
it has not seen
before
◆Its output is called a
model
90
Types of Modelers/Models

◆ Logistic regression

◆ Naïve Bayes classifiers

◆ Support vector machines (SVMs)

◆ Decision trees

◆ Random forests

◆ Kernel methods

◆ Genetic algorithms

◆ Neural networks
91
http://tjo-en.hatenablog.com/entry/2014/01/06/234155 93
http://tjo-en.hatenablog.com/entry/2014/01/06/234155 94
http://tjo-en.hatenablog.com/entry/2014/01/06/234155 95
What Modeler to Choose?

◆ Logistic regression
◆Data scientists try
◆ Naïve Bayes classifiers
different modelers,
◆ Support vector machines (SVMs)
with different
◆ Decision trees
parameters, and
◆ Random forests check the accuracy
◆ Kernel methods to figure out which
◆ Genetic algorithms (GAs) one works best for
◆ Neural networks: perceptrons the data at hand
98
Ensembles
◆ An ensemble method uses several
algorithms that do the same task,
and combines their results
◆ “Ensemble learning”

◆ A combination function joins the


results
◆ Majority vote: each algorithm
gets a vote
◆ Weighted voting: each
algorithm’s vote has a weight
◆ Other complex combination
functions

99
http://magizbox.com/index.php/machine-learning/ds-model-building/ensemble/ 100
3. Evaluating a Classifier

101
Classification Accuracy

◆ Accuracy: percentage of correct classifications

Total test instances classified correctly


Accuracy =
Total number of test instances

102
Evaluating a Classifier:
n-fold Cross Validation
◆ Suppose m labeled
instances
◆ Divide into n subsets
(“folds”) of equal
size

◆ Run classifier n times,


with each of the subsets
as the test set
◆ The rest (n-1) for
training
◆ Each run gives an
accuracy result
Translated from image by Joan.domenech91 (Own work) [CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
(https://commons.wikimedia.org/wiki/File:K-fold_cross_validation.jpg) 103
Evaluating a Classifier:
Confusion Matrix

Classified positive Classified negative

Actual positive True positive False negative

Actual negative False positive True negative

TP: number of positive examples classified correctly


FN: number of positive examples classified incorrectly
FP: number of negative examples classified incorrectly
TN: number of negative examples classified correctly
104
Evaluating a Classifier:
Precision and Recall

TP: number of positive examples classified correctly


FN: number of positive examples classified incorrectly
FP: number of negative examples classified incorrectly
TN: number of negative examples classified correctly

TP TP
Precision = Recall =
TP + FP TP + FN

Note that the focus is on the positive class 105


Evaluating a Classifier:
Other Metrics

◆ There are many other accuracy metrics


◆ F1-score
◆ Receive Operating Characteristics (ROC) curve
◆ Area Under the Curve (AUC)

106
Evaluating a Classifier:
Other Metrics

◆ Other accuracy metrics ◆ Other concerns


◆ F1-score ◆ Explainability of
◆ Receive Operating classifier results
Characteristics (ROC) ◆ Cost of examples
curve ◆ Cost of feature
◆ Area Under the Curve values
(AUC) ◆ Labeling

107
Overfitting
◆ A model overfits the training data when it is very accurate
with that data, and may not do so well with new test data

Training Data Test Data

Model 1

Model 2

109
Induction

◆ Induction requires inferring general rules about


examples seen in the past
◆ Contrast with deduction: inferring things that are
a logical consequence of what we have seen in
the past
◆ Classifiers use induction: they generate general
rules about the target classes
◆ The rules are used to make predictions about new data
◆ These predictions can be wrong

110
When Facing a Classification
Task
◆ What features to choose ◆ What classes to choose
◆ Try defining different ◆ Edible / poisonous?
features ◆ Edible / poisonous /
◆ For some problems, unknown?
hundreds and maybe
thousands of features may ◆ How many labeled examples
be possible ◆ May require a lot of work
◆ Sometimes the features are ◆ What modeler to choose
not directly observable (ie,
◆ Better to try different ones
there are “latent” variables)

111
What to remember about classifiers

• No free lunch: machine learning algorithms are tools,


not dogmas

• Try simple classifiers first

• Better to have smart features and simple classifiers


than simple features and smart classifiers

• Use increasingly powerful classifiers with more


training data (bias-variance tradeoff)

Slide credit: D. Hoiem


Resources: Datasets
• UCI Repository:
http://www.ics.uci.edu/~mlearn/MLRepository.html

• UCI KDD Archive:


http://kdd.ics.uci.edu/summary.data.application.html

• Statlib: http://lib.stat.cmu.edu/
• Delve: http://www.cs.utoronto.ca/~delve/

113
Resources: Journals
• Journal of Machine Learning Research
www.jmlr.org
• Machine Learning
• IEEE Transactions on Neural Networks
• IEEE Transactions on Pattern Analysis and
Machine Intelligence
• Annals of Statistics
• Journal of the American Statistical Association
• ...
114
Resources: Conferences

• International Conference on Machine Learning (ICML)


• European Conference on Machine Learning (ECML)
• Neural Information Processing Systems (NIPS)
• Computational Learning
• International Joint Conference on Artificial Intelligence (IJCAI)
• ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
• IEEE Int. Conf. on Data Mining (ICDM)

115
Some Machine Learning References

• General
– Tom Mitchell, Machine Learning, McGraw Hill, 1997
– Christopher Bishop, Neural Networks for Pattern
Recognition, Oxford University Press, 1995
• Adaboost
– Friedman, Hastie, and Tibshirani, “Additive logistic
regression: a statistical view of boosting”, Annals of
Statistics, 2000
• SVMs
– http://www.support-vector.net/icml-tutorial.pdf

Slide credit: D. Hoiem

You might also like