0% found this document useful (0 votes)
24 views30 pages

Chapter 19

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views30 pages

Chapter 19

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

1

Chapter 19.
Machine learning.
Eng.Abdulrazak A. Dirie
Learning from Examples: Machine Learning 2

Machine Learning
 Learning: Improve performance after making observations about the world. That is, learn what works
and what doesn’t to get closer to optimal decisions.
 How to learn a model to make better decisions from data/experience?
 Supervised Learning: Learn a function (model) to map input to output from a
training set.
Examples:
 Use a naïve Bayesian classifier to distinguish between spam/no spam
 Learn a playout policy to simulate games (current board -> good move)
 Unsupervised Learning: Organize data (e.g., clustering, embedding)
 Deap learning :
 Reinforcement Learning: Learn from rewards/punishment (e.g., winning a game)
obtained via interaction with the environment over time.
Supervised Learning 3
 Examples
 We assume there exists a target function that produces iid (independent and identically distributed) examples
possibly with noise and errors.
 Examples are observed input-output pairs , where is a vectors called the feature vector.

 Learning problem 𝑓

Given a hypothesis space H of representable models.
 Find a hypothesis such that
 That is, we want to approximate by using .

 Supervised learning includes


 Classification (outputs = class labels). E.g., is an email and is spam / ham.

Regression (outputs = real numbers). E.g., x is a house and is its selling price.
Consistency vs. Simplicity 4

Example: Univariate curve fitting (regression, function approximation)


y Examples y Learned Models x…
lines …

Very simple,
but not very
consistent
with the
data!

 Consistency:
 Simplicity: small number of model parameters
Measuring Consistency using Loss 5

Goal of learning: Find a hypothesis that makes predictions that are consistent with the examples .
That is,

 Measure mistakes: Loss function


 Absolute-value loss For Regression
 Squared-error loss For Classification
 0/1 loss Loss
 Log loss, cross-entropy loss and many others… 𝑓

 Empirical loss: average loss over the N examples in the dataset h
Learning Consistent by Minimizing the Loss 6

 Empirical loss

 Find the best hypothesis that minimizes the loss

Loss

𝑓
Reasons for

a)

b)
Realizability:
is nondeterministic or examples are noisy.
h
c) It is computationally intractable to search all ,
so we use a non-optimal heuristic.
The Bayes Classifier 7

For 0/1 loss, the empirical loss is minimized by the model that predicts for each the most likely class using MAP
(Maximum a posteriori) estimates. This is called the Bayes classifier.

Optimality: The Bayes classifier is optimal for 0/1 loss. It is the most consistent classifier possible with the
lowest possible error called the Bayes error rate. No better classifier is possible!

Issue: The classifier requires to learn from the examples.


 It needs the complete joint probability which requires in the general case a probability table with one entry
for each possible value for the feature vector .
 This is impractical (unless a simple Bayes network exists) and most classifiers try to approximate the Bayes
classifier using a simpler model with fewer parameters.
Simplicity 8

Ease of use
 Simpler hypotheses have fewer model parameters to estimate and store.

Generalization: How well does the hypothesis perform on new data?


 We do not want the model to be too specific to the training examples (an issue called overfitting).
 Simpler models typically generalize better to new examples.

How to achieve simplicity?


a) Model bias: Restrict to simpler models (e.g., assumptions like independence, only consider linear
models).
b) Feature selection: use fewer variables from the feature vector
c) Regularization: penalize model for its complexity (e.g., number of parameters)

Penalty term
Overfitting
Model Selection: Bias vs. Variance 9

Simpler More consistent

Points: Two
samples from the
same function to
show variance.

Lines: the learned


function .

High Bias: restrictions by the model class Low


This is a tradeoff
Low Variance: difference in the model due to slightly different data. high
The Dataset Feature vector
(Features, Variables, Attributes)
Class
10
Label

Examples
(Instances,
Observation)

on
ve

Pa y

e
ns

ti

ti m
r
a ti

ng

va
tro
rn

Hu

ai t
se
te

W
Re
Al

Find a hypothesis (called “model”) to predict the class given the features.
Feature Engineering 11

 Add information sources as new variables to the model.


 Add derived features that help the classifier (e.g., , ).
 Embedding: E.g., convert words to vectors where vector similarity
between vectors reflects semantic similarity.

 Example for Spam detection: In addition to words


 Have you emailed the sender before?
 Have 1000+ other people just gotten the same email?
 Is the header information consistent?
 Is the email in ALL CAPS?
 Do inline URLs point where they say they point?
 Does the email address you by (your) name?

 Feature Selection: Which features should be used in the model is a


model selection problem (choose between models with different
features).
12

Training
and
Testing
Model Evaluation (Testing) 13

The model was trained on the training examples . We want to test how well the model
will perform on new examples (i.e., how well it generalizes to new data).

 Testing loss: Calculate the empirical loss for predictions on a testing data set that
is different from the data used for training.

 For classification we often use the accuracy measure, the proportion of correctly
classified test examples.

is an indicator function returning 1 if and otherwise 0


Training a Model 14

 Models are “trained” (learned) on the training data. This involved estimating:

1. Model parameters (the model): E.g., probabilities, weights, factors.


2. Hyperparameters: Many learning algorithms have choices for learning rate,
regularization , maximal decision tree depth, selected features,... The algorithm tries Training
to optimizes the model parameters given user-specified hyperparameters. Data

 We need to tune the hyperparameters!

Test
Data
Hyperparameter Tuning/Model Selection 15

1. Hold a validation data set back from the training data.


2. Learn models using the training set with different hyperparameters. Often a
grid of possible hyperparameter combinations or some greedy search is used.
3. Evaluate the models using the validation data and choose the model with the
best accuracy. Selecting the right type of model, hyperparameters and features is Training
called model selection. Data
Training
4. Learn the final model with the chosen hyperparameters using all training Data
(including validation data).

 Notes: Validation

Data
The validation set was not used for training, so we get generalization accuracy for the
different hyperparameter settings.
 If no model selection is necessary, then no validation set is used.
Test
Data
Testing a Model 16

Training
Data

 After the model is selected, the final model is evaluated against the
test set to estimate the final model accuracy.
Test
 Very important: never “peek” at the test set during training! Data
How to Split the Dataset 17

 Random splits: Split the data randomly in, e.g.,


60% training, 20% validation, and 20% testing.

 Stratified splits: Like random splits, but balance classes and other properties of the Training
examples. Data
Training
Data
 k-fold cross validation: Use training & validation data better
 Split the training & validation data randomly into k folds. Validation
 For k rounds hold one fold back for testing and use the remaining folds for training. Data
 Use the average error/accuracy as a better estimate.
 Some algorithms/tools do this internally.
Test
Data
 LOOCV (leave-one-out cross validation): used if very little data is available.
Learning Curve: 18

The Effect the Training Data Size


Accuracy of a classifier
when the amount of
available training data
increases.
Accuracy

More data is
better!
At some point the learning
curve flattens out and more
data does not contribute
much!
Comparing to a Baselines 19

 First step: get a baseline


 Baselines are very simple straw man model.
 Helps to determine how hard the task is.
 Helps to find out what a good accuracy is.

 Weak baseline: The most frequent label classifier


 Gives all test instances whatever label was most common in the training set.
 Example: For spam filtering, give every message the label “ham.”
 Accuracy might be very high if the problem is skewed (called class imbalance).
 Example: If calling everything “ham” gets already 66% right, so a classifier that gets 70% isn’t very
good…

 Strong baseline: For research, we typically compare to previous published


state-of-the-art as a baseline.
20
Types of
Models
REGRESSION: PREDICT A
NUMBER
CLASSIFICATION: PREDICT A
LABEL
Regression: Linear Regression 21

Model:
Squared error loss over the whole data matrix
Empirical Loss:
The gradient is a vector of partial derivatives
Gradient:

Find: 0

Gradient descend: ∇ 𝐿(𝒘 )


𝒘

Analytical solution:

Pseudo inverse
Naïve Bayes Classifier 22

 Approximates a Bayes classifier with the naïve independence assumption that all features are
conditional independent given the class.

The s and the s are estimated from the data by counting.

 Gaussian Naïve Bayes Classifiers extend the approach to continuous features by assuming:

The parameters for the normal distribution are estimated from


data.
Decision Trees 23

 A sequence of decisions represented as a tree.


 Many implementations that differ by
 How to select features to split?
 When to stop splitting?
 Is the tree pruned?

 Approximates a Bayesian classifier by


K-Nearest Neighbors Classifier 24

 Class is predicted by looking at the majority in the set of the k nearest neighbors. is a hyperparameter.
Larger smooth the decision boundary.
 Neighbors are found using a distance measure (e.g., Euclidean distance between points).
 Approximates a Bayesian classifier by
Support Vector Machine (SVM) 25

Margin

Decision
boundary

 Linear classifier that finds the maximum margin separator using only the points that are “support
vectors” and quadratic optimization.
 The kernel trick can be used to learn non-linear decision boundaries.
Artificial Neural Networks/Deep Learning 26
Computational graph
Hidden Layer For classification
typically a softmax  Represent as a network of
activation function
returning
weighted sums with non-linear
activation functions g (e.g.,
logistic, ReLU).
 Learn weights from examples
using backpropagation of
prediction errors (gradient
descend).
 ANNs are universal
approximators. Large networks
can approximate any function (no
bias). Regularization is typically
used to avoid overfitting.
 Deep learning adds more hidden
Perceptron layers and layer types (e.g.,
Bias term Non-linear activation function convolution layers) for better
learning.
27
Other
Many other models exist

• Generalized linear model (GLM): This important


model family includes linear regression and the

Popular classification method logistic regression.

Models and Often used methods

• Regularization: enforce simplicity by using a penalty


for complexity.

Methods • Kernel trick: Let a linear classifier learn non-linear


decision boundaries ( = a linear boundary in a high
dimensional space).
• Ensemble Learning: Use many models and combine
the results (e.g., random forest, boosting).
• Embedding and Dimensionality Reduction: Learn
how to represent data in a simpler way.
Some Use Cases of ML for Intelligent 28

Agents Learn Actions Learn Heuristics Perception Compressing Tables


• Directly learn the best • Learn evaluation functions • Natural language • Neural networks can be
action from examples. for states. processing: Use deep used as a compact
learning / word representation of tables
embeddings / language that do not fit in memory.
models to understand E.g.,
• Can learn a heuristic for concepts, translate • Joint probability table
• This model can also be
between languages, or • State utility table
used as a playout policy minimax search from generate text.
for Monte Carlo tree examples. • Speech recognition: • The tables can be learned
search with data from self-
play. Identify the most likely form data.
sequence of words.
• Vision: Object recognition
in images/videos. Generate
images/video.

Bottom line: Learning a function is often more effective than hard-coding it


However, we do not always know how it performs in very rare cases!
Conclusion 29

 Machine leaning
 Supervised learning ( k-nearest naiboughr, linear regresson, ANN)
 Unsupervised learning. (clustering..)
 Deep learning ( ANN, CNN, RNN…)
 Reinforcement learning ( q table ..)
 Practice of supervised learning:
 Linear regression.
 Support vector machine.
 Ensample leaning.
 Decision tree.
 Naïve bayes classifier.
30

END

You might also like