Lecture 7 Overview of ML models
Lecture 7 Overview of ML models
Parameter
• The machine learns from the training data to map the target function,
but the configuration of the function is unknown.
• Different algorithms make various conclusions or biases about the
function‘s structure, so our task as machine learning practitioners is
to test various machine learning algorithms to see which one is
effective at modeling the underlying function.
• Thus machine learning models are parameterized so that their
behavior can be tuned for a given problem. These models can have
many parameters and finding the best combination of parameters can
be treated as a search problem.
What is a parameter in a machine learning
model?
•
A model parameter is a configuration variable that is internal to the model
and whose value can be estimated from the given data.
• They are required by the model when making predictions.
• Their values define the skill of the model on your problem.
• They are estimated or learned from historical training data.
• They are often not set manually by the practitioner.
• They are often saved as part of the learned model.
• The examples of model parameters include:
• The weights in an artificial neural network.
• The support vectors in a support vector machine.
• The coefficients in linear regression or logistic regression.
What is the parametric model?
•
A learning model that summarizes data with a set of fixed-size parameters
(independent on the number of instances of training).Parametric machine learning
algorithms are which optimizes the function to a known form.
• In a parametric model, you know exactly which model you are going to fit in with the
data, for example, linear regression line.
• Following the functional form of a linear line clarifies the learning process greatly.
Now we’ll have to do is estimate the line equation coefficients and we have a
predictive model for the problem. With the intercept and the coefficient, one can
predict any value along with the regression.
• Some more examples of parametric machine learning algorithms include:
• Logistic Regression
• Linear Discriminant Analysis
• Perceptron
• Naive Bayes
• Simple Neural Networks
nonparametric model?
•
Nonparametric machine learning algorithms are those which do not make specific
assumptions about the type of the mapping function.
• They are prepared to choose any functional form from the training data, by not
making assumptions.
The word nonparametric does not mean that the value lacks parameters existing in it,
but rather that the parameters are adjustable and can change.
• A simple to understand the nonparametric model is the k-nearest neighbors'
algorithm, making predictions for a new data instance based on the most similar
training patterns k. The only assumption it makes about the data set is that the
training patterns that are the most similar are most likely to have a similar result.
• Some more examples of popular nonparametric machine learning algorithms are:
• k-Nearest Neighbors
• Decision Trees like CART and C4.5
• Support Vector Machines
Example ML models for
classification
Decision Trees
• A decision tree is a flowchart-like tree structure where an internal node represents a
feature(or attribute), the branch represents a decision rule, and each leaf node
represents the outcome.
• The topmost node in a decision tree is known as the root node. It learns to partition on
the basis of the attribute value. It partitions the tree in a recursive manner called
recursive partitioning. This flowchart-like structure helps you in decision-making. It's
visualization like a flowchart diagram which easily mimics the human level thinking. That
is why decision trees are easy to understand and interpret.
• A decision tree is a white box type of ML algorithm. It shares internal decision-making
logic, which is not available in the black box type of algorithms such as with a neural
network. Its training time is faster compared to the neural network algorithm.
• The time complexity of decision trees is a function of the number of records and
attributes in the given data. The decision tree is a distribution-free or non-parametric
method which does not depend upon probability distribution assumptions. Decision
trees can handle high-dimensional data with good accuracy.
Overview
• Consider a classification problem that involves nominal data – data described by a list of
attributes (e.g., categorizing people as short or tall using gender, height, age, and
ethnicity).
• How can we use such nominal data for classification? How can we learn the categories of
such data? Nonmetric methods such as decision trees provide a way to deal with such
data.
• Decision trees attempt to classify a pattern
through a sequence of questions. For
example, attributes such as gender and
height can be used to classify people as
short or tall. But the best threshold for
height is gender dependent.
• A decision tree consists of nodes and leaves, with each leaf denoting a class.
• Classes (tall or short) are the outputs of the tree.
• Attributes (gender and height) are a set of features that describe the data.
• The input data consists of values of the different attributes. Using these attribute values,
the decision tree generates a class as the output for each input data.
Basic Principles
• If we continue to grow the tree until each leaf node has the lowest impurity, then the data
will be overfit.
• Two strategies: (1) stop tree from growing or (2) grow and then prune the tree.
• A traditional approach to stopping splitting relies on cross-validation:
Validation: train a tree on 90% of the data and test on 10% of the data (referred to as
the held-out set).
Cross-validation: repeat for several independently chosen partitions.
Stopping Criterion: Continue splitting until the error on the held-out data is minimized.
• Reduction In Impurity: stop if the candidate split leads to a marginal reduction of the
impurity (drawback: leads to an unbalanced tree).
• Cost-Complexity: use a global criterion function that combines size and impurity:
. This approach is related to minimum description length when the impurity is based on
entropy.
size i ( N )
leaf nodes
H1
H
H2
Recall the distance from a point(x0,y0) to a line: d+
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||
=-1
=+1
• The kernel trick makes your brain hurt when you first learn
about it, but its actually very simple.
Kernel Trick
K ( x a , x b ) ( x a ) . ( x b )
(xa )
Letting the doing the scalar ( xb )
kernel do product in the
the work obvious way
Dealing with the test data
• If we choose a mapping to a high-D space for which the
kernel trick works, we do not have to pay a
computational price for the high-dimensionality when
we find the best hyper-plane.
• We cannot express the hyperplane by using its normal vector
in the high-dimensional space because this vector would
have a huge number of components.
• Luckily, we can express it in terms of the support vectors.
• But what about the test data. We cannot compute the
scalar product because its in the high-D space.
w . ( x)
Dealing with the test data
• We need to decide which side of the separating
hyperplane a test point lies on and this requires us to
compute a scalar product.
• We can express this scalar product as a weighted
average of scalar products with the stored support
vectors
• This could still be slow if there are a lot of support vectors .
The classification rule
• The final classification rule is quite simple:
bias s
w K ( x test
, x s
) 0
s SV
The set of
support vectors
• All the cleverness goes into selecting the support vectors that
maximize the margin and computing the weight to use on each
support vector.
• We also need to choose a good kernel function and we may
need to choose a lambda for dealing with non-separable
cases.
Some commonly used kernels
Polynomial: K (x, y ) (x.y 1) p
For the neural network kernel, there is one “hidden unit” per
support vector, so the process of fitting the maximum margin
hyperplane decides how many hidden units to use. Also, it may
violate Mercer’s condition.
Performance
• Support Vector Machines work very well in practice.
• The user must choose the kernel function and its parameters,
but the rest is automatic.
• The test performance is very good.
• They can be expensive in time and space for big datasets
• The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
• We need to store all the support vectors.
• SVM’s are very good if you have no idea about what structure to
impose on the task.
• The kernel trick can also be used to do PCA in a much higher-
dimensional space, thus giving a non-linear version of PCA in the
original space.
Naive Bayes Classifier
• Naive Bayes is a statistical classification technique based on Bayes
Theorem. It is one of the simplest supervised learning algorithms. Naive
Bayes classifier is the fast, accurate and reliable algorithm. Naive Bayes
classifiers have high accuracy and speed on large datasets.
• Naive Bayes classifier assumes that the effect of a particular feature in a
class is independent of other features. For example, a loan applicant is
desirable or not depending on his/her income, previous loan and
transaction history, age, and location. Even if these features are
interdependent, these features are still considered independently. This
assumption simplifies computation, and that's why it is considered as
naive. This assumption is called class conditional independence.
• P(h): the probability of hypothesis h being true (regardless of the
data). This is known as the prior probability of h.
• P(D): the probability of the data (regardless of the hypothesis). This is
known as the prior probability.
• P(h|D): the probability of hypothesis h given the data D. This is known
as posterior probability.
• P(D|h): the probability of data d given that the hypothesis h was true.
This is the likelihood which is the probability of predictor given class.
• First Approach (In case of a single feature)
• Naive Bayes classifier calculates the probability of an event in the
following steps:
• Step 1: Calculate the prior probability for given class labels
• Step 2: Find Likelihood probability with each attribute for each class
• Step 3: Put these value in Bayes Formula and calculate posterior
probability.
• Step 4: See which class has a higher probability, given the input
belongs to the higher probability class.
Example
• Training Dataset
Test data: whether the text “overall liked the movie” has a positive review or a negative review.
Example
• We have to calculate,
P(positive | overall liked the movie) — the probability that the tag of a sentence is positive given that the sentence is “overall
liked the movie”.
P(negative | overall liked the movie) — the probability that the tag of a sentence is negative given that the sentence is “overall
liked the movie”.
• Our features will be the counts of each of these words.
In our case, we have P(positive | overall liked the movie), by using this theorem:
• P(positive | overall liked the movie) = P(overall liked the movie | positive) * P(positive) / P(overall liked the movie)
• P(negative| overall liked the movie) = P(overall liked the movie | negative) * P(negative) / P(overall liked the movie)
• Since for our classifier we have to find out which tag has a bigger probability, we can discard the divisor which is the same for both
tags,
• P(positive | overall liked the movie) = P(overall liked the movie | positive)* P(positive)
• P(negative| overall liked the movie) = P(overall liked the movie | negative) * P(negative)
• Therefore
• P(overall liked the movie| positive) = P(overall | positive) * P(liked | positive) * P(the | positive) * P(movie | positive)
• To get probability of a class, First, we calculate the a priori probability of each tag: for a given sentence in our training data, the
probability that it is positive
• P(positive) is 3/5. Then,
• P(negative) is 2/5
The Scikit-learn provides different naïve Bayes classifiers models namely Gaussian, Multinomial,
Complement and Bernoulli. All of them differ mainly by the assumption they make regarding the
distribution of 𝑷 : P(features|Y) i.e. the probability of predictor given class.
Gaussian Naïve Bayes Gaussian Naïve Bayes classifier assumes that the data from
1
each label is drawn from a simple Gaussian distribution.
Multinomial Naïve Bayes It assumes that the features are drawn from a simple
2
Multinomial distribution.
Bernoulli Naïve Bayes The assumption in this model is that the features binary (0s
3 and 1s) in nature. An application of Bernoulli Naïve Bayes classification is Text
classification with ‘bag of words’ model
Complement Naïve Bayes It was designed to correct the severe assumptions made by
4 Multinomial Bayes classifier. This kind of NB classifier is suitable for imbalanced data
sets
Machine Learning Algorithm/model
• We have looked at several Ml algorithms for classification:
• Naïve bayes classifier
• Decision Trees
• K Nearest neibours
• Logistic regression
• Support vector machine
• Artificial Neural Networks
• …
• True Positive: 12 (You have predicted the positive case correctly!), system predicted
spam and the are truly spam.
• True Negative: 77 (You have predicted negative case correctly!), system predicted not
spam and the email are truly not spam.
• False Positive: 8 (Oh! You have predicted these emails are spam, but in actual they are
not spam. This is type-II error in this case.)
• False Negative: 3 (Oh ho! You have predicted that these three emails are not spam. But
actually are spam. This is dangerous! Be careful! This is type-I error in this case.)
Classify email as either spam or not spam
• Accuracy the ratio of the accurately predicted number and the total number of
people which is (12+77)/100 = 0.89.
• Precision - the ratio, 12/(12+8) = 0.6 is the measure of the accuracy of your
model in detecting a person to have the disease.
• Recall - the ratio, 12/(12+3) = 0.8 is the measure of the accuracy of your model to
detect a person having disease out of all the people who are having the disease in
actual.
Actual
Spam Not spam
System spam 12 8
(Predicted) Not spam 3 77
Cross-Validation
• Cross-validation involves partitioning your data into
distinct training and test subsets.
• The goal of ML is not to replicate the training data, but to predict unseen
data well, i.e., to generalize well.
• For best generalization, we should match the complexity of the hypothesis
class H with the complexity of the function underlying the data:
•If H is less complex: underfitting. Ex: fitting a line to data generated from a cubic
polynomial.
•the excessively simple model fails to “Learn” the intricate patterns and underlying
trends of the given dataset.
•If H is more complex: overfitting. Ex: fitting a cubic polynomial to data generated from
a line.
• increasing model complexity, the model tends to fit the Noise present in data
Over-fitting
• Your model should ideally fit an infinite sample of the
type of data you’re interested in.
• In reality, you only have a finite set to train on. A good
model for this subset is a good model for the infinite
set, up to a point.
• Beyond that point, the model quality (measured on new
data) starts to decrease.
• Beyond that point, the model is over-fitting the data.
Model selection and generalization
Model selection and generalization
Bias-Variance Tradeoff
Bias is the amount of error introduced by approximating
real-world phenomena with a simplified model.
Variance is how much your model's test error changes
based on variation in the training data. It reflects the
model's sensitivity to the idiosyncrasies of the data set it
was trained on.
As a model increases in complexity and it becomes more
wiggly ( flexible ), its bias decreases (it does a good job of
explaining the training data), but variance increases (it
doesn't generalize as well). Ultimately, in order to have a
good model, you need one with low bias and low
variance.
Bias-Variance Tradeoff
• There are two major sources of error in machine learning: bias and
variance. Understanding them will help you decide whether adding data, as
well as other tactics to improve performance, are a good use of time.
• Suppose you hope to build a cat recognizer that has 5% error. Right now,
your training set has an error rate of 15%, and your dev set has an error
rate of 16%. In this case, adding training data probably won’t help much.
• First, the algorithm’s error rate on the training set. In this example, it is
15%. We think of this informally as the algorithm’s bias.
• • Second, how much worse the algorithm does on the dev (or test) set than
the training set. We think of this informally as the algorithm’s variance.
Bias-Variance Tradeoff
Model
error Error on
new data
Training error
Number of iterations
Bias-Variance Tradeoff
Remember that the only thing we care about is how the model
performs on test data.
Classification example - Complex model
• Should we keep the hypothesis class simple rather than complex?
•Easier to use and to train (fewer parameters, faster).
• Easier to explain or interpret.
•Less variance in the learned model than for a complex model (less affected by
single instances), but also higher bias.
• Given comparable empirical error, a simple model will generalize
better than a complex one. (Occam’s razor : simpler explanations are
more plausible; eliminate unnecessary complexity.)
Model selection and generalization
• In summary, in ML algorithms there is a tradeoff between 3 factors:
•the complexity c(H) of the hypothesis class
•the amount of training data N
•the generalization error E
Noise
• Noise is any unwanted anomaly in the data. It can be due to:
•Imprecision in recording the input attributes: xn .
•Errors in labeling the input vectors: yn .
•Attributes not considered that affect the label (hidden or latent attributes,
may be unobservable).
• Noise makes learning harder.