ML 2
ML 2
Part II
A Biswas
IIEST, Shibpur
Syllabus
Introduction
Learning Problems, Well-posed learning problems, Designing learning systems.
Concept Learning
Concept learning task, Inductive hypothesis, Ordering of Hypothesis, General-to-specific ordering of
hypotheses. Version spaces, Inductive Bias.
Regression
Linear regression, Notion of cost function, Logistic Regression, Cost function for logistic regression,
application of logistic regression to multi-class classification.
Continued …
Syllabus (continued)
Supervised Learning
Support Vector Machine, Decision tree Learning, Representation, Problems, Decision Tree Learning Algorithm,
Attributes, Inductive Bias, Overfitting.
Bayes Theorem, Bayesian learning, Maximum likelihood, Least squared error hypothesis, Gradient Search, Naive
Bayes classifier, Bayesian Network, Expectation Maximization Algorithm.
Unsupervised learning
Clustering, K-means clustering, hierarchical clustering.
Instance-Based Learning
k-Nearest Neighbour Algorithm, Radial Basis Function, Locally Weighted Regression, Locally Weighted Function.
Neural networks
Linear threshold units, Perceptrons, Multilayer networks and back-propagation, recurrent networks. Probabilistic
Machine Learning, Maximum Likelihood Estimation.
Regularization, Preventing Overfitting, Ensemble Learning: Bagging and Boosting, Dimensionality reduction
Machine Learning Basics
A machine learning algorithm is an algorithm that is able to
learn from data.
Machine Learning Basics
P ↑ for T with E
Machine Learning Basics
Representing an example :
n
a vector x ∈ ℝ where each xi is a feature
Classification :
n
The learning algorithm produces a function : f : ℝ → {1,…, k}
n
f:ℝ →ℝ
Clustering
Supervised Learning
Supervised learning algorithms experience a dataset
containing features, but each example is also associated with
a label or target.
( j)
Every feature xi , j = 1,…, D is also a real
number
Linear Regression Problem
Problem Statement : …
Objective is to build a model f (x) as a linear combination
w,b
of features of example x:
fw,b = wx + b
y ← fw,b(x)
fi
Logistic Regression
fi
Logistic Regression
Problem Statement
fi
Logistic Regression
If we now apply our model fwˆ,ˆb to x we will get some value
i
0<p<1 as output.
If y is the positive class, the likelihood of yi being the
i
Problem statement:
S
The prediction given by the above model, fID3 (x), would be
the same for any input x.
Decision Tree Learning
Decision Tree Learning
Search through all features j=1, … , D and all thresholds t.
Split the set S into two subsets:
and
Decision Tree Learning
The two new subsets would go to two new leaf nodes, and
we evaluate, for all possible pairs (j, t) how good the split
with pieces S− and S+ is.
Finally, we pick the best such values (j, t), split S into
S− and S+, form two new leaf nodes, and continue
recursively on S− and S+ (or quit if no split produces a
model that’s suf ciently better than the current one).
fi
Decision Tree Learning
fi
Decision Tree Learning
The algorithm stops at a leaf node when:
All examples in the leaf node are classi ed correctly by
the one-piece model
We cannot nd an attribute to split upon.
The split reduces the entropy less than some ϵ (the value
for which has to be found experimentally).
The tree reaches some maximum depth d (also has to be
found experimentally).
fi
fi
Decision Tree Learning
Because in ID3, the decision to split the dataset on each
iteration is local (doesn’t depend on future splits), the
algorithm doesn’t guarantee an optimal solution.
where wx means
and D is the dimension of the feature vector x.
Support Vector Machine
Now, the predicted label for some input feature vector x is
given by
y = sign(wx - b)
Once these optimal values are identi ed, the model f(x) is
then de ned as
f(x) = sign(w*x- b*)
fi
fi
fi
Support Vector Machine
To predict whether an email message is spam or not spam
using an SVM model:
- take a text of the message,
- convert it into a feature vector,
- then multiply this vector by w*, subtract b*
- and take the sign of the result.
Inherent nonlinearity.
Support Vector Machine
So, as of now,
Constraints:
and
minimize ||w|| so that the hyperplane was equally
distant from the closest examples of each class.
Support Vector Machine
1 2
Minimising ||w|| is equivalent to minimising | | w | | .
2
The use of this term makes it possible to perform quadratic
programming optimization.
The optimization problem for SVM
Support Vector Machine: Dealing with Noise
To extend SVM to cases in which the data is not linearly
separable, we introduce the hinge loss function:
max(0,1 − yi(wxi − b))
The hinge loss function is zero if the constraints a) and b)
are satis ed, that is, if wxi lies on the correct side of the
decision boundary.
For data on the wrong side of the decision boundary, the
function’s value is proportional to the distance from the
decision boundary.
fi
Support Vector Machine: Dealing with Noise
So, we have to minimise the cost function:
boundary.
Support Vector Machine: Dealing with Noise
fi
fi
Support Vector Machine: Dealing with Noise
fi
Support Vector Machine: for Inherent Non-Linearity
Optimisation algorithm for SVM to nd w and b
fi
Support Vector Machine: for Inherent Non-Linearity
The term xixk is the place where the feature vectors are
used.
In order to transform the vector space into higher
dimensional space, transform xi into ϕ(xi) and
xj into ϕ(xj) and then multiply - computationally costly.
Support Vector Machine: for Inherent Non-Linearity
By using the kernel trick, we can get rid of a costly
transformation of original feature vectors into higher-
dimensional vectors and avoid computing their dot-
product.
̂ ) returned by this algorithm as its estimate of f(xq)is just the most common value of f
f(xq
among the k training examples nearest to xq.
fi
fi
k-Nearest Neighbors (kNN)
If we choose k=1, then the 1-NEAREST NEIGHBOR
̂
algorithm assigns to f(xq) the value f(xi) where xi is the
training instance nearest to xq,.
fi
fi
k-Nearest Neighbors (kNN)
̂
will have very little effect on f(xq) ——> slow down.
fi
fl
Remarks on kNN
The distance-weighted k-NEAREST NEIGHBOR algorithm
is a highly effective inductive inference method for many
practical problems.
It is robust to noisy training data and quite effective
when it is provided a suf ciently large set of training
data.
Note that by taking the weighted average of the k
neighbors nearest to the query point, it can smooth out
the impact of isolated noisy training examples.
fi
Remarks on kNN
What is the inductive bias of k-NEAREST NEIGHBOR?
fi
Remarks on kNN
One interesting approach to overcoming this problem is
to weight each attribute differently when calculating the
distance between two instances.
fi
Remarks on kNN
Hence, one algorithm is to select a random subset of
the available data to use as training examples, then
determine the values of z1, …, zn that lead to the
minimum error in classifying the remaining examples.
ways.
This leave-one- out approach is easily implemented in k-
NEAREST-NEIGHBOR algorithms because no additional
training effort is required each time the training set is
rede ned.
fi
Remarks on kNN
Because this algorithm delays all processing until a new
query is received, signi cant computation can be required
to process each new query.
fi
fi
fi
Locally weighted linear regression
To choose weights that minimize the squared error
summed over the set D of training examples