Machine Learning - Unit - 1
Machine Learning - Unit - 1
Outline
Unit 1 ------> Introduction to ML, Supervised Learning.
Unit 2 ---------> Bayesian Decision Theory, Decision Trees.
In general terms Once the machine (Computer) is trained and learnt with past
example data with appropriate algorithms, it adapt to changes automatically even for
complex problems.
• A model is defined upto some parameters.
• Learning is the execution of computer program to optimize parameters of the model
using training data or past experience.
• The model may be predictive(to make predictions in future) or descriptive (to gain
knowledge).
What Is Machine Learning?
Features:
Learning Associations
Classification
Regression
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Examples of Machine Learning Applications
1. Learning Associations
Association rule Finds interesting associations and relationships among large sets
of data items.
This rule shows how frequently an itemset occurs in a transaction.
Example - Market Basket Analysis.
* Finding associations between products bought by customers, make a distinction
among customers and target them for cross-selling.
Estimate P(Y | X , D)
where, P Probability, Y Product or set of products, X Condition,
D Set of customer attributes (Demographic attributes like gender, age, marital status,
etc.)
2. Classification
Categorizes a set of data into classes.
Uses a set of features or parameters to characterize each object.
Supervised learning concept (Labelled data).
Example – The bank calculates the financial capacity of the customer.
(income, savings, collaterals, profession, age, past financial history).
Predicting the customers with the given input and past transaction records.
Classifying the customers with two classes: High – risk customer and Low – risk
customer.
After training with the past data, a classification rule learned may be of the form
for suitable values of θ1 and θ2;
• Prediction Once we have a rule that fits the past data, if the future is similar to
the past, then we can make correct predictions for novel instances.
• Knowledge Extraction Learning knowledge a rule from data also allows
knowledge extraction. The rule is a simple model that explains the properties of
underlying data.
• Compression Learning also performs compression in that by fitting a rule to
the data, an explanation that is simpler than the data, requiring less memory, to
store and less computation to process.
• Outlier Detection Finding the instances that do not obey the rule and are
exceptions. (Which may imply anomalies requiring attention— for example,
fraud).
3. Regression
A Supervised learning concept.
Helps in finding the correlation between variables.
Used to predict a continuous value (output).
Example – Predicting the price of an used car.
Inputs are the car attributes—brand, year, engine capacity, mileage. (independent variables)
The output is the price of the car. (dependent variable)
X2: Engine
Training set for the class of a “family car.” Power - -
Each data point corresponds to one example car. -
The coordinates of the point indicate the price -
+ +
and engine power of that car. + +
‘+’ denotes a positive example of the class (a +
-
family car), and ‘−’ denotes a negative example X t - -
(not a family car); it is another type of car. 2
-
X t X1: Price
Two input attributes : X1 -> Price (in Rs) and 1
X2 -> Engine Power (in cubic cms).
Each car is represented by such an
Represent each car using two numeric values x =
ordered pair (x, r). The training set
[ ]
X1
X2
contains N such examples:
Its label denotes its type, X = { xt, rt } Nt =1
r = { 1 if x is a positive example where t indexes different examples in
0 if x is a negative example the set.
Learning a class from examples…
Example of a hypothesis class X2: Engine
Power - -
The class of family car is a rectangle in the price- - -
e2 + +
engine power space. C
+ +
Our training data can now be plotted in the two- +
e1 -
dimensional (x1, x2) space where each instance t. -
-
is a data point at coordinates (x1t , x2t) and its type, -
C is the actual class and h is our induced hypothesis. X2: Engine False +ve
The point where C is 1 but h is 0 is a false Power h - -
negative. C
The point where C is 0 but h is 1 is a false e2 ……........
- + + -
positive. + False _ve
+
Other points—namely, true positives and true +
e1 …........ -
negatives—are correctly classified. . -
- -
For the points to be in general position, no combination The 3 points lie in a straight line. These points
of 3 points should lie in a straight line. [No subset of are not in a general position in a 2D space.
(n+1) points lie on (n-1) Dimensional Space].
VC Dimension…
For all, 23 =8 possible labelings, we can find a hyperplane that separates them perfectly.
VC Dimension…
Points to Remember
For good generalization, VC Dimension of a hypothesis should be finite.
VC Dimension of a Linear Classifier: (n+1) {Points should be in general position}
VC Dimension of a Non-Linear Classifier: Very difficult to compute.
3. Probably Approximately Correct (PAC) Learning
Need for PAC learning:
To find how many examples needed, using a hypothesis. (For Example: Using a
tightest rectangle S).
Hypothesis to be approximately correct, namely, that the error probability be
bounded by some value.
To know that our hypothesis will be correct most of the time (if not always); so we
want to be probably correct as well (by a probability we can specify).
PAC Learning Given a class, C, and examples drawn from some unknown but
fixed probability distribution, p(x), we want to find the number of examples, N, such
that with probability at least 1 − δ, the hypothesis h has error at most ε, for arbitrary
δ ≤ 1/2 and > 0. (CΔh is the region of difference between C and h).
P {CΔh ≤ ε} ≥ 1 − δ
Interpretations of Noise:
Imprecision in recording the input attributes. (Shifts the data points in input
space).
Teachers Noise: Errors in labeling the data points. (Positive as negative and
negative as positive).
Hidden or Latent attributes: Additional attributes that are not taken into
account, that affects the label of instance. (Unobservable)
Noise…
When there is noise, there is not a simple
boundary between the positive and negative
instances.
Principle of Occam’s razor: Simpler explanations are more plausible and any
unnecessary complexity should be shaved off.
5. Learning Multiple Classes
Two – class Problem: In the example of learning a family car, we have positive examples
belonging to the class family car and the negative examples belonging to all other cars.
General case: We have K classes, denoted as ci , i = 1, . . . , K, and an input instance
belongs to one and exactly one of them. The training set is now of the form:
X ={x t
, r } t=1
t N
r has k dimensions:
rit = {1 if xt ∈ Ci
{0 if xt ∈ Cj , j ≠ i
Learn the boundary separating the instances of one class from the instances of all other
classes.
View a K-class classification problem as K two-class problems.
The training examples belonging to Ci are the positive instances of hypothesis hi and the
examples of all other classes are the negative instances of hi.
The total empirical error takes a sum over the predictions for all classes over all instances.
5. Learning Multiple Classes…
For a given x:
Ideally only one of hi(x), i = 1, . . . , K is 1 and we can choose a class.
But when no, or two or more, hi(x) is 1, we cannot choose reject a class, and this is
the case of doubt and the classifier rejects such cases.
In our example of learning a family car, we used only one hypothesis and only
modeled the positive examples.
Any negative example outside is not a family car.
Sometimes we may prefer to build two hypotheses, one for the positive and the
other for the negative instances.
This assumes a structure also for the negative instances that can be covered by
another hypothesis.
5. Learning Multiple Classes…
If the linear model is too simple, it is too constrained and incurs a large approximation
error.
When the order of the polynomial is increased, the error on the training data decreases.
7. Model Selection and Generalization
Example: Consider learning a Boolean function. All inputs and the output are binary
(0 or 1).
There are 2d possible ways to write d binary values and therefore, with d inputs, the
training set has at most 2d examples. There are dpossible Boolean functions of d
inputs. 22
Interpret Learning: Each distinct training example removes half the hypotheses,
namely, those whose guesses are wrong. For example, let us say we have x1 = 0, x2 = 1
and the output is 0; this removes h5, h6, h7, h8, h13, h14, h15, h16.
• Start with all possible hypothesis.
• See more training examples.
• Remove those hypotheses that are not consistent with the training data.
ill-posed problem The data by itself is not sufficient to find a unique solution.
If the training set we are given contains only a small subset of all possible instances, the
solution is not unique. (The output is for only a small percentage of the cases).
Model Selection and Generalization…
Inductive Bias The set of assumptions we make to have learning possible.
When learning is ill-posed, we should make some extra assumptions to have a unique
solution with the data we have.
Example: Assume a hypothesis class H.
• In learning the class of family car, there are many ways of separating the positive
examples from the negative examples.
• Assuming the shape of a rectangle is one inductive bias, and then the rectangle with the
largest margin is another inductive bias.
• Each hypothesis class has a certain capacity and can learn only certain functions.
• The class of functions that can be learned can be extended by using a hypothesis class
with larger capacity. (i.e., More complex hypothesis)
• The hypothesis class that is a union of two rectangles has higher capacity, but its
hypotheses are more complex.
Model Selection and Generalization…
Model Selection How to choose the right bias. (Choosing between possible
hypothesis H).
NOTE: The aim of machine learning is the prediction of new cases.
To rarely replicate the training data.
To be able to generate the right output for an input instance outside the training set,
one for which the correct output is not given in the training set.
Generalization How well a model trained on the training set predicts the right
output for new instances.
Underfitting If H is less complex than the function, we have underfitting.
Increase the complexity, the training error decreases.
Overfitting If there is noise, an overcomplex hypothesis may learn not only the
underlying function but also the noise in the data and may make a bad fit.
Having more training data helps but only up to a certain point.
Model Selection and Generalization…
Triple trade-off In learning algorithms, that are trained from example data, there is a
trade-off between 3 factors:
• The complexity of the hypothesis we fit to data (the capacity of the hypothesis class).
• The amount of training data.
• The generalization error on new examples.
As the amount of training data increases, the generalization error decreases.
As the complexity of the model class H increases, the generalization error decreases first and then
starts to increase.
Divide the training set into two parts: One part for training (i.e., to fit hypothesis).
Validation set Used to test the generalization ability. (Choose the best model)
Cross – Validation Assuming large training and validation sets, the hypothesis that is
the most accurate on the validation set is the best one (the one that has the best inductive
bias).
Test set Also called the publication set, containing examples not used in training or
validation.
Model Selection and Generalization…
Example: Taking a course
Training Set: The example problems that the instructor solves in class while teaching a
subject.
Validation Set: Exam questions.
Test set: The problems we solve in our later, professional life.
• We cannot keep on using the same training/validation.
• Because after having been used once, the validation set effectively becomes part of
training data.
• This will be like an instructor who uses the same exam questions every year;
• A smart student will figure out not to bother with the lectures and will only memorize the
answers to those questions.
Always remember that the training data we use is a random sample, that is, for the same application, if
we collect data once more, we will get a slightly different dataset. Slight differences in error will allow
us to estimate how large differences should be to be considered significant and not due to chance.
Dimensions of a Supervised Machine Learning Algorithm
Sample Independent and identically distributed (iid).
Sample: X ={xt , rt}Nt=1
The ordering is not important and all instances are drawn from the same joint
distribution p(x, r).
t indexes one of the N instances.
xt is the arbitrary dimensional input.
rt is the associated desired output.
rt is 0/1 for two-class learning, is a K-dimensional binary vector (where exactly
one of the dimensions is 1 and all others 0) for (K > 2)-class classification, and is
a real value in regression.
Aim To build a good and useful approximation to rt using the model g(xt |θ).
Dimensions of a Supervised Machine Learning Algorithm…
Three decisions that must be made:
1. Model we use in learning, denoted as g(x|θ).
• where g(·) is the model, x is the input, and θ are the parameters.
• g(·) defines the hypothesis class H, and a particular value of θ instantiates one hypothesis h
∈ H.
• For example,
• In class learning, we have taken a rectangle as our model whose four coordinates make up
θ;
• In linear regression, the model is the linear function of the input whose slope and intercept
are the parameters learned from the data.
• The model (inductive bias), or H, is fixed by the machine learning system designer based on
his or her knowledge of the application.
• The hypothesis h is chosen (parameters are tuned) by a learning algorithm using the training
set, sampled from p(x, r).
Dimensions of a Supervised Machine Learning Algorithm…
2. Loss function, L(·)
• To compute the difference between the desired output, rt , and our approximation to it,
g(xt |θ), given the current value of the parameters, θ.
• The approximation error, or loss, is the sum of losses over the individual instances,
• In class learning, where outputs are 0/1, L(·) checks for equality or not; In regression,
because the output is a numeric value, we have ordering information for distance and
one possibility is to use the square of the difference.
Dimensions of a Supervised Machine Learning Algorithm…
3. Optimization procedure to find θ∗ that minimizes the total error.