100% found this document useful (1 vote)
193 views58 pages

Machine Learning - Unit - 1

The document outlines an introduction to machine learning course. It covers 5 units: 1) Introduction to machine learning and supervised learning, 2) Bayesian decision theory and decision trees, 3) Clustering and hidden Markov models, 4) Reinforcement learning and combining multiple learners, and 5) Design and analysis of machine learning experiments. Unit 1 defines machine learning, provides examples of applications, and discusses supervised and unsupervised learning.

Uploaded by

Deepak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
193 views58 pages

Machine Learning - Unit - 1

The document outlines an introduction to machine learning course. It covers 5 units: 1) Introduction to machine learning and supervised learning, 2) Bayesian decision theory and decision trees, 3) Clustering and hidden Markov models, 4) Reinforcement learning and combining multiple learners, and 5) Design and analysis of machine learning experiments. Unit 1 defines machine learning, provides examples of applications, and discusses supervised and unsupervised learning.

Uploaded by

Deepak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

Machine Learning

Outline
 Unit 1 ------> Introduction to ML, Supervised Learning.
 Unit 2 ---------> Bayesian Decision Theory, Decision Trees.

 Unit 3 ---------> Clustering, Hidden Markov Models.

 Unit 4 ---------> Reinforcement Learning, Combining Multiple Learners.

 Unit 5 ---------> Design and Analysis of Machine Learning Experiments.


Unit - 1
Introduction to ML
What Is Machine Learning?
Programming computers to optimize performance criterion from past experience.

In general terms  Once the machine (Computer) is trained and learnt with past
example data with appropriate algorithms, it adapt to changes automatically even for
complex problems.
• A model is defined upto some parameters.
• Learning is the execution of computer program to optimize parameters of the model
using training data or past experience.
• The model may be predictive(to make predictions in future) or descriptive (to gain
knowledge).
What Is Machine Learning?

Features:

Ability to the computer for self-learn without being explicitly programmed.


Application of machine learning methods to large databases is called data mining.
A part of Artificial Intelligence.
Uses the theory of statistics in building mathematical model.
The core task is making inference from a sample.
What Is Machine Learning?

Importance of Machine Learning


 Automation (No human intervention needed).
 Adapt to changes easily.
 Handles multi-variety of data in uncertain environments.
 Identifies useful patterns and regularities.
 Construct a good and useful approximation.
 Provide results with high predictive accuracy.
 Wide area of applications.
Examples of Machine Learning Applications

Learning Associations

Classification

Regression

Supervised Learning

Unsupervised Learning

Reinforcement Learning
Examples of Machine Learning Applications
1. Learning Associations
 Association rule  Finds interesting associations and relationships among large sets
of data items.
 This rule shows how frequently an itemset occurs in a transaction.
 Example - Market Basket Analysis.
* Finding associations between products bought by customers, make a distinction
among customers and target them for cross-selling.

Estimate P(Y | X , D)
where, P Probability, Y  Product or set of products, X  Condition,
D  Set of customer attributes (Demographic attributes like gender, age, marital status,
etc.)
2. Classification
 Categorizes a set of data into classes.
 Uses a set of features or parameters to characterize each object.
 Supervised learning concept (Labelled data).
 Example – The bank calculates the financial capacity of the customer.
(income, savings, collaterals, profession, age, past financial history).
 Predicting the customers with the given input and past transaction records.
 Classifying the customers with two classes: High – risk customer and Low – risk
customer.
 After training with the past data, a classification rule learned may be of the form
for suitable values of θ1 and θ2;

IF income> θ1 AND savings> θ2 THEN low-risk ELSE high-risk


Classification continued….
• Discriminant  A function that separates the examples of different
classes.
Low - risk
Calculate a probability, P(Y|X)
Savings - + (X= Customer attributes, Y=0/1) [0 --> low-
+ risk, 1 --> high-risk]
- +  Consider classification as learning
an association from X to Y.
- + +  Then for a given X = x, if we have
θ2
High - risk
P(Y = 1|X = x) = 0.8, we say that the
- customer has an 80 percent probability
- - of being high-risk, or equivalently a 20
-
percent probability of being low-risk.
 Then it can be decided whether to
Income
accept or refuse the loan depending on
θ1
the possible gain and loss.
Classification continued….

Pattern recognition and its applications in ML:


Optical Character Recognition (OCR) Recognizing character codes from their images. Learning
different sequence of words, syntax and semantics of languages using ML algorithms. (Eg: Learning
different styles of handwritten characters)
Face Recognition  The input is an image, the classes are people to be recognized, and the learning
program should learn to associate the face images to identities. (Eg: Recognizing different identities of
3D face image).
Medical Diagnosis  The inputs are the relevant information about the patient and the classes are
the illnesses. The inputs contain the patient’s age, gender, past medical history, and current symptoms.
 Speech Recognition  The input is acoustic (sound) and the classes are words that can be uttered.
The association to be learned is from an acoustic signal to a word of some language.
 Biometrics  A recognition or authentication of people using their physiological and/or behavioral
characteristics that requires an integration of inputs from different modalities. (Eg: Iris, finger prints,
face).
Classification continued….

• Prediction  Once we have a rule that fits the past data, if the future is similar to
the past, then we can make correct predictions for novel instances.
• Knowledge Extraction  Learning knowledge a rule from data also allows
knowledge extraction. The rule is a simple model that explains the properties of
underlying data.
• Compression  Learning also performs compression in that by fitting a rule to
the data, an explanation that is simpler than the data, requiring less memory, to
store and less computation to process.
• Outlier Detection  Finding the instances that do not obey the rule and are
exceptions. (Which may imply anomalies requiring attention— for example,
fraud).
3. Regression
 A Supervised learning concept.
 Helps in finding the correlation between variables.
 Used to predict a continuous value (output).
 Example – Predicting the price of an used car.
 Inputs are the car attributes—brand, year, engine capacity, mileage. (independent variables)
 The output is the price of the car. (dependent variable)

Estimating a Regression problem.


Y : Price Let X denote the car attributes and Y be the price of
the car.
By surveying the past transactions, we can collect a
training data and the machine learning program
fits a function to this data to learn Y as a function of
X.
 y = wx + w0,
for suitable values of w and w0.
X: Mileage
Supervised Learning
 The training set contains predefined sample data (labelled data).
 Consider there is an input, X, an output, Y, and the task is to learn the mapping from the
input to the output.
The approach in machine learning is that we assume a model defined up to a set of
y = g(x|θ)
parameters:
 where g(·) is the model and θ are its parameters.
 Y is a number in regression. Y is a class code (e.g., 0/1) in the case of classification.
 Another example of regression is driving an autonomous car.
 Inputs in such a case are provided by sensors on the car—for example, a video camera,
GPS, etc.
 Training data can be collected by monitoring and recording the actions of a human driver.
 The output is the angle by which the steering wheel should be turned at each time, to
advance without hitting obstacles and deviating from the route.
4. Unsupervised Learning
Contains unlabelled input data.
The aim is to find the regularities in the input.
Density Estimation  There is a structure to the input space such that certain patterns
occur more often than others, and we want to see what generally happens and what does
not.
Clustering  A method of density estimation which aims to find clusters or groupings of
input.
 Example 1: Customer segmentation --> From past transactions and using demographic
info (customer attributes), finding what type of customers frequently occur. Deciding
services and products to specific and different groups.
 Example 2: Image Compression --> The input instances are image pixels represented as
RGB values. A clustering program groups pixels with similar colors in the same group, and
such groups correspond to the colors occurring frequently in the image.
 Example 3: Document Clustering --> The aim is to group similar documents. For
example, news reports can be subdivided as those related to politics, sports, fashion, arts,
and so on. Commonly, a document is represented as a “bag of words”.
5. Reinforcement Learning

• Policy  Finding the sequence of correct actions to reach the goal.


• The ML program should be able to assess the goodness of policies and learn
from past good action sequences to be able to generate a policy.
• Example 1: Game Playing --> Single move is not enough to reach the goal, the
sequence of right moves that is good. A game like chess has a small number of
rules, but it is very complex because of the large number of possible moves at
each state and the large number of moves that a game contains.
• Example 2: A robot navigating in an environment --> At any time, the robot can
move in one of a number of directions. After a number of trial runs, it should learn
the correct sequence of actions to reach to the goal state from an initial state,
without hitting any of the obstacles.
Exercise
1. Let us say you are given the task of building an automated taxi. Define the
constraints. What are the inputs? What is the output? How can you communicate
with the passenger? Do you need to communicate with the other automated taxis,
that is, do you need a “language”?
2. In basket analysis, we want to find the dependence between two items X and Y.
Given a database of customer transactions, how can you find these dependencies?
How would you generalize this to more than two items?
3. How can you predict the next command to be typed by the user? Or the next page
to be downloaded over the Web? When would such a prediction be useful? When
would it be annoying?
Exercise
4. In your everyday newspaper, find five sample news reports for each category of
politics, sports, and the arts. Go over these reports and find words that are used
frequently for each category, which may help us discriminate between different
categories. For example, a news report on politics is likely to include words such as
“government,” “recession,” “congress,” and so forth, whereas a news report on the
arts may include “album,” “canvas,” or “theater.” There are also words such as
“goal” that are ambiguous.
5. Take a word, for example, “machine.” Write it ten times. Also ask a friend to
write it ten times. Analyzing these twenty images, try to find features, types of
strokes, curvatures, loops, how you make the dots, and so on, that discriminate your
handwriting from your friend’s.
6. In estimating the price of a used car, rather than estimating the absolute price it
makes more sense to estimate the percent depreciation over the original price. Why?
Unit -1
Supervised Learning
1. Learning a Class from Examples
Example: Learn the class, C, of a “family car.” (Survey from people)
• Positive examples The people look at the cars and label them; the cars that they believe
are family cars.
• Negative examples Other cars that they do not believe as family cars.
 Class Learning  Finding a description that is shared by all positive examples and none
of the negative examples.
• Helps to make a prediction and knowledge extraction from the description of learned
data.
Aim : To understand the expectations of people and to make conclusion.
 Input Representation  Features that separate family cars from others: Price and
Engine power (input attributes).
 Additional attributes: Seating capacity and Color.
Learning a class from examples…

X2: Engine
Training set for the class of a “family car.” Power - -
Each data point corresponds to one example car. -
The coordinates of the point indicate the price -
+ +
and engine power of that car. + +
‘+’ denotes a positive example of the class (a +
-
family car), and ‘−’ denotes a negative example X t - -
(not a family car); it is another type of car. 2
-

X t X1: Price
 Two input attributes : X1 -> Price (in Rs) and 1
X2 -> Engine Power (in cubic cms).
Each car is represented by such an
 Represent each car using two numeric values x =
ordered pair (x, r). The training set
[ ]
X1
X2
contains N such examples:
 Its label denotes its type, X = { xt, rt } Nt =1
r = { 1 if x is a positive example where t indexes different examples in
0 if x is a negative example the set.
Learning a class from examples…
Example of a hypothesis class X2: Engine
Power - -
The class of family car is a rectangle in the price- - -
e2 + +
engine power space.  C
+ +
Our training data can now be plotted in the two- +
e1 -
dimensional (x1, x2) space where each instance t. -
-
is a data point at coordinates (x1t , x2t) and its type, -

namely, positive versus negative, is given by rt . P1 P2


After the analysis of data, its price and engine power X1: Price
should be in a certain range for suitable values of p1, p2, e1, and e2.
(p1 ≤ price ≤ p2) AND (e1 ≤ engine power ≤ e2)
 Hypothesis (h) The learning algorithm then finds the particular hypothesis, h ∈ H,
to approximate C as closely as possible (from set of rectangles). [H->Hypothesis class]
Learning a class from examples…
Drawbacks to overcome:
• In this hypothesis class, the values of the parameters are not known.
• Do not know which particular h ∈ H is equal, or closest, to C.
Solution: Restrict the attention to this hypothesis class, then learning the class reduces to
the easier problem of finding the four parameters (p1, p2, e1, e2) that define h.
Aim: To find h ∈ H that is as similar as possible to C.
 Let the hypothesis h makes a prediction for an instance x such that
h(x) = { 1 if h classifies x as a positive example
{ 0 if h classifies x as a negative example
In real life we do not know C(x), so we cannot evaluate how well h(x) matches C(x). We
have only a training set X, which is a small subset of the set of all possible x.
Empirical error  The proportion of training instances where predictions of h do not
match the required values given in X.
Learning a class from examples…
In our example, the hypothesis class H is the set of all possible rectangles.
Find the values of the four parameters (p1, p2, e1, e2) given the training set, to
include all the positive examples and none of the negative examples.
Different candidate hypothesis may make different generalization predictions.
 Generalization Problem Finding how well our hypothesis will correctly
classify future examples that are not part of the training set.
• Most specific Hypothesis (S)  The tightest rectangle that includes all the
positive examples and none of the negative examples.
• Most general hypothesis (G)  The largest rectangle we can draw that includes
all the positive examples and none of the negative examples.
• Version space  Any h∈ H between S and G is a valid hypothesis with no error,
said to be consistent with the training set, and such h make up the version space.
Learning a class from examples…

C is the actual class and h is our induced hypothesis. X2: Engine False +ve
 The point where C is 1 but h is 0 is a false Power h - -
negative. C
 The point where C is 0 but h is 1 is a false e2 ……........
- + + -
positive. + False _ve
+
 Other points—namely, true positives and true +
e1 …........ -
negatives—are correctly classified. . -
- -

False negative: When a data point is classified as a negative example P1 P2


(say class 0) but it is actually a positive example(belongs to class 1). X1: Price
False positive: When a data point is classified as a positive example
(say class 1) but it is actually a negative example(belongs to class 0).
True negative: When a data point is classified as a negative example
(say class 0) and it is actually a negative example(belongs to class 0).
True positive: When a data point is classified as a positive example
(say class 1) and it is actually a positive example(belongs to class 1).
Learning a class from examples…
Candidate Elimination Algorithm  Incrementally updates the S- and G-sets as it
sees training instances one by one.
 Margin  Distance between the boundary and the instances closest to it.
 Doubt  Any instance that falls in between S and G is a case of doubt, which cannot
be labelled with certainty due to lack of data. The system rejects the instance and
defers the decision to a human expert.

 For our error function to have a minimum at h with


the maximum margin, we should use an error (loss)
function which not only checks whether an instance
is on the correct side of the boundary but also how far
away it is.
 We choose the hypothesis with the largest margin,
for best separation. The shaded instances are those that
define (or support) the margin; other instances can be
removed without affecting h.
2. Vapnik-Chervonenkis (VC) Dimension
VC Dimension: The maximum number of points that VC dimension can be
shattered by H is called the Vapnik-Chervonenkis (VC) dimension of H, is denoted
as VC(H).
 Measures the capacity of H (hypothesis).
 The cardinality (size) of the largest set of points that the classification algorithm
can shatter. 
Example: Consider a dataset containing N points. These N points can be labeled
in 2N ways as positive and negative.
• Find a hypothesis h ∈ H that separates the positive examples from the negative ,
.
then we say H shatters N points.
 For instance, for a dataset with 2 fixed points whose Cartesian coordinates are
(0,0) and (1,0), these points can be labeled {+,+}, {+,-}, {-,+}, {-, -}, so there are 
22 =4 labeling schemes for the dataset.
VC Dimension…

For the points to be in general position, no combination The 3 points lie in a straight line. These points
of 3 points should lie in a straight line. [No subset of are not in a general position in a 2D space.
(n+1) points lie on (n-1) Dimensional Space].
VC Dimension…

23 =8 possible labelings. A Straight-line shatters 3 points in


2D space. Points are in general position.
VC Dimension…

24 = 16 possible combinations. No straight line


can shatter them correctly. So 4 points in 2D
space are not shattered by a straight line.
VC Dimension…
VC Dimension…
Example:
• Let's consider hyperplanes (i.e. lines in 2D).
• It is easy to find a set of three points that can be classified correctly no
matter how they are labeled:

For all, 23 =8 possible labelings, we can find a hyperplane that separates them perfectly.
VC Dimension…

Types of classifier Decision Boundary


Straight lines Linear Regression
Circles Non Linear (Support Vector Machines)
Rectangles Decision Trees

Points to Remember
 For good generalization, VC Dimension of a hypothesis should be finite.
 VC Dimension of a Linear Classifier: (n+1) {Points should be in general position}
 VC Dimension of a Non-Linear Classifier: Very difficult to compute.
3. Probably Approximately Correct (PAC) Learning
Need for PAC learning:
 To find how many examples needed, using a hypothesis. (For Example: Using a
tightest rectangle S).
 Hypothesis to be approximately correct, namely, that the error probability be
bounded by some value.
 To know that our hypothesis will be correct most of the time (if not always); so we
want to be probably correct as well (by a probability we can specify).
PAC Learning  Given a class, C, and examples drawn from some unknown but
fixed probability distribution, p(x), we want to find the number of examples, N, such
that with probability at least 1 − δ, the hypothesis h has error at most ε, for arbitrary
δ ≤ 1/2 and > 0. (CΔh is the region of difference between C and h).
P {CΔh ≤ ε} ≥ 1 − δ

 Goal of a PAC Learner  To build a hypothesis with high probability (traditionally


denoted 1-delta (δ)) that was approximately correct (error rate less than epsilon ε).
PAC Learning...
 S is the tightest possible rectangle, the error region between
C and h = S is the sum of four rectangular strips.
 Make sure that the probability of a positive example falling in
here (and causing an error) is at most ε.
 For any of these strips, if the probability is upper bounded by ε
/4, the error is at most 4(ε /4) = ε.
Note: The overlaps in the corners are counted twice, and the total
actual error in this case is less than 4(ε /4).
 The probability that a randomly drawn example misses this strip
is 1 − ε /4.
 The probability that all N independent draws miss the strip is (1−
ε /4) N
 The probability that all N independent draws miss any of the  confidence probability at least 1 − δ,
four strips is at most 4(1 − ε /4) N, which we would like to be at  error probability at most ε.
most δ.  arbitrary large confidence by decreasing
 Dividing both sides by 4, taking (natural) log and rearranging δ.
terms, we have N ≥ (4/ε) log (4/δ).  arbitrary small error by decreasing ε.
4. Noise
 Noise is any unwanted anomaly in the data.
 Due to noise, the class may be more difficult to learn.
 Zero error may be infeasible with a simple hypothesis class.

Interpretations of Noise:
 Imprecision in recording the input attributes. (Shifts the data points in input
space).
 Teachers Noise: Errors in labeling the data points. (Positive as negative and
negative as positive).
 Hidden or Latent attributes: Additional attributes that are not taken into
account, that affects the label of instance. (Unobservable)
Noise…
 When there is noise, there is not a simple
boundary between the positive and negative
instances.

 Zero misclassification error may not be


possible with a simple hypothesis.

 A rectangle is a simple hypothesis with four


parameters defining the corners.

 An arbitrary closed form can be drawn by


piecewise functions with a larger number of
control points.

 With a complex model, one can make a perfect


fit to the data and attain zero error.
Noise…
 Using the simple rectangle (unless its training error is much bigger) makes more sense
because of the following:
• It is a simple model to use. It is easy to check whether a point is inside or outside a rectangle and
whether it is a positive or a negative instance.
• It is a simple model to train and has fewer parameters. It is easier to find the corner values of a
rectangle than the control points of an arbitrary shape. A simple model have less variance and
more bias. Finding the optimal model corresponds to minimizing both the bias and the variance.
• It is a simple model to explain. A rectangle simply corresponds to defining intervals on the two
attributes. By learning a simple model, we can extract information from the raw data given in the
training set.
• A simple model like the rectangle, will be a better discriminator than the wiggly shape. A simple
model would generalize better than a complex model.

Principle of Occam’s razor: Simpler explanations are more plausible and any
unnecessary complexity should be shaved off.
5. Learning Multiple Classes
 Two – class Problem: In the example of learning a family car, we have positive examples
belonging to the class family car and the negative examples belonging to all other cars.
 General case: We have K classes, denoted as ci , i = 1, . . . , K, and an input instance
belongs to one and exactly one of them. The training set is now of the form:

X ={x t
, r } t=1
t N
 r has k dimensions:
rit = {1 if xt ∈ Ci
{0 if xt ∈ Cj , j ≠ i
 Learn the boundary separating the instances of one class from the instances of all other
classes.
 View a K-class classification problem as K two-class problems.
 The training examples belonging to Ci are the positive instances of hypothesis hi and the
examples of all other classes are the negative instances of hi.
 The total empirical error takes a sum over the predictions for all classes over all instances.
5. Learning Multiple Classes…
For a given x:
 Ideally only one of hi(x), i = 1, . . . , K is 1 and we can choose a class.
 But when no, or two or more, hi(x) is 1, we cannot choose reject a class, and this is
the case of doubt and the classifier rejects such cases.
 In our example of learning a family car, we used only one hypothesis and only
modeled the positive examples.
 Any negative example outside is not a family car.
 Sometimes we may prefer to build two hypotheses, one for the positive and the
other for the negative instances.
 This assumes a structure also for the negative instances that can be covered by
another hypothesis.
5. Learning Multiple Classes…

Example: Separating family cars from sports


cars.
 Each class has a structure of its own.
 Advantage: If the input is a luxury sedan,
we can have both hypotheses decide
negative and reject the input.
 There are three classes: family car, sports
car, and luxury sedan.
 There are three hypotheses induced, each
one covering the instances of one class and
leaving outside the instances of the other
two classes. ‘?’ are reject regions where no,
or more than one, class is chosen.
5. Learning Multiple Classes…
Classes with similar distribution:
 If in a dataset, we expect to have all classes with similar distribution— shapes in
the input space—then the same hypothesis class can be used for all classes.
 Example: In a handwritten digit recognition dataset, we would expect all digits
to have similar distributions.
 Classes with different distribution:
 Example: In a medical diagnosis dataset, for example, where we have two
classes for sick and healthy people, we may have completely different
distributions for the two classes;
• There may be multiple ways for a person to be sick, reflected differently in the
inputs.
• All healthy people are alike; each sick person is sick in his or her own way.
6. Regression

 A Supervised learning technique.


 A statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. 
 Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other
independent variables are held fixed.
 It predicts continuous/real values such as temperature, age, salary, price, etc.
 Example: Predicting the price of an old house, price of a used car, etc.
 Regression shows a line or curve that passes through all the datapoints on target-
predictor graph.
 The vertical distance between the datapoints and the regression line is minimum.
 The distance between datapoints and line tells whether a model has captured a
strong relationship or not.
Regression…
Terminologies Related to the Regression Analysis:
•Dependent Variable: The main factor in Regression analysis which we want to predict or understand is
called the dependent variable. It is also called target variable.
•Independent Variable: The factors which affect the dependent variables or which are used to predict the
values of the dependent variables are called independent variable, also called as a predictor.
•Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.
(Example:  In the scores 25,29,3,32,85,33,27,28 both 3 and 85 are "outliers".)
•Multicollinearity: If the independent variables are highly correlated with each other than other variables,
then such condition is called Multicollinearity. It should not be present in the dataset, because it creates
problem while ranking the most affecting variable. (Example: height and weight, 1-month return,6-month
return and 1-year return)
•Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with
test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even
with training dataset, then such problem is called underfitting.
Regression…
Regression…
Linear Regression Shows the linear relationship between the independent
variable (X-axis) and the dependent variable (Y-axis).
Simple linear regression  There is only one input variable (x).
Multiple linear regression  There is more than one input variable.
Example: Predicting the salary of an employee on the basis of the year of experience.

Mathematical equation for Linear


regression: Y= aX + b

 Here, Y = dependent variable (target


variables),
 X= Independent variables (predictor
variables),
 a and b are the linear coefficients.
Regression…
Polynomial Regression  A type of regression which models the non-linear dataset using a linear
model.
 It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and
corresponding conditional values of y.
 Suppose there is a dataset which consists of datapoints which are present in a non-linear fashion, so for
such case, linear regression will not best fit to those datapoints. To cover such datapoints, we need
Polynomial regression.
 In Polynomial regression, the original features are transformed into polynomial features of given degree
and then modeled using a linear model. Which means the datapoints are best fitted using a polynomial line.

 The equation for polynomial regression also derived from linear


regression equation that means Linear regression equation Y= b0+ b1x, is
transformed into Polynomial regression equation Y= b0+b1x+ b2x2+
b3x3+.....+ bnxn.
 Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
 Polynomial: A single element has different degrees.
 Multiple linear: Multiple variables with the same degree.
Regression
Regression  Learning a numeric function and the output generated is
a numeric value.
 In ML, the function is not known but we have a training set of examples
drawn from it: (rt ∈ Ʀ) X ={xt , rt}Nt=1
 Interpolation: If there is a noise, the task is interpolation.
 Find the function f (x) that passes through these points such that we
have, rt = f(xt)
Extrapolation: If x is outside of the range of xt in the training set.
(Polynomial interpolation).
 Time-series prediction If we have data up to the present then we
want to predict the value for the future.
Regression…
Regression  There is noise added to the output of the unknown
function rt = f(xt) + ε
• where f (x) ∈ Ʀ is the unknown function and ε is random noise. (extra
hidden variables that we cannot observe) where Zt denote those hidden
variables. rt = f*(xt, zt)
 Approximate the output by our model g(x).
 Calculate the empirical error.
 r and g(x) are numeric quantities (∈ Ʀ) there is an ordering defined on
their values.
 Define a distance between values. The square of the difference is one
error (loss) function that can be used; another is the absolute value of the
difference.
Regression…
Aim: To find g(·) that minimizes the empirical error.
 Assume a hypothesis class for g(·) with a small set of parameters.
 Assume that g(x) is linear: A single input linear model --> g(x) = w1x + w0
• where w1 and w0 are the parameters to learn from data.

 If the linear model is too simple, it is too constrained and incurs a large approximation
error.
 When the order of the polynomial is increased, the error on the training data decreases.
7. Model Selection and Generalization
Example: Consider learning a Boolean function. All inputs and the output are binary
(0 or 1).
 There are 2d possible ways to write d binary values and therefore, with d inputs, the
training set has at most 2d examples. There are dpossible Boolean functions of d
inputs. 22
 Interpret Learning: Each distinct training example removes half the hypotheses,
namely, those whose guesses are wrong. For example, let us say we have x1 = 0, x2 = 1
and the output is 0; this removes h5, h6, h7, h8, h13, h14, h15, h16.
• Start with all possible hypothesis.
• See more training examples.
• Remove those hypotheses that are not consistent with the training data.
 ill-posed problem  The data by itself is not sufficient to find a unique solution.
 If the training set we are given contains only a small subset of all possible instances, the
solution is not unique. (The output is for only a small percentage of the cases).
Model Selection and Generalization…
Inductive Bias  The set of assumptions we make to have learning possible.
 When learning is ill-posed, we should make some extra assumptions to have a unique
solution with the data we have.
 Example: Assume a hypothesis class H.
• In learning the class of family car, there are many ways of separating the positive
examples from the negative examples.
• Assuming the shape of a rectangle is one inductive bias, and then the rectangle with the
largest margin is another inductive bias.
• Each hypothesis class has a certain capacity and can learn only certain functions.
• The class of functions that can be learned can be extended by using a hypothesis class
with larger capacity. (i.e., More complex hypothesis)
• The hypothesis class that is a union of two rectangles has higher capacity, but its
hypotheses are more complex.
Model Selection and Generalization…
 Model Selection  How to choose the right bias. (Choosing between possible
hypothesis H).
 NOTE: The aim of machine learning is the prediction of new cases.
 To rarely replicate the training data.
 To be able to generate the right output for an input instance outside the training set,
one for which the correct output is not given in the training set.

 Generalization  How well a model trained on the training set predicts the right
output for new instances.
 Underfitting  If H is less complex than the function, we have underfitting.
 Increase the complexity, the training error decreases.
 Overfitting  If there is noise, an overcomplex hypothesis may learn not only the
underlying function but also the noise in the data and may make a bad fit.
 Having more training data helps but only up to a certain point.
Model Selection and Generalization…
Triple trade-off  In learning algorithms, that are trained from example data, there is a
trade-off between 3 factors:
• The complexity of the hypothesis we fit to data (the capacity of the hypothesis class).
• The amount of training data.
• The generalization error on new examples.
As the amount of training data increases, the generalization error decreases.
As the complexity of the model class H increases, the generalization error decreases first and then
starts to increase.

 Divide the training set into two parts: One part for training (i.e., to fit hypothesis).
 Validation set  Used to test the generalization ability. (Choose the best model)
 Cross – Validation  Assuming large training and validation sets, the hypothesis that is
the most accurate on the validation set is the best one (the one that has the best inductive
bias).
 Test set  Also called the publication set, containing examples not used in training or
validation.
Model Selection and Generalization…
Example: Taking a course
 Training Set: The example problems that the instructor solves in class while teaching a
subject.
 Validation Set: Exam questions.
 Test set: The problems we solve in our later, professional life.
• We cannot keep on using the same training/validation.
• Because after having been used once, the validation set effectively becomes part of
training data.
• This will be like an instructor who uses the same exam questions every year;
• A smart student will figure out not to bother with the lectures and will only memorize the
answers to those questions.

Always remember that the training data we use is a random sample, that is, for the same application, if
we collect data once more, we will get a slightly different dataset. Slight differences in error will allow
us to estimate how large differences should be to be considered significant and not due to chance.
Dimensions of a Supervised Machine Learning Algorithm
 Sample Independent and identically distributed (iid).
Sample: X ={xt , rt}Nt=1
The ordering is not important and all instances are drawn from the same joint
distribution p(x, r).
 t indexes one of the N instances.
 xt is the arbitrary dimensional input.
 rt is the associated desired output.
 rt is 0/1 for two-class learning, is a K-dimensional binary vector (where exactly
one of the dimensions is 1 and all others 0) for (K > 2)-class classification, and is
a real value in regression.
Aim  To build a good and useful approximation to rt using the model g(xt |θ).
Dimensions of a Supervised Machine Learning Algorithm…
 Three decisions that must be made:
1. Model we use in learning, denoted as g(x|θ).
• where g(·) is the model, x is the input, and θ are the parameters.
• g(·) defines the hypothesis class H, and a particular value of θ instantiates one hypothesis h
∈ H.
• For example,
• In class learning, we have taken a rectangle as our model whose four coordinates make up
θ;
• In linear regression, the model is the linear function of the input whose slope and intercept
are the parameters learned from the data.
• The model (inductive bias), or H, is fixed by the machine learning system designer based on
his or her knowledge of the application.
• The hypothesis h is chosen (parameters are tuned) by a learning algorithm using the training
set, sampled from p(x, r).
Dimensions of a Supervised Machine Learning Algorithm…
2. Loss function, L(·)
• To compute the difference between the desired output, rt , and our approximation to it,
g(xt |θ), given the current value of the parameters, θ.
• The approximation error, or loss, is the sum of losses over the individual instances,

E(θ|X) = L (rt, g (xt|θ))


t

• In class learning, where outputs are 0/1, L(·) checks for equality or not; In regression,
because the output is a numeric value, we have ordering information for distance and
one possibility is to use the square of the difference.
Dimensions of a Supervised Machine Learning Algorithm…
3. Optimization procedure to find θ∗ that minimizes the total error.

θ* = arg min E(θ|X)


θ

• where arg min returns the argument that minimizes.


• In regression, we can solve analytically for the optimum.
• With more complex models and error functions, we may need to use more complex
optimization methods, for example, gradient-based methods or genetic algorithms.
Dimensions of a Supervised Machine Learning Algorithm…
 Conditions to be satisfied:
• First: The hypothesis class of g(·) should be large enough, that is, have enough
capacity, to include the unknown function that generated the data that is represented
in rt in a noisy form.
• Second: There should be enough training data to allow us to pinpoint the correct (or a
good enough) hypothesis from the hypothesis class.
• Third: We should have a good optimization method that finds the correct hypothesis
given the training data.
Exercises
1. Let us say our hypothesis class is a circle instead of a rectangle. What are the
parameters? How can the parameters of a circle hypothesis be calculated in such a
case? What if it is an ellipse? Why does it make more sense to use an ellipse instead
of a circle? How can you generalize your code to K > 2 classes?
2. Imagine our hypothesis is not one rectangle but a union of two (or m > 1)
rectangles. What is the advantage of such a hypothesis class? Show that any class
can be represented by such a hypothesis class with large enough m.
3. If we have a supervisor who can provide us with the label for any x, where should
we choose x to learn with fewer queries?
4. Show that the VC dimension of the triangle hypothesis class is 7 in two
dimensions. (Hint: For best separation, it is best to place the seven points
equidistant on a circle.)
5. One source of noise is error in the labels. Can you propose a method to find data
points that are highly likely to be mislabeled?

You might also like