0% found this document useful (0 votes)
39 views56 pages

Lec1 PDF

Week 1-4 of the course covers introductions to machine learning, supervised learning, unsupervised learning, semi-supervised learning, classification and regression problems, linear regression using gradient descent and stochastic gradient descent, logistic regression, and its multiclass extension. Evaluation will be through an assignment and textbooks and references on related topics are provided.

Uploaded by

Ashish Padhy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views56 pages

Lec1 PDF

Week 1-4 of the course covers introductions to machine learning, supervised learning, unsupervised learning, semi-supervised learning, classification and regression problems, linear regression using gradient descent and stochastic gradient descent, logistic regression, and its multiclass extension. Evaluation will be through an assignment and textbooks and references on related topics are provided.

Uploaded by

Ashish Padhy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Week 1-4: Introduction

to Machine learning

Dr. Rajesh Kumar Tripathy


Assistant Professor, EEE
BITS Pilani, Hyderabad Campus
Week 1-3 (Lecture 1-8)

Syllabus: Introduction to machine learning, Supervised, unsupervised and semi-supervised


learning, Classification and regression problems, Linear regression, gradient descent (Batch
gradient descent and stochastic gradient descent), Logistic regression, multiclass extension of
logistic regression, Performance Measures for Classifiers (binary class and multiclass), Likelihood
ratio test, Bayesian Multiclass classifier with ML and MAP Criteria.

Evaluation: Assignment 1 (Please submit the report for the assignment 1 along-with the
pseudo-code)

Textbooks:

T1. Simon Haykin, “Neural Networks – A comprehensive Foundation”, Pearson Education, 1999.
T2. H. J. Zimmermann, “Fuzzy Set Theory and its Applications”,3rd Edition, Kluwer Academic, 1996.

Reference books/Materials

R1: CS229 Lecture notes: Stanford University


R2: CS231 Convolutional neural networks for visual recognition: Stanford University
R3: http://gyan.iitg.ernet.in/handle/123456789/833
R4: https://www.sciencedirect.com/science/article/pii/S0925231206000385
R5: https://www.springer.com/cda/content/document/cda_downloaddocument/9783319284354-c2.pdf?SGWID=0-
0-45-1545215-p177863021
Introduction to Pattern Recognition

❑Pattern recognition stems from the need for automated machine


recognition of objects, signals or images, or the need for automated
decision-making based on a given set of parameters or features.

Applications:

❑ Speech recognition (e.g., automated voice-activated customer service)


❑ Speaker identification (Forensic applications)
❑ Handwritten character recognition (such as the one used by the postal system
to automatically read the addresses on envelopes)
❑ Identification of a system malfunction based on sensor data
Applications:

❑Loan or credit card application decision based on an individual’s credit


report data

❑ Automated digital mammography analysis for early detection of cancer

❑Automated electrocardiogram (ECG) or electroencephalogram (EEG)


analysis for cardiovascular or neurological disorder diagnosis

❑Biometrics (personal identification based on biological data such as iris


scan, fingerprint, Heart sound, ECG, etc.).
Components of Pattern Recognition System
What is Machine Learning?

• Machine learning focuses on the development of computer programs that can


access data and use it learn for themselves.

• Machine learning is an application of artificial intelligence (AI) that provides


systems the ability to automatically learn and improve from experience without
being explicitly programmed.

• Machine learning is broadly categorized into three types as supervised learning,


unsupervised learning and semi supervised learning.
Supervised Learning

Test data

Trained Model Test label (Banana)


Unsupervised Learning

• Unsupervised learning is the training of machine using information that is


neither classified nor labeled and allowing the algorithm to act on that
information without guidance.

Clustering: A clustering problem is where


you want to discover the inherent groupings
in the data, such as grouping customers by
purchasing behavior.
Semi-supervised Learning

❑This kind of learning fall somewhere in between supervised and unsupervised


learning, since they use both labeled and unlabeled data for training – typically a
small amount of labeled data and a large amount of unlabeled data.

❑The systems that use this method are able to considerably improve learning
accuracy.

❑Usually, semi-supervised learning is chosen when the acquired labeled data


requires skilled and relevant resources in order to train it / learn from it.
Category of Supervised Learning

❑ Supervised learning classified into two categories of algorithms:

❑ Classification: A classification problem is when the output variable is a


category, such as “Red” or “blue” or “disease” and “no disease”.

❑ Regression: A regression problem is when the output variable is a real value,


such as “dollars” or “weight”.
Linear Regression

❑ Living area, number of bedrooms: Features or attributes.

❑ Price of the house: Output (for regression problems)


and class labels (classification problems)

❑ x i is the feature vector for ith instance and it is given


by x i = [x1, x2]; where x1 is the Living area and x2 is
the number of bedrooms. y i is the output or price of
the house for i th instance.
Initially, the hypothesis is given as

We can also write in the intercept form as

Where ‘n’ is the number of input


variables or features. Here, for
house price prediction problem, n
is given as 2.
The cost function can be defined as

Our objective is to evaluate the parameter


‘w’ so as to minimize the above cost
function. The popular Least mean square
(LMS) algorithm is widely used to estimate
the parameter ‘w’.
LMS algorithm considers the gradient descent, which start with some
initial theta and repeatedly perform the update as
Where α is the learning rate and its
value varies from 0 to 1.

This is a natural algorithm which takes a step in the direction of steepest decrease
of the cost function ‘J’. Now, the partial derivative is evaluated as

For one instance, the partial derivative is


evaluated as
Thus, for a single training instance (i th instance), the LMS update
rule is given by

For entire training set, the LMS update rule is given as

Where ‘j’ is the number of attributes or features with j=0, 1, 2…., n and the
training instances varies from i=1, 2, …, m. This method looks at every example in
the entire training set on every step, and is called batch gradient descent.
Stochastic gradient descent

❑ In this algorithm, we repeatedly run through the training set, and each time we
encounter a training example, we update the parameters according to the
gradient of the error with respect to that single training example only. This
algorithm is called stochastic gradient descent (also incremental gradient
descent).

❑ Often, stochastic gradient descent gets w “close” to the minimum much faster
than batch gradient descent.
Regularized Linear Regression (Ridge Regression)
❑ Regularization helps to deal with the bias-variance problem of
model development. When small changes are made to data, such as
switching from the training to testing data, there can be wild
changes in the estimates. Regularization can often smooth this
problem out substantially.

❑ For highly correlated features, the regularization is helpful for


smoother minimization of the cost function.

The cost function for ridge regression is given by


The partial derivative with respect to wj is given by

For one instance, the partial derivative


is evaluated as

The gradient descent for ridge


regression is given by
Least Angle Regression
Implementation of Linear Regression

%%%%%X is the feature matrix X=[x0, x1] and y is the output.


%%%%%%%%Weight vector is w= [ w0, w1]%%%%%

For i = 1 to K %%%%assign number of iteration initially (K)

T1 = w(1) - alpha * ((X * w ) - y)' * X(:, 1);


T2 = w(2) - alpha * ((X * w ) - y)' * X(:, 2);
w (1) = T1;
w (2) = T2;
J (i) = evaluatecostfunction (X, y, w);

End of the for loop


Vectorization based Linear regression
Vectorization based Ridge regression
Vectorization based Least Angle regression
Linear Regression from Probabilistic Prospective
Let us assume that the target variables and the inputs are related
via the equation as

are independently and identically distributed according to a Gaussian


distribution with zero mean and some variance σ^2.

The probability density function is given by

This can also be written as


The likelihood function by considering all the training examples
is given by

The minimization of error is same as the maximization of the log likelihood and it is
given by
Logistic regression (Binary Classification)

The hypothesis for logistic regression is given as

logistic function or the sigmoid function.


g(z) tends towards 1 as z → ∞, and g(z) tends towards 0 as
z → −∞.
The hypothesis for the classification task is given by

More compactly, it can be written as

The likelihood function by considering ‘m’ training examples is given by


The maximization of the cost function is given by

The partial derivative of the cost function for single training instance
is given by

Hence using the LMS rule, the weight values for logistic regression is estimated as

For test data (xt), the output is evaluated as


Multiclass Extension of Logistic Regression (One Vs All)

❑Let’s consider a four class classification problem with class labels as


1, 2, 3 and 4.

❑In One Vs All algorithm, 4 binary class models are created (‘n’ classes
so ‘n’ number of models). The following block diagram explains the
procedure for One Vs All based multiclass coding algorithm.

yt=max (y1, y2, y3, y4), where y1,


y2, y3 and y4 are the predicted
class labels using model1,
model2, model3 and model4,
respectively.
Multiclass Extension of Logistic Regression (One Vs One)
❑ In One Vs One algorithm, 6 binary class models are created for 4
class classification task (for ‘n’ classes ‘n(n-1)/2’ number of
models). The following block diagram explains the procedure for
One Vs All based multiclass coding algorithm.

yt= mode (y1, y2, y3, y4, y5, y6),


where y1, y2, y3, y4, y5 and y6 are
the predicted class labels using
model1, model2, model3, model4,
model5 and model6, respectively.
Logistic Regression with L2-Norm Regularization
The cost function or likelihood function for logistic regression with
regularization is given by

Now, the gradient of the cost function for single training instance is given by

Hence using the LMS rule, the weight values for the cost function is estimated as
Logistic Regression with L1-Norm Regularization
The cost function or likelihood function for logistic regression with
L1-Norm regularization is given by

Now, the gradient of the cost function for a single training instance is given by

Hence using the LMS rule, the weight values for the cost function is estimated as
Unsupervised Learning (k-means clustering)

Let X = {x1,x2,x3,……..,xm} be the set of data points and


V = {v1,v2,…….,vc} be the set of centers.

❑ Randomly select ‘c’ cluster centers.

❑ Calculate the distance between each data point and cluster centers.

❑ Assign the data point to the cluster center whose distance from the cluster
center is minimum of all the cluster centers.

❑ Recalculate the new cluster center using:

where, ‘ck’ represents the number of data points in kth cluster.


How to Select training and Test instances?

Hold-out cross validation:

❑The holdout method is the simplest kind of cross validation. The data set is
separated into two sets, called the training set and the testing set.

❑The function approximator fits a function using the training set only. Then the
function approximator is asked to predict the output values for the data in the
testing set (it has never seen these output values before).

❑Either 60/40, 70/30 or 80/20 based hold-out cross-validation approaches are


followed.
K-fold cross-validation

❑ K-fold cross validation is one way to improve over the holdout method. The data set is divided into
k subsets, and the holdout method is repeated k times.

❑ Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together
to form a training set.

❑ Then the average error across all k trials is computed. The advantage of this method is that it
matters less how the data gets divided.

❑ The disadvantage of this method is that the training algorithm has to be rerun from scratch k
times, which means it takes k times as much computation to make an evaluation.
Leave-one-out cross validation

❑Leave-one-out cross validation is K-fold cross validation taken to its logical


extreme, with K equal to N, the number of data points in the set.

❑That means that N separate times, the function approximator is trained on all
the data except for one point and a prediction is made for that point.
Performance Measures for Binary Classifier

❑The performance of the binary classifier is evaluated using the


confusion matrix. This matrix is given by

❑ The TP, the TN, the FP and the FN are number of true positives, number of
true negatives, number of false positives and number of false negatives,
respectively.
❑The sensitivity (SE) is defined as proportion of abnormal episodes that are
accurately classified as abnormal and it is given by
❑The specificity (SP) is defined as proportion of normal episodes
that are accurately classified as normal and it is given by

❑The accuracy (Acc) is defined as the proportion of episodes that are


classified as normal and abnormal. It is given by
Performance Measures for Multiclass Classifier
❑ The performance measures for multiclass classifier are individual
class accuracy and the overall accuracy. These measures are
evaluated from the multiclass confusion matrix. The confusion
matrix is given as

❑The individual class accuracy (IA) value of i th class is given by

❑where i, j = 1, 2, 3, 4, 5. The overall accuracy (OA) of the multiclass classifier is


evaluated as
Bayesian Decision Theory
The feature matrix and the class label vector are denoted by

The feature matrix consists of ‘m’ feature vectors.

i=1,2,…m
The posterior probability evaluated using Bayes’s theorem is given by

The posterior probability can also be written as


The likelihood is modeled using Normal or Gaussian distribution. For
single dimensional feature vector, the likelihood function is given by

For multi dimensional feature vector,


the likelihood function is given by
Likelihood Ratio Test (LRT) for Two-class Bayesian Classifier
The decision rule for Two-class Bayesian Classifier using a posteriori
is given by

We can also write

As P(x) does not affect the decision rule, so it can be eliminated.


The LRT is defined as
Maximum a posteriori Probability (MAP) Decision rule

The LRT is given by

The LRT can also be written as

For binary class, the MAP decision rule is given by

The MAP decision rule for multiclass classification is given by


MAP decision rule for Multiclass Classification
Maximum Likelihood (ML) Decision rule

The LRT for two-class classifier is given by

For equal priors, the decision rule for LRT is given by

The ML decision rule for multiclass classification is given by


ML decision rule for Multiclass Classification

You might also like