0% found this document useful (0 votes)

17 views116 pages

Classification BP Regression KNN Other Classifiers - Final

The document discusses the fundamentals of neural networks, focusing on back propagation and regression techniques. It explains the structure of neural networks, including single and multi-layer feed-forward networks, data normalization methods, and the back propagation algorithm for training the network. Additionally, it highlights various applications of neural networks in fields such as finance, medicine, and image recognition.

Uploaded by

researchanalystforapurpose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views116 pages

Classification BP Regression KNN Other Classifiers - Final

Uploaded by

researchanalystforapurpose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 116

Classification-Prediction: Back

Propogation & Regression

Dr. Manish Kumar

Associate Professor
Chair: Data Analytics Lab & M.Tech (Data Engg.)
Department of Information Technology
Indian Institute of Information Technology-Allahabad, Prayagraj
Back Propagation
Basics of Neural Network
■ What is a Neural Network
■ Neural Network Classifier
■ Data Normalization
■ Neuron and bias of a neuron
■ Single Layer Feed Forward
■ Limitation
■ Multi Layer Feed Forward
■ Back propagation
Neural Networks
What is a Neural Network?
•Biologically motivated approach to machine learning

Similarity with biological network

Fundamental processing elements of a neural network is

a neuron

1.Receives inputs from other source

2.Combines them in someway

3.Performs a generally nonlinear operation on the result

Similarity with Biological Network

• Fundamental processing element of a

neural network is a neuron
• A human brain has 100 billion neurons
• An ant brain has 250,000 neurons
Neural Network

■ Neural Network is a set of connected

INPUT/OUTPUT UNITS, where each connection has a WEIGHT
associated with it.

■ Neural Network learning is also called CONNECTIONIST learning due

to the connections between units.

■ It is a case of SUPERVISED, INDUCTIVE or CLASSIFICATION

learning.
Neural Network
■ Neural Network learns by adjusting the
weights so as to be able to correctly classify
the training data and hence, after testing
phase, to classify unknown data.

■ Neural Network needs long time for training.

■ Neural Network has a high tolerance to noisy

and incomplete data
Neural Network Classifier
■ Input: Classification data
It contains classification attribute
■ Data is divided, as in any classification problem.
[Training data and Testing data]

■ All data must be normalized.

(i.e. all values of attributes in the database are changed to
contain values in the internal [0,1] or[-1,1])
Neural Network can work with data in the range of (0,1) or (-1,1)

■ Two basic normalization techniques

[1] Max-Min normalization
[2] Decimal Scaling normalization
Data Normalization

[1] Max- Min normalization formula is as follows:

[minA, maxA , the minimun and maximum values of the attribute A

max-min normalization maps a value v of A to v’ in the range {new_minA, new_maxA} ]
Example of Max-Min Normalization

Max- Min normalization formula

Example: We want to normalize data to range of the interval [0,1].

We put: new_max A= 1, new_minA =0.

Say, max A was 100 and min A was 20 ( That means maximum and minimum values for the attribute ).

Now, if v = 40 ( If for this particular pattern , attribute value is 40 ), v’ will be calculated as , v’ = (40-20)
x (1-0) / (100-20) + 0
=> v’ = 20 x 1/80
=> v’ = 0.4
Decimal Scaling Normalization
[2]Decimal Scaling Normalization

Normalization by decimal scaling normalizes by moving the

decimal point of values of attribute A.

Here j is the smallest integer such that max|v’|<1.

Example :

A – values range from -986 to 917. Max |v| = 986.

v = -986 normalize to v’ = -986/1000 = -0.986

One Neuron as
a Network
■ Here x1 and x2 are normalized attribute value of data.

■ y is the output of the neuron , i.e the class label.

■ x1 and x2 values multiplied by weight values w1 and w2 are input to the neuron
x.

■ Value of x1 is multiplied by a weight w1 and values of x2 is multiplied by a

weight w2.

■ Given that

■ w1 = 0.5 and w2 = 0.5

■ Say value of x1 is 0.3 and value of x2 is 0.8,

■ So, weighted sum is :

■ sum= w1 x x1 + w2 x x2 = 0.5 x 0.3 + 0.5 x 0.8 = 0.55

■
One Neuron as a Network

■ The neuron receives the weighted sum as input and

calculates the output as a function of input as follows :

■ y = f(x) , where f(x) is defined as

■ f(x) = 0 { when x< 0.5 }

■ f(x) = 1 { when x >= 0.5 }

■ For our example, x ( weighted sum ) is 0.55, so y = 1 ,

■ That means corresponding input attribute values are classified in

class 1.

■ If for another input values , x = 0.45 , then f(x) = 0,

■ so we could conclude that input values are classified
to class 0.

■
Bias of a Neuron

■ We need the bias value to be added to the weighted

sum ∑wixi so that we can transform it from the origin.
v = ∑wixi + b, here b is the bias

x1-x2= -1

x2
x1-x2=0

x1-x2= 1

x1
Bias as extra input

w0
x0 = +1
W1
x1 Activation
v
Input
function

Attribute x2 w2
Output
values class
Summing function
y
weights
xm wm
Neuron with Activation
■ The neuron is the basic information processing unit of a
NN. It consists of:

1 A set of links, describing the neuron inputs, with

weights W1, W2, …, Wm

2. An adder function (linear combiner) for computing the

weighted sum of the inputs (real numbers):

3 Activation function : for limiting the amplitude of

the neuron output.
A Multilayer Feed-Forward
Neural Network
Output Class

Output nodes

Hidden nodes

wij - weights

Input nodes
Network is fully connected

Input Record : xi
Neural Network Learning

■ The inputs are fed simultaneously into the

input layer.

■ The weighted outputs of these units are fed

into hidden layer.

■ The weighted outputs of the last hidden layer

are inputs to units making up the output
layer.
Network

■ The units in the hidden layers and output layer are

sometimes referred to as neurodes, due to their
symbolic biological basis, or as output units.

■ A network containing two hidden layers is called a

three-layer neural network, and so on.

■ The network is feed-forward in that none of the

weights cycles back to an input unit or to an output
unit of a previous layer.
A Multilayered Feed – Forward Network
■ INPUT: records without class attribute with
normalized attributes values.

■ INPUT VECTOR: X = { x1, x2, …. xn}

where n is the number of (non class) attributes.

■ INPUT LAYER – there are as many nodes as

non-class attributes i.e. as the length of the input
vector.

■ HIDDEN LAYER – the number of nodes in the hidden

layer and the number of hidden layers depends on
implementation.
A Multilayered Feed–Forward
Network
■ OUTPUT LAYER – corresponds to the class attribute.
■ There are as many nodes as classes (values of the
class attribute).

k= 1, 2,.. #classes

• Network is fully connected, i.e. each unit provides input

to each unit in the next forward layer.
Classification by Back propagation

■ Back Propagation learns by iteratively

processing a set of training data (samples).

■ For each sample, weights are modified to

minimize the error between network’s
classification and actual classification.
Steps in Back propagation
Algorithm
■ STEP ONE: initialize the weights and biases.

■ The weights in the network are initialized to

random numbers from the interval [-1,1].

■ Each unit has a BIAS associated with it

■ The biases are similarly initialized to random

numbers from the interval [-1,1].

■ STEP TWO: feed the training sample.

Steps in Back propagation Algorithm
( cont..)

■ STEP THREE: Propagate the inputs forward; we

compute the net input and output of each unit
in the hidden and output layers.

■ STEP FOUR: back propagate the error.

■ STEP FIVE: update weights and biases to

reflect the propagated errors.

■ STEP SIX: terminating conditions.

Propagation through Hidden Layer (
One Node )

- Bias Θj
x0 w0j

x1 w1j
∑ f
output y
xn wnj

Input weight weighted Activation

vector x vector w sum function

■ The inputs to unit j are outputs from the previous layer. These are
multiplied by their corresponding weights in order to form a
weighted sum, which is added to the bias associated with unit j.
■ A nonlinear activation function f is applied to the net input.
Propagate the inputs forward

■
For unit j in the input layer, its output is equal to its input, that is,

for input unit j.

• The net input to each unit in the hidden and output layers is computed as
follows.
•Given a unit j in a hidden or output layer, the net input is

where wij is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of unit I
from the previous layer;

is the bias of the unit

Propagate the inputs forward
■ Each unit in the hidden and output layers takes its
net input and then applies an activation function.
The function symbolizes the activation of the
neuron represented by the unit. It is also called a
logistic, sigmoid, or squashing function.
■ Given a net input Ij to unit j, then
Oj = f(Ij),
the output of unit j, is computed as
Back propagate the error
■ When reaching the Output layer, the error is
computed and propagated backwards.
■ For a unit k in the output layer the error is
• computed by a formula:

Where O k – actual output of unit k ( computed by activation function.

Tk – True output based of known class label; classification of training sample

Ok(1-Ok) – is a Derivative ( rate of change ) of activation function.

Back propagate the error
■ The error is propagated backwards by updating
weights and biases to reflect the error of the
network classification .
■ For a unit j in the hidden layer the error is
computed by a formula:

where wjk is the weight of the connection from unit j to unit k in the next higher
layer, and Errk is the error of unit k.
Update weights and biases
■ Weights are updated by the following equations,
where l is a constant between 0.0 and 1.0
reflecting the learning rate, this learning rate is
fixed for implementation.

• Biases are updated by the following equations

Update weights and biases
■ We are updating weights and biases after the
presentation of each sample.
■ This is called case updating.

■ Epoch --- One iteration through the training set is called an

epoch.

■ Epoch updating ------------

■ Alternatively, the weight and bias increments could be
accumulated in variables and the weights and biases
updated after all of the samples of the training set have
been presented.

■ Case updating is more accurate

Terminating Conditions
■ Training stops
• All in the previous epoch are below some threshold, or

•The percentage of samples misclassified in the previous epoch is below some

threshold, or

• a pre specified number of epochs has expired.

• In practice, several hundreds of thousands of epochs may be required before the

weights will converge.
Backpropagation Formulas

Output vector

Output nodes

Hidden nodes

wij

Input nodes

Input vector: xi
Example of Back propagation
Input = 3, Hidden Neuron = 2
Output =1

Initialize weights :

Random Numbers from -1.0 to 1.0

Initial Input and weight

x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56

1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2

Example ( cont.. )

■ Bias added to Hidden

■ + Output nodes
■ Initialize Bias
■ Random Values from
■ -1.0 to 1.0

■ Bias ( Random )

θ4 θ5 θ6

-0.4 0.2 0.1

Net Input and Output Calculation

Unitj Net Input Ij Output Oj

4 0.2 + 0 + 0.5 -0.4 = -0.7

= 0.332

5 -0.3 + 0 + 0.2 + 0.2 =0.1

= 0.525

6 (-0.3)0.332-(0.2)(0.525)+0
.1= -0.105
= 0.475
Calculation of Error at Each
Node

Unit j Error j
6 0.475(1-0.475)(1-0.475) =0.1311
We assume T 6 = 1

5 0.525 x (1- 0.525)x 0.1311x

(-0.2) = 0.0065
4 0.332 x (1-0.332) x 0.1311 x
(-0.3) = -0.0087
Calculation of weights and Bias Updating

Learning Rate l =0.9

Weight New Values

w46 -0.3 + 0.9(0.1311)(0.332) =
-0.261
w56 -0.2 + (0.9)(0.1311)(0.525) =
-0.138
w14 0.2 + 0.9(-0.0087)(1) = 0.192

w15 -0.3 + (0.9)(-0.0065)(1) = -0.306

……..similarly ………similarly
θ6 0.1 +(0.9)(0.1311)=0.218

……..similarly ………similarly
Network Pruning and Rule Extraction

■ Network pruning
■ Fully connected network will be hard to
articulate
■ N input nodes, h hidden nodes and m output
nodes lead to h(m+N) weights
■ Pruning: Remove some of the links without
affecting classification accuracy of the network
Applications
■ Handwritten Digit Recognition
■ Face recognition
■ Time series prediction
■ Process identification
■ Process control
■ Optical character recognition
Application-II

■ Forecasting/Market Prediction: finance and banking

■ Manufacturing: quality control, fault diagnosis

■ Medicine: analysis of electrocardiogram data, RNA & DNA

sequencing, drug development without animal testing

■ Control: process, robotics

Regression
Recall: Covariance
Interpreting Covariance
cov(X,Y) > 0 X and Y are positively correlated
cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent
Correlation coefficient

■ Pearson’s Correlation Coefficient is

standardized covariance (unitless):
Correlation
■ Measures the relative strength of the linear
relationship between two variables
■ Unit-less
■ Ranges between –1 and 1
■ The closer to –1, the stronger the negative linear
relationship
■ The closer to 1, the stronger the positive linear
relationship
■ The closer to 0, the weaker any positive linear
relationship
Scatter Plots of Data with
Various Correlation Coefficients
Y Y Y

X X X
r = -1 r = -.6 r=0
Y
Y Y

X X X
r = +1 r = +.3 r=0
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Linear Correlation
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Linear Correlation
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Linear Correlation
No relationship

X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Calculating by hand…
Simpler calculation formula…
Numerator of
covariance

Numerators of
variance
Distribution of the
correlation coefficient:

The sample correlation coefficient follows a T-distribution with

n-2 degrees of freedom (since you have to estimate the
standard error).

*note, like a proportion, the variance of the correlation coefficient depends

on the correlation coefficient itself🡪substitute in estimated r
■ In probability and statistics, Student's t-distribution (or simply
the t-distribution) is a continuous probability distribution that
generalizes the standard normal distribution. Like the latter, it is
symmetric around zero and bell-shape.

■ The Z distribution is a special case of the normal distribution with a

mean of 0 and standard deviation of 1. The t-distribution is similar to
the Z-distribution, but is sensitive to sample size and is used for small
or moderate samples when the population standard deviation is
unknown.
Linear regression

In correlation, the two variables are treated as equals. In regression, one

variable is considered independent (=predictor) variable (X) and the other the
dependent (=outcome) variable Y.
What is “Linear”?
■ Remember this:
■ Y=mX+B?

B
What’s Slope?

A slope of 2 means that every 1-unit change in X

yields a 2-unit change in Y.
Prediction

If you know something about X, this knowledge helps you

predict something about Y. (Sound familiar?…sound
like conditional probabilities?)
Regression equation…
Expected value of y at a given level of x=
Predicted value for an
individual…
yi= α + β*xi + random errori

Fixed – Follows a normal

exactly distribution
on the
line
Assumptions (or the fine print)
■ Linear regression assumes that…
■ 1. The relationship between X and Y is linear
■ 2. Y is distributed normally at each value of X
■ 3. The variance of Y at every value of X is the
same (homogeneity of variances)
■ 4. The observations are independent
The standard error of Y given X is the average variability around the
regression line at any given value of X. It is assumed to be equal at
all values of X.

Sy/x

Sy/x
Sy/x

Sy/x
Regression Picture
yi
C A

B
y
B
A
C
yi

*Least squares estimation

x gave us the line (β) that
minimized C2

2 2 2
R2=SSreg/SStotal
A B C
SStotal SSreg SSresidual
Total squared distance of Distance from regression line to naïve mean of y Variance around the regression line
observations from naïve mean Variability due to x (regression) Additional variability not explained by
of y x—what least squares method aims to
Total variation minimize
Estimating the intercept and
slope: least squares estimation
** Least Squares Estimation
A little calculus….
What are we trying to estimate? β, the slope, from

What’s the constraint? We are trying to minimize the squared distance (hence the “least squares”) between the
observations themselves and the predicted values , or (also called the “residuals”, or left-over unexplained variability)

Differencei = yi – (βx + α) Differencei2 = (yi – (βx + α)) 2

Find the β that gives the minimum sum of the squared differences. How do you maximize a function? Take the
derivative; set it equal to zero; and solve. Typical max/min problem from calculus….

From here takes a little math trickery to solve for β…

Resulting formulas…

Slope (beta coefficient) =

Intercept=

Regression line always goes through the point:

Relationship with correlation

In correlation, the two variables are treated as equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.
Formula for the standard error of beta (you will
not have to calculate by hand!):
Residual Analysis: check
assumptions

■ The residual for observation i, ei, is the difference

between its observed and predicted value
■ Check the assumptions of regression by examining the
residuals
■ Examine for linearity assumption
■ Examine for constant variance for all levels of X
(homoscedasticity)
■ Evaluate normal distribution assumption
■ Evaluate independence assumption
■ Graphical Analysis of Residuals
■ Can plot residuals vs. X
Residual Analysis for
Linearity
Y Y

x x
residuals

x residuals x

Not Linear
✔ Linear
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Residual Analysis for
Homoscedasticity

Y Y

x x
residuals

x residuals x

Non-constant variance ✔ Constant variance

Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Residual Analysis for
Independence

Not Independent
✔ Independent
residuals

residuals
X
residuals

Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Other types of multivariate
regression
● Multiple linear regression is for normally
distributed outcomes

● Logistic regression is for binary outcomes

● Cox proportional hazards regression is used when

time-to-event is the outcome
Common multivariate regression models.
Example Appropriate Example equation What do the coefficients give
Outcome outcome multivariate you?
(dependent variable regression
variable) model
Continuous Blood Linear blood pressure (mmHg) = slopes—tells you how much
pressure regression α + βsalt*salt consumption (tsp/day) + the outcome variable
βage*age (years) + βsmoker*ever increases for every 1-unit
smoker (yes=1/no=0) increase in each predictor.

Binary High blood Logistic ln (odds of high blood pressure) = odds ratios—tells you how
pressure regression α + βsalt*salt consumption (tsp/day) + much the odds of the
(yes/no) βage*age (years) + βsmoker*ever outcome increase for every
smoker (yes=1/no=0) 1-unit increase in each
predictor.

Time-to-event Time-to- Cox regression ln (rate of death) = hazard ratios—tells you how
death α + βsalt*salt consumption (tsp/day) + much the rate of the outcome
βage*age (years) + βsmoker*ever increases for every 1-unit
smoker (yes=1/no=0) increase in each predictor.
Multivariate regression pitfalls
● Multi-collinearity
● Residual confounding
● Overfitting
Multicollinearity
● Multicollinearity arises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model; they will, in effect, cancel each
other out and generally destroy your model.

● Model building and diagnostics are tricky

business!
Overfitting
■ In multivariate modeling, you can get
highly significant but meaningless
results if you put too many predictors in
the model.
■ The model is fit perfectly to the quirks
of your particular sample, but has no
predictive ability in a new sample.
Overfitting: class data
example
■ I asked SAS to automatically find
predictors of optimism in our class
dataset. Here’s the resulting linear
regression model:
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 11.80175 2.98341 11.96067 15.65 0.0019

exercise -0.29106 0.09798 6.74569 8.83 0.0117
sleep -1.91592 0.39494 17.98818 23.53 0.0004
obama 1.73993 0.24352 39.01944 51.05 <.0001
Clinton -0.83128 0.17066 18.13489 23.73 0.0004
mathLove 0.45653 0.10668 13.99925 18.32 0.0011

Exercise, sleep, and high ratings for Clinton are negatively related to optimism (highly
significant!) and high ratings for Obama and high love of math are positively related to
optimism (highly significant!).
Overfitting
Rule of thumb: You need at
least 10 subjects for each
additional predictor
variable in the multivariate
regression model.

Pure noise variables still produce good R2 values if the model is

overfitted. The distribution of R2 values from a series of
simulated regression models containing only noise variables.
(Figure 1 from: Babyak, MA. What You See May Not Be What You Get: A Brief, Nontechnical Introduction
to Overfitting in Regression-Type Models. Psychosomatic Medicine 66:411-421 (2004).)
■ R-Squared values range from 0 to 1. An R-Squared
value of 0 means that the model explains or predicts
0% of the relationship between the dependent and
independent variables. A value of 1 indicates that the
model predicts 100% of the relationship, and a value
of 0.5 indicates that the model predicts 50%, and so
on
K-Nearest Neighbor
Different Learning Methods
■ Eager Learning
■ Explicit description of target function on
the whole training set
■ Instance-based Learning
■ Learning=storing all training instances
■ Classification=assigning target function to
a new instance
■ Referred to as “Lazy” learning
K-Nearest Neighbor
■ Features
■ All instances correspond to points in an
n-dimensional Euclidean space
■ Classification is delayed till a new instance
arrives
■ Classification done by comparing feature
vectors of the different points
■ Target function may be discrete or
real-valued
1-Nearest Neighbor
3-Nearest Neighbor
K-Nearest Neighbor
■ An arbitrary instance is represented by
(a1(x), a2(x), a3(x),.., an(x))
■ ai(x) denotes features
■ Euclidean distance between two instances
d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) -
ar(xj))2)
■ Continuous valued target function
■ mean value of the k nearest training examples
• K-nearest neighbours uses the local neighborhood to obtain a prediction
• The K memorized examples more similar to the one that is being classified
are retrieved
• A distance function is needed to compare the examples similarity
■ If the ranges of the features differ,
feaures with bigger values will dominate
decision
■ In general feature values are normalized
prior to distance calculation
Voronoi Diagram
■ Decision surface formed by the training
examples
Distance-Weighted Nearest
Neighbor Algorithm
■ Assign weights to the neighbors based on
their ‘distance’ from the query point
■ Weight ‘may’ be inverse square of the

distances
All training points may influence a particular
instance
▪ Shepard’s method
Remarks
+Highly effective inductive inference method for
noisy training data and complex target
functions
+Target function for a whole space may be
described as a combination of less complex
local approximations
+Learning is very simple
- Classification is time consuming
Nearest-Neighbor Classifiers: Issues

– The value of k, the number of nearest neighbors to retrieve

– Choice of Distance Metric to compute distance between
records
– Computational complexity
– Size of training set
– Dimension of data
Value of K
■ Choosing the value of k:
■ If k is too small, sensitive to noise points
■ If k is too large, neighborhood may include points
from other classes

Rule of thumb:
K = sqrt(N)
N: number of training points
Distance Metrics
Distance Measure: Scale
Effects
■ Different features may have different
measurement scales
■ E.g., patient weight in kg (range [50,200])
vs. blood protein values in ng/dL (range
[-3,3])
■ Consequences
■ Patient weight will have a much greater
influence on the distance between samples
■ May bias the performance of the classifier
Nearest Neighbour :
Computational Complexity
■ Expensive
■ To determine the nearest neighbour of a query point q, must
compute the distance to all N training examples
+ Pre-sort training examples into fast data structures (kd-trees)
+ Compute only an approximate distance
+ Remove redundant data (condensing)
■ Storage Requirements
■ Must store all training data P
+ Remove redundant data (condensing)
- Pre-sorting often increases the storage requirements
■ High Dimensional Data
■ “Curse of Dimensionality”
■ Required amount of training data increases exponentially with
dimension
■ Computational cost also increases dramatically
■ Partitioning techniques degrade to linear search in high dimension
Reduction in Computational
Complexity
■ Reduce size of training set
■ Condensation, editing

■ Use geometric data structure for high

dimensional search
Condensation: Decision
Regions Each cell contains one sample, and every
location within the cell is closer to that
sample than to any other sample.

A Voronoi diagram divides the space into

such cells.

Every query point will be assigned the classification of the sample within that cell. The decision boundary separates the class
regions based on the 1-NN decision rule.

Knowledge of this boundary is sufficient to classify new points.

The boundary itself is rarely computed; many algorithms seek to retain only those points necessary to generate an identical
boundary.
Condensing

■ Aim is to reduce the number of training samples

■ Retain only the samples that are needed to define the decision boundary

■ Decision Boundary Consistent – a subset whose nearest neighbour decision

boundary is identical to the boundary of the entire training set
■ Minimum Consistent Set – the smallest subset of the training data that correctly
classifies all of the original training data

Original data Condensed data Minimum Consistent Set

Condensing
■ Condensed Nearest Neighbour
(CNN) •Incremental
1. Initialize subset with a single (or K)
•Order dependent
training example
•Neither minimal nor decision
2. Classify all remaining samples using boundary consistent
the subset, and transfer any
incorrectly classified samples to the
•O(n3) for brute-force method
subset
3. Return to 2 until no transfers
occurred or the subset is full
Condensing
■ Condensed Nearest Neighbour (CNN)

1. Initialize subset with a single training example

2. Classify all remaining samples using the subset, and transfer any
incorrectly classified samples to the subset
3. Return to 2 until no transfers occurred or the subset is full
Condensing
■ Condensed Nearest Neighbour (CNN)
Condensing
■ Condensed Nearest Neighbour (CNN)
Condensing
■ Condensed Nearest Neighbour (CNN)
High dimensional search
■ Given a point set and a nearest neighbor query point

■ Find the points enclosed in a rectangle (range) around the query

■ Perform linear search for nearest neighbor only in the rectangle

Query
KNN: Alternate Terminologies

■ Instance Based Learning

■ Lazy Learning
■ Case Based Reasoning
■ Exemplar Based Learning
Classifier Accuracy

▪ How it can be measured?

▪ Holdout Method (Random Sub sampling)
▪ K-fold Cross Validation
▪ Bootstrapping
▪ How we can improve classifier Accuracy?
▪ Bagging
▪ Boosting
▪ Is accuracy enough to judge a classifier?
Predictor Error Measures
■ Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
■ Loss function: measures the error between yi and the predicted value yi’
■ Absolute error: | yi – yi’|
■ Squared error: (yi – yi’)2
■ Test error (generalization error): the average loss over the test set
■ Mean absolute error: Mean squared error:

■ Relative absolute error: Relative squared error:

The mean squared-error exaggerates the presence of outliers

Popularly use (square) root mean-square error, similarly, root relative squared
error
Evaluating the Accuracy of a Classifier or
Predictor (I)

■ Holdout method
■ Given data is randomly partitioned into two independent sets
■ Training set (e.g., 2/3) for model construction
■ Test set (e.g., 1/3) for accuracy estimation
■ Random sampling: a variation of holdout
■ Repeat holdout k times, accuracy = avg. of the accuracies obtained
■ Cross-validation (k-fold, where k = 10 is most popular)
■ Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
■ At i-th iteration, use Di as test set and others as training set
■ Leave-one-out: k folds where k = # of tuples, for small sized data
■ Stratified cross-validation: folds are stratified so that class dist. in each
fold is approx. the same as that in the initial data
Evaluating the Accuracy of a Classifier or
Predictor (II)

■ Bootstrap
■ Works well with small data sets
■ Samples the given training tuples uniformly with replacement
■ i.e., each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
■ Several boostrap methods, and a common one is .632 boostrap
■ Suppose we are given a data set of d tuples. The data set is sampled d
times, with replacement, resulting in a training set of d samples. The data
tuples that did not make it into the training set end up forming the test set.
About 63.2% of the original data will end up in the bootstrap, and the
remaining 36.8% will form the test set (since (1 – 1/d) d ≈ e-1 = 0.368)
■ Repeat the sampling procedue k times, overall accuracy of the
model:
Ensemble Methods: Increasing the Accuracy

■ Ensemble methods
■ Use a combination of models to increase accuracy

■ Combine a series of k learned models, M , M , …, M ,

1 2 k
with the aim of creating an improved model M*
■ Popular ensemble methods
■ Bagging: averaging the prediction over a collection of

classifiers (e.g. Random Forest)

■ Boosting: weighted vote with a collection of classifiers

■ Ensemble: combining a set of heterogeneous classifiers

Bagging: Boostrap Aggregation
■ Analogy: Diagnosis based on multiple doctors’ majority vote
■ Training
■ Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled
with replacement from D (i.e., boostrap)
■ A classifier model Mi is learned for each training set Di
■ Classification: classify an unknown sample X
■ Each classifier Mi returns its class prediction
■ The bagged classifier M* counts the votes and assigns the class with the most
votes to X
■ Prediction: can be applied to the prediction of continuous values by taking the average
value of each prediction for a given test tuple
■ Accuracy
■ Often significant better than a single classifier derived from D
■ For noise data: not considerably worse, more robust
■ Proved improved accuracy in prediction
Boosting

■ Analogy: Consult several doctors, based on a combination of weighted

diagnoses—weight assigned based on the previous diagnosis accuracy
■ How boosting works?
■ Weights are assigned to each training tuple
■ A series of k classifiers is iteratively learned
■ After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to pay more attention to the training tuples that
were misclassified by Mi
■ The final M* combines the votes of each individual classifier, where the
weight of each classifier's vote is a function of its accuracy
■ The boosting algorithm can be extended for the prediction of continuous values
■ Comparing with bagging: boosting tends to achieve greater accuracy, but it
also risks overfitting the model to misclassified data
Adaboost (Freund and Schapire, 1997)
■ Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
■ Initially, all the weights of tuples are set the same (1/d)
■ Generate k classifiers in k rounds. At round i,
■ Tuples from D are sampled (with replacement) to form a training
set Di of the same size
■ Each tuple’s chance of being selected is based on its weight
■ A classification model Mi is derived from Di
■ Its error rate is calculated using Di as a test set
■ If a tuple is misclssified, its weight is increased, o.w. it is
decreased
■ Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier
Mi error rate is the sum of the weights of the misclassified tuples:

■ The weight of classifier Mi’s vote is

Model Selection: ROC Curves

■ ROC (Receiver Operating Characteristics)

curves: for visual comparison of
classification models
■ Originated from signal detection theory
■ Shows the trade-off between the true ■
Vertical axis represents
positive rate and the false positive rate
the true positive rate
■ The area under the ROC curve is a ■
Horizontal axis rep. the
measure of the accuracy of the model
false positive rate
■ The closer to the diagonal line (i.e., the
■
closer the area is to 0.5), the less accurate The plot also shows a
is the model diagonal line
■
A model with perfect
accuracy will have an area
Classification of Class-Imbalanced Data Sets

■ Class-imbalance problem: Rare positive example but numerous negative

ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
■ Traditional methods assume a balanced distribution of classes and equal
error costs: not suitable for class-imbalanced data
■ Typical methods for imbalance data in 2-class classification:
■ Oversampling: re-sampling of data from positive class
■ Under-sampling: randomly eliminate tuples from negative class
■ Threshold-moving: moves the decision threshold, t, so that the
rare class tuples are easier to classify, and hence, less chance of
false negative errors
■ Ensemble techniques: Ensemble multiple classifiers introduced above

115
■ Feedback: [email protected]

■ I acknowledge all the authors and websites whose content was part of my
lecture.

■ Thanks

Manish@iiita 116

Luo Vessels Jeffrey Yuen NESA
100% (4)
Luo Vessels Jeffrey Yuen NESA
174 pages
RAADS-R Test: Ritvo Autism Asperger Diagnostic Scale-Revised
100% (3)
RAADS-R Test: Ritvo Autism Asperger Diagnostic Scale-Revised
10 pages
Manual Operador Amaro 5000 - OMRON - HOSPITALAR EN
100% (1)
Manual Operador Amaro 5000 - OMRON - HOSPITALAR EN
54 pages
Unit 3 - Ann
No ratings yet
Unit 3 - Ann
49 pages
Neural Network
No ratings yet
Neural Network
55 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
WIT-Color Ultra 9000 High Definition Printer Operations Manual
100% (1)
WIT-Color Ultra 9000 High Definition Printer Operations Manual
95 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
EPS-DL-Handout3-Build ANN From Scratch Basics
No ratings yet
EPS-DL-Handout3-Build ANN From Scratch Basics
25 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
Deep Learning-Material For The Units 1,2,3
No ratings yet
Deep Learning-Material For The Units 1,2,3
36 pages
Final PPT DataMining
No ratings yet
Final PPT DataMining
64 pages
Unit 4 Neural Networks
No ratings yet
Unit 4 Neural Networks
76 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
75 pages
36-Multi-Layer Perceptron and Its Properties-30-10-2024
No ratings yet
36-Multi-Layer Perceptron and Its Properties-30-10-2024
39 pages
Unit 1
No ratings yet
Unit 1
72 pages
Unit 4
No ratings yet
Unit 4
38 pages
Module 5 Lecture 2
No ratings yet
Module 5 Lecture 2
45 pages
Unit 2 - ML
No ratings yet
Unit 2 - ML
18 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Session XX - Neural Network
No ratings yet
Session XX - Neural Network
43 pages
ML 03
No ratings yet
ML 03
42 pages
NN Introduction MES
No ratings yet
NN Introduction MES
39 pages
Classification Advanced
No ratings yet
Classification Advanced
51 pages
Slide 2
No ratings yet
Slide 2
35 pages
Model Risk Tiering
100% (2)
Model Risk Tiering
32 pages
Unit III
No ratings yet
Unit III
29 pages
Back Propagation
No ratings yet
Back Propagation
29 pages
3ML.05.NeuralNetworks DeepLearning
No ratings yet
3ML.05.NeuralNetworks DeepLearning
67 pages
Tensorflow Keras Pytorch: Step 1: For Each Input, Multiply The Input Value X With Weights W
No ratings yet
Tensorflow Keras Pytorch: Step 1: For Each Input, Multiply The Input Value X With Weights W
6 pages
Unit 5 ML
No ratings yet
Unit 5 ML
37 pages
Chapter 9. Classification: Advanced Methods
No ratings yet
Chapter 9. Classification: Advanced Methods
39 pages
Understanding and Coding Neural Networks From Scratch in Python and R
No ratings yet
Understanding and Coding Neural Networks From Scratch in Python and R
12 pages
Neural
No ratings yet
Neural
53 pages
NN 2
No ratings yet
NN 2
12 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Chapter3 - BP
No ratings yet
Chapter3 - BP
12 pages
Advanced Supervised Learning
No ratings yet
Advanced Supervised Learning
17 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Chapter 5 Summary
No ratings yet
Chapter 5 Summary
5 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Neural Networks
No ratings yet
Neural Networks
10 pages
Neural Networks
No ratings yet
Neural Networks
10 pages
Lecture 13.3 Classification ANN
No ratings yet
Lecture 13.3 Classification ANN
64 pages
Neural Network
100% (1)
Neural Network
54 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
Working of Multi-Layer Perceptron
No ratings yet
Working of Multi-Layer Perceptron
16 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
15 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
Data Mining Techniques: Presentation On Neural Network
No ratings yet
Data Mining Techniques: Presentation On Neural Network
55 pages
Understanding and Coding Neural Networks From Scratch in Python and R
100% (1)
Understanding and Coding Neural Networks From Scratch in Python and R
15 pages
Back Propagation Algorithm
No ratings yet
Back Propagation Algorithm
13 pages
LCD Panel Repairing Book - Parte3
No ratings yet
LCD Panel Repairing Book - Parte3
30 pages
Artificial Neural Network: Lecture Module 22
No ratings yet
Artificial Neural Network: Lecture Module 22
54 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
An Introduction To Mathematics Behind Neural Networks
No ratings yet
An Introduction To Mathematics Behind Neural Networks
5 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Employee Performance Review - Quarterly - Final
No ratings yet
Employee Performance Review - Quarterly - Final
5 pages
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
No ratings yet
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
31 pages
Community Based Fisheries Management PDF
100% (1)
Community Based Fisheries Management PDF
2 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Bulb Onion Production in Ethiopia
No ratings yet
Bulb Onion Production in Ethiopia
70 pages
Three Piece Can Manufacturing-Paper
No ratings yet
Three Piece Can Manufacturing-Paper
9 pages
Audio Recording & Mastering Tips
93% (15)
Audio Recording & Mastering Tips
2 pages
Study Material 2 PDF
No ratings yet
Study Material 2 PDF
8 pages
Instruction Manual: Digital Genset Controller DGC-500
No ratings yet
Instruction Manual: Digital Genset Controller DGC-500
151 pages
Thesis Approval Muhs 2016
100% (1)
Thesis Approval Muhs 2016
7 pages
Resilience Through Education Equipping Schools and Students To Face Climate Change Challenges in Punjab
No ratings yet
Resilience Through Education Equipping Schools and Students To Face Climate Change Challenges in Punjab
6 pages
Technical Spec For Gas Detectors
No ratings yet
Technical Spec For Gas Detectors
19 pages
Bab3 Matrikulasi
No ratings yet
Bab3 Matrikulasi
31 pages
The 5th ICMS Agenda
No ratings yet
The 5th ICMS Agenda
13 pages
Duplicate Cleaner Log
No ratings yet
Duplicate Cleaner Log
183 pages
Cambridge IGCSE: PHYSICS 0625/41
No ratings yet
Cambridge IGCSE: PHYSICS 0625/41
16 pages
Automotive Diagnosis Terminal (Dbscar Ii) : User Manual
No ratings yet
Automotive Diagnosis Terminal (Dbscar Ii) : User Manual
5 pages
2023 2024 SPGBHS Main Teaching Load
No ratings yet
2023 2024 SPGBHS Main Teaching Load
2 pages
Test 2 Answers
No ratings yet
Test 2 Answers
8 pages
S. G. Balekundri Institute of Technology: Shivabasavanagar, Belagavi-590 010, Karnataka - India
No ratings yet
S. G. Balekundri Institute of Technology: Shivabasavanagar, Belagavi-590 010, Karnataka - India
7 pages
Gold Care
No ratings yet
Gold Care
16 pages
Dipak Jha Booking - Com - Confirmation
No ratings yet
Dipak Jha Booking - Com - Confirmation
2 pages
Size of Capacitor For Power Factor Correction Size of Capacitor For Power Factor Correction
No ratings yet
Size of Capacitor For Power Factor Correction Size of Capacitor For Power Factor Correction
4 pages
Reviewer in Entrepreneurship
No ratings yet
Reviewer in Entrepreneurship
2 pages
Halter
No ratings yet
Halter
2 pages
Abhishek Dhiman
No ratings yet
Abhishek Dhiman
3 pages
Introduction to Vectors, Matrices and Tensors
From Everand
Introduction to Vectors, Matrices and Tensors
Simone Malacrida
No ratings yet
Exercises of Vectors and Vectorial Spaces
From Everand
Exercises of Vectors and Vectorial Spaces
Simone Malacrida
No ratings yet
Introduction to Vectorial and Matricial Calculus
From Everand
Introduction to Vectorial and Matricial Calculus
Simone Malacrida
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet