Classification BP Regression KNN Other Classifiers - Final
Classification BP Regression KNN Other Classifiers - Final
Say, max A was 100 and min A was 20 ( That means maximum and minimum values for the attribute ).
Now, if v = 40 ( If for this particular pattern , attribute value is 40 ), v’ will be calculated as , v’ = (40-20)
x (1-0) / (100-20) + 0
=> v’ = 20 x 1/80
=> v’ = 0.4
Decimal Scaling Normalization
[2]Decimal Scaling Normalization
Example :
■ x1 and x2 values multiplied by weight values w1 and w2 are input to the neuron
x.
■ Given that
■
One Neuron as a Network
■
Bias of a Neuron
x1-x2= -1
x2
x1-x2=0
x1-x2= 1
x1
Bias as extra input
w0
x0 = +1
W1
x1 Activation
v
Input
function
Attribute x2 w2
Output
values class
Summing function
y
weights
xm wm
Neuron with Activation
■ The neuron is the basic information processing unit of a
NN. It consists of:
Output nodes
Hidden nodes
wij - weights
Input nodes
Network is fully connected
Input Record : xi
Neural Network Learning
k= 1, 2,.. #classes
- Bias Θj
x0 w0j
x1 w1j
∑ f
output y
xn wnj
■ The inputs to unit j are outputs from the previous layer. These are
multiplied by their corresponding weights in order to form a
weighted sum, which is added to the bias associated with unit j.
■ A nonlinear activation function f is applied to the net input.
Propagate the inputs forward
■
For unit j in the input layer, its output is equal to its input, that is,
where wij is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of unit I
from the previous layer;
where wjk is the weight of the connection from unit j to unit k in the next higher
layer, and Errk is the error of unit k.
Update weights and biases
■ Weights are updated by the following equations,
where l is a constant between 0.0 and 1.0
reflecting the learning rate, this learning rate is
fixed for implementation.
Output vector
Output nodes
Hidden nodes
wij
Input nodes
Input vector: xi
Example of Back propagation
Input = 3, Hidden Neuron = 2
Output =1
Initialize weights :
■ Bias ( Random )
θ4 θ5 θ6
6 (-0.3)0.332-(0.2)(0.525)+0
.1= -0.105
= 0.475
Calculation of Error at Each
Node
Unit j Error j
6 0.475(1-0.475)(1-0.475) =0.1311
We assume T 6 = 1
……..similarly ………similarly
Network Pruning and Rule Extraction
■ Network pruning
■ Fully connected network will be hard to
articulate
■ N input nodes, h hidden nodes and m output
nodes lead to h(m+N) weights
■ Pruning: Remove some of the links without
affecting classification accuracy of the network
Applications
■ Handwritten Digit Recognition
■ Face recognition
■ Time series prediction
■ Process identification
■ Process control
■ Optical character recognition
Application-II
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Linear Correlation
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Linear Correlation
No relationship
X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Calculating by hand…
Simpler calculation formula…
Numerator of
covariance
Numerators of
variance
Distribution of the
correlation coefficient:
B
What’s Slope?
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Regression Picture
yi
C A
B
y
B
A
C
yi
2 2 2
R2=SSreg/SStotal
A B C
SStotal SSreg SSresidual
Total squared distance of Distance from regression line to naïve mean of y Variance around the regression line
observations from naïve mean Variability due to x (regression) Additional variability not explained by
of y x—what least squares method aims to
Total variation minimize
Estimating the intercept and
slope: least squares estimation
** Least Squares Estimation
A little calculus….
What are we trying to estimate? β, the slope, from
What’s the constraint? We are trying to minimize the squared distance (hence the “least squares”) between the
observations themselves and the predicted values , or (also called the “residuals”, or left-over unexplained variability)
Find the β that gives the minimum sum of the squared differences. How do you maximize a function? Take the
derivative; set it equal to zero; and solve. Typical max/min problem from calculus….
Intercept=
In correlation, the two variables are treated as equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.
Formula for the standard error of beta (you will
not have to calculate by hand!):
Residual Analysis: check
assumptions
x x
residuals
x residuals x
Not Linear
✔ Linear
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Residual Analysis for
Homoscedasticity
Y Y
x x
residuals
x residuals x
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Residual Analysis for
Independence
Not Independent
✔ Independent
residuals
residuals
X
residuals
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
■
Other types of multivariate
regression
● Multiple linear regression is for normally
distributed outcomes
Binary High blood Logistic ln (odds of high blood pressure) = odds ratios—tells you how
pressure regression α + βsalt*salt consumption (tsp/day) + much the odds of the
(yes/no) βage*age (years) + βsmoker*ever outcome increase for every
smoker (yes=1/no=0) 1-unit increase in each
predictor.
Time-to-event Time-to- Cox regression ln (rate of death) = hazard ratios—tells you how
death α + βsalt*salt consumption (tsp/day) + much the rate of the outcome
βage*age (years) + βsmoker*ever increases for every 1-unit
smoker (yes=1/no=0) increase in each predictor.
Multivariate regression pitfalls
● Multi-collinearity
● Residual confounding
● Overfitting
Multicollinearity
● Multicollinearity arises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model; they will, in effect, cancel each
other out and generally destroy your model.
Exercise, sleep, and high ratings for Clinton are negatively related to optimism (highly
significant!) and high ratings for Obama and high love of math are positively related to
optimism (highly significant!).
Overfitting
Rule of thumb: You need at
least 10 subjects for each
additional predictor
variable in the multivariate
regression model.
distances
All training points may influence a particular
instance
▪ Shepard’s method
Remarks
+Highly effective inductive inference method for
noisy training data and complex target
functions
+Target function for a whole space may be
described as a combination of less complex
local approximations
+Learning is very simple
- Classification is time consuming
Nearest-Neighbor Classifiers: Issues
Rule of thumb:
K = sqrt(N)
N: number of training points
Distance Metrics
Distance Measure: Scale
Effects
■ Different features may have different
measurement scales
■ E.g., patient weight in kg (range [50,200])
vs. blood protein values in ng/dL (range
[-3,3])
■ Consequences
■ Patient weight will have a much greater
influence on the distance between samples
■ May bias the performance of the classifier
Nearest Neighbour :
Computational Complexity
■ Expensive
■ To determine the nearest neighbour of a query point q, must
compute the distance to all N training examples
+ Pre-sort training examples into fast data structures (kd-trees)
+ Compute only an approximate distance
+ Remove redundant data (condensing)
■ Storage Requirements
■ Must store all training data P
+ Remove redundant data (condensing)
- Pre-sorting often increases the storage requirements
■ High Dimensional Data
■ “Curse of Dimensionality”
■ Required amount of training data increases exponentially with
dimension
■ Computational cost also increases dramatically
■ Partitioning techniques degrade to linear search in high dimension
Reduction in Computational
Complexity
■ Reduce size of training set
■ Condensation, editing
Every query point will be assigned the classification of the sample within that cell. The decision boundary separates the class
regions based on the 1-NN decision rule.
The boundary itself is rarely computed; many algorithms seek to retain only those points necessary to generate an identical
boundary.
Condensing
Query
KNN: Alternate Terminologies
■ Holdout method
■ Given data is randomly partitioned into two independent sets
■ Training set (e.g., 2/3) for model construction
■ Test set (e.g., 1/3) for accuracy estimation
■ Random sampling: a variation of holdout
■ Repeat holdout k times, accuracy = avg. of the accuracies obtained
■ Cross-validation (k-fold, where k = 10 is most popular)
■ Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
■ At i-th iteration, use Di as test set and others as training set
■ Leave-one-out: k folds where k = # of tuples, for small sized data
■ Stratified cross-validation: folds are stratified so that class dist. in each
fold is approx. the same as that in the initial data
Evaluating the Accuracy of a Classifier or
Predictor (II)
■ Bootstrap
■ Works well with small data sets
■ Samples the given training tuples uniformly with replacement
■ i.e., each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
■ Several boostrap methods, and a common one is .632 boostrap
■ Suppose we are given a data set of d tuples. The data set is sampled d
times, with replacement, resulting in a training set of d samples. The data
tuples that did not make it into the training set end up forming the test set.
About 63.2% of the original data will end up in the bootstrap, and the
remaining 36.8% will form the test set (since (1 – 1/d) d ≈ e-1 = 0.368)
■ Repeat the sampling procedue k times, overall accuracy of the
model:
Ensemble Methods: Increasing the Accuracy
■ Ensemble methods
■ Use a combination of models to increase accuracy
115
■ Feedback: [email protected]
■ I acknowledge all the authors and websites whose content was part of my
lecture.
■ Thanks
Manish@iiita 116