Pattern Recognition Application
Pattern Recognition Application
Pattern Recognition Application
DATA MINING
Data Mining is discovering hidden values in
data repositories.
Data mining is important because:
o It facilitates the process of fetching the required
information from huge amounts of data.
o It helps making crucial decisions.
This can be achieved by wide variety of data
mining techniques which can be categorized
into supervised and unsupervised learning.
Supervised Learning
Supervised learning" algorithms are those used in classification
and prediction.
Data is available in which the value of the outcome of interest is known .
3
Supervised Learning
4
Unsupervised Learning
No outcome variable to predict or classify
5
CLASSIFICATION MODEL BASIC
BLOCKS
DATA PREPROCESSING
PREDICTIVE
EVALUATION
MODEL
The Steps in Data Mining
1. Develop an understanding of the purpose of the data
mining project
It is a one-shot effort to answer a question or questions or
Application (if it is an ongoing procedure).
2. Obtain the dataset to be used in the analysis.
Random sampling from a large database to capture records to be used
in an analysis
Pulling together data from different databases.
Usually the analysis to be done requires thousands or tens of
thousands of records.
7
The Steps in Data Mining
8
The Steps in Data Mining
4. Reduce the data
If supervised training is involved separate them into training,
validation and test datasets.
Eliminating unneeded variables,
Transforming variables
Creating new variables
Make sure you know what each variable means, and whether it is
sensible to include it in the model.
5. Determine the data mining task
Classification, prediction, clustering, etc.
6. Choose the data mining techniques to be used
Regression, neural nets, hierarchical clustering, etc.
9
The Steps in Data Mining
7. Use algorithms to perform the task.
Iterative process - trying multiple variants, and often using multiple variants of
the same algorithm (choosing different variables or settings within the
algorithm).
When appropriate, feedback from the algorithm's performance on validation
data is used to refine the settings.
8. Interpret the results of the algorithms.
Choose the best algorithm to deploy,
Use final choice on the test data to get an idea how well it will perform.
9. Deploy the model.
Integrate the model into operational systems
Run it on real records to produce decisions or actions.
10
Example
Open MatLab
Write the following command
load fisheriris
Iris Data
Fishers Iris data base (Fisher, 1936) is perhaps
the best known database to be found in the
pattern recognition literature.
The data set contains 3 classes of 50 instances
each, where each class refers to a type of iris
plant. One class is linearly separable from the
other two; the latter are not linearly separable
from each other.
Iris Data
The data base contains the following attributes:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
- Iris Setosa
- Iris Versicolour
- Iris Virginica
Fishers Iris data base is available in Matlab (load
fisheriris)
Plot the Data
Now lets plot Iris Data and see how the sepal
measurements differ between species.
You can use the two columns containing sepal
measurements.
MatLab 2016
lda = fitcdiscr(meas(:,1:2),species);
ldaClass = resubPredict(lda);
Assignment
Prepare a presentation to discuss the theory
of operation of discriminant analysis classifier
including:
1. Idea of operation
2. Mathematical formulation
3. Advantages and disadvantages
4. Different forms of writing its function in different
versions of MatLab (2010,2013,2016,2017) with a
detailed description of the function employed in the
version you are using
Computing the error
Now compute the resubstitution error, which
is the misclassification error (the proportion of
misclassified observations) on the training set.
MatLab 2013
Bad=~strcmp(ldaclass,species);
N=size(meas,1);
ldaResubErr = sum(bad)/N;
MatLab 2017
ldaResubErr = resubLoss(lda)
Evaluation
How predictive is the model we learned?
Error on the training data is not a good
indicator of performance on future data
Q: Why?
A: Because new data will probably not be exactly
the same as the training data!
Overfitting fitting the training data too
precisely - usually leads to poor results on new
data
18
Evaluation issues
Possible evaluation measures:
Classification Accuracy
Total cost/benefit when different errors involve
different costs
Lift and ROC curves
Error in numeric predictions
How reliable are the predicted results ?
19
Classifier error rate
Natural performance measure for
classification problems: error rate
Success: instances class is predicted correctly
Error: instances class is predicted incorrectly
Error rate: proportion of errors made over the
whole set of instances
Training set error rate: is way too optimistic!
you can find patterns even in random data
20
Evaluation on LARGE data
If many (thousands) of examples are available,
including several hundred examples from each
class, then a simple evaluation is sufficient
Randomly split data into training and test sets
(usually 2/3 for train, 1/3 for test)
Build a classifier using the train set and
evaluate it using the test set.
21
Classification Step 1:
Split data into train and test sets
THE PAST
Results Known
+
+ Training set
-
-
+
Data
Testing set
22
Classification Step 2:
Build a model on a training set
THE PAST
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Testing set
23
Classification Step 3:
Evaluate on test set (Re-train?)
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Evaluate
Predictions
+
Y N
-
+
Testing set -
24
Handling unbalanced data
Sometimes, classes have very unequal frequency
Attrition prediction: 97% stay, 3% attrite (in a month)
medical diagnosis: 90% healthy, 10% disease
eCommerce: 99% dont buy, 1% buy
Security: >99.99% of Americans are not terrorists
Similar situation with multiple classes
Majority class classifier can be 97% correct, but
useless
25
Balancing unbalanced data
With two classes, a good approach is to build
BALANCED train and test sets, and train model
on a balanced set
randomly select desired number of minority class
instances
add equal number of randomly selected majority
class
Generalize balancing to multiple classes
Ensure that each class is represented with
approximately equal proportions in train and test
26
A note on parameter tuning
It is important that the test data is not used in any way to
create the classifier
Some learning schemes operate in two stages:
Stage 1: builds the basic structure
Stage 2: optimizes parameter settings
The test data cant be used for parameter tuning!
Proper procedure uses three sets: training data, validation
data, and test data
Validation data is used to optimize parameters
27
Making the most of the data
Once evaluation is complete, all the data can
be used to build the final classifier
Generally, the larger the training data the
better the classifier (but returns diminish)
The larger the test data the more accurate the
error estimate
28
Classification:
Train, Validation, Test split
Results Known
+
Model
+ Training set Builder
-
-
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Validation set -
+
- Final Evaluation
+
Final Test Set Final Model -
29
Evaluating Classification &
Predictive Performance
30
Why Evaluate?
31
Accuracy Measures
(Classification)
32
Misclassification error
34
Confusion Matrix
35
36
Confusion matrix glossary
In a 2-class problem where the class is either C
or not C the confusion matrix looks like this:
Classifier Output
True Class C not C
C TP FN
not C FP TN
38
Example
Solution:
TP=201, FP=25, TN=2689, FN= 85
Overall error rate = (25+85)/3000 = 3.67%
Accuracy = 1 err = (201+2689) = 96.33%
Sensitivity = 201/(201+85) = 68.14%
Specificity = 2689/(2689+25) = 99.08%
39
Cutoff for classification
Most DM algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class 1
2. Compare to cutoff value, and classify accordingly
40
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
owner 11 1
non-owner 4 8
owner 7 5
non-owner 1 11
42
Confusion Matrix on Iris data example
[ldaResubCM,grpOrder] =confusionmat(species,ldaClass)
ldaResubCM =
49 1 0
0 36 14
0 15 35
grpOrder =
3x1 cell array
{'setosa' }
{'versicolor'}
{'virginica' }