Classification Ppts 2021

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 80

Classification

Classification consists of assigning a class label


to a set of unclassified cases.
1. Supervised Classification
The set of possible classes is known in advance.
2. Unsupervised Classification
Set of possible classes is not known. After
classification we can try to assign a name to that
class. Unsupervised classification is called
clustering.
Why Classification? A motivating application

• Credit approval
– A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
– The history of past customers is used to train the classifier
– The classifier provides rules, which identify potentially
reliable future customers
– Classification rule:
• If age = “31...40” and income = high then credit_rating =
excellent
– Future customers
• Paul: age = 35, income = high  excellent credit rating
• John: age = 20, income = medium  fair credit rating
Supervised Classification
The input data, also called the training set,
consists of multiple records each having multiple
attributes or features.
Each record is tagged with a class label.
The objective of classification is to analyze the
input data and to develop an accurate description
or model for each class using the features present
in the data.
This model is used to classify test data for which
the class descriptions are not known.
Classification and Prediction
• Classification is the process of
– finding a model that describes data classes
– for the purpose of being able to use the model
– to predict the class of objects whose class label is
unknown.

• Prediction:
– predicts unknown or missing values
• Typical applications

 Credit approval
 Target marketing
 Medical diagnosis
 Fraud detection

5/14/22 Data Mining: Concepts and Techniques 5


Classification—A Two-Step Process
• Model construction:
– describing a set of predetermined classes
– The model is represented as classification rules,
decision trees, or mathematical formulae

• Model usage:
– for classifying future or unknown objects
– test sample is compared with the classified result
from the model

5/14/22 Data Mining: Concepts and Techniques 6


Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
5/14/22 Data Mining: Concepts and Techniques 7
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 yes
George Professor 5 yes
Joseph
5/14/22
Assistant Prof Data Mining:
7 Concepts andyes
Techniques 8
• Classification model can be represented in
various forms such as

» IF-THEN Rules
» A decision tree
» Neural network
Classification Model
Decision Tree - Classification
• Decision tree builds classification models in the form of a
tree structure.
• It breaks down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is
incrementally developed.
• The final result is a tree with decision nodes and leaf nodes.
• A decision node has two or more branches
• Leaf node represents a classification or decision.
• The topmost decision node in a tree which corresponds to
the best predictor called root node.
• Decision trees can handle both categorical and numerical
data. 
Example 1 : using given training data set, create classification model using decision tree
Outlook= Overcast
Outlook= Sunny
Example 2: using given training data set, create classification model
using decision tree
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
20
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

5/14/22 Data Mining: Concepts and Techniques 21


Algorithm for Decision Tree Induction

• Basic algorithm (a greedy algorithm)


– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in
advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left

5/14/22 Data Mining: Concepts and Techniques 22


Attribute Selection Measure: Information Gain
(ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info ( D )    pi log 2 ( pi )
 Information needed (after using A to split Di into
1 v partitions) to
classify D:
v | Dj |
InfoA (D)    I (Dj )
 Information gained by branching on attributej A1 | D |

Gain(A) Info(D) InfoA(D)


5/14/22 Data Mining: Concepts and Techniques 23
Example 3: Create classification model using decision
tree for the following training data set
Sr. No. Income Age Own House
1 Very high Young yes
2 High Medium yes
3 Low Young Rented
4 High Medium Yes
5 Very high Medium yes
6 Medium Young yes
7 High Old yes
8 Medium Medium Rented
9 low medium Rented
10 Low Old Rented
11 High Young yes
12 Medium old Rented
Total no. of records=12,
yes=7, rented=5
Entropy(own house) = E([7/12],[5/12])
=E(0.58, 0.41)
= - (0.58 log2 0.58) - (0.42log2 0.42)
= 0.98
• Step1: for E(Income) we have
Income Yes rented total

Very high 2 0 2

High 4 0 4

Low 0 3 3

medium 1 2 3
E(Own house, Income) = p(vh)*E(vh) +12p(h)*E(h) +
p(l)*E(l) + p(m)*E(m)

E(O,I) = [2/12*E(2/2,0/2)]+ [4/12*E(4/4,0/4)]+


[3/12*E(0/3,3/3)] + [3/12*E(1/3,2/3)]
E(O,I)= (3/12) *E([1/3],[2/3])
= 0.25 * [ - (0.33 log2 0.33) – (0.67 log2
0.67)]
= 0.25 * 0.92
= 0.23
• Gain(O,I) = E(O) – E(O,I)
= 0.98 – 0.23
= 0.75

• Information Gain
The information gain is based on the decrease in
entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding
attribute that returns the highest information gain
(i.e., the most homogeneous branches).
Step 2: For E(Age) we have,
Age Yes Rented total
Young 3 1 4
Medium 3 2 5
Old 1 2 3
12

E(Own house, Age)= p(y)*E(y) + p(m)*E(m) + p(o)*E(o)

= [(4/12)*E(3/4, 1/4)] + [(5/12)*E(3/5,2/5)] +


[(3/12)*E(1/3,2/3)]

= [(4/12)*E(0.75,0.25)] + [(5/12)*E(0.6,0.4)]
+ [(3/12)*E(0.33,0.67)]
= 0.90
• G(O,A)= E(O) – E(O,A)
= 0.98 – 0.90
= 0.08
Income attribute has highest gain, so used as a
decision attribute in the root node
R1: If(Income=VH) then Own
house=yes

R2: If (Income=Medium) and


(Age=old) then own
house=rented

Income Age Own house


For income medium we have Medium Young yes
three values  Medium Medium Rented
Medium old Rented
Bayesian classification
• A Bayes classifier is a simple probabilistic classifier
• In simple terms, a naive Bayes classifier assumes that the
presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of any other feature.
• For example, a fruit may be considered to be an apple if it is
red, round, and about 4" in diameter. Even if these features
depend on each other or upon the existence of the other
features, a naive Bayes classifier considers all of these
properties to independently contribute to the probability that
this fruit is an apple
• Bayesian classifier are able to predict class membership
probabilities such as the probability that a given tuple
belongs to a particular class.
Naïve Bayesian classification

• Naïve Bayes Algorithm (for discrete input attributes)


– Learning Phase: Given a training set S,
For each targetvalueof ci (ci  c1 ,  , cL )
Pˆ (C  ci )  estimateP(C  ci ) withexamplesin S;
For everyattributevaluexjk of eachattributeXj ( j  1,  , n; k  1,  , N j )
Pˆ (Xj  xjk |C  ci )  estimateP(Xj  xjk |C  ci ) withexamplesin S;
Output: conditional probability tables; forX , N L elements
j j
– Test Phase: Given an unknown instance ,
X  ( a1 ,  , an )
Look up tables to assign the label c* to X’ if

[Pˆ (a1 |c* )    Pˆ (an |c* )]Pˆ (c* )  [Pˆ (a1 |c)    Pˆ (an |c)]Pˆ (c), c  c* , c  c1 ,  , cL

33
Example 1: Naïve Bayes Classifier Example

Predict the class label for an unknown sample


“X” using Naïve Bayesian classification.

‘X’= (Outlook=Sunny, Temperature=Cool,


Humidity=High, Wind=Strong)
(Target attribute)

35
Learning phase:

P(Play=Yes) = 9/14 P(Play=No) = 5/14


Outlook Play=Yes Play=No Temperature Play=Yes Play=No

Sunny 2/9 3/5 Hot 2/9 2/5


Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No

Strong 3/9 3/5


High 3/9 4/5
Weak 6/9 2/5
Normal 6/9 1/5

36
• Test Phase
– Given a new instance,
x’ = (Outlook=Sunny, Temperature=Cool,
Humidity=High, Wind=Strong)
• MAP rule

P(x’|Yes) =P(Outlook=Sunny / Yes) *


P(Temperature=Cool / Yes)*
P(Humidity=High / Yes) *
P(Wind=Strong / Yes) *
P(Yes)
= 2/9 * 3/9 * 3/9 * 3/9 * 9/14
= = 0.0053
37
Map rule:
P(x’|No) = P(Outlook=Sunny/No) *
P(Temperature=Cool/No) *
P(Humidity=High/No) *
P(Wind=Strong/No) *
P(No)
= 3/5 * 1/5 * 4/5 * 3/5 * 5/14
= 0.0206

Given the fact P(X|Yes) < P(X|No),


we label X to be “Play tennis = No”.
- MAP rule
• P(x’|Yes) =
[P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes)
= 0.0053

• P(x’|No):
[P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No)
= 0.0206

Given the fact P(x’|Yes) < P(x’|No), we label x’ to be


“No”.
Example 2: Naïve Bayesian classification Example

• Predict a class label of an unknown sample


using Naïve Bayesian classification on the
following training dataset from all electronics
customer database.
• The unknown sample is_
X’={age=“<=30, Income=“median”,
Student=“yes”, credit rating=“fair”}
Age Income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
• P(x’|Yes) =0.028
• P(x’|No)=0.007
• Since 0.028>0.007, therefor the naïve Bayesian
classifier predicts buyes computer=“yes” for
sample X’
• Weka classifer demo naïve bay

https://www.youtube.com/watch?
v=UzT4W1tOKD4
Surprise Test (20 marks)

ID Homeowner Status Income Defaulted


1 YES Employed High No
2 NO Business Average NO
3 NO Employed Low NO
4 YES Business High NO
5 NO Unemployed Average Yes
6 NO Business Low No
7 YES Unemployed High NO
8 NO Employed Average Yes
9 NO Business Low No
10 NO Employed Average Yes

Illustrate Decision tree and Naive Bayesian Classification techniques for the above
data set.
Show how we can classify a new tuple, with (Homeowner=yes; Status=Employed;
Income= Average)
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Samples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left
Algorithm for Decision Tree Induction
(pseudocode)
Algorithm GenDecTree(Sample S, Attlist A)
1. create a node N
2. If all samples are of the same class C then label N with C; terminate;
3. If A is empty then label N with the most common class C in S (majority
voting); terminate;
4. Select aA, with the highest information gain; Label N with a;
5. For each value v of a:
a. Grow a branch from N with condition a=v;
b. Let Sv be the subset of samples in S with a=v;
c. If Sv is empty then attach a leaf labeled with the most common class in S;
d. Else attach the node generated by GenDecTree(Sv, A-a)
Attribute Selection Measure
• Information gain (ID3/C4.5)
– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for each attribute
– May need other tools, such as clustering, to get the possible split values
– Can be modified for categorical attributes
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n elements of
class N
– The amount of information, needed to decide if an arbitrary example in S
belongs to P or N is defined as

p p n n
I ( p , n)   log2  log2
pn pn pn pn
Prediction

• In prediction we can predict the continues values of


response variable with the help of predictor variable
• Prediction can be done with the help of statistical
technique of regression
• It assumes the data to fit in some kind of function
and involve study of those function
• Most widely used approach for numeric prediction is
regression
Regression can be of following kind
• Linear regression
• Multiple linear regression
• Non-linear regression
Linear regression(single predictor variable)
• Data is modeled using straight line
• This regression line is represented by
following expression

Where x  predictor variable


Y response variable
Multiple linear regression(multiple predictor
variable)

• In linear and multiple linear regression always


predictor and response variable got the linear
relationship
Non-linear regression :

• If the given response variable and predictor


variable have got polynomial relationship then
it is called non-linear regression
How α and β is calculated?
• using least square method

X mean value of x
Y mean value of y
Example
The below table shows the marks obtain by student in
midterm and final year exam
Midterm(x) Final year(y)
45 60
70 70
60 54
84 82
75 68
Find: 84 76

1)Equation of predication(linear Regression formula)


2) What will be the final year marks if the midterm
marks is 40?
• Solution:
• compute
x y (x-x) (y-y) (x-x)(y-y) (x-x)2
45 60
70 70
60 54
84 82
75 68
84 76
Σ = 648.62 Σ= 1141.36
β =0.558

α =28.84
• Now,

• So, prediction equation Y=28.84 + 0.558X

• For X=40, we get Y=52


This is the final year marks for students getting
40 marks in midterm.
Model evaluation and selection

Q1) Explain different methods that can used to


evaluate and compare the accuracy of different
classification algorithm
Ans:
• Methods for estimating a classifier’s accuracy:
– Holdout method, random subsampling
– Cross-validation
– Bootstrap
• Comparing classifiers:
– ROC Curves
Q2) Explain Bagging &Boosting of classification
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other metrics to
consider?
• Use validation test set of class-labeled tuples instead of training set
when assessing accuracy
• Methods for estimating a classifier’s accuracy:
– Holdout method, random subsampling
– Cross-validation
– Bootstrap
• Comparing classifiers:
– Confidence intervals
– Cost-benefit analysis and ROC Curves

63
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
64
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All

1. Classifier Accuracy, or 3. Sensitivity: True Positive


recognition rate: percentage of recognition rate
test set tuples that are  Sensitivity = TP/P

correctly classified 4. Specificity: True Negative


Accuracy = (TP + TN)/All recognition rate
2. Error rate: 1 – accuracy, or  Specificity = TN/N

Error rate = (FP + FN)/All

65
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
5. Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

6. Recall: completeness – what % of positive tuples did the


classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
7. F measure (F1 or F-score): harmonic mean of precision and
recall,

66
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

67
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized
data
– *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
68
5 fold cross validation
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:

70
The bootstrap method involves iteratively resampling a dataset with replacement.
Estimating Confidence Intervals:
Table for t-distribution

• Symmetric
• Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
• Confidence limit, z
= sig/2

72
Model Selection: ROC Curves
• ROC (Receiver Operating Characteristics)
curves: for visual comparison of
classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a  Vertical axis
measure of the accuracy of the model represents the true
• Rank the test tuples in decreasing order: positive rate
the one that is most likely to belong to
 Horizontal axis rep.
the positive class appears at the top of the false positive rate
the list
 The plot also shows a
diagonal line
• The closer to the diagonal line (i.e., the  A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
73
Issues Affecting Model Selection
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
74
Ensemble Methods: Increasing the Accuracy

• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of
classifiers. Eg. Random Forest
– Boosting: weighted vote with a collection of classifiers.
Eg. Ada Boost
75
Bagging: Boostrap Aggregation

• Analogy: Diagnosis based on multiple doctors’ majority vote


• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
• Accuracy
– Often significantly better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction
76
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to allow
the subsequent classifier, Mi+1, to pay more attention to the
training tuples that were misclassified by Mi
– The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
77
Adaboost

• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)


• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set Di
of the same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
d
error ( M i )  w
j
j  err ( X j )

• The weight of classifier Mi’s vote is 1  error ( M i )


log
error ( M i )
78
Random Forest
• Random Forest:
– Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
– During classification, each tree votes and the most popular class is
returned
• Two Methods to construct Random Forest:
– Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
– Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting
79
Classification of Class-Imbalanced Data Sets

• Class-imbalance problem: Rare positive example but numerous


negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes and
equal error costs: not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class classification:
– Oversampling: re-sampling of data from positive class
– Under-sampling: randomly eliminate tuples from negative
class
– Threshold-moving: moves the decision threshold, t, so that the
rare class tuples are easier to classify, and hence, less chance of
costly false negative errors
– Ensemble techniques: Ensemble multiple classifiers introduced
above
• Still difficult for class imbalance problem on multiclass tasks
80

You might also like