Lectures 7 and 8 - Data Anaysis in Management - MBM
Lectures 7 and 8 - Data Anaysis in Management - MBM
3
Learning objectives
Upon completing this lecture, you should be able to:
Classification task
One of the common data mining task is that of
classification.
Classification task
The classification model examines a large set of records, each
record containing information on the target variable as well as a
set of input or predictor variables.
WHAT TASKS CAN CLASSIFICATION METHODS ACCOMPLISH?
Classification task
Suppose that there is a target categorical variable, such as
income bracket, which, for example, could be partitioned
into three classes or categories: high income, middle income,
and low income.
15
WHAT TASKS CAN CLASSIFICATION METHODS ACCOMPLISH?
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
16
k-NEAREST NEIGHBOUR (kNN) ALGORITHM
What is kNN?
classification.
However, one may feel that neighbours that are closer or more
similar to the new record should be weighted more heavily than
more distant neighbours.
Computationally expensive.
Classification trees
Misclassification error,
Entropy,
Gini index.
DECISION TREES
Measures of impurity (uncertainty)
Misclassification error:
1- max (pk)
Hit ratio (also called hit rate, overall accuracy, ACC or the
percentage correctly classified)
Percentage of observations (objects, individuals,
respondents, firms, etc.) correctly classified.
It is calculated as the number of objects in the diagonal of
the classification matrix divided by the total number of
observations.
𝑇𝑁 + 𝑇𝑃
ℎ𝑖𝑡_𝑟𝑎𝑡𝑖𝑜 =
𝑇𝑁 + 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑃
Note! The higher, the better !
Evaluating classification accuracy
Measures of predictive accuracy
𝑇𝑃
𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑣𝑒 𝑣𝑎𝑙𝑢𝑒 = 𝑃𝑃𝑉 =
𝑇𝑃 + 𝐹𝑃
𝑇𝑁
𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑣𝑒 𝑣𝑎𝑙𝑢𝑒 = 𝑁𝑃𝑉 =
𝑇𝑁 + 𝐹𝑁