3 DM Classification
3 DM Classification
1
DM Task: Predictive Modeling
• A predictive model makes a prediction/forecast about
values of data using known results found from different
historical data.
– Prediction Methods use existing variables to predict unknown
or future values of other variables.
• Predict one variable, Y, given a set of other variables, X.
Here, X could be an n-dimensional vector.
– In effect, this is a function approximation through learning the
relationship between Y and X.
• Many, many algorithms for predictive modeling in
statistics and machine learning, including
– Classification, regression, etc.
• Often the emphasis is on predictive accuracy, less
emphasis on understanding the model. 2
Prediction Problems:
Classification vs. Numeric Prediction
• Classification:
– predicts categorical class labels (discrete or nominal).
– classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data.
• Numeric Prediction:
– models continuous-valued functions, i.e., predicts
unknown or missing values.
3
Models and Patterns
x f(x)
• Model = abstract representation of a given
training data. 1 1
e.g., very simple linear model structure
Y=aX+b 2 4
– a and b are parameters determined from the data 3 9
– Y = aX + b is the model structure
– Y = 0.9X + 0.3 is a particular model 4 16
• Pattern represents “local structure” in a dataset.
5 ?
–E.g., if X>x then Y >y with probability p
• Example: Given a finite sample, <x,f(x)> pairs, create a
model that can hold for future values?
✓ To guess the true function f, find some pattern (called a
hypothesis) in the training examples, and assume that the pattern
will hold for future examples too.
5
Predictive Modeling: Customer Scoring
• Example: a bank has a database of 1 million past
customers, 10% of whom took out mortgages.
• Use machine learning to rank new customers as a
function of p(mortgage | customer data)
• Customer data:
– History of transactions with the bank
– Other credit data (obtained from Experian, etc)
– Demographic data on the customer or where they live
• Techniques:
– Binary classification: logistic regression, decision trees,
etc
– Many, many applications of this nature 6
Classification
4.1
descriptive model serves as a way
4
to explore the properties of the
data examined, not to predict new
3.9
properties. 3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume
10
Example of Pattern Discovery
• Example in retail: Customer transactions to consumer behavior
– People who bought “Da Vinci Code” also bought “The Five
People You Meet in Heaven” (www.amazon.com)
• Example: football player behavior
– If player A is in the game, player B’s scoring rate increases from
25% chance per game to 95% chance per game
• What about the following?
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDC
BBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBB
CCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBC
ADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABAC
BDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABC
CBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCD
CCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCB
DBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDD
BDDCABACBCADCDCBAAADCADDADAABBACCBB
11
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set
12
Basic Data Mining algorithms
• Classification (which is also called Supervised learning) maps
data into predefined groups or classes to enhance the
prediction process
• Clustering (which is also called Unsupervised learning )
groups similar data together into clusters.
– is used to find appropriate groupings of elements for a set of
data.
– Unlike classification, clustering is a kind of undirected knowledge
discovery or unsupervised learning; that is, there is no target field,
and the relationship among the data is identified by bottom-up
approach.
• Association Rule (is also known as market basket analysis)
– It discovers interesting associations between attributes contained
in a database.
– Based on frequency counts of the number of items occur in the
event, association rule tells if item X is a part of the event, then
what is the percentage of item Y is also part of the event. 13
Classification
14
Classification: Definition
• Classification is a data mining (machine learning) technique
used to predict group membership for data instances.
• Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the class.
– Find a model for class attribute as a function of the values of
other attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible. A test set is used to determine the
accuracy of the model.
– Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
• For example, one may use classification to predict whether the
weather on a particular day will be “sunny”, “rainy” or “cloudy”.
15
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
16
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Confusion Matrix for Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FP)
Class=No c d
(FP) (TP)
Dist ( X , Y ) = ( Xi − Yi)
i =1
22
Example
• We have data from the questionnaires survey (to ask
people opinion) and objective testing with two attributes
(acid durability and strength) to classify whether a special
paper tissue is good or not. Here is four training samples.
X1 = Acid Durability (seconds) X2 = Strength (kg/m2) Y = Classification
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
26
Decision Tree
28
Decision Trees
• Decision tree constructs a tree where internal nodes are simple
decision rules on one or more attributes and leaf nodes are
predicted class labels.
✓Given an instance of an object or situation, which is specified
by a set of properties, the tree returns a "yes" or "no"
decision about that instance.
Attribute_1
value-1 value-3
value-2
Attribute_2 Class1 Attribute_2
• Information Gain
–Select the attribute with the highest information gain, that create
small average disorder
• First, compute the disorder using Entropy; the expected
information needed to classify objects into classes
• Second, measure the Information Gain; to calculate by how
much the disorder of a set would reduce by knowing the value
of a particular attribute.
Entropy
• The Entropy measures the disorder of a set S containing a
total of n examples of which n+ are positive and n- are
negative and it is given by:
n+ n+ n− n−
D(n+ , n− ) = − log 2 − log 2 = Entropy( S )
n n n n
• Some useful properties of the Entropy:
– D(n,m) = D(m,n)
– D(0,m) = D(m,0) = 0
✓D(S)=0 means that all the examples in S have the same
class
– D(m,m) = 1
✓D(S)=1 means that half the examples in S are of one class
and half are in the opposite class
Information Gain
• The Information Gain measures the expected
reduction in entropy due to splitting on an attribute A
k ni
GAIN split = Entropy ( S ) − Entropy (i )
i =1 n
Parent Node, S is split into k partitions; ni is number of
records in partition i
3 3 5 5
= D(3+ ,5− ) = − log 2 − log 2 = 0.954
8 8 8 8
Which decision variable minimises the
disorder?
Test Average Disorder of the other attributes
Hair 0.50
height 0.69
weight 0.94
lotion 0.61
• Which decision variable maximises the Info Gain then?
• Remember it’s the one which minimises the average disorder.
✓ Gain(hair) = 0.954 - 0.50 = 0.454
✓ Gain(height) = 0.954 - 0.69 =0.264
✓ Gain(weight) = 0.954 - 0.94 =0.014
✓ Gain (lotion) = 0.954 - 0.61 =0.344
The best decision tree?
is_sunburned
Hair colour
blonde
red brown
Emily Alex
? Pete
John
Sunburned = Sarah, Annie,
None = Dana, Katie
is_sunburned
Hair colour
blonde brown
red
Alex,
Emily
Lotion used Pete,
John
no yes
Sarah, Dana,
Annie Katie
Sunburn sufferers are ...
• You can view Decision Tree as an IF-THEN_ELSE
statement which tells us whether someone will suffer
from sunburn.
If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-
used=“No”) then
return (sunburned = yes)
else
return (false)
Why decision tree induction in DM?
• Relatively faster learning speed (than other classification
methods)
• Convertible to simple and easy to understand classification if-
then-else rules
• Comparable classification accuracy with other methods
• Does not require any prior knowledge of data distribution,
works well on noisy data.
Pros Cons
+ Reasonable training time - Cannot handle complicated
+ Fast application relationship between features
+ Easy to interpret - Simple decision boundaries
+ Easy to implement - Problems with lots of missing
+ Can handle large number of data
features
45
Neural Network
46
Brain and Machine
• The Brain
– Pattern Recognition
– Association
– Complexity
– Noise Tolerance
• The Machine
– Calculation
– Precision
– Logic
47
Features of the Brain
• Ten billion (1010) neurons
✓ Neuron switching time >10-3secs
• Face Recognition ~0.1secs
• On average, each neuron has several thousand
connections
• Hundreds of operations per second
• High degree of parallel computation
• Distributed representations
• Die off frequently (never replaced)
• Compensated for problems by massive parallelism
48
Neural Network classifier
• It is represented as a layered set of interconnected
processors. These processor nodes has a relationship
with the neurons of the brain. Each node has a weighted
connection to several other nodes in adjacent layers.
Individual nodes take the input received from connected
nodes and use the weights together to compute output
values.
• The inputs are fed simultaneously into the input layer.
• The weighted outputs of these units are fed into hidden
layer.
• The weighted outputs of the last hidden layer are inputs
to units making up the output layer.
49
Architecture of Neural network
• Neural networks are used to look for patterns in data, learn
these patterns, and then classify new patterns & make forecasts
• A network with the input and output layer only is called single-
layered neural network. Whereas, a multilayer neural network is
a generalized one with one or more hidden layer.
– A network containing two hidden layers is called a three-layer neural
network, and so on.
Single layered NN Multilayer NN
n
o = ( wi xi )
x1 x1
w1
x2 i =1 x2
w2
1 x3
x3 w3 ( y) =
1 + e− y Input Hidden Output
nodes nodes nodes
A Multilayer Neural Network
• INPUT: records with class attribute with
normalized attributes values. Output layer
–INPUT VECTOR: X = { x1, x2, …. xm}, where n
is the number of attributes.
–INPUT LAYER – there are as many nodes as
class attributes i.e. as the length of the input Hidden layer
vector.
• HIDDEN LAYER – neither its input nor its
output can be observed from outside.
–The number of nodes in the hidden layer
and the number of hidden layers depends Input layer
on implementation.
• OUTPUT LAYER – corresponds to the class attribute.
–There are as many nodes as classes (values of the class
attribute).
–Ok, where k= 1, 2,.. n, where n is number of classes
Hidden layer: Neuron with Activation
• The neuron is the basic information processing unit of a NN. It
consists of:
1 A set of links, describing the neuron inputs, with weights W1,
W2, …, Wm
• Training – epoch 3:
y1 = 0.72*0 + 0.62*0 – 0.42 = -0.42 → y = 0
y2 = 0.72*1 + 0.62*0 – 0.42 = 0.4 → y = 1 X
W1(3) = 0.72 + 0.1 * (0 – 1) * 1 = 0.62
W2(3) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(3) = 0.42 + 0.1 * (0 – 1) * (-1)= 0.52
y3 = 0.62*0 + 0.62*1 – 0.52 = 0.1→ y = 1
y3 = 0.62*1 + 0.62*1 – 0.52 = 0.72→ y = 1
ANN Training Example
• Training – epoch 4:
y1 = 0.62*0 + 0.62*0 – 0.52 = -0.52 → y = 0
y2 = 0.62*1 + 0.62*0 – 0.52 = 0.10→ y = 1 X
W1(4) = 0.62 + 0.1 * (0 – 1) * 1 = 0.52
W2(4) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(4) = 0.52 + 0.1 * (0 – 1) * (-1)= 0.62
y3 = 0.52*0 + 0.62*1 – 0.62 = 0 → y = 0 X
W1(4) = 0.52 + 0.1 * (1 – 0) * 0 = 0.52
W2(4) = 0.62 + 0.1 * (1 – 0) * 1 = 0.72
W0(4) = 0.62 + 0.1 * (1 – 0) * (-1)= 0.52
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72 → y = 1
• Finally:
y1 = 0.52*0 + 0.72*0 – 0.52 = -0.52 → y = 0
y2 = 0.52*1 + 0.72*0 – 0.52 = -0.0 → y = 0
y3 = 0.52*0 + 0.72*1 – 0.52 = 0.2 → y= 1
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72 → y= 1
ANN Training Example
1+ + +
1 +
x2 x2
0o x1 1
o o
0 x1 1
o
Logical Functions
• McCulloch and Pitts: Boolean function can be implemented with a
artificial neuron (not XOR).
a0 a0 a0
W0 = 1.5 a1 W0 = 0.5 W0 = -0.5
a1
W1 = 1 W1 = 1
W2 = 1 a2 W2 = 1 W1 = -1
a2 a1
AND OR NOT
I1 I2 I3 Summation Output
-1 0 0 (-1*0.3) + (0*0.5) + (0*-0.4) = -0.3 0
-1 0 1 (-1*0.3) + (0*0.5) + (1*-0.4) = -0.7 0
-1 1 0 (-1*0.3) + (1*0.5) + (0*-0.4) = 0.2 1
-1 1 1 (-1*0.3) + (1*0.5) + (1*-0.4) = -0.2 0
64
Pros and Cons of Neural Network
• Useful for learning complex data like handwriting, speech
and image recognition
Cons
Pros
-Slow training time
+ Can learn more complicated
- Hard to interpret & understand
class boundaries the learned function (weights)
+ Fast application -Hard to implement: trial & error
+ Can handle large number of for choosing number of nodes
features