Chapter 02_DM tasks_Part I_Classification
Chapter 02_DM tasks_Part I_Classification
❑ Linear Classifiers
❑ Summary
4
Classification: Definition
• Classification is a data mining (machine learning) technique used to predict group
membership for data instances.
• Given a collection of records (training set), each record contains a set of attributes,
one of the attributes is the class.
– Find a model for class attribute as a function of the values of other attributes.
• Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model.
– Usually, the given data set is divided into training and test sets, with training set
used to build the model and test set used to validate it.
• For example, one may use classification to predict whether the weather on a
particular day will be “sunny”, “rainy” or “cloudy”.
5
Supervised vs. Unsupervised Learning (1)
❑ Supervised learning (classification)
❑ Supervision: The training data such as observations or measurements are
accompanied by labels indicating the classes which they belong to
❑ New data is classified based on the models built from the training set
Training Data with class label:
Outlook Temp Humidity Windy Play Golf Training Model
Rainy Hot High False No Instance Learning
Rainy Hot High True No s
Overcast Hot High False Yes
7
Confusion Matrix & Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FP)
Class=No c d
(FP) (TP)
• Other metric for performance evaluation are Precision, Recall & F-Measure
9
10
11
Classification methods
• Goal: Predict class Ci = f(x1, x2, .. Xn)
• There are various classification methods. Popular classification
techniques include the following.
– Decision tree classifier: divide decision space into piecewise constant regions.
❑ Linear Classifiers
❑ Summary
13
Decision Tree Induction: Algorithm
❑ Decision tree performs classification by constructing a tree based on training instances.
❑ The tree is traversed for each test instance to find a leaf, and the class of the leaf is the
predicted class.
❑ Basic algorithm
❑ Tree is constructed in a top-down, recursive, divide-and-conquer manner
❑ At start, all the training examples are at the root
❑ On each node, attributes are selected based on the training examples on that node, and
a heuristic or statistical measure (e.g., information gain, Gini index)
❑ Conditions for stopping partitioning
❑ All samples for a given node belong to the same class
❑ There are no remaining attributes for further partitioning – majority voting is employed
for classifying the leaf
14
❑ There are no samples left
Attribute Selection Measure
• Information gain
–Select the attribute with the highest information gain
• First, compute the disorder using Entropy; the expected information needed to classify
objects into classes
• Second, measure the Information Gain; to calculate by how much the disorder of a set
would reduce by knowing the value of a particular attribute.
– In information gain measure we want:-
• large Gain
• same as: small average disorder created
• GINI index
– An alternative to information gain that measure impurity of attributes in the classification
task
– Select the attribute with the smallest GINI value
15
Entropy
• The Entropy measures the disorder of a set S containing a total of n examples of which n+ are
positive and n- are negative and it is given by:
• Information Gain: Measures Reduction in Entropy achieved because of the split. Choose the
split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large number of partitions, each being
small but pure.
Decision Tree Induction: An Example
18 18
19
We can’t decided on sunny and rain
because there are 2 yes and 3 no in
sunny and 3 yes and 2 no in rain but
Now, outlook is the highest gain we can decided on overcast because
Therefore, it will be the root node. all of them are categorized as yes.
For sunny
27
Attribute Selection by Information Gain
• Class P: buys_computer = “yes”
• Class N: buys_computer = “no”
• E(P, N) = E(9, 5) =0.940
Hence
• Compute the entropy for age:
Similarly
28
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
no yes no yes
Classification Rules
IF age = “<=30” & student = “no” THEN buys_computer = “no”
IF age = “<=30” & student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” & credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” & credit_rating = “fair” THEN buys_computer = “no”
29
Exercise 2: The problem of “Sunburn”
• You want to predict whether another person is likely to get sunburned if he is back
to the beach. How can you do this?
• Data Collected: predict based on the observed properties of the people
30
Exercise: 3 ‘is the customer Good, Doubtful or Poor?’
Customer ID Debt Income Marital Status Risk
Abel High 31 High Married Good
Ben Low High Married Doubtful
Candy Medium Low Unmarried Poor
Dale High Low Married Poor
Ellen High Low Married Poor
Fred High Low Married Poor
George Low High Unmarried Doubtful
Harry Low Medium Married Doubtful
Igor Low High Married Good
Jack High High Married Doubtful
Kate Low Low Married Poor
Lane Medium High Unmarried Good
Mary High Low Unmarried Poor
Nancy Low Medium Unmarried Doubtful
Othello Medium High Unmarried Good
Pros and Cons of decision trees
Pros Cons
• Reasonable training time Cannot handle complicated
• Fast application relationship between features
• Easy to interpret simple decision boundaries
• Easy to implement problems with lots of missing
• Can handle large number of data
features
32
Classification: Basic Concepts
• Classification: Basic Concepts
• Decision Tree Induction
• Bayes Classification Methods
• Lazy Learners (or learning from your neighbors)
• Linear Classifiers
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy
• Summary
Why Bayesian Classification?
• Provides practical learning algorithms
– Probabilistic learning: Calculate explicit probabilities for hypothesis. E.g.
Naïve Bayes
• Prior knowledge and observed data can be combined
– Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct.
• It is a generative (model based) approach, which offers a useful conceptual
framework
– Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities.
E.g. sequences could also be classified, based on a probabilistic model specification
– Any kind of objects can be classified, based on a probabilistic model specification
34
Bayesian Classifiers
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes
theorem
36
Play-tennis example
Based on the examples in the table, classify the following unseen sample X :
x=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=strong)
• That means: Play tennis or not?
• Working:
– (0.64)(2/9)(3/9)(3/9)(3/9)= 0.0053
– (0.36)(3/5)(1/5)(4/5)(3/5)= 0.0206
38
Exercise: Naïve Bayes Classifier
A: attributes
M: mammals
N: non-mammals
39
Naive Bayesian Classifier
• Advantages
–Easy to implement
–Good results obtained in most of the cases
–Robust to isolated noise points
–Handle missing values by ignoring the instance during probability estimate calculations
–Robust to irrelevant attributes
• Disadvantages
–Class conditional independence assumption may not hold for some attributes, therefore
loss of accuracy
–Practically dependencies exist among variables
• E.g. hospitals: patients: profile: age, family history, etc. symptoms: fever, cough etc.
Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian classifier
– How to deal with these dependencies? Bayesian Belief Networks
40
Neural Network
41
Brain and Machine
• The Brain
– Pattern Recognition
– Association
– Complexity
– Noise Tolerance
• The Machine
– Calculation
– Precision
– Logic
42
Neural Network classifier
• It is represented as a layered set of interconnected processors. These processor
nodes has a relationship with the neurons of the brain. Each node has a
weighted connection to several other nodes in adjacent layers. Individual
nodes take the input received from connected nodes and use the weights
together to compute output values.
• The inputs are fed simultaneously into the input layer.
• The weighted outputs of these units are fed into hidden layer.
• The weighted outputs of the last hidden layer are inputs to units making up the
output layer.
43
Architecture of Neural network
• Neural networks are used to look for patterns in data, learn these patterns, and then
classify new patterns & make forecasts
• A network with the input and output layer only is called single-layered neural network.
Whereas, a multilayer neural network is a generalized one with one or more hidden
layer.
– A network containing two hidden layers is called a three-layer neural network, and so on.
2. An adder function (linear combiner) for computing the weighted sum of the inputs (real
numbers):
3. Activation function (also called squashing function): for limiting the output behavior of
the neuron.
46
Activation Functions
47
Activation Functions
48
Two Topologies of neural network
• NN can be designed in a feed forward or recurrent manner
• In a feed forward neural network connections between the units
do not form a directed cycle.
– In this network, the information moves in only one direction, forward, from
the input nodes, through the hidden nodes (if any) & to the output nodes.
49
Training the neural network
• The purpose is to learn to generalize using a set of sample patterns where the
desired output is known.
• Back Propagation is the most commonly used method for training multilayer
feed forward NN.
– Back propagation learns by iteratively processing a set of training data (samples).
– For each sample, weights are modified to minimize the error between the desired
output and the actual output.
• After propagating an input through the network, the error is calculated and
the error is propagated back through the network while the weights are
adjusted in order to make the error smaller.
50
Training Algorithm
• The applied learning algorithm is as follows
–Initialize the weights and threshold to small random numbers.
–Present a vector x to the neuron inputs and calculate the output using the adder
function.
51
ANN Training Example
Given the following two inputs x1, x2; find equation that helps Bias 1st input 2nd input Target
(x1) (x2) output
to draw the boundary?
• Let say we have the following initializations: -1 0 0 0
W1(0) = 0.92, W2(0) = 0.62, W0(0) = 0.22, ή = 0.1 -1 1 0 0
-1 0 1 1
-1 1 1 1
• Training – epoch 1:
• Training – epoch 3:
y1 = 0.72*0 + 0.62*0 – 0.42 = -0.42 🡪 y = 0 X
y2 = 0.72*1 + 0.62*0 – 0.42 = 0.4 🡪 y = 1
W1(3) = 0.72 + 0.1 * (0 – 1) * 1 = 0.62
W2(3) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(3) = 0.42 + 0.1 * (0 – 1) * (-1)= 0.52
y3 = 0.62*0 + 0.62*1 – 0.52 = 0.1🡪 y = 1 53
ANN Training Example
• Training – epoch 4:
y1 = 0.62*0 + 0.62*0 – 0.52 = -0.52 🡪 y = 0
y2 = 0.62*1 + 0.62*0 – 0.52 = 0.10🡪 y = 1 X
W1(4) = 0.62 + 0.1 * (0 – 1) * 1 = 0.52
W2(4) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(4) = 0.52 + 0.1 * (0 – 1) * (-1)= 0.62
X
y3 = 0.52*0 + 0.62*1 – 0.62 = 0 🡪 y = 0
W1(4) = 0.52 + 0.1 * (1 – 0) * 0 = 0.52
W2(4) = 0.62 + 0.1 * (1 – 0) * 1 = 0.72
W0(4) = 0.62 + 0.1 * (1 – 0) * (-1)= 0.52
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72 🡪 y = 1
• Finally:
y1 = 0.52*0 + 0.72*0 – 0.52 = -0.52 🡪 y = 0
y2 = 0.52*1 + 0.72*0 – 0.52 = -0.0 🡪 y = 0
y3 = 0.52*0 + 0.72*1 – 0.52 = 0.2 🡪 y= 1
54
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72 🡪 y= 1
ANN Training Example
1+ + +
1 +
x2 x2
0o x1 1
o o
0 x1 1
o
55
Logical Functions
• McCulloch and Pitts: Boolean function can be implemented with a
artificial neuron (not XOR).
a0 a0 a0
W0 = 1.5 a1 W0 = 0.5 W0 = -0.5
a1
W1 = 1 W1 = 1
W2 = 1 a2 W2 = 1 W1 = -1
a2 a1
AND OR NOT
58
Pros and Cons of Neural Network
• Useful for learning complex data like handwriting, speech
and image recognition
Cons
Pros
Slow training time
+ Can learn more complicated
Hard to interpret &
class boundaries understand the learned
+ Fast application function (weights)
+ Can handle large number of
Hard to implement: trial &
features error for choosing number of
nodes
Neural Network needs long time for training.
Neural Network has a high tolerance to noisy and
incomplete data
Conclusion: Use neural nets only if decision-trees
fail. 59