0% found this document useful (0 votes)
9 views

Chapter 02_DM tasks_Part I_Classification

Chapter Two of the document focuses on data mining tasks, specifically classification and clustering techniques. It covers various classification methods such as decision trees, Naïve Bayes, and neural networks, along with model evaluation and performance metrics. Additionally, it discusses the concepts of supervised and unsupervised learning, providing examples and exercises for practical understanding.

Uploaded by

dine mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Chapter 02_DM tasks_Part I_Classification

Chapter Two of the document focuses on data mining tasks, specifically classification and clustering techniques. It covers various classification methods such as decision trees, Naïve Bayes, and neural networks, along with model evaluation and performance metrics. Additionally, it discusses the concepts of supervised and unsupervised learning, providing examples and exercises for practical understanding.

Uploaded by

dine mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

DtSc 5140 : Data Analysis and Visualization

Credit Hours: 3(2+1)

Gebremedhin Gebreyohans (PhD)


March, 2025
Chapter Two

DM tasks: Part I- Classification


CHAPTER 2: DM TASKS: (CLASSIFICATION AND CLUSTERING)
◼ Classification
◼ Concepts of Classification;
◼ K-Nearest Neighbor;
◼ Decision Trees;
◼ Naïve Bayes;
◼ Neural Networks
◼ Clustering
◼ Overview of Clustering;
◼ Partitioning algorithms:
◼ K-Means & K-Medoids;
◼ Hierarchical Clustering: Divisive & Agglomerative Algorithms;
◼ Single-link, Double link & Average link clustering
Slide 1-3
Classification: Basic Concepts
❑ Classification: Basic Concepts

❑ Decision Tree Induction

❑ Bayes Classification Methods

❑ Lazy Learners (or learning from your neighbors)

❑ Linear Classifiers

❑ Model Evaluation and Selection

❑ Techniques to Improve Classification Accuracy

❑ Summary

4
Classification: Definition
• Classification is a data mining (machine learning) technique used to predict group
membership for data instances.
• Given a collection of records (training set), each record contains a set of attributes,
one of the attributes is the class.
– Find a model for class attribute as a function of the values of other attributes.
• Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model.
– Usually, the given data set is divided into training and test sets, with training set
used to build the model and test set used to validate it.
• For example, one may use classification to predict whether the weather on a
particular day will be “sunny”, “rainy” or “cloudy”.

5
Supervised vs. Unsupervised Learning (1)
❑ Supervised learning (classification)
❑ Supervision: The training data such as observations or measurements are
accompanied by labels indicating the classes which they belong to
❑ New data is classified based on the models built from the training set
Training Data with class label:
Outlook Temp Humidity Windy Play Golf Training Model
Rainy Hot High False No Instance Learning
Rainy Hot High True No s
Overcast Hot High False Yes

Sunny Mild High False Yes

Sunny Cool Normal False Yes Positi


Sunny Cool Normal True No
ve
Test
Overcast Cool Normal True Yes Prediction
Instance
Rainy Mild High False No Model Negat
s
ive
6
Classification—Model Construction, Validation and
Testing
❑ Model Construction and Training
❑ Model: Represented as decision trees, rules, mathematical formulas, or other forms
❑ Assumption: Each sample belongs to a predefined class /class label
❑ Training Set: The set of samples used for model construction
❑ Model Validation and Testing:
❑ Test: Estimate accuracy of the model
❑ The known label of test sample VS. the classified result from the model
❑ Accuracy: % of test set samples that are correctly classified by the model
❑ Test set is independent of training set
❑ Validation: If the test set is used to select or refine models, it is called validation (or
development) (test) set
❑ Model Deployment: If the accuracy is acceptable, use the model to classify new data

7
Confusion Matrix & Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FP)
Class=No c d
(FP) (TP)

• Most widely-used metric is measuring Accuracy of the system :

• Other metric for performance evaluation are Precision, Recall & F-Measure
9
10
11
Classification methods
• Goal: Predict class Ci = f(x1, x2, .. Xn)
• There are various classification methods. Popular classification
techniques include the following.
– Decision tree classifier: divide decision space into piecewise constant regions.

– Bayesian network: a probabilistic model

– K-Nearest Neighbour: classify based on similarity measurement

– Neural networks: partition by non-linear boundaries

– Support vector machine: solves non-linearly separable problems 12


Chapter 6. Classification: Basic Concepts
❑ Classification: Basic Concepts

❑ Decision Tree Induction

❑ Bayes Classification Methods

❑ Lazy Learners (or learning from your neighbors)

❑ Linear Classifiers

❑ Model Evaluation and Selection

❑ Techniques to Improve Classification Accuracy

❑ Summary

13
Decision Tree Induction: Algorithm
❑ Decision tree performs classification by constructing a tree based on training instances.
❑ The tree is traversed for each test instance to find a leaf, and the class of the leaf is the
predicted class.
❑ Basic algorithm
❑ Tree is constructed in a top-down, recursive, divide-and-conquer manner
❑ At start, all the training examples are at the root
❑ On each node, attributes are selected based on the training examples on that node, and
a heuristic or statistical measure (e.g., information gain, Gini index)
❑ Conditions for stopping partitioning
❑ All samples for a given node belong to the same class
❑ There are no remaining attributes for further partitioning – majority voting is employed
for classifying the leaf
14
❑ There are no samples left
Attribute Selection Measure
• Information gain
–Select the attribute with the highest information gain
• First, compute the disorder using Entropy; the expected information needed to classify
objects into classes
• Second, measure the Information Gain; to calculate by how much the disorder of a set
would reduce by knowing the value of a particular attribute.
– In information gain measure we want:-
• large Gain
• same as: small average disorder created
• GINI index
– An alternative to information gain that measure impurity of attributes in the classification
task
– Select the attribute with the smallest GINI value

15
Entropy
• The Entropy measures the disorder of a set S containing a total of n examples of which n+ are
positive and n- are negative and it is given by:

• Some useful properties of the Entropy:


• D(n,m) = D(m,n)
• D(0,m) = 0
• D(S)=0 means that all the examples in S have the same class
• D(m,m) = 1
• D(S)=1 means that half the examples in S are of one class and half are the opposite class
16
Information Gain
• The Information Gain measures the expected reduction in entropy due to splitting on an
attribute A

Parent Node, p is split into k partitions;


ni is number of records in partition I

• Information Gain: Measures Reduction in Entropy achieved because of the split. Choose the
split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large number of partitions, each being
small but pure.
Decision Tree Induction: An Example

18 18
19
We can’t decided on sunny and rain
because there are 2 yes and 3 no in
sunny and 3 yes and 2 no in rain but
Now, outlook is the highest gain we can decided on overcast because
Therefore, it will be the root node. all of them are categorized as yes.
For sunny

Next Step we need to calculate for sunny and rain.


first lets do for sunny which are (D1,D2,D8,D9,D11
For sunny
Now, Humidity is the highest gain
Therefore, it will be the next node.
Now, wind is the highest gain
Therefore, it will be the next node.
Classification Rules
IF outlook = “sunny” & Humidity = “high” THEN play tens = “no”
IF outlook = “sunny” & hunidity = “Normal” THEN play tens = “yes”
IF outlook = “overcast” THEN play tens = “yes
IF outlook = “rain” & wind = “strong” THEN play tens = “no”
IF outlook = “rain” & wind = “weak” THEN play tens = “yes”
Exercise 1 : Decision Tree for “buy computer or not”.
Use the training Dataset given below to construct decision tree

27
Attribute Selection by Information Gain
• Class P: buys_computer = “yes”
• Class N: buys_computer = “no”
• E(P, N) = E(9, 5) =0.940
Hence
• Compute the entropy for age:

Similarly

28
Output: A Decision Tree for “buys_computer”
age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
Classification Rules
IF age = “<=30” & student = “no” THEN buys_computer = “no”
IF age = “<=30” & student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” & credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” & credit_rating = “fair” THEN buys_computer = “no”
29
Exercise 2: The problem of “Sunburn”
• You want to predict whether another person is likely to get sunburned if he is back
to the beach. How can you do this?
• Data Collected: predict based on the observed properties of the people

30
Exercise: 3 ‘is the customer Good, Doubtful or Poor?’
Customer ID Debt Income Marital Status Risk
Abel High 31 High Married Good
Ben Low High Married Doubtful
Candy Medium Low Unmarried Poor
Dale High Low Married Poor
Ellen High Low Married Poor
Fred High Low Married Poor
George Low High Unmarried Doubtful
Harry Low Medium Married Doubtful
Igor Low High Married Good
Jack High High Married Doubtful
Kate Low Low Married Poor
Lane Medium High Unmarried Good
Mary High Low Unmarried Poor
Nancy Low Medium Unmarried Doubtful
Othello Medium High Unmarried Good
Pros and Cons of decision trees
Pros Cons
• Reasonable training time ­ Cannot handle complicated
• Fast application relationship between features
• Easy to interpret ­ simple decision boundaries
• Easy to implement ­ problems with lots of missing
• Can handle large number of data
features

Why decision tree induction in data mining?


• Relatively faster learning speed (than other classification methods)
• Convertible to simple and easy to understand classification if-then-else rules
•Comparable classification accuracy with other methods
• Does not require any prior knowledge of data distribution, works well on noisy data.

32
Classification: Basic Concepts
• Classification: Basic Concepts
• Decision Tree Induction
• Bayes Classification Methods
• Lazy Learners (or learning from your neighbors)
• Linear Classifiers
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy
• Summary
Why Bayesian Classification?
• Provides practical learning algorithms
– Probabilistic learning: Calculate explicit probabilities for hypothesis. E.g.
Naïve Bayes
• Prior knowledge and observed data can be combined
– Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct.
• It is a generative (model based) approach, which offers a useful conceptual
framework
– Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities.
E.g. sequences could also be classified, based on a probabilistic model specification
– Any kind of objects can be classified, based on a probabilistic model specification

34
Bayesian Classifiers
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes
theorem

– Choose value of C that maximizes


P(C | A1, A2, …, An)

– Equivalent to choosing value of C that maximizes


P(A1, A2, …, An|C) P(C)

• How to estimate P(A1, A2, …, An | C )?


35
Example. ‘Play Tennis’ data
• Suppose that you have a free afternoon and you are thinking whether or not to go and play
tennis, How you do that? Based the training data, the goal is to Predict When This Player Will
Play Tennis?
• The following training Data Example are prepared for the classifier ?

36
Play-tennis example
Based on the examples in the table, classify the following unseen sample X :
x=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=strong)
• That means: Play tennis or not?

• Working:
– (0.64)(2/9)(3/9)(3/9)(3/9)= 0.0053
– (0.36)(3/5)(1/5)(4/5)(3/5)= 0.0206

38
Exercise: Naïve Bayes Classifier
A: attributes
M: mammals
N: non-mammals

P(A|M)P(M) > P(A|N)P(N)


=> Mammals

39
Naive Bayesian Classifier
• Advantages
–Easy to implement
–Good results obtained in most of the cases
–Robust to isolated noise points
–Handle missing values by ignoring the instance during probability estimate calculations
–Robust to irrelevant attributes
• Disadvantages
–Class conditional independence assumption may not hold for some attributes, therefore
loss of accuracy
–Practically dependencies exist among variables
• E.g. hospitals: patients: profile: age, family history, etc. symptoms: fever, cough etc.
Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian classifier
– How to deal with these dependencies? Bayesian Belief Networks

40
Neural Network

41
Brain and Machine
• The Brain
– Pattern Recognition
– Association
– Complexity
– Noise Tolerance

• The Machine
– Calculation
– Precision
– Logic
42
Neural Network classifier
• It is represented as a layered set of interconnected processors. These processor
nodes has a relationship with the neurons of the brain. Each node has a
weighted connection to several other nodes in adjacent layers. Individual
nodes take the input received from connected nodes and use the weights
together to compute output values.
• The inputs are fed simultaneously into the input layer.
• The weighted outputs of these units are fed into hidden layer.
• The weighted outputs of the last hidden layer are inputs to units making up the
output layer.

43
Architecture of Neural network
• Neural networks are used to look for patterns in data, learn these patterns, and then
classify new patterns & make forecasts
• A network with the input and output layer only is called single-layered neural network.
Whereas, a multilayer neural network is a generalized one with one or more hidden
layer.
– A network containing two hidden layers is called a three-layer neural network, and so on.

Single layered NN Multilayer NN


x1 x1
w1
x2 x2
w2
x3 w3 x3
Input Hidden Output
nodes nodes nodes
44
A Multilayer Neural Network
• INPUT: records with class attribute with normalized
Output layer
attributes values.
–INPUT VECTOR: X = { x1, x2, …. xm}, where n is the number of
attributes.
–INPUT LAYER – there are as many nodes as class attributes i.e. as
Hidden layer
the length of the input vector.
• HIDDEN LAYER – neither its input nor its output can be
observed from outside.
–The number of nodes in the hidden layer and the number of
hidden layers depends on implementation. Input layer

• OUTPUT LAYER – corresponds to the class attribute.


–There are as many nodes as classes (values of the class attribute).
–Ok, where k= 1, 2,.. n, where n is number of classes
45
Hidden layer: Neuron with Activation
• The neuron is the basic information processing unit of a NN. It consists of:
1 A set of links, describing the neuron inputs, with weights W1, W2, …, Wm

2. An adder function (linear combiner) for computing the weighted sum of the inputs (real
numbers):

3. Activation function (also called squashing function): for limiting the output behavior of
the neuron.

46
Activation Functions

• (a) is a step function or threshold function (hardlimiting):


• (b) is a sigmoid function: 1/(1+e-x)
• Changing the bias weight W0,i moves the threshold location
– Bias helps the neural network to be more flexible since it adjust the activation function left-or-right, making it
centered on some other value than x = 0. To this effect an additional node is added to the input layer, with its
constant input; say, 1 or -1, … When this is multiplied by the weights of the hidden layer, it provides a bias (DC
offset) to activation function.

47
Activation Functions

48
Two Topologies of neural network
• NN can be designed in a feed forward or recurrent manner
• In a feed forward neural network connections between the units
do not form a directed cycle.
– In this network, the information moves in only one direction, forward, from
the input nodes, through the hidden nodes (if any) & to the output nodes.

– There are no cycles or loops or no feedback connections are present in the


network, that is, connections extending from outputs of units to inputs of
units in the same layer or previous layers.

• In recurrent networks data circulates back & forth until the


activation of the units is stabilized
– Recurrent networks have a feedback loop where data can be fed back into
the input at some point before it is fed forward again for further processing
and final output.

49
Training the neural network
• The purpose is to learn to generalize using a set of sample patterns where the
desired output is known.
• Back Propagation is the most commonly used method for training multilayer
feed forward NN.
– Back propagation learns by iteratively processing a set of training data (samples).
– For each sample, weights are modified to minimize the error between the desired
output and the actual output.
• After propagating an input through the network, the error is calculated and
the error is propagated back through the network while the weights are
adjusted in order to make the error smaller.

50
Training Algorithm
• The applied learning algorithm is as follows
–Initialize the weights and threshold to small random numbers.
–Present a vector x to the neuron inputs and calculate the output using the adder
function.

–Apply the activation function such that

–Update the weights according to the error.

51
ANN Training Example
Given the following two inputs x1, x2; find equation that helps Bias 1st input 2nd input Target
(x1) (x2) output
to draw the boundary?
• Let say we have the following initializations: -1 0 0 0
W1(0) = 0.92, W2(0) = 0.62, W0(0) = 0.22, ή = 0.1 -1 1 0 0
-1 0 1 1
-1 1 1 1
• Training – epoch 1:

y1 = 0.92*0 + 0.62*0 – 0.22 = -0.22 🡪 y = 0

y2 = 0.92*1 + 0.62*0 – 0.22 = 0.7 🡪 y =1 X

W1(1) = 0.92 + 0.1 * (0 – 1) * 1 = 0.82

W2(1) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62

W0(1) = 0.22 + 0.1 * (0 – 1) * (-1)= 0.32

y3 = 0.82*0 + 0.62*1 – 0.32 = 0.3 🡪 y = 1


52
ANN Training Example
• Training – epoch 2:
y1 = 0.82*0 + 0.62*0 – 0.32 = -0.32 🡪 y= 0
y2 = 0.82*1 + 0.62*0 – 0.32 = 0.5 🡪 y= 1 X
W1(2) = 0.82 + 0.1 * (0 – 1) * 1 = 0.72
W2(2) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(2) = 0.32 + 0.1 * (0 – 1) * (-1)= 0.42
y3 = 0.72*0 + 0.62*1 – 0.42 = 0.2 🡪 y= 1
y4 = 0.72*1 + 0.62*1 – 0.42 = 0.92 🡪 y = 1

• Training – epoch 3:
y1 = 0.72*0 + 0.62*0 – 0.42 = -0.42 🡪 y = 0 X
y2 = 0.72*1 + 0.62*0 – 0.42 = 0.4 🡪 y = 1
W1(3) = 0.72 + 0.1 * (0 – 1) * 1 = 0.62
W2(3) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(3) = 0.42 + 0.1 * (0 – 1) * (-1)= 0.52
y3 = 0.62*0 + 0.62*1 – 0.52 = 0.1🡪 y = 1 53
ANN Training Example
• Training – epoch 4:
y1 = 0.62*0 + 0.62*0 – 0.52 = -0.52 🡪 y = 0
y2 = 0.62*1 + 0.62*0 – 0.52 = 0.10🡪 y = 1 X
W1(4) = 0.62 + 0.1 * (0 – 1) * 1 = 0.52
W2(4) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(4) = 0.52 + 0.1 * (0 – 1) * (-1)= 0.62
X
y3 = 0.52*0 + 0.62*1 – 0.62 = 0 🡪 y = 0
W1(4) = 0.52 + 0.1 * (1 – 0) * 0 = 0.52
W2(4) = 0.62 + 0.1 * (1 – 0) * 1 = 0.72
W0(4) = 0.62 + 0.1 * (1 – 0) * (-1)= 0.52
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72 🡪 y = 1

• Finally:
y1 = 0.52*0 + 0.72*0 – 0.52 = -0.52 🡪 y = 0
y2 = 0.52*1 + 0.72*0 – 0.52 = -0.0 🡪 y = 0
y3 = 0.52*0 + 0.72*1 – 0.52 = 0.2 🡪 y= 1
54
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72 🡪 y= 1
ANN Training Example

1+ + +
1 +

x2 x2

0o x1 1
o o
0 x1 1
o

55
Logical Functions
• McCulloch and Pitts: Boolean function can be implemented with a
artificial neuron (not XOR).
a0 a0 a0
W0 = 1.5 a1 W0 = 0.5 W0 = -0.5
a1

W1 = 1 W1 = 1

W2 = 1 a2 W2 = 1 W1 = -1
a2 a1
AND OR NOT

AND Function OR Function


A B Outpu NOT Function
A B Output
t A Outpu
0 0 0 t
0 0 0
0 1 1 0 1
0 1 0
1 0 1 1 0
1 0 0
1 1 1
1 1 1 56
Training Perceptrons
-1 For AND
W=? A B Output
00 0
x y 01 0
W=?
10 0
11 1
W=?
y

• Initialize with random weight values. What are the


weight values?
• Use the activation function:

• By updating the weights find the equation and draw


the separating line?
57
Exercise: Training Perceptrons
For AND
-1 A B Output
W = 0.3 00 0
01 0
x y 10 0
W = 0.5
11 1
W = -0.4
y

58
Pros and Cons of Neural Network
• Useful for learning complex data like handwriting, speech
and image recognition
Cons
Pros
­Slow training time
+ Can learn more complicated
­ Hard to interpret &
class boundaries understand the learned
+ Fast application function (weights)
+ Can handle large number of
­Hard to implement: trial &
features error for choosing number of
nodes
Neural Network needs long time for training.
Neural Network has a high tolerance to noisy and
incomplete data
Conclusion: Use neural nets only if decision-trees
fail. 59

You might also like