Module 3
Module 3
Introduction A mining model is created by applying an algorithm to data, but it is more than
an algorithm or a metadata container: it is a set of data, statistics, and patterns
that can be applied to new data to generate predictions and make inferences
about relationships. This module will introduce the application of data mining
tasks under predictive and descriptive models.
Learning Activities
(to include MODELING
Content/Discussion
of the Topic)
Predictive modeling is the process by which a model is created to predict an
outcome. If the outcome is categorical it is called classification and if the outcome
is numerical it is called regression. Descriptive modeling or clustering is the
assignment of observations into clusters so that observations in the same cluster
are similar. Finally, association rules can find interesting associations amongst
observations.
3.1. CLASSIFICATION
1. Frequency Table
o ZeroR
o OneR
o Naive Bayesian
o Decision Tree
2. Covariance Matrix
o Linear Discriminant Analysis
o Logistic Regression
3. Similarity Functions
o K Nearest Neighbors
4. Others
o Artificial Neural Network
o Support Vector Machine
3.1.1 ZeroR
ZeroR is the simplest classification method which relies on the target and ignores
all predictors. ZeroR classifier simply predicts the majority category (class).
Although there is no predictability power in ZeroR, it is useful for determining a
baseline performance as a benchmark for other classification methods.
Algorithm
Construct a frequency table for the target and select its most frequent value.
Example:
"Play Golf = Yes" is the ZeroR model for the following dataset with an accuracy of
0.64.
Predictors Contribution
There is nothing to be said about the predictors contribution to the model because
ZeroR does not use any of them.
Model Evaluation
The following confusion matrix shows that ZeroR only predicts the majority class
correctly. As mentioned before, ZeroR is only useful for determining a baseline
performance for other classification methods.
ZeroR - Exercise
1. Open "Weka".
2. Click on "Open file ..." and load the dataset (weather_nominal.csv).
3. Click on the "Classify" tab and choose "ZeroR".
4. Select "Play Golf" as the target from the list box and click on "Start".
5. Check the "Classifier output" pane for the result.
Datasets of Weather_nominal
Outlook,Temperature,Humidity,Windy,Play golf
Rainy,Hot,High,FALSE,No
Rainy,Hot,High,TRUE,No
Overcast,Hot,High,FALSE,Yes
Sunny,Mild,High,FALSE,Yes
Sunny,Cool,Normal,FALSE,Yes
Sunny,Cool,Normal,TRUE,No
Overcast,Cool,Normal,TRUE,Yes
Rainy,Mild,High,FALSE,No
Rainy,Cool,Normal,FALSE,Yes
Sunny,Mild,Normal,FALSE,Yes
Rainy,Mild,Normal,TRUE,Yes
Overcast,Mild,High,TRUE,Yes
Overcast,Hot,Normal,FALSE,Yes
Sunny,Mild,High,TRUE,No
3.1.2 OneR
OneR, short for "One Rule", is a simple, yet accurate, classification algorithm
that generates one rule for each predictor in the data, then selects the rule with
the smallest total error as its "one rule". To create a rule for a predictor, we
construct a frequency table for each predictor against the target. It has been
shown that OneR produces rules only slightly less accurate than state-of-the-
art classification algorithms while producing rules that are simple for humans
to interpret.
OneR Algorithm
Example:
Finding the best predictor with the smallest total error using OneR algorithm
based on related frequency tables.
Predictors Contribution
Simply, the total error calculated from the frequency tables is the measure of
each predictor contribution. A low total error means a higher contribution to the
predictability of the model.
Model Evaluation
The following confusion matrix shows significant predictability power. OneR does
not generate score or probability, which means evaluation charts (Gain, Lift, K-S
and ROC) are not applicable.
OneR - Exercise
1. Open "Weka".
2. Click on "Open file ..." and load the dataset (weather_nominal.csv).
3. Click on the "Classify" tab and choose "OneR".
4. Select "Play Golf" as the target from the list box and click on "Start".
5. Check the "Classifier output" pane for the result.
Exercise
Algorithm
Example:
Add 1 to the count for every attribute value-class combination (Laplace estimator)
when an attribute value (Outlook=Overcast) doesn’t occur with every class value
(Play Golf=no).
Numerical Predictors
The probability density function for the normal distribution is defined by two
parameters (mean and standard deviation).
Example:
Predictors Contribution
Kononenko's information gain as a sum of information contributed by each
attribute can offer an explanation on how values of the predictors influence the
class probability.
Open "Orange".
Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
Drag and drop "Select Attributes" widget and connect it to the "File" widget.
Open "Select Attributes" and set the target (class) and predictors
(attributes).
Drag and drop "Naive Bayes" widget and connect it to the "Select Attributes"
widget.
Drag and drop "Test Learners" widget and connect it to the "Naive Bayes"
and the "Select Attributes" widget.
Drag and drop "Confusion Matrix", "Lift Curve" and "ROC Analysis" widgets
and connect it to the "Test Learners" widget.
Credit_scoring Dataset
326 3 N
12 4 N
12 4 N
38 4 N
200 4 N
302 4 N
30 5 N
44 5 N
131 5 N
23 7 N
152 7 N
89 10 N
24 12 N
41 12 N
151 14 N
25 17 N
48 18 N
12 19 N
267 20 N
18 22 N
43 25 N
15 28 N
60 28 N
86 28 N
12 29 N
77 29 N
38 30 N
46 30 N
64 30 N
77 30 N
132 30 N
150 30 N
12 31 N
24 31 N
12 32 N
27 32 N
36 34 N
26 46 N
26 46 N
116 46 N
88 53 N
27 60 N
84 60 N
162 60 N
277 60 N
51 66 N
12 11 N
100 26 Y
78 27 Y
61 48 Y
343 28 Y
24 49 Y
43 53 Y
417 13 Y
275 54 Y
42 57 Y
26 58 Y
42 59 Y
47 59 Y
170 59 Y
12 60 Y
49 62 Y
36 64 Y
48 65 Y
17 66 Y
85 69 Y
350 39 Y
207 70 Y
229 70 Y
65 72 Y
37 76 Y
73 80 Y
Confusion Matrix
https://orange3.readthedocs.io/projects/orange-visual-
programming/en/latest/widgets/model/naivebayes.html
https://www.solver.com/xlminer/help/classification-using-naive-bayes-example
https://orangedatamining.com/widget-catalog/evaluate/liftcurve/
Algorithm
The core algorithm for building decision trees called ID3 by J. R. Quinlan which
employs a top-down, greedy search through the space of possible branches with
no backtracking. ID3 uses Entropy and Information Gain to construct a decision
tree.
Entropy
A decision tree is built top-down from a root node and involves partitioning the
data into subsets that contain instances with similar values (homogenous). ID3
algorithm uses entropy to calculate the homogeneity of a sample. If the sample is
completely homogeneous the entropy is zero and if the sample is an equally
divided it has entropy of one.
To build a decision tree, we need to calculate two types of entropy using frequency
tables as follows:
Information Gain
The information gain is based on the decrease in entropy after a dataset is split on
an attribute. Constructing a decision tree is all about finding attribute that returns
the highest information gain (i.e., the most homogeneous branches).
Step 2: The dataset is then split on the different attributes. The entropy for each
branch is calculated. Then it is added proportionally, to get total entropy for the
split. The resulting entropy is subtracted from the entropy before the split. The
result is the Information Gain, or decrease in entropy.
Step 3: Choose attribute with the largest information gain as the decision node.
Step 4b: A branch with entropy more than 0 needs further splitting.
Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data
is classified.
A decision tree can easily be transformed to a set of rules by mapping from the
root node to the leaf nodes one by one.
Open "Orange".
Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
Open "Select Attributes" and set the target (class) and predictors
(attributes).
Drag and drop "Classification Tree" widget and connect it to the "Select
Attributes" widget.
Drag and drop "Test Learners" widget and connect it to the "Classification
Tree" and the "Select Attributes" widget.
Drag and drop "Confusion Matrix", "Lift Curve" and "ROC Analysis"
widgets and connect it to the "Test Learners" widget.
Confusion Matrix
Overfitting is a significant practical difficulty for decision tree models and many
other predictive models. Overfitting happens when the learning algorithm
continues to develop hypotheses that reduce training set error at the cost of an
increased test set error. There are several approaches to avoiding overfitting in
building decision trees.
Pre-pruning that stop growing the tree earlier, before it perfectly classifies
the training set.
Post-pruning that allows the tree to perfectly classify the training set, and
then post prune the tree.
1. Use a distinct dataset from the training set (called validation set), to
evaluate the effect of post-pruning nodes from the tree.
2. Build the tree by using the training set, then apply a statistical test to
estimate whether pruning or expanding a particular node is likely to
produce an improvement beyond the training set.
o Error estimation
o Significance testing (e.g., Chi-square test)
3. Minimum Description Length principle: Use an explicit measure of the
complexity for encoding the training set and the decision tree, stopping
growth of the tree when this encoding size (size(tree) +
size(misclassifications(tree)) is minimized.
The first method is the most common approach. In this approach, the available
data are separated into two sets of examples: a training set, which is used to build
the decision tree, and a validation set, which is used to evaluate the impact of
pruning the tree. The second method is also a common approach. Here, we explain
the error estimation and Chi2 test.
The error rate at the parent node is 0.46 and since the error rate for its children
(0.51) increases with the split, we do not want to keep the children.
In Chi2 test we construct the corresponding frequency table and calculate the
Chi2 value and its probability.
Bad 4 1 4
Good 2 1 2
If we require that the probability has to be less than a limit (e.g., 0.05), therefore
we decide not to split the node.
The information gain equation, G(T,X) is biased toward attributes that have a large
number of values over attributes that have a smaller number of values. These
‘Super Attributes’ will easily be selected as the root, resulted in a broad tree that
classifies perfectly but performs poorly on unseen instances. We can penalize
attributes with large numbers of values by using an alternative method for
attribute selection, referred to as Gain Ratio.
Example:
The following example shows a frequency table between the target (Play Golf) and
the ID attribute which has a unique value for each record of the dataset.
The information gain for ID is maximum (0.94) without using the split information.
However, with the adjustment the information gain dropped to 0.25.
Algorithm
LDA is based upon the concept of searching for a linear combination of variables
(predictors) that best separates two classes (targets). To capture the notion of
separability, Fisher defined the following score function.
Given the score function, the problem is to estimate the linear coefficients that
maximize the score which can be solved by the following equations.
Example:
Suppose we received a dataset from a bank regarding its small business clients
who defaulted (red square) and those that did not (blue circle) separated by
delinquent days (DAYSDELQ) and number of months in business (BUSAGE). We use
LDA to find an optimal linear model that best separates two classes (default and
non-default).
The first step is to calculate the mean (average) vectors, covariance matrices and
class probabilities.
Then, we calculate pooled covariance matrix and finally the coefficients of the
linear model.
A Mahalanobis distance of 2.32 shows a small overlap between two groups which
means a good separation between classes by the linear model.
Predictors Contribution
A simple linear correlation between the model scores and predictors can be used
to test which predictors contribute significantly to the discriminant function.
Correlation varies from -1 to 1, with -1 and 1 meaning the highest contribution but
in different directions and 0 means no contribution at all.
where Ck is the covariance matrix for the class k (-1 means inverse matrix), |Ck| is
the determinant of the covariance matrix Ck, and P(ck) is the prior probability of the
class k. The classification rule is simply to find the class with highest Z value.
A linear regression will predict values outside the acceptable range (e.g.
predictingprobabilities outside the range 0 to 1)
Since the dichotomous experiments can only have one of two possible
values for each experiment, the residuals will not be normally distributed
about the predicted line.
On the other hand, a logistic regression produces a logistic curve, which is limited
to values between 0 and 1. Logistic regression is similar to a linear regression, but
the curve is constructed using the natural logarithm of the “odds” of the target
variable, rather than the probability. Moreover, the predictors do not have to be
normally distributed or have equal variance in each group.
In the logistic regression the constant (b0) moves the curve left and right and the
slope (b1) defines the steepness of the curve. By simple transformation, the logistic
regression equation can be written in terms of an odds ratio.
Finally, taking the natural log of both sides, we can write the equation in terms of
log-odds (logit) which is a linear function of the predictors. The coefficient (b1) is
the amount the logit (log-odds) changes with a one unit change in x.
There are several analogies between linear regression and logistic regression. Just
as ordinary least square regression is the method used to estimate coefficients for
the best fit line in linear regression, logistic regression uses maximum likelihood
estimation (MLE) to obtain the model coefficients that relate predictors to the
target. After this initial function is estimated, the process is repeated until LL (Log
Likelihood) does not change significantly.
Pseudo R2
There are several measures intended to mimic the R2 analysis to evaluate the
goodness-of-fit of logistic models, but they cannot be interpreted as one would
interpret an R2 and different pseudo R2 can arrive at very different values. Here
we discuss three pseudo R2measures.
The likelihood ratio test provides the means for comparing the likelihood of the
data under one model (e.g., full model) against the likelihood of the data under
another, more restricted model (e.g., intercept model).
where 'p' is the logistic model predicted probability. The next step is to calculate
the difference between these two log-likelihoods.
Wald test
A Wald test is used to evaluate the statistical significance of each coefficient (b) in
the model.
where W is the Wald's statistic with a normal distribution (like Z-test), b is the
coefficient and SE is its standard error. The W value is then squared, yielding a
Wald statistic with a chi-square distribution.
Predictors Contributions
The Wald test is usually used to assess the significance of prediction of each
predictor. Another indicator of contribution of a predictor is exp(b) or odds-ratio of
coefficient which is the amount the logit (log-odds) changes, with a one unit
change in the predictor (x).
Exercise
Open "Orange".
Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
Open "Select Attributes" and set the target (class) and predictors
(attributes).
Drag and drop "Logistic Regression" widget and connect it to the "Select
Attributes" widget.
Drag and drop "Test Learners" widget and connect it to the "Logistic
Regression" and the "Select Attributes" widget.
Drag and drop "Confusion Matrix", "Lift Curve" and "ROC Analysis"
widgets and connect it to the "Test Learners" widget.
Confusion Matrix
Algorithm
A case is classified by a majority vote of its neighbors, with the case being assigned
to the class most common amongst its K nearest neighbors measured by a distance
function. If K = 1, then the case is simply assigned to the class of its nearest
neighbor.
It should also be noted that all three distance measures are only valid for
continuous variables. In the instance of categorical variables the Hamming distance
must be used. It also brings up the issue of standardization of the numerical
variables between 0 and 1 when there is a mixture of numerical and categorical
variables in the dataset.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 100
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Choosing the optimal value for K is best done by first inspecting the data. In
general, a large K value is more precise as it reduces the overall noise but there is
no guarantee. Cross-validation is another way to retrospectively determine a good
K value by using an independent dataset to validate the K value. Historically, the
optimal K for most datasets has been between 3-10. That produces much better
results than 1NN.
Example:
Consider the following data concerning credit default. Age and Loan are two
numerical variables (predictors) and Default is the target.
We can now use the training set to classify an unknown case (Age=48 and
Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the
last case in the training set with Default=Y.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 101
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
With K=3, there are two Default=Y and one Default=N out of three closest
neighbors. The prediction for the unknown case is again Default=Y.
Standardized Distance
One major drawback in calculating distance measures directly from the training set
is in the case where variables have different measurement scales or there is a
mixture of numerical and categorical variables. For example, if one variable is
based on annual income in dollars, and the other is based on age in years then
income will have a much higher influence on the distance calculated. One solution
is to standardize the training set as shown below.
Using the standardized distance on the same training set, the unknown case
returned a different neighbor which is not a good sign of robustness.
Open "Orange".
Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
Open "Select Attributes" and set the target (class) and predictors
(attributes).
Drag and drop "k Nearest Neighbours" widget and connect it to the
"Select Attributes" widget.
Drag and drop "Test Learners" widget and connect it to the "k Nearest
Neighbours" and the "Select Attributes" widget.
Drag and drop "Confusion Matrix", "Lift Curve" and "ROC Analysis"
widgets and connect it to the "Test Learners" widget.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 102
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Confusion Matrix
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 103
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 104
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
The input nodes take in information, in the form which can be numerically
expressed. The information is presented as activation values, where each node is
given a number, the higher the number, the greater the activation. This
information is then passed throughout the network. Based on the connection
strengths (weights), inhibition or excitation, and transfer functions, the activation
value is passed from node to node. Each of the nodes sums the activation values it
receives; it then modifies the value based on its transfer function. The activation
flows through the network, through hidden layers, until it reaches the output
nodes. The output nodes then reflect the input in a meaningful way to the outside
world.
The transfer function translates the input signals to output signals. Four types of
transfer functions are commonly used, Unit step (threshold), sigmoid, piecewise
linear, and Gaussian.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 105
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Sigmoid
The sigmoid function consists of 2 functions, logistic and tangential. The values of
logistic function range from 0 and 1 and -1 to +1 for tangential function.
Piecewise Linear
Gaussian
Gaussian functions are bell-shaped curves that are continuous. The node output
(high/low) is interpreted in terms of class membership (1/0), depending on how
close the net input is to a chosen value of average
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 106
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Algorithm
There are different types of neural networks, but they are generally classified into
feed-forward and feed-back networks.
A feed-back network has feed-back paths meaning they can have signals traveling
in both directions using loops. All possible connections between neurons are
allowed. Since loops are present in this type of network, it becomes a non-linear
dynamic system which changes continuously until it reaches a state of equilibrium.
Feed-back networks are often used in associative memories and optimization
problems where the network looks for the best arrangement of interconnected
factors.
Algorithm
The single layer perceptron does not have a priori knowledge, so the initial weights
are assigned randomly. SLP sums all the weighted inputs and if the sum is above
the threshold (some predetermined value), SLP is said to be activated (output=1).
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 107
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
The input values are presented to the perceptron, and if the predicted output is
the same as the desired output, then the performance is considered satisfactory
and no changes to the weights are made. However, if the output does not match
the desired output, then the weights need to be changed to reduce the error.
Because SLP is a linear classifier and if the cases are not linearly separable the
learning process will never reach a point where all the cases are classified properly.
The most famous example of the inability of perceptron to solve problems with
linearly non-separable cases is the XOR problem.
A multi-layer perceptron (MLP) has the same structure of a single layer perceptron
with one or more hidden layers. The backpropagation algorithm consists of two
phases: the forward phase where the activations are propagated from the input to
the output layer, and the backward phase, where the error between the observed
actual and the requested nominal value in the output layer is propagated
backwards in order to modify the weights and bias values.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 108
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Forward propagation:
Propagate inputs by adding all the weighted inputs and then computing outputs
using sigmoid threshold.
Backward propagation:
1. Open "Weka".
2. Click on "Open file ..." and load the dataset (slump.csv).
3. Click on the "Classify" tab and choose "functions > MultilayerPeceptron".
4. Set the parameters before running the model.
5. Select "Compressive Strength" as the target from the list box and click on
"Start".
6. Check the output pane for the result.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 109
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
You can change the model parameters before running the model.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 110
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Slump Dataset
Compressive
Fly Coarse Fine SLUMP FLOW Strength (28-
Cement Slag ash Water SP Aggr. Aggr. (cm) (cm) day)(Mpa)
273 82 105 210 9 904 680 23 62 34.99
163 149 191 180 12 843 746 0 20 41.14
162 148 191 179 16 840 743 1 20 41.81
162 148 190 179 19 838 741 3 21.5 42.08
154 112 144 220 10 923 658 20 64 26.82
147 89 115 202 9 860 829 23 55 25.21
152 139 178 168 18 944 695 0 20 38.86
145 0 227 240 6 750 853 14.5 58.5 36.59
152 0 237 204 6 785 892 15.5 51 32.71
304 0 140 214 6 895 722 19 51 38.46
145 106 136 208 10 751 883 24.5 61 26.02
148 109 139 193 7 768 902 23.75 58 28.03
142 130 167 215 6 735 836 25.5 67 31.37
354 0 0 234 6 959 691 17 54 33.91
374 0 0 190 7 1013 730 14.5 42.5 32.44
159 116 149 175 15 953 720 23.5 54.5 34.05
153 0 239 200 6 1002 684 12 35 28.29
295 106 136 206 11 750 766 25 68.5 41.01
310 0 143 168 10 914 804 20.5 48.2 49.3
296 97 0 219 9 932 685 15 48.5 29.23
305 100 0 196 10 959 705 20 49 29.77
310 0 143 218 10 787 804 13 46 36.19
148 180 0 183 11 972 757 0 20 18.52
146 178 0 192 11 961 749 18 46 17.19
142 130 167 174 11 883 785 0 20 36.72
140 128 164 183 12 871 775 23.75 53 33.38
308 111 142 217 10 783 686 25 70 42.08
295 106 136 208 6 871 650 26.5 70 39.4
298 107 137 201 6 878 655 16 26 41.27
314 0 161 207 6 851 757 21.5 64 41.14
321 0 164 190 5 870 774 24 60 45.82
349 0 178 230 6 785 721 20 68.5 43.95
366 0 187 191 7 824 757 24.75 62.7 52.65
274 89 115 202 9 759 827 26.5 68 35.52
137 167 214 226 6 708 757 27.5 70 34.45
275 99 127 184 13 810 790 25.75 64.5 43.54
252 76 97 194 8 835 821 23 54 33.11
165 150 0 182 12 1023 729 14.5 20 18.26
158 0 246 174 7 1035 706 19 43 34.99
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 111
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 112
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
RBF networks have three layers: input layer, hidden layer and output layer. One
neuron in the input layer corresponds to each predictor variable. With respects to
categorical variables, n-1 neurons are used where n is the number of categories.
Hidden layer has a variable number of neurons. Each neuron consists of a radial
basis function centered on a point with the same dimensions as the predictor
variables. The output layer has a weighted sum of outputs from the hidden layer
to form the network outputs.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 113
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Algorithm
h(x) is the Gaussian activation function with the parameters r (the radius or
standard deviation) and c (the center or average taken from the input space)
defined separately at each RBF unit. The learning process is based on adjusting the
parameters of the network to reproduce a set of input-output patterns. There are
three types of parameters; the weight w between the hidden nodes and the output
nodes, the center c of each neutron of the hidden layer and the unit width r.
Any clustering algorithm can be used to determine the RBF unit centers (e.g., K-
means clustering). A set of clusters each with r-dimensional centers is determined
by the number of input variables or nodes of the input layer. The cluster centers
become the centers of the RBF units. The number of clusters, H, is a design
parameter and determines the number of nodes in the hidden layer. The K-means
clustering algorithm proceeds as follows:
When the RBF centers have been established, the width of each RBF unit can be
calculated using the K-nearest neighbors algorithm. A number K is chosen, and for
each center, the K nearest centers is found. The root-mean squared distance
between the current cluster center and its K nearest neighbors is calculated, and
this is the value chosen for the unit width (r). So, if the current cluster center is cj,
the r value is:
A typical value for K is 2, in which case s is set to be the average distance from the
two nearest neighboring cluster centers.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 114
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Weights (w)
Using the linear mapping, w vector is calculated using the output vector (y) and
the design matrix H.
1. Open "Weka".
2. Click on "Open file ..." and load the dataset (credit_scoring.csv).
3. Click on the "Classify" tab and choose "functions > RBFNetwork".
4. Set the parameters before running the model.
5. Select "DEFAULT" as the target from the list box and click on "Start".
6. Check the "Classifier output" pane for the result.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 115
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 116
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Algorithm
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 117
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
The beauty of SVM is that if the data is linearly separable, there is a unique global
minimum value. An ideal SVM analysis should produce a hyperplane that
completely separates the vectors (cases) into two non-overlapping classes.
However, perfect separation may not be possible, or it may result in a model with
so many cases that the model does not classify correctly. In this situation SVM finds
the hyperplane that maximizes the margin and minimizes the misclassifications.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 118
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
The algorithm tries to maintain the slack variable to zero while maximizing margin.
However, it does not minimize the number of misclassifications (NP-complete
problem) but the sum of distances from the margin hyperplanes.
The simplest way to separate two groups of data is with a straight line (1
dimension), flat plane (2 dimensions) or an N-dimensional hyperplane. However,
there are situations where a nonlinear region can separate the groups more
efficiently. SVM handles this by using a kernel function (nonlinear) to map the data
into a different space where a hyperplane (linear) cannot be used to do the
separation. It means a non-linear function is learned by a linear learning machine
in a high-dimensional feature space while the capacity of the system is controlled
by a parameter that does not depend on the dimensionality of the space. This is
called kernel trick which means the kernel function transform the data into a
higher dimensional feature space to make it possible to perform the linear
separation.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 119
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Map data into new space, then take the inner product of the new vectors. The
image of the inner product of the data is the inner product of the images of the
data. Two kernel functions are shown below.
Open "Orange".
Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
Open "Select Attributes" and set the target (class) and predictors
(attributes).
Drag and drop "SVM" widget and connect it to the "Select Attributes"
widget.
Drag and drop "Test Learners" widget and connect it to the "SVM" and
the "Select Attributes" widget.
Drag and drop "Confusion Matrix", "Lift Curve" and "ROC Analysis"
widgets and connect it to the "Test Learners" widget.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 120
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Confusion Matrix
Activity # 1.
Completely perform the simulation process on classification using your
dataset and compare the output of the algorithms using Weka or Orange for
data analysis. Identify which are best algorithms in classification.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 121
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
3.2 REGRESSION
Regression is a data science task of predicting the value of target (numerical
variable) by building a model based on one or more predictors (numerical and
categorical variables).
1. Frequency Table
o Decision Tree
2. Covariance Matrix
o Multiple Linear Regression
3. Similarity Function
o K Nearest Neighbors
4. Others
o Artificial Neural Network
o Support Vector Machine
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 122
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Standard Deviation
A decision tree is built top-down from a root node and involves partitioning
the data into subsets that contain instances with similar values (homogenous).
We use standard deviation to calculate the homogeneity of a numerical
sample. If the numerical sample is completely homogeneous its standard
deviation is zero.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 123
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Step 2: The dataset is then split on the different attributes. The standard deviation
for each branch is calculated. The resulting standard deviation is subtracted from
the standard deviation before the split. The result is the standard deviation
reduction.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 124
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Step 4a: The dataset is divided based on the values of the selected
attribute. This process is run recursively on the non-leaf branches,
until all data is processed.
Step 4b: "Overcast" subset does not need any further splitting because
its CV (8%) is less than the threshold (10%). The related leaf node gets
the average of the "Overcast" subset.
Step 4c: However, the "Sunny" branch has an CV (28%) more than the
threshold (10%) which needs further splitting. We select "Windy" as
the best best node after "Outlook" because it has the largest SDR.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 125
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Because the number of data points for both branches (FALSE and
TRUE) is equal or less than 3 we stop further branching and assign
the average of each branch to the related leaf node.
Step 4d: Moreover, the "rainy" branch has an CV (22%) which is more
than the threshold (10%). This branch needs further splitting. We select
"Temp" as the best best node because it has the largest SDR.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 126
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Because the number of data points for all three branches (Cool, Hot and
Mild) is equal or less than 3 we stop further branching and assign the
average of each branch to the related leaf node.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 127
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 128
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
MLR is based on ordinary least squares (OLS), the model is fit such that the
sum-of-squares of differences of observed and predicted values is
minimized.
The MLR model is based on several assumptions (e.g., errors are normally
distributed with zero mean and constant variance). Provided the assumptions
are satisfied, the regression estimators are optimal in the sense that they
are unbiased, efficient, and consistent. Unbiased means that the expected
value of the estimator is equal to the true value of the parameter. Efficient
means that the estimator has a smaller variance than any other estimator.
Consistent means that the bias and variance of the estimator approach zero
as the sample size approaches infinity.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 129
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
The advantage of the F-ratio over R2 is that the F-ratio incorporates sample
size and number of predictors in assessment of significance of the
regression model. A model can have a high R2 and still not be statistically
significant.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 130
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Example
MLR.xls
X1 X2 X3 Y
2 5 1 2
2 4 2 1
1 5 4 1
1 3 4 1
3 6 5 5
4 4 6 4
5 6 3 7
5 4 3 6
7 3 7 7
6 3 7 8
Multicolinearity
The diagonal values in the (X'X)-1 matrix called Variance Inflation Factors (VIFs) and
they are very useful measures of multicolinearity. If any VIF exceed 5,
multicolinearity is a problem.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 131
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Model Selection
MLR - Exercise
1. Open "Weka".
2. Click on "Open file ..." and load the dataset (mlr.csv).
3. Click on the "Classify" tab and choose "Linear Regression".
4. Set the parameters before running the model.
5. Select "Y" as the target from the list box and click on "Start".
6. Check the "Classifier output" pane for the result.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 132
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 133
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Algorithm
The above three distance measures are only valid for continuous
variables. In the case of categorical variables, you must use the
Hamming distance, which is a measure of the number of instances in
which corresponding symbols are different in two strings of equal
length.
Choosing the optimal value for K is best done by first inspecting the data. In
general, a large K value is more precise as it reduces the overall noise; however,
the compromise is that the distinct boundaries within the feature space are
blurred. Cross-validation is another way to retrospectively determine a good K
value by using an independent data set to validate your K value. The optimal K
for most datasets is 10 or more. That produces much better results than 1-NN.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 134
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Example:
We can now use the training set to classify an unknown case (Age=33
and Loan=$150,000) using Euclidean distance. If K=1 then the nearest
neighbor is the last case in the training set with HPI=264.
Standardized Distance
One major drawback in calculating distance measures directly from
the training set is in the case where variables have different
measurement scales or there is a mixture of numerical and
categorical variables. For example, if one variable is based on annual
income in dollars, and the other is based on age in years then income
will have a much higher influence on the distance calculated. One
solution is to standardize the training set as shown below.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 135
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 136
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Evaluation Results
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 137
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
The input nodes take in information, in the form which can be numerically
expressed. The information is presented as activation values, where each node is
given a number, the higher the number, the greater the activation. This
information is then passed throughout the network. Based on the connection
strengths (weights), inhibition or excitation, and transfer functions, the activation
value is passed from node to node. Each of the nodes sums the activation values it
receives; it then modifies the value based on its transfer function. The activation
flows through the network, through hidden layers, until it reaches the output
nodes. The output nodes then reflect the input in a meaningful way to the outside
world. The difference between predicted value and actual value (error) will be
propagated backward by apportioning them to each node's weights according to
the amount of this error the node is responsible for (e.g., gradient descent
algorithm).
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 138
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
The output is set at one of two levels, depending on whether the total
input is greater than or less than some threshold value.
Sigmoid
The sigmoid function consists of 2 functions, logistic and tangential. The
values of logistic function range from 0 and 1 and -1 to +1 for tangential
function.
Piecewise Linear
The output is proportional to the total weighted output.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 139
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Gaussian
Gaussian functions are bell-shaped curves that are continuous. The node
output (high/low) is interpreted in terms of class membership (1/0),
depending on how close the net input is to a chosen value of average.
Linear
Like a linear regression, a linear activation function transforms the
weighted sum inputs of the neuron to an output using a linear function.
Algorithm
There are different types of neural networks, but they are generally classified
into feed-forward and feed-back networks.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 140
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Algorithm
The single layer perceptron does not have a priori knowledge, so the initial
weights are assigned randomly. SLP sums all the weighted inputs and if the sum
is above the threshold (some predetermined value), SLP is said to be activated
(output=1).
The input values are presented to the perceptron, and if the predicted output is
the same as the desired output, then the performance is considered satisfactory
and no changes to the weights are made. However, if the output does not match
the desired output, then the weights need to be changed to reduce the error.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 141
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Because SLP is a linear classifier and if the cases are not linearly separable the
learning process will never reach a point where all the cases are classified
properly. The most famous example of the inability of perceptron to solve
problems with linearly non-separable cases is the XOR problem.
Forward propagation:
Propagate inputs by adding all the weighted inputs and then computing outputs
using sigmoid threshold.
Backward propagation:
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 142
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Example:
A Neural Network in 11 lines of Python
Exercise
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 143
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
You can change the model parameters before running the model.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 144
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
RBF networks have three layers: input layer, hidden layer and output layer. One
neuron in the input layer corresponds to each predictor variable. With respects to
categorical variables, n-1 neurons are used where n is the number of categories.
Hidden layer has a variable number of neurons. Each neuron consists of a radial
basis function centered on a point with the same dimensions as the predictor
variables. The output layer has a weighted sum of outputs from the hidden layer
to form the network outputs.
Algorithm
h(x) is the Gaussian activation function with the parameters r (the radius or
standard deviation) and c (the center or average taken from the input space)
defined separately at each RBF unit. The learning process is based on adjusting the
parameters of the network to reproduce a set of input-output patterns. There are
three types of parameters; the weight w between the hidden nodes and the
output nodes, the center c of each neuron of the hidden layer and the unit width r.
Any clustering algorithm can be used to determine the RBF unit centers (e.g., K-
means clustering). A set of clusters each with r-dimensional centers is determined
by the number of input variables or nodes of the input layer. The cluster centers
become the centers of the RBF units. The number of clusters, H, is a design
parameter and determines the number of nodes in the hidden layer. The K-means
clustering algorithm proceeds as follows:
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 145
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
4. Repeat steps 2 and 3, until the cluster centers do not change during the
subsequent iterations.
When the RBF centers have been established, the width of each RBF unit can be
calculated using the K-nearest neighbors algorithm. A number K is chosen, and for
each center, the K nearest centers is found. The root-mean squared distance
between the current cluster center and its K nearest neighbors is calculated, and
this is the value chosen for the unit width (r). So, if the current cluster center is cj,
the r value is:
A typical value for K is 2, in which case s is set to be the average distance from the
two nearest neighboring cluster centers.
Weights (w)
Using the linear mapping, w vector is calculated using the output vector (y) and
the design matrix H.
1. Open "Weka".
2. Click on "Open file ..." and load the dataset (credit_scoring.csv).
3. Click on the "Classify" tab and choose "functions > RBFNetwork".
4. Set the parameters before running the model.
5. Select "DEFAULT" as the target from the list box and click on "Start".
6. Check the "Classifier output" pane for the result.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 146
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 147
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
You can change the model parameters before running the model.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 148
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Linear SVR
Non-linear SVR
The kernel functions transform the data into a higher dimensional feature
space to make it possible to perform the linear separation.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 149
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Kernel functions
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 150
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 151
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
You can change the model parameters before running the model.
Activity # 2:
Graphically present a visualization on regression using your dataset
utilizing different algorithms of regression. Compare and interpret the
performance of each algorithm.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 152
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
3.3 Clustering
A cluster is a subset of data which are similar. Clustering (also called unsupervised
learning) is the process of dividing a dataset into groups such that the members
of each group are as similar (close) as possible to one another, and different
groups are as dissimilar (far) as possible from one another. Clustering can uncover
previously undetected relationships in a dataset. There are many applications for
cluster analysis. For example, in business, cluster analysis can be used to discover
and characterize customer segments for marketing purposes and in biology, it can
be used for classification of plants and animals given their features.
1. Hierarchical
o Agglomerative
o Divisive
2. Partitive
o K Means
o Self-Organizing Map
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 153
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 154
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Divisive method
In divisive or top-down clustering method we assign all of the observations
to a single cluster and then partition the cluster to two least similar
clusters. Finally, we proceed recursively on each cluster until there is one
cluster for each observation. There is evidence that divisive algorithms
produce more accurate hierarchies than agglomerative algorithms in
some circumstances but is conceptually more complex.
Agglomerative method
In agglomerative or bottom-up clustering method we assign each
observation to its own cluster. Then, compute the similarity (e.g., distance)
between each of the clusters and join the two most similar clusters.
Finally, repeat steps 2 and 3 until there is only a single cluster left. The
related algorithm is shown below.
Single Linkage
In single linkage hierarchical clustering, the distance between two
clusters is defined as the shortest distance between two points in each
cluster. For example, the distance between clusters “r” and “s” to the left
is equal to the length of the arrow between their two closest points.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 155
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Complete Linkage
In complete linkage hierarchical clustering, the distance between two
clusters is defined as the longest distance between two points in each
cluster. For example, the distance between clusters “r” and “s” to the left
is equal to the length of the arrow between their two furthest points.
Average Linkage
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 156
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Open "Orange".
Drag and drop "File" widget and double click to load a dataset (iris.txt).
Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
Open "Select Attributes" and set the class (target) and attributes
(predictors).
Drag and drop "Example Distance" widget and connect it to the "Select
Attributes" widget.
Drag and drop "Hierarchical Clustering" widget and connect it to the
"Example Distance" widget.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 157
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Algorithm
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 158
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Example:
Suppose we want to group the visitors to a website using just their age (one-
dimensional space) as follows:
n = 19
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
Initial clusters (random centroid or average):
k=2
c1 = 16
c2 = 22
Iteration 1:
c1 = 15.33
c2 = 36.25
Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid
15 16 22 1 7 1
15 16 22 1 7 1 15.33
16 16 22 0 6 1
19 16 22 9 3 2
19 16 22 9 3 2
20 16 22 16 2 2
20 16 22 16 2 2
21 16 22 25 1 2 36.25
22 16 22 36 0 2
28 16 22 12 6 2
35 16 22 19 13 2
40 16 22 24 18 2
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 159
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
41 16 22 25 19 2
42 16 22 26 20 2
43 16 22 27 21 2
44 16 22 28 22 2
60 16 22 44 38 2
61 16 22 45 39 2
65 16 22 49 43 2
Iteration 2:
c1 = 18.56
c2 = 45.90
Nearest New
xi c1 c2 Distance 1 Distance 2 Cluster Centroid
15 15.33 36.25 0.33 21.25 1
15 15.33 36.25 0.33 21.25 1
16 15.33 36.25 0.67 20.25 1
19 15.33 36.25 3.67 17.25 1
19 15.33 36.25 3.67 17.25 1 18.56
20 15.33 36.25 4.67 16.25 1
20 15.33 36.25 4.67 16.25 1
21 15.33 36.25 5.67 15.25 1
22 15.33 36.25 6.67 14.25 1
28 15.33 36.25 12.67 8.25 2
35 15.33 36.25 19.67 1.25 2
40 15.33 36.25 24.67 3.75 2
41 15.33 36.25 25.67 4.75 2
42 15.33 36.25 26.67 5.75 2
45.9
43 15.33 36.25 27.67 6.75 2
44 15.33 36.25 28.67 7.75 2
60 15.33 36.25 44.67 23.75 2
61 15.33 36.25 45.67 24.75 2
65 15.33 36.25 49.67 28.75 2
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 160
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Iteration 3:
c1 = 19.50
c2 = 47.89
Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid
15 18.56 45.9 3.56 30.9 1
15 18.56 45.9 3.56 30.9 1
16 18.56 45.9 2.56 29.9 1
19 18.56 45.9 0.44 26.9 1
19 18.56 45.9 0.44 26.9 1
19.50
20 18.56 45.9 1.44 25.9 1
20 18.56 45.9 1.44 25.9 1
21 18.56 45.9 2.44 24.9 1
22 18.56 45.9 3.44 23.9 1
28 18.56 45.9 9.44 17.9 1
35 18.56 45.9 16.44 10.9 2
40 18.56 45.9 21.44 5.9 2
41 18.56 45.9 22.44 4.9 2
42 18.56 45.9 23.44 3.9 2
43 18.56 45.9 24.44 2.9 2 47.89
44 18.56 45.9 25.44 1.9 2
60 18.56 45.9 41.44 14.1 2
61 18.56 45.9 42.44 15.1 2
65 18.56 45.9 46.44 19.1 2
Iteration 4:
c1 = 19.50
c2 = 47.89
Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid
15 19.5 47.89 4.50 32.89 1
15 19.5 47.89 4.50 32.89 1
16 19.5 47.89 3.50 31.89 1 19.50
19 19.5 47.89 0.50 28.89 1
19 19.5 47.89 0.50 28.89 1
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 161
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Open "Orange".
Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
Open "Select Attributes" and set the class (target) and attributes
(predictors).
Drag and drop "k-Means Clustering" widget and connect it to the "Select
Attributes" widget.
Drag and drop "Scatterplot" widget and connect it to the "k-Means
Clustering" widget to view the plot with the class label.
Drag and drop "Data Tables" widget and connect it to the "k-Means
Clustering" widget to view the dataset with a new cluster column.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 162
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Scatter Plot
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 163
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 164
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Algorithm
3- Calculating the Best Matching Unit (BMU). Each node is examined to find
the one which its weights are most similar to the input vector. This unit is
known as the Best Matching Unit (BMU) since its vector is most similar to the
input vector. This selection is done by Euclidean distance formula, which is a
measure of similarity between two datasets. The distance between the input
vector and the weights of node is calculated in order to find the BMU.
4- Calculating the size of the neighborhood around the BMU. The size of the
neighborhood around the BMU is decreasing with an exponential decay
function. It shrinks on each iteration until reaching just the BMU.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 165
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
As training goes on, the neighborhood gradually shrinks. At the end of training,
the neighborhoods have shrunk to zero size.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 166
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
The influence rate shows amount of influence a node's distance from the BMU
has on its learning. In the simplest form influence rate is equal to 1 for all the
nodes close to the BMU and zero for others, but a Gaussian function is common
too. Finally, from a random distribution of weights and through much iteration,
SOM is able to arrive at a map of stable zones. At the end, interpretation of data
is to be done by human but SOM is a great technique to present the invisible
patterns in the data.
Open "Orange".
Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
Open "Select Attributes" and set the class (target) and attributes
( predictors).
Drag and drop "SOM" widget and connect it to the "Select Attributes"
widget.
Drag and drop "SOM Visualizer" widget and connect it to the "SOM"
widget to view the map.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 167
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
SOM Map
Activity # 3:
Using your dataset utilize various techniques in clustering to discover
interesting patterns. What are the techniques to measure the quality of
clusters formed after undergoing clustering techniques?
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 168
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Example:
AIS Algorithm
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 169
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
SETM Algorithm
The SETM algorithm has the same disadvantage of the AIS algorithm. Another
disadvantage is that for each candidate itemset, there are as many entries as
its support value.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 170
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Apriori Algorithm
1. Candidate itemsets are generated using only the large itemsets of the
previous pass without considering the transactions in the database.
2. The large itemset of the previous pass is joined with itself to generate
all itemsets whose size is higher by 1. 4.
3. Each generated itemset that has a subset which is not large is deleted.
The remaining itemsets are the candidate ones.
The Apriori algorithm takes advantage of the fact that any subset of a frequent
itemset is also a frequent itemset. The algorithm can therefore, reduce the
number of candidates being considered by only exploring the itemsets whose
support count is greater than the minimum support count. All infrequent
itemsets can be pruned if it has an infrequent subset.
AprioriTid Algorithm
1. The database is not used at all for counting the support of candidate
itemsets after the first pass.
2. The candidate itemsets are generated the same way as in Apriori
4.
algorithm.
3. Another set C’ is generated of which each member has the TID of each
transaction and the large itemsets present in this transaction. This set
is used to count the support of each candidate itemset.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 171
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
The advantage is that the number of entries in C’ may be smaller than the
number of transactions in the database, especially in the later passes.
AprioriHybrid Algorithm
Apriori does better than AprioriTid in the earlier passes. However, AprioriTid
does better than Apriori in the later passes. Hence, a hybrid algorithm can be
designed that uses Apriori in the initial passes and switches to AprioriTid when
it expects that the set C’ will fit in memory.
Open "Orange".
Drag and drop "File" widget and double click to load a dataset
(contact_lenses.txt).
Drag and drop "Association Rules" widget and connect it to the "File"
widget.
Open "Association Rules" and set the support and confidence.
Drag and drop "Association Rules Filter" widget and connect it to the
"Association Rules" widget to view the extracted rules.
Drag and drop "Association Rules Explorer" widget and connect it to the
"Association Rules" widget to explore the extracted rules.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 172
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 173
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-
tutorial.html
Activity # 4.
Using your dataset produce a simulated output for mining frequent
patterns using the association rule mining methods.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 174
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
1. The following table consists of training data from an employee database. The
data have been generalized. For example, “31 ... 35” for age represents the age
range of 31 to 35. For a given row entry, count represents the number of data
tuples having the values for department, status, age, and salary given in that row.
2. The following table shows the midterm and final exam grades obtained for
students in a database course.
x y
Midterm exam Final exam
72 84
50 63
81 77
74 78
94 90
86 75
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 175
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
59 49
83 79
65 77
33 52
88 74
81 90
3. Suppose that the data mining task is to cluster the following eight points (with
(x, y) representing location) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9).
The distance function is Euclidean distance. Suppose initially we assign A1, B1, and
C1 as the center of each cluster, respectively. Use the k-means algorithm to show
only
(a) The three cluster centers after the first round execution
(b) The final three clusters
4. A database has five transactions. Let min sup = 60% and min con f = 80%.
(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare
the efficiency of the two mining processes.
(b) List all of the strong association rules (with supports and confidence c)
matching the following metarule, where X is a variable representing customers,
and itemi denotes variables representing items (e.g., “A”, “B”, etc.):
Review of Classification is a data mining function that assigns items in a collection to target
Concepts categories or classes. The goal of classification is to accurately predict the
target class for each case in the data. In machine learning, classification refers to
a predictive modeling problem where a class label is predicted for a
given example of input data. Classification constructs the classification model by
using training data set.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 176
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Association rule mining, at a basic level, involves the use of machine learning
models to analyze data for patterns, or co-occurrences, in a
database. Association rules are created by searching data for frequent if-then
patterns and using the criteria support and confidence to identify the most
important relationships.
References Han, J., Kamber, M. and Pei, J. (2011). Data Mining: Concepts and Techniques, 3rd
edition. Morgan Kaufman.
Dunham, M.H. (2003). Data Mining Introductory and Advanced Topics. Pearson
Education Inc. Upper Saddle River, New Jersey.
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 177