0% found this document useful (0 votes)
9 views

Module 3

This document discusses different modeling techniques used in data science including classification, regression, clustering, and association rule mining. It provides details on classification algorithms like ZeroR, OneR, Naive Bayesian, and examples of applying them. Regression, clustering techniques and association rule mining are also introduced at a high level.

Uploaded by

PRINCE CARL AJOC
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Module 3

This document discusses different modeling techniques used in data science including classification, regression, clustering, and association rule mining. It provides details on classification algorithms like ZeroR, OneR, Naive Bayesian, and examples of applying them. Regression, clustering techniques and association rule mining are also introduced at a high level.

Uploaded by

PRINCE CARL AJOC
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

LEARNING

LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

LEARNING MODULE NO. 3


Title MODELING

Predicting the Future


Data science predicts the future by means of modeling.

Topic 3.1 Classification


3.2 Regression
3.3 Clustering
3.4 Association

Time Frame 20 hrs.

Introduction A mining model is created by applying an algorithm to data, but it is more than
an algorithm or a metadata container: it is a set of data, statistics, and patterns
that can be applied to new data to generate predictions and make inferences
about relationships. This module will introduce the application of data mining
tasks under predictive and descriptive models.

Objectives In this module, learners will be able to:


1. Completely perform the simulation process on classification using the
statistical software for powerful data analysis.
2. Graphically present a scatter plot on regression analysis.
3. Utilize various techniques in clustering to discover interesting patterns in large
databases.
4. Produce a simulated output for mining frequent patterns using the association
rule mining method.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 70


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Learning Activities
(to include MODELING
Content/Discussion
of the Topic)
Predictive modeling is the process by which a model is created to predict an
outcome. If the outcome is categorical it is called classification and if the outcome
is numerical it is called regression. Descriptive modeling or clustering is the
assignment of observations into clusters so that observations in the same cluster
are similar. Finally, association rules can find interesting associations amongst
observations.

3.1. CLASSIFICATION

Classification is a data mining task of predicting the value of a categorical variable


(target or class) by building a model based on one or more numerical and/or
categorical variables (predictors or attributes). Four main groups of classification
algorithms are:

1. Frequency Table
o ZeroR
o OneR
o Naive Bayesian
o Decision Tree
2. Covariance Matrix
o Linear Discriminant Analysis
o Logistic Regression

3. Similarity Functions
o K Nearest Neighbors

4. Others
o Artificial Neural Network
o Support Vector Machine

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 71


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3.1.1 ZeroR
ZeroR is the simplest classification method which relies on the target and ignores
all predictors. ZeroR classifier simply predicts the majority category (class).
Although there is no predictability power in ZeroR, it is useful for determining a
baseline performance as a benchmark for other classification methods.

Algorithm
Construct a frequency table for the target and select its most frequent value.

Example:
"Play Golf = Yes" is the ZeroR model for the following dataset with an accuracy of
0.64.

Predictors Contribution

There is nothing to be said about the predictors contribution to the model because
ZeroR does not use any of them.

Model Evaluation

The following confusion matrix shows that ZeroR only predicts the majority class
correctly. As mentioned before, ZeroR is only useful for determining a baseline
performance for other classification methods.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 72


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

ZeroR - Exercise

1. Open "Weka".
2. Click on "Open file ..." and load the dataset (weather_nominal.csv).
3. Click on the "Classify" tab and choose "ZeroR".
4. Select "Play Golf" as the target from the list box and click on "Start".
5. Check the "Classifier output" pane for the result.

Datasets of Weather_nominal

Outlook,Temperature,Humidity,Windy,Play golf
Rainy,Hot,High,FALSE,No
Rainy,Hot,High,TRUE,No
Overcast,Hot,High,FALSE,Yes
Sunny,Mild,High,FALSE,Yes
Sunny,Cool,Normal,FALSE,Yes
Sunny,Cool,Normal,TRUE,No
Overcast,Cool,Normal,TRUE,Yes
Rainy,Mild,High,FALSE,No
Rainy,Cool,Normal,FALSE,Yes
Sunny,Mild,Normal,FALSE,Yes
Rainy,Mild,Normal,TRUE,Yes
Overcast,Mild,High,TRUE,Yes
Overcast,Hot,Normal,FALSE,Yes
Sunny,Mild,High,TRUE,No

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 73


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3.1.2 OneR
OneR, short for "One Rule", is a simple, yet accurate, classification algorithm
that generates one rule for each predictor in the data, then selects the rule with
the smallest total error as its "one rule". To create a rule for a predictor, we
construct a frequency table for each predictor against the target. It has been
shown that OneR produces rules only slightly less accurate than state-of-the-
art classification algorithms while producing rules that are simple for humans
to interpret.

OneR Algorithm

For each predictor,


For each value of that predictor, make a rule as follows;
Count how often each value of target (class) appears
Find the most frequent class
Make the rule assign that class to this value of the predictor
Calculate the total error of the rules of each predictor
Choose the predictor with the smallest total error.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 74


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Example:

Finding the best predictor with the smallest total error using OneR algorithm
based on related frequency tables.

The best predictor is:

Predictors Contribution

Simply, the total error calculated from the frequency tables is the measure of
each predictor contribution. A low total error means a higher contribution to the
predictability of the model.

Model Evaluation

The following confusion matrix shows significant predictability power. OneR does
not generate score or probability, which means evaluation charts (Gain, Lift, K-S
and ROC) are not applicable.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 75


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

OneR - Exercise

1. Open "Weka".
2. Click on "Open file ..." and load the dataset (weather_nominal.csv).
3. Click on the "Classify" tab and choose "OneR".
4. Select "Play Golf" as the target from the list box and click on "Start".
5. Check the "Classifier output" pane for the result.

Exercise

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 76


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3.1.3 Naive Bayesian


The Naive Bayesian classifier is based on Bayes’ theorem with independence
assumptions between predictors. A Naive Bayesian model is easy to build, with no
complicated iterative parameter estimation which makes it particularly useful for
very large datasets. Despite its simplicity, the Naive Bayesian classifier often does
surprisingly well and is widely used because it often outperforms more
sophisticated classification methods.

Algorithm

Bayes theorem provides a way of calculating the posterior probability, P(c|x),


from P(c), P(x), and P(x|c). Naive Bayes classifier assume that the effect of the
value of a predictor (x) on a given class (c) is independent of the values of other
predictors. This assumption is called class conditional independence.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 77


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

 P(c|x) is the posterior probability of class (target)


given predictor (attribute).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.

Example:

The posterior probability can be calculated by first, constructing a frequency table


for each attribute against the target. Then, transforming the frequency tables to
likelihood tables and finally use the Naive Bayesian equation to calculate the
posterior probability for each class. The class with the highest posterior probability
is the outcome of prediction.

The zero-frequency problem

Add 1 to the count for every attribute value-class combination (Laplace estimator)
when an attribute value (Outlook=Overcast) doesn’t occur with every class value
(Play Golf=no).

Numerical Predictors

Numerical variables need to be transformed to their categorical counterparts


(binning) before constructing their frequency tables. The other option we have is
using the distribution of the numerical variable to have a good guess of the
frequency. For example, one common practice is to assume normal distributions
for numerical variables.

The probability density function for the normal distribution is defined by two
parameters (mean and standard deviation).

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 78


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Example:

Humidity Mean StDev

yes 86 96 80 65 70 80 70 90 75 79.1 10.2


Play Golf
no 85 90 70 95 91 86.2 9.7

Predictors Contribution
Kononenko's information gain as a sum of information contributed by each
attribute can offer an explanation on how values of the predictors influence the
class probability.

The contribution of predictors can also be visualized by plotting nomograms.


Nomogram plots log odds ratios for each value of each predictor. Lengths of the
lines correspond to spans of odds ratios, suggesting importance of the related
predictor. It also shows impacts of individual values of the predictor.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 79


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Naive Bayesian - Exercise

 Open "Orange".
 Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
 Drag and drop "Select Attributes" widget and connect it to the "File" widget.
 Open "Select Attributes" and set the target (class) and predictors
(attributes).
 Drag and drop "Naive Bayes" widget and connect it to the "Select Attributes"
widget.
 Drag and drop "Test Learners" widget and connect it to the "Naive Bayes"
and the "Select Attributes" widget.
 Drag and drop "Confusion Matrix", "Lift Curve" and "ROC Analysis" widgets
and connect it to the "Test Learners" widget.

Credit_scoring Dataset

BUSAGE DAYSDELQ DEFAULT


87 2 N
89 2 N
90 2 N
90 2 N
101 2 N
110 2 N
115 2 N
115 2 N
115 2 N
117 2 N
127 2 N
149 2 N
150 2 N
157 2 N
166 2 N
170 2 N
183 2 N
183 2 N
190 2 N
201 2 N
202 2 N
207 2 N
212 2 N
220 2 N
310 2 N
322 2 N
399 2 N
415 2 N
CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 80
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

326 3 N
12 4 N
12 4 N
38 4 N
200 4 N
302 4 N
30 5 N
44 5 N
131 5 N
23 7 N
152 7 N
89 10 N
24 12 N
41 12 N
151 14 N
25 17 N
48 18 N
12 19 N
267 20 N
18 22 N
43 25 N
15 28 N
60 28 N
86 28 N
12 29 N
77 29 N
38 30 N
46 30 N
64 30 N
77 30 N
132 30 N
150 30 N
12 31 N
24 31 N
12 32 N
27 32 N
36 34 N
26 46 N
26 46 N
116 46 N
88 53 N
27 60 N
84 60 N
162 60 N
277 60 N
51 66 N

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 81


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

12 11 N
100 26 Y
78 27 Y
61 48 Y
343 28 Y
24 49 Y
43 53 Y
417 13 Y
275 54 Y
42 57 Y
26 58 Y
42 59 Y
47 59 Y
170 59 Y
12 60 Y
49 62 Y
36 64 Y
48 65 Y
17 66 Y
85 69 Y
350 39 Y
207 70 Y
229 70 Y
65 72 Y
37 76 Y
73 80 Y

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 82


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Confusion Matrix

https://orange3.readthedocs.io/projects/orange-visual-
programming/en/latest/widgets/model/naivebayes.html
https://www.solver.com/xlminer/help/classification-using-naive-bayes-example
https://orangedatamining.com/widget-catalog/evaluate/liftcurve/

3.1.4 Decision Tree – Classification


Decision tree builds classification or regression models in the form of a tree
structure. It breaks down a dataset into smaller and smaller subsets while at the
same time an associated decision tree is incrementally developed. The final result
is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has
two or more branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g., Play)
represents a classification or decision. The topmost decision node in a tree which
corresponds to the best predictor called root node. Decision trees can handle both
categorical and numerical data.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 83


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Algorithm

The core algorithm for building decision trees called ID3 by J. R. Quinlan which
employs a top-down, greedy search through the space of possible branches with
no backtracking. ID3 uses Entropy and Information Gain to construct a decision
tree.

Entropy

A decision tree is built top-down from a root node and involves partitioning the
data into subsets that contain instances with similar values (homogenous). ID3
algorithm uses entropy to calculate the homogeneity of a sample. If the sample is
completely homogeneous the entropy is zero and if the sample is an equally
divided it has entropy of one.

To build a decision tree, we need to calculate two types of entropy using frequency
tables as follows:

a) Entropy using the frequency table of one attribute:

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 84


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

b) Entropy using the frequency table of two attributes:

Information Gain

The information gain is based on the decrease in entropy after a dataset is split on
an attribute. Constructing a decision tree is all about finding attribute that returns
the highest information gain (i.e., the most homogeneous branches).

Step 1: Calculate entropy of the target.

Step 2: The dataset is then split on the different attributes. The entropy for each
branch is calculated. Then it is added proportionally, to get total entropy for the
split. The resulting entropy is subtracted from the entropy before the split. The
result is the Information Gain, or decrease in entropy.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 85


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Step 3: Choose attribute with the largest information gain as the decision node.

Step 4a: A branch with entropy of 0 is a leaf node.

Step 4b: A branch with entropy more than 0 needs further splitting.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 86


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data
is classified.

Decision Tree to Decision Rules

A decision tree can easily be transformed to a set of rules by mapping from the
root node to the leaf nodes one by one.

Decision Trees – Issues

 Working with continuous attributes (binning)


 Avoiding overfitting
 Super Attributes (attributes with many values)
 Working with missing values

Decision Tree - Exercise

 Open "Orange".
 Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
 Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
 Open "Select Attributes" and set the target (class) and predictors
(attributes).
 Drag and drop "Classification Tree" widget and connect it to the "Select
Attributes" widget.
 Drag and drop "Test Learners" widget and connect it to the "Classification
Tree" and the "Select Attributes" widget.
 Drag and drop "Confusion Matrix", "Lift Curve" and "ROC Analysis"
widgets and connect it to the "Test Learners" widget.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 87


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Confusion Matrix

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 88


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Decision Tree - Overfitting

Overfitting is a significant practical difficulty for decision tree models and many
other predictive models. Overfitting happens when the learning algorithm
continues to develop hypotheses that reduce training set error at the cost of an
increased test set error. There are several approaches to avoiding overfitting in
building decision trees.

 Pre-pruning that stop growing the tree earlier, before it perfectly classifies
the training set.
 Post-pruning that allows the tree to perfectly classify the training set, and
then post prune the tree.

Practically, the second approach of post-pruning overfit trees is more successful


because it is not easy to precisely estimate when to stop growing the tree.

The important step of tree pruning is to define a criterion be used to determine


the correct final tree size using one of the following methods:

1. Use a distinct dataset from the training set (called validation set), to
evaluate the effect of post-pruning nodes from the tree.
2. Build the tree by using the training set, then apply a statistical test to
estimate whether pruning or expanding a particular node is likely to
produce an improvement beyond the training set.
o Error estimation
o Significance testing (e.g., Chi-square test)
3. Minimum Description Length principle: Use an explicit measure of the
complexity for encoding the training set and the decision tree, stopping
growth of the tree when this encoding size (size(tree) +
size(misclassifications(tree)) is minimized.

The first method is the most common approach. In this approach, the available
data are separated into two sets of examples: a training set, which is used to build
the decision tree, and a validation set, which is used to evaluate the impact of
pruning the tree. The second method is also a common approach. Here, we explain
the error estimation and Chi2 test.

Post-pruning using Error estimation


Error estimate for a sub-tree is weighted sum of error estimates for all its leaves.
The error estimate (e) for a node is:

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 89


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

In the following example we set Z to 0.69 which is equal to a confidence level of


75%.

The error rate at the parent node is 0.46 and since the error rate for its children
(0.51) increases with the split, we do not want to keep the children.

Post-pruning using Chi2 test

In Chi2 test we construct the corresponding frequency table and calculate the
Chi2 value and its probability.

Bronze Silver Gold

Bad 4 1 4

Good 2 1 2

Chi2 = 0.21 Probability = 0.90 degree of freedom=2

If we require that the probability has to be less than a limit (e.g., 0.05), therefore
we decide not to split the node.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 90


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Decision Tree - Super Attributes

The information gain equation, G(T,X) is biased toward attributes that have a large
number of values over attributes that have a smaller number of values. These
‘Super Attributes’ will easily be selected as the root, resulted in a broad tree that
classifies perfectly but performs poorly on unseen instances. We can penalize
attributes with large numbers of values by using an alternative method for
attribute selection, referred to as Gain Ratio.

Example:

The following example shows a frequency table between the target (Play Golf) and
the ID attribute which has a unique value for each record of the dataset.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 91


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

The information gain for ID is maximum (0.94) without using the split information.
However, with the adjustment the information gain dropped to 0.25.

3.1.5 Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a classification method originally developed


in 1936 by R. A. Fisher. It is simple, mathematically robust and often produces
models whose accuracy is as good as more complex methods.

Algorithm

LDA is based upon the concept of searching for a linear combination of variables
(predictors) that best separates two classes (targets). To capture the notion of
separability, Fisher defined the following score function.

Given the score function, the problem is to estimate the linear coefficients that
maximize the score which can be solved by the following equations.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 92


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

One way of assessing the effectiveness of the discrimination is to calculate


the Mahalanobis distance between two groups. A distance greater than 3 means
that in two averages differ by more than 3 standard deviations. It means that the
overlap (probability of misclassification) is quite small.

Finally, a new point is classified by projecting it onto the maximally separating


direction and classifying it as C1 if:

Example:

Suppose we received a dataset from a bank regarding its small business clients
who defaulted (red square) and those that did not (blue circle) separated by
delinquent days (DAYSDELQ) and number of months in business (BUSAGE). We use
LDA to find an optimal linear model that best separates two classes (default and
non-default).

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 93


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

The first step is to calculate the mean (average) vectors, covariance matrices and
class probabilities.

Then, we calculate pooled covariance matrix and finally the coefficients of the
linear model.

A Mahalanobis distance of 2.32 shows a small overlap between two groups which
means a good separation between classes by the linear model.

Predictors Contribution

A simple linear correlation between the model scores and predictors can be used
to test which predictors contribute significantly to the discriminant function.
Correlation varies from -1 to 1, with -1 and 1 meaning the highest contribution but
in different directions and 0 means no contribution at all.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 94


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Quadratic Discriminant Analysis (QDA)


QDA is a general discriminant function with a quadratic decision boundaries which
can be used to classify datasets with two or more classes. QDA has more
predictability power than LDA but it needs to estimate the covariance matrix for
each classes.

where Ck is the covariance matrix for the class k (-1 means inverse matrix), |Ck| is
the determinant of the covariance matrix Ck, and P(ck) is the prior probability of the
class k. The classification rule is simply to find the class with highest Z value.

3.1.6 Logistic Regression


Logistic regression predicts the probability of an outcome that can only have two
values (i.e. a dichotomy). The prediction is based on the use of one or several
predictors (numerical and categorical). A linear regression is not appropriate for
predicting the value of a binary variable for two reasons:

 A linear regression will predict values outside the acceptable range (e.g.
predictingprobabilities outside the range 0 to 1)
 Since the dichotomous experiments can only have one of two possible
values for each experiment, the residuals will not be normally distributed
about the predicted line.

On the other hand, a logistic regression produces a logistic curve, which is limited
to values between 0 and 1. Logistic regression is similar to a linear regression, but
the curve is constructed using the natural logarithm of the “odds” of the target
variable, rather than the probability. Moreover, the predictors do not have to be
normally distributed or have equal variance in each group.

In the logistic regression the constant (b0) moves the curve left and right and the
slope (b1) defines the steepness of the curve. By simple transformation, the logistic
regression equation can be written in terms of an odds ratio.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 95


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Finally, taking the natural log of both sides, we can write the equation in terms of
log-odds (logit) which is a linear function of the predictors. The coefficient (b1) is
the amount the logit (log-odds) changes with a one unit change in x.

As mentioned before, logistic regression can handle any number of numerical


and/or categorical variables.

There are several analogies between linear regression and logistic regression. Just
as ordinary least square regression is the method used to estimate coefficients for
the best fit line in linear regression, logistic regression uses maximum likelihood
estimation (MLE) to obtain the model coefficients that relate predictors to the
target. After this initial function is estimated, the process is repeated until LL (Log
Likelihood) does not change significantly.

A pseudo R2 value is also available to indicate the adequacy of the regression


model. Likelihood ratio test is a test of the significance of the difference between
the likelihood ratio for the baseline model minus the likelihood ratio for a reduced
model. This difference is called "model chi-square“. Wald test is used to test the
statistical significance of each coefficient (b) in the model (i.e., predictors
contribution).

Pseudo R2

There are several measures intended to mimic the R2 analysis to evaluate the
goodness-of-fit of logistic models, but they cannot be interpreted as one would
interpret an R2 and different pseudo R2 can arrive at very different values. Here
we discuss three pseudo R2measures.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 96


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Likelihood Ratio Test

The likelihood ratio test provides the means for comparing the likelihood of the
data under one model (e.g., full model) against the likelihood of the data under
another, more restricted model (e.g., intercept model).

where 'p' is the logistic model predicted probability. The next step is to calculate
the difference between these two log-likelihoods.

The difference between two likelihoods is multiplied by a factor of 2 in order to be


assessed for statistical significance using standard significance levels (Chi2 test).
The degrees of freedom for the test will equal the difference in the number of
parameters being estimated under the models (e.g., full and intercept).

Wald test

A Wald test is used to evaluate the statistical significance of each coefficient (b) in
the model.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 97


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

where W is the Wald's statistic with a normal distribution (like Z-test), b is the
coefficient and SE is its standard error. The W value is then squared, yielding a
Wald statistic with a chi-square distribution.

Predictors Contributions

The Wald test is usually used to assess the significance of prediction of each
predictor. Another indicator of contribution of a predictor is exp(b) or odds-ratio of
coefficient which is the amount the logit (log-odds) changes, with a one unit
change in the predictor (x).

Exercise

Logistic Regression - Exercise

 Open "Orange".
 Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
 Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
 Open "Select Attributes" and set the target (class) and predictors
(attributes).
 Drag and drop "Logistic Regression" widget and connect it to the "Select
Attributes" widget.
 Drag and drop "Test Learners" widget and connect it to the "Logistic
Regression" and the "Select Attributes" widget.
 Drag and drop "Confusion Matrix", "Lift Curve" and "ROC Analysis"
widgets and connect it to the "Test Learners" widget.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 98


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Confusion Matrix

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 99


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3.1.7 K Nearest Neighbors - Classification


K nearest neighbors is a simple algorithm that stores all available cases and
classifies new cases based on a similarity measure (e.g., distance functions). KNN
has been used in statistical estimation and pattern recognition already in the
beginning of 1970’s as a non-parametric technique.

Algorithm

A case is classified by a majority vote of its neighbors, with the case being assigned
to the class most common amongst its K nearest neighbors measured by a distance
function. If K = 1, then the case is simply assigned to the class of its nearest
neighbor.

It should also be noted that all three distance measures are only valid for
continuous variables. In the instance of categorical variables the Hamming distance
must be used. It also brings up the issue of standardization of the numerical
variables between 0 and 1 when there is a mixture of numerical and categorical
variables in the dataset.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 100
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Choosing the optimal value for K is best done by first inspecting the data. In
general, a large K value is more precise as it reduces the overall noise but there is
no guarantee. Cross-validation is another way to retrospectively determine a good
K value by using an independent dataset to validate the K value. Historically, the
optimal K for most datasets has been between 3-10. That produces much better
results than 1NN.

Example:

Consider the following data concerning credit default. Age and Loan are two
numerical variables (predictors) and Default is the target.

We can now use the training set to classify an unknown case (Age=48 and
Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the
last case in the training set with Default=Y.

D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 101
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

With K=3, there are two Default=Y and one Default=N out of three closest
neighbors. The prediction for the unknown case is again Default=Y.

Standardized Distance

One major drawback in calculating distance measures directly from the training set
is in the case where variables have different measurement scales or there is a
mixture of numerical and categorical variables. For example, if one variable is
based on annual income in dollars, and the other is based on age in years then
income will have a much higher influence on the distance calculated. One solution
is to standardize the training set as shown below.

Using the standardized distance on the same training set, the unknown case
returned a different neighbor which is not a good sign of robustness.

K Nearest Neighbors - Exercise

 Open "Orange".
 Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
 Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
 Open "Select Attributes" and set the target (class) and predictors
(attributes).
 Drag and drop "k Nearest Neighbours" widget and connect it to the
"Select Attributes" widget.
 Drag and drop "Test Learners" widget and connect it to the "k Nearest
Neighbours" and the "Select Attributes" widget.
 Drag and drop "Confusion Matrix", "Lift Curve" and "ROC Analysis"
widgets and connect it to the "Test Learners" widget.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 102
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Confusion Matrix

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 103
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3.1.8 Artificial Neural Network


An artificial neutral network (ANN) is a system that is based on the biological neural
network, such as the brain. The brain has approximately 100 billion neutrons,
which communicate through electro-chemical signals. The neutrons are connected
through junctions called synapses. Each neutron receives thousands of
connections with other neutrons, constantly receiving incoming signals to reach
the cell body. If the resulting sum of the signals surpasses a certain threshold, a
response is sent through the axon. The ANN attempts to recreate the
computational mirror of the biological neural network, although it is not
comparable since the number and complexity of neutrons and the used in a
biological neural network is many times more than those in an artificial neutral
network.

An ANN is comprised of a network of artificial neurons (also known as "nodes").


These nodes are connected to each other, and the strength of their connections to
one another is assigned a value based on their strength: inhibition (maximum being
-1.0) or excitation (maximum being +1.0). If the value of the connection is high,
then it indicates that there is a strong connection. Within each node's design, a
transfer function is built in. There are three types of neutrons in an ANN, input
nodes, hidden nodes, and output nodes.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 104
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

The input nodes take in information, in the form which can be numerically
expressed. The information is presented as activation values, where each node is
given a number, the higher the number, the greater the activation. This
information is then passed throughout the network. Based on the connection
strengths (weights), inhibition or excitation, and transfer functions, the activation
value is passed from node to node. Each of the nodes sums the activation values it
receives; it then modifies the value based on its transfer function. The activation
flows through the network, through hidden layers, until it reaches the output
nodes. The output nodes then reflect the input in a meaningful way to the outside
world.

Transfer (Activation) Functions

The transfer function translates the input signals to output signals. Four types of
transfer functions are commonly used, Unit step (threshold), sigmoid, piecewise
linear, and Gaussian.

Unit step (threshold)


The output is set at one of two levels, depending on whether the total input is
greater than or less than some threshold value.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 105
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Sigmoid

The sigmoid function consists of 2 functions, logistic and tangential. The values of
logistic function range from 0 and 1 and -1 to +1 for tangential function.

Piecewise Linear

The output is proportional to the total weighted output.

Gaussian

Gaussian functions are bell-shaped curves that are continuous. The node output
(high/low) is interpreted in terms of class membership (1/0), depending on how
close the net input is to a chosen value of average

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 106
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Algorithm

There are different types of neural networks, but they are generally classified into
feed-forward and feed-back networks.

A feed-forward network is a non-recurrent network which contains inputs,


outputs, and hidden layers; the signals can only travel in one direction. Input data
is passed onto a layer of processing elements where it performs calculations. Each
processing element makes its computation based upon a weighted sum of its
inputs. The new calculated values then become the new input values that feed the
next layer. This process continues until it has gone through all the layers and
determines the output. A threshold transfer function is sometimes used to
quantify the output of a neuron in the output layer. Feed-forward networks
include Perceptron (linear and non-linear) and Radial Basis Function networks.
Feed-forward networks are often used in data mining.

A feed-back network has feed-back paths meaning they can have signals traveling
in both directions using loops. All possible connections between neurons are
allowed. Since loops are present in this type of network, it becomes a non-linear
dynamic system which changes continuously until it reaches a state of equilibrium.
Feed-back networks are often used in associative memories and optimization
problems where the network looks for the best arrangement of interconnected
factors.

Artificial Neural Network - Perceptron

A single layer perceptron (SLP) is a feed-forward network based on a threshold


transfer function. SLP is the simplest type of artificial neural networks and can only
classify linearly separable cases with a binary target (1 , 0).

Algorithm
The single layer perceptron does not have a priori knowledge, so the initial weights
are assigned randomly. SLP sums all the weighted inputs and if the sum is above
the threshold (some predetermined value), SLP is said to be activated (output=1).

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 107
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

The input values are presented to the perceptron, and if the predicted output is
the same as the desired output, then the performance is considered satisfactory
and no changes to the weights are made. However, if the output does not match
the desired output, then the weights need to be changed to reduce the error.

Because SLP is a linear classifier and if the cases are not linearly separable the
learning process will never reach a point where all the cases are classified properly.
The most famous example of the inability of perceptron to solve problems with
linearly non-separable cases is the XOR problem.

However, a multi-layer perceptron using the backpropagation algorithm can


successfully classify the XOR data.

Multi-layer Perceptron - Backpropagation algorithm

A multi-layer perceptron (MLP) has the same structure of a single layer perceptron
with one or more hidden layers. The backpropagation algorithm consists of two
phases: the forward phase where the activations are propagated from the input to
the output layer, and the backward phase, where the error between the observed
actual and the requested nominal value in the output layer is propagated
backwards in order to modify the weights and bias values.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 108
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Forward propagation:

Propagate inputs by adding all the weighted inputs and then computing outputs
using sigmoid threshold.

Backward propagation:

Propagates the errors backwards by apportioning them to each unit according to


the amount of this error the unit is responsible for.

Artificial Neural Network - Perceptron - Exercise

1. Open "Weka".
2. Click on "Open file ..." and load the dataset (slump.csv).
3. Click on the "Classify" tab and choose "functions > MultilayerPeceptron".
4. Set the parameters before running the model.
5. Select "Compressive Strength" as the target from the list box and click on
"Start".
6. Check the output pane for the result.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 109
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

You can change the model parameters before running the model.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 110
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Slump Dataset

Compressive
Fly Coarse Fine SLUMP FLOW Strength (28-
Cement Slag ash Water SP Aggr. Aggr. (cm) (cm) day)(Mpa)
273 82 105 210 9 904 680 23 62 34.99
163 149 191 180 12 843 746 0 20 41.14
162 148 191 179 16 840 743 1 20 41.81
162 148 190 179 19 838 741 3 21.5 42.08
154 112 144 220 10 923 658 20 64 26.82
147 89 115 202 9 860 829 23 55 25.21
152 139 178 168 18 944 695 0 20 38.86
145 0 227 240 6 750 853 14.5 58.5 36.59
152 0 237 204 6 785 892 15.5 51 32.71
304 0 140 214 6 895 722 19 51 38.46
145 106 136 208 10 751 883 24.5 61 26.02
148 109 139 193 7 768 902 23.75 58 28.03
142 130 167 215 6 735 836 25.5 67 31.37
354 0 0 234 6 959 691 17 54 33.91
374 0 0 190 7 1013 730 14.5 42.5 32.44
159 116 149 175 15 953 720 23.5 54.5 34.05
153 0 239 200 6 1002 684 12 35 28.29
295 106 136 206 11 750 766 25 68.5 41.01
310 0 143 168 10 914 804 20.5 48.2 49.3
296 97 0 219 9 932 685 15 48.5 29.23
305 100 0 196 10 959 705 20 49 29.77
310 0 143 218 10 787 804 13 46 36.19
148 180 0 183 11 972 757 0 20 18.52
146 178 0 192 11 961 749 18 46 17.19
142 130 167 174 11 883 785 0 20 36.72
140 128 164 183 12 871 775 23.75 53 33.38
308 111 142 217 10 783 686 25 70 42.08
295 106 136 208 6 871 650 26.5 70 39.4
298 107 137 201 6 878 655 16 26 41.27
314 0 161 207 6 851 757 21.5 64 41.14
321 0 164 190 5 870 774 24 60 45.82
349 0 178 230 6 785 721 20 68.5 43.95
366 0 187 191 7 824 757 24.75 62.7 52.65
274 89 115 202 9 759 827 26.5 68 35.52
137 167 214 226 6 708 757 27.5 70 34.45
275 99 127 184 13 810 790 25.75 64.5 43.54
252 76 97 194 8 835 821 23 54 33.11
165 150 0 182 12 1023 729 14.5 20 18.26
158 0 246 174 7 1035 706 19 43 34.99

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 111
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

156 0 243 180 11 1022 698 21 57 33.78


145 177 227 209 11 752 715 2.5 20 35.66
154 141 181 234 11 797 683 23 65 33.51
160 146 188 203 11 829 710 13 38 33.51
291 105 0 205 6 859 797 24 59 27.62
298 107 0 186 6 879 815 3 20 30.97
318 126 0 210 6 861 737 17.5 48 31.77
280 92 118 207 9 883 679 25.5 64 37.39
287 94 121 188 9 904 696 25 61 43.01
332 0 170 160 6 900 806 0 20 58.53
326 0 167 174 6 884 792 21.5 42 52.65
320 0 163 188 9 866 776 23.5 60 45.69
342 136 0 225 11 770 747 21 61 32.04
356 142 0 193 11 801 778 8 30 36.46
309 0 142 218 10 912 680 24 62 38.59
322 0 149 186 8 951 709 20.5 61.5 45.42
159 193 0 208 12 821 818 23 50 19.19
307 110 0 189 10 904 765 22 40 31.5
313 124 0 205 11 846 758 22 49 29.63
143 131 168 217 6 891 672 25 69 26.42
140 128 164 237 6 869 656 24 65 29.5
278 0 117 205 9 875 799 19 48 32.71
288 0 121 177 7 908 829 22.5 48.5 39.93
299 107 0 210 10 881 745 25 63 28.29
291 104 0 231 9 857 725 23 69 30.43
265 86 111 195 6 833 790 27 60 37.39
159 0 248 175 12 1041 683 21 51 35.39
160 0 250 168 12 1049 688 18 48 37.66
166 0 260 183 13 859 827 21 54 40.34
320 127 164 211 6 721 723 2 20 46.36
336 134 0 222 6 756 787 26 64 31.9
276 90 116 180 9 870 768 0 20 44.08
313 112 0 220 10 794 789 23 58 28.16
322 116 0 196 10 818 813 25.5 67 29.77
294 106 136 207 6 747 778 24 47 41.27
146 106 137 209 6 875 765 24 67 27.89
149 109 139 193 6 892 780 23.5 58.5 28.7
159 0 187 176 11 990 789 12 39 32.57
261 78 100 201 9 864 761 23 63.5 34.18
140 1.4 198 174.9 4.4 1050 781 16.25 31 30.83
141.1 0.6 210 188.8 4.6 996.1 789 23.5 53 30.43
140.1 4.2 216 193.9 4.7 1050 710 24.5 57 26.42
140.1 12 226 207.8 4.9 1021 684 21 64 26.28
160.2 0.3 240 233.5 9.2 781 841 24 75 36.19

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 112
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

140.2 31 239 169.4 5.3 1028 743 21.25 46 36.32


140.2 45 235 171.3 5.5 1048 704 23.5 52.5 33.78
140.5 61 239 182.5 5.7 1018 681 24.5 60 30.97
143.3 92 240 200.8 6.2 964.8 647 25 55 27.09
194.3 0.3 240 234.2 8.9 780.6 811 26.5 78 38.46
150.4 111 240 168.1 6.5 1000 667 9.5 27.5 37.92
150.3 111 239 167.3 6.5 999.5 671 14.5 36.5 38.19
155.4 122 240 179.9 6.7 966.8 653 14.5 41.5 35.52
165.3 143 238 200.4 7.1 883.2 653 17 27 32.84
303.8 0.2 240 236.4 8.3 780.1 715 25 78 44.48
172 162 239 166 7.4 953.3 641 0 20 41.54
172.8 158 240 166.4 7.4 952.6 644 0 20 41.81
184.3 153 239 179 7.5 920.2 641 0 20 41.01
215.6 113 239 198.7 7.4 884 649 27.5 64 39.13
295.3 0 240 236.2 8.3 780.3 723 25 77 44.08
248.3 101 239 168.9 7.7 954.2 641 0 20 49.97
248 101 240 169.1 7.7 949.9 644 2 20 50.23
258.8 88 240 175.3 7.6 938.9 646 0 20 50.5
297.1 41 240 194 7.5 908.9 652 27.5 67 49.17
348.7 0.1 223 208.5 9.6 786.2 758 29 78 48.7

Radial Basis Function Networks (RBF)

RBF networks have three layers: input layer, hidden layer and output layer. One
neuron in the input layer corresponds to each predictor variable. With respects to
categorical variables, n-1 neurons are used where n is the number of categories.
Hidden layer has a variable number of neurons. Each neuron consists of a radial
basis function centered on a point with the same dimensions as the predictor
variables. The output layer has a weighted sum of outputs from the hidden layer
to form the network outputs.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 113
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Algorithm

h(x) is the Gaussian activation function with the parameters r (the radius or
standard deviation) and c (the center or average taken from the input space)
defined separately at each RBF unit. The learning process is based on adjusting the
parameters of the network to reproduce a set of input-output patterns. There are
three types of parameters; the weight w between the hidden nodes and the output
nodes, the center c of each neutron of the hidden layer and the unit width r.

Unit Center (c)

Any clustering algorithm can be used to determine the RBF unit centers (e.g., K-
means clustering). A set of clusters each with r-dimensional centers is determined
by the number of input variables or nodes of the input layer. The cluster centers
become the centers of the RBF units. The number of clusters, H, is a design
parameter and determines the number of nodes in the hidden layer. The K-means
clustering algorithm proceeds as follows:

1. Initialize the center of each cluster to a different randomly selected


training pattern.
2. Assign each training pattern to the nearest cluster. This can be
accomplished by calculating the Euclidean distances between the training
patterns and the cluster centers.
3. When all training patterns are assigned, calculate the average position for
each cluster center. They then become new cluster centers.
4. Repeat steps 2 and 3, until the cluster centers do not change during the
subsequent iterations.

Unit width (r)

When the RBF centers have been established, the width of each RBF unit can be
calculated using the K-nearest neighbors algorithm. A number K is chosen, and for
each center, the K nearest centers is found. The root-mean squared distance
between the current cluster center and its K nearest neighbors is calculated, and
this is the value chosen for the unit width (r). So, if the current cluster center is cj,
the r value is:

A typical value for K is 2, in which case s is set to be the average distance from the
two nearest neighboring cluster centers.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 114
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Weights (w)

Using the linear mapping, w vector is calculated using the output vector (y) and
the design matrix H.

In contrast to the multi-layer Perceptron, there is no local minima in RBF.

Artificial Neural Network - RBF - Exercise

1. Open "Weka".
2. Click on "Open file ..." and load the dataset (credit_scoring.csv).
3. Click on the "Classify" tab and choose "functions > RBFNetwork".
4. Set the parameters before running the model.
5. Select "DEFAULT" as the target from the list box and click on "Start".
6. Check the "Classifier output" pane for the result.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 115
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 116
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3.1.9 Support Vector Machine - Classification (SVM)

A Support Vector Machine (SVM) performs classification by finding the hyperplane


that maximizes the margin between the two classes. The vectors (cases) that
define the hyperplane are the support vectors.

Algorithm

1. Define an optimal hyperplane: maximize margin


2. Extend the above definition for non-linearly separable problems: have a
penalty term for misclassifications.
3. Map data to high dimensional space where it is easier to classify with
linear decision surfaces: reformulate problem so that data is mapped
implicitly to this space.

To define an optimal hyperplane we need to maximize the width of the margin


(w).

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 117
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

We find w and b by solving the following objective function using Quadratic


Programming.

The beauty of SVM is that if the data is linearly separable, there is a unique global
minimum value. An ideal SVM analysis should produce a hyperplane that
completely separates the vectors (cases) into two non-overlapping classes.
However, perfect separation may not be possible, or it may result in a model with
so many cases that the model does not classify correctly. In this situation SVM finds
the hyperplane that maximizes the margin and minimizes the misclassifications.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 118
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

The algorithm tries to maintain the slack variable to zero while maximizing margin.
However, it does not minimize the number of misclassifications (NP-complete
problem) but the sum of distances from the margin hyperplanes.

The simplest way to separate two groups of data is with a straight line (1
dimension), flat plane (2 dimensions) or an N-dimensional hyperplane. However,
there are situations where a nonlinear region can separate the groups more
efficiently. SVM handles this by using a kernel function (nonlinear) to map the data
into a different space where a hyperplane (linear) cannot be used to do the
separation. It means a non-linear function is learned by a linear learning machine
in a high-dimensional feature space while the capacity of the system is controlled
by a parameter that does not depend on the dimensionality of the space. This is
called kernel trick which means the kernel function transform the data into a
higher dimensional feature space to make it possible to perform the linear
separation.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 119
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Map data into new space, then take the inner product of the new vectors. The
image of the inner product of the data is the inner product of the images of the
data. Two kernel functions are shown below.

Support Vector Machine - Exercise

 Open "Orange".
 Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
 Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
 Open "Select Attributes" and set the target (class) and predictors
(attributes).
 Drag and drop "SVM" widget and connect it to the "Select Attributes"
widget.
 Drag and drop "Test Learners" widget and connect it to the "SVM" and
the "Select Attributes" widget.
 Drag and drop "Confusion Matrix", "Lift Curve" and "ROC Analysis"
widgets and connect it to the "Test Learners" widget.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 120
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Confusion Matrix

Activity # 1.
Completely perform the simulation process on classification using your
dataset and compare the output of the algorithms using Weka or Orange for
data analysis. Identify which are best algorithms in classification.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 121
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3.2 REGRESSION
Regression is a data science task of predicting the value of target (numerical
variable) by building a model based on one or more predictors (numerical and
categorical variables).
1. Frequency Table
o Decision Tree
2. Covariance Matrix
o Multiple Linear Regression
3. Similarity Function
o K Nearest Neighbors
4. Others
o Artificial Neural Network
o Support Vector Machine

3.2.1 Decision Tree - Regression


Decision tree builds regression or classification models in the form of a tree
structure. It breaks down a dataset into smaller and smaller subsets while at the
same time an associated decision tree is incrementally developed. The final result
is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has
two or more branches (e.g., Sunny, Overcast and Rainy), each representing values
for the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the
numerical target. The topmost decision node in a tree which corresponds to the
best predictor called root node. Decision trees can handle both categorical and
numerical data.

Decision Tree Algorithm


The core algorithm for building decision trees called ID3 by J. R. Quinlan which
employs a top-down, greedy search through the space of possible branches
with no backtracking. The ID3 algorithm can be used to construct a decision
tree for regression by replacing Information Gain with Standard Deviation
Reduction.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 122
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Standard Deviation

A decision tree is built top-down from a root node and involves partitioning
the data into subsets that contain instances with similar values (homogenous).
We use standard deviation to calculate the homogeneity of a numerical
sample. If the numerical sample is completely homogeneous its standard
deviation is zero.

a) Standard deviation for one attribute:

 Standard Deviation (S) is for tree building (branching).

 Coefficient of Deviation (CV) is used to decide when to stop


branching. We can use Count (n) as well.

 Average (Avg) is the value in the leaf nodes.

b) Standard deviation for two attributes (target and predictor):

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 123
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Standard Deviation Reduction

The standard deviation reduction is based on the decrease in standard deviation


after a dataset is split on an attribute. Constructing a decision tree is all about
finding attribute that returns the highest standard deviation reduction (i.e., the
most homogeneous branches).

Step 1: The standard deviation of the target is calculated.

Standard deviation (Hours Played) = 9.32

Step 2: The dataset is then split on the different attributes. The standard deviation
for each branch is calculated. The resulting standard deviation is subtracted from
the standard deviation before the split. The result is the standard deviation
reduction.

Step 3: The attribute with the largest standard deviation reduction


is chosen for the decision node.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 124
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Step 4a: The dataset is divided based on the values of the selected
attribute. This process is run recursively on the non-leaf branches,
until all data is processed.

In practice, we need some termination criteria. For example, when


coefficient of deviation (CV) for a branch becomes smaller than a
certain threshold (e.g., 10%) and/or when too few instances (n) remain
in the branch (e.g., 3).

Step 4b: "Overcast" subset does not need any further splitting because
its CV (8%) is less than the threshold (10%). The related leaf node gets
the average of the "Overcast" subset.

Step 4c: However, the "Sunny" branch has an CV (28%) more than the
threshold (10%) which needs further splitting. We select "Windy" as
the best best node after "Outlook" because it has the largest SDR.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 125
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Because the number of data points for both branches (FALSE and
TRUE) is equal or less than 3 we stop further branching and assign
the average of each branch to the related leaf node.

Step 4d: Moreover, the "rainy" branch has an CV (22%) which is more
than the threshold (10%). This branch needs further splitting. We select
"Temp" as the best best node because it has the largest SDR.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 126
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Because the number of data points for all three branches (Cool, Hot and
Mild) is equal or less than 3 we stop further branching and assign the
average of each branch to the related leaf node.

When the number of instances is more than one at a leaf node we


calculate the average as the final value for the target.

Decision Tree Regression - Exercise


 Open "Orange".
 Drag and drop "File" widget and double click to load a dataset (slump.txt).
 Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
 Open "Select Attributes" and set the class (comprehensive strength) and
attributes (predictors).
 Drag and drop "Regression Tree" widget and connect it to the "Select 
Attributes" widget.
 Drag and drop "Regression Tree Graph" widget and connect it to the
"Regression Tree" widget.
 Drag and drop "Test Learners" widget and connect it to the "Regression
Tree" and the "Select Attributes" widget.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 127
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Regression Tree Graph

Test Learners Result

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 128
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3.2.2 Multiple Linear Regression

Multiple linear regression (MLR) is a method used to model the linear


relationship between a dependent variable (target) and one or more
independent variables (predictors).

MLR is based on ordinary least squares (OLS), the model is fit such that the
sum-of-squares of differences of observed and predicted values is
minimized.

The MLR model is based on several assumptions (e.g., errors are normally
distributed with zero mean and constant variance). Provided the assumptions
are satisfied, the regression estimators are optimal in the sense that they
are unbiased, efficient, and consistent. Unbiased means that the expected
value of the estimator is equal to the true value of the parameter. Efficient
means that the estimator has a smaller variance than any other estimator.
Consistent means that the bias and variance of the estimator approach zero
as the sample size approaches infinity.

How good is the model?

R2 also called as coefficient of determination summarizes the explanatory power

of the regression model and is computed from the sums-of-squares terms.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 129
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

R2 describes the proportion of variance of the dependent variable


explained by the regression model. If the regression model is
“perfect”, SSE is zero, and R2 is 1. If the regression model is a total
failure, SSE is equal to SST, no variance is explained by regression,
and R2 is zero. It is important to keep in mind that there is no direct
relationship between high R2 and causation.

How significant is the model?

F-ratio estimates the statistical significance of the regression model and is


computed from the mean squared terms in the ANOVA table. The significance of
the F-ratio is obtained by referring to the F distribution table using two degrees
of freedom (dfMSR, dfMSE). p is the number of independent variables (e.g., p is one
for the simple linear regression).

The advantage of the F-ratio over R2 is that the F-ratio incorporates sample
size and number of predictors in assessment of significance of the
regression model. A model can have a high R2 and still not be statistically
significant.

How significant are the coefficients?

If the regression model is significantly good, we can use t-test to estimate


the statistical significance of each coefficient.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 130
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Example

MLR.xls

X1 X2 X3 Y
2 5 1 2
2 4 2 1
1 5 4 1
1 3 4 1
3 6 5 5
4 4 6 4
5 6 3 7
5 4 3 6
7 3 7 7
6 3 7 8

Multicolinearity

A high degree of multicolinearity between predictors produces unreliable


regression coefficient estimates. Signs of multicolinearity include:

1. High correlation between pairs of predictor variables.


2. Regression coefficients whose signs or magnitudes do not make good
physical sense.
3. Statistically nonsignificant regression coefficients on important
predictors.
4. Extreme sensitivity of sign or magnitude of regression coefficients to
insertion or deletion of a predictor.

The diagonal values in the (X'X)-1 matrix called Variance Inflation Factors (VIFs) and
they are very useful measures of multicolinearity. If any VIF exceed 5,
multicolinearity is a problem.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 131
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Model Selection

A frequent problem in data mining is to avoid predictors that do not contribute


significantly to model prediction. First, it has been shown that dropping
predictors that have insignificant coefficients can reduce the average error of
predictions. Second, estimation of regression coefficients are likely to be
unstable due to multicollinearity in models with many variables. Finally, a
simpler model is a better model with more insight into the influence of
predictors in models. There are two main methods of model selection:

 Forward selection, the best predictors are entered in the model,


one by one.
 Backward Elimination, the worst predictors are eliminated from
the model, one by one.

MLR - Exercise
1. Open "Weka".
2. Click on "Open file ..." and load the dataset (mlr.csv).
3. Click on the "Classify" tab and choose "Linear Regression".
4. Set the parameters before running the model.
5. Select "Y" as the target from the list box and click on "Start".
6. Check the "Classifier output" pane for the result.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 132
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Set the parameters before running the model.

3.2.3 K Nearest Neighbors - Regression


K nearest neighbors is a simple algorithm that stores all available cases and
predict the numerical target based on a similarity measure (e.g., distance
functions). KNN has been used in statistical estimation and pattern
recognition already in the beginning of 1970’s as a non-parametric
technique.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 133
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Algorithm

A simple implementation of KNN regression is to calculate the average of the


numerical target of the K nearest neighbors. Another approach uses an
inverse distance weighted average of the K nearest neighbors. KNN
regression uses the same distance functions as KNN classification.

The above three distance measures are only valid for continuous
variables. In the case of categorical variables, you must use the
Hamming distance, which is a measure of the number of instances in
which corresponding symbols are different in two strings of equal
length.

Choosing the optimal value for K is best done by first inspecting the data. In
general, a large K value is more precise as it reduces the overall noise; however,
the compromise is that the distinct boundaries within the feature space are
blurred. Cross-validation is another way to retrospectively determine a good K
value by using an independent data set to validate your K value. The optimal K
for most datasets is 10 or more. That produces much better results than 1-NN.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 134
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Example:

Consider the following data concerning House Price Index or HPI.


Age and Loan are two numerical variables (predictors) and HPI is the
numerical target.

We can now use the training set to classify an unknown case (Age=33
and Loan=$150,000) using Euclidean distance. If K=1 then the nearest
neighbor is the last case in the training set with HPI=264.

D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> HPI =


264
By having K=3, the prediction for HPI is equal to the average of HPI
for the top three neighbors.

HPI = (264+139+139)/3 = 180.7

Standardized Distance
One major drawback in calculating distance measures directly from
the training set is in the case where variables have different
measurement scales or there is a mixture of numerical and
categorical variables. For example, if one variable is based on annual
income in dollars, and the other is based on age in years then income
will have a much higher influence on the distance calculated. One
solution is to standardize the training set as shown below.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 135
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

As mentioned in KNN Classification using the standardized distance


on the same training set, the unknown case returned a different
neighbor which is not a good sign of robustness.

K Nearest Neighbors Regression - Exercise


 Open "Orange".
 Drag and drop "File" widget and double click to load a dataset
(slump.txt).
 Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
 Open "Select Attributes" and set the class (comprehensive strength)
and attributes (predictors).
 Drag and drop "k Nearest Neighbors Regression" widget and connect it
to the "Select Attributes" widget.
 Drag and drop "Test Learners" widget and connect it to the "k Nearest
Neighbors Regression" widget and the "Select Attributes" widget.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 136
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Evaluation Results

3.2.4 Artificial Neural Network


An artificial neutral network (ANN) is a system that is based on the biological neural
network, such as the brain. The brain has approximately 100 billion neurons, which
communicate through electro-chemical signals. The neurons are connected
through junctions called synapses. Each neuron receives thousands of connections
with other neurons, constantly receiving incoming signals to reach the cell body. If
the resulting sum of the signals surpasses a certain threshold, a response is sent
through the axon. The ANN attempts to recreate the computational mirror of the
biological neural network, although it is not comparable since the number and
complexity of neurons and the used in a biological neural network is many times
more than those in an artificial neutral network.

An ANN is comprised of a network of artificial neurons (also known as "nodes").


These nodes are connected to each other, and the strength of their connections to
one another is assigned a value based on their strength: inhibition (maximum being
-1.0) or excitation (maximum being +1.0). If the value of the connection is high,
then it indicates that there is a strong connection. Within each node's design, a
transfer function is built in. There are three types of neurons in an ANN, input
nodes, hidden nodes, and output nodes.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 137
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

The input nodes take in information, in the form which can be numerically
expressed. The information is presented as activation values, where each node is
given a number, the higher the number, the greater the activation. This
information is then passed throughout the network. Based on the connection
strengths (weights), inhibition or excitation, and transfer functions, the activation
value is passed from node to node. Each of the nodes sums the activation values it
receives; it then modifies the value based on its transfer function. The activation
flows through the network, through hidden layers, until it reaches the output
nodes. The output nodes then reflect the input in a meaningful way to the outside
world. The difference between predicted value and actual value (error) will be
propagated backward by apportioning them to each node's weights according to
the amount of this error the node is responsible for (e.g., gradient descent
algorithm).

Transfer (Activation) Functions


The transfer function translates the input signals to output signals. Four types of
transfer functions are commonly used, Unit step (threshold), sigmoid, piecewise
linear, and Gaussian.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 138
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Unit step (threshold)

The output is set at one of two levels, depending on whether the total
input is greater than or less than some threshold value.

Sigmoid
The sigmoid function consists of 2 functions, logistic and tangential. The
values of logistic function range from 0 and 1 and -1 to +1 for tangential
function.

Piecewise Linear
The output is proportional to the total weighted output.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 139
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Gaussian
Gaussian functions are bell-shaped curves that are continuous. The node
output (high/low) is interpreted in terms of class membership (1/0),
depending on how close the net input is to a chosen value of average.

Linear
Like a linear regression, a linear activation function transforms the
weighted sum inputs of the neuron to an output using a linear function.

Algorithm

There are different types of neural networks, but they are generally classified
into feed-forward and feed-back networks.

A feed-forward network is a non-recurrent network which contains inputs,


outputs, and hidden layers; the signals can only travel in one direction. Input data
is passed onto a layer of processing elements where it performs calculations.
Each processing element makes its computation based upon a weighted sum of
its inputs. The new calculated values then become the new input values that feed
the next layer. This process continues until it has gone through all the layers and
determines the output. A threshold transfer function is sometimes used to
quantify the output of a neuron in the output layer. Feed-forward networks
include Perceptron (linear and non-linear) and Radial Basis Function networks.
Feed-forward networks are often used in data mining.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 140
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

A feed-back network (e.g., recurrent neural network or RNN) has feed-back


paths meaning they can have signals traveling in both directions using loops. All
possible connections between neurons are allowed. Since loops are present in
this type of network, it becomes a non-linear dynamic system which changes
continuously until it reaches a state of equilibrium. Feed-back networks are often
used in associative memories and optimization problems where the network
looks for the best arrangement of interconnected factors.

Artificial Neural Network – Perceptron

A single layer perceptron (SLP) is a feed-forward network based on a threshold


transfer function. SLP is the simplest type of artificial neural networks and can only
classify linearly separable cases with a binary target (1 , 0).

Algorithm
The single layer perceptron does not have a priori knowledge, so the initial
weights are assigned randomly. SLP sums all the weighted inputs and if the sum
is above the threshold (some predetermined value), SLP is said to be activated
(output=1).

The input values are presented to the perceptron, and if the predicted output is
the same as the desired output, then the performance is considered satisfactory
and no changes to the weights are made. However, if the output does not match
the desired output, then the weights need to be changed to reduce the error.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 141
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Because SLP is a linear classifier and if the cases are not linearly separable the
learning process will never reach a point where all the cases are classified
properly. The most famous example of the inability of perceptron to solve
problems with linearly non-separable cases is the XOR problem.

However, a multi-layer perceptron using the backpropagation algorithm can


successfully classify the XOR data.

Multi-layer Perceptron - Backpropagation algorithm


A multi-layer perceptron (MLP) has the same structure of a single layer
perceptron with one or more hidden layers. The backpropagation algorithm
consists of two phases: the forward phase where the activations are propagated
from the input to the output layer, and the backward phase, where the error
between the observed actual and the requested nominal value in the output
layer is propagated backwards in order to modify the weights and bias values.

Forward propagation:
Propagate inputs by adding all the weighted inputs and then computing outputs
using sigmoid threshold.

Backward propagation:

Propagates the errors backward by apportioning them to each unit according to


the amount of this error the unit is responsible for.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 142
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Example:
A Neural Network in 11 lines of Python

Exercise

Artificial Neural Network - Perceptron -


Exercise
1. Open "Weka".
2. Click on "Open file ..." and load the dataset (slump.csv).
3. Click on the "Classify" tab and choose "functions > MultilayerPeceptron".
4. Set the parameters before running the model.
5. Select "Compressive Strength" as the target from the list box and click on
"Start".
6. Check the output pane for the result.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 143
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

You can change the model parameters before running the model.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 144
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Radial Basis Function Networks (RBF)

RBF networks have three layers: input layer, hidden layer and output layer. One
neuron in the input layer corresponds to each predictor variable. With respects to
categorical variables, n-1 neurons are used where n is the number of categories.
Hidden layer has a variable number of neurons. Each neuron consists of a radial
basis function centered on a point with the same dimensions as the predictor
variables. The output layer has a weighted sum of outputs from the hidden layer
to form the network outputs.

Algorithm
h(x) is the Gaussian activation function with the parameters r (the radius or
standard deviation) and c (the center or average taken from the input space)
defined separately at each RBF unit. The learning process is based on adjusting the
parameters of the network to reproduce a set of input-output patterns. There are
three types of parameters; the weight w between the hidden nodes and the
output nodes, the center c of each neuron of the hidden layer and the unit width r.

Unit Center (c)

Any clustering algorithm can be used to determine the RBF unit centers (e.g., K-
means clustering). A set of clusters each with r-dimensional centers is determined
by the number of input variables or nodes of the input layer. The cluster centers
become the centers of the RBF units. The number of clusters, H, is a design
parameter and determines the number of nodes in the hidden layer. The K-means
clustering algorithm proceeds as follows:

1. Initialize the center of each cluster to a different randomly selected


training pattern.
2. Assign each training pattern to the nearest cluster. This can be
accomplished by calculating the Euclidean distances between the
training patterns and the cluster centers.
3. When all training patterns are assigned, calculate the average position
for each cluster center. They then become new cluster centers.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 145
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

4. Repeat steps 2 and 3, until the cluster centers do not change during the
subsequent iterations.

Unit width (r)

When the RBF centers have been established, the width of each RBF unit can be
calculated using the K-nearest neighbors algorithm. A number K is chosen, and for
each center, the K nearest centers is found. The root-mean squared distance
between the current cluster center and its K nearest neighbors is calculated, and
this is the value chosen for the unit width (r). So, if the current cluster center is cj,
the r value is:

A typical value for K is 2, in which case s is set to be the average distance from the
two nearest neighboring cluster centers.

Weights (w)
Using the linear mapping, w vector is calculated using the output vector (y) and
the design matrix H.

In contrast to the multi-layer Perceptron, there is no local minima in RBF.

Artificial Neural Network - RBF - Exercise

1. Open "Weka".
2. Click on "Open file ..." and load the dataset (credit_scoring.csv).
3. Click on the "Classify" tab and choose "functions > RBFNetwork".
4. Set the parameters before running the model.
5. Select "DEFAULT" as the target from the list box and click on "Start".
6. Check the "Classifier output" pane for the result.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 146
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 147
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

You can change the model parameters before running the model.

3.2.5 Support Vector Machine - Regression (SVR)


Support Vector Machine can also be used as a regression method, maintaining all
the main features that characterize the algorithm (maximal margin). The Support
Vector Regression (SVR) uses the same principles as the SVM for classification,
with only a few minor differences. First of all, because output is a real number it
becomes very difficult to predict the information at hand, which has infinite
possibilities. In the case of regression, a margin of tolerance (epsilon) is set in
approximation to the SVM which would have already requested from the
problem. But besides this fact, there is also a more complicated reason, the
algorithm is more complicated therefore to be taken in consideration. However,
the main idea is always the same: to minimize error, individualizing the
hyperplane which maximizes the margin, keeping in mind that part of the error is
tolerated.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 148
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Linear SVR

Non-linear SVR

The kernel functions transform the data into a higher dimensional feature
space to make it possible to perform the linear separation.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 149
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Kernel functions

Support Vector Regression - Exercise


1. Open "Weka".
2. Click on "Open file ..." and load the dataset (slump.csv).
3. Click on the "Classify" tab and choose "functions > SMOreg".
4. Set the parameters before running the model.
5. Select "Compressive Strength" as the target from the list box and click
on "Start".
6. Check the output pane for the result.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 150
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 151
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

You can change the model parameters before running the model.

Activity # 2:
Graphically present a visualization on regression using your dataset
utilizing different algorithms of regression. Compare and interpret the
performance of each algorithm.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 152
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3.3 Clustering
A cluster is a subset of data which are similar. Clustering (also called unsupervised
learning) is the process of dividing a dataset into groups such that the members
of each group are as similar (close) as possible to one another, and different
groups are as dissimilar (far) as possible from one another. Clustering can uncover
previously undetected relationships in a dataset. There are many applications for
cluster analysis. For example, in business, cluster analysis can be used to discover
and characterize customer segments for marketing purposes and in biology, it can
be used for classification of plants and animals given their features.

Two main groups of clustering algorithms are:

1. Hierarchical
o Agglomerative
o Divisive
2. Partitive
o K Means
o Self-Organizing Map

A good clustering method requirements are:

 The ability to discover some or all of the hidden clusters.


 Within-cluster similarity and between-cluster dissimilarity.
 Ability to deal with various types of attributes.
 Can deal with noise and outliers.
 Can handle high dimensionality.
 Scalable, interpretable and usable.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 153
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

An important issue in clustering is how to determine the similarity between


two objects, so that clusters can be formed from objects with high similarity
within clusters and low similarity between clusters. Commonly, to measure
similarity or dissimilarity between objects, a distance measure such as
Euclidean, Manhattan and Minkowski is used. A distance function returns a
lower value for pairs of objects that are more similar to one another.

3.3.1 Hierarchical Clustering


Hierarchical clustering involves creating clusters that have a predetermined
ordering from top to bottom. For example, all files and folders on the hard disk are
organized in a hierarchy. There are two types of hierarchical
clustering, Divisive and Agglomerative.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 154
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Divisive method
In divisive or top-down clustering method we assign all of the observations
to a single cluster and then partition the cluster to two least similar
clusters. Finally, we proceed recursively on each cluster until there is one
cluster for each observation. There is evidence that divisive algorithms
produce more accurate hierarchies than agglomerative algorithms in
some circumstances but is conceptually more complex.

Agglomerative method
In agglomerative or bottom-up clustering method we assign each
observation to its own cluster. Then, compute the similarity (e.g., distance)
between each of the clusters and join the two most similar clusters.
Finally, repeat steps 2 and 3 until there is only a single cluster left. The
related algorithm is shown below.

Before any clustering is performed, it is required to determine the proximity


matrix containing the distance between each point using a distance function.
Then, the matrix is updated to display the distance between each cluster. The
following three methods differ in how the distance between each cluster is
measured.

Single Linkage
In single linkage hierarchical clustering, the distance between two
clusters is defined as the shortest distance between two points in each
cluster. For example, the distance between clusters “r” and “s” to the left
is equal to the length of the arrow between their two closest points.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 155
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Complete Linkage
In complete linkage hierarchical clustering, the distance between two
clusters is defined as the longest distance between two points in each
cluster. For example, the distance between clusters “r” and “s” to the left
is equal to the length of the arrow between their two furthest points.

Average Linkage

In average linkage hierarchical clustering, the distance between two


clusters is defined as the average distance between each point in one
cluster to every point in the other cluster. For example, the distance
between clusters “r” and “s” to the left is equal to the average length
each arrow between connecting the points of one cluster to the other.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 156
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Hierarchical Clustering - Exercise

 Open "Orange".
 Drag and drop "File" widget and double click to load a dataset (iris.txt).
 Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
 Open "Select Attributes" and set the class (target) and attributes
(predictors).
 Drag and drop "Example Distance" widget and connect it to the "Select
Attributes" widget.
 Drag and drop "Hierarchical Clustering" widget and connect it to the
"Example Distance" widget.

Double click on the "Hierarchical Clustering" widget to view the clusters.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 157
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3.3.2 Partitive Clustering

3.3.2.1 K-Means Clustering


K-Means clustering intends to partition n objects into k clusters in which each
object belongs to the cluster with the nearest mean. This method produces
exactly k different clusters of greatest possible distinction. The best number of
clusters k leading to the greatest separation (distance) is not known as a priori and
must be computed from the data. The objective of K-Means clustering is to
minimize total intra-cluster variance, or, the squared error function:

Algorithm

1. Clusters the data into k groups where k is predefined.


2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center according to the Euclidean
6.
distance function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster
in consecutive rounds.

K-Means is relatively an efficient method. However, we need to specify the


number of clusters, in advance and the final results are sensitive to initialization
and often terminates at a local optimum. Unfortunately there is no global
theoretical method to find the optimal number of clusters. A practical approach is
to compare the outcomes of multiple runs with different k and choose the best

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 158
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

one based on a predefined criterion. In general, a large k probably decreases the


error but increases the risk of overfitting.

Example:

Suppose we want to group the visitors to a website using just their age (one-
dimensional space) as follows:

n = 19

15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
Initial clusters (random centroid or average):

k=2
c1 = 16
c2 = 22

Iteration 1:

c1 = 15.33
c2 = 36.25

Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid
15 16 22 1 7 1
15 16 22 1 7 1 15.33
16 16 22 0 6 1
19 16 22 9 3 2
19 16 22 9 3 2
20 16 22 16 2 2
20 16 22 16 2 2
21 16 22 25 1 2 36.25
22 16 22 36 0 2
28 16 22 12 6 2
35 16 22 19 13 2
40 16 22 24 18 2

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 159
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

41 16 22 25 19 2
42 16 22 26 20 2
43 16 22 27 21 2
44 16 22 28 22 2
60 16 22 44 38 2
61 16 22 45 39 2
65 16 22 49 43 2

Iteration 2:

c1 = 18.56
c2 = 45.90

Nearest New
xi c1 c2 Distance 1 Distance 2 Cluster Centroid
15 15.33 36.25 0.33 21.25 1
15 15.33 36.25 0.33 21.25 1
16 15.33 36.25 0.67 20.25 1
19 15.33 36.25 3.67 17.25 1
19 15.33 36.25 3.67 17.25 1 18.56
20 15.33 36.25 4.67 16.25 1
20 15.33 36.25 4.67 16.25 1
21 15.33 36.25 5.67 15.25 1
22 15.33 36.25 6.67 14.25 1
28 15.33 36.25 12.67 8.25 2
35 15.33 36.25 19.67 1.25 2
40 15.33 36.25 24.67 3.75 2
41 15.33 36.25 25.67 4.75 2
42 15.33 36.25 26.67 5.75 2
45.9
43 15.33 36.25 27.67 6.75 2
44 15.33 36.25 28.67 7.75 2
60 15.33 36.25 44.67 23.75 2
61 15.33 36.25 45.67 24.75 2
65 15.33 36.25 49.67 28.75 2

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 160
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Iteration 3:

c1 = 19.50
c2 = 47.89

Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid
15 18.56 45.9 3.56 30.9 1
15 18.56 45.9 3.56 30.9 1
16 18.56 45.9 2.56 29.9 1
19 18.56 45.9 0.44 26.9 1
19 18.56 45.9 0.44 26.9 1
19.50
20 18.56 45.9 1.44 25.9 1
20 18.56 45.9 1.44 25.9 1
21 18.56 45.9 2.44 24.9 1
22 18.56 45.9 3.44 23.9 1
28 18.56 45.9 9.44 17.9 1
35 18.56 45.9 16.44 10.9 2
40 18.56 45.9 21.44 5.9 2
41 18.56 45.9 22.44 4.9 2
42 18.56 45.9 23.44 3.9 2
43 18.56 45.9 24.44 2.9 2 47.89
44 18.56 45.9 25.44 1.9 2
60 18.56 45.9 41.44 14.1 2
61 18.56 45.9 42.44 15.1 2
65 18.56 45.9 46.44 19.1 2

Iteration 4:

c1 = 19.50
c2 = 47.89

Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid
15 19.5 47.89 4.50 32.89 1
15 19.5 47.89 4.50 32.89 1
16 19.5 47.89 3.50 31.89 1 19.50
19 19.5 47.89 0.50 28.89 1
19 19.5 47.89 0.50 28.89 1

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 161
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

20 19.5 47.89 0.50 27.89 1


20 19.5 47.89 0.50 27.89 1
21 19.5 47.89 1.50 26.89 1
22 19.5 47.89 2.50 25.89 1
28 19.5 47.89 8.50 19.89 1
35 19.5 47.89 15.50 12.89 2
40 19.5 47.89 20.50 7.89 2
41 19.5 47.89 21.50 6.89 2
42 19.5 47.89 22.50 5.89 2
43 19.5 47.89 23.50 4.89 2 47.89
44 19.5 47.89 24.50 3.89 2
60 19.5 47.89 40.50 12.11 2
61 19.5 47.89 41.50 13.11 2
65 19.5 47.89 45.50 17.11 2

No change between iterations 3 and 4 has been noted. By using clustering, 2


groups have been identified 15-28 and 35-65. The initial choice of centroids can
affect the output clusters, so the algorithm is often run multiple times with
different starting conditions in order to get a fair view of what the clusters should
be.

K Means Clustering - Exercise

 Open "Orange".
 Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
 Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
 Open "Select Attributes" and set the class (target) and attributes
(predictors).
 Drag and drop "k-Means Clustering" widget and connect it to the "Select
Attributes" widget.
 Drag and drop "Scatterplot" widget and connect it to the "k-Means
Clustering" widget to view the plot with the class label.
 Drag and drop "Data Tables" widget and connect it to the "k-Means
Clustering" widget to view the dataset with a new cluster column.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 162
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Scatter Plot

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 163
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Dataset with Clusters

3.3.2.2 Self Organizing Map


Self organizing Map (SOM) is used for visualization and analysis of high-
dimensional datasets. SOM facilitate presentation of high dimensional datasets
into lower dimensional ones, usually 1-D, 2-D and 3-D. It is an unsupervised
learning algorithm, and does not require a target vector since it learns to
classify data without supervision. A SOM is formed from a grid of nodes or units
to which the input data are presented. Every node is connected to the input,
and there is no connection between the nodes. SOM is a topology preserving
technique and keeps the neighborhood relations in its mapping presentation.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 164
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Algorithm

1- Initialization of each node’s weights with a random number between 0


and 1

2- Choosing a random input vector from training dataset

3- Calculating the Best Matching Unit (BMU). Each node is examined to find
the one which its weights are most similar to the input vector. This unit is
known as the Best Matching Unit (BMU) since its vector is most similar to the
input vector. This selection is done by Euclidean distance formula, which is a
measure of similarity between two datasets. The distance between the input
vector and the weights of node is calculated in order to find the BMU.

4- Calculating the size of the neighborhood around the BMU. The size of the
neighborhood around the BMU is decreasing with an exponential decay
function. It shrinks on each iteration until reaching just the BMU.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 165
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

5- Modification of nodes’ weights of the BMU and neighboring nodes, so that


their weight gets more similar to the weight of input vector. The weight of every
node within the neighborhood is adjusted, having greater change for neighbors
closer to the BMU.

The decay of learning rate is calculated for each iteration.

As training goes on, the neighborhood gradually shrinks. At the end of training,
the neighborhoods have shrunk to zero size.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 166
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

The influence rate shows amount of influence a node's distance from the BMU
has on its learning. In the simplest form influence rate is equal to 1 for all the
nodes close to the BMU and zero for others, but a Gaussian function is common
too. Finally, from a random distribution of weights and through much iteration,
SOM is able to arrive at a map of stable zones. At the end, interpretation of data
is to be done by human but SOM is a great technique to present the invisible
patterns in the data.

Self Organizing Map - Exercise

 Open "Orange".
 Drag and drop "File" widget and double click to load a dataset
(credit_scoring.txt).
 Drag and drop "Select Attributes" widget and connect it to the "File"
widget.
 Open "Select Attributes" and set the class (target) and attributes
( predictors).
 Drag and drop "SOM" widget and connect it to the "Select Attributes"
widget.
 Drag and drop "SOM Visualizer" widget and connect it to the "SOM"
widget to view the map.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 167
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

SOM Map

Activity # 3:
Using your dataset utilize various techniques in clustering to discover
interesting patterns. What are the techniques to measure the quality of
clusters formed after undergoing clustering techniques?

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 168
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3.4 Association Rules


Association Rules find all sets of items (itemsets) that have support greater
than the minimum support and then using the large itemsets to generate the
desired rules that have confidence greater than the minimum confidence.
The lift of a rule is the ratio of the observed support to that expected if X and
Y were independent. A typical and widely used example of association rules
application is market basket analysis.

Example:

AIS Algorithm

1. Candidate itemsets are generated and counted on-the-fly as the


database is scanned.
2. For each transaction, it is determined which of the large itemsets of 4.
the previous pass are contained in this transaction.
3. New candidate itemsets are generated by extending these large
itemsets with other items in this transaction.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 169
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

The disadvantage of the AIS algorithm is that it results in


unnecessarily generating and counting too many candidate
itemsets that turn out to be small.

SETM Algorithm

1. Candidate itemsets are generated on-the-fly as the database is


scanned, but counted at the end of the pass.
2. New candidate itemsets are generated the same way as in AIS
algorithm, but the TID of the generating transaction is saved with the
4.
candidate itemset in a sequential structure.
3. At the end of the pass, the support count of candidate itemsets is
determined by aggregating this sequential structure.

The SETM algorithm has the same disadvantage of the AIS algorithm. Another
disadvantage is that for each candidate itemset, there are as many entries as
its support value.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 170
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Apriori Algorithm

1. Candidate itemsets are generated using only the large itemsets of the
previous pass without considering the transactions in the database.
2. The large itemset of the previous pass is joined with itself to generate
all itemsets whose size is higher by 1. 4.
3. Each generated itemset that has a subset which is not large is deleted.
The remaining itemsets are the candidate ones.

The Apriori algorithm takes advantage of the fact that any subset of a frequent
itemset is also a frequent itemset. The algorithm can therefore, reduce the
number of candidates being considered by only exploring the itemsets whose
support count is greater than the minimum support count. All infrequent
itemsets can be pruned if it has an infrequent subset.

AprioriTid Algorithm

1. The database is not used at all for counting the support of candidate
itemsets after the first pass.
2. The candidate itemsets are generated the same way as in Apriori
4.
algorithm.
3. Another set C’ is generated of which each member has the TID of each
transaction and the large itemsets present in this transaction. This set
is used to count the support of each candidate itemset.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 171
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

The advantage is that the number of entries in C’ may be smaller than the
number of transactions in the database, especially in the later passes.

AprioriHybrid Algorithm

Apriori does better than AprioriTid in the earlier passes. However, AprioriTid
does better than Apriori in the later passes. Hence, a hybrid algorithm can be
designed that uses Apriori in the initial passes and switches to AprioriTid when
it expects that the set C’ will fit in memory.

Association Rules - Exercise

 Open "Orange".
 Drag and drop "File" widget and double click to load a dataset
(contact_lenses.txt).
 Drag and drop "Association Rules" widget and connect it to the "File"
widget.
 Open "Association Rules" and set the support and confidence.
 Drag and drop "Association Rules Filter" widget and connect it to the
"Association Rules" widget to view the extracted rules.
 Drag and drop "Association Rules Explorer" widget and connect it to the
"Association Rules" widget to explore the extracted rules.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 172
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Association Rules Filter

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 173
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Association Rules Explorer

https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-
tutorial.html

Activity # 4.
Using your dataset produce a simulated output for mining frequent
patterns using the association rule mining methods.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 174
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Self-Evaluation Name: ______________________________________ Date: __________


Program/Yr/Section: __________________________ Score: _________
Try to answer the questions below to test your knowledge of this lesson.

1. The following table consists of training data from an employee database. The
data have been generalized. For example, “31 ... 35” for age represents the age
range of 31 to 35. For a given row entry, count represents the number of data
tuples having the values for department, status, age, and salary given in that row.

department status age salary count


sales senior 31. . . 35 46K. . . 50K 30
sales junior 26. . . 30 26K. . . 30K 40
sales junior 31. . . 35 31K. . . 35K 40
systems junior 21. . . 25 46K. . . 50K 20
systems senior 31. . . 35 66K. . . 70K 5
systems junior 26. . . 30 46K. . . 50K 3
systems senior 41. . . 45 66K. . . 70K 3
marketing senior 36. . . 40 46K. . . 50K 10
marketing junior 31. . . 35 41K. . . 45K 4
secretary senior 46. . . 50 36K. . . 40K 4
secretary junior 26. . . 30 26K. . . 30K 6

Let status be the class label attribute.


(a) How would you modify the basic decision tree algorithm to take into
consideration the count of each generalized data tuple (i.e., of each row entry)?
(b) Use your algorithm to construct a decision tree from the given data.
(c) Given a data tuple having the values “systems,” “26. . . 30,” and “46–50K” for
the attributes department, age, and salary, respectively, what would a naive
Bayesian classification of the status for the tuple be?
(d) Design a multilayer feed-forward neural network for the given data. Label the
nodes in the input and output layers.
(e) Using the multilayer feed-forward neural network obtained above, show the
weight values after one iteration of the backpropagation algorithm, given the
training instance “(sales, senior, 31. . . 35, 46K. . . 50K).” Indicate your initial weight
values and biases, and the learning rate used.

2. The following table shows the midterm and final exam grades obtained for
students in a database course.

x y
Midterm exam Final exam
72 84
50 63
81 77
74 78
94 90
86 75

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 175
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

59 49
83 79
65 77
33 52
88 74
81 90

(a) Plot the data. Do x and y seem to have a linear relationship?


(b) Use the method of least squares to find an equation for the prediction of a
student’s final exam grade based on the student’s midterm grade in the course.
(c) Predict the final exam grade of a student who received an 86 on the midterm
exam.

3. Suppose that the data mining task is to cluster the following eight points (with
(x, y) representing location) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9).

The distance function is Euclidean distance. Suppose initially we assign A1, B1, and
C1 as the center of each cluster, respectively. Use the k-means algorithm to show
only
(a) The three cluster centers after the first round execution
(b) The final three clusters

4. A database has five transactions. Let min sup = 60% and min con f = 80%.

TID items bought


T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y }
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I ,E}

(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare
the efficiency of the two mining processes.
(b) List all of the strong association rules (with supports and confidence c)
matching the following metarule, where X is a variable representing customers,
and itemi denotes variables representing items (e.g., “A”, “B”, etc.):

∀x ∈ transaction, buys(X, item1)∧buys(X, item2) ⇒ buys(X, item3) [s, c]

Review of  Classification is a data mining function that assigns items in a collection to target
Concepts categories or classes. The goal of classification is to accurately predict the
target class for each case in the data. In machine learning, classification refers to
a predictive modeling problem where a class label is predicted for a
given example of input data. Classification constructs the classification model by
using training data set.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 176
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Regression is a data mining technique used to predict a range of numeric values


(also called continuous values), given a particular dataset. A regression model is
tested by applying it to test data with known target values and comparing the
predicted values with the known values. A regression task begins with a data set in
which the target values are known. In the model build (training) process, a
regression algorithm estimates the value of the target as a function of the
predictors for each case in the build data. These relationships between predictors
and target are summarized in a model, which can then be applied to a different
data set in which the target values are unknown. Regression models are tested by
computing various statistics that measure the difference between the predicted
values and the expected values. The historical data for a regression project is
typically divided into two data sets: one for building the model, the other for
testing the model.

In clustering, a group of different data objects is classified as similar objects. One


group means a cluster of data. Data sets are divided into different groups in
the cluster analysis, which is based on the similarity of the data. There are two
different clustering methods namely: Hierarchical Clustering is based on top-to-
bottom hierarchy of the data points to create clusters, and Partitioning methods is
based on centroids and data points are assigned into a cluster based
on its proximity to the cluster centroid.

Association rule mining, at a basic level, involves the use of machine learning
models to analyze data for patterns, or co-occurrences, in a
database. Association rules are created by searching data for frequent if-then
patterns and using the criteria support and confidence to identify the most
important relationships.

References Han, J., Kamber, M. and Pei, J. (2011). Data Mining: Concepts and Techniques, 3rd
edition. Morgan Kaufman.

Sayad, S. (2010-2021). An Introduction to Data Mining. http://www.saedsayad.


com/data_mining

Dunham, M.H. (2003). Data Mining Introductory and Advanced Topics. Pearson
Education Inc. Upper Saddle River, New Jersey.

Data Mining Concepts. Oracle Database Online Documentation Library, 11g


Release 2 (11.2). https://docs.oracle.com/cd/E11882_01/datamine.112/e16808/
process.htm#DMCON002

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 177

You might also like