0% found this document useful (0 votes)
19 views19 pages

Unit IV and V

The document discusses clustering validation, outlining various criteria for evaluating clustering partitions, including external, internal, and relative indices. It details specific methods such as the silhouette index, within-groups sum of squares, and Jaccard measure, as well as clustering techniques like K-means, DBSCAN, and agglomerative hierarchical clustering. Additionally, it covers frequent itemset mining and association rules, emphasizing the importance of support thresholds and the concept of interestingness in patterns.

Uploaded by

Sulochana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views19 pages

Unit IV and V

The document discusses clustering validation, outlining various criteria for evaluating clustering partitions, including external, internal, and relative indices. It details specific methods such as the silhouette index, within-groups sum of squares, and Jaccard measure, as well as clustering techniques like K-means, DBSCAN, and agglomerative hierarchical clustering. Additionally, it covers frequent itemset mining and association rules, emphasizing the importance of support thresholds and the concept of interestingness in patterns.

Uploaded by

Sulochana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Unit-IV

Clustering Validation
To find good clustering partitions for a data set, regardless of the clustering algorithm used, the
quality of the partitions must be evaluated. In contrast to the classification task, the identification
of the best clustering algorithm for a given data set lacks a clear definition.Several cluster
validity criteria have been proposed; some automatic and others using expert input.
The automatic validation measures for clustering partition evaluation can be roughly divided into
three categories:
• External indices:The external criteria uses external information, such as class label, if available,
to define the quality of the clusters in a given partition. Two of the most common external
measures are the correct-RAND and Jaccard.
• Internal indices: The internal criteria looks for compactness inside each cluster and/or
separation between different clusters. Two of the most common internal measures are the
silhouette index, which measures both compactness and separation, and the within-groups sum of
squares, which only measures compactness.
• Relative indices: The relative criterion compares partitions found by two or more clustering
techniques or by different runs of the same technique.
Silhouette internal index
This evaluates compactness inside each cluster, measuring:
• how close to each other the objects inside a cluster are
• the separation of different clusters– how far the objects in each cluster are to the closest object
from another cluster.
To do this, it applies the following equation to each object xi:
s(xi)= 1 −a(xi)∕b(xi) , if a(xi) < b(xi)
0, if a(xi)=b(xi) b(xi)∕a(xi)
−1 , if a(xi) > b(xi) (5.5)
where:
• a(xi) is the average distance between xi and all other objects in its cluster
• b(xi) is the minimum average distance between xi and all other objects from each other cluster.
T heaverage of all s(xi) gives the partition silhouette measure value.
Within-groups sum of squares
This is also an internal measure but only measures compactness. It sums the squared Euclidean
distance between each instance and the centroid of its cluster. From Equation (5.4) we know that
the squared Euclidean distance between two instances p and q with m attributes each is given by:
∑ sed(p, q)= |pk − qk|2
The within groups sum of squares is given by:
s= ∑ ∑ sed(pj, Ci)
where K is the number of clusters and Ji is the number of instances of cluster i, and Ci is the
centroid of cluster i.
Jaccard external measure
This is a variation of a similar measure used in classification tasks. It evaluates how uniform the
distribution of the objects in each cluster is with respect to the class label. It uses the following
equation:
J =(M11)∕(M01 +M10 +M11) where:
• M01 isthe number of objects in other clusters but with the same label (5.8)
• M10 isthe number of objects in the same cluster, but with different labels
• M00 isthe number of objects in other clusters with different labels
• M11 is the number of objects in the same cluster with the same label.
Clustering Techniques
Another criteria is the approach used to define what a cluster is, determining the elements to be
included in the same cluster. According to this criterion, the main types of clusters are [20]:
• Separation-based: each object in the cluster is closer to every other object in the cluster than to
any object outside the cluster
• Prototype-based: each object in the cluster is closer to a prototype representing the cluster than
to a prototype representing any other cluster
• Graph-based: represents the data set by a graph structure associating each node with an object
and connecting objects that belong to the same cluster with an edge
• Density-based: a cluster is a region where the objects have a high number of close neighbors
(i.e. a dense region), surrounded by a region of low density
• Shared-property: a cluster is a group of objects that share a property.
Methods
 K-means:the most popular clustering algorithm and a representative of partitional and
prototype-based clustering methods
• DBSCAN: another partitional clustering method, but in this case density-based
• Agglomerative hierarchical clustering: a representative of hierarchical and graph-based
clustering methods
K-means
Centroids are a key concept in order to understand k-means. They represent a kind of centre of
gravity for a set of instances. We start by describing the concept before explaining how k-means
works, how to read the results, and how to set the hyper-parameters.
Centroids and Distance Measures
A centroid can also be seen a s a prototype or profile of all the objects in a cluster, for example
the average of all the objects in the cluster. Thus, if we have several photos of cats and dogs, if
we put all the dogs in a cluster and all the cats in another cluster, the centroid in the dog cluster,
for example, would be a photo representing the average features in all dog photos. We can
observe, therefore, that the centroid of a cluster is not usually one of the objects in the cluster.
Example 5.7
The centroid for the friends Bernhard, Gwyneth and James has the average age and education of
the three friends: 41 years (the average of 43, 38 and 42) and 3.4 (the average of 2.0, 4.2 and 4.1)
as the education level. As you can see none of the three friends has this age and education level.
In order to have an object of the cluster as a prototype, a medoid is used instead of a centroid.
The medoid of a cluster is the instance with the shortest sum of distances to the other instances of
the cluster.
How K-means Works
The way k-means works is shown graphically. This follows the k-means algorithm, the
pseudocode for which can be seen below.
Algorithm K-means
1: INPUT D the data set
2: INPUT d the distance measure
3: INPUTK the number of clusters
4: Define the initial K centroids (they are usually randomly defined, but can be defined explicitly
in some software packages)
5: repeat
6: Associate each instance in D with the closest centroid according to the chosen distance
measure d
7. Recalculate each centroid using all instances from D associated with it.
8: until No instances from D change of associated centroid
DBSCAN
Like k-means, DBSCAN (density-based spatial clustering of applications with noise) is used for
partitional clustering. In contrast to k-means, DBSCAN automatically defines the number of
clusters. DBSCAN is a density-based technique, defining objects forming a dense region as
belonging to the same cluster. Objects not belonging to dense regions are considered to be noise.
Agglomerative Hierarchical Clustering Technique
Hierarchical algorithms construct clusters progressively. This can be done by starting with all
instances in a single cluster and dividing it progressively, or by starting with as many clusters as
the number of instances and joining them up step by step. The first approach is top-down while
the second is bottom-up. The agglomerative hierarchical clustering method is a bottom-up
approach.
Frequent Itemsets
An arbitrary combination of items is called an “itemset”. It is, in essence, an arbitrary subset of
the set I of all items. Let us think a little about the number of possible itemsets (combinations of
items). In total, there are 2|I| − 1 possible itemsets where |1| is the number of items in I.
Example 6.1 In our example, I ={Arabic, Indian, Mediterranean, Oriental, Fast food} is the set
of all of five items considered, so |I| = 5. The subsets {Fast food}, {Indian, Oriental} and
{Arabic, Oriental, Fast food} are itemsets of size 1, 2 and 3,respectively. The number of all
possible itemsets, created from items in I, of length 1 is five, of length 2 and 3 is ten and ten,
respectively, while there are five itemsets of length 4 and one itemset of length 5. In total, there
are 5+10+10+5+1 =31=32−1=25−1itemsets we can generate from items in I.
Setting them in supThreshold
The min sup threshold, a hyper-parameter with high importance, which has to be set carefully by
the user according to their expectations of the results:
• Setting it to a very low value would give a large number of itemsets that would be too specific
to be considered “frequent”. These itemsets might apply in too few cases to be useful.
• On the other hand, very high values for minsup would give a small number of itemsets. These
would be too generic to be useful. Thus, the resulting information would probably not represent
new knowledge for the user. Another important aspect of the minsup value is whether the
number of frequent itemsets that results is small enough for subsequent analysis.
Example 6.4
All itemsets generated from I ={Arabic, Indian, Mediter ranean, Oriental, Fast food}, numbered
and organized into a so-called “lattice”. Each itemset is connected to the subset(s) positioned
above it and to the superset(s) positioned below it. The uppermost itemset (with the number 0) is
an empty set, which should not be considered an itemset. It is introduced into the lattice only for
the sake of completeness.
Apriori–a Join-based Method
The oldest and most simple technique of mining frequent itemsets involves the generic, so-called
“join-based” principle, as set out below. The Apriori principle for our dataset for a minimum
support threshold minsup = 3. In the first step of the algorithm, the support of each itemset of
length k = 1 is computed, resulting in four frequent and one non-frequent itemsets. In the next
step, itemsets of length k = 2 are generated from the frequent itemsets of length k = 1, so no
itemset containing item F is considered. Step 2 results in four frequent itemsets, which are used
to generate itemsets of length k = 3, in the following step.
Algorithm Apriori.
1: INPUT T the transactional dataset
2: INPUT min_sup the minimum support threshod
3: Set k = 1
4: Set stop = false
5: repeat
6: Select all frequent itemsets of length k (with support at least min_sup)
7: if there are no two frequent itemsets of length k then
8: stop = true
9: else
10: Set k = k +1
11: until stop
Eclat
The main obstacle for the Apriori algorithm is that in every step it needs to scan the whole
transactional database in order to count the support of candidate itemsets. Counting support is
one of the bottlenecks for frequent itemset mining algorithms, especially if the database does not
fit into the memory. There are many technical issues, not relevant here, for why counting support
is computationally expensive if the database is large and does not fit into the memory.
Behind Support and Confidence
Association rules lattice corresponding to the frequent itemset {I,M,O} found in the data. In
some sense, each pattern reveals a kind of knowledge that might support further decisions of
users of these patterns. However, only some patterns are “interesting” enough for the user,
representing useful and unexpected knowledge. Evaluation of the interestingness of patterns
depends on the application domain and also on the subjective opinion of the user.
Cross-support Patterns
It is not rare in real-world data that the most of the items have relatively low or modest support,
while a few of the items have high support. For example, more students at a university attend a
course on introductory data analytics than one on quantum computing. If a pattern contains low-
support items and high-support items, then it is called a cross-support pattern. A cross-support
pattern can represent interesting relationships between items but also, and most likely, it can be
spurious since the items it contains are weakly correlated in the transactions. To measure an
extent to which a pattern P can be called a cross-support pattern, the so-called support ratio is
used.
It is defined as: supratio(P)= min{s(i1),s(i2),…,s(ik)} /max{s(i1),s(i2), …,s(ik)}
where s(i1),s(i2), …,s(ik) are the supports of items i1,i2,…,ik contained in P and min and max
return the minimum and maximum value, respectively, in their arguments. In other words,
supratio computes the ratio of the minimal support of items present in the pattern to the maximal
support of items present in the pattern.
Lift
First, let us start with some considerations about the confidence measure and its relationship to
the strength of an association rule. We will use a so-called contingency table, which contains
some statistics related to an association rule. A contingency table related to two itemsets X and Y
appearing in the rule X ⇒ Y contains four frequency counts of transactions in which:
• X and Y are present
• X is present and Y is absent
• X is absent and Y is present
• neither X nor Y are present
Simpson’s Paradox
A related phenomenon, called Simpson’s paradox, says that certain correlations between pairs of
itemsets(antecedents and consequents of rules) appearing in different groups of data may
disappear or be reversed when these groups are combined.
Example6.23
Consider 800 transcripts of records(transactions)formed by two groups of students, A and B, as
The groups A and B may refer, for example, to students on the physics and biology study
programs, respectively, while X= {basics of genetics} and Y={introduction to data analytics}
might be two itemsets, each consisting of a single course.
• In group A, the rule X⇒Y has a highconfidence(0.8) and goodlift(1.79) values.
• In group B, the ruleY⇒X has a highconfidence(0.8) and goodlift(1.66) values.
Other Types of Pattern
Sequential Patterns
The input to sequential pattern mining is a sequence database, denoted by S. Each row consists
of a sequence of events consecutively recorded in time. Each event is an itemset of arbitrary
length assembled from items available in the data.
Example 6.24 As an example, let the sequence database in Table 6.7 represent shopping records
of customers over some period of time. For example, the first row can be interpreted as follows:
the customer with ID=1 bought items as follows:
• first visit: items a and b
• second visit: items a, b and c
• third visit: items a, c, d, e
• fourth visit: items b and f.
Frequent Sequence Mining
Given a set of all available items I, sequential database S and a threshold value minsup,frequent
sequence mining aims at finding those sequences, called frequent sequences, generated from I
for which support in S is at least minsup. It is important to mention that the number of frequent
sequences that can be generated from S with available items I is usually much larger than the
number of frequent itemsets generated from I.
Example 6.27
As an example, the number of all possible itemsets which can be generated from 6 items a, b, c,
d, e and f, regardless of the value of the minsupthreshold,is26 − 1 = 64 −1 =
63.Thenumbersoffrequentsequences with respect to the sequence database in Table 6.7 are: 6 for
minsup = 1.0, 20 for minsup = 0.8, 53 for minsup = 0.6 and 237 for minsup = 0.4.
Closed and Maximal Sequences
Similar to closed and maximal frequent itemsets, closed and maximal sequential patterns can be
defined. A frequent sequential patterns is closed if it is not a subsequence of any other frequent
sequential pattern with the same support. A frequent sequential pattern s is maximal if it is not a
subsequence of any other frequent sequential pattern.
Example 6.28
Given the sequential database in Table 6.7 and minsup = 0.8, the frequent sequences ⟨{b,f} ⟩ and
⟨{a},{f}⟩, both with support 0.8, are not closed since they are subsequences of ⟨{a},{b,f} ⟩ with
the same support 0.8, which is a maximal frequent sequence. On the other hand, the frequent
sequence ⟨{a,e}⟩ with support 1.0 is closed since all of its “supersequences” ⟨{a},{a,e} ⟩, ⟨{b},
{a,e}⟩ and ⟨{c},{a,e}⟩ have less support, at 0.8.
UNIT-V

Predictive Performance Estimation


Generalization
When dealing with a predictive task represented by a data set, the main goal is to induce from
this data set a model able to correctly predict new objects of the same task. We want to minimize
the number or extent of future mispredictions. Since we cannot predict the future, what we do is
to estimate the predictive performance of the model for new data. We do so by separating our
dataset into two mutually exclusive parts, one for training– model parameter tuning– and one for
testing: evaluating the induced model on new data.
We use the training data set to induce one or more predictive models for a technique. We try
different configurations of the technique’s hyper-parameters. For each hyper-parameter value,
we induce one or more models. The hyper parameter values that induce the models with the best
predictive performance in the training set are used to represent the technique’s predictive
performance. If we induce more than one model for the same hyper-parameter values, we use the
average predictive performance for the different models to define the predictive performance that
will be associated with the technique.
Usually, the larger the number of examples in the training set, the better the predictive
performance of the technique. The two main issues with performance estimation are how to
estimate the model performance for new data and what performance measure will be used in this
estimation.
Model Validation
The main goal of a predictive model is the prediction of the correct label for new objects. As
previously mentioned, what we can do is estimate the predictive performance using the model’s
predictive performance on a test set of data. In this process, we use predictive performance
estimation methods.
There are different model validation methods, which are usually based on data sampling [24].
The simplest method, holdout, divides the data set into two subsets:
• the training set is used for training
• the test set is used for testing.
The main deficiency of this method is that the predictive performance of the predictive models is
strongly influenced by the data selected for the training and test sets. The data partition in the
holdout method for a data set with eight objects is illustrated in Figure. An alternative that
reduces the problems with the holdout method is to sample the original data set several times,
creating several partitions, each partition with one training set and one test set. Thus, the training
predictive performance is the average of the predictive performances for multiple training sets.
The same idea is used to define the test predictive performance. By using several partitions, we
reduce the chance of having the predictive performance influenced by the partition used and, as a
result, have a more reliable predictive per formance estimate. The main methods used to create
the several partitions are:
• random sub-sampling
• k-fold cross-validation
• leave-one-out
• bootstrap.
The random sub-sampling method performs several holdouts, where the partition of the data set
into training and test sets is randomly defined.
A problem with both holdout and random sub-sampling is that half of the data is not used for
model induction. This can be a waste if the data set is small or medium-sized. There are
variations that use two thirds of the data in the training set and one third in the testset. You
reduce the waste, but it is still there.
In the k-fold cross validation method, the original data set is divided into k subsets, called
“folds”, ideally of equal size. This results in k partitions, where for each partition one of the folds
is used as the test set and the remaining folds are used as training sets. The value used for k is
usually 10. In this way, we use 90% of the data in the training set and 10% in the test set. Thus,
we end up with a larger training set than with random sub-sampling. Additionally we guarantee
that all objects are used for testing.
In a variation of k-fold cross-validation, called stratified k-fold cross validation, the folds keep
the same proportion of labels as found in the original data set. For regression, this is done by
ensuring that the average of the target attribute is similar in all folders, while in classification it is
done by ensuring that the number of objects per class are similar for the different folders. The
data set partition obtained using k-fold with k equal to 4 for a data set with eight objects
The leave-one-out method is a special case of k-fold cross validation when k is equal to the
number of objects in the data set. Leave-one-out provides a better estimate of the predictive
performance for new data than the 10-fold cross validation method. The average estimate
provided by leave-one-out tends to the true predictive performance.
The bootstrap validation method, like leave-one-out, also works better than 10-fold cross-
validation for small data sets. There are many variations of the bootstrap method. In the simplest
variation, the training set is defined by sampling from the original data set uniformly and with
replacement. Thus one object, after being selected from the original data set, is put back and can
be selected again, with the same probability as other objects in the data set. As a result, the same
object can be sampled more than once for inclusion in the training set.
Predictive Performance Measures for Regression
Consider the linear model . This was induced from a training set of 14 instances. The test set consist
of two instances, Omar and Patricia, whose predicted heights are 183.618 and 163.455 cm,
respectively. The real heights of Omar and Patricia are, however, different from these predicted
values, meaning that there are errors in our predictions. More concretely, let the real, measured,
heights of Omar and Patricia be 176 and 168 cm, respectively.
Let S ={(xi,yi) | i = 1,…,n} denote the set of n instances on which the prediction performance of a
model is measured. Here, x is in bold because it is a vector of predictive attribute values and y is in
lowercase because it is a known, single value, the target value. The predicted target attribute values
will be denoted as ̂ y1,…, ̂ yn.
Example 8.2
In our case, the set S, on which the performance is measured, contains two instances, so n = 2.
These instances are, (x1,y1)= ((Omar,91),176) and (x2,y2)=((Patricia,58),168). The predicted
values of the target attribute are ̂ y1 = 183.618 and ̂ y2 = 163.455.
Finding the Parameters of the Model
Linear Regression
The linear regression(LR) algorithm is one of the oldest and simplest regression algorithms.
Although simple,it is able to induce good regression models, which are easily interpretable.
Let’s take a closer look at our model height = 128.017 + 0.611 × weight. This will allow us to
understand the main idea behind LR. Given the notation above, each instance x is associated with
only one attribute, the weight (that’s why it is usually called univariate linear regression) and the
target attribute y is associated with the height. As we have already shown, this model is the equation
of a line in a two-dimensional space. We can see that there are two parameters, ̂ 𝛽0 and ̂ 𝛽1, such
that ̂ 𝛽1 is associated with the importance of the attribute x1, the weight. The other parameter, ̂ 𝛽0,
is called the intercept and is the value of ̂ y when the linear model intercepts the y-axis, in other
words when x1 = 0.
The result of the optimization process is that this line goes “through the middle” of these instances,
represented by points. The objective function, in this case, could be defined as follows. Find the
parameters ̂ 𝛽0, ̂ 𝛽1 representing a line such that the mean of the squared distance of the points to
this line is minimal. Or, in other words, find a model ̂ y = ̂ 𝛽0 + ̂ 𝛽1 × x1, called a univariate linear
model, such that the MSE between yi and ̂ yi is minimal, considering all the instances (xi,yi) in the
training set where i = 1,…,n.
The Bias-variance Trade-off
Before moving forward, let us discuss the noise in the data, which significantly influences the
outcome of the learning process. Noise can have many causes such as, for example, imprecision of
the measuring devices or sensors. In our example, the height of each person is an integer, and is
therefore probably not precise. We usually do not measure the height on a more accurate scale.
Suppose that the relationship between the instances xi and their labels yi can be expressed, in
general, by some hypothetical function f that maps each instance xi to some value f(xi).However,
there al, measured value yi usually differs from this hypothetical value f(xi) by some, usually small,
value 𝜖i corresponding to the noise. Thus we get: yi = f(xi)+𝜖i, (8.11) where the noise 𝜖 i is the
component of yi that cannot be predicted by any model. In addition, we usually assume normally
distributed noise with zero mean; that is, 𝜖 ∼ N(0,1). In other words, there are cases when yi is
slightly below f(xi) (negative noise) and cases when yi is slightly above f(xi) (positive noise), but
the average noise (its expected value) should be 0.
Shrinkage Methods
Multivariate linear regression has low bias, but high variance. Shrinkage methods try to minimize
the overall error by increasing the bias slightly while reducing the variance component of the error.
Two of the best-known shrinkage methods are ridge and lasso regression.
Ridge Regression
Ridge regression increases the bias component of the overall error by adding a penalty term for the
coefficients ̂ 𝛽0, ̂ 𝛽1,…, ̂ 𝛽p to Equation(8.10),leading to the following objective function for
optimization:
Lasso Regression
The least absolute shrinkage and selection operator (lasso) regression algorithm is another penalized
regression algorithm , that can deal efficiently with high-dimensional data sets. It performs attribute
selection by taking into account not only the predictive performance of the induced model, but also
the complexity of the model. The complexity is measured by the number of predictive attributes
used by the model. It does this by including in the equation of the multivariate linear regression
model an additional weighting term, which depends on the sum of the ̂ 𝛽j weights modules. The
weight values define the importance and number of predictive attributes in the induced model.
The lasso algorithm usually produces sparse solutions. Sparse means that a large number of
predictive attributes have zero weight, resulting in a regression model that uses a small number of
predictive attributes. As well as attribute selection, the lasso algorithm also performs shrinkage.
Mathematically, the lasso algorithm is very well founded.
Classification
Classification is one of the most common tasks in analytics, and the most common in predictive
analytics. Without noticing, we are classifying things all the time. We perform a classification task
when:
• we decide if we are going to stay at home, go out for dinner or visit a friend;
• we choose a meal from the menu of a restaurant;
• we decide if a particular person should be added to our social network;
• we decide if someone is a friend.
Classification task
A predictive task where the label to be assigned to a new, unlabeled, object, given the value of its
predictive attributes, is a qualitative value representing a class or category. The difficulty
(complexity) of a classification task depends on the data distribution in the training data set. The
most common classification task is binary classification, where the target attribute can have one of
only two possible val ues, for example “yes” or “no”. Usually, one of these classes is referred to as
the positive class and the other as the negative class. The positive class is usually the class of
particular interest. As an example, a medical diagnosis classification task can have only two classes:
healthy and sick. The “sick” class is the main class of interest, so it is the positive class.
Binary Classification
In addition to being the most common classification task, binary classification is the simplest. Most
other classification tasks, like multiclass, multilabel and hierarchical classification, can be
decomposed into a set of binary classification tasks. In these cases, the final classification is a
combination of the output from binary classifiers.
Let us look at an example of a binary classification task using the familiar data set of our contacts in
a social network tool. Suppose you want to go out for dinner and you want to predict who, among
your new contacts, would be a good company. Suppose also that, to make this decision, you can use
data from previous dinners with people in your social network and that these data have the
following three attributes: name, age and how the last dinner experience with that person was
(Table 9.1). The “how was the dinner” attribute is a qualitative attribute that has two possible
values: good and bad. Suppose further that Table 9.1 has the values of these three attributes for all
your contacts. Figure 9.1 illustrates how the data are distributed in this data set.
Using this data set, you can induce a simple classification model by applying a classification
algorithm to identify a vertical line able to separate objects into the two classes. For example, it is
possible to induce a classification model represented by a linear equation ̂ y = ̂ 𝛽0 + ̂ 𝛽1 × x,
which, in our case, since ̂ 𝛽0 = 0 and ̂ 𝛽1 = 1, would be simplified to dinner = age, where y is the
target attribute dinner and x is the predictive attribute age. The classification of objects according to
this equation can be represented by a simple rule that says: everybody whose age is less than 31 was
bad a company at the last dinner. The others are classified as good company. Thus a simple model
can be induced, which can be a rule saying:
If person-age < 32 Then dinner will
be bad
Else dinner will be good
The application of this rule to the data set in Table 9.1 creates a dividing line separating objects in
the two classes, as illustrated in Figure 9.2. Any other vertical line separating the objects from the
two classes would be a valid classification model too. However, this data set is very well-behaved.
Classification tasks are usually not that easy. Suppose your data set was actually the data in Table
9.2, which is illustrated in Figure 9.3. A simple rule like the one used for the data set in Table 9.1
does not work any more; neither would any other rule represented by a vertical line separating
objects. An alternative to deal with this difficulty is to extract an additional predictive attribute from
our social network data that could allow the induction of a classification model able to discriminate
between the two classes. In this case, we will add the attribute “Education level”. The new data set,
now with eight objects, can be seen in Table 9.3.The distribution of the data in this dataset is shown
in Figure 9.4

Since a classification task with two predictive attributes can be represented by a graph with two
axes, the representation is two-dimensional. Up to three predictive attributes, it is possible to
visualize the data distribution without a mathematical transformation.

A binary classification data set with


three predictive attributes is linearly separable if a plane divides the objects into the two classes. For
more than three predictive attributes, a classification dataset is linearly separable if a hyperplane
separates the objects into the two classes.

You might also like