Business Intelligence Unit 5
Business Intelligence Unit 5
Classification Tree
Classification tree methods (i.e., decision tree methods) are recommended when the data
mining task contains classifications or predictions of outcomes, and the goal is to generate
rules that can be easily explained and translated into SQL or a natural query language.
A Classification tree is built through a process known as binary recursive partitioning. This is
an iterative process of splitting the data into partitions, and then splitting it up further on
each of the branches.
Initially, a Training Set is created where the classification label (i.e., purchaser or non-
purchaser) is known (pre-classified) for each record. Next, the algorithm systematically
assigns each record to one of two subsets on the some basis (i.e., income > $75,000 or
income <= $75,000). The object is to attain an homogeneous set of labels (i.e., purchaser or
non-purchaser) in each partition. This partitioning (splitting) is then applied to each of the
new partitions. The process continues until no more useful splits can be found. The heart of
the algorithm is the rule that determines the initial split rule (displayed in the following
figure).
The process starts with a Training Set consisting of pre-classified records (target field or
dependent variable with a known class or label such as purchaser or non-purchaser). The
goal is to build a tree that distinguishes among the classes. For simplicity, assume that there
are only two target classes, and that each split is a binary partition. The partition (splitting)
criterion generalizes to multiple classes, and any multi-way partitioning can be achieved
through repeated binary splits. To choose the best splitter at a node, the algorithm considers
each input field in turn. In essence, each field is sorted. Every possible split is tried and
considered, and the best split is the one that produces the largest decrease in diversity of
the classification label within each partition (i.e., the increase in homogeneity). This is
repeated for all fields, and the winner is chosen as the best splitter for that node. The
process is continued at subsequent nodes until a full tree is generated.
XLMiner uses the Gini index as the splitting criterion, which is a commonly used measure of
inequality. The index fluctuates between a value of 0 and 1. A Gini index of 0 indicates that
all records in the node belong to the same category. A Gini index of 1 indicates that each
record in the node belongs to a different category. For a complete discussion of this index,
please see Leo Breiman’s and Richard Friedman’s book, Classification and Regression Trees
(3).
For example, if the classification tree is trying to predict height and it comes to a node
containing one tall person X and several other shorter people, the algorithm decreases
diversity at that node by a new rule imposing people named X are tall, and thus classify the
Training Data. In the real world, this rule is obviously inappropriate. Pruning methods solve
this problem -- they let the tree grow to maximum size, then remove smaller branches that
fail to generalize. (Note: Do not include irrelevant fields such as name, as this is simply used
an illustration.)
Since the tree is grown from the Training Set, when it has reaches full structure it usually
suffers from over-fitting (i.e., it is explaining random elements of the Training Data that are
not likely to be features of the larger population of data). This results in poor performance
on data. Therefore, trees must be pruned using the Validation Set.
It is a statistical method for analysing a data set in which there are one or more independent
variables that determine an outcome. The outcome is measured with a dichotomous variable (in
which there are only two possible outcomes). The goal of logistic regression is to find the best
fitting model to describe the relationship between the dichotomous characteristic of interest
(dependent variable = response or outcome variable) and a set of independent (predictor or
explanatory) variables.
Decision Trees:
Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a data set into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A
decision node has two or more branches and a leaf node represents a classification or decision.
The topmost decision node in a tree which corresponds to the best predictor called root node.
Decision trees can handle both categorical and numerical data.
Random Forest:
Random forests or random decision forests are an ensemble learning method for classification,
regression and other tasks, that operate by constructing a multitude of decision trees at training
time and outputting the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees. Random decision forests correct for decision trees’ habit of
over fitting to their training set.
Neural Network:
A neural network consists of units (neurons), arranged in layers, which convert an input vector
into some output. Each unit takes an input, applies a (often nonlinear) function to it and then
passes the output on to the next layer. Generally the networks are defined to be feed-forward: a
unit feeds its output to all the units on the next layer, but there is no feedback to the previous
layer. Weightings are applied to the signals passing from one unit to another, and it is these
weightings which are tuned in the training phase to adapt a neural network to the particular
problem at hand.
Nearest Neighbor:
---------------------------------------------------------------------------------------------------------------------
Association Rule
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is
Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show associations
between items.It allows retailers to identify relationships between the items that people buy
together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item based on
the occurrences of other items in the transaction.
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
TID Items
5 Bread, Milk, Diaper, Coke
Before we start defining the rule, let us first see the basic definitions.
Frequent Itemset – An itemset whose support is greater than or equal to minsup threshold.
Association Rule – An implication expression of the form X -> Y, where X and Y are any 2
itemsets.
Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction.It is a measure of how frequently the
collection of items occur together as a percentage of all transactions.
Support = (X+Y) total –
It is interpreted as fraction of transactions that contain both X and Y.
Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as the no of
transactions that includes all items in {A} to the no of transactions that includes all items
in {A}.
Conf(X=>Y) = Supp(X Y) Supp(X) –
It measures how often each item in Y appears in transactions that contains items in X
also.
Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each other.The
expected confidence is the confidence divided by the frequency of {Y}.
Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as expected, greater than
1 means they appear together more than expected and less than 1 means they appear less
than expected.Greater lift values indicate stronger association.
The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consists of a large number of transaction records which
list all items bought by a customer on a single purchase. So the manager could know if certain
groups of items are consistently purchased together and use this data for adjusting store layouts,
cross-selling, promotions based on statistics.
-------------------------------------------------------------------------------------------------------------------
Apriori Algorithm:
Apriori algorithm, a classic algorithm, is useful in mining frequent itemsets and relevant
association rules. Usually, you operate this algorithm on a database containing a large number of
transactions. One such example is the items customers buy at a supermarket.
It helps the customers buy their items with ease, and enhances the sales performance of the
departmental store.
This algorithm has utility in the field of healthcare as it can help in detecting adverse drug
reactions (ADR) by producing association rules to indicate the combination of medications and
patient characteristics that could lead to ADRs.
It has got this odd name because it uses ‘prior’ knowledge of frequent itemset properties. The
credit for introducing this algorithm goes to Rakesh Agrawal and Ramakrishnan Srikant in 1994.
We shall now explore the apriori algorithm implementation in detail.
Three significant components comprise the apriori algorithm. They are as follows.
Support
Confidence
Lift
Introduction to Clustering
It is basically a type of unsupervised learning method . An unsupervised learning method is a
method in which we draw references from datasets consisting of input data without labeled
responses. Generally, it is used as a process to find meaningful structure, explanatory underlying
processes, generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.
5.2Partitioning Methods
Partitioning methods relocate instances by moving them from one cluster toanother,
starting from an initial partitioning. Such methods typically requirethat the number of
clusters will be pre-set by the user. To achieve global op-timality in partitioned-based
clustering, an exhaustive enumeration process ofall possible partitions is required.
Because this is not feasible, certain greedyheuristics are used in the form of iterative
optimization. Namely, a reloca-tion method iteratively relocates points between
thekclusters. The followingsubsections present various types of partitioning methods.
.These algorithms, which tendto work well with isolated and compact clusters, are the
most intuitive and fre-quently used methods. The basic idea is to find a clustering
structure thatminimizes a certain error criterion which measures the “distance” of each
in-stance to its representative value. The most well-known criterion is the Sumof
Squared Error (SSE), which measures the total squared Euclidian distanceof instances
to their representative values. SSE may be globally optimized byexhaustively
enumerating all partitions, which is very time-consuming, or bygiving an approximate
solution (not necessarily leading to a global minimum)using heuristics. The latter
option is the most common alternative.
Clustering Methods
The simplest and most commonly used algorithm, employing a squared er-ror
criterion is theK-means algorithm. This algorithm partitions the data intoKclusters(C1,
C2, . . . , CK), represented by their centers or means. The cen-ter of each cluster is
calculated as the mean of all the instances belonging tothat cluster.Figure 15.1
presents the pseudo-code of theK-means algorithm. The algo-rithm starts with an
initial set of cluster centers, chosen at random or accordingto some heuristic
procedure. In each iteration, each instance is assigned toits nearest cluster center
according to the Euclidean distance between the two.Then the cluster centers are re-
calculated.The center of each cluster is calculated as the mean of all the
instancesbelonging to that cluster:μk=1NkNk∑q=1xqwhereNkis the number of instances
belonging to clusterkandμkis the meanof the clusterk.A number of convergence
conditions are possible.
For example, the searchmay stop when the partitioning error is not reduced by the
relocation of the cen-ters. This indicates that the present partition is locally optimal.
Other stoppingcriteria can be used also such as exceeding a pre-defined number of
iterations.Input:S(instance set),K(number of cluster)Output:clusters1:InitializeKcluster
centers.2:whiletermination condition is not satisfieddo3:Assign instances to the closest
cluster center.4:Update cluster centers based on the assignment.5:end whileFigure 15.1.K-
means Algorithm.TheK-means algorithm may be viewed as a gradient-decent
procedure,which begins with an initial set ofKcluster-centers and iteratively updatesit
so as to decrease the error function.A rigorous proof of the finite convergence of
theK-means type algorithmsis given in (Selim and Ismail, 1984). The complexity
ofTiterations of theK-means algorithm performed on a sample size ofminstances, each
charac-terized byNattributes, is:O(T∗K∗m∗N).This linear complexity is one of the
reasons for the popularity of theK-means algorithms. Even if the number of instances
is substantially large (whichoften is the case nowadays), this algorithm is
computationally attractive. Thus,theK-means algorithm has an advantage in
comparison to other clustering
(e.g. hierarchical clustering methods), which have non-linear com-plexity.Other
reasons for the algorithm’s popularity are its ease of interpretation,simplicity of
implementation, speed of convergence and adaptability to sparsedata (Dhillon and
Modha, 2001).The Achilles heel of theK-means algorithm involves the selection of
theinitial partition. The algorithm is very sensitive to this selection, which maymake
the difference between global and local minimum.Being a typical partitioning
algorithm, theK-means algorithm works wellonly on data sets having isotropic
clusters, and is not as versatile as single linkalgorithms, for instance.In addition, this
algorithm is sensitive to noisy data and outliers (a singleoutlier can increase the
squared error dramatically); it is applicable only whenmean is defined (namely, for
numeric attributes);and it requires the number ofclusters in advance, which is not
trivial when no prior knowledge is available.The use of theK-means algorithm is often
limited to numeric attributes.Haung (1998) presented theK-prototypes algorithm,
which is based on theK-means algorithm but removes numeric data limitations while
preserving itsefficiency. The algorithm clusters objects with numeric and categorical
at-tributes in a way similar to theK-means algorithm. The similarity measure
onnumeric attributes is the square Euclidean distance; the similarity measure onthe
categorical attributes is the number of mismatches between objects and thecluster
prototypes.Another partitioning algorithm, which attempts to minimize the SSE is
theK-medoids or PAM (partition around medoids — (Kaufmann and
Rousseeuw,1987)). This algorithm is very similar to theK-means algorithm. It
differsfrom the latter mainly in its representation of the different clusters. Each clus-
ter is represented by the most centric object in the cluster, rather than by theimplicit
mean that may not belong to the cluster.TheK-medoids method is more robust than
theK-means algorithm in thepresence of noise and outliers because a medoid is less
influenced by outliersor other extreme values than a mean. However, its processing is
more costlythan theK-means method. Both methods require the user to specifyK,
thenumber of clusters.Other error criteria can be used instead of the SSE. Estivill-
Castro (2000)analyzed the total absolute error criterion. Namely, instead of summing
upthe squared error, he suggests to summing up the absolute error. While thiscriterion
is superior in regard to robustness, it requires more computationaleffort.