0% found this document useful (0 votes)
52 views12 pages

Business Intelligence Unit 5

Uploaded by

arvind.gautam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views12 pages

Business Intelligence Unit 5

Uploaded by

arvind.gautam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIT :V Classification & Unsupervised Learning

In machine learning and statistics, classification is the problem of identifying to which of a


set of categories (sub-populations) a new observation belongs, on the basis of a training set
of data containing observations (or instances) whose category membership is known.
Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a
diagnosis to a given patient based on observed characteristics of the patient (sex, blood
pressure, presence or absence of certain symptoms, etc.). Classification is an example of
pattern recognition.

In the terminology of machine learning,[1] classification is considered an instance of


supervised learning, i.e., learning where a training set of correctly identified observations is
available. The corresponding unsupervised procedure is known as clustering, and involves
grouping data into categories based on some measure of inherent similarity or distance.

An algorithm that implements classification, especially in a concrete implementation, is


known as a classifier. The term "classifier" sometimes also refers to the mathematical
function, implemented by a classification algorithm, that maps input data to a category.

Classification Tree
Classification tree methods (i.e., decision tree methods) are recommended when the data
mining task contains classifications or predictions of outcomes, and the goal is to generate
rules that can be easily explained and translated into SQL or a natural query language.

A Classification tree labels, records, and assigns variables to discrete classes. A


Classification tree can also provide a measure of confidence that the classification is correct.

A Classification tree is built through a process known as binary recursive partitioning. This is
an iterative process of splitting the data into partitions, and then splitting it up further on
each of the branches.

Initially, a Training Set is created where the classification label (i.e., purchaser or non-
purchaser) is known (pre-classified) for each record. Next, the algorithm systematically
assigns each record to one of two subsets on the some basis (i.e., income > $75,000 or
income <= $75,000). The object is to attain an homogeneous set of labels (i.e., purchaser or
non-purchaser) in each partition. This partitioning (splitting) is then applied to each of the
new partitions. The process continues until no more useful splits can be found. The heart of
the algorithm is the rule that determines the initial split rule (displayed in the following
figure).
The process starts with a Training Set consisting of pre-classified records (target field or
dependent variable with a known class or label such as purchaser or non-purchaser). The
goal is to build a tree that distinguishes among the classes. For simplicity, assume that there
are only two target classes, and that each split is a binary partition. The partition (splitting)
criterion generalizes to multiple classes, and any multi-way partitioning can be achieved
through repeated binary splits. To choose the best splitter at a node, the algorithm considers
each input field in turn. In essence, each field is sorted. Every possible split is tried and
considered, and the best split is the one that produces the largest decrease in diversity of
the classification label within each partition (i.e., the increase in homogeneity). This is
repeated for all fields, and the winner is chosen as the best splitter for that node. The
process is continued at subsequent nodes until a full tree is generated.

XLMiner uses the Gini index as the splitting criterion, which is a commonly used measure of
inequality. The index fluctuates between a value of 0 and 1. A Gini index of 0 indicates that
all records in the node belong to the same category. A Gini index of 1 indicates that each
record in the node belongs to a different category. For a complete discussion of this index,
please see Leo Breiman’s and Richard Friedman’s book, Classification and Regression Trees
(3).

Pruning the Tree


Pruning is the process of removing leaves and branches to improve the performance of the
decision tree when moving from the Training Set (where the classification is known) to real-
world applications (where the classification is unknown). The tree-building algorithm makes
the best split at the root node where there are the largest number of records, and
considerable information. Each subsequent split has a smaller and less representative
population with which to work. Towards the end, idiosyncrasies of training records at a
particular node display patterns that are peculiar only to those records. These patterns can
become meaningless for prediction if you try to extend rules based on them to larger
populations.

For example, if the classification tree is trying to predict height and it comes to a node
containing one tall person X and several other shorter people, the algorithm decreases
diversity at that node by a new rule imposing people named X are tall, and thus classify the
Training Data. In the real world, this rule is obviously inappropriate. Pruning methods solve
this problem -- they let the tree grow to maximum size, then remove smaller branches that
fail to generalize. (Note: Do not include irrelevant fields such as name, as this is simply used
an illustration.)

Since the tree is grown from the Training Set, when it has reaches full structure it usually
suffers from over-fitting (i.e., it is explaining random elements of the Training Data that are
not likely to be features of the larger population of data). This results in poor performance
on data. Therefore, trees must be pruned using the Validation Set.

Classification models (Types of classification


algorithms)
In machine learning and statistics, classification is a supervised learning approach in which the
computer program learns from the data input given to it and then uses this learning to classify
new observation. This data set may simply be bi-class (like identifying whether the person is
male or female or that the mail is spam or non-spam) or it may be multi-class too. Some
examples of classification problems are: speech recognition, handwriting recognition, bio metric
identification, document classification etc.

Here we have the types of classification algorithms in Machine Learning:

1. Linear Classifiers: Logistic Regression, Naive Bayes Classifier


2. Support Vector Machines
3. Decision Trees
4. Boosted Trees
5. Random Forest
6. Neural Networks
7. Nearest Neighbor

Naive Bayes Classifier (Generative Learning Model) :

It is a classification technique based on Bayes’ Theorem with an assumption of independence


among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature. Even if these
features depend on each other or upon the existence of the other features, all of these properties
independently contribute to the probability. Naive Bayes model is easy to build and particularly
useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even
highly sophisticated classification methods.

Logistic Regression (Predictive Learning Model) :

It is a statistical method for analysing a data set in which there are one or more independent
variables that determine an outcome. The outcome is measured with a dichotomous variable (in
which there are only two possible outcomes). The goal of logistic regression is to find the best
fitting model to describe the relationship between the dichotomous characteristic of interest
(dependent variable = response or outcome variable) and a set of independent (predictor or
explanatory) variables.

Decision Trees:

Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a data set into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A
decision node has two or more branches and a leaf node represents a classification or decision.
The topmost decision node in a tree which corresponds to the best predictor called root node.
Decision trees can handle both categorical and numerical data.

Random Forest:

Random forests or random decision forests are an ensemble learning method for classification,
regression and other tasks, that operate by constructing a multitude of decision trees at training
time and outputting the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees. Random decision forests correct for decision trees’ habit of
over fitting to their training set.

Neural Network:

A neural network consists of units (neurons), arranged in layers, which convert an input vector
into some output. Each unit takes an input, applies a (often nonlinear) function to it and then
passes the output on to the next layer. Generally the networks are defined to be feed-forward: a
unit feeds its output to all the units on the next layer, but there is no feedback to the previous
layer. Weightings are applied to the signals passing from one unit to another, and it is these
weightings which are tuned in the training phase to adapt a neural network to the particular
problem at hand.

Nearest Neighbor:

The k-nearest-neighbors algorithm is a classification algorithm, and it is supervised: it takes a


bunch of labelled points and uses them to learn how to label other points. To label a new point, it
looks at the labelled points closest to that new point (those are its nearest neighbors), and has
those neighbors vote, so whichever label the most of the neighbors have is the label for the new
point (the “k” is the number of neighbors it checks).

---------------------------------------------------------------------------------------------------------------------

Association Rule: Structure of Association Rule

Association Rule
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is
Market Based Analysis.

Market Based Analysis is one of the key techniques used by large relations to show associations
between items.It allows retailers to identify relationships between the items that people buy
together frequently.

Given a set of transactions, we can find rules that will predict the occurrence of an item based on
the occurrences of other items in the transaction.

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
TID Items
5 Bread, Milk, Diaper, Coke

Before we start defining the rule, let us first see the basic definitions.

Support Count( ) – Frequency of occurrence of a itemset.

Here ({Milk, Bread, Diaper})=2

Frequent Itemset – An itemset whose support is greater than or equal to minsup threshold.

Association Rule – An implication expression of the form X -> Y, where X and Y are any 2
itemsets.

Example: {Milk, Diaper}->{Beer}

Rule Evaluation Metrics –

 Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction.It is a measure of how frequently the
collection of items occur together as a percentage of all transactions.
 Support = (X+Y) total –
It is interpreted as fraction of transactions that contain both X and Y.
 Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as the no of
transactions that includes all items in {A} to the no of transactions that includes all items
in {A}.
 Conf(X=>Y) = Supp(X Y) Supp(X) –
It measures how often each item in Y appears in transactions that contains items in X
also.
 Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each other.The
expected confidence is the confidence divided by the frequency of {Y}.
 Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as expected, greater than
1 means they appear together more than expected and less than 1 means they appear less
than expected.Greater lift values indicate stronger association.

The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consists of a large number of transaction records which
list all items bought by a customer on a single purchase. So the manager could know if certain
groups of items are consistently purchased together and use this data for adjusting store layouts,
cross-selling, promotions based on statistics.

-------------------------------------------------------------------------------------------------------------------

Apriori Algorithm:
Apriori algorithm, a classic algorithm, is useful in mining frequent itemsets and relevant
association rules. Usually, you operate this algorithm on a database containing a large number of
transactions. One such example is the items customers buy at a supermarket.

It helps the customers buy their items with ease, and enhances the sales performance of the
departmental store.

This algorithm has utility in the field of healthcare as it can help in detecting adverse drug
reactions (ADR) by producing association rules to indicate the combination of medications and
patient characteristics that could lead to ADRs.

Apriori Algorithm – An Odd Name

It has got this odd name because it uses ‘prior’ knowledge of frequent itemset properties. The
credit for introducing this algorithm goes to Rakesh Agrawal and Ramakrishnan Srikant in 1994.
We shall now explore the apriori algorithm implementation in detail.

Apriori algorithm – The Theory

Three significant components comprise the apriori algorithm. They are as follows.

 Support
 Confidence
 Lift
Introduction to Clustering
It is basically a type of unsupervised learning method . An unsupervised learning method is a
method in which we draw references from datasets consisting of input data without labeled
responses. Generally, it is used as a process to find meaningful structure, explanatory underlying
processes, generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.

Clustering MethodsIn this section we describe the most well-known clustering


algorithms. Themain reason for having many clustering methods is the fact that the
notion of“cluster” is not precisely defined (Estivill-Castro, 2000). Consequently
manyclustering methods have been developed, each of which uses a different in-
duction principle. Farley and Raftery (1998) suggest dividing the clusteringmethods
into two main groups: hierarchical and partitioning methods. Hanand Kamber (2001)
suggest categorizing the methods into additional threemain categories:density-based
methods,model-based clusteringandgrid-based methods. An alternative categorization
based on the induction principleof the various clustering methods is presented in
(Estivill-Castro, 2000).

5.1Hierarchical MethodsThese methods construct the clusters by recursively


partitioning the insta-nces in either a top-down or bottom-up fashion. These methods
can be sub-divided as following:Agglomerative hierarchical clustering — Each object
initially representsa cluster of its own. Then clusters are successively merged until
thedesired cluster structure is obtained.

Clustering Methods331Divisive hierarchical clustering — All objects initially belong to


one cluster. Then the cluster is divided into sub-clusters, which are succes-sively
divided into their own sub-clusters. This process continues untilthe desired cluster
structure is obtained.The result of the hierarchical methods is a dendrogram,
representing the nestedgrouping of objects and similarity levels at which groupings
change. A clus-tering of the data objects is obtained by cutting the dendrogram at the
desiredsimilarity level.The merging or division of clusters is performed according to
some similar-ity measure, chosen so as to optimize some criterion (such as a sum of
squares).The hierarchical clustering methods could be further divided according to
themanner that the similarity measure is calculated (Jainet al., 1999):Single-link
clustering(also called the connectedness, the minimummethod or the nearest neighbor
method) — methods that consider thedistance between two clusters to be equal to the
shortest distance fromany member of one cluster to any member of the other cluster.
If thedata consist of similarities, the similarity between a pair of clusters isconsidered
to be equal to the greatest similarity from any member of onecluster to any member of
the other cluster (Sneath and Sokal, 1973).Complete-link clustering(also called the
diameter, the maximummethod or the furthest neighbor method) - methods that
consider thedistance between two clusters to be equal to the longest distance fromany
member of one cluster to any member of the other cluster (King,1967).Average-link
clustering(also called minimum variance method) - meth-ods that consider the
distance between two clusters to be equal to theaverage distance from any member of
one cluster to any member of theother cluster. Such clustering algorithms may be
found in (Ward, 1963)and (Murtagh, 1984).The disadvantages of the single-link
clustering and the average-link clusteringcan be summarized as follows (Guhaet al.,
1998):Single-link clustering has a drawback known as the “chaining effect“: Afew
points that form a bridge between two clusters cause the single-linkclustering to unify
these two clusters into one.Average-link clustering may cause elongated clusters to
split and for por-tions of neighboring elongated clusters to merge.The complete-link
clustering methods usually produce more compact clustersand more useful hierarchies
than the single-link clustering methods, yet the

-link methods are more versatile. Generally, hierarchical methods arecharacterized


with the following strengths:Versatility — The single-link methods, for example,
maintain good per-formance on data sets containing non-isotropic clusters, including
well-separated, chain-like and concentric clusters.Multiple partitions — hierarchical
methods produce not one partition,but multiple nested partitions, which allow
different users to choose dif-ferent partitions, according to the desired similarity level.
The hierarchi-cal partition is presented using the dendrogram.The main disadvantages
of the hierarchical methods are:Inability to scale well — The time complexity of
hierarchical algorithmsis at leastO(m2)(wheremis the total number of instances),
which isnon-linear with the number of objects. Clustering a large number ofobjects
using a hierarchical algorithm is also characterized by huge I/Ocosts.Hierarchical
methods can never undo what was done previously. Namelythere is no back-tracking
capability.

5.2Partitioning Methods
Partitioning methods relocate instances by moving them from one cluster toanother,
starting from an initial partitioning. Such methods typically requirethat the number of
clusters will be pre-set by the user. To achieve global op-timality in partitioned-based
clustering, an exhaustive enumeration process ofall possible partitions is required.
Because this is not feasible, certain greedyheuristics are used in the form of iterative
optimization. Namely, a reloca-tion method iteratively relocates points between
thekclusters. The followingsubsections present various types of partitioning methods.

5.2.1Error Minimization Algorithms

.These algorithms, which tendto work well with isolated and compact clusters, are the
most intuitive and fre-quently used methods. The basic idea is to find a clustering
structure thatminimizes a certain error criterion which measures the “distance” of each
in-stance to its representative value. The most well-known criterion is the Sumof
Squared Error (SSE), which measures the total squared Euclidian distanceof instances
to their representative values. SSE may be globally optimized byexhaustively
enumerating all partitions, which is very time-consuming, or bygiving an approximate
solution (not necessarily leading to a global minimum)using heuristics. The latter
option is the most common alternative.

Clustering Methods
The simplest and most commonly used algorithm, employing a squared er-ror
criterion is theK-means algorithm. This algorithm partitions the data intoKclusters(C1,
C2, . . . , CK), represented by their centers or means. The cen-ter of each cluster is
calculated as the mean of all the instances belonging tothat cluster.Figure 15.1
presents the pseudo-code of theK-means algorithm. The algo-rithm starts with an
initial set of cluster centers, chosen at random or accordingto some heuristic
procedure. In each iteration, each instance is assigned toits nearest cluster center
according to the Euclidean distance between the two.Then the cluster centers are re-
calculated.The center of each cluster is calculated as the mean of all the
instancesbelonging to that cluster:μk=1NkNk∑q=1xqwhereNkis the number of instances
belonging to clusterkandμkis the meanof the clusterk.A number of convergence
conditions are possible.
For example, the searchmay stop when the partitioning error is not reduced by the
relocation of the cen-ters. This indicates that the present partition is locally optimal.
Other stoppingcriteria can be used also such as exceeding a pre-defined number of
iterations.Input:S(instance set),K(number of cluster)Output:clusters1:InitializeKcluster
centers.2:whiletermination condition is not satisfieddo3:Assign instances to the closest
cluster center.4:Update cluster centers based on the assignment.5:end whileFigure 15.1.K-
means Algorithm.TheK-means algorithm may be viewed as a gradient-decent
procedure,which begins with an initial set ofKcluster-centers and iteratively updatesit
so as to decrease the error function.A rigorous proof of the finite convergence of
theK-means type algorithmsis given in (Selim and Ismail, 1984). The complexity
ofTiterations of theK-means algorithm performed on a sample size ofminstances, each
charac-terized byNattributes, is:O(T∗K∗m∗N).This linear complexity is one of the
reasons for the popularity of theK-means algorithms. Even if the number of instances
is substantially large (whichoften is the case nowadays), this algorithm is
computationally attractive. Thus,theK-means algorithm has an advantage in
comparison to other clustering
(e.g. hierarchical clustering methods), which have non-linear com-plexity.Other
reasons for the algorithm’s popularity are its ease of interpretation,simplicity of
implementation, speed of convergence and adaptability to sparsedata (Dhillon and
Modha, 2001).The Achilles heel of theK-means algorithm involves the selection of
theinitial partition. The algorithm is very sensitive to this selection, which maymake
the difference between global and local minimum.Being a typical partitioning
algorithm, theK-means algorithm works wellonly on data sets having isotropic
clusters, and is not as versatile as single linkalgorithms, for instance.In addition, this
algorithm is sensitive to noisy data and outliers (a singleoutlier can increase the
squared error dramatically); it is applicable only whenmean is defined (namely, for
numeric attributes);and it requires the number ofclusters in advance, which is not
trivial when no prior knowledge is available.The use of theK-means algorithm is often
limited to numeric attributes.Haung (1998) presented theK-prototypes algorithm,
which is based on theK-means algorithm but removes numeric data limitations while
preserving itsefficiency. The algorithm clusters objects with numeric and categorical
at-tributes in a way similar to theK-means algorithm. The similarity measure
onnumeric attributes is the square Euclidean distance; the similarity measure onthe
categorical attributes is the number of mismatches between objects and thecluster
prototypes.Another partitioning algorithm, which attempts to minimize the SSE is
theK-medoids or PAM (partition around medoids — (Kaufmann and
Rousseeuw,1987)). This algorithm is very similar to theK-means algorithm. It
differsfrom the latter mainly in its representation of the different clusters. Each clus-
ter is represented by the most centric object in the cluster, rather than by theimplicit
mean that may not belong to the cluster.TheK-medoids method is more robust than
theK-means algorithm in thepresence of noise and outliers because a medoid is less
influenced by outliersor other extreme values than a mean. However, its processing is
more costlythan theK-means method. Both methods require the user to specifyK,
thenumber of clusters.Other error criteria can be used instead of the SSE. Estivill-
Castro (2000)analyzed the total absolute error criterion. Namely, instead of summing
upthe squared error, he suggests to summing up the absolute error. While thiscriterion
is superior in regard to robustness, it requires more computationaleffort.

5.2.2Graph-Theoretic Clustering.Graph theoretic methods aremethods that produce


clusters via graphs. The edges of the graph connect the instances represented as
nodes. A well-known graph-theoretic algorithm isbased on the Minimal Spanning
Tree — MST (Zahn, 1971). Inconsistent edgesare edges whose weight (in the case of
clustering-length) is significantly largerthan the average of nearby edge lengths.
Another graph-theoretic approachconstructs graphs based on limited neighborhood
sets (Urquhart, 1982).There is also a relation between hierarchical methods and
graph theoreticclustering:Single-link clusters are subgraphs of the MST of the data
instances. Eachsubgraph is aconnected component, namely a set of instances in
whicheach instance is connected to at least one other member of the set, sothat the
set is maximal with respect to this property. These subgraphsare formed according
to some similarity threshold.Complete-link clusters aremaximal complete subgraphs,
formed usinga similarity threshold. A maximal complete subgraph is a subgraph
suchthat each node is connected to every other node in the subgraph and theset is
maximal with respect to this property.

You might also like