Datamining Fifth Lecture
Datamining Fifth Lecture
11
Frequent item set
A set of items is referred as an itemset. An itemset that
contains k items is a k-itemset. The occurrence
frequency of an itemset is the number of transactions
that contain the itemset.
An itemset satisfies minimum support if the occurrence
frequency of the itemset is greater than or equal to the
product of min_suf and the total number of
transactions in D. The number of transactions required
for the itemset to satisfy minimum support is referred
to as the minimum support count.
If an itemset satisfies minimum support, then it is a
frequent itemset. The set of frequent k-itemsets is
commonly denoted by Lk.
12
Example 2.1
Transaction-ID Items_bought
-------------------------------------------
2000 A, B, C
1000 A, C
4000 A, D
5000 B, E, F
14
3. CLASSIFICATION
Classification is the process of learning a model
that describes different classes of data. The
classes are predetermined.
Example: In a banking application, customers
who apply for a credit card may be classify as a
good risk, a fair risk or a poor risk. Hence,
this type of activity is also called supervised
learning.
Once the model is built, then it can be used to
classify new data.
15
The first step, of learning the model, is accomplished by using
a training set of data that has already been classified. Each
record in the training data contains an attribute, called the
class label, that indicates which class the record belongs to.
The model that is produced is usually in the form of a
decision tree or a set of rules.
Some of the important issues with regard to the model and
the algorithm that produces the model include:
the models ability to predict the correct class of the new
data,
the computational cost associated with the algorithm
the scalability of the algorithm.
Let examine the approach where the model is in the form of a
decision tree.
A decision tree is simply a graphical representation of the
description of each class or in other words, a representation
of the classification rules.
16
Example 3.1
Example 3.1: Suppose that we have a database of customers
on the AllEletronics mailing list. The database describes
attributes of the customers, such as their name, age, income,
occupation, and credit rating. The customers can be classified
as to whether or not they have purchased a computer at
AllElectronics.
Suppose that new customers are added to the database and
that you would like to notify these customers of an upcoming
computer sale. To send out promotional literature to every new
customers in the database can be quite costly. A more cost-
efficient method would be to target only those new customers
who are likely to purchase a new computer. A classification
model can be constructed and used for this purpose.
The figure 2 shows a decision tree for the concept
buys_computer, indicating whether or not a customer at
AllElectronics is likely to purchase a computer.
17
Each internal node
represents a test on
an attribute. Each leaf
node represents a
class.
18
Decision Trees
For example, consider the widely referenced Iris data classification
problem introduced by Fisher (1936).
The purpose of the analysis is to learn how one can discriminate
between the three types of flowers, based on the four measures of
width and length of petals and sepals.
A classification tree will determine a set of logical if-then conditions
(instead of linear equations) for predicting or classifying cases.
Advantages of tree
methods.
Simplicity of results.
In most cases, the interpretation of results summarized in
a tree is very simple. This simplicity is useful not only for
purposes of rapid classification of new observations .
Often yield a much simpler "model" for explaining why
observations are classified or predicted in a particular
manner .
e.g., when analyzing business problems, it is much easier
to present a few simple if-then statements to management,
than some elaborate equations.
Regression Trees.
24
Artificial Intelligence for Data Mining
25
Neural Network
Characteristics
Number of inputs/outputs
is variable
27
Biological Background
A neuron: many-inputs / one-output
unit
Connection
Node
29
Preceptron Training
30
Neural Network Learning
From experience: examples / training
data
Evaluate output
Adapt weights
33
Topologies Back-Propogated
Networks
Inputs are put
through a
Hidden Layer
before the
output layer
All nodes
connected
between layers
34
BP Network Supervised
Training
Desired output of the training examples
Improved performance
35
Neural Network Topology
Characteristics
Set of inputs
Set of outputs
36
Applications of Neural
Networks
Prediction weather, stocks, disease
37
ANN and classification
ANNs can be classified into 2 categories: supervised
and unsupervised networks. Adaptive methods that
attempt to reduce the output error are supervised
learning methods, whereas those that develop
internal representations without sample outputs are
called unsupervised learning methods.
ANNs can learn from information on a specific
problem. They perform well on classification tasks
and are therefore useful in data mining.
38
Information processing at a neuron in an ANN
39
Machine Learning
Algorithms.
STATISTICA Machine Learning provides a
number of advanced statistical methods for
handling regression and classification tasks
with multiple dependent and independent
variables.
These methods include
Support Vector Machines (SVM)
( for regression and classification).
56
Clustering of a set of objects based on the k-means method.
57
Hierarchical Clustering
A hierarchical clustering method works by grouping
data objects into a tree of clusters.
In general, there are two types of hierarchical
clustering methods:
Agglomerative hierarchical clustering: This bottom-up
strategy starts by placing each object in its own cluster and
then merges these atomic clusters into larger and larger
clusters, until all of the objects are in a single cluster or
until a certain termination conditions are satisfied. Most
hierarchical clustering methods belong to this category.
They differ only in their definition of intercluster similarity.
Divisive hierarchical clustering: This top-down strategy
does the reverse of agglomerative hierarchical clustering by
starting with all objects in one cluster. It subdivides the
cluster into smaller and smaller pieces, until each object
forms a cluster on its own or until it satisfied certain
termination condition, such as a desired number clusters is
obtained or the distance between two closest clusters is
above a certain threshold distance.
58
Agglomerative and divisive hierarchical clustering on data objects {a, b, c,
d, e}
59
Hierarchical Clustering
60
7. POTENTIAL APPLICATIONS OF DM
61
Market Analysis and Management
63
Fraud Detection and Management
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring
of doctors and ring of references
64
Some representative data mining tools
65