Unit 4
Unit 4
Association rules are created by analyzing data for frequent if/then patterns and using the criteria
support and confidence to identify the most important relationships. Support is an indication of
how frequently the items appear in the database. Confidence indicates the number of times the
if/then statements have been found to be true.
In data mining, association rules are useful for analyzing and predicting customer behavior. They
play an important part in shopping basket data analysis, product clustering, catalog design and
store layout.
Programmers use association rules to build programs capable of machine learning. Machine
learning is a type of artificial intelligence (AI) that seeks to build programs with the ability to
become more efficient without being explicitly programmed.
The goal of the techniques described in this section is to detect relationships or associations
between specific values of categorical variables in large data sets. These powerful exploratory
techniques have a wide range of applications in many areas of business practice and also
research - from the analysis of consumer preferences or human resource management, to the
history of language. These techniques enable analysts and researchers to uncover hidden patterns
in large data sets, such as "customers who order product A often also order product B or C" or
"employees who said positive things about initiative X also frequently complain about issue Y
but are happy with issue Z."
Association rules mining has many applications other than market basket analysis, including
applications in marketing, customer segmentation, medicine, electronic commerce,
bioinformatics and finance.
How do Association Rules Work? The usefulness of this technique to address unique data
mining problems is best illustrated in a simple example. Suppose you are collecting data at the
checkout cash registers at a large bookstore. Each customer transaction is logged in a database,
and consists of the titles of the books purchased by the respective customer, perhaps additional
magazine titles and other gift items that were purchased, and so on. Hence, each record in the
database will represent one customer (transaction), and may consist of a single book purchased
by that customer, or it may consist of many (perhaps hundreds of) different items that were
purchased, arranged in an arbitrary order depending on the order in which the different items
(books, magazines, and so on) came down the conveyor belt at the cash register. The purpose of
the analysis is to find associations between the items that were purchased, i.e., to derive
association rules that identify the items and co-occurrences of different items that appear with the
greatest (co-) frequencies. For example, you want to learn which books are likely to be
purchased by a customer who you know already purchased (or is about to purchase) a particular
book. This type of information could then quickly be used to suggest to the customer those
additional titles. You may already be "familiar" with the results of these types of analyses, if you
are a customer of various on-line (Web-based) retail businesses; many times when making a
purchase on-line, the vendor will suggest similar items (to the ones purchased by you) at the time
of "check-out", based on some rules such as "customers who buy book title A are also likely to
purchase book title B," and so on.
There are many interesting algorithms proposed recently to discover association rules. One of the
key features of all algorithms is that each of these methods assume that the underlying database
size is enormous and they require multiple passes over the database.
5.3.2 Classification
Definition: Classification is a Data Mining (machine learning) technique used to predict group
membership for data instances. For example, you may wish to use classification to predict if the
weather on a particular day will be “sunny”, “rainy” or “cloudy”. Popular classification
techniques include decision trees and neural networks.
Regression and Classification are two of the more popular Classification Techniques.
Classification involves finding rules that partition the data into disjoint groups. The input for the
classification is the training data set, whose class labels are already known. Classification
analyzes the training data set and constructs a model based on the class label, and aims to assign
a class label to the future unlabelled records. Since the class field is known, this type of
classification is known as supervised learning. A set of classification rules are generated by such
a classification process, which can be used to classify future data and develop a better
understanding of each class in the database.
The applications include the credit card analysis, banking, medical applications and the like.
Classification Example
Problem:
Given a new automobile insurance applicant, should he or she be classified as low risk, medium
risk or high risk?
Classification rules for above problem could use a variety of data, such as customer’s
educational level, salary, age, etc.
Classification Rules:
Rule 1:
P.credit = “Excellent”
Rule 2:
" person P, P.degree = bachelors and (P.income ³ 25,000 and P.income £ 75,000) P.credit =
“Good”
A Decision Tree is a predictive model that, as its name implies, can be viewed as a tree.
Specifically each branch of the tree is a classification question and the leaves of the tree are
partitions of the dataset (data base table/file) with their classification.
In the above classification, four groups are classified i.e Bad, Good, Average and Excellent. At
any moment of time the customer would fall into any one of the group.
5.3.3 Regression
Regression is the oldest and most well known Statistical technique that the Data Mining
community utilizes. Basically, Regression takes a numerical dataset and develops a mathematical
formula (Eg: y=a+ bx, here y is the dependant variable and x is the independent variable) that fits
the data. When you're ready to use the results to predict future behavior, you simply take your
new data, plug it into the developed formula and you've got a prediction. The major limitation of
this technique is that it only works well with continuous quantitative data (like weight, speed or
age).
If the data is categorical, where order is not significant (like color, name or gender) then it is
better off choosing another technique.
There are two forms of data analysis that can be used for extracting models describing important
classes or to predict future data trends. These two forms are as follows −
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict continuous
valued functions. For example, we can build a classification model to categorize bank loan
applications as either safe or risky, or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend during a
sale at his company. In this example we are bothered to predict a numeric value. Therefore the
data analysis task is an example of numeric prediction. In this case, a model or a predictor will be
constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric
prediction.
With the help of the bank loan application that we have discussed above, let us understand the
working of classification. The Data Classification process includes two steps −
In this step, the classifier is used for classification. Here the test data is used to estimate the
accuracy of classification rules. The classification rules can be applied to the new data tuples if
the accuracy is considered acceptable.
Data Cleaning − Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring
value for that attribute.
Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
Data Transformation and reduction − The data can be transformed by any of the
following methods.
o Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within a
small specified range. Normalization is used when in the learning step, the neural
networks or the methods involving measurements are used.
o Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
Note − Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.
Here is the criteria for comparing the methods of Classification and Prediction −
Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class
label correctly and the accuracy of the predictor refers to how well a given predictor can
guess the value of predicted attribute for a new data.
Speed − This refers to the computational cost in generating and using the classifier or
predictor.
Robustness − It refers to the ability of classifier or predictor to make correct predictions
from given noisy data.
Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
Interpretability − It refers to what extent the classifier or predictor understands.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at
a company is likely to buy a computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class.
The benefits of having a decision tree are as follows −
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known
as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3
and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer manner.
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or
outliers. The pruned trees are smaller and less complex.
Cost Complexity
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
Bayesian Belief Networks specify joint conditional probability distributions. They are also
known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
The following diagram shows a directed acyclic graph for six Boolean variables.
The arc in the diagram allows representation of causal knowledge. For example, lung cancer is
influenced by a person's family history of lung cancer, as well as whether or not the person is a
smoker. It is worth noting that the variable PositiveXray is independent of whether the patient
has a family history of lung cancer or that the patient is a smoker, given that we know the patient
has lung cancer.
The arc in the diagram allows representation of causal knowledge. For example, lung cancer is
influenced by a person's family history of lung cancer, as well as whether or not the person is a
smoker. It is worth noting that the variable PositiveXray is independent of whether the patient
has a family history of lung cancer or that the patient is a smoker, given that we know the patient
has lung cancer.
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a
rule in the following from −
Points to remember −
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a
decision tree.
Points to remember −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We
do not require to generate a decision tree first. In this algorithm, each rule for a given class
covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by
the rule is removed and the process continues for the rest of the tuples. This is because the path
to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a
time. When learning a rule from a class Ci, we want the rule to cover all the tuples from class C
only and no tuple form any other class.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule Pruning
The Assessment of quality is made on the original set of training data. The rule may
perform well on training data but less well on subsequent data. That's why the rule
pruning is required.
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has
greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
Here we will discuss other classification methods such as Genetic Algorithms, Rough Set
Approach, and Fuzzy Set Approach.
Genetic Algorithms
The idea of genetic algorithm is derived from natural evolution. In genetic algorithm, first of all,
the initial population is created. This initial population consists of randomly generated rules. We
can represent each rule by a string of bits.
For example, in a given training set, the samples are described by two Boolean attributes such as
A1 and A2. And this given training set contains two classes such as C1 and C2.
We can encode the rule IF A1 AND NOT A2 THEN C2 into a bit string 100. In this bit
representation, the two leftmost bits represent the attribute A1 and A2, respectively.
Likewise, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
Note − If the attribute has K values where K>2, then we can use the K bits to encode the
attribute values. The classes are also encoded in the same manner.
Points to remember −
Based on the notion of the survival of the fittest, a new population is formed that consists
of the fittest rules in the current population and offspring values of these rules as well.
The fitness of a rule is assessed by its classification accuracy on a set of training samples.
The genetic operators such as crossover and mutation are applied to create offspring.
In crossover, the substring from pair of rules are swapped to form a new pair of rules.
In mutation, randomly selected bits in a rule's string are inverted.
We can use the rough set approach to discover structural relationship within imprecise and noisy
data.
Note − This approach can only be applied on discrete-valued attributes. Therefore, continuous-
valued attributes must be discretized before its use.
The Rough Set Theory is based on the establishment of equivalence classes within the given
training data. The tuples that forms the equivalence class are indiscernible. It means the samples
are identical with respect to the attributes describing the data.
There are some classes in the given real world data, which cannot be distinguished in terms of
available attributes. We can use the rough sets to roughly define such classes.
For a given class C, the rough set definition is approximated by two sets as follows −
The following diagram shows the Upper and Lower Approximation of class C:
Fuzzy Set Theory is also called Possibility Theory. This theory was proposed by Lotfi Zadeh in
1965 as an alternative the two-value logic and probability theory. This theory allows us to
work at a high level of abstraction. It also provides us the means for dealing with imprecise
measurement of data.
The fuzzy set theory also allows us to deal with vague or inexact facts. For example, being a
member of a set of high incomes is in exact (e.g. if $50,000 is high then what about $49,000 and
$48,000). Unlike the traditional CRISP set where the element either belong to S or its
complement but in fuzzy set theory the element can belong to more than one fuzzy set.
For example, the income value $49,000 belongs to both the medium and high fuzzy sets but to
differing degrees. Fuzzy set notation for this income value is as follows −
where ‘m’ is the membership function that operates on the fuzzy sets of medium_income and
high_income respectively. This notation can be shown diagrammatically as follows −