0% found this document useful (0 votes)
43 views19 pages

Data Mining Real

Uploaded by

Shamila Saleem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views19 pages

Data Mining Real

Uploaded by

Shamila Saleem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Data Mining

Data Mining” can be referred to as knowledge mining from data, knowledge


extraction, data/pattern analysis, data archaeology, and data dredging.  It
is basically the process carried out for the extraction of useful information from
a bulk of data or data warehouses. Data mining results are the patterns and
knowledge that we gain at the end of the extraction process. In that sense, we
can think of Data Mining as a step in the process of Knowledge Discovery or
Knowledge Extraction.
Knowledge discovery as a process consists of an iterative sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be combined) 1


3. Dataselection(wheredatarelevanttotheanalysistaskareretrievedfromthedatabase)
4. Data transformation (where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations, for instance) 2
5. Data mining (an essential process where intelligent methods are applied in order to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some
interestingness measures )
7. Knowledge presentation (where visualization and knowledge representation techniques are used to
present the mined knowledge to the user)

Data mining architecture:

1
2
A detailed description of parts of data mining architecture is shown:
1. Data Sources:
Database, World Wide Web(WWW) and data warehouse are parts of data
sources. The data in these sources may be in the form of plain text,
spreadsheets or in other forms of media like photos or videos. WWW is one
of the biggest sources of data.
2. Database Server:
The database server contains the actual data ready to be processed. It
performs the task of handling data retrieval as per the request of the user.
3. Data Mining Engine:
It is one of the core components of the data mining architecture that
performs all kinds of data mining techniques like association, classification,
characterization, clustering, prediction, etc.
4. Pattern Evaluation Modules:
They are responsible for finding interesting patterns in the data and
sometimes they also interact with the database servers for producing the
result of the user requests.
5. Graphic User Interface:
Since the user cannot fully understand the complexity of the data mining
process so graphical user interface helps the user to communicate
effectively with the data mining system.
6. Knowledge Base:
Knowledge Base is an important part of the data mining engine that is quite
beneficial in guiding the search for the result patterns. Data mining engine
may also sometimes get inputs from the knowledge base. This knowledge
base may contain data from user experiences. The objective of the
knowledge base is to make the result more accurate and reliable.

Types of Data Mining architecture:


1. No Coupling:
The no coupling data mining architecture retrieves data from particular data
sources. It does not use the database for retrieving the data which is
otherwise quite an efficient and accurate way to do the same. The no
coupling architecture for data mining is poor and only used for performing
very simple data mining processes.
2. Loose Coupling:
In loose coupling architecture data mining system retrieves data from the
database and stores the data in those systems. This mining is for memory-
based data mining architecture.
3. Semi Tight Coupling:
It tends to use various advantageous features of the data warehouse
systems. It includes sorting, indexing, aggregation. In this architecture, an
intermediate result can be stored in the database for better performance.
4. Tight coupling:
In this architecture, a data warehouse is considered as one of it’s most
important components whose features are employed for performing data
mining tasks. This architecture provides scalability, performance, and
integrated information.

Data Mining Techniques:

1. Association

Association analysis is the finding of association rules showing attribute-value


conditions that occur frequently together in a given set of data. Association
analysis is widely used for a market basket or transaction data analysis.
Association rule mining is a significant and exceptionally dynamic area of data
mining research. One method of association-based classification, called
associative classification, consists of two steps. In the main step, association
instructions are generated using a modified version of the standard association
rule mining algorithm known as Apriori. The second step constructs a classifier
based on the association rules discovered

2. Classification

Classification is the processing of finding a set of models (or functions) that


describe and distinguish data classes or concepts, for the purpose of being able
to use the model to predict the class of objects whose class label is unknown.
The determined model depends on the investigation of a set of training data
information (i.e. data objects whose class label is known). The derived model
may be represented in various forms, such as classification (if – then) rules,
decision trees, and neural networks. Data Mining has a different type of
classifier: 
 Decision Tree
 SVM(Support Vector Machine)
 Generalized Linear Models
 Bayesian classification:
 Classification by Backpropagation
 K-NN Classifier
 Rule-Based Classification
 Frequent-Pattern Based Classification
 Rough set theory
 Fuzzy logic

3. Clustering:
Unlike classification and prediction, which analyze class-labeled data objects or
attributes, clustering analyzes data objects without consulting an identified class
label. In general, the class labels do not exist in the training data simply
because they are not known to begin with. Clustering can be used to generate
these labels. The objects are clustered based on the principle of maximizing the
intra-class similarity and minimizing the interclass similarity.  
4.prediction:
Data Prediction is a two-step process, similar to that of data classification.
Although, for prediction, we do not utilize the phrasing of “Class label attribute”
because the attribute for which values are being predicted is consistently
valued(ordered) instead of categorical (discrete-esteemed and unordered). The
attribute can be referred to simply as the predicted attribute. Prediction can be
viewed as the construction and use of a model to assess the class of an
unlabeled object, or to assess the value or value ranges of an attribute that a
given object is likely to have.
5. Regression

Regression can be defined as a statistical modeling method in which previously


obtained data is used to predicting a continuous quantity for new observations.
This classifier is also known as the Continuous Value Classifier. There are two
types of regression models: Linear regression and multiple linear regression
models.

6. Artificial Neural network (ANN) Classifier Method

An artificial neural network (ANN) also referred to as simply a “Neural Network”


(NN), could be a process model supported by biological neural networks. It
consists of an interconnected collection of artificial neurons. A neural network is
a set of connected input/output units where each connection has a weight
associated with it. During the knowledge phase, the network acquires by
adjusting the weights to be able to predict the correct class label of the input
samples.
7. Outlier Detection

A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are Outliers. The
investigation of OUTLIER data is known as OUTLIER MINING. An outlier may
be detected using statistical tests which assume a distribution or probability
model for the data, or using distance measures where objects having a small
fraction of “close” neighbors in space are considered outliers.
8. Genetic Algorithm
Genetic algorithms are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the
ideas of natural selection and genetics. These are intelligent exploitation of
random search provided with historical data to direct the search into the region
of better performance in solution space. They are commonly used to generate
high-quality solutions for optimization problems and search problems.. In simple
words, they simulate “survival of the fittest” among individuals of consecutive
generations for solving a problem.
Cluster Analysis
Cluster Analysis is the process to find similar groups of objects in order to form
clusters.It is an unsupervised machine learning-based algorithm that acts on
unlabelled data. A group of data points would comprise together to form a
cluster in which all the objects would belong to the same group.
Cluster:
The given data is divided into different groups by combining similar objects into
a group. This group is nothing but a cluster. A cluster is nothing but a collection
of similar data which is grouped together. 
For example, consider a dataset of vehicles is given in which it contains
information about different vehicles like cars,  buses, bicycles, etc. As it is
unsupervised learning there are no class labels like Cars, Bikes, etc for all the
vehicles, all the data is combined and is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be
done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by
forming clusters like cars cluster which contains all the cars,  bikes clusters
which contains all the bikes, etc.
Simply it is partitioning of similar objects which are applied on unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should
be dealing with huge databases. In order to handle extensive databases, the
clustering algorithm should be scalable. Data should be scalable if it is not
scalable, then we can’t get the appropriate result and would lead to wrong
results.
2. High Dimensionality: The algorithm should be able to handle high
dimensional space along with the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can
be used with algorithms of clustering. It should be capable of dealing with
different types of data like discrete, categorical and interval-based data, binary
data etc.
4. Dealing with unstructured data: These would be some databases that
contain missing values, noisy or erroneous data. If the algorithms are sensitive
to such data then it may lead to poor quality clusters. So it should be able to
handle unstructured data give it some structure to the data by organizing it into
groups of similar data objects. This makes the job of the data expert easier in
order to process the data and discover new patterns.
5. Interpretability: The outcomes of clustering should be interpretable,
comprehensible, and usable. The interpretability reflects how easily the data is
understood.

Clustering Methods:
The clustering methods can be classified into the following categories:
  Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
  Constraint-based Method

Partitioning Method: It is used to make partitions on the data in order to form


clusters. If  “n” partitions are done on  “p” objects of the database then each
partition is represented by a cluster and n < p.  The two conditions which need
to be satisfied with this Partitioning Clustering Method are: 
 One objective should only belong to only one group.
 There should be no group without even a single purpose.
Hierarchical Method: In this method, a hierarchical decomposition of the given
set of data objects is created. We can classify hierarchical methods and will be
able to know the purpose of classification on the basis of how the hierarchical
decomposition is formed.
Density-Based Method: The density-based method mainly focuses on density.
In this method, the given cluster will keep on growing continuously as long as
the density in the neighbourhood exceeds some threshold, i.e,  for each data
point within a given cluster. The radius of a given cluster has to contain at least
a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the
object together,i.e, the object space is quantized into a finite number of cells
that form a grid structure. One of the major advantages of the grid-based
method is fast processing time and it is dependent only on the number of cells
in each dimension in the quantized space.  The processing time for this method
is much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are
hypothesized in order to find the data which is best suited for the model. The
clustering of the density function is used to locate the clusters for a given
model. It reflects the spatial distribution of data points and also provides a way
to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. Therefore it yields robust clustering
methods.
Constraint-Based Method: The constraint-based clustering method is
performed by the incorporation of application or user-oriented constraints. A
constraint refers to the user expectation or the properties of the desired
clustering results.  Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the
user or the application requirement.  

Types of Data in Cluster Analysis


Interval-scaled variables: are continuous measurements of a roughly linear scale. Typical
examples include weight and height, latitude and longitude coordinates (e.g., when clustering
houses), and weather temperature. The measurement unit used can affect the clustering
analysis. For example, changing measurement units from meters to inches for height, or from
kilograms to pounds for weight, may lead to a very different clustering structure.

Binary variables: has only two states: 0 or 1, where 0 means that the variable is absent,
and 1 means that it is present. Given the variable smoker describing a patient, for instance, 1
indicates that the patient smokes, while 0 indicates that the patient does not. Treating binary
variables as if they are interval-scaled can lead to misleading clustering results. Therefore,
methods specific to binary data are necessary for computing dissimilarities.A binary variable
is symmetric if both of its states are equally valuable and carry the same weight; that is,
there is no preference on which outcome should be coded as 0 or 1.

Dissimilarity that is based on symmetric binary variables is called symmetric binary dissimilarity. Its
dissimilarity (or distance) measure, defined in Equation (7.9), can be used to assess the dissimilarity
between objects i and j.
r+s
d(i, j) = . (7.9)
q+r+s+t

A binary variable is asymmetric if the outcomes of the states are not equally important, such as
the positive and negative outcomes.

Categorical variables: A categorical variable is a generalization of the binary variable in that it can
take on more than two states. For example, map color is a categorical variable that may have, say, five
states: red, yellow, green, pink, and blue.
Let the number of states of a categorical variable be M. The states can be denoted by letters, symbols,
or a set of integers, such as 1, 2,..., M. Notice that such integers are used just for data handling and do
not represent any specific ordering.
The dissimilarity between two objects i and j can be computed based on the ratio of mismatches:
p−m
d(i, j) = , (7.12) p

where m is the number of matches (i.e., the number of variables for which i and j are in the same state),
and p is the total number of variables.

Ordinal Variables
A discrete ordinal variable resembles a categorical variable, except that the M states of the ordinal value
are ordered in a meaningful sequence. Ordinal variables are very useful for registering subjective
assessments of qualities that cannot be measured objectively. For example, professional ranks are often
enumerated in a sequential order, such as assistant, associate, and full for professors. A continuous
ordinal variable looks like a set of continuous data of an unknown scale; that is, the relative ordering of
the values is essential but their actual magnitude is not.

Ratio-Scaled Variables
A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponential scale,
approximately following the formula
AeBt or Ae−Bt (7.14)
where A and B are positive constants, andt typically represents time. Common examples include the
growth of a bacteria population or the decay of a radioactive element. There are three methods to
handle ratio-scaled variables for computing the dissimilarity between objects.
Treat ratio-scaled variables like interval-scaled variables. This, however, is not usually a good choice
since it is likely that the scale may be distorted
Apply logarithmic transformation to a ratio-scaled variable f having value xif for object i by using the
formula yif = log(xif ).
Treat xif as continuous ordinal data and treat their ranks as interval-valued.

Outlier
An outlier is an object that deviates significantly from the rest of the objects.
They can be caused by measurement or execution errors. The analysis of
outlier data is referred to as outlier analysis or outlier mining. An outlier cannot
be termed as a noise or error.
Difference between outliers and noise
Any unwanted error occurs in some previously measured variable, or there is any variance in the
previously measured variable called noise. Before finding the outliers present in any data set, it is
recommended first to remove the noise.

Analysis of Outliers
Outliers are mostly discarded when methods of data mining are applied. But,
it’s still used in certain applications like fraud detection. This is mainly
because the events that rarely occur can store much more interesting facts
than the events that occur more regularly.

Other applications where outlier detection plays a major role are:


 Detection of frauds in the insurance sector, credit cards, and the
healthcare sector.
 Fraud detection in telecom.
 In cybersecurity for detecting any form of intrusion.
 In the field of medical analysis.
 Detection of faults in the safety-critical systems.
 In marketing, outlier analysis helps in identifying the customer’s nature
of spending.
 Any sort of unusual responses that occurs due to certain medical
treatments can be analyzed through outlier analysis in data mining.
The process where the anomalous behavior of the outliers is identified in a
dataset is known as outlier analysis.  Also, known as “outlier mining”, the
process is defined to be an important task of data mining.
Outlier Detection methods
Various techniques combined with different approaches are applied for
detecting any anomalous behaviour in a dataset. A few techniques used for
outlier detection are:
1 Statistical Distribution-Based Outlier Detection
The statistical distribution-based approach to outlier detection assumes a distribution or
probability model for the given data set (e.g., a normal or Poisson distribution) and then
identifies outliers with respect to the model using a discordancy test. Application of the test
requires knowledge of the data set parameters (such as the assumed data distribution),
knowledge of distribution parameters (such as the mean and variance), and the expected
number of outliers.

“How does the discordancy testing work?” A statistical discordancy test examines two
hypotheses: a working hypothesis and an alternative hypothesis
There are two basic types of procedures for detecting outliers:

Block procedures: In this case, either all of the suspect objects are treated as outliers or all of them
are accepted as consistent.
Consecutiv(orsequential)procedures:Anexampleofsuchaprocedureistheinsideout procedure.
Its main idea is that the object that is least “likely” to be an outlier is tested first. If it is found to
be an outlier, then all of the more extreme values are also considered outliers; otherwise, the next
most extreme object is tested, and so on. This procedure tends to be more effective than block
procedures.
2.Distance-Based Outlier Detection
The notion of distance-based outliers was introduced to counter the main limitations imposed by
statistical methods. An object, o, in a data set, D, is a distance-based (DB) outlier with
parameters
pct and dmin,3 that is, a DB(pct,dmin)-outlier, if at least a fraction, pct, of the objects in D lie at a
distance greater than dmin from o. In other words, rather than relying on statistical tests, we can think
of distance-based outliers as those objects that do not have “enough” neighbors, where neighbors are
defined based on distance from the given object. In comparison with statistical-based methods,
distancebased outlier detection generalizes the ideas behind discordancy testing for various standard
distributions. Distance-based outlier detection avoids the excessive computation that can be associated
with fitting the observed distribution into some standard distribution and in selecting discordancy tests.

Several efficient algorithms for mining distance-based outliers have been developed. These are outlined
as follows.

Index-based algorithm: Given a data set, the index-based algorithm uses multidimensional indexing
structures, such as R-trees or k-d trees, to search for neighbors of each object o within radius dmin
around that object.

Nested-loop algorithm: The nested-loop algorithm has the same computational complexity as
the index-based algorithm but avoids index structure construction and tries to minimize the
number of I/Os. It divides the memory buffer space into two halves and the data set into
several logical blocks. By carefully choosing the order in which blocks are loaded into each
half, I/O efficiency can be achieved.

Cell-based algorithm: ToavoidO(n2) computational complexity, a cell-based algorithm was developed


for memory-resident data sets. Its complexity is O(ck +n), where c is a constant depending on the

3
number of cells and k is the dimensionality. In this method, the data space is partitioned into cells with a
side length equal to . Each cell has two√ layers surrounding it.

Density-Based Local Outlier Detection


Statistical and distance-based outlier detection both depend on the overall or “global”
distribution of the given set of data points, D. However, data are usually not uniformly
distributed.
Deviation-Based Outlier Detection
Deviation-based outlier detection does not use statistical tests or distance-based measures to identify
exceptional objects. Instead, it identifies outliers by examining the
maincharacteristicsofobjectsinagroup.Objectsthat“deviate”fromthisdescriptionare considered outliers.
Hence, in this approach the term deviations is typically used to refer to outliers. In this section, we study
two techniques for deviation-based outlier detection. The first sequentially compares objects in a set,
while the second employs an OLAP data cube approach.

Missing Values
Imagine that you need to analyze All Electronics sales and customer data. You note that many tuples
have no recorded value for several attributes, such as customer income. How can you go out filling in
the missing values for this attribute? Let’s look at the following methods:

Ignore the tuple: This is usually done when the class label is missing (assuming the
miningtaskinvolvesclassification).Thismethodisnotveryeffective,unless the tuple
containsseveralattributeswithmissingvalues.Itisespeciallypoorwhenthepercentage of missing values per
attribute varies considerably.

Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible
given a large data set with many missing values

Use a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like “Unknown” or −∞. If missing values are replaced by, say, “Unknown,” then
the mining program may mistakenly think that they form an interesting concept, since they all have a
value in common—that of “Unknown.” Hence, although this method is simple, it is not foolproof.

Use the attribute mean to fill in the missing value: For example, suppose that the average income of
AllElectronics customers is $56,000. Use this value to replace the missing value for income.

Use the attribute mean for all samples belonging to the same class as the given tuple: Forexample, if
classifying customers according to credit risk ,replace the missing value with the average income value
for customers in the same credit risk category as that of the given tuple.

Use the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using a Bayesian formalism, or decision tree induction. For example, using the
other customer attributes in your data set, you may construct a decision tree to predict the missing
values for income.

What is classification?
Following are the examples of cases where the data analysis task is Classification −
 A bank loan officer wants to analyze the data in order to know which customer
(loan applicant) are risky or which are safe.
 A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.

What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will
spend during a sale at his company. In this example we are bothered to predict a
numeric value. Therefore the data analysis task is an example of numeric prediction. In
this case, a model or a predictor will be constructed that predicts a continuous-valued-
function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for
numeric prediction.

Decision Trees
 Decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a
flowchart like tree structure, where each internal node denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node (terminal node) holds a class label. 

They help to analyze which parts of the database are really useful or which part contains a solution to
your problem.

It is a support tool that uses a decision chart or model and its possible consequences. That includes
results of chance events, resource costs, and utility.

From a decision perspective, a decision tree is the least number of questions that must be
acknowledged to evaluate the likelihood of making an accurate decision.

By looking at the predictors or values for each split in the tree, you can draw some ideas or find answers
to the questions you have asked.

Decision trees enable you to approach the obstacle in a structured and systematic behavior.
In Decision Trees, for estimating a class tag for best ever we start with the root of the
tree. We make relations with the root attribute to the record’s attribute. We make
division agreeing to that value and jump to the subsequent node on the base of choice.
Classification by Decision Tree Induction

Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision
tree is a flowchart-like tree structure , where each internal node ( non leaf node) denotes a test on an
attribute ,each branch represents an outcome of the test, and each leaf node (or terminal node) holds a
class label. The top most node in a tree is the root node.

The following decision tree is for the concept buy_computer that indicates whether a customer
at a company is likely to buy a computer or not. Each internal node represents a test on an
attribute. Each leaf node represents a class.

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.


 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of
ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the
trees are constructed in a top-down recursive divide-and-conquer manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_ decision_ tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_ list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_ attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_ list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_ selection_ method(D, attribute_ list)


to find the best splitting_ criterion;
label node N with splitting _criterion;

if splitting_ attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_ list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
 Pre-pruning − The tree is pruned by halting its construction early.
 Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity
The cost complexity is measured by the following two parameters −

 Number of leaves in the tree, and


 Error rate of the tree.

Rule Based Classification


IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
express a rule in the following from −
IF condition THEN conclusion

Let us consider a rule R1,


R1: IF age = youth AND student = yes
THEN buy_computer = yes

Points to remember −
 The IF part of the rule is called rule antecedent or precondition.
 The THEN part of the rule is called rule consequent.
 The antecedent part the condition consist of one or more attribute tests and
these tests are logically ANDed.
 The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from
a decision tree.
Points to remember −
To extract a rule from a decision tree −
 One rule is created for each path from the root to the leaf node.
 To form a rule antecedent, each splitting criterion is logically ANDed.
 The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm


Sequential Covering Algorithm can be used to extract IF-THEN rules form the training
data. We do not require to generate a decision tree first. In this algorithm, each rule for
a given class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the
general strategy the rules are learned one at a time. For each time rules are learned, a
tuple covered by the rule is removed and the process continues for the rest of the
tuples. This is because the path to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one
class at a time. When learning a rule from a class Ci, we want the rule to cover all the
tuples from class C only and no tuple form any other class.
Algorithm: Sequential Covering

Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN rules.


Method:
Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-set


end for
return Rule_Set;
Rule Pruning
The rule is pruned is due to the following reason −
 The Assessment of quality is made on the original set of training data. The rule
may perform well on training data but less well on subsequent data. That's why
the rule pruning is required.
 The rule is pruned by removing conjunct. The rule R is pruned, if pruned version
of R has greater quality than what was assessed on an independent set of
tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.

Genetic Algorithms
The idea of genetic algorithm is derived from natural evolution. In genetic algorithm,
first of all, the initial population is created. This initial population consists of randomly
generated rules. We can represent each rule by a string of bits.
For example, in a given training set, the samples are described by two Boolean
attributes such as A1 and A2. And this given training set contains two classes such as
C1 and C2.
We can encode the rule IF A1 AND NOT A2 THEN C2 into a bit string 100. In this bit
representation, the two leftmost bits represent the attribute A1 and A2, respectively.
Likewise, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
Note − If the attribute has K values where K>2, then we can use the K bits to encode
the attribute values. The classes are also encoded in the same manner.
Points to remember −
 Based on the notion of the survival of the fittest, a new population is formed that
consists of the fittest rules in the current population and offspring values of these
rules as well.
 The fitness of a rule is assessed by its classification accuracy on a set of training
samples.
 The genetic operators such as crossover and mutation are applied to create
offspring.
 In crossover, the substring from pair of rules are swapped to form a new pair of
rules.
 In mutation, randomly selected bits in a rule's string are inverted.

Fuzzy Set Approaches


Fuzzy Set Theory is also called Possibility Theory. This theory was proposed by Lotfi
Zadeh in 1965 as an alternative the two-value logic and probability theory. This
theory allows us to work at a high level of abstraction. It also provides us the means for
dealing with imprecise measurement of data.
The fuzzy set theory also allows us to deal with vague or inexact facts. For example,
being a member of a set of high incomes is in exact (e.g. if $50,000 is high then what
about $49,000 and $48,000). Unlike the traditional CRISP set where the element either
belong to S or its complement but in fuzzy set theory the element can belong to more
than one fuzzy set.
For example, the income value $49,000 belongs to both the medium and high fuzzy
sets but to differing degrees. Fuzzy set notation for this income value is as follows −

mmedium_income($49k)=0.15 and mhigh_income($49k)=0.96


where ‘m’ is the membership function that operates on the fuzzy sets of
medium_income and high_income respectively. This notation can be shown
diagrammatically as follows −

You might also like