0% found this document useful (0 votes)

68 views46 pages

Unit 3 - DM FULL

Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a data set. Market basket analysis is a data mining technique used by retailers to analyze purchasing patterns and reveal products frequently purchased together. Association rule learning is an unsupervised learning technique that checks for dependencies between data items to discover interesting relations that can make recommendations more profitable.

Uploaded by

minto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views46 pages

Unit 3 - DM FULL

Uploaded by

minto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 46

Frequent patterns

Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a
data set. For example, a set of items, such as milk and bread, that appear frequently together in a
transaction data set is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera,
and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential
pattern. A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices,
which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern.

Market Basket Analysis in Data Mining

Market basket analysis is a data mining technique used by retailers to increase sales by better
understanding customer purchasing patterns. It involves analyzing large data sets, such as
purchase history, to reveal product groupings and products that are likely to be purchased
together.

The adoption of market basket analysis was aided by the advent of electronic point-of-sale
(POS) systems. Compared to handwritten records kept by store owners, the digital records
generated by POS systems made it easier for applications to process and analyze large volumes
of purchase data.

Implementation of market basket analysis requires a background in statistics and data science
and some algorithmic computer programming skills. For those without the needed technical
skills, commercial, off-the-shelf tools exist.

One example is the Shopping Basket Analysis tool in Microsoft Excel, which analyzes transaction
data contained in a spreadsheet and performs market basket analysis. A transaction ID must
relate to the items to be analyzed. The Shopping Basket Analysis tool then creates two
worksheets:

o The Shopping Basket Item Groups worksheet, which lists items that are frequently
purchased together,
o And the Shopping Basket Rules worksheet shows how items are related (For example,
purchasers of Product A are likely to buy Product B).

Association Rule Learning

Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be more
profitable. It tries to find some interesting relations or associations among the variables of
dataset. It is based on different rules to discover the interesting relations between variables in
the database.
The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. Here
market basket analysis is a technique used by the various big retailer to discover the
associations between items. We can understand it by taking an example of a supermarket, as in
a supermarket, all products that are purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly nearby. Consider the below diagram:

Association rule learning can be divided into three types of algorithms:

1. Apriori
2. Eclat
3. F-P Growth Algorithm

How does Association Rule Learning work?

Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality. It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there are several
metrics. These metrics are given below:
o Support
o Confidence
o Lift

Let's understand each of them:

Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction
of the transaction T that contains the itemset X. If there are X datasets, then for transactions T, it can be
written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur
together in the dataset when the occurrence of X is already given. It is the ratio of the transaction that
contains X and Y to the number of records that contain X.

Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are independent
of each other. It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of

each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has
a negative effect on another.

Apriori Algorithm – Frequent Pattern Algorithms

Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. It was later
improved by R Agarwal and R Srikant and came to be known as Apriori. This algorithm uses two steps
“join” and “prune” to reduce the search space. It is an iterative approach to discover the most frequent
itemsets.
Apriori says:
The probability that item I is not frequent is if:

 P(I) < minimum support threshold, then I is not frequent.

 P (I+A) < minimum support threshold, then I+A is not frequent, where A also belongs to itemset.
 If an itemset set has value less than minimum support then all of its supersets will also fall below
min support, and thus can be ignored. This property is called the Antimonotone property.
The steps followed in the Apriori Algorithm of data mining are:
1. Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with itself.
2. Prune Step: This step scans the count of each item in the database. If the candidate item does not
meet minimum support, then it is regarded as infrequent and thus it is removed. This step is
performed to reduce the size of the candidate itemsets.
Steps In Apriori
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the given
database. This data mining technique follows the join and the prune steps iteratively until the most
frequent itemset is achieved. A minimum support threshold is given in the problem or it is assumed by the
user.

#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm will
count the occurrences of each item.
#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose occurrence is
satisfying the min sup are determined. Only those candidates which count more than or equal to min_sup,
are taken ahead for the next iteration and the others are pruned.
#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-itemset is
generated by forming a group of 2 by combining items with itself.
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have 2 –
itemsets with min-sup only.
#5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each group fall in
min_sup. If all 2-itemset subsets are frequent then the superset will be frequent otherwise it is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset does
not meet the min_sup criteria. The algorithm is stopped when the most frequent itemset is achieved.
EXAMPLE

Let’s see an example of the Apriori Algorithm.

Find the frequent itemsets and generate association rules on this. Assume that minimum support

threshold (s = 33.33%) and minimum confident threshold (c = 60%)

Let’s start,
There is only one itemset with minimum support 2. So only one itemset is frequent.

Frequent Itemset (I) = {Hot Dogs, Coke, Chips}

Association rules,
 [Hot Dogs^Coke]=>[Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Coke) =
2/2*100=100% //Selected
 [Hot Dogs^Chips]=>[Coke] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Chips) =
2/2*100=100% //Selected
 [Coke^Chips]=>[Hot Dogs] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke^Chips) =
2/3*100=66.67% //Selected
 [Hot Dogs]=>[Coke^Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs) =
2/4*100=50% //Rejected
 [Coke]=>[Hot Dogs^Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke) =
2/3*100=66.67% //Selected
 [Chips]=>[Hot Dogs^Coke] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Chips) =
2/4*100=50% //Rejected

There are four strong results (minimum confidence greater than 60%)

Pattern Evaluation Methods in Data Mining

Most association rule mining algorithms employ a support–confidence framework. Although minimum
support and confidence thresholds help weed out or exclude the exploration of a good number of
uninteresting rules, many of the rules generated are still not interesting to the users.

Strong Rules Are Not Necessarily Interesting

Whether or not a rule is interesting can be assessed either subjectively or objectively. Ultimately, only the
user can judge if a given rule is interesting, and this judgment, being subjective, may differ from one user
to another. However, objective interestingness measures, based on the statistics “behind” the data, can be
used as one step toward the goal of weeding out uninteresting rules that would otherwise be presented to
the user.

A misleading “strong” association rule. Suppose we are interested in analyzing transactions at

AllElectronics with respect to the purchase of computer games and videos. Let game refer to the
transactions containing computer games, and video refer to those containing videos. Of the 10,000
transactions analyzed, the data show that 6000 of the customer transactions included computer games,
while 7500 included videos, and 4000 included both computer games and videos. Suppose that a data
mining program for discovering association rules is run on the data, using a minimum support of, say,
30% and a minimum confidence of 60%. The following association rule is discovered:

buys(X, “computer games”) ⇒ buys(X, “videos”) [support = 40%, confidence = 66%].---------- Rule 1

Rule 1 is a strong association rule and would therefore be reported, since its support value of 4000 /
10,000 = 40% and confidence value of 4000 / 6000 = 66% satisfy the minimum support and minimum
confidence thresholds, respectively. However, Rule 1 is misleading because the probability of purchasing
videos is 75%, which is even larger than 66%. In fact, computer games and videos are negatively
associated because the purchase of one of these items actually decreases the likelihood of purchasing the
other. Without fully understanding this phenomenon, we could easily make unwise business decisions
based on Rule 1

From Association Analysis to Correlation Analysis

As we have seen so far, the support and confidence measures are insufficient at filtering out uninteresting
association rules. To tackle this weakness, a correlation measure can be used to augment the support–
confidence framework for association rules. This leads to correlation rules of the form

A ⇒ B [support, confidence, correlation].

That is, a correlation rule is measured not only by its support and confidence but also by the correlation
between itemsets A and B. There are many different correlation measures from which to choose. In this
subsection, we study several correlation measures to determine which would be good for mining large
data sets.

Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is independent
of the occurrence of itemset B if P(A ∪B) = P(A)P(B); otherwise, itemsets A and B are dependent and
correlated as events. This definition can easily be extended to more than two itemsets. The lift between
the occurrence of A and B can be measured by computing

-------------------------------------------------------- Eq 1
If the resulting value of Eq. 1 is less than 1, then the occurrence of A is negatively correlated with the
occurrence of B, meaning that the occurrence of one likely leads to the absence of the other one. If the
resulting value is greater than 1, then A and B are positively correlated, meaning that the occurrence of
one implies the occurrence of the other. If the resulting value is equal to 1, then A and B are independent
and there is no correlation between them

Pattern Mining in Multilevel

Association rules generated from mining data at multiple abstraction levels are called multiple-level or
multilevel association rules. Multilevel association rules can be mined efficiently using concept
hierarchies under a support-confidence framework. In general, a top-down strategy is employed, where
counts are accumulated for the calculation of frequent itemsets at each concept level, starting at concept
level 1 and working downward in the hierarchy toward the more specific concept levels, until no more
frequent itemsets can be found. For each level, any algorithm for discovering frequent itemsets may be
used, such as Apriori or its variations.

A number of variations to this approach are described next, where each variation involves “playing” with
the support threshold in a slightly different way. The variations are illustrated in Figures where nodes
indicate an item or itemset that has been examined, and nodes with thick borders indicate that an
examined item or itemset is frequent.
Using uniform minimum support for all levels (referred to as uniform support): The same minimum
support threshold is used when mining at each abstraction level. For example, in Figure, a minimum
support threshold of 5% is used throughout (e.g., for mining from “computer” downward to “laptop
computer”). Both “computer” and “laptop computer” are found to be frequent, whereas “desktop
computer” is not. When a uniform minimum support threshold is used, the search procedure is simplified.
The method is also simple in that users are required to specify only one minimum support threshold. An
Apriori-like optimization technique can be adopted, based on the knowledge that an ancestor is a superset
of its descendants: The search avoids examining itemsets containing any item of which the ancestors do
not have minimum support. The uniform support approach, however, has some drawbacks. It is unlikely
that items at lower abstraction levels will occur as frequently as those at higher abstraction levels. If the
minimum support threshold is set too high, it could miss some meaningful associations occurring at low
abstraction levels. If the threshold is set too low, it may generate many uninteresting associations
occurring at high abstraction levels. This provides the motivation for the next approach.

Using reduced minimum support at lower levels (referred to as reduced support): Each abstraction
level has its own minimum support threshold. The deeper the abstraction level, the smaller the
corresponding threshold. For example, in Figure, the minimum support thresholds for levels 1 and 2 are
5% and 3%, respectively. In this way, “computer,” “laptop computer,” and “desktop computer” are all
considered frequent.

Using item or group-based minimum support (referred to as group-based support): Because users
or experts often have insight as to which groups are more important than others, it is sometimes more
desirable to set up user-specific, item, or group-based minimal support thresholds when mining multilevel
rules. For example, a user could set up the minimum support thresholds based on product price or on
items of interest, such as by setting particularly low support thresholds for “camera with price over
$1000” or “Tablet PC,” to pay particular attention to the association patterns containing items in these
categories. For mining patterns with mixed items from groups with different support thresholds, usually
the lowest support threshold among all the participating groups is taken as the support threshold in
mining. This will avoid filtering out valuable patterns containing items from the group with the lowest
support threshold. In the meantime, the minimal support threshold for each individual group should be
kept to avoid generating uninteresting itemsets from each group. Other interestingness measures can be
used after the itemset mining to extract truly interesting rules.

Mining Multidimensional Associations

A single dimensional or intra dimensional association rule it contains a single distinct predicate (e.g.,
buys) with multiple occurrences (i.e., the predicate occurs more than once within the rule). Such rules are
commonly mined from transactional data.

buys(X, “digital camera”) ⇒ buys(X, “HP printer”) --------------------------------------- Rule 1

Instead of considering transactional data only, sales and related information are often linked with
relational data or integrated into a data warehouse. Such data stores are multidimensional in nature. For
instance, in addition to keeping track of the items purchased in sales transactions, a relational database
may record other attributes associated with the items and/or transactions such as the item description or
the branch location of the sale. Additional relational information regarding the customers who purchased
the items (e.g., customer age, occupation, credit rating, income, and address) may also be stored.
Considering each database attribute or warehouse dimension as a predicate, we can therefore mine
association rules containing multiple predicates such as age

(X, “20...29”) ∧ occupation(X, “student”)⇒buys(X, “laptop”) --------------------------------- Rule 2

Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules. Rule 2 contains three predicates (age, occupation, and buys), each of
which occurs only once in the rule. Hence, we say that it has no repeated predicates. Multidimensional
association rules with no repeated predicates are called interdimensional association rules. We can also
mine multidimensional association rules with repeated predicates, which contain multiple occurrences of
some predicates. These rules are called hybrid-dimensional association rules. An example of such a rule is
the following, where the predicate buys is repeated:

age(X, “20...29”) ∧ buys(X, “laptop”)⇒buys(X, “HP printer”). ---------------------------------- Rule 3

Database attributes can be nominal or quantitative. The values of nominal (or categorical) attributes are
“names of things.” Nominal attributes have a finite number of possible values, with no ordering among
the values (e.g., occupation, brand, color). Quantitative attributes are numeric and have an implicit
ordering among values (e.g., age, income, price). Techniques for mining multidimensional association
rules can be categorized into two basic approaches regarding the treatment of quantitative attributes. In
the first approach, quantitative attributes are discretized using predefined concept hierarchies. This
discretization occurs before mining. For instance, a concept hierarchy for income may be used to replace
the original numeric values of this attribute by interval labels such as “0..20K,” “21K..30K,” “31K..40K,”
and so on. Here, discretization is static and predetermined.

In the second approach, quantitative attributes are discretized or clustered into “bins” based on the data
distribution. These bins may be further combined during the mining process. The discretization process is
dynamic and established so as to satisfy some mining criteria such as maximizing the confidence of the
rules mined. Because this strategy treats the numeric attribute values as quantities rather than as
predefined ranges or categories, association rules mined from this approach are also referred to as
(dynamic) quantitative association rules.

Constraint-Based Frequent Pattern Mining

Constraint-Based Frequent Pattern Mining is users have a good sense of which “direction” of mining may
lead to interesting patterns and the “form” of the patterns or rules they want to find. They may also have a
sense of “conditions” for the rules, which would eliminate the discovery of certain rules that they know
would not be of interest.

This strategy is known as constraint-based mining. The constraints can include the following:

Knowledge type constraints: These specify the type of knowledge to be mined, such as association,
correlation, classification, or clustering.

Data constraints: These specify the set of task-relevant data

Dimension/level constraints: These specify the desired dimensions (or attributes) of the data, the
abstraction levels, or the level of the concept hierarchies to be used in mining.

Interestingness constraints: These specify thresholds on statistical measures of rule interestingness such
as support, confidence, and correlation.

Rule constraints: These specify the form of, or conditions on, the rules to be mined. Such constraints
may be expressed as metarules (rule templates), as the maximum or minimum number of predicates that
can occur in the rule antecedent or consequent, or as relationships among attributes, attribute values,
and/or aggregates.

Metarule-Guided Mining of Association Rules

Metarules enables users to define the syntactic form of rules that they are involved in mining. The rule
forms can be used as constraints to provide improve the effectiveness of the mining phase. Metarules can
be based on the analyst’s experience, expectations, or intuition concerning the data or can be
automatically generated depends on the database schema.
Metarule-guided mining − Consider that as a market analyst for AllElectronics, it can have access to
the data defining customers (including customer age, address, and credit rating) and the list of customer
transactions.
It can be finding associations among customer traits and the items that customers purchase. However,
instead of finding some association rules reflecting these relationships, it is interested only in deciding
which pairs of customer traits enhance the sale of office software.
An example of such a metarule is
P1(X, Y)∧ P2(X, W) ⇒ buys(X, “office software”)
where P1 and P2 are predicate variables that are instantiated to attributes from the given database during
the mining phase, X is a variable defining a customer, and Y and W take on values of the attributes
assigned to P1 and P2, accordingly.
Generally, a user can define a list of attributes to be treated for instantiation with P 1and P2. Therefore, a
default set can be used.
In general, a metarule forms a hypothesis concerning the relationships that the user is implicated in
perceptive or confirming. The data mining system can search for rules that connect the given metarule.
For example,
age(X, “30...39”)∧income(X, “41K...60K”) ⇒ buys(X, “office software”)

Mining Colossal Patterns by Pattern-Fusion

Although we have studied methods for mining frequent patterns in various situations, many applications
have hidden patterns that are tough to mine, due mainly to their immense length or size. Consider
bioinformatics, for example, where a common activity is DNA or microarray data analysis. This involves
mapping and analyzing very long DNA and protein sequences. Researchers are more interested in finding
large patterns (e.g., long sequences) than finding small ones since larger patterns usually carry more
significant meaning. We call these large patterns colossal patterns, as distinguished from patterns with
large support sets. Finding colossal patterns is challenging because incremental mining tends to get
“trapped” by an explosive number of midsize patterns before it can even reach candidate patterns of large
size

The challenge of mining colossal patterns. Consider a 40 × 40 square table where each row contains the
integers 1 through 40 in increasing order. Remove the integers on the diagonal, and this gives a 40 × 39
table. Add 20 identical rows to the bottom of the table, where each row contains the integers 41 through
79 in increasing order, resulting in a 60 × 39 table.

We consider each row as a transaction and set the minimum support threshold at 20. The table has an
exponential number 40 / 20 of midsize closed/maximal frequent patterns of size 20, but only one that is
colossal: α = (41,42,...,79) of size 39. None of the frequent pattern mining algorithms that we have
introduced so far can complete execution in a reasonable amount of time .

A simple colossal patterns example: The data set contains an exponential number of midsize patterns of
size 20 but only one that is colossal, namely (41,42,...,79).
A new mining strategy called Pattern-Fusion was developed, which fuses a small number of shorter
frequent patterns into colossal pattern candidates. It thereby takes leaps in the pattern search space and
avoids the pitfalls of both breadth-first and depth first searches. This method finds a good approximation
to the complete set of colossal frequent patterns. The Pattern-Fusion method has the following major
characteristics. First, it traverses the tree in a bounded-breadth way. Only a fixed number of patterns in a
bounded-size candidate pool are used as starting nodes to search downward in the pattern tree. As such, it
avoids the problem of exponential search space. Second, Pattern-Fusion has the capability to identify
“shortcuts” whenever possible. Each pattern’s growth is not performed with one-item addition, but with
an agglomeration of multiple patterns in the pool. These shortcuts direct Pattern-Fusion much more
rapidly down the search tree toward the colossal patterns.

Mining Compressed or Approximate Patterns

Pattern compression can be achieved by pattern clustering. . Clustering is the automatic process of
grouping like objects together, so that objects within a cluster are similar to one another and dissimilar to
objects in other clusters.

Extracting Redundancy-Aware Top-k Patterns

Mining the top-k most frequent patterns is a strategy for reducing the number of patterns returned during
mining. However, in many cases, frequent patterns are not mutually independent but often clustered in
small regions. This is somewhat like finding 20 population centers in the world, which may result in cities
clustered in a small number of countries rather than evenly distributed across the globe. Instead, most
users would prefer to derive the k most interesting patterns, which are not only significant, but also
mutually independent and containing little redundancy. A small set of k representative patterns that have
not only high significance but also low redundancy are called redundancy-aware top-k patterns.

Redundancy-aware top-k strategy versus other top-k strategies.

The below Figure illustrates the intuition behind redundancy-aware top-k patterns versus traditional top-k
patterns and k-summarized patterns. Suppose we have the frequent patterns set shown in Figure (a),
where each circle represents a pattern of which the significance is colored in grayscale. The distance
between two circles reflects the redundancy of the two corresponding patterns: The closer the circles are,
the more redundant the respective patterns are to one another. Let’s say we want to find three patterns that
will best represent the given set, that is, k = 3. Which three should we choose? Arrows are used to show
the patterns chosen if using redundancy-aware top-k patterns (Figure b), traditional top-k patterns (Figure
c), or k-summarized patterns (Figure d). In Figure (c), the traditional top-k strategy relies solely on
significance: It selects the three most significant patterns to represent the set. In Figure (d), the k-
summarized pattern strategy selects patterns based solely on nonredundancy.

It detects three clusters, and finds the most representative patterns to be the “centermost’” pattern from
each cluster. These patterns are chosen to represent the data. The selected patterns are considered
“summarized patterns” in the sense that they represent or “provide a summary” of the clusters they stand
for. By contrast, in Figure (d) the redundancy-aware top-k patterns make a trade-off between significance
and redundancy. The three patterns chosen here have high significance and low redundancy. Observe, for
example, the two highly significant patterns that, based on their redundancy, are displayed next to each
other. The redundancy-aware top-k strategy selects only one of them, taking into consideration that two
would be redundant. To formalize the definition of redundancy-aware top-k patterns, we’ll need to define
the concepts of significance and redundancy.

Pattern Exploration and Application

There are various applications of Pattern Mining which are as follows −
Pattern mining is generally used for noise filtering and data cleaning as preprocessing in several data-
intensive applications. It can be used to explore microarray data, for example, which includes tens of
thousands of dimensions (e.g., describing genes).
Pattern mining provides in the discovery of inherent mechanisms and clusters hidden in the data.
Frequent pattern mining can simply discover interesting clusters like coauthor clusters (by determining
authors who generally collaborate) and conference clusters (by determining the sharing of several authors
and terms). Such architecture or cluster discovery can be used as preprocessing for additional
sophisticated data mining.
Frequent patterns can be used effectively for subspace clustering in high-dimensional area. Clustering is
difficult in high-dimensional space, where the distance among two objects is complex to measure. This is
because such a distance is dominated by the multiple sets of dimensions in which the objects are
occupying.
Pattern analysis is beneficial in the analysis of spatiotemporal information, timeseries data, image data,
video data, and multimedia data. An application of spatiotemporal data analysis is the analysis of
colocation patterns. These can help decide if a specific disease is geographically colocated with specific
objects like a well, a hospital, or a river.
In time-series data analysis, researchers have discretized time-series values into several intervals
therefore that small fluctuations and value differences can be ignored. The data can be summarized into
sequential patterns, which can be indexed to simplify similarity search or comparative analysis.
In image analysis and pattern recognition, researchers have also orderly frequently appearing visual
fragments as visual words, which can be used for efficient clustering, classification, and comparative
analysis.
Pattern mining has been used for the analysis of sequence or structural data including trees, graphs,
subsequences, and networks. In software engineering, researchers have coherent consecutive or gapped
subsequences in code execution as sequential patterns that support identify software errors.
Copy-and-paste errors in huge software programs can be recognized by extended sequential pattern
analysis of source code. Plagiarized software programs can be recognized based on their substantially
identical program flow/loop mechanism.
Frequent and discriminative patterns can be used as primitive indexing mechanism (called graph indices)
to provide search large, complex, structured data sets and networks. These provide a similarity search in
graph-structured data including chemical compound databases or XML-structured databases. Such
patterns can be used for data compression and description.

Basic Concept of Classification

Definition:
Data classification in data mining is a common technique that helps in organizing data sets that are both
complicated and large. This technique often involves the use of algorithms that can be easily adapted to
improve the quality of data.
Example: Before starting any project, we need to check its feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further
approve it. It is a two-step process such as:

A two-step process is followed, to build a classification model.

Data Classification is a form of analysis which builds a model that describes important class
variables. For example, a model built to categorize bank loan applications as safe or risky. Classification
methods are used in machine learning, and pattern recognition.
Application of classification includes fraud detection, medical diagnosis, target marketing, etc. The output
of the classification problem is taken as “Mode” of all observed values of the terminal node.

1. In the first step i.e. learning: A classification model based on training data is built.
2. In the second step i.e. Classification, the accuracy of the model is checked and then the model is
used to classify new data. The class labels presented here are in the form of discrete values such as
“yes” or “no”, “safe” or “risky”.
Regression Analysis
Regression analysis is used for the prediction of numeric attributes.
Numeric attributes are also called continuous values. A model built to predict the continuous values
instead of class labels is called the regression model. The output of regression analysis is the “Mean” of
all observed values of the node.
Training and Testing:
Suppose there is a person who is sitting under a fan and the fan starts falling on him, he should get
aside in order not to get hurt. So, this is his training part to move away. While Testing if the person sees
any heavy object coming towards him or falling on him and moves aside then the system is tested
positively and if the person does not move aside then the system is negatively tested.
The same is the case with the data, it should be trained in order to get the accurate and best results.
There are certain data types associated with data mining that actually tells us the format of the file
(whether it is in text format or in numerical format).
Attributes – Represents different features of an object. Different types of attributes are:
1. Binary: Possesses only two values i.e. True or False
Example: Suppose there is a survey evaluating some products. We need to check whether it’s useful
or not. So, the Customer has to answer it in Yes or No.
Product usefulness: Yes / No
 Symmetric: Both values are equally important in all aspects
 Asymmetric: When both the values may not be important.
2. Nominal: When more than two outcomes are possible. It is in Alphabet form rather than being
in Integer form.
Example: One needs to choose some material but of different colors. So, the color might be
Yellow, Green, Black, Red.
Different Colors: Red, Green, Black, Yellow
 Ordinal: Values that must have some meaningful order.
Example: Suppose there are grade sheets of few students which might contain different grades as per
their performance such as A, B, C, D
Grades: A, B, C, D
 Continuous: May have an infinite number of values, it is in float type
Example: Measuring the weight of few Students in a sequence or orderly manner i.e. 50, 51, 52, 53
Weight: 50, 51, 52, 53
 Discrete: Finite number of values.
Example: Marks of a Student in a few subjects: 65, 70, 75, 80, 90
Marks: 65, 70, 75, 80, 90
Syntax:
 Mathematical Notation: Classification is based on building a function taking input feature vector
“X” and predicting its outcome “Y” (Qualitative response taking values in set C)
 Here Classifier (or model) is used which is a Supervised function, can be designed manually
based on the expert’s knowledge. It has been constructed to predict class labels (Example: Label –
“Yes” or “No” for the approval of some event).
Classifiers can be categorized into two major types:
1. Discriminative: It is a very basic classifier and determines just one class for each row of data. It
tries to model just by depending on the observed data, depends heavily on the quality of data
rather than on distributions.
Example: Logistic Regression
2. Generative: It models the distribution of individual classes and tries to learn the model that
generates the data behind the scenes by estimating assumptions and distributions of the model.
Used to predict the unseen data.
Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that too divided
in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam emails). Now if a user wants
to check that if an email contains the word cheap, then that may be termed as Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are spam.
So, if the email contains the word cheap, what is the probability of it being spam ?? (= 80%)

Decision Tree Algorithm Examples In Data Mining

Decision Tree Mining is a type of data mining technique that is used to build Classification Models. It
builds classification models in the form of a tree-like structure, just like its name. This type of mining
belongs to supervised class learning.
In supervised learning, the target result is already known. Decision trees can be used for both categorical
and numerical data. The categorical data represent gender, marital status, etc. while the numerical data
represent age, temperature, etc.

Decision Tree Induction

What Is The Use Of A Decision Tree?
Decision Tree is used to build classification and regression models. It is used to create data models that
will predict class labels or values for the decision-making process. The models are built from the training
dataset fed to the system (supervised learning).
Using a decision tree, we can visualize the decisions that make it easy to understand and thus it is a
popular data mining technique.

Decision tree induction is the method of learning the decision trees from the training set. The training set
consists of attributes and class labels. Applications of decision tree induction include astronomy, financial
analysis, medical diagnosis, manufacturing, and production.
A decision tree is a flowchart tree-like structure that is made from training set tuples. The dataset is
broken down into smaller subsets and is present in the form of nodes of a tree. The tree structure has a
root node, internal nodes or decision nodes, leaf node, and branches.
The root node is the topmost node. It represents the best attribute selected for classification. Internal
nodes of the decision nodes represent a test of an attribute of the dataset leaf node or terminal node which
represents the classification or decision label. The branches show the outcome of the test performed.
Some decision trees only have binary nodes, that means exactly two branches of a node, while some
decision trees are non-binary.

CART
CART model i.e. Classification and Regression Models is a decision tree algorithm for building models.
Decision Tree model where the target values have a discrete nature is called classification models.
A discrete value is a finite or countably infinite set of values, For Example, age, size, etc. The models where
the target values are represented by continuous values are usually numbers that are called Regression Models.
Continuous variables are floating-point variables. These two models together are called CART.
CART uses Gini Index as Classification matrix.

Decision Tree Induction for Data Mining: ID3

In the late 1970s and early 1980s, J.Ross Quinlan was a researcher who built a decision tree algorithm for
machine learning and data mining. This algorithm is known as ID3, Iterative Dichotomiser. This
algorithm was an extension of the concept learning systems described by E.B Hunt, J, and Marin.

ID3 later came to be known as C4.5. ID3 and C4.5 follow a greedy top-down approach for constructing
decision trees. The algorithm starts with a training dataset with class labels that are portioned into smaller
subsets as the tree is being constructed.

#1) Initially, there are three parameters i.e. attribute list, attribute selection method and data
partition. The attribute list describes the attributes of the training set tuples.
#2) The attribute selection method describes the method for selecting the best attribute for discrimination
among tuples. The methods used for attribute selection can either be Information Gain or Gini Index.
#3) The structure of the tree (binary or non-binary) is decided by the attribute selection method.
#4) When constructing a decision tree, it starts as a single node representing the tuples.
#5) If the root node tuples represent different class labels, then it calls an attribute selection method to
split or partition the tuples. The step will lead to the formation of branches and decision nodes.
#6) The splitting method will determine which attribute should be selected to partition the data tuples. It
also determines the branches to be grown from the node according to the test outcome. The main motive
of the splitting criteria is that the partition at each branch of the decision tree should represent the same
class label.
An example of splitting attribute is shown below:

a. The portioning above is discrete-valued.

b. The portioning above is for continuous-valued.

#7) The above partitioning steps are followed recursively to form a decision tree for the training dataset
tuples.
#8) The portioning stops only when either all the partitions are made or when the remaining tuples cannot
be partitioned further.
#9) The complexity of the algorithm is described by n * |D| * log |D| where n is the number of attributes
in training dataset D and |D| is the number of tuples.

Attribute Selection Measures or How To Select Attributes For Creating A

Tree?
Attribute selection measures are also called splitting rules to decide how the tuples are going to split. The
splitting criteria are used to best partition the dataset. These measures provide a ranking to the attributes
for partitioning the training tuples.
The most popular methods of selecting the attribute are information gain, Gini index.

Entropy:

Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.

#1) Information Gain

This method is the main method that is used to build decision trees. It reduces the information that is
required to classify the tuples. It reduces the number of tests that are needed to classify the given tuple.
The attribute with the highest information gain is selected.
The original information needed for classification of a tuple in dataset D is given by:

Where p is the probability that the tuple belongs to class C. The information is encoded in bits, therefore,
log to the base 2 is used. E(s) represents the average amount of information required to find out the class
label of dataset D. This information gain is also called Entropy.
The information required for exact classification after portioning is given by the formula:

Where P (c) is the weight of partition. This information represents the information needed to classify the
dataset D on portioning by X.

Information gain is the difference between the original and expected information that is required to
classify the tuples of dataset D.

Gain is the reduction of information that is required by knowing the value of X. The attribute with the
highest information gain is chosen as “best”.

#3) Gini Index

Gini Index is calculated for binary variables only. It measures the impurity in training tuples of dataset D,
as

P is the probability that tuple belongs to class C. The Gini index that is calculated for binary split dataset
D by attribute A is given by:

Where n is the nth partition of the dataset D.

The reduction in impurity is given by the difference of the Gini index of the original dataset D and Gini
index after partition by attribute A.
The maximum reduction in impurity or max Gini index is selected as the best attribute for splitting.
Decision Tree Induction Algorithm

Bayesian Classification Method

Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian classifiers
are the statistical classifiers with the Bayesian probability understandings. The theory expresses how a
level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability to
provide an algorithm that uses evidence to calculate limits on an unknown parameter.
Naïve Bayes's Classification theorem
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape,
and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the weather
conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play Class label

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 4

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering
Rule Based Classification
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the
following form −
IF condition THEN conclusion

Let us consider a rule R1,

o R1: IF age = youth AND student = yes

o THEN buy_computer = yes

Points to remember −
 The IF part of the rule is called rule antecedent or precondition.
 The THEN part of the rule is called rule consequent.
 The antecedent part the condition consist of one or more attribute tests and these tests are logically
ANDed.
 The consequent part consists of class prediction.

Note − We can also write rule R1 as follows −

R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction
Decision tree classifiers are a popular method of classification—it is easy to understand how decision trees work
and they are known for their accuracy. Decision trees can become large and difficult to interpret. . In comparison
with a decision tree, the IF-THEN rules may be easier for humans to understand, particularly if the decision tree is
very large. To extract rules from a decision tree, one rule is created for each path from the root to a leaf node.
Each splitting criterion along a given path is logically ANDed to form the rule antecedent (“IF” part). The leaf node
holds the class prediction, forming the rule consequent (“THEN” part). Extracting classification rules from a
decision tree. The decision tree of Figure below can be converted to classification IF-THEN rules by tracing the
path from the root node to each leaf node in the tree.
R1: IF age = youth AND student = no THEN buys computer = no
R2: IF age = youth AND student = yes THEN buys computer = yes
R3: IF age = middle aged THEN buys computer = yes
R4: IF age = senior AND credit rating = excellent THEN buys computer = yes
R5: IF age = senior AND credit rating = fair THEN buys computer = no

Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −
 One rule is created for each path from the root to the leaf node.
 To form a rule antecedent, each splitting criterion is logically ANDed.
 The leaf node holds the class prediction, forming the rule consequent.

Model Evaluation

Model evaluation is the process of using different evaluation metrics to understand a machine learning
model’s performance, as well as its strengths and weaknesses. Model evaluation is important to assess the
efficacy of a model during initial research phases, and it also plays a role in model monitoring.

Classification
The most popular metrics for measuring classification performance include accuracy, precision, confusion
matrix.
Classification Metrics
Classification is about predicting the class labels given input data. In binary classification, there are only
two possible output classes(i.e., Dichotomy). In multiclass classification, more than two possible classes
can be present. focus only on binary classification.
A very common example of binary classification is spam detection, where the input data could include the
email text and metadata (sender, sending time), and the output label is either “spam” or “not spam.” (See
Figure) Sometimes, people use some other names also for the two classes: “positive” and “negative,” or
“class 1” and “class 0.”

There are many ways for measuring classification performance. Accuracy, confusion matrix, log-loss, and
AUC-ROC are some of the most popular metrics. Precision-recall is a widely used metrics for
classification problems.

Accuracy
Accuracy simply measures how often the classifier correctly predicts. We can define accuracy as the ratio
of the number of correct predictions and the total number of predictions.

When any model gives an accuracy rate of 99%, you might think that model is performing very good but
this is not always true and can be misleading in some situations. I am going to explain this with the help
of an example.
Consider a binary classification problem, where a model can achieve only two results, either model gives
a correct or incorrect prediction. Now imagine we have a classification task to predict if an image is a
dog or cat as shown in the image. In a supervised learning algorithm, we first fit/train a model on training
data, then test the model on testing data. Once we have the model’s predictions from the X_test data, we
compare them to the true y_values (the correct labels).
We feed the image of the dog into the training model. Suppose the model predicts that this is a dog, and
then we compare the prediction to the correct label. If the model predicts that this image is a cat and then
we again compare it to the correct label and it would be incorrect.
We repeat this process for all images in X_test data. Eventually, we’ll have a count of correct and
incorrect matches. But in reality, it is very rare that all incorrect or correct matches hold equal value.
Therefore one metric won’t tell the entire story.
Accuracy is useful when the target class is well balanced but is not a good choice for the unbalanced
classes. Imagine the scenario where we had 99 images of the dog and only 1 image of a cat present in our
training data. Then our model would always predict the dog, and therefore we got 99% accuracy. In
reality, Data is always imbalanced for example Spam email, credit card fraud, and medical diagnosis.
Hence, if we want to do a better model evaluation and have a full picture of the model evaluation, other
metrics such as recall and precision should also be considered.
Confusion Matrix
Evaluation of the performance of a classification model is based on the counts of test records correctly
and incorrectly predicted by the model. The confusion matrix provides a more insightful picture which is
not only the performance of a predictive model, but also which classes are being predicted correctly and
incorrectly, and what type of errors are being made. To illustrate, we can see how the 4 classification
metrics are calculated (TP, FP, FN, TN), and our predicted value compared to the actual value in a
confusion matrix is clearly presented in the below confusion matrix table.
Precision is the ratio of True Positives to all the positives predicted by the model.
Low precision: the more False positives the model predicts, the lower the precision.
Recall (Sensitivity)is the ratio of True Positives to all the positives in your Dataset.
Low recall: the more False Negatives the model predicts, the lower the recall.
The idea of recall and precision seems to be abstract. Let me illustrate the difference in three real cases.

 the result of TP will be that the COVID 19 residents diagnosed with COVID-19.
 the result of TN will be that healthy residents are with good health.
 the result of FP will be that those actually healthy residents are predicted as COVID 19 residents.
 the result of FN will be that those actual COVID 19 residents are predicted as the healthy residents
In case 1, which scenario do you think will have the highest cost?
Imagine that if we predict COVID-19 residents as healthy patients and they do not need to quarantine,
there would be a massive number of COVID-19 infections. The cost of false negatives is much higher
than the cost of false positives.

MODEL SELECTION
To train a model, we collect enormous quantities of data to help the machine learn better.
Usually, a good portion of the data collected is noise, while some of the columns of our dataset
might not contribute significantly to the performance of our model. Further, having a lot of data
can slow down the training process and cause the model to be slower. The model may also
learn from this irrelevant data and be inaccurate.
Feature selection is what separates good data scientists from the rest. Given the same model
and computational facilities, why do some people win in competitions with faster and more
accurate models? The answer is Feature Selection. Apart from choosing the right model for our
data, we need to choose the right data to put in our model.

Consider a table which contains information on old cars. The model decides which cars must be
crushed for spare parts.

In the above table, we can see that the model of the car, the year of manufacture, and the miles
it has traveled are pretty important to find out if the car is old enough to be crushed or not.
However, the name of the previous owner of the car does not decide if the car should be
crushed or not. Further, it can confuse the algorithm into finding patterns between names and
the other features. Hence we can drop the column.
What is Feature Selection?

Feature Selection is the method of reducing the input variable to your model by using only
relevant data and getting rid of noise in data.

It is the process of automatically choosing relevant features for your machine learning model
based on the type of problem you are trying to solve. We do this by including or excluding
important features without changing them. It helps in cutting down the noise in our data and
reducing the size of our input data.

Feature Selection Models

Feature selection models are of two types:

1. Supervised Models: Supervised feature selection refers to the method which uses the output
label class for feature selection. They use the target variables to identify the variables which can
increase the efficiency of the model

2. Unsupervised Models: Unsupervised feature selection refers to the method which does not need
the output label class for feature selection. We use them for unlabelled data.

We can further divide the supervised models into three :

1. Filter Method: In this method, features are dropped based on their relation to the output, or
how they are correlating to the output. We use correlation to check if the features are positively
or negatively correlated to the output labels and drop features accordingly. Eg: Information
Gain, Chi-Square Test, Fisher’s Score, etc.

2. Wrapper Method: We split our data into subsets and train a model using this. Based on the
output of the model, we add and subtract features and train the model again. It forms the
subsets using a greedy approach and evaluates the accuracy of all the possible combinations of
features. Eg: Forward Selection, Backwards Elimination, etc.

The input variables that we give to our machine learning models are called features. Each
column in our dataset constitutes a feature. To train an optimal model, we need to make sure
that we use only the essential features. If we have too many features, the model can capture
the unimportant patterns and learn from noise. The method of choosing the important
parameters of our data is called Feature Selection.

We can further divide the supervised models into three :

Figure 6: Wrapper Method Flowchart

3. Intrinsic Method: This method combines the qualities of both the Filter and Wrapper method
to create the best subset.

Figure 7: Intrinsic Model Flowchart

This method takes care of the machine training iterative process while maintaining the
computation cost to be minimum. Eg: Lasso and Ridge Regression.

Techniques to Improve Classification Accuracy

Ensemble Data Mining is the name given to Machine Learning methods and techniques where several
dissimilar models are merged to produce a single optimal result in Data Mining, either by using Diverse
datasets or various algorithms.
The drive for using the Ensemble Approach is to lessen the prediction error. However, the base models
must be different for the prediction error for this approach to work.
In creating a prediction, this technique follows the idea of seeking results from multiple sources of advice.
Although the ensemble model has numerous base models, it functions, performs, and releases output as a
single model.
However, note that Ensemble approaches are used in most fundamental Data Science applications, not
just in Data Mining.

Why should you use the Ensemble Approach in Data Mining?

The fundamental challenge is not obtaining very accurate base models but rather getting base models with
various flaws. For example, if ensembles are employed for classification, high accuracies can be achieved
even if multiple basis models misclassify different training samples.
This way, if different results are taken into account and used to get a particular outcome, then it can be
said that all sample sizes have gone through various processes, and many factors have been accounted for.

Types of Ensemble Data Mining Methods

Ensemble Data Mining methods use the strength of several models to produce better Prediction Accuracy
than any of the individual models could on their own.
When constructing an ensemble, the primary purpose is the same as when forming a committee of people.
Each committee member should be as competent as possible, but the members should complement one
another.
If the members are not complementary (if they constantly agree), then the committee is unnecessary—
anyone can do the job. If the members are complementary, the chances are that if one or more of them
makes a mistake, the other members will be able to rectify it.
Generally, there are three ensemble methods: Bagging, Random Forest Models, and Boosting:
1) Bagging

Bagging gets its name because it combines Bootstrapping and Aggregation to form one Ensemble model.
Given a sample of data, different bootstrapped subsamples are extracted. A Decision Tree is created on
each of the bootstrapped subsamples. After each subsample’s Decision Tree has been formed, an
algorithm is used to aggregate the decision trees to develop the most efficient predictor. The image below
explains this:

2) Random Forest Models

Random Forest Models are very identical to Bagging though they work in different ways. When deciding
how to make decisions, the Decision Trees in the Bagging method have the full complement of features to
select from. While the bootstrapped samples might appear different, one similar characteristic will be that
the data will change similarity at the same features throughout each model.

On the other hand, Random Forest models choose where to change similarities based on randomly
selected features. Random Forest models induce a level of variation because each tree will split based on
random features. This offers a greater level of results which can draw much more different results and
produce a more accurate prediction.
3) Boosting

The Boosting Method comprises the use of algorithms called Strong and Weak Learners. AdaBoost
(which stands for Adaptive Boosting) is the most used of all Boosting Algorithms, where the main model
is built on several weak learners. Weak learners are so-called because they are characteristically simple
with restricted prediction abilities and, as a result, are just slightly better at accuracy than random guesses.
However, unlike Bagging, Boosting is a sequential method and cannot be used for parallel operations.

The adaptation capability of AdaBoost was a significant factor in this technique, becoming one of the
earliest successful Binary Classifiers. Sequential Decision Trees were the core of such adaptability, where
each tree adjusted its weights based on prior knowledge of accuracies.

Other Classification Methods

Genetic Algorithms
Genetic algorithms attempt to incorporate ideas of natural evolution. In general, genetic learning starts as
follows. An initial population is created consisting of randomly generated rules. Each rule can be
represented by a string of bits. As a simple example, suppose that samples in a given training set are
described by two Boolean attributes, A1 and A2, and that there are two classes, C1 and C2. The rule “IF
A1 AND NOT A2 THEN C2” can be encoded as the bit string “100,” where the two leftmost bits
represent attributes A1 and A2, respectively, and the rightmost bit represents the class. Similarly, the rule
“IF NOT A1 AND NOT A2 THEN C1” can be encoded as “001.” If an attribute has k values, where k >
2, then k bits may be used to encode the attribute’s values. Classes can be encoded in a similar fashion.
Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules in
the current population, as well as offspring of these rules. Typically, the fitness of a rule is assessed by its
classification accuracy on a set of training samples. Offspring are created by applying genetic operators
such as crossover and mutation. In crossover, substrings from pairs of rules are swapped to form new
pairs of rules. In mutation, randomly selected bits in a rule’s string are inverted. The process of generating
new populations based on prior populations of rules continues until a population, P, evolves where each
rule in P satisfies a prespecified fitness threshold. Genetic algorithms are easily parallelizable and have
been used for classification as well as other optimization problems. In data mining, they may be used to
evaluate the fitness of other algorithms.
Rough Set Approach
Rough set theory can be used for classification to discover structural relationships within imprecise or
noisy data. It applies to discrete-valued attributes. Continuous-valued attributes must therefore be
discretized before its use. Rough set theory is based on the establishment of equivalence classes within
the given training data. All the data tuples forming an equivalence class are indiscernible, that is, the
samples are identical with respect to the attributes describing the data. Given real-world data, it is
common that some classes cannot be distinguished in terms of the available attributes. Rough sets can be
used to approximately or “roughly” define such classes. A rough set definition for a given class, C, is
approximated by two sets—a lower approximation of C and an upper approximation of C. The lower
approximation of C consists of all the data tuples that, based on the knowledge of the attributes, are
certain to belong to C without ambiguity. The upper approximation of C consists of all the tuples that,
based on the knowledge of the attributes, cannot be described as not belonging to C. The lower and upper
approximations for a class C are shown in Figure, where each rectangular region represents an
equivalence class. Decision rules can be generated for each class. Typically, a decision table is used to
represent the rules. Rough sets can also be used for attribute subset selection (or feature reduction, where
attributes that do not contribute to the classification of the given training data can be identified and
removed) and relevance analysis (where the contribution or significance of each attribute is assessed with
respect to the classification task). The problem of finding the minimal subsets (reducts) of attributes that
can describe all the concepts in the given data set is NP-hard. However, algorithms to reduce the
computation intensity have been proposed. In one method, for example, a discernibility matrix is used
that stores the differences between attribute values for each pair of data tuples. Rather than searching on
the entire training set, the matrix is instead searched to detect redundant attributes.

Fuzzy Set Approaches

Rule-based systems for classification have the disadvantage that they involve sharp cutoffs for continuous
attributes. For example, consider the following rule for customer credit application approval. The rule
essentially says that applications for customers who have had a job for two or more years and who have a
high income (i.e., of at least $50,000) are approved:

IF (years employed ≥ 2) AND (income ≥ 50,000) THEN credit = approved ----------------------Rule 1

By Rule 1, a customer who has had a job for at least two years will receive credit if her income is, say,
$50,000, but not if it is $49,000. Such harsh thresholding may seem unfair. Instead, we can discretize
income into categories (e.g., {low income, medium income, high income}) and then apply fuzzy logic to
allow “fuzzy” thresholds or boundaries to be defined for each category Figure. Rather than having a
precise cutoff between categories, fuzzy logic uses truth values between 0.0 and 1.0 to represent the
degree of membership that a certain value has in a given category. Each category then represents a fuzzy
set. Hence, with fuzzy logic, we can capture the notion that an income of $49,000 is, more or less, high,
although not as high as an income of $50,000. Fuzzy logic systems typically provide graphical tools to
assist users in converting attribute values to fuzzy truth values. Fuzzy set theory is also known as
possibility theory. It was proposed by Lotfi Zadeh in 1965 as an alternative to traditional two-value logic
and probability theory. It lets us work at a high abstraction level and offers a means for dealing with
imprecise data measurement. Most important, fuzzy set theory allows us to deal with vague or inexact
facts. For example, being a member of a set of high incomes is inexact (e.g., if $50,000 is high, then what
about $49,000? or $48,000?) Unlike the notion of traditional “crisp” sets where an element belongs to
either a set S or its complement, in fuzzy set theory, elements can belong to more than one fuzzy set. For
example, the income value $49,000 belongs to both the medium and high fuzzy sets, but to differing
degrees. Using fuzzy set notation and following Figure 9.15, this can be shown as

mmedium income($49,000) = 0.15 and mhigh income($49,000) = 0.96, where m denotes the membership
function, that is operating on the fuzzy sets of medium income and high income, respectively. In fuzzy set
theory, membership values for a given element, x (e.g., for $49,000), do not have to sum to 1. This is
unlike traditional probability theory, which is constrained by a summation axiom.

Ensemble Classifier | Data Mining

Ensemble learning helps improve machine learning results by combining several
models. This approach allows the production of better predictive performance
compared to a single model. Basic idea is to learn a set of classifiers (experts) and to
allow them to vote.

Advantage : Improvement in predictive accuracy.

Disadvantage : It is difficult to understand an ensemble of classifiers.
Why do ensembles work?

 Statistical Problem –
The Statistical Problem arises when the hypothesis space is too large for the
amount of available data. Hence, there are many hypotheses with the same
accuracy on the data and the learning algorithm chooses only one of them! There is
a risk that the accuracy of the chosen hypothesis is low on unseen data!
 Computational Problem –
The Computational Problem arises when the learning algorithm cannot guarantees
finding the best hypothesis.
 Representational Problem –
The Representational Problem arises when the hypothesis space does not contain
any good approximation of the target class(es).
if ensembles are used for classification, high accuracies can be accomplished if
different base models misclassify different training examples, even if the base classifier
accuracy is low.

Types of Ensemble Classifier –

Bagging
Bagging (aka bootstrap aggregating) is a simple but powerful ensemble algorithm that facilitates the
increased stability & accuracy of classification models. The Bagging process works by generating multiple
training datasets via random sampling with replacement, applying the algorithm to each dataset, and then
taking the majority vote amongst the models to determine data classifications. Bagging is a particularly
popular method because it reduces variance, helps to prevent overfitting (i.e., forced applicability of random
irrelevant data), and it can be easily parallelized for application to large datasets.
Implementation steps of Bagging –
1. Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of each
other.
4. The final predictions are determined by combining the predictions from all the
models.

Boosting
Boosting is a robust ensemble algorithm that is capable of reducing both bias & variance, and also
facilitates the conversion of weak learners (i.e., classifiers with weak correlations to the true classification) to
strong learners (i.e., well-correlated classifiers). Boosting creates strong classification tree models by
training models to concentrate on misclassified records from previous models; when this is done, all
classifiers are combined by a weighted majority vote. This process places a higher weight to incorrectly
classified records while decreasing the weight of correct classifications -- this effectively forces subsequent
models to place a greater emphasis on misclassified records. The algorithm then computes the weighted
sum of votes for each class and assigns the best classification to the record. Boosting frequently yields
better models than bagging, but is not capable of parallelization; consequently, if the dataset is very large
(i.e., significant number of weak learners), then boosting may not be the most appropriate ensemble
method.

Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a
decision tree classifier and is generated using a random selection of attributes at each
node to determine the split. During classification, each tree votes and the most popular
class is returned.
Implementation steps of Random Forest –
1. Multiple subsets are created from the original data set, selecting
observations with replacement.
2. A subset of features is selected randomly and whichever feature gives the
best split is used to split the node iteratively.
3. The tree is grown to the largest.
4. Repeat the above steps and prediction is given based on the aggregation
of predictions from n number of trees.

Genetic Algorithm in Data Mining

A genetic algorithm in data mining is an advanced method of data classification. Data
classification incorporates two steps, i.e. learning step and the classification step. The
classification model is constructed in the learning step, and in the classification step,
the model predicts the output for the provided input.
A genetic algorithm is based on the basic principle of natural evolution where the
fittest individual survives at the end. We use the algorithm for solving optimization
problems.
Genetic algorithm emulates the principles of natural evolution, i.e. survival of the
fittest. Natural evolution propagates the genetic material in the fittest individuals from
one generation to the next.

The genetic algorithm applies the same technique in data mining – it iteratively
performs the selection, crossover, mutation, and encoding process to evolve the
successive generation of models.

The idea of genetic algorithm is derived from natural evolution. In genetic algorithm, first of all,
the initial population is created. This initial population consists of randomly generated rules. We
can represent each rule by a string of bits.
For example, in a given training set, the samples are described by two Boolean attributes such
as A1 and A2. And this given training set contains two classes such as C1 and C2.
We can encode the rule IF A1 AND NOT A2 THEN C2 into a bit string 100. In this bit
representation, the two leftmost bits represent the attribute A1 and A2, respectively.
Likewise, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
Note − If the attribute has K values where K>2, then we can use the K bits to encode the
attribute values. The classes are also encoded in the same manner.
Points to remember −
 Based on the notion of the survival of the fittest, a new population is formed that consists
of the fittest rules in the current population and offspring values of these rules as well.
 The fitness of a rule is assessed by its classification accuracy on a set of training
samples.
 The genetic operators such as crossover and mutation are applied to create offspring.
 In crossover, the substring from pair of rules are swapped to form a new pair of rules.
 In mutation, randomly selected bits in a rule's string are inverted.

Rough Set Theory

Rough set theory has been a methodology of database mining or knowledge discovery
in relational databases. In its abstract form, it is a new area of uncertainty mathematics
closely related to fuzzy theory. We can use rough set approach to discover structural
relationship within imprecise and noisy data.
Basic problems in data analysis solved by Rough Set:
 Characterization of a set of objects in terms of attribute values.
 Finding dependency between the attributes.
 Reduction of superfluous attributes.
 Finding the most significant attributes.
 Decision rule generation.

Goals of Rough Set Theory –

 The main goal of the rough set analysis is the induction of (learning)
approximations of concepts. Rough sets constitute a sound basis for KDD. It offers
mathematical tools to discover patterns hidden in data.
 It can be used for feature selection, feature extraction, data reduction, decision
rule generation, and pattern extraction (templates, association rules) etc.
 Identifies partial or total dependencies in data, eliminates redundant data, gives
approach to null values, missing data, dynamic data and others.

Information system –

In Rough Set, data model information is stored in a table. Each row (tuples) represents
a fact or an object. Often the facts are not consistent with each other. In Rough Set
terminology a data table is called an Information System. Thus, the information table
represents input data, gathered from any domain.

Note: Rows of a table are called examples(objects, entities). Information system is a

pair (U, A), U is a non-empty finite set of objects and A is a non-empty finite set of
attributes. The elements of A are called conditional attributes. An Information table
sometimes called decision table when it contains decision attribute/attributes.
Decision system is a pair of (U, A union {d}), where d is decision attribute (instead of
one we can consider more decision attributes).

Indiscernibility –

Tables may contain many objects having the same features. A way of reducing table
size is to store only one representative object for every set of objects with same
features. These objects are called indiscernible objects or tuples. With any P subset A
there is an associated equivalence relation IND(P):

Where IND(P) is called indiscernibility of relation. Here x and y are indiscernible from
each other by attribute
P.

A set is said to be rough if its boundary region is non-empty, otherwise the set is crisp.

Data Mining Unit-Ii Notes
No ratings yet
Data Mining Unit-Ii Notes
24 pages
SOLUTIONS of End-of-Chapter Problems CHAPTER 3 The Z Transform Digital Signal Processing: Andreas Antoniou
No ratings yet
SOLUTIONS of End-of-Chapter Problems CHAPTER 3 The Z Transform Digital Signal Processing: Andreas Antoniou
37 pages
Banking System
No ratings yet
Banking System
21 pages
Create Dan Config GPON
No ratings yet
Create Dan Config GPON
38 pages
63944en3 PDF
No ratings yet
63944en3 PDF
784 pages
Ai & ML Lab Manual (As Per 2018 Scheme)
No ratings yet
Ai & ML Lab Manual (As Per 2018 Scheme)
42 pages
Introduction To Internet of Things
No ratings yet
Introduction To Internet of Things
54 pages
Sonix SNC7001A - Spec - V1.5
No ratings yet
Sonix SNC7001A - Spec - V1.5
22 pages
ISU Master Data V0.7
No ratings yet
ISU Master Data V0.7
28 pages
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
55 pages
Data Mining Unit 2 1
No ratings yet
Data Mining Unit 2 1
15 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
Contents
No ratings yet
Contents
59 pages
Association Analysis and Frequent Sequential Pattern Mining-Apriori Algorithm
No ratings yet
Association Analysis and Frequent Sequential Pattern Mining-Apriori Algorithm
13 pages
Computer Tips and Tricks
100% (1)
Computer Tips and Tricks
46 pages
Chapter 5 Data Mining: Dr. Huma Lone
No ratings yet
Chapter 5 Data Mining: Dr. Huma Lone
56 pages
Association Rule Learning
No ratings yet
Association Rule Learning
16 pages
RPT MT THN 6
No ratings yet
RPT MT THN 6
11 pages
Schematic Diagram: 7-1. Circuit Descriptions
No ratings yet
Schematic Diagram: 7-1. Circuit Descriptions
6 pages
Load Balancing in Cloud Computing: Violetta N. Volkova, Liudmila V. Chernenkaya Elena N. Desyatirikova
No ratings yet
Load Balancing in Cloud Computing: Violetta N. Volkova, Liudmila V. Chernenkaya Elena N. Desyatirikova
4 pages
Ariori Introduction and Concept
No ratings yet
Ariori Introduction and Concept
37 pages
ARTERY AT32 MCU Cross Reference Table EN V202011
No ratings yet
ARTERY AT32 MCU Cross Reference Table EN V202011
4 pages
Abstract Algebra
No ratings yet
Abstract Algebra
4 pages
BCA Semester VI Data Mining Module 3 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 3 (Presentation Kind of N
108 pages
Association Rule Mining Using Apriori Al PDF
No ratings yet
Association Rule Mining Using Apriori Al PDF
11 pages
Marketbasket Analysis
No ratings yet
Marketbasket Analysis
28 pages
Unit - III
No ratings yet
Unit - III
27 pages
Data Analysis Using Apriori Algorithm & Neural Netwok: Ashutosh Padhi
No ratings yet
Data Analysis Using Apriori Algorithm & Neural Netwok: Ashutosh Padhi
27 pages
Fundamentals of Data Science Unit 5
No ratings yet
Fundamentals of Data Science Unit 5
25 pages
ΝΕΡΟΥ MEHP-iS-G07 0051
No ratings yet
ΝΕΡΟΥ MEHP-iS-G07 0051
93 pages
Apriori Algorithm Example PDF
No ratings yet
Apriori Algorithm Example PDF
7 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
54 pages
Association Rule Mining
No ratings yet
Association Rule Mining
17 pages
Module5 DMW
No ratings yet
Module5 DMW
13 pages
Association Analysis: Unit-V
No ratings yet
Association Analysis: Unit-V
12 pages
Wago I o System 750 60348385
No ratings yet
Wago I o System 750 60348385
36 pages
DWDM Module III
No ratings yet
DWDM Module III
33 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
23 pages
Market Basket Analysis
No ratings yet
Market Basket Analysis
20 pages
FSM 5000 OPC Special enUS 80930337931
No ratings yet
FSM 5000 OPC Special enUS 80930337931
28 pages
Association Rule Mining
No ratings yet
Association Rule Mining
72 pages
Software Cable List - CM CABLES
No ratings yet
Software Cable List - CM CABLES
1 page
DWDM Unit 4 (R22)
No ratings yet
DWDM Unit 4 (R22)
25 pages
Untitled Document
No ratings yet
Untitled Document
59 pages
SpaceLYnk (LSS100200) For Firmware 2 - 8 - 0 - User Guide (Version Q)
No ratings yet
SpaceLYnk (LSS100200) For Firmware 2 - 8 - 0 - User Guide (Version Q)
172 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
19 pages
UKG Dimensions Users GuideFinal
No ratings yet
UKG Dimensions Users GuideFinal
26 pages
Apriori Algorithm in Data Mining
No ratings yet
Apriori Algorithm in Data Mining
8 pages
14-Introduction To Apriori Level Wise Algorithm-03-09-2024
No ratings yet
14-Introduction To Apriori Level Wise Algorithm-03-09-2024
32 pages
Aml Unit 3
No ratings yet
Aml Unit 3
17 pages
Association Rule: Association Rule Learning Is A Popular and Well Researched Method For Discovering
No ratings yet
Association Rule: Association Rule Learning Is A Popular and Well Researched Method For Discovering
10 pages
Association Rule Mod 3
No ratings yet
Association Rule Mod 3
28 pages
CS8091 - Big Data Analytics - Unit 3
No ratings yet
CS8091 - Big Data Analytics - Unit 3
26 pages
Lab8 Apriori
No ratings yet
Lab8 Apriori
9 pages
Ramesh Yadav Resume
No ratings yet
Ramesh Yadav Resume
3 pages
What Is A Frequent Itemset?
No ratings yet
What Is A Frequent Itemset?
7 pages
Scaling Techniques
No ratings yet
Scaling Techniques
30 pages
Module 2
No ratings yet
Module 2
13 pages
Lecture - 11 - Sathya - Zainab
No ratings yet
Lecture - 11 - Sathya - Zainab
17 pages
Gsistouch Presentation
No ratings yet
Gsistouch Presentation
12 pages
Q) Frequent Itemset Generation: States That If An Itemset Is Frequent, Then All of Its Subsets Must Also Be Frequent. This
No ratings yet
Q) Frequent Itemset Generation: States That If An Itemset Is Frequent, Then All of Its Subsets Must Also Be Frequent. This
9 pages
Frequent Itemsets and Associations
No ratings yet
Frequent Itemsets and Associations
15 pages
Performance Analysis of Distributed Association Rule Mining With Apriori Algorithm
No ratings yet
Performance Analysis of Distributed Association Rule Mining With Apriori Algorithm
5 pages
Data Mining Notes UNIT III
No ratings yet
Data Mining Notes UNIT III
26 pages
DWDM Unit 2 and 3
No ratings yet
DWDM Unit 2 and 3
31 pages
LaTex LAB MANUAL 2023-24
No ratings yet
LaTex LAB MANUAL 2023-24
43 pages
DWDM Lecture Notes U-4
No ratings yet
DWDM Lecture Notes U-4
17 pages
The Ribbons - MS Word Review 12345
No ratings yet
The Ribbons - MS Word Review 12345
14 pages
Association Rule Mining
No ratings yet
Association Rule Mining
10 pages
Association Rule Mining:: Dm-Unit-2
No ratings yet
Association Rule Mining:: Dm-Unit-2
16 pages
DWM Unit 4
No ratings yet
DWM Unit 4
11 pages
HP E78x - Part List
No ratings yet
HP E78x - Part List
2 pages
Association Rule Mining
No ratings yet
Association Rule Mining
24 pages
Data Mining
No ratings yet
Data Mining
4 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
12 pages
Understanding The Priority Queue With Custom
No ratings yet
Understanding The Priority Queue With Custom
3 pages
BS EN 80000-13-2008 - 1 Quantities and Units - Djvu
No ratings yet
BS EN 80000-13-2008 - 1 Quantities and Units - Djvu
36 pages
SAP Hana & Fiori Security
No ratings yet
SAP Hana & Fiori Security
2 pages
Unit 2
No ratings yet
Unit 2
23 pages
Incident Response Report
No ratings yet
Incident Response Report
4 pages
DMDW U3
No ratings yet
DMDW U3
16 pages
Data Mining Frequent Patterns
No ratings yet
Data Mining Frequent Patterns
22 pages
Association Rule Mining (ARM)
No ratings yet
Association Rule Mining (ARM)
24 pages
UNIT III
No ratings yet
UNIT III
13 pages
Unit 5 Notes DWM
No ratings yet
Unit 5 Notes DWM
11 pages
Mod 5
No ratings yet
Mod 5
56 pages
Unit IV DWDM
No ratings yet
Unit IV DWDM
17 pages
1association Analysis-Apriori
No ratings yet
1association Analysis-Apriori
67 pages
FDS Unit02
No ratings yet
FDS Unit02
16 pages