Unit 3 - DM FULL
Unit 3 - DM FULL
Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a
data set. For example, a set of items, such as milk and bread, that appear frequently together in a
transaction data set is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera,
and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential
pattern. A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices,
which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern.
Market basket analysis is a data mining technique used by retailers to increase sales by better
understanding customer purchasing patterns. It involves analyzing large data sets, such as
purchase history, to reveal product groupings and products that are likely to be purchased
together.
The adoption of market basket analysis was aided by the advent of electronic point-of-sale
(POS) systems. Compared to handwritten records kept by store owners, the digital records
generated by POS systems made it easier for applications to process and analyze large volumes
of purchase data.
Implementation of market basket analysis requires a background in statistics and data science
and some algorithmic computer programming skills. For those without the needed technical
skills, commercial, off-the-shelf tools exist.
One example is the Shopping Basket Analysis tool in Microsoft Excel, which analyzes transaction
data contained in a spreadsheet and performs market basket analysis. A transaction ID must
relate to the items to be analyzed. The Shopping Basket Analysis tool then creates two
worksheets:
o The Shopping Basket Item Groups worksheet, which lists items that are frequently
purchased together,
o And the Shopping Basket Rules worksheet shows how items are related (For example,
purchasers of Product A are likely to buy Product B).
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly nearby. Consider the below diagram:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality. It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there are several
metrics. These metrics are given below:
o Support
o Confidence
o Lift
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction
of the transaction T that contains the itemset X. If there are X datasets, then for transactions T, it can be
written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur
together in the dataset when the occurrence of X is already given. It is the ratio of the transaction that
contains X and Y to the number of records that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are independent
of each other. It has three possible values:
#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm will
count the occurrences of each item.
#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose occurrence is
satisfying the min sup are determined. Only those candidates which count more than or equal to min_sup,
are taken ahead for the next iteration and the others are pruned.
#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-itemset is
generated by forming a group of 2 by combining items with itself.
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have 2 –
itemsets with min-sup only.
#5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each group fall in
min_sup. If all 2-itemset subsets are frequent then the superset will be frequent otherwise it is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset does
not meet the min_sup criteria. The algorithm is stopped when the most frequent itemset is achieved.
EXAMPLE
Find the frequent itemsets and generate association rules on this. Assume that minimum support
Let’s start,
There is only one itemset with minimum support 2. So only one itemset is frequent.
Association rules,
[Hot Dogs^Coke]=>[Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Coke) =
2/2*100=100% //Selected
[Hot Dogs^Chips]=>[Coke] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Chips) =
2/2*100=100% //Selected
[Coke^Chips]=>[Hot Dogs] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke^Chips) =
2/3*100=66.67% //Selected
[Hot Dogs]=>[Coke^Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs) =
2/4*100=50% //Rejected
[Coke]=>[Hot Dogs^Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke) =
2/3*100=66.67% //Selected
[Chips]=>[Hot Dogs^Coke] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Chips) =
2/4*100=50% //Rejected
There are four strong results (minimum confidence greater than 60%)
buys(X, “computer games”) ⇒ buys(X, “videos”) [support = 40%, confidence = 66%].---------- Rule 1
Rule 1 is a strong association rule and would therefore be reported, since its support value of 4000 /
10,000 = 40% and confidence value of 4000 / 6000 = 66% satisfy the minimum support and minimum
confidence thresholds, respectively. However, Rule 1 is misleading because the probability of purchasing
videos is 75%, which is even larger than 66%. In fact, computer games and videos are negatively
associated because the purchase of one of these items actually decreases the likelihood of purchasing the
other. Without fully understanding this phenomenon, we could easily make unwise business decisions
based on Rule 1
That is, a correlation rule is measured not only by its support and confidence but also by the correlation
between itemsets A and B. There are many different correlation measures from which to choose. In this
subsection, we study several correlation measures to determine which would be good for mining large
data sets.
Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is independent
of the occurrence of itemset B if P(A ∪B) = P(A)P(B); otherwise, itemsets A and B are dependent and
correlated as events. This definition can easily be extended to more than two itemsets. The lift between
the occurrence of A and B can be measured by computing
-------------------------------------------------------- Eq 1
If the resulting value of Eq. 1 is less than 1, then the occurrence of A is negatively correlated with the
occurrence of B, meaning that the occurrence of one likely leads to the absence of the other one. If the
resulting value is greater than 1, then A and B are positively correlated, meaning that the occurrence of
one implies the occurrence of the other. If the resulting value is equal to 1, then A and B are independent
and there is no correlation between them
A number of variations to this approach are described next, where each variation involves “playing” with
the support threshold in a slightly different way. The variations are illustrated in Figures where nodes
indicate an item or itemset that has been examined, and nodes with thick borders indicate that an
examined item or itemset is frequent.
Using uniform minimum support for all levels (referred to as uniform support): The same minimum
support threshold is used when mining at each abstraction level. For example, in Figure, a minimum
support threshold of 5% is used throughout (e.g., for mining from “computer” downward to “laptop
computer”). Both “computer” and “laptop computer” are found to be frequent, whereas “desktop
computer” is not. When a uniform minimum support threshold is used, the search procedure is simplified.
The method is also simple in that users are required to specify only one minimum support threshold. An
Apriori-like optimization technique can be adopted, based on the knowledge that an ancestor is a superset
of its descendants: The search avoids examining itemsets containing any item of which the ancestors do
not have minimum support. The uniform support approach, however, has some drawbacks. It is unlikely
that items at lower abstraction levels will occur as frequently as those at higher abstraction levels. If the
minimum support threshold is set too high, it could miss some meaningful associations occurring at low
abstraction levels. If the threshold is set too low, it may generate many uninteresting associations
occurring at high abstraction levels. This provides the motivation for the next approach.
Using reduced minimum support at lower levels (referred to as reduced support): Each abstraction
level has its own minimum support threshold. The deeper the abstraction level, the smaller the
corresponding threshold. For example, in Figure, the minimum support thresholds for levels 1 and 2 are
5% and 3%, respectively. In this way, “computer,” “laptop computer,” and “desktop computer” are all
considered frequent.
Using item or group-based minimum support (referred to as group-based support): Because users
or experts often have insight as to which groups are more important than others, it is sometimes more
desirable to set up user-specific, item, or group-based minimal support thresholds when mining multilevel
rules. For example, a user could set up the minimum support thresholds based on product price or on
items of interest, such as by setting particularly low support thresholds for “camera with price over
$1000” or “Tablet PC,” to pay particular attention to the association patterns containing items in these
categories. For mining patterns with mixed items from groups with different support thresholds, usually
the lowest support threshold among all the participating groups is taken as the support threshold in
mining. This will avoid filtering out valuable patterns containing items from the group with the lowest
support threshold. In the meantime, the minimal support threshold for each individual group should be
kept to avoid generating uninteresting itemsets from each group. Other interestingness measures can be
used after the itemset mining to extract truly interesting rules.
Instead of considering transactional data only, sales and related information are often linked with
relational data or integrated into a data warehouse. Such data stores are multidimensional in nature. For
instance, in addition to keeping track of the items purchased in sales transactions, a relational database
may record other attributes associated with the items and/or transactions such as the item description or
the branch location of the sale. Additional relational information regarding the customers who purchased
the items (e.g., customer age, occupation, credit rating, income, and address) may also be stored.
Considering each database attribute or warehouse dimension as a predicate, we can therefore mine
association rules containing multiple predicates such as age
Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules. Rule 2 contains three predicates (age, occupation, and buys), each of
which occurs only once in the rule. Hence, we say that it has no repeated predicates. Multidimensional
association rules with no repeated predicates are called interdimensional association rules. We can also
mine multidimensional association rules with repeated predicates, which contain multiple occurrences of
some predicates. These rules are called hybrid-dimensional association rules. An example of such a rule is
the following, where the predicate buys is repeated:
Database attributes can be nominal or quantitative. The values of nominal (or categorical) attributes are
“names of things.” Nominal attributes have a finite number of possible values, with no ordering among
the values (e.g., occupation, brand, color). Quantitative attributes are numeric and have an implicit
ordering among values (e.g., age, income, price). Techniques for mining multidimensional association
rules can be categorized into two basic approaches regarding the treatment of quantitative attributes. In
the first approach, quantitative attributes are discretized using predefined concept hierarchies. This
discretization occurs before mining. For instance, a concept hierarchy for income may be used to replace
the original numeric values of this attribute by interval labels such as “0..20K,” “21K..30K,” “31K..40K,”
and so on. Here, discretization is static and predetermined.
In the second approach, quantitative attributes are discretized or clustered into “bins” based on the data
distribution. These bins may be further combined during the mining process. The discretization process is
dynamic and established so as to satisfy some mining criteria such as maximizing the confidence of the
rules mined. Because this strategy treats the numeric attribute values as quantities rather than as
predefined ranges or categories, association rules mined from this approach are also referred to as
(dynamic) quantitative association rules.
This strategy is known as constraint-based mining. The constraints can include the following:
Knowledge type constraints: These specify the type of knowledge to be mined, such as association,
correlation, classification, or clustering.
Dimension/level constraints: These specify the desired dimensions (or attributes) of the data, the
abstraction levels, or the level of the concept hierarchies to be used in mining.
Interestingness constraints: These specify thresholds on statistical measures of rule interestingness such
as support, confidence, and correlation.
Rule constraints: These specify the form of, or conditions on, the rules to be mined. Such constraints
may be expressed as metarules (rule templates), as the maximum or minimum number of predicates that
can occur in the rule antecedent or consequent, or as relationships among attributes, attribute values,
and/or aggregates.
The challenge of mining colossal patterns. Consider a 40 × 40 square table where each row contains the
integers 1 through 40 in increasing order. Remove the integers on the diagonal, and this gives a 40 × 39
table. Add 20 identical rows to the bottom of the table, where each row contains the integers 41 through
79 in increasing order, resulting in a 60 × 39 table.
We consider each row as a transaction and set the minimum support threshold at 20. The table has an
exponential number 40 / 20 of midsize closed/maximal frequent patterns of size 20, but only one that is
colossal: α = (41,42,...,79) of size 39. None of the frequent pattern mining algorithms that we have
introduced so far can complete execution in a reasonable amount of time .
A simple colossal patterns example: The data set contains an exponential number of midsize patterns of
size 20 but only one that is colossal, namely (41,42,...,79).
A new mining strategy called Pattern-Fusion was developed, which fuses a small number of shorter
frequent patterns into colossal pattern candidates. It thereby takes leaps in the pattern search space and
avoids the pitfalls of both breadth-first and depth first searches. This method finds a good approximation
to the complete set of colossal frequent patterns. The Pattern-Fusion method has the following major
characteristics. First, it traverses the tree in a bounded-breadth way. Only a fixed number of patterns in a
bounded-size candidate pool are used as starting nodes to search downward in the pattern tree. As such, it
avoids the problem of exponential search space. Second, Pattern-Fusion has the capability to identify
“shortcuts” whenever possible. Each pattern’s growth is not performed with one-item addition, but with
an agglomeration of multiple patterns in the pool. These shortcuts direct Pattern-Fusion much more
rapidly down the search tree toward the colossal patterns.
It detects three clusters, and finds the most representative patterns to be the “centermost’” pattern from
each cluster. These patterns are chosen to represent the data. The selected patterns are considered
“summarized patterns” in the sense that they represent or “provide a summary” of the clusters they stand
for. By contrast, in Figure (d) the redundancy-aware top-k patterns make a trade-off between significance
and redundancy. The three patterns chosen here have high significance and low redundancy. Observe, for
example, the two highly significant patterns that, based on their redundancy, are displayed next to each
other. The redundancy-aware top-k strategy selects only one of them, taking into consideration that two
would be redundant. To formalize the definition of redundancy-aware top-k patterns, we’ll need to define
the concepts of significance and redundancy.
1. In the first step i.e. learning: A classification model based on training data is built.
2. In the second step i.e. Classification, the accuracy of the model is checked and then the model is
used to classify new data. The class labels presented here are in the form of discrete values such as
“yes” or “no”, “safe” or “risky”.
Regression Analysis
Regression analysis is used for the prediction of numeric attributes.
Numeric attributes are also called continuous values. A model built to predict the continuous values
instead of class labels is called the regression model. The output of regression analysis is the “Mean” of
all observed values of the node.
Training and Testing:
Suppose there is a person who is sitting under a fan and the fan starts falling on him, he should get
aside in order not to get hurt. So, this is his training part to move away. While Testing if the person sees
any heavy object coming towards him or falling on him and moves aside then the system is tested
positively and if the person does not move aside then the system is negatively tested.
The same is the case with the data, it should be trained in order to get the accurate and best results.
There are certain data types associated with data mining that actually tells us the format of the file
(whether it is in text format or in numerical format).
Attributes – Represents different features of an object. Different types of attributes are:
1. Binary: Possesses only two values i.e. True or False
Example: Suppose there is a survey evaluating some products. We need to check whether it’s useful
or not. So, the Customer has to answer it in Yes or No.
Product usefulness: Yes / No
Symmetric: Both values are equally important in all aspects
Asymmetric: When both the values may not be important.
2. Nominal: When more than two outcomes are possible. It is in Alphabet form rather than being
in Integer form.
Example: One needs to choose some material but of different colors. So, the color might be
Yellow, Green, Black, Red.
Different Colors: Red, Green, Black, Yellow
Ordinal: Values that must have some meaningful order.
Example: Suppose there are grade sheets of few students which might contain different grades as per
their performance such as A, B, C, D
Grades: A, B, C, D
Continuous: May have an infinite number of values, it is in float type
Example: Measuring the weight of few Students in a sequence or orderly manner i.e. 50, 51, 52, 53
Weight: 50, 51, 52, 53
Discrete: Finite number of values.
Example: Marks of a Student in a few subjects: 65, 70, 75, 80, 90
Marks: 65, 70, 75, 80, 90
Syntax:
Mathematical Notation: Classification is based on building a function taking input feature vector
“X” and predicting its outcome “Y” (Qualitative response taking values in set C)
Here Classifier (or model) is used which is a Supervised function, can be designed manually
based on the expert’s knowledge. It has been constructed to predict class labels (Example: Label –
“Yes” or “No” for the approval of some event).
Classifiers can be categorized into two major types:
1. Discriminative: It is a very basic classifier and determines just one class for each row of data. It
tries to model just by depending on the observed data, depends heavily on the quality of data
rather than on distributions.
Example: Logistic Regression
2. Generative: It models the distribution of individual classes and tries to learn the model that
generates the data behind the scenes by estimating assumptions and distributions of the model.
Used to predict the unseen data.
Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that too divided
in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam emails). Now if a user wants
to check that if an email contains the word cheap, then that may be termed as Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are spam.
So, if the email contains the word cheap, what is the probability of it being spam ?? (= 80%)
Decision tree induction is the method of learning the decision trees from the training set. The training set
consists of attributes and class labels. Applications of decision tree induction include astronomy, financial
analysis, medical diagnosis, manufacturing, and production.
A decision tree is a flowchart tree-like structure that is made from training set tuples. The dataset is
broken down into smaller subsets and is present in the form of nodes of a tree. The tree structure has a
root node, internal nodes or decision nodes, leaf node, and branches.
The root node is the topmost node. It represents the best attribute selected for classification. Internal
nodes of the decision nodes represent a test of an attribute of the dataset leaf node or terminal node which
represents the classification or decision label. The branches show the outcome of the test performed.
Some decision trees only have binary nodes, that means exactly two branches of a node, while some
decision trees are non-binary.
CART
CART model i.e. Classification and Regression Models is a decision tree algorithm for building models.
Decision Tree model where the target values have a discrete nature is called classification models.
A discrete value is a finite or countably infinite set of values, For Example, age, size, etc. The models where
the target values are represented by continuous values are usually numbers that are called Regression Models.
Continuous variables are floating-point variables. These two models together are called CART.
CART uses Gini Index as Classification matrix.
ID3 later came to be known as C4.5. ID3 and C4.5 follow a greedy top-down approach for constructing
decision trees. The algorithm starts with a training dataset with class labels that are portioned into smaller
subsets as the tree is being constructed.
#1) Initially, there are three parameters i.e. attribute list, attribute selection method and data
partition. The attribute list describes the attributes of the training set tuples.
#2) The attribute selection method describes the method for selecting the best attribute for discrimination
among tuples. The methods used for attribute selection can either be Information Gain or Gini Index.
#3) The structure of the tree (binary or non-binary) is decided by the attribute selection method.
#4) When constructing a decision tree, it starts as a single node representing the tuples.
#5) If the root node tuples represent different class labels, then it calls an attribute selection method to
split or partition the tuples. The step will lead to the formation of branches and decision nodes.
#6) The splitting method will determine which attribute should be selected to partition the data tuples. It
also determines the branches to be grown from the node according to the test outcome. The main motive
of the splitting criteria is that the partition at each branch of the decision tree should represent the same
class label.
An example of splitting attribute is shown below:
#7) The above partitioning steps are followed recursively to form a decision tree for the training dataset
tuples.
#8) The portioning stops only when either all the partitions are made or when the remaining tuples cannot
be partitioned further.
#9) The complexity of the algorithm is described by n * |D| * log |D| where n is the number of attributes
in training dataset D and |D| is the number of tuples.
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.
Where p is the probability that the tuple belongs to class C. The information is encoded in bits, therefore,
log to the base 2 is used. E(s) represents the average amount of information required to find out the class
label of dataset D. This information gain is also called Entropy.
The information required for exact classification after portioning is given by the formula:
Where P (c) is the weight of partition. This information represents the information needed to classify the
dataset D on portioning by X.
Information gain is the difference between the original and expected information that is required to
classify the tuples of dataset D.
Gain is the reduction of information that is required by knowing the value of X. The attribute with the
highest information gain is chosen as “best”.
P is the probability that tuple belongs to class C. The Gini index that is calculated for binary split dataset
D by attribute A is given by:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape,
and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the weather
conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
Points to remember −
The IF part of the rule is called rule antecedent or precondition.
The THEN part of the rule is called rule consequent.
The antecedent part the condition consist of one or more attribute tests and these tests are logically
ANDed.
The consequent part consists of class prediction.
Rule Extraction
Decision tree classifiers are a popular method of classification—it is easy to understand how decision trees work
and they are known for their accuracy. Decision trees can become large and difficult to interpret. . In comparison
with a decision tree, the IF-THEN rules may be easier for humans to understand, particularly if the decision tree is
very large. To extract rules from a decision tree, one rule is created for each path from the root to a leaf node.
Each splitting criterion along a given path is logically ANDed to form the rule antecedent (“IF” part). The leaf node
holds the class prediction, forming the rule consequent (“THEN” part). Extracting classification rules from a
decision tree. The decision tree of Figure below can be converted to classification IF-THEN rules by tracing the
path from the root node to each leaf node in the tree.
R1: IF age = youth AND student = no THEN buys computer = no
R2: IF age = youth AND student = yes THEN buys computer = yes
R3: IF age = middle aged THEN buys computer = yes
R4: IF age = senior AND credit rating = excellent THEN buys computer = yes
R5: IF age = senior AND credit rating = fair THEN buys computer = no
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Model Evaluation
Model evaluation is the process of using different evaluation metrics to understand a machine learning
model’s performance, as well as its strengths and weaknesses. Model evaluation is important to assess the
efficacy of a model during initial research phases, and it also plays a role in model monitoring.
Classification
The most popular metrics for measuring classification performance include accuracy, precision, confusion
matrix.
Classification Metrics
Classification is about predicting the class labels given input data. In binary classification, there are only
two possible output classes(i.e., Dichotomy). In multiclass classification, more than two possible classes
can be present. focus only on binary classification.
A very common example of binary classification is spam detection, where the input data could include the
email text and metadata (sender, sending time), and the output label is either “spam” or “not spam.” (See
Figure) Sometimes, people use some other names also for the two classes: “positive” and “negative,” or
“class 1” and “class 0.”
There are many ways for measuring classification performance. Accuracy, confusion matrix, log-loss, and
AUC-ROC are some of the most popular metrics. Precision-recall is a widely used metrics for
classification problems.
Accuracy
Accuracy simply measures how often the classifier correctly predicts. We can define accuracy as the ratio
of the number of correct predictions and the total number of predictions.
When any model gives an accuracy rate of 99%, you might think that model is performing very good but
this is not always true and can be misleading in some situations. I am going to explain this with the help
of an example.
Consider a binary classification problem, where a model can achieve only two results, either model gives
a correct or incorrect prediction. Now imagine we have a classification task to predict if an image is a
dog or cat as shown in the image. In a supervised learning algorithm, we first fit/train a model on training
data, then test the model on testing data. Once we have the model’s predictions from the X_test data, we
compare them to the true y_values (the correct labels).
We feed the image of the dog into the training model. Suppose the model predicts that this is a dog, and
then we compare the prediction to the correct label. If the model predicts that this image is a cat and then
we again compare it to the correct label and it would be incorrect.
We repeat this process for all images in X_test data. Eventually, we’ll have a count of correct and
incorrect matches. But in reality, it is very rare that all incorrect or correct matches hold equal value.
Therefore one metric won’t tell the entire story.
Accuracy is useful when the target class is well balanced but is not a good choice for the unbalanced
classes. Imagine the scenario where we had 99 images of the dog and only 1 image of a cat present in our
training data. Then our model would always predict the dog, and therefore we got 99% accuracy. In
reality, Data is always imbalanced for example Spam email, credit card fraud, and medical diagnosis.
Hence, if we want to do a better model evaluation and have a full picture of the model evaluation, other
metrics such as recall and precision should also be considered.
Confusion Matrix
Evaluation of the performance of a classification model is based on the counts of test records correctly
and incorrectly predicted by the model. The confusion matrix provides a more insightful picture which is
not only the performance of a predictive model, but also which classes are being predicted correctly and
incorrectly, and what type of errors are being made. To illustrate, we can see how the 4 classification
metrics are calculated (TP, FP, FN, TN), and our predicted value compared to the actual value in a
confusion matrix is clearly presented in the below confusion matrix table.
Precision is the ratio of True Positives to all the positives predicted by the model.
Low precision: the more False positives the model predicts, the lower the precision.
Recall (Sensitivity)is the ratio of True Positives to all the positives in your Dataset.
Low recall: the more False Negatives the model predicts, the lower the recall.
The idea of recall and precision seems to be abstract. Let me illustrate the difference in three real cases.
the result of TP will be that the COVID 19 residents diagnosed with COVID-19.
the result of TN will be that healthy residents are with good health.
the result of FP will be that those actually healthy residents are predicted as COVID 19 residents.
the result of FN will be that those actual COVID 19 residents are predicted as the healthy residents
In case 1, which scenario do you think will have the highest cost?
Imagine that if we predict COVID-19 residents as healthy patients and they do not need to quarantine,
there would be a massive number of COVID-19 infections. The cost of false negatives is much higher
than the cost of false positives.
MODEL SELECTION
To train a model, we collect enormous quantities of data to help the machine learn better.
Usually, a good portion of the data collected is noise, while some of the columns of our dataset
might not contribute significantly to the performance of our model. Further, having a lot of data
can slow down the training process and cause the model to be slower. The model may also
learn from this irrelevant data and be inaccurate.
Feature selection is what separates good data scientists from the rest. Given the same model
and computational facilities, why do some people win in competitions with faster and more
accurate models? The answer is Feature Selection. Apart from choosing the right model for our
data, we need to choose the right data to put in our model.
Consider a table which contains information on old cars. The model decides which cars must be
crushed for spare parts.
In the above table, we can see that the model of the car, the year of manufacture, and the miles
it has traveled are pretty important to find out if the car is old enough to be crushed or not.
However, the name of the previous owner of the car does not decide if the car should be
crushed or not. Further, it can confuse the algorithm into finding patterns between names and
the other features. Hence we can drop the column.
What is Feature Selection?
Feature Selection is the method of reducing the input variable to your model by using only
relevant data and getting rid of noise in data.
It is the process of automatically choosing relevant features for your machine learning model
based on the type of problem you are trying to solve. We do this by including or excluding
important features without changing them. It helps in cutting down the noise in our data and
reducing the size of our input data.
1. Supervised Models: Supervised feature selection refers to the method which uses the output
label class for feature selection. They use the target variables to identify the variables which can
increase the efficiency of the model
2. Unsupervised Models: Unsupervised feature selection refers to the method which does not need
the output label class for feature selection. We use them for unlabelled data.
1. Filter Method: In this method, features are dropped based on their relation to the output, or
how they are correlating to the output. We use correlation to check if the features are positively
or negatively correlated to the output labels and drop features accordingly. Eg: Information
Gain, Chi-Square Test, Fisher’s Score, etc.
2. Wrapper Method: We split our data into subsets and train a model using this. Based on the
output of the model, we add and subtract features and train the model again. It forms the
subsets using a greedy approach and evaluates the accuracy of all the possible combinations of
features. Eg: Forward Selection, Backwards Elimination, etc.
The input variables that we give to our machine learning models are called features. Each
column in our dataset constitutes a feature. To train an optimal model, we need to make sure
that we use only the essential features. If we have too many features, the model can capture
the unimportant patterns and learn from noise. The method of choosing the important
parameters of our data is called Feature Selection.
1. Filter Method: In this method, features are dropped based on their relation to the output, or
how they are correlating to the output. We use correlation to check if the features are positively
or negatively correlated to the output labels and drop features accordingly. Eg: Information
Gain, Chi-Square Test, Fisher’s Score, etc.
Filter Method flowchart
2. Wrapper Method: We split our data into subsets and train a model using this. Based on the
output of the model, we add and subtract features and train the model again. It forms the
subsets using a greedy approach and evaluates the accuracy of all the possible combinations of
features. Eg: Forward Selection, Backwards Elimination, etc.
3. Intrinsic Method: This method combines the qualities of both the Filter and Wrapper method
to create the best subset.
This method takes care of the machine training iterative process while maintaining the
computation cost to be minimum. Eg: Lasso and Ridge Regression.
Bagging gets its name because it combines Bootstrapping and Aggregation to form one Ensemble model.
Given a sample of data, different bootstrapped subsamples are extracted. A Decision Tree is created on
each of the bootstrapped subsamples. After each subsample’s Decision Tree has been formed, an
algorithm is used to aggregate the decision trees to develop the most efficient predictor. The image below
explains this:
2) Random Forest Models
Random Forest Models are very identical to Bagging though they work in different ways. When deciding
how to make decisions, the Decision Trees in the Bagging method have the full complement of features to
select from. While the bootstrapped samples might appear different, one similar characteristic will be that
the data will change similarity at the same features throughout each model.
On the other hand, Random Forest models choose where to change similarities based on randomly
selected features. Random Forest models induce a level of variation because each tree will split based on
random features. This offers a greater level of results which can draw much more different results and
produce a more accurate prediction.
3) Boosting
The Boosting Method comprises the use of algorithms called Strong and Weak Learners. AdaBoost
(which stands for Adaptive Boosting) is the most used of all Boosting Algorithms, where the main model
is built on several weak learners. Weak learners are so-called because they are characteristically simple
with restricted prediction abilities and, as a result, are just slightly better at accuracy than random guesses.
However, unlike Bagging, Boosting is a sequential method and cannot be used for parallel operations.
The adaptation capability of AdaBoost was a significant factor in this technique, becoming one of the
earliest successful Binary Classifiers. Sequential Decision Trees were the core of such adaptability, where
each tree adjusted its weights based on prior knowledge of accuracies.
By Rule 1, a customer who has had a job for at least two years will receive credit if her income is, say,
$50,000, but not if it is $49,000. Such harsh thresholding may seem unfair. Instead, we can discretize
income into categories (e.g., {low income, medium income, high income}) and then apply fuzzy logic to
allow “fuzzy” thresholds or boundaries to be defined for each category Figure. Rather than having a
precise cutoff between categories, fuzzy logic uses truth values between 0.0 and 1.0 to represent the
degree of membership that a certain value has in a given category. Each category then represents a fuzzy
set. Hence, with fuzzy logic, we can capture the notion that an income of $49,000 is, more or less, high,
although not as high as an income of $50,000. Fuzzy logic systems typically provide graphical tools to
assist users in converting attribute values to fuzzy truth values. Fuzzy set theory is also known as
possibility theory. It was proposed by Lotfi Zadeh in 1965 as an alternative to traditional two-value logic
and probability theory. It lets us work at a high abstraction level and offers a means for dealing with
imprecise data measurement. Most important, fuzzy set theory allows us to deal with vague or inexact
facts. For example, being a member of a set of high incomes is inexact (e.g., if $50,000 is high, then what
about $49,000? or $48,000?) Unlike the notion of traditional “crisp” sets where an element belongs to
either a set S or its complement, in fuzzy set theory, elements can belong to more than one fuzzy set. For
example, the income value $49,000 belongs to both the medium and high fuzzy sets, but to differing
degrees. Using fuzzy set notation and following Figure 9.15, this can be shown as
mmedium income($49,000) = 0.15 and mhigh income($49,000) = 0.96, where m denotes the membership
function, that is operating on the fuzzy sets of medium income and high income, respectively. In fuzzy set
theory, membership values for a given element, x (e.g., for $49,000), do not have to sum to 1. This is
unlike traditional probability theory, which is constrained by a summation axiom.
Statistical Problem –
The Statistical Problem arises when the hypothesis space is too large for the
amount of available data. Hence, there are many hypotheses with the same
accuracy on the data and the learning algorithm chooses only one of them! There is
a risk that the accuracy of the chosen hypothesis is low on unseen data!
Computational Problem –
The Computational Problem arises when the learning algorithm cannot guarantees
finding the best hypothesis.
Representational Problem –
The Representational Problem arises when the hypothesis space does not contain
any good approximation of the target class(es).
if ensembles are used for classification, high accuracies can be accomplished if
different base models misclassify different training examples, even if the base classifier
accuracy is low.
Bagging
Bagging (aka bootstrap aggregating) is a simple but powerful ensemble algorithm that facilitates the
increased stability & accuracy of classification models. The Bagging process works by generating multiple
training datasets via random sampling with replacement, applying the algorithm to each dataset, and then
taking the majority vote amongst the models to determine data classifications. Bagging is a particularly
popular method because it reduces variance, helps to prevent overfitting (i.e., forced applicability of random
irrelevant data), and it can be easily parallelized for application to large datasets.
Implementation steps of Bagging –
1. Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of each
other.
4. The final predictions are determined by combining the predictions from all the
models.
Boosting
Boosting is a robust ensemble algorithm that is capable of reducing both bias & variance, and also
facilitates the conversion of weak learners (i.e., classifiers with weak correlations to the true classification) to
strong learners (i.e., well-correlated classifiers). Boosting creates strong classification tree models by
training models to concentrate on misclassified records from previous models; when this is done, all
classifiers are combined by a weighted majority vote. This process places a higher weight to incorrectly
classified records while decreasing the weight of correct classifications -- this effectively forces subsequent
models to place a greater emphasis on misclassified records. The algorithm then computes the weighted
sum of votes for each class and assigns the best classification to the record. Boosting frequently yields
better models than bagging, but is not capable of parallelization; consequently, if the dataset is very large
(i.e., significant number of weak learners), then boosting may not be the most appropriate ensemble
method.
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a
decision tree classifier and is generated using a random selection of attributes at each
node to determine the split. During classification, each tree votes and the most popular
class is returned.
Implementation steps of Random Forest –
1. Multiple subsets are created from the original data set, selecting
observations with replacement.
2. A subset of features is selected randomly and whichever feature gives the
best split is used to split the node iteratively.
3. The tree is grown to the largest.
4. Repeat the above steps and prediction is given based on the aggregation
of predictions from n number of trees.
The genetic algorithm applies the same technique in data mining – it iteratively
performs the selection, crossover, mutation, and encoding process to evolve the
successive generation of models.
The idea of genetic algorithm is derived from natural evolution. In genetic algorithm, first of all,
the initial population is created. This initial population consists of randomly generated rules. We
can represent each rule by a string of bits.
For example, in a given training set, the samples are described by two Boolean attributes such
as A1 and A2. And this given training set contains two classes such as C1 and C2.
We can encode the rule IF A1 AND NOT A2 THEN C2 into a bit string 100. In this bit
representation, the two leftmost bits represent the attribute A1 and A2, respectively.
Likewise, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
Note − If the attribute has K values where K>2, then we can use the K bits to encode the
attribute values. The classes are also encoded in the same manner.
Points to remember −
Based on the notion of the survival of the fittest, a new population is formed that consists
of the fittest rules in the current population and offspring values of these rules as well.
The fitness of a rule is assessed by its classification accuracy on a set of training
samples.
The genetic operators such as crossover and mutation are applied to create offspring.
In crossover, the substring from pair of rules are swapped to form a new pair of rules.
In mutation, randomly selected bits in a rule's string are inverted.
Information system –
In Rough Set, data model information is stored in a table. Each row (tuples) represents
a fact or an object. Often the facts are not consistent with each other. In Rough Set
terminology a data table is called an Information System. Thus, the information table
represents input data, gathered from any domain.
Indiscernibility –
Tables may contain many objects having the same features. A way of reducing table
size is to store only one representative object for every set of objects with same
features. These objects are called indiscernible objects or tuples. With any P subset A
there is an associated equivalence relation IND(P):
Where IND(P) is called indiscernibility of relation. Here x and y are indiscernible from
each other by attribute
P.
A set is said to be rough if its boundary region is non-empty, otherwise the set is crisp.