Bhabesh - Chapter 3 Complete Editing Including Summary
Bhabesh - Chapter 3 Complete Editing Including Summary
LITERATURE REVIEW: add literature review study for the entire study…
Minimum 1 page
Decision tree learning stands as a prominent technique within supervised classification tasks
in data mining. It generates classification models in the form of a tree structure, where each
internal node signifies a test on an attribute, each branch represents the outcome of the test,
and each leaf node represents a class label. This approach aligns with supervised learning,
1
22 JULY 2023
where the target outcome is known beforehand. Decision trees are adept at handling both
categorical data (e.g., gender, marital status) and numerical data (e.g., age, temperature). Their
primary function is to construct data models capable of predicting class labels or values that
aid in decision-making processes. These models are constructed from the training dataset fed
into the system.
As noted by Han and Kamber (2006), a decision tree can be visualized as a flowchart-like
structure. Internal nodes represent tests on attributes, branches represent the results of those
tests, and leaf nodes represent class labels (decisions made after evaluating all attributes). The
paths from the root node to leaf nodes embody classification rules.
The core objective of employing decision trees is to establish a model during the training
phase that can subsequently predict the class or value of the target variable. This is achieved
by learning straightforward decision rules derived from prior data (training data). When
making predictions for a new record's class label, the process begins at the root node. The root
attribute's value is compared to the corresponding attribute in the record. Based on this
comparison, the branch corresponding to that value is followed, leading to the next node in
the tree. This process continues until a leaf node is reached, which holds the predicted class
label.
Root Node: This node signifies the entire population or sample from which
subsequent divisions occur to create more homogenous subsets.
Splitting: This process involves dividing a node into two or more child nodes based
on specific attribute values.
Decision Node: A node that is not a leaf node and can be further split into sub-nodes
based on decision rules is referred to as a decision node.
Leaf/Terminal Node: Nodes that have reached a final classification and cannot be
further divided are termed leaf or terminal nodes.
Pruning: The process of strategically removing sub-nodes from a decision tree to
mitigate overfitting is known as pruning. It can be viewed as the opposite of splitting.
Branch/Sub-Tree: A branch refers to a sub-section of the entire tree structure that
originates from a single parent node and extends to its terminal nodes.
Parent and Child Node: A node that is partitioned into sub-nodes is designated as the
parent node of those sub-nodes. Conversely, the sub-nodes are considered the children
of the parent node.
2
22 JULY 2023
A preview of a sample dataset of COVID-19 infection is shown below [133]. We can use
decision tree algorithms to check the probability whether a patient is infected or not.
Decision trees have garnered popularity for exploratory knowledge discovery due to their
inherent strengths. First, they do not necessitate extensive domain knowledge or parameter
tuning during construction, making them well-suited for initial exploration. Second, decision
trees effectively handle high-dimensional data. Furthermore, their knowledge representation
in the form of a tree structure is intuitive and generally user-friendly for human
comprehension. Additionally, decision tree algorithms boast efficient learning and
3
22 JULY 2023
classification processes, leading to fast execution times. In general, decision tree classifiers
achieve good accuracy levels; however, successful application can be influenced by the
specific characteristics of the data at hand. The versatility of decision tree learning algorithms
has led to their widespread adoption for classification tasks across diverse domains, including
medicine, manufacturing, finance, astronomy, and molecular biology (Han & Kamber, 2006).
Decision trees can be categorized into two primary types based on the target variable:
Categorical Target Variable: When the target variable is categorical (e.g., yes/no,
disease classification), the resulting decision tree is classified as a categorical variable
decision tree.
Continuous Target Variable: Conversely, if the target variable is continuous (e.g.,
temperature, income level), the decision tree is classified as a continuous variable
decision tree.
The process of selecting the most suitable attribute for splitting a node in a decision tree
significantly impacts the model's accuracy. These decision criteria differ between
classification and regression trees. Several algorithms exist within decision trees to determine
how to split a node into sub-nodes. When dealing with a dataset containing N attributes,
identifying the optimal attribute to place at the root node or at various internal node levels
presents a challenge. Randomly selecting attributes for splitting is not an effective approach,
as it can lead to poor model performance and low accuracy. Common and well-established
attribute selection methods include information gain, Gini index, gain ratio, reduction in
variance, and chi-square. These methods provide a structured approach to identifying the most
informative attribute for each split, ultimately leading to a more robust and accurate decision
tree.
Within the context of decision trees, entropy serves as a metric quantifying the level of
randomness or uncertainty associated with a dataset. A higher entropy value signifies greater
randomness or difficulty in drawing definitive conclusions from the data. Conversely, a lower
entropy value indicates a more homogenous dataset where clear classifications can be made.
In essence, entropy reflects the degree of disorder within a system. A perfectly homogenous
dataset, where all instances belong to the same class, will have an entropy of zero.
4
22 JULY 2023
Conversely, a dataset where each class is represented with equal probability (50% — 50%)
will exhibit a maximum entropy of 1.
Mathematically Entropy for 1 attribute is represented as
Gini Index:
Gini Index is calculated for binary variables only. It measures the impurity in training tuples
of dataset D, as
While information gain is a popular measure for attribute selection in decision trees, it can
sometimes favor attributes with a high number of distinct values, even if those values hold
minimal discriminative power for classification. Gain ratio addresses this shortcoming by
5
22 JULY 2023
incorporating the intrinsic information (or split information) of an attribute into the
calculation. This additional factor penalizes attributes with a large number of branches that
would result from a split, ultimately aiming to identify attributes that are not only informative
but also lead to more efficient tree structures. Gain ratio can be calculated with the following
mathematical formula [132].
Where “before” is the dataset before the split, K is the number of subsets generated by the
split, and (j, after) is subset j after the split.
Reduction in Variance (Regression Trees): For decision trees dealing with continuous
target variables (regression tasks), reduction in variance emerges as a prominent attribute
selection method. This approach leverages the standard formula for variance to identify the
optimal split point. The criterion for selecting a split revolves around minimizing the variance
within the resulting child nodes. In essence, the reduction in variance method aims to create
child nodes with greater homogeneity regarding the target variable's values.
Above X-bar is the mean of the values, X is actual and n is the number of values.
Chi-Square Test:
The chi-square test serves as another attribute selection method within decision trees,
particularly suited for tasks involving categorical target variables. It assesses the statistical
significance of the differences between the distribution of the target variable within a node
and its distribution across all child nodes resulting from a potential split. The chi-square
statistic is calculated based on the sum of the squared deviations between the observed and
expected frequencies of the target variable in each child node relative to the parent node.
Higher chi-square values indicate a greater level of statistical significance, suggesting that the
chosen attribute effectively partitions the data into more homogenous sub-groups based on the
target variable. This method is often employed in conjunction with algorithms like CHAID
(Chi-square Automatic Interaction Detector). It is important to note that the chi-square test is
most effective when dealing with categorical target variables with two or more categories.
6
22 JULY 2023
Different Classifiers of Decision Tree are ID3 (Iterative Dichotomiser 3), C4.5 (successor of
ID3), CART (Classification and Regression Tree), Chi-square automatic interaction
detection (CHAID), MARS (It extends decision trees to handle numerical data better) etc.
Some of them are discussed below.
3.2.2 ID3 algorithm: The ID3 (Iterative Dichotomiser 3) algorithm, developed by Ross
Quinlan (1975) at the University of Sydney, is a foundational decision tree learning algorithm.
It employs a top-down greedy approach to construct a decision tree. Notably, ID3 is primarily
suited for classification tasks involving nominal features (categorical data with a fixed
number of discrete values). The core attribute selection method within ID3 is information
gain.
7
22 JULY 2023
Leaf Node Creation: If all instances within a node belong to the same class, the node
is designated as a leaf node labeled with the corresponding class.
Iteration: Steps 1-4 are repeated for the remaining features until either all features
have been exhausted or the decision tree consists entirely of leaf nodes.
3.2.3 C4.5 Classifier: The C4.5 algorithm, developed by Ross Quinlan (1993) as a successor
to ID3, represents a significant advancement in decision tree learning. Building upon Hunt's
algorithm and implemented serially, C4.5 constructs decision trees using training data similar
to ID3. A key improvement lies in C4.5's pruning capabilities. Pruning involves strategically
replacing internal nodes with leaf nodes to reduce the overall error rate and mitigate
overfitting. Unlike ID3, C4.5 can handle both continuous and categorical attributes during tree
construction. Furthermore, it employs a more sophisticated method for tree pruning, which
helps to reduce misclassification errors stemming from noise or excessive detail within the
training data. While C4.5, like ID3, sorts data at each node to determine the optimal splitting
attribute, it utilizes the gain ratio impurity measure for evaluation (Quinlan, 1993). In essence,
the C4.5 algorithm leverages the concept of information gain or entropy reduction to identify
the best split point. The attribute with the highest normalized information gain is chosen for
decision making at each node. This approach aims to partition the data at each node into
subsets enriched in one class or another, ultimately leading to a more robust and accurate
decision tree. Comparative analyses, such as the work by Badr Hassina et al. (year unknown)
[135], have demonstrated that C4.5 exhibits superior performance compared to ID3,
particularly when dealing with datasets of varying sizes. The following table and chart shows
their result.
8
22 JULY 2023
Source: [135]
Enhancements in C4.5
As a successor to ID3, C4.5 builds upon the foundation laid by its predecessor. While the
initial information gain calculation remains similar, C4.5 incorporates key improvements:
9
22 JULY 2023
J48 is an open-source Java implementation of the C4.5 algorithm embedded within the
WEKA data mining suite. J48 offers the capability to generate either pruned or unpruned C4.5
decision trees. Essentially, it represents an optimized implementation of C4.5 principles. As
with C4.5, J48 is primarily suited for classification tasks (Quinlan, 1993). A key feature of
J48 is its ability to prune branches within the decision tree that do not significantly contribute
to overall predictive accuracy. This pruning process addresses the issue of overfitting that can
arise due to noise or outliers within the training data (Han & Kamber, 2006). J48 employs
statistical measures to strategically remove these less reliable branches, ultimately resulting in
a more robust and generalizable model. Similar to C4.5, J48 constructs decision trees by
leveraging the concept of information entropy during the learning process. It exploits the
inherent information content within each attribute to partition the data into smaller, more
homogenous subsets. J48 evaluates the normalized information gain (or change in entropy)
associated with each potential split point. The attribute that yields the highest normalized
information gain is then chosen for the decision, effectively guiding the tree construction
process.
Pruning represents a critical technique employed within decision tree learning to mitigate the
risk of overfitting. Overfitting occurs when a decision tree becomes overly complex and
adapts too closely to the specificities of the training data, potentially hindering its ability to
generalize effectively to unseen instances. Pruning strategies aim to strategically remove
unnecessary nodes from the decision tree, thereby reducing its overall size and complexity.
The core objective of pruning is to achieve a balance between model accuracy and efficiency.
Ideally, pruning should succeed in reducing the size of the tree without compromising its
predictive accuracy.
Importance of Pruning:
Pruning plays a vital role in decision tree learning by addressing the potential pitfalls of
overfitting and underfitting. Overfitting happens when a decision tree becomes excessively
intricate and conforms too closely to the training data, potentially hindering its
generalizability. Conversely, underfitting occurs when the tree is not complex enough to
capture the underlying patterns within the data. The figure below illustrates these concepts
visually, depicting a training dataset with overfitting and underfitting issues. Pruning tactics
aim to strike a balance between model complexity and accuracy, ideally reducing tree size
without sacrificing its ability to make accurate predictions.
10
22 JULY 2023
Overfitting presents a substantial challenge in decision tree learning and various other
machine learning models. It arises when a model prioritizes memorizing the specifics of the
training data to an excessive degree, potentially at the expense of generalizability. An
overfitted decision tree may exhibit a very low overall cost function value on the training
data; however, its ability to accurately predict outcomes for unseen instances deteriorates.
This phenomenon occurs because the model captures idiosyncrasies or noise within the
training data that are not representative of the broader population.
Pruning mitigates the risk of overfitting. The application of pruning techniques helps to
prevent overfitting. Pruning decision trees helps to control overfitting tendencies. By
removing unnecessary nodes, pruning reduces the model's susceptibility to overfitting.
Pruning fosters improved generalization by reducing overfitting in decision trees.
Pruning decision trees can be achieved through two primary approaches: structured pruning
and unstructured pruning.
Structured Pruning: This strategy focuses on removing entire subtrees from the
decision tree. Structured pruning aims to maintain the overall architecture of the tree
while reducing its complexity. Subtrees that contribute minimally to the predictive
11
22 JULY 2023
accuracy or introduce noise into the model are targeted for removal. This approach is
particularly well-suited for decision trees with a well-defined structure.
Unstructured Pruning: In contrast, unstructured pruning involves eliminating
individual nodes from the tree without necessarily adhering to a specific structure.
This method offers more flexibility but can potentially alter the overall architecture of
the tree to a greater extent. Unstructured pruning is often applied to decision trees
where the internal structure is less rigid, allowing for more targeted removal of
individual nodes that may be hindering performance.
The choice between structured and unstructured pruning depends on the specific
characteristics and complexity of the decision tree being optimized.
The Naive Bayes classifier is a probabilistic machine learning algorithm widely employed for
classification tasks, particularly in domains involving high-dimensional training data, such as
text classification. This algorithm's efficiency and simplicity make it a popular choice for
building fast and effective classification models. The "naive" aspect of the name stems from
its core assumption: that features are conditionally independent given the class label. In
simpler terms, the presence or absence of one feature is independent of the presence or
absence of another feature, considering the class membership. This assumption simplifies the
computational complexity of the algorithm.
1. Bayes' Theorem Foundation: Naive Bayes is built upon the principles of Bayes'
theorem, which allows for calculating the probability of an event occurring given prior
knowledge of related conditions. Mathematically, this is represented by the formula:
where:
2. Data Preparation: The algorithm requires a labeled training dataset, where each data
point comprises a set of features and their corresponding class labels. Features can be
categorical or numerical, and class labels represent discrete categories.
3. Parameter Estimation: During training, Naive Bayes estimates the prior probabilities
(P(A)) for each class A by calculating the frequency of each class within the training
12
22 JULY 2023
data. Additionally, it estimates the likelihood probabilities (P(B_i | A)) for each
feature B_i given each class A.
4. Classification: For classifying a new data instance, the algorithm utilizes Bayes'
theorem to compute the posterior probability (P(A | B)) for each potential class A. The
class with the highest posterior probability is then assigned as the predicted class for
the given features B. Mathematically, this classification step can be expressed as:
where:
The K-Nearest Neighbors (KNN) algorithm is a widely used non-parametric method for
supervised learning, applicable to both classification and regression problems. KNN
operates on the principle of locality, assuming that data points with similar characteristics
tend to share similar class labels or target values. During the training phase, KNN
memorizes the entire training dataset as a reference. When presented with a new, unseen
data point, the algorithm calculates the distances between the input point and all instances
in the training data using a chosen distance metric (e.g., Euclidean distance). KNN then
identifies the K closest neighbors (nearest data points) to the input point based on the
calculated distances. In classification tasks, the most frequent class label amongst these K
neighbors is assigned as the predicted label for the new data point. For regression tasks,
KNN predicts the target value of the new data point by calculating either the average or a
weighted average of the target values from its K nearest neighbors. The KNN algorithm's
simplicity and ease of implementation contribute to its popularity across various domains.
However, its performance is sensitive to the selection of the K parameter (number of
13
22 JULY 2023
neighbors) and the chosen distance metric. Careful parameter tuning is crucial for
achieving optimal results with KNN.
The KNN algorithm follows a straightforward process for both classification and regression
tasks. Here's a breakdown of the key steps:
K Selection: A crucial initial step involves selecting the appropriate value for K, the
number of nearest neighbors to consider during prediction. This choice significantly
impacts the model's performance.
Distance Calculation: For each data point in the training set, a distance metric
(commonly Euclidean distance) is used to compute the distance between that point and
the new, unseen data point for which a prediction is required.
Nearest Neighbor Identification: Based on the calculated distances, the algorithm
identifies the K nearest neighbors to the new data point.
Prediction: In classification tasks, the most frequent class label among these K
neighbors is assigned as the predicted class for the new data point. For regression
tasks, the predicted target value is typically determined by calculating the average or a
weighted average of the target values from the K nearest neighbors.
Impact of K Value:
Advantages of KNN
The KNN algorithm offers several advantages that contribute to its popularity across various
machine learning applications:
Adaptability to New Data: Since KNN memorizes the entire training dataset, it can
inherently adapt to new data points as they become available. These new data points
are incorporated into the neighbor search process, potentially influencing future
predictions. This characteristic allows the model to stay relatively up-to-date with
evolving data patterns.
Limited Hyperparameter Tuning: KNN requires tuning only a small number of
hyperparameters compared to some other algorithms. The primary parameters to
consider are the number of neighbors (K) and the chosen distance metric. This
characteristic can be advantageous, especially in scenarios where extensive
hyperparameter tuning might be computationally expensive or time-consuming.
Support Vector Machines (SVMs) represent a powerful machine learning algorithm for
classification tasks. Originally introduced in the 1960s and further refined in the 1990s, SVMs
have gained widespread adoption due to their effectiveness, versatility, and computational
efficiency. The core principle behind SVMs lies in identifying an optimal hyperplane within a
high-dimensional space (where the number of dimensions corresponds to the number of
features) that effectively separates the data points belonging to different classes. SVM excels
in various applications, including text classification, image recognition tasks (handwriting and
face recognition), and gene identification, among others.
SVMs aim to achieve classification by identifying a hyperplane within the feature space that
maximizes the margin between the data points belonging to different classes. Here are some
central concepts underlying SVMs:
15
22 JULY 2023
Support Vectors: These are data points that lie closest to the separating hyperplane.
They play a critical role in defining the model's decision boundary, as the SVM is
constructed such that it maximizes the margin while ensuring these closest points are
correctly classified.
Hyperplane: This represents a decision boundary within the high-dimensional feature
space that effectively separates the data points of distinct classes. The SVM algorithm
iteratively refines the hyperplane to achieve the optimal separation between classes.
Margin: The margin refers to the distance between the hyperplane and the closest
support vectors from both classes. A larger margin translates to a more robust
separation between classes, which is desirable for good generalization performance.
Conversely, a small margin indicates that the classes are not well-separated in the
feature space, potentially leading to overfitting issues. The SVM algorithm prioritizes
maximizing this margin during the hyperplane construction process.
SVM algorithms typically leverage the concept of margins to achieve optimal class
separation. The margin refers to the distance between the decision hyperplane and the closest
data points (support vectors) from each class. Maximizing this margin is a core principle in
SVM, as it generally leads to better generalization performance on unseen data. There are two
primary categories of margins in SVMs:
Hard Margin: A hard margin SVM aims to create a perfect separation between the
classes, ideally with no data points lying between the hyperplane and the margins.
This rigid approach strictly enforces the maximum margin criterion on all training data
points. While effective for linearly separable datasets, it may not be realistic for most
real-world data containing noise or inherent class overlap. Enforcing a hard margin in
such scenarios can lead to overfitting issues.
Soft Margin: A soft margin SVM acknowledges the potential for some data points to
violate the perfect separation criterion. It incorporates a cost function that penalizes
misclassified points while still aiming to maximize the margin. This approach allows
for some flexibility in handling noisy data or datasets with inherent class overlap. The
degree of flexibility is controlled by a cost parameter, which determines the trade-off
between maximizing the margin and allowing for misclassifications.
3.4. Summary
This chapter outlines the methodological approaches used in the research to predict heart
disease using data mining classification techniques.
3.1 Introduction
This section provides a brief overview of the chapter's objective: exploring data
mining classification techniques for heart disease prediction.
16
22 JULY 2023
This section introduces the concept of machine learning classification and highlights
decision trees as a prominent and interpretable algorithm.
It defines classification as a two-stage process: learning and prediction.
It explains how decision trees work by segmenting data into sub-regions based on
decision rules derived from features, ultimately leading to class labels.
This subsection delves deeper into decision trees, highlighting their strengths as a
supervised learning technique.
It describes a decision tree as a flowchart-like structure with nodes representing tests
on attributes, branches representing test outcomes, and leaf nodes representing class
labels.
It explains the core objective of decision trees: building models during training to
predict class labels for new data.
The process of making predictions using a decision tree is described, starting from the
root node and following branches based on attribute values until a leaf node (class
label) is reached.
Common components of decision trees are defined, including root node, splitting,
decision node, leaf/terminal node, pruning, branch/sub-tree, parent and child node.
This section covers various classification algorithms beyond decision trees, including:
o ID3 (Iterative Dichotomiser 3)
o C4.5 Algorithm (successor to ID3)
o J48 Decision Tree Algorithm (open-source Java implementation of C4.5)
o Pruning Decision Trees (techniques to mitigate overfitting)
o Naive Bayes
o K-Nearest Neighbors (KNN)
o Support Vector Machines (SVM)
For each algorithm, the subsection explains key concepts, functionalities, and
advantages.
Mathematical formulas are included for some algorithms (e.g., Information Gain for
ID3) to provide a deeper understanding.
17
22 JULY 2023
18