0% found this document useful (0 votes)
11 views18 pages

Bhabesh - Chapter 3 Complete Editing Including Summary

Uploaded by

Kangkan Rajkhowa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views18 pages

Bhabesh - Chapter 3 Complete Editing Including Summary

Uploaded by

Kangkan Rajkhowa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

22 JULY 2023

CHAPTER 3: RESEARCH METHODOLOGY


3.1INTRODUCTION
This chapter outlines the methodological approaches explored within this research endeavor.
As the core objective centers on predicting heart disease through data mining classification
techniques, a selection of prominent methods are examined in detail. Furthermore, the chapter
delves into the significance of feature selection techniques and their application within this
context. It is acknowledged that a multitude of techniques exist for disease prediction. This
study specifically compares a subset of these techniques across various datasets to arrive at a
comprehensive analysis of their effectiveness in predicting heart disease.

LITERATURE REVIEW: add literature review study for the entire study…
Minimum 1 page

3.2 DATA MINING CLASSIFICATION TECHNIQUES


Within the domain of machine learning classification, decision trees stand out as one of the
most comprehensible and interpretable algorithms. Classification itself can be characterized
as a two-stage process: a learning phase and a prediction phase. During the learning phase,
the model is constructed based on the provided training data. Subsequently, in the prediction
phase, the model is leveraged to forecast the response variable for new, unseen data
instances. Decision trees achieve classification by segmenting the data space into distinct
sub-regions based on a series of sequential decision rules. These rules are derived from the
features of the data, ultimately leading to a class label assignment for each instance. [131].

Fig: Steps of classification task [131]


3.2.1 DECISION TREES

Decision tree learning stands as a prominent technique within supervised classification tasks
in data mining. It generates classification models in the form of a tree structure, where each
internal node signifies a test on an attribute, each branch represents the outcome of the test,
and each leaf node represents a class label. This approach aligns with supervised learning,

1
22 JULY 2023

where the target outcome is known beforehand. Decision trees are adept at handling both
categorical data (e.g., gender, marital status) and numerical data (e.g., age, temperature). Their
primary function is to construct data models capable of predicting class labels or values that
aid in decision-making processes. These models are constructed from the training dataset fed
into the system.

As noted by Han and Kamber (2006), a decision tree can be visualized as a flowchart-like
structure. Internal nodes represent tests on attributes, branches represent the results of those
tests, and leaf nodes represent class labels (decisions made after evaluating all attributes). The
paths from the root node to leaf nodes embody classification rules.

The core objective of employing decision trees is to establish a model during the training
phase that can subsequently predict the class or value of the target variable. This is achieved
by learning straightforward decision rules derived from prior data (training data). When
making predictions for a new record's class label, the process begins at the root node. The root
attribute's value is compared to the corresponding attribute in the record. Based on this
comparison, the branch corresponding to that value is followed, leading to the next node in
the tree. This process continues until a leaf node is reached, which holds the predicted class
label.

3.2.1.1 Components of Decision Trees:

 Root Node: This node signifies the entire population or sample from which
subsequent divisions occur to create more homogenous subsets.
 Splitting: This process involves dividing a node into two or more child nodes based
on specific attribute values.
 Decision Node: A node that is not a leaf node and can be further split into sub-nodes
based on decision rules is referred to as a decision node.
 Leaf/Terminal Node: Nodes that have reached a final classification and cannot be
further divided are termed leaf or terminal nodes.
 Pruning: The process of strategically removing sub-nodes from a decision tree to
mitigate overfitting is known as pruning. It can be viewed as the opposite of splitting.
 Branch/Sub-Tree: A branch refers to a sub-section of the entire tree structure that
originates from a single parent node and extends to its terminal nodes.
 Parent and Child Node: A node that is partitioned into sub-nodes is designated as the
parent node of those sub-nodes. Conversely, the sub-nodes are considered the children
of the parent node.

2
22 JULY 2023

Figure: A simple Decision tree

A preview of a sample dataset of COVID-19 infection is shown below [133]. We can use
decision tree algorithms to check the probability whether a patient is infected or not.

Decision trees have garnered popularity for exploratory knowledge discovery due to their
inherent strengths. First, they do not necessitate extensive domain knowledge or parameter
tuning during construction, making them well-suited for initial exploration. Second, decision
trees effectively handle high-dimensional data. Furthermore, their knowledge representation
in the form of a tree structure is intuitive and generally user-friendly for human
comprehension. Additionally, decision tree algorithms boast efficient learning and

3
22 JULY 2023

classification processes, leading to fast execution times. In general, decision tree classifiers
achieve good accuracy levels; however, successful application can be influenced by the
specific characteristics of the data at hand. The versatility of decision tree learning algorithms
has led to their widespread adoption for classification tasks across diverse domains, including
medicine, manufacturing, finance, astronomy, and molecular biology (Han & Kamber, 2006).

Types of Decision Trees:

Decision trees can be categorized into two primary types based on the target variable:

 Categorical Target Variable: When the target variable is categorical (e.g., yes/no,
disease classification), the resulting decision tree is classified as a categorical variable
decision tree.
 Continuous Target Variable: Conversely, if the target variable is continuous (e.g.,
temperature, income level), the decision tree is classified as a continuous variable
decision tree.

The process of selecting the most suitable attribute for splitting a node in a decision tree
significantly impacts the model's accuracy. These decision criteria differ between
classification and regression trees. Several algorithms exist within decision trees to determine
how to split a node into sub-nodes. When dealing with a dataset containing N attributes,
identifying the optimal attribute to place at the root node or at various internal node levels
presents a challenge. Randomly selecting attributes for splitting is not an effective approach,
as it can lead to poor model performance and low accuracy. Common and well-established
attribute selection methods include information gain, Gini index, gain ratio, reduction in
variance, and chi-square. These methods provide a structured approach to identifying the most
informative attribute for each split, ultimately leading to a more robust and accurate decision
tree.

Within the context of decision trees, entropy serves as a metric quantifying the level of
randomness or uncertainty associated with a dataset. A higher entropy value signifies greater
randomness or difficulty in drawing definitive conclusions from the data. Conversely, a lower
entropy value indicates a more homogenous dataset where clear classifications can be made.
In essence, entropy reflects the degree of disorder within a system. A perfectly homogenous
dataset, where all instances belong to the same class, will have an entropy of zero.

4
22 JULY 2023

Conversely, a dataset where each class is represented with equal probability (50% — 50%)
will exhibit a maximum entropy of 1.
Mathematically Entropy for 1 attribute is represented as

Where S → Current state, and Pi → Probability of an event i of state S or Percentage of class i


in a node of state S.
Mathematically Entropy for multiple attributes is represented as

Where T→ Current state and X → Selected attribute


Information gain is a decrease in entropy. It computes the difference between entropy before
split and average entropy after split of the dataset based on given attribute values. ID3
(Iterative Dichotomiser) decision tree algorithm uses information gain.
Mathematically, IG is represented as:

Gini Index:
Gini Index is calculated for binary variables only. It measures the impurity in training tuples
of dataset D, as

Here, P is the probability that tuple belongs to class C.


Gini Index works with the categorical target variable “Success” or “Failure”. CART
(Classification and Regression Tree) uses the Gini index method to create split points.
Steps to Calculate Gini index for a split:
a) Calculate Gini for sub-nodes, using the above formula for success (p) and failure (q)
(p²+q²).
b) Calculate the Gini index for split using the weighted Gini score of each node of that
split.

While information gain is a popular measure for attribute selection in decision trees, it can
sometimes favor attributes with a high number of distinct values, even if those values hold
minimal discriminative power for classification. Gain ratio addresses this shortcoming by

5
22 JULY 2023

incorporating the intrinsic information (or split information) of an attribute into the
calculation. This additional factor penalizes attributes with a large number of branches that
would result from a split, ultimately aiming to identify attributes that are not only informative
but also lead to more efficient tree structures. Gain ratio can be calculated with the following
mathematical formula [132].

Where “before” is the dataset before the split, K is the number of subsets generated by the
split, and (j, after) is subset j after the split.

Reduction in Variance (Regression Trees): For decision trees dealing with continuous
target variables (regression tasks), reduction in variance emerges as a prominent attribute
selection method. This approach leverages the standard formula for variance to identify the
optimal split point. The criterion for selecting a split revolves around minimizing the variance
within the resulting child nodes. In essence, the reduction in variance method aims to create
child nodes with greater homogeneity regarding the target variable's values.

Above X-bar is the mean of the values, X is actual and n is the number of values.

Chi-Square Test:

The chi-square test serves as another attribute selection method within decision trees,
particularly suited for tasks involving categorical target variables. It assesses the statistical
significance of the differences between the distribution of the target variable within a node
and its distribution across all child nodes resulting from a potential split. The chi-square
statistic is calculated based on the sum of the squared deviations between the observed and
expected frequencies of the target variable in each child node relative to the parent node.
Higher chi-square values indicate a greater level of statistical significance, suggesting that the
chosen attribute effectively partitions the data into more homogenous sub-groups based on the
target variable. This method is often employed in conjunction with algorithms like CHAID
(Chi-square Automatic Interaction Detector). It is important to note that the chi-square test is
most effective when dealing with categorical target variables with two or more categories.

Mathematically, Chi-squared is represented as following.

6
22 JULY 2023

Different Classifiers of Decision Tree are ID3 (Iterative Dichotomiser 3), C4.5 (successor of
ID3), CART (Classification and Regression Tree), Chi-square automatic interaction
detection (CHAID), MARS (It extends decision trees to handle numerical data better) etc.
Some of them are discussed below.

3.2.2 ID3 algorithm: The ID3 (Iterative Dichotomiser 3) algorithm, developed by Ross
Quinlan (1975) at the University of Sydney, is a foundational decision tree learning algorithm.
It employs a top-down greedy approach to construct a decision tree. Notably, ID3 is primarily
suited for classification tasks involving nominal features (categorical data with a fixed
number of discrete values). The core attribute selection method within ID3 is information
gain.

Fig: Decision tree for COVID19 infected patient


ID3 Steps:

 Information Gain Calculation: The algorithm begins by calculating the information


gain for each feature within the dataset.
 Feature Splitting: If all instances in the dataset do not belong to the same class, the
dataset (S) is partitioned into subsets based on the feature with the highest information
gain.
 Decision Node Creation: A decision tree node is created using the feature that
yielded the maximum information gain.

7
22 JULY 2023

 Leaf Node Creation: If all instances within a node belong to the same class, the node
is designated as a leaf node labeled with the corresponding class.
 Iteration: Steps 1-4 are repeated for the remaining features until either all features
have been exhausted or the decision tree consists entirely of leaf nodes.

3.2.3 C4.5 Classifier: The C4.5 algorithm, developed by Ross Quinlan (1993) as a successor
to ID3, represents a significant advancement in decision tree learning. Building upon Hunt's
algorithm and implemented serially, C4.5 constructs decision trees using training data similar
to ID3. A key improvement lies in C4.5's pruning capabilities. Pruning involves strategically
replacing internal nodes with leaf nodes to reduce the overall error rate and mitigate
overfitting. Unlike ID3, C4.5 can handle both continuous and categorical attributes during tree
construction. Furthermore, it employs a more sophisticated method for tree pruning, which
helps to reduce misclassification errors stemming from noise or excessive detail within the
training data. While C4.5, like ID3, sorts data at each node to determine the optimal splitting
attribute, it utilizes the gain ratio impurity measure for evaluation (Quinlan, 1993). In essence,
the C4.5 algorithm leverages the concept of information gain or entropy reduction to identify
the best split point. The attribute with the highest normalized information gain is chosen for
decision making at each node. This approach aims to partition the data at each node into
subsets enriched in one class or another, ultimately leading to a more robust and accurate
decision tree. Comparative analyses, such as the work by Badr Hassina et al. (year unknown)
[135], have demonstrated that C4.5 exhibits superior performance compared to ID3,
particularly when dealing with datasets of varying sizes. The following table and chart shows
their result.

8
22 JULY 2023

Source: [135]

Enhancements in C4.5

As a successor to ID3, C4.5 builds upon the foundation laid by its predecessor. While the
initial information gain calculation remains similar, C4.5 incorporates key improvements:

 Continuous Data Handling: A significant advancement lies in C4.5's ability to


handle continuous data attributes within the decision tree, expanding its applicability
beyond purely categorical data.
 Missing Value Handling: C4.5 offers a mechanism to address missing values within
the data, allowing for more robust tree construction even in datasets with incomplete
information.
 Attribute Weighting: The algorithm introduces the capability to assign weights to
different attributes, enabling the incorporation of domain knowledge or feature
importance into the decision-making process.
 Post-Pruning: Unlike ID3's reliance on pre-pruning, C4.5 employs a post-pruning
strategy. This involves strategically removing unnecessary branches from the fully
constructed tree to reduce model complexity and mitigate overfitting.
 Pessimistic Error Estimation: C4.5 leverages a pessimistic error estimation approach
during pruning, which helps to prevent overfitting by favoring simpler sub-trees that
might perform slightly less well on the training data but are likely to generalize better
to unseen instances.

These enhancements collectively contribute to the superior performance and broader


applicability of the C4.5 algorithm compared to ID3.

9
22 JULY 2023

3.2.4 J48 Decision Tree Algorithm

J48 is an open-source Java implementation of the C4.5 algorithm embedded within the
WEKA data mining suite. J48 offers the capability to generate either pruned or unpruned C4.5
decision trees. Essentially, it represents an optimized implementation of C4.5 principles. As
with C4.5, J48 is primarily suited for classification tasks (Quinlan, 1993). A key feature of
J48 is its ability to prune branches within the decision tree that do not significantly contribute
to overall predictive accuracy. This pruning process addresses the issue of overfitting that can
arise due to noise or outliers within the training data (Han & Kamber, 2006). J48 employs
statistical measures to strategically remove these less reliable branches, ultimately resulting in
a more robust and generalizable model. Similar to C4.5, J48 constructs decision trees by
leveraging the concept of information entropy during the learning process. It exploits the
inherent information content within each attribute to partition the data into smaller, more
homogenous subsets. J48 evaluates the normalized information gain (or change in entropy)
associated with each potential split point. The attribute that yields the highest normalized
information gain is then chosen for the decision, effectively guiding the tree construction
process.

3.2.5 Pruning Decision Trees:

Pruning represents a critical technique employed within decision tree learning to mitigate the
risk of overfitting. Overfitting occurs when a decision tree becomes overly complex and
adapts too closely to the specificities of the training data, potentially hindering its ability to
generalize effectively to unseen instances. Pruning strategies aim to strategically remove
unnecessary nodes from the decision tree, thereby reducing its overall size and complexity.
The core objective of pruning is to achieve a balance between model accuracy and efficiency.
Ideally, pruning should succeed in reducing the size of the tree without compromising its
predictive accuracy.

Importance of Pruning:

Pruning plays a vital role in decision tree learning by addressing the potential pitfalls of
overfitting and underfitting. Overfitting happens when a decision tree becomes excessively
intricate and conforms too closely to the training data, potentially hindering its
generalizability. Conversely, underfitting occurs when the tree is not complex enough to
capture the underlying patterns within the data. The figure below illustrates these concepts
visually, depicting a training dataset with overfitting and underfitting issues. Pruning tactics
aim to strike a balance between model complexity and accuracy, ideally reducing tree size
without sacrificing its ability to make accurate predictions.

10
22 JULY 2023

Fig: Graphical representation of underfitting vs. overfitting of data set [136]

Overfitting in Decision Trees:

Overfitting presents a substantial challenge in decision tree learning and various other
machine learning models. It arises when a model prioritizes memorizing the specifics of the
training data to an excessive degree, potentially at the expense of generalizability. An
overfitted decision tree may exhibit a very low overall cost function value on the training
data; however, its ability to accurately predict outcomes for unseen instances deteriorates.
This phenomenon occurs because the model captures idiosyncrasies or noise within the
training data that are not representative of the broader population.

Underfitting in Decision Trees:

Underfitting represents the opposite extreme of overfitting in decision trees. Underfitting


occurs when a model fails to capture the underlying patterns and relationships within the
training data sufficiently. This can result in an overly simplistic decision tree with limited
predictive power. An underfitted tree may exhibit high error rates on both the training and
testing data, indicating that it has not learned effectively from the training examples. In
essence, the model lacks the necessary complexity to model the true relationships within the
data.

Pruning mitigates the risk of overfitting. The application of pruning techniques helps to
prevent overfitting. Pruning decision trees helps to control overfitting tendencies. By
removing unnecessary nodes, pruning reduces the model's susceptibility to overfitting.
Pruning fosters improved generalization by reducing overfitting in decision trees.

Pruning decision trees can be achieved through two primary approaches: structured pruning
and unstructured pruning.

 Structured Pruning: This strategy focuses on removing entire subtrees from the
decision tree. Structured pruning aims to maintain the overall architecture of the tree
while reducing its complexity. Subtrees that contribute minimally to the predictive

11
22 JULY 2023

accuracy or introduce noise into the model are targeted for removal. This approach is
particularly well-suited for decision trees with a well-defined structure.
 Unstructured Pruning: In contrast, unstructured pruning involves eliminating
individual nodes from the tree without necessarily adhering to a specific structure.
This method offers more flexibility but can potentially alter the overall architecture of
the tree to a greater extent. Unstructured pruning is often applied to decision trees
where the internal structure is less rigid, allowing for more targeted removal of
individual nodes that may be hindering performance.

The choice between structured and unstructured pruning depends on the specific
characteristics and complexity of the decision tree being optimized.

3.2.6 NAIVE BAYES

The Naive Bayes classifier is a probabilistic machine learning algorithm widely employed for
classification tasks, particularly in domains involving high-dimensional training data, such as
text classification. This algorithm's efficiency and simplicity make it a popular choice for
building fast and effective classification models. The "naive" aspect of the name stems from
its core assumption: that features are conditionally independent given the class label. In
simpler terms, the presence or absence of one feature is independent of the presence or
absence of another feature, considering the class membership. This assumption simplifies the
computational complexity of the algorithm.

Key Components and Steps:

1. Bayes' Theorem Foundation: Naive Bayes is built upon the principles of Bayes'
theorem, which allows for calculating the probability of an event occurring given prior
knowledge of related conditions. Mathematically, this is represented by the formula:

where:

 P(A | B) is the posterior probability of class A given features B


 P(B | A) is the likelihood probability of features B given class A
 P(A) is the prior probability of class A
 P(B) is the total probability of features B (often ignored as it acts as a constant for a
given instance)

2. Data Preparation: The algorithm requires a labeled training dataset, where each data
point comprises a set of features and their corresponding class labels. Features can be
categorical or numerical, and class labels represent discrete categories.
3. Parameter Estimation: During training, Naive Bayes estimates the prior probabilities
(P(A)) for each class A by calculating the frequency of each class within the training
12
22 JULY 2023

data. Additionally, it estimates the likelihood probabilities (P(B_i | A)) for each
feature B_i given each class A.
4. Classification: For classifying a new data instance, the algorithm utilizes Bayes'
theorem to compute the posterior probability (P(A | B)) for each potential class A. The
class with the highest posterior probability is then assigned as the predicted class for
the given features B. Mathematically, this classification step can be expressed as:

predicted_class = argmax(P(A) * Π P(B_i | A))

where:

 P(A) is the prior probability of class A.


 P(B_i | A) is the likelihood probability of feature B_i given class A.
 Π symbol represents the product operator (multiplying probabilities across all features
B_i)

5. Laplace Smoothing (Optional): To address potential issues arising from zero


probabilities (when a feature value in the test data is not present in the training data),
Laplace smoothing (add-one smoothing) can be incorporated. This technique adds a
small value (typically 1) to the count of each feature value, ensuring that no
probability becomes zero.
6. Performance Evaluation: Once the model is trained and used for predictions on the
test data, its performance is evaluated using various metrics such as accuracy,
precision, recall, F1-score, and others.

3.2.7 K-NEAREST NEIGHBORS

The K-Nearest Neighbors (KNN) algorithm is a widely used non-parametric method for
supervised learning, applicable to both classification and regression problems. KNN
operates on the principle of locality, assuming that data points with similar characteristics
tend to share similar class labels or target values. During the training phase, KNN
memorizes the entire training dataset as a reference. When presented with a new, unseen
data point, the algorithm calculates the distances between the input point and all instances
in the training data using a chosen distance metric (e.g., Euclidean distance). KNN then
identifies the K closest neighbors (nearest data points) to the input point based on the
calculated distances. In classification tasks, the most frequent class label amongst these K
neighbors is assigned as the predicted label for the new data point. For regression tasks,
KNN predicts the target value of the new data point by calculating either the average or a
weighted average of the target values from its K nearest neighbors. The KNN algorithm's
simplicity and ease of implementation contribute to its popularity across various domains.
However, its performance is sensitive to the selection of the K parameter (number of

13
22 JULY 2023

neighbors) and the chosen distance metric. Careful parameter tuning is crucial for
achieving optimal results with KNN.

KNN Algorithm Steps and K Parameter Selection:

The KNN algorithm follows a straightforward process for both classification and regression
tasks. Here's a breakdown of the key steps:

 K Selection: A crucial initial step involves selecting the appropriate value for K, the
number of nearest neighbors to consider during prediction. This choice significantly
impacts the model's performance.
 Distance Calculation: For each data point in the training set, a distance metric
(commonly Euclidean distance) is used to compute the distance between that point and
the new, unseen data point for which a prediction is required.
 Nearest Neighbor Identification: Based on the calculated distances, the algorithm
identifies the K nearest neighbors to the new data point.
 Prediction: In classification tasks, the most frequent class label among these K
neighbors is assigned as the predicted class for the new data point. For regression
tasks, the predicted target value is typically determined by calculating the average or a
weighted average of the target values from the K nearest neighbors.

Impact of K Value:

Selecting an appropriate K value is essential for achieving optimal KNN performance.


Choosing a small K value can lead to overfitting, where the model becomes overly sensitive
to noise in the training data and performs poorly on unseen instances. In this case, the limited
number of neighbors (small K) results in a high error rate for new data points, as there are
fewer votes to determine the prediction. Conversely, a large K value can cause underfitting,
where the decision boundaries become too relaxed, potentially increasing the number of
misclassifications. In this scenario, the model lacks sufficient granularity due to the influence
of a larger and potentially more diverse set of neighbors. Careful parameter tuning,
particularly with respect to K, is necessary to achieve a balance between these two extremes
and optimize KNN performance.

Advantages of KNN

The KNN algorithm offers several advantages that contribute to its popularity across various
machine learning applications:

 Simplicity and Ease of Implementation: KNN is known for its straightforward


design and implementation. The core concept of identifying similar neighbors and
leveraging their characteristics for prediction is readily understandable. This ease of
use makes KNN an accessible choice for beginners and experienced practitioners
alike.
 Non-parametric Approach: KNN does not require making assumptions about the
underlying data distribution, unlike some machine learning models. This flexibility
allows it to work effectively with data that may not conform to specific statistical
distributions.
14
22 JULY 2023

 Adaptability to New Data: Since KNN memorizes the entire training dataset, it can
inherently adapt to new data points as they become available. These new data points
are incorporated into the neighbor search process, potentially influencing future
predictions. This characteristic allows the model to stay relatively up-to-date with
evolving data patterns.
 Limited Hyperparameter Tuning: KNN requires tuning only a small number of
hyperparameters compared to some other algorithms. The primary parameters to
consider are the number of neighbors (K) and the chosen distance metric. This
characteristic can be advantageous, especially in scenarios where extensive
hyperparameter tuning might be computationally expensive or time-consuming.

3.2.8 SUPPORT VECTOR MACHINES (SVM)

Support Vector Machines (SVMs) represent a powerful machine learning algorithm for
classification tasks. Originally introduced in the 1960s and further refined in the 1990s, SVMs
have gained widespread adoption due to their effectiveness, versatility, and computational
efficiency. The core principle behind SVMs lies in identifying an optimal hyperplane within a
high-dimensional space (where the number of dimensions corresponds to the number of
features) that effectively separates the data points belonging to different classes. SVM excels
in various applications, including text classification, image recognition tasks (handwriting and
face recognition), and gene identification, among others.

An SVM model can be visualized as a representation of distinct classes separated by a


hyperplane within a multidimensional space. The algorithm iteratively constructs this
hyperplane with the objective of minimizing the overall classification error. This optimization
process ensures that the data points closest to the hyperplane, termed support vectors, play a
crucial role in defining the model's decision boundary. The following figure shows how
different hyperplanes separates two different classes of items.

Source: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 22-07-2023 9.15pm

Key Concepts in Support Vector Machines (SVM)

SVMs aim to achieve classification by identifying a hyperplane within the feature space that
maximizes the margin between the data points belonging to different classes. Here are some
central concepts underlying SVMs:
15
22 JULY 2023

 Support Vectors: These are data points that lie closest to the separating hyperplane.
They play a critical role in defining the model's decision boundary, as the SVM is
constructed such that it maximizes the margin while ensuring these closest points are
correctly classified.
 Hyperplane: This represents a decision boundary within the high-dimensional feature
space that effectively separates the data points of distinct classes. The SVM algorithm
iteratively refines the hyperplane to achieve the optimal separation between classes.
 Margin: The margin refers to the distance between the hyperplane and the closest
support vectors from both classes. A larger margin translates to a more robust
separation between classes, which is desirable for good generalization performance.
Conversely, a small margin indicates that the classes are not well-separated in the
feature space, potentially leading to overfitting issues. The SVM algorithm prioritizes
maximizing this margin during the hyperplane construction process.

SVM algorithms typically leverage the concept of margins to achieve optimal class
separation. The margin refers to the distance between the decision hyperplane and the closest
data points (support vectors) from each class. Maximizing this margin is a core principle in
SVM, as it generally leads to better generalization performance on unseen data. There are two
primary categories of margins in SVMs:

 Hard Margin: A hard margin SVM aims to create a perfect separation between the
classes, ideally with no data points lying between the hyperplane and the margins.
This rigid approach strictly enforces the maximum margin criterion on all training data
points. While effective for linearly separable datasets, it may not be realistic for most
real-world data containing noise or inherent class overlap. Enforcing a hard margin in
such scenarios can lead to overfitting issues.
 Soft Margin: A soft margin SVM acknowledges the potential for some data points to
violate the perfect separation criterion. It incorporates a cost function that penalizes
misclassified points while still aiming to maximize the margin. This approach allows
for some flexibility in handling noisy data or datasets with inherent class overlap. The
degree of flexibility is controlled by a cost parameter, which determines the trade-off
between maximizing the margin and allowing for misclassifications.

3.3. Hardware and Software Environment Used

3.4. Summary

This chapter outlines the methodological approaches used in the research to predict heart
disease using data mining classification techniques.

3.1 Introduction

 This section provides a brief overview of the chapter's objective: exploring data
mining classification techniques for heart disease prediction.

16
22 JULY 2023

 It acknowledges the existence of various disease prediction techniques and emphasizes


the focus on comparing a subset of them across different datasets.

3.2 Data Mining Classification Techniques

 This section introduces the concept of machine learning classification and highlights
decision trees as a prominent and interpretable algorithm.
 It defines classification as a two-stage process: learning and prediction.
 It explains how decision trees work by segmenting data into sub-regions based on
decision rules derived from features, ultimately leading to class labels.

3.2.1 Decision Trees

 This subsection delves deeper into decision trees, highlighting their strengths as a
supervised learning technique.
 It describes a decision tree as a flowchart-like structure with nodes representing tests
on attributes, branches representing test outcomes, and leaf nodes representing class
labels.
 It explains the core objective of decision trees: building models during training to
predict class labels for new data.
 The process of making predictions using a decision tree is described, starting from the
root node and following branches based on attribute values until a leaf node (class
label) is reached.
 Common components of decision trees are defined, including root node, splitting,
decision node, leaf/terminal node, pruning, branch/sub-tree, parent and child node.

3.2.1.1 Components of Decision Trees

 This sub-subsection provides detailed definitions for each component of a decision


tree, offering a clear understanding of its structure.

3.2.2 - 3.2.8: Specific Classification Algorithms

 This section covers various classification algorithms beyond decision trees, including:
o ID3 (Iterative Dichotomiser 3)
o C4.5 Algorithm (successor to ID3)
o J48 Decision Tree Algorithm (open-source Java implementation of C4.5)
o Pruning Decision Trees (techniques to mitigate overfitting)
o Naive Bayes
o K-Nearest Neighbors (KNN)
o Support Vector Machines (SVM)
 For each algorithm, the subsection explains key concepts, functionalities, and
advantages.
 Mathematical formulas are included for some algorithms (e.g., Information Gain for
ID3) to provide a deeper understanding.

Overall, Chapter 3 provides a comprehensive overview of the research methodology,


focusing on data mining classification techniques for heart disease prediction. It explains
core concepts and functionalities of various algorithms, laying the foundation for the
analysis and results presented in subsequent chapters.

17
22 JULY 2023

18

You might also like