Unit 3 PDF
Unit 3 PDF
Classifica on algorithm: - Logis c Regression, Decision Tree Classifica on, Neural Network, K-Nearest Neighbors (K-
NN), Support Vector Machine, Naive Bayes (Gaussian, Mul nomial, Bernoulli). Performance Measures: Confusion
Matrix, Classifica on Accuracy, Classifica on Report: Precisions, Recall, F1 score and Support.
linear model.
Decision Tree
Features of Decision Tree
1. Tree-Like Structure: Decision Trees have a flowchart-like structure, where each internal node represents a
"test" on an a ribute, each branch represents the outcome of the test, and each leaf node represents a class
label (decision taken a er compu ng all a ributes). The paths from root to leaf represent classifica on rules.
2. Simple to Understand and Interpret: One of the main advantages of Decision Trees is their simplicity and
ease of interpreta on. They can be visualized, which makes it easy to understand how decisions are made
and explain the reasoning behind predic ons.
3. Versa lity: Decision Trees can handle both numerical and categorical data and can be used for both
regression and classifica on tasks, making them versa le across different types of data and problems.
4. Feature Importance: Decision Trees inherently perform feature selec on, giving insights into the most
significant variables for making the predic ons. The top nodes in a tree are the most important features,
providing a straigh orward way to iden fy cri cal variables.
3. Random Forest
Random forest are an ensemble learning techniques that combines mul ple decision trees to improve predic ve
accuracy and control over-fi ng. By aggrega ng the predic ons of numerous trees, Random Forests enhance
the decision-making process, making them robust against noise and bias.
Random Forest uses numerous decision trees to increase predic on accuracy and reduce overfi ng. It constructs
many trees and integrates their predic ons to create a reliable model. Diversity is added by using a random dataset
and characteris cs in each tree. Random Forests excel at high-dimensional data, feature importance metrics, and
overfi ng resistance. Many fields use them for classifica on and regression.
Random Forest
Features of Random Forest
1. Ensemble Method: Random Forest uses the ensemble learning technique, where mul ple learners (decision
trees, in this case) are trained to solve the same problem and combined to get be er results. The ensemble
approach improves the model's accuracy and robustness.
2. Handling Both Types of Data: It can handle both categorical and con nuous input and output variables,
making it versa le for different types of data.
3. Reduc on in Overfi ng: By averaging mul ple trees, Random Forest reduces the risk of overfi ng, making
the model more generalizable than a single decision tree.
4. Handling Missing Values: Random Forest can handle missing values. When it encounters a missing value in a
variable, it can use the median for numerical variables or the mode for categorical variables of all samples
reaching the node where the missing value is encountered.
4.Support Vector Machine (SVM)
SVM is an effec ve classifica on and regression algorithm. It seeks the hyperplane that best classifies data while
increasing the margin. SVM works well in high-dimensional areas and handles nonlinear feature interac ons with its
kernel technique. It is powerful classifica on algorithm known for their accuracy in high-dimensional spaces
SVM is robust against overfi ng and generalizes well to different datasets. It finds applica ons in image recogni on,
text classifica on, and bioinforma cs, among other fields. Its use cases span image recogni on, text categoriza on,
and bioinforma cs, where precision is paramount.
K-Nearest Algorithm
Fetures of K-Nearest Neighbors (KNN)
1. Instance-Based Learning: KNN is a type of instance-based or lazy learning algorithm, meaning it does not
explicitly learn a model. Instead, it memorizes the training dataset and uses it to make predic ons.
2. Simplicity: One of the main advantages of KNN is its simplicity. The algorithm is straigh orward to
understand and easy to implement, requiring no training phase in the tradi onal sense.
3. Non-Parametric: KNN is a non-parametric method, meaning it makes no underlying assump ons about the
distribu on of the data. This flexibility allows it to be used in a wide variety of situa ons, including those
where the data distribu on is unknown or non-standard.
4. Flexibility in Distance Choice: The algorithm's performance can be significantly influenced by the choice of
distance metric (e.g., Euclidean, Manha an, Minkowski). This flexibility allows for customiza on based on
the specific characteris cs of the data.
Confusion Matrix
A confusion matrix is a matrix that summarizes the performance of a machine learning model on a set of test data. It
is a means of displaying the number of accurate and inaccurate instances based on the model’s predic ons. It is o en
used to measure the performance of classifica on models, which aim to predict a categorical label for each input
instance.
The matrix displays the number of instances produced by the model on the test data.
True Posi ve (TP): The model correctly predicted a posi ve outcome (the actual outcome was posi ve).
True Nega ve (TN): The model correctly predicted a nega ve outcome (the actual outcome was nega ve).
False Posi ve (FP): The model incorrectly predicted a posi ve outcome (the actual outcome was nega ve).
Also known as a Type I error.
False Nega ve (FN): The model incorrectly predicted a nega ve outcome (the actual outcome was posi ve).
Also known as a Type II error.
Why do we need a Confusion Matrix?
When assessing a classifica on model’s performance, a confusion matrix is essen al. It offers a thorough analysis of
true posi ve, true nega ve, false posi ve, and false nega ve predic ons, facilita ng a more profound
comprehension of a model’s recall, accuracy, precision, and overall effec veness in class dis nc on. When there is
an uneven class distribu on in a dataset, this matrix is especially helpful in evalua ng a model’s performance beyond
basic accuracy metrics.
Metrics based on Confusion Matrix Data
1. Accuracy
Accuracy is used to measure the performance of the model. It is the ra o of Total correct instances to the total
instances.
2. Precision
Precision is a measure of how accurate a model’s posi ve predic ons are. It is defined as the ra o of true posi ve
predic ons to the total number of posi ve predic ons made by the model.
3. Recall
Recall measures the effec veness of a classifica on model in iden fying all relevant instances from a dataset. It is the
ra o of the number of true posi ve (TP) instances to the sum of true posi ve and false nega ve (FN) instances.
4. F1-Score
F1-score is used to evaluate the overall performance of a classifica on model. It is the harmonic mean of precision
and recall,
5. Specificity
Specificity is another important metric in the evalua on of classifica on models, par cularly in binary classifica on. It
measures the ability of a model to correctly iden fy nega ve instances. Specificity is also known as the True Nega ve
Rate. Formula is given by:
For example,
Specificity=3/(1+3)=3/4=0.75
Type 1
For example, in a courtroom scenario, a Type 1 Error, o en referred to as a false posi ve, occurs when the court
mistakenly convicts an individual as guilty when, in truth, they are innocent of the alleged crime. This grave error can
have profound consequences, leading to the wrongful punishment of an innocent person who did not commit the
offense in ques on. Preven ng Type 1 Errors in legal proceedings is paramount to ensuring that jus ce is accurately
served and innocent individuals are protected from unwarranted harm and punishment.
2. Type 2 error
Type 2 error occurs when the model fails to predict a posi ve instance. Recall is directly affected by false nega ves,
as it is the ra o of true posi ves to the sum of true posi ves and false nega ves.
In the context of medical tes ng, a Type 2 Error, o en known as a false nega ve, occurs when a diagnos c test fails
to detect the presence of a disease in a pa ent who genuinely has it. The consequences of such an error are
significant, as it may result in a delayed diagnosis and subsequent treatment.
Type 2
Precision emphasizes minimizing false posi ves, while recall focuses on minimizing false nega ves.