CC50 Toxicity Classification Using Radial Basis Function RBF Neural Network
CC50 Toxicity Classification Using Radial Basis Function RBF Neural Network
I. INTRODUCTION 2
Introduction 3
Brief Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Libraries 6
Data 9
How to load CSV files with 4 methods . . . . . . . . . . . . . . . 11
Rename . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
ii
Table of contents
Data Wrangling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Normalization and Coercion . . . . . . . . . . . . . . . . . . 27
Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Other models 49
iii
Table of contents
iv
Table of contents
v
1.
1
Part I.
INTRODUCTION
2
Introduction
Brief Introduction
This documentation is one of the blogs (see this page), providing you
a reproducible workflow, that will hope to teach you, from my research
in predicting the CC50, or the Median Cytotoxic Concentration, dataset
using Radial Basis Function (RBF) Neural Networks. It is a classification
problem in bioinformatics that involves machine learning—in particular,
neural networks—as an algorithm to categorize the CC50 toxicity levels,
not a quantitative outcome but a categorical outcome. And as you can see
in the title, I am going to evaluate how well is the Radial Basis Function
(RBF) Neural Network, or just RBF Networks, in the CC50 dataset. Later,
I will discuss the RBF Networks the concepts and usages.
The dataset has a total of 9196 observation from the training set and 3070
observation from the test set. It is basically 75:25, which means the dataset
has 75% observations from the train set, and 25% observation from the
test set. I have to remind you that this data is confidential thus I can’t
able to share/distribute this dataset to you, but I can show you what’s
going on with the dataset.
One of the mainlines in this document is the use of box::use. What is it?
Check box’s documentation for more details. What it does is, essentially,
to make R to have its own “modular” system, in which this system doesn’t
exist in R and that’s unfortunate. As a result, I can’t even use the R script
or a folder as a module, since the “modular” system is the way to organize
your codes, especially when your codebase is getting larger. If this feature
3
Introduction
4
Part II.
Chapter 1: Libraries to be
used
5
Libraries
I have plenty libraries used to work with machine learning problem. How-
ever, there are only 3 loaded libraries, and those 3 libraries, found below,
are the main essential for the analysis. The rest are the dependencies
being called with package::.
library(torch)
library(tidyverse)
library(data.table)
1. torch - although I don’t need to load this, I still need the entire
namespace in order to teach you the process of using torch in clas-
sifying the toxicity with neural networks. This come handy for con-
verting the dataset into torch tensors and calling the functions for
neural network optimizers.
2. tidyverse - I used this package, paired with dtplyr, for data ma-
nipulation (see normalization), and Exploratory Data Analysis (see
EDA).
6
Libraries
2. dtplyr - I will use this to paired with dplyr for data manipulation.
This package uses lazy evaluation, a technique to delay the code
you created with. This come handy if you prefer speed and memory
efficient with dplyr syntax. With R’s non-standard evaluation, the
dplyr’s code you create will be translated into data.table’s syntax.
I wouldn’t call it “lazy evaluation” since the data is being evaluated, any-
ways.
3. box - With the use of use function, you can call the folder or an R
script as if they were a module. This would be similar to Python’s
modular system, where you can gain access to the code from another
folder or a Python script.
7
Part III.
8
Data
9
Data
Name Description
DDµ1(ATO)bt Perturbation term that describes the difference of
the spectral moment of order 1 (weighted by the
atomic weight) between the NPs used in the new
and reference states, also depending on the
biological systems.
DDµ3(POL)ns Perturbation term that characterizes the change
of the spectral moment of order 3 (weighted by
the polarizability) between the NPs used in the
new and reference states, also depending on the
shapes of the NPs.
DDE(dm) Perturbation term that describes the variation of
the electronegativity between the NPs used in the
new and reference states, also depending on the
conditions under which the sizes of the NPs were
measured.
DDµ3(VAN)ta Perturbation term that accounts for the difference
of the spectral moment of order 3 (weighted by
the atomic van der Waals radius) between the
NPs used in the new and reference states, also
depending on the exposure times.
DDµ2(ATO)ta Perturbation term that characterizes the change
of the spectral moment of order 2 (weighted by
the atomic weight) between the NPs used in the
new and reference states, also depending on the
exposure times.
DGµ2(HYD)sc Perturbation general spectral moment of order 2
weighted by the hydrophobicity, which accounts
for the difference between the chemical structures
of the coating agents used in the new and
reference states.
10
How to load CSV files with 4 methods
Name Description
DGµ5(PSA)sc Perturbation general spectral moment of order 5
weighted by the polar surface area, which
characterizes the difference between the chemical
structures of the coating agents used in the new
and reference states.
Series Column used to index which rows are for Training
or Prediction, used for data splitting in the
dataset.
TEi(cj)_rf Dummy classification variable describing the toxic
effect of the NP used in the reference state.
I am gone with 4 trials to see which library is the fastest library to read
CSV.
First Trial: This is how you read your data with duckdb::read_csv_duckdb:
Again, I chose DuckDB because it is faster than the libraries I know to read
CSV files. Also, don’t forget to encode your CSV file in UTF-8, otherwise,
duckdb::read_csv_duckdb can’t read your CSV file.
11
Data
Fact: Both readr and data.table are not strict in terms of encoding,
unlike DuckDB where you need to set the CSV encoding into UTF-8 so
that the DuckDB can be able to read your CSV file.
• DuckDB: 55.4ms
• readr::read_csv(): 79.5ms
• data.table::fread(): 5.45ms
• polars::pl$read_csv(): 6.2ms
And thus, I will choose the data.table in this case because of its speed
and not gonna lie, it is not just faster, but also more readable and no-
brainer.
Like readr and polars, I was able to read my data in 5 milliseconds with
just 1 line.
12
Rename
Rename
I wanted to rename the variables that is used for the analysis, including
all the features and a label. This is the description of the data:
Table 1.2.: CC50 Data Description: Before and After column names
Name Rename
DDV(me) x1
DDL(me) x2
DDµ1(ATO)bt x3
DDµ3(POL)ns x4
DDE(dm) x5
DDµ3(VAN)ta x6
DDµ2(ATO)ta x7
DGµ2(HYD)sc x8
DGµ5(PSA)sc x9
Series Series
TEi(cj)_rf y
I am gone with trials again to see which of the 3 libraries is the fastest
to rename the columns that is used for the analysis. Why 3 libraries?
Because polars in R is so difficult to use when renaming the columns.
13
Data
SELECT
"DDV.me." AS x1,
"DDL.me." AS x2,
"DDµ1.ATO.bt" AS x3,
"DDµ3.POL.ns" AS x4,
"DDE.dm." AS x5,
"DDµ3.VAN.ta" AS x6,
"DDµ2.ATO.ta" AS x7,
"DGµ2.HYD.sc" AS x8,
"DGµ5.PSA.sc" AS x9,
"Series",
"TEi.cj._rf" AS y
FROM cc50
box::use(module_r/retrieve_df[...])
cc50 <- sql_res_query(conn = duckdb_con, "rename.sql")
14
Rename
cc50 |>
rename_with(
~ c(paste0("x", 1:9), "y"),
c(1:9, 11)
)
Benchmark:
• DuckDB: 18.7 ms
• dplyr: 824 µs
• data.table: 57.6 µs
I already showed you the dplyr code for renaming the variables so I have
no issues with readability. My issue is the speed and thus, I will choose
the data.table syntax:
15
Data
And, you don’t have to assign it into another variable because the change
happens by reference. Hence, once you ran that data.table syntax (like
one in above), the change is already made. And therefore, we’re done
renaming the columns.
Miscellaneous
I know in this analysis, this won’t be much of use, but I wanted to show
this to you so that you will how it is done to store the data into SQL
connection. The DBI::dbWriteTable is a function where you don’t need
to do this:
16
Miscellaneous
17
Part IV.
18
Data Wrangling and Exploratory
Data Analysis
As far as I know, in Data Science, the basics in data analysis has 3 parts:
Data Wrangling, Exploratory Data Analysis (EDA), and Feature Engineer-
ing. What I have done is that the EDA comes first, and then the Data
Wrangling part. The Feature Engineering will be done after the training
process of the ML model.
Features
Since I can’t share the data to you, I will explore the data for you so that
at least, you’ll understand what’s going on with the data.
Before the start of the analysis, make sure to check the quality of your
data. Start with the inspection of missing values.
cc50 |>
summarise(
across(
everything(),
~ sum(is.na(.x))
)
)
19
Data Wrangling and Exploratory Data Analysis
x1 x2 x3 x4 x5 x6 x7 x8 x9 Series y
1 0 0 0 0 0 0 0 0 0 0 0
Great! The data has no missing values. Thus, we can’t use drop_na to
drop rows with NA.
Here, you can take the glimpse of the data through descriptive statistics.
The data has 2 parts to be used for ML classification: Training and
Prediction.
Again, I can’t share you the data but this is the least I can do: summary
statistics in HTML table. The summary statistics of the data:
This PDF is only a skeleton. Please read the online HTML version
This is not a problem. Click the number as if you represent them as a
column number. Other option is to search the variable (e.g. “x1”, “x2”),
and that specific column is filtered out.
I chose the boxplot to visualize the features of the dataset. Here is the
boxplot:
cc50 |>
mutate(
Series = factor(
Series,
levels = c("Training", "Prediction")
)
) |>
gather(key = "variable", value = "value", -Series, -y) |>
ggplot(aes(x = variable, y = value, fill = Series)) +
facet_wrap(~ variable, scales = "free") +
geom_violin(
aes(color = Series),
alpha = 0.7, # With transparency: 0.07; No transparency: 0.7
position = position_dodge(width = 0.9),
20
Exploratory Data Analysis
width = 0.8
) +
geom_boxplot(
aes(color = Series),
outlier.size = 2,
width = 0.4,
position = position_dodge(width = 0.9)
) +
ggforce::geom_sina(
alpha = 0.08,
aes(color = Series),
position = position_dodge(width = 0.9)
) +
scale_fill_manual(values = c("Training" = "#6FDCE3", "Prediction" = "#D5ED9F")) +
scale_color_manual(values = c("Training" = "#6FDCE3", "Prediction" = "#D5ED9F")) +
theme_minimal() +
labs(
title = "Features' Distribution",
x = "Variables", y = "Sizes (in �m)"
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(
color = "#0099f8", size = 12, face = "bold", hjust = 0.5
),
axis.title.x = element_text(
color = "blue", size = 9, face = "bold"
),
axis.title.y = element_text(size = 9, face = "italic"),
plot.caption = element_text(face = "italic"),
text = element_text(family = "Times New Roman")
)
21
Data Wrangling and Exploratory Data Analysis
Based on that plot, among the features except x5 and x6, there’s too many
observation detected outside the range in the whiskers. We found many
noise in the data.
Label
For the dependent variable y, I will summarize them for you through
frequency statistics. I know this will be done later but I will use this so
that you will know the counts of the each labels in the data
By Labels:
cc50 |>
mutate(
y = case_when(
y == -1 ~ "Nontoxic",
y == 1 ~ "Toxic"
)
22
Exploratory Data Analysis
) |>
count(y)
y n
<char> <int>
1: Nontoxic 4469
2: Toxic 7797
cc50 |>
mutate(
y = case_when(
y == -1 ~ "Nontoxic",
y == 1 ~ "Toxic"
)
) |>
group_by(Series) |>
count(y)
# A tibble: 4 x 3
# Groups: Series [2]
Series y n
<chr> <chr> <int>
1 Prediction Nontoxic 1124
2 Prediction Toxic 1946
3 Training Nontoxic 3345
4 Training Toxic 5851
23
Data Wrangling and Exploratory Data Analysis
matrix(
c(5851, 3345,
1946, 1124),
nrow = 2,
byrow = T,
dimnames = list(
c("Training", "Prediction"),
c("Toxic", "Nontoxic")
)
) |> knitr::kable()
Toxic Nontoxic
Training 5851 3345
Prediction 1946 1124
(cc50 |>
mutate(
Toxicity = case_when(
y == -1 ~ "Nontoxic",
y == 1 ~ "Toxic"
)
) |>
group_by(Series) |>
mutate(total = n()) |>
group_by(Series, Toxicity, total) |>
summarise(n = n(), .groups = "drop") |>
24
Exploratory Data Analysis
25
Data Wrangling and Exploratory Data Analysis
:::
Summary: The data has huge variation. Therefore, in this case, the solu-
tion is that this will be done with normalization through standardization,
and re-coding the y variable into factor/integer.
26
Data Wrangling
Data Wrangling
When I discovered that the dataset has large, I decided to perform the
dataset, the features will be normalized under the hood. The dataset is
already in shape, in tidy format, and the dataset is already renamed. What
it comes to this part is to normalize the data through standardization using
transmutation with mutate function and data splitting with data.table’s
[ method.
𝑥−𝜇
𝑧=
𝜎
cc50 |>
mutate(
across(
starts_with('x'),
scale
27
Data Wrangling and Exploratory Data Analysis
)
)
What I did here is that, with across function, I mapped the columns that
starts with “x”, and then I iterate the standardization of those columns I
mapped through scale function, although we can just use mutate_if.
But then, I am not satisfied with its speed, even though this was also
fast. Thus, I am gone with many trials that may help me to speed up the
normalization. Starts with duckdb and then dtplyr.
For the trials:
SELECT
(x1 - avg(x1) OVER ()) / stddev(x1) OVER () AS x1,
(x2 - avg(x2) OVER ()) / stddev(x2) OVER () AS x2,
(x3 - avg(x3) OVER ()) / stddev(x3) OVER () AS x3,
(x4 - avg(x4) OVER ()) / stddev(x4) OVER () AS x4,
(x5 - avg(x5) OVER ()) / stddev(x5) OVER () AS x5,
(x6 - avg(x6) OVER ()) / stddev(x6) OVER () AS x6,
(x7 - avg(x7) OVER ()) / stddev(x7) OVER () AS x7,
(x8 - avg(x8) OVER ()) / stddev(x8) OVER () AS x8,
(x9 - avg(x9) OVER ()) / stddev(x9) OVER () AS x9,
Series,
28
Data Wrangling
y
FROM cc50_new;
ii. Here, I am just recycling the dplyr code I have before, but this time,
the cc50 dataset is loaded in dtplyr’s lazy_dt. The method on how
you write the code for normalization is the same with dplyr’s, and
it is powered by data.table for speed.
cc50 |>
dtplyr::lazy_dt() |>
mutate(
across(
starts_with('x'),
scale
)
)
iii. Third Trial: Using tidytable, but I am not using scale, but rather
the use of lambda, the purrr-style lambda.
cc50 |>
tidytable::as_tidytable() |>
mutate(
across(
29
Data Wrangling and Exploratory Data Analysis
starts_with('x'),
~ (.x - mean(.x)) / sd(.x)
)
)
I also consider other libraries for data manipulation as well, the libraries
I didn’t include in the trials for speed. But I chose those because I ensure
the readability with speed and the ability to coerce. In contrast, I chose
polars, because this package is so fast (according to H20’s benchmark),
but when I tried this into CC50’s data frame to do normalization, this is
30
Data Wrangling
fast but not as fast as those 2 data frame libraries I used and I can’t even
coerce the y variable into factor.
Data Splitting
After normalizing the data, splitting the data is done with data.table.
Well, there’s another method of splitting the dataset, like the use
of dplyr’s sample_n() and sample_frac() function, combined
with anti_join() function to gather the dataset not captured by
sample_frac(). But, I prefer the data.table indexing method, and it’s
faster. It is important to note that the [ method from data.table is not
the usual indexing [ method from base R’s data frame, but it is the SQL
of base R’s data frame in functional programming.
DT[𝑖, 𝑗, 𝑏𝑦]
Where
31
Data Wrangling and Exploratory Data Analysis
And thus, I have to convert the cc50 data frame into data.table via
coercion using as.data.table() first. Then, go through the data.table’s
process of data splitting, by Series.
duckdb::dbDisconnect(duckdb_con)
32
Part V.
33
Neural Network with Torch
Like I said in the introduction, I’ll be using Radial Basis Function (RBF)
Network into the data. But, I didn’t mentioned that the model to be used
is just an experiment to see how well the RBF Networks’ performance to
the classification of CC50 toxicity, even though I already did the standard
feedforward neural networks (I tried Multilayer Perceptron (MLP) with
ReLu activation function for classification) and it’s working well.
It’s crucial to convert your data frames into Torch tensor objects to ensure
compatibility with the torch library. To do this, use the dataset function
from the torch namespace, which extends the R6 class of torch::dataset
for inheritance. This allows the dataset method to work seamlessly with
my data frames.
34
Data to torch readings
)$to(torch_long())
},
.getitem = function(i) {
list(x = self$x[i, ], y = self$y[i])
},
.length = function() {
dim(self$x)[1]
}
)
Extra item:
And then, encapsulate the data frames you have from the dataset method
with cc50_dataset.
And if you done it as well, then you successfully encapsulate the data
frames you have from the dataset method (here, the instance is saved in
cc50_dataset class), and converted them into torch tensors.
35
Neural Network with Torch
The Radial Basis Function (RBF) Neural Networks, or just RBF Network,
is just a typical Feedforward Neural Networks (FfNN) but the activation
function is not using the standard activation functions used in FfNN, such
as ReLu, sigmoid, softplus, etc…, but it uses Radial Basis Functions.
When RBF is applied, the centres and the shapes are determined.
||𝑥 − 𝑐||2
𝜙(𝑥) = exp(− )
2𝜎2
where:
Yup, the model above is way similar to standardization. But the 𝜙(𝑥)
is a squared distance, and to get the distance, and then we will have to
square-root the calculated 𝜙(𝑥).
When you go to 1 Layer and 2 Layers, notice on how I got RBFNetwork,
even though it doesn’t exist in torch namespace or a manually-created
function within this post/website? Don’t misunderstand it, that’s
because I have a separate R script, in .../module_r/torch_rbf.R
directory, to contain that function, and made R to reuse the code
from .../module_r/torch_rbf.R as a module, using box::use with
module_r/torch_rbf as the unquoted argument. As I said in the
introduction and what I did during Chapter 2 and Chapter 3, this is the
way to access or re-use the code in a particular R script or a folder as a
module and import it using box::use. If you use [...] after the module,
you can have the access all of the (exported) functions in that module.
Go to this page to access the source codes. I do this to show you the
capabilities of R as a programming language.
36
Radial Basis Function (RBF) Neural Networks
For training the neural networks, I will apply the cross entropy loss func-
tion, the classification one to be exact, and the ADAM algorithm as the
optimizer. Both are used at the same time for estimation.
1 Layer
For 1 layer RBF Network, all the process wasn’t stored as a function as
a whole process so that I can explain to you the steps on how to train a
RBF Network into CC50 dataset.
box::use(./module_r/torch_rbf[...])
This is done after accessing the module, and then you can now access the
RBFNetwork function found inside the torch_rbf module. The number of
neurons in that hidden layer has 52 nodes. Note that you still can increase
the number of neurons as many as you want and as you can.
The cross-entropy loss function measures the difference between the pre-
dicted class probabilities and the true class labels, while the ADAM opti-
mizer adjusts the model’s weights iteratively to minimize this loss. Both
functions are executed in tandem during training to optimize the network’s
performance.
37
Neural Network with Torch
Training Set
For the number of epochs, I set it to 1,200. Meaning, the model is trained
over 1200 epochs, with progress being printed every 100 epochs, displaying
the current epoch and the loss.
if (epoch %% 100 == 0) {
cat("Epoch:", epoch, "Loss:", as.numeric(loss1_mod), "\n")
38
Radial Basis Function (RBF) Neural Networks
}
}
39
Neural Network with Torch
caret::confusionMatrix(
data = table(actual = cc50_train$y, predictions = predicted_cc50),
mode = 'everything'
)
Prediction
Actual Nontoxic Toxic
Nontoxic 2268 1077
Toxic 571 5280
Accuracy : 0.8208
95% CI : (0.8128, 0.8286)
No Information Rate : 0.6913
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5999
Sensitivity : 0.7989
40
Radial Basis Function (RBF) Neural Networks
Specificity : 0.8306
Pos Pred Value : 0.6780
Neg Pred Value : 0.9024
Precision : 0.6780
Recall : 0.7989
F1 : 0.7335
Prevalence : 0.3087
Detection Rate : 0.2466
Detection Prevalence : 0.3637
Balanced Accuracy : 0.8147
Prediction Set
41
Neural Network with Torch
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
predictions = predicted_cc50_test
),
mode = 'everything'
)
Prediction
Actual Nontoxic Toxic
Nontoxic 703 421
Toxic 188 1758
Accuracy : 0.8016
95% CI : (0.7871, 0.8156)
No Information Rate : 0.7098
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.553
Sensitivity : 0.7890
Specificity : 0.8068
Pos Pred Value : 0.6254
Neg Pred Value : 0.9034
Precision : 0.6254
Recall : 0.7890
F1 : 0.6978
42
Radial Basis Function (RBF) Neural Networks
Prevalence : 0.2902
Detection Rate : 0.2290
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7979
2 Layers
The RBF Network with 1 layer achieved pretty well to classify the CC50
toxicity levels, this time, I am gonna add 1 more layer to get 2 layers for
its hidden layer to see how well this is. Is it better to RBF Network with
1 layer? Let’s find out.
Note: For this part, I am already done discussing the details in the previ-
ous part. So in this part, the torch_rbf_2layer module has its function
that combines all the workflows, explained in first part of this chapter,
including the process in prediction with new data.
Thus, the module this time:
box::use(./module_r/torch_rbf_2layer)
43
Neural Network with Torch
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
predictions = cc50_train_rbf$preds
),
mode = 'everything'
)
Prediction
Actual Nontoxic Toxic
44
Radial Basis Function (RBF) Neural Networks
Accuracy : 0.864
95% CI : (0.8568, 0.8709)
No Information Rate : 0.6468
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7043
Sensitivity : 0.8224
Specificity : 0.8867
Pos Pred Value : 0.7985
Neg Pred Value : 0.9014
Precision : 0.7985
Recall : 0.8224
F1 : 0.8103
Prevalence : 0.3532
Detection Rate : 0.2905
Detection Prevalence : 0.3637
Balanced Accuracy : 0.8545
Prediction Set
45
Neural Network with Torch
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
predictions = cc50_test_rbf$preds
),
mode = 'everything'
)
Prediction
Actual Nontoxic Toxic
Nontoxic 839 285
Toxic 215 1731
Accuracy : 0.8371
95% CI : (0.8236, 0.85)
No Information Rate : 0.6567
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6444
Sensitivity : 0.7960
Specificity : 0.8586
Pos Pred Value : 0.7464
46
Radial Basis Function (RBF) Neural Networks
If you wonder why it is so slow, for both 1-layer and 2-layer RBF Network,
I suspected that the cause is the inference of RBF network. When I
read some literature, it is true that the performance of RBF Network for
inference is so slow, but not when it is trained.
So with the current structure of both 1-layer and 2-layer RBF Networks,
the accuracy values are:
1. 1-layer
2. 2-layer
47
Part VI.
Chapter 5: Comparison to
other models
48
Other models
The accuracy of those 2 neural networks is still high, despite when the
CC50 data is having huge variance in it, however, the huge drawback of
the neural network models are the compilation of those models, it is so
slow, and you need to have large parameters to be train.
Now, the question is, how about we compare it to other models?
These are the list of models/algorithms to be compared:
1. Logistic Regression
2. Extreme Gradient Boosting (XGBoost)
3. Support Vector Machine
4. Naive Bayes
5. Random Forest
6. Penalized/Regularized Logistic Regression
7. K-nearest Neighbors (kNN)
Same as the previous 2 RBF Network models, after applying those models,
the confusion matrix is applied using caret::confusionMatrix function.
The visualization of the prediction/classification will be done in Chapter
6.
49
Other model: Logistic Regression
𝑝
Odds = 𝜋̂ =
1−𝑝
𝑝
log (𝜋)̂ = log ( ) = 𝑋𝛽
1−𝑝
1
𝑝=
1 + exp (−𝑋𝐵)
50
With tidymodels workflow
As you can see in the equation above, the model’s prediction happens
by transforming the log odds back into a probability using the logistic
(sigmoid) function, ensuring the output is between 0 and 1, which can
then be used to classify the outcome (e.g., if 𝑝 ≥ 0.5, predict the positive
class).
The same as the previous 2 RBF Network models, the use of module is still
applied. There, I have a functions where it has a function that combines
all of the workflows in tidymodels, and the other one where it applies the
base R’s glm function and its predictions were already classified.
box::use(module_r/logit)
I have to note you that if you want to use it as a module, take note that
it is only applied with binary outcomes. The location of the source code
of module_r/logit has the same directory as the previous 2 RBF models
where I stored it in this page.
Ĺ Note
1. The formula
2. The original data to be used to train the ML models
51
Other model: Logistic Regression
Ď Remember
As you can see in the code below, this will be the same as other
models in this chapter, where you can just apply the formula, data,
and new_data arguments.
model$model
�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps
�� Model �����������������������������������������������������������������������������������
Coefficients:
52
With tidymodels workflow
(Intercept) x1 x2 x3 x4 x5 x6
0.72780 0.34341 -0.41703 0.20841 0.01045 -1.04238 -0.42912
x7 x8 x9
0.45369 0.40539 -0.64373
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_logit$prediction
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 1720 1625
Toxic 734 5117
Accuracy : 0.7435
95% CI : (0.7344, 0.7524)
No Information Rate : 0.7331
P-Value [Acc > NIR] : 0.0127
53
Other model: Logistic Regression
Kappa : 0.4123
Sensitivity : 0.7009
Specificity : 0.7590
Pos Pred Value : 0.5142
Neg Pred Value : 0.8746
Precision : 0.5142
Recall : 0.7009
F1 : 0.5932
Prevalence : 0.2669
Detection Rate : 0.1870
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7299
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_logit$predict_new
),
mode = "everything"
)
54
With tidymodels workflow
Prediction
Actual Nontoxic Toxic
Nontoxic 540 584
Toxic 243 1703
Accuracy : 0.7306
95% CI : (0.7145, 0.7462)
No Information Rate : 0.745
P-Value [Acc > NIR] : 0.9667
Kappa : 0.3799
Sensitivity : 0.6897
Specificity : 0.7446
Pos Pred Value : 0.4804
Neg Pred Value : 0.8751
Precision : 0.4804
Recall : 0.6897
F1 : 0.5663
Prevalence : 0.2550
Detection Rate : 0.1759
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7171
55
Other model: Logistic Regression
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_logit$prediction
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 1720 1625
56
Without tidymodels workflow
Accuracy : 0.7435
95% CI : (0.7344, 0.7524)
No Information Rate : 0.7331
P-Value [Acc > NIR] : 0.0127
Kappa : 0.4123
Sensitivity : 0.7009
Specificity : 0.7590
Pos Pred Value : 0.5142
Neg Pred Value : 0.8746
Precision : 0.5142
Recall : 0.7009
F1 : 0.5932
Prevalence : 0.2669
Detection Rate : 0.1870
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7299
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_logit$predict_new
57
Other model: Logistic Regression
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 540 584
Toxic 243 1703
Accuracy : 0.7306
95% CI : (0.7145, 0.7462)
No Information Rate : 0.745
P-Value [Acc > NIR] : 0.9667
Kappa : 0.3799
Sensitivity : 0.6897
Specificity : 0.7446
Pos Pred Value : 0.4804
Neg Pred Value : 0.8751
Precision : 0.4804
Recall : 0.6897
F1 : 0.5663
Prevalence : 0.2550
Detection Rate : 0.1759
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7171
58
Without tidymodels workflow
This is what I mean that it doesn’t really matter whether you apply
tidymodels to control the workflow, including the hyperparameter tuning
and 5-folds cross validation, or not, unless if you apply penalty parameters
(see Other model: Regularized Logistic Regression). The results are
still the same: Neither of them got 80% for both Training and Prediction
sets. Anyways, this model is no way better than 2 RBF Network models
to classify CC50 toxicity levels.
59
Other model: Extreme Gradient
Boosting
This module contains the function that combines the whole workflow of
XGBoost with tidymodels and a function that uses regular XGBoost with
xgboost package, and both of them are utilizing formulas and new_data
in the arguments. This is how you load the module with box::use:
box::use(module_r/xgboost)
60
With tidymodels workflow
Ĺ Note
##### xgb.Booster
raw: 693.7 Kb
call:
xgboost::xgb.train(params = list(eta = 1.04032930083058, max_depth = 6L,
gamma = 0, colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 6L,
subsample = 1), data = x$data, nrounds = 500, watchlist = x$watchlist,
61
Other model: Extreme Gradient Boosting
Training Set
caret::confusionMatrix(
data = table(
62
With tidymodels workflow
actual = cc50_train$y,
prediction = cc50_tm_xgb$prediction
)
)
Prediction
Actual Nontoxic Toxic
Nontoxic 3322 23
Toxic 6 5845
Accuracy : 0.9968
95% CI : (0.9955, 0.9979)
No Information Rate : 0.6381
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9932
Sensitivity : 0.9982
Specificity : 0.9961
Pos Pred Value : 0.9931
Neg Pred Value : 0.9990
Precision : 0.9931
Recall : 0.9982
F1 : 0.9957
Prevalence : 0.3619
Detection Rate : 0.3612
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9971
63
Other model: Extreme Gradient Boosting
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_tm_xgb$predict_new
)
)
Prediction
Actual Nontoxic Toxic
Nontoxic 1031 93
Toxic 62 1884
Accuracy : 0.9495
95% CI : (0.9412, 0.957)
No Information Rate : 0.644
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8906
Sensitivity : 0.9433
Specificity : 0.9530
Pos Pred Value : 0.9173
64
Without tidymodels workflow
Now, if you apply the XGBoost algorithm with the workflows used in
tidymodels package, you see that the hyperparameters are not tuned and
iterated at the same time. Thus, I’ll just rely on the default from xgboost’s
xgboost defaults with xgb.DMatrix as helpers for the XGBoost’s input
data.
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_xgb$predictions
65
Other model: Extreme Gradient Boosting
)
)
Prediction
Actual Nontoxic Toxic
Nontoxic 3288 57
Toxic 24 5827
Accuracy : 0.9912
95% CI : (0.9891, 0.993)
No Information Rate : 0.6398
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9809
Sensitivity : 0.9928
Specificity : 0.9903
Pos Pred Value : 0.9830
Neg Pred Value : 0.9959
Precision : 0.9830
Recall : 0.9928
F1 : 0.9878
Prevalence : 0.3602
Detection Rate : 0.3575
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9915
66
Without tidymodels workflow
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_xgb$predictions_new
)
)
Prediction
Actual Nontoxic Toxic
Nontoxic 1026 98
Toxic 73 1873
Accuracy : 0.9443
95% CI : (0.9356, 0.9521)
No Information Rate : 0.642
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8794
Sensitivity : 0.9336
Specificity : 0.9503
Pos Pred Value : 0.9128
Neg Pred Value : 0.9625
Precision : 0.9128
Recall : 0.9336
F1 : 0.9231
67
Other model: Extreme Gradient Boosting
Prevalence : 0.3580
Detection Rate : 0.3342
Detection Prevalence : 0.3661
Balanced Accuracy : 0.9419
You see, with or without using tidymodels, both workflows have high ac-
curacies (both training and prediction set) and the training set’s accuracy
that are roughly equal to each other, even though the metrics are slightly
in favor to tidymodels’ XGBoost. And also, it happened that the accu-
racy between Training and Prediction sets, both with or without using
tidymodels, does not deviated to each other, hence, this is so nice that
no overfitting happened.
68
Other model: Support Vector
Machine
Let’s try another model, and that is the Support Vector Machine,
or SVM for short. SVM is a machine learning algorithm is one of the
easiest model to learn and understand, if not, that works by finding the
optimal boundary, or hyperplane, that best separates data into different
classes. SVMs are used for classification and regression problem and some
literature uses this algorithm for smaller to medium-sized datasets with
clear separation between classes. Its goal is to fit a margin between the
classes, with the support vectors being the data points that lie closest to
the boundary. In other terminology, we refer this as margin maximization,
or “the street”. This margin maximization is what makes SVMs robust
for classification tasks. What sets SVM apart is its ability to handle high-
dimensional data (the data is wider, not longer) and apply different kernel
functions—like linear, polynomial, or radial basis functions (RBF)—to
transform the input space, making it more flexible for complex problems.
But, I will apply the Radial Basis Function (RBF) since I am using RBF
Network, previously.
69
Other model: Support Vector Machine
the module is assigned in svm alias. This is how you load the module with
box::use:
box::use(svm = module_r/svm)
Ĺ Note
To tune the hyperparameter of SVM to train the model into CC50 dataset,
I complain to its speed. With the default hyperparameters, it trains the
SVM model to the CC50 dataset SO SLOW! That’s why, I reduce the
number of iterations to its bare minimum, the one you put an integer
value in n_iter as an argument.
70
With tidymodels workflow
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_tm_svm$prediction
)
)
Prediction
71
Other model: Support Vector Machine
Accuracy : 0.9837
95% CI : (0.9809, 0.9862)
No Information Rate : 0.643
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9646
Sensitivity : 0.9866
Specificity : 0.9821
Pos Pred Value : 0.9683
Neg Pred Value : 0.9925
Precision : 0.9683
Recall : 0.9866
F1 : 0.9774
Prevalence : 0.3570
Detection Rate : 0.3522
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9843
Prediction Set
caret::confusionMatrix(
data = table(
72
With tidymodels workflow
actual = cc50_test$y,
prediction = cc50_tm_xgb$predict_new
)
)
Prediction
Actual Nontoxic Toxic
Nontoxic 919 205
Toxic 114 1832
Accuracy : 0.8961
95% CI : (0.8848, 0.9067)
No Information Rate : 0.6635
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7722
Sensitivity : 0.8896
Specificity : 0.8994
Pos Pred Value : 0.8176
Neg Pred Value : 0.9414
Precision : 0.8176
Recall : 0.8896
F1 : 0.8521
Prevalence : 0.3365
Detection Rate : 0.2993
Detection Prevalence : 0.3661
Balanced Accuracy : 0.8945
73
Other model: Support Vector Machine
Now, if you apply the SVM model with the workflows used in tidymodels
package, you see that the hyperparameters are not tuned and iterated
at the same time. Thus, I’ll just rely on the default from e1071’s svm
defaults.
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_svm$predictions
)
)
Prediction
Actual Nontoxic Toxic
Nontoxic 2257 1088
Toxic 443 5408
74
Without tidymodels workflow
Accuracy : 0.8335
95% CI : (0.8257, 0.8411)
No Information Rate : 0.7064
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6248
Sensitivity : 0.8359
Specificity : 0.8325
Pos Pred Value : 0.6747
Neg Pred Value : 0.9243
Precision : 0.6747
Recall : 0.8359
F1 : 0.7467
Prevalence : 0.2936
Detection Rate : 0.2454
Detection Prevalence : 0.3637
Balanced Accuracy : 0.8342
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_svm$predictions_new
)
)
75
Other model: Support Vector Machine
Prediction
Actual Nontoxic Toxic
Nontoxic 728 396
Toxic 150 1796
Accuracy : 0.8221
95% CI : (0.8082, 0.8355)
No Information Rate : 0.714
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5983
Sensitivity : 0.8292
Specificity : 0.8193
Pos Pred Value : 0.6477
Neg Pred Value : 0.9229
Precision : 0.6477
Recall : 0.8292
F1 : 0.7273
Prevalence : 0.2860
Detection Rate : 0.2371
Detection Prevalence : 0.3661
Balanced Accuracy : 0.8243
As you can see, the results, both with or without tidymodels workflows,
have huge difference to each, in favor to the SVM model with the control
76
Without tidymodels workflow
77
Other model: Naive Bayes
Naive Bayes is the only generative model to be used in this Chapter,
while the rest is discriminant models. This model is a probabilistic classi-
fication model based on applying Bayes’ theorem with strong (naive) in-
dependence assumptions between the features. This model is particularly
useful for large datasets and text classification tasks due to its simplicity
and efficiency. The independence assumption means that the presence of
a particular feature in the dataset does not affect the presence of any other
feature.
In practice, Naive Bayes classifiers work by calculating the posterior prob-
ability of each class given the features of the input data and then choosing
the class with the highest posterior probability. This makes Naive Bayes
models fast and scalable. Despite its simplicity, it often performs surpris-
ingly well in practice.
The name of the script is nb.R. It is so short you won’t need any alias like
nb for that module, just like in SVM part. Here’s how you load it:
box::use(module_r/nb)
Ĺ Note
78
With tidymodels workflow
The speed is not a big deal, roughly like under 5 mins. The hyperparam-
eters to be tuned are not that huge, and there are only 2 of them:
1. Smoothness
2. Laplace parameter
Actual result:
Call:
naive_bayes.default(x = maybe_data_frame(x), y = y, laplace = ~0.0945426446851343,
usekernel = TRUE, adjust = ~0.163482782384381)
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
A priori probabilities:
Nontoxic Toxic
0.3637451 0.6362549
79
Other model: Naive Bayes
-----------------------------------------------------------------------------
Tables:
-----------------------------------------------------------------------------
:: x1::Nontoxic (KDE)
-----------------------------------------------------------------------------
Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)
x y
Min. :-3.55164 Min. :0.00000
1st Qu.:-1.81466 1st Qu.:0.01215
Median :-0.07768 Median :0.09504
Mean :-0.07768 Mean :0.14366
3rd Qu.: 1.65931 3rd Qu.:0.21610
Max. : 3.39629 Max. :0.82503
-----------------------------------------------------------------------------
:: x1::Toxic (KDE)
-----------------------------------------------------------------------------
Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)
x y
Min. :-3.30409 Min. :0.00000
1st Qu.:-1.65993 1st Qu.:0.00634
Median :-0.01578 Median :0.07529
80
With tidymodels workflow
--------------------------------------------------------------------------------------------
:: x2::Nontoxic (KDE)
--------------------------------------------------------------------------------------------
Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)
x y
Min. :-3.0150 Min. :0.0000007
1st Qu.:-1.3854 1st Qu.:0.0163480
Median : 0.2443 Median :0.0746043
Mean : 0.2443 Mean :0.1531175
3rd Qu.: 1.8739 3rd Qu.:0.2681576
Max. : 3.5035 Max. :0.6575734
--------------------------------------------------------------------------------------------
:: x2::Toxic (KDE)
--------------------------------------------------------------------------------------------
Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)
x y
Min. :-3.45844 Min. :0.0000921
1st Qu.:-1.75693 1st Qu.:0.0237468
81
Other model: Naive Bayes
-----------------------------------------------------------------------------
:: x3::Nontoxic (KDE)
-----------------------------------------------------------------------------
Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)
x y
Min. :-4.2747 Min. :0.000000
1st Qu.:-2.3451 1st Qu.:0.006735
Median :-0.4155 Median :0.057628
Mean :-0.4155 Mean :0.129308
3rd Qu.: 1.5141 3rd Qu.:0.171811
Max. : 3.4437 Max. :1.424567
-----------------------------------------------------------------------------
:: x3::Toxic (KDE)
-----------------------------------------------------------------------------
Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)
x y
Min. :-3.8168 Min. :0.000000
82
With tidymodels workflow
--------------------------------------------------------------------------------------------
:: x4::Nontoxic (KDE)
--------------------------------------------------------------------------------------------
Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)
x y
Min. :-3.7336 Min. :0.00000
1st Qu.:-1.7856 1st Qu.:0.00000
Median : 0.1624 Median :0.00000
Mean : 0.1624 Mean :0.12812
3rd Qu.: 2.1103 3rd Qu.:0.02758
Max. : 4.0583 Max. :2.68670
--------------------------------------------------------------------------------------------
:: x4::Toxic (KDE)
--------------------------------------------------------------------------------------------
Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)
x y
83
Other model: Naive Bayes
-----------------------------------------------------------------------------
:: x5::Nontoxic (KDE)
-----------------------------------------------------------------------------
Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)
x y
Min. :-2.5025 Min. :0.00000
1st Qu.:-1.3117 1st Qu.:0.03801
Median :-0.1209 Median :0.12358
Mean :-0.1209 Mean :0.20952
3rd Qu.: 1.0698 3rd Qu.:0.33197
Max. : 2.2606 Max. :0.83230
-----------------------------------------------------------------------------
:: x5::Toxic (KDE)
-----------------------------------------------------------------------------
Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)
84
With tidymodels workflow
x y
Min. :-2.6985 Min. :0.00000
1st Qu.:-1.4335 1st Qu.:0.05057
Median :-0.1685 Median :0.12368
Mean :-0.1685 Mean :0.19723
3rd Qu.: 1.0964 3rd Qu.:0.28863
Max. : 2.3614 Max. :1.50649
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_nb$prediction
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 729 2616
Toxic 0 5851
85
Other model: Naive Bayes
Accuracy : 0.7155
95% CI : (0.7062, 0.7247)
No Information Rate : 0.9207
P-Value [Acc > NIR] : 1
Kappa : 0.2618
Sensitivity : 1.00000
Specificity : 0.69104
Pos Pred Value : 0.21794
Neg Pred Value : 1.00000
Precision : 0.21794
Recall : 1.00000
F1 : 0.35788
Prevalence : 0.07927
Detection Rate : 0.07927
Detection Prevalence : 0.36375
Balanced Accuracy : 0.84552
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_nb$predict_new
),
86
With tidymodels workflow
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 227 897
Toxic 0 1946
Accuracy : 0.7078
95% CI : (0.6914, 0.7239)
No Information Rate : 0.9261
P-Value [Acc > NIR] : 1
Kappa : 0.2429
Sensitivity : 1.00000
Specificity : 0.68449
Pos Pred Value : 0.20196
Neg Pred Value : 1.00000
Precision : 0.20196
Recall : 1.00000
F1 : 0.33605
Prevalence : 0.07394
Detection Rate : 0.07394
Detection Prevalence : 0.36612
Balanced Accuracy : 0.84224
87
Other model: Naive Bayes
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_nb$predictions
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 1724 1621
Toxic 1143 4708
Accuracy : 0.6994
95% CI : (0.6899, 0.7088)
No Information Rate : 0.6882
P-Value [Acc > NIR] : 0.01034
Kappa : 0.3301
88
Without tidymodels workflow
Sensitivity : 0.6013
Specificity : 0.7439
Pos Pred Value : 0.5154
Neg Pred Value : 0.8046
Precision : 0.5154
Recall : 0.6013
F1 : 0.5551
Prevalence : 0.3118
Detection Rate : 0.1875
Detection Prevalence : 0.3637
Balanced Accuracy : 0.6726
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_nb$predict_new
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
89
Other model: Naive Bayes
Accuracy : 0.7049
95% CI : (0.6884, 0.721)
No Information Rate : 0.6906
P-Value [Acc > NIR] : 0.04427
Kappa : 0.3427
Sensitivity : 0.6147
Specificity : 0.7453
Pos Pred Value : 0.5196
Neg Pred Value : 0.8119
Precision : 0.5196
Recall : 0.6147
F1 : 0.5632
Prevalence : 0.3094
Detection Rate : 0.1902
Detection Prevalence : 0.3661
Balanced Accuracy : 0.6800
You can see the difference between the Naive Bayes that was tuned and
controlled by tidymodels to the Naive Bayes that was ran natively in
naivebayes package. Yes, even though tidymodels Naive Bayes has bet-
ter results than naivebayes, Naive Bayes got barely 70% accuracy, for all
sets, and it is no way better than the previous 2 RBF Network models.
90
Other model: Random Forest
1. randomForest
2. ranger
However, among the choices, I picked the ranger package due to its known
speed, confirmed to be faster than randomForest package. That’s right,
when you have large dataset to handle, in this case, I have 9,196 training set
and 3,070 prediction set, ranger is surprisingly faster than randomForest.
While, randomForest is written in C, ranger is more optimized, main-
tained, and updated than randomForest, plus this library is written in
91
Other model: Random Forest
C++, and both of them doesn’t compromise the accuracy of Random For-
est. Furthermore, ranger is also used when you are conducting survival
analysis.
Thus, I’ll be using ranger package to get a better performance in train-
ing RF model, in terms of speed. As you know, all of the model func-
tions are stored in a script as a module and access them using box::use.
Hence, I stored the workflow of the random forest, with or without using
tidymodels, as a function within a module named random_forest.
box::use(module_r/random_forest)
Ĺ Note
92
With tidymodels workflow
�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps
�� Model �����������������������������������������������������������������������������������
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~16L, x), num.trees = ~
Just like the previous result from the 2 RBF Network models, I am still
using caret::confusionMatrix to obtain information of the metrics of
RF model’s performance, since this RF model is a classification model, not
regression model.
93
Other model: Random Forest
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_tm_rf$predictions
)
)
Prediction
Actual Nontoxic Toxic
Nontoxic 3323 22
Toxic 1 5850
Accuracy : 0.9975
95% CI : (0.9962, 0.9984)
No Information Rate : 0.6385
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9946
Sensitivity : 0.9997
Specificity : 0.9963
Pos Pred Value : 0.9934
Neg Pred Value : 0.9998
Precision : 0.9934
Recall : 0.9997
F1 : 0.9966
94
With tidymodels workflow
Prevalence : 0.3615
Detection Rate : 0.3614
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9980
After running that code, I obtain the metrics to measure the performance
of the RF model and they are quite large for the training model implying
that the RF model quite performed well compared to the previous 2 RBF
Network models (though I can obtained large metrics if I run them in
batch and increase the parameters).
Prediction Set
However, the result of the training set metrics are not sufficiently enough,
I have to verify the metrics of the test set. If the metrics didn’t have big
difference, the RF model doesn’t resonates from overfitting, otherwise, the
conclusion would be overfitting.
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_tm_rf$predictions_new
)
)
Prediction
Actual Nontoxic Toxic
95
Other model: Random Forest
Nontoxic 1026 98
Toxic 64 1882
Accuracy : 0.9472
95% CI : (0.9387, 0.9549)
No Information Rate : 0.645
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8856
Sensitivity : 0.9413
Specificity : 0.9505
Pos Pred Value : 0.9128
Neg Pred Value : 0.9671
Precision : 0.9128
Recall : 0.9413
F1 : 0.9268
Prevalence : 0.3550
Detection Rate : 0.3342
Detection Prevalence : 0.3661
Balanced Accuracy : 0.9459
96
Without tidymodels workflow
This time, the ranger random forest model is run by default and the
hyperparameter tuning processes are not implemented. Let’s see how it is
performed to the CC50 dataset.
Training Set
This doesn’t differ from the previous result where we obtain the
metrics of the classification model through confusion metrics with
caret::confusionMatrix, where the data is the matrix or table between
the actual set and its prediction set.
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_rf$train_preds
)
)
Prediction
Actual Nontoxic Toxic
Nontoxic 3063 282
Toxic 162 5689
Accuracy : 0.9517
97
Other model: Random Forest
Kappa : 0.8949
Sensitivity : 0.9498
Specificity : 0.9528
Pos Pred Value : 0.9157
Neg Pred Value : 0.9723
Precision : 0.9157
Recall : 0.9498
F1 : 0.9324
Prevalence : 0.3507
Detection Rate : 0.3331
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9513
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_rf$test_preds
)
)
98
Without tidymodels workflow
Prediction
Actual Nontoxic Toxic
Nontoxic 1022 102
Toxic 57 1889
Accuracy : 0.9482
95% CI : (0.9398, 0.9558)
No Information Rate : 0.6485
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8875
Sensitivity : 0.9472
Specificity : 0.9488
Pos Pred Value : 0.9093
Neg Pred Value : 0.9707
Precision : 0.9093
Recall : 0.9472
F1 : 0.9278
Prevalence : 0.3515
Detection Rate : 0.3329
Detection Prevalence : 0.3661
Balanced Accuracy : 0.9480
99
Other model: Random Forest
94% for Prediction set, we can say that the ranger random forest is not
overfitting to the CC50 dataset.
When using tidymodels, the metrics for accuracy are in favor to it than
without using tidymodels. But the performance, in terms of training
speed, is in favor to the regular use of ranger. Hence, we’ll be using the
result of tidymodels ranger RF model for comparison.
100
Other model: Regularized Logistic
Regression
1. Ridge
2. LASSO
3. Elastic Net
And so, their workflow has difference in it. Just like Support Vector Ma-
chine, The module is assigned in PLR alias and the location of the source
code of module_r/penal_reg has the same directory as the previous 2
RBF models where I stored it in this page.
box::use(PLR = module_r/penal_reg)
101
Other model: Regularized Logistic Regression
Ĺ Note
Penalty: Ridge
What is actually Ridge? Some refers this as regression that has L2 regu-
larization. With logistic regression as a model to be penalized, is a modi-
fication of standard logistic regression will be added a penalty term to the
loss function 𝐿.
This is the equation:
𝑛 𝑘
𝐿(𝛽) = − ∑(𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )) + 𝜆 ∑ 𝛽𝑗2
𝑖=1 𝑗=1
The goal is to minimize the loss function of the 𝛽s in the logistic regression
equation (found in Chapter 5.1, in log odds part).
This is how it looks like:
𝑛 𝑘
min(− ∑(𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )) + 𝜆 ∑ 𝛽𝑗2 )
𝛽
𝑖=1 𝑗=1
The tuned parameters for Elastic Net is just the Penalty parameter with
mixture = 0.
102
Penalty: Ridge
Actual result:
�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps
�� Model �����������������������������������������������������������������������������������
Df %Dev Lambda
1 9 0.00 159.000
2 9 0.05 144.900
3 9 0.05 132.000
4 9 0.06 120.300
5 9 0.07 109.600
6 9 0.07 99.870
7 9 0.08 91.000
8 9 0.09 82.920
9 9 0.09 75.550
10 9 0.10 68.840
11 9 0.11 62.720
12 9 0.12 57.150
103
Other model: Regularized Logistic Regression
13 9 0.14 52.070
14 9 0.15 47.450
15 9 0.16 43.230
16 9 0.18 39.390
17 9 0.20 35.890
18 9 0.22 32.700
19 9 0.24 29.800
20 9 0.26 27.150
21 9 0.28 24.740
22 9 0.31 22.540
23 9 0.34 20.540
24 9 0.37 18.710
25 9 0.41 17.050
26 9 0.44 15.540
27 9 0.49 14.160
28 9 0.53 12.900
29 9 0.58 11.750
30 9 0.63 10.710
31 9 0.69 9.758
32 9 0.75 8.891
33 9 0.82 8.101
34 9 0.90 7.381
35 9 0.98 6.726
36 9 1.06 6.128
37 9 1.16 5.584
38 9 1.26 5.088
39 9 1.36 4.636
40 9 1.48 4.224
41 9 1.61 3.849
42 9 1.74 3.507
43 9 1.88 3.195
44 9 2.03 2.911
45 9 2.20 2.653
104
Penalty: Ridge
46 9 2.37 2.417
...
and 54 more lines.
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_ridge$prediction
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 1592 1753
Toxic 658 5193
Accuracy : 0.7378
95% CI : (0.7287, 0.7468)
No Information Rate : 0.7553
P-Value [Acc > NIR] : 0.9999
Kappa : 0.3909
105
Other model: Regularized Logistic Regression
Sensitivity : 0.7076
Specificity : 0.7476
Pos Pred Value : 0.4759
Neg Pred Value : 0.8875
Precision : 0.4759
Recall : 0.7076
F1 : 0.5691
Prevalence : 0.2447
Detection Rate : 0.1731
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7276
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_ridge$predict_new
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 500 624
Toxic 222 1724
106
Penalty: Ridge
Accuracy : 0.7244
95% CI : (0.7083, 0.7402)
No Information Rate : 0.7648
P-Value [Acc > NIR] : 1
Kappa : 0.3578
Sensitivity : 0.6925
Specificity : 0.7342
Pos Pred Value : 0.4448
Neg Pred Value : 0.8859
Precision : 0.4448
Recall : 0.6925
F1 : 0.5417
Prevalence : 0.2352
Detection Rate : 0.1629
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7134
107
Other model: Regularized Logistic Regression
penal_type = "ridge"
)
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_ridge$predictions
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 1592 1753
Toxic 658 5193
Accuracy : 0.7378
95% CI : (0.7287, 0.7468)
No Information Rate : 0.7553
P-Value [Acc > NIR] : 0.9999
Kappa : 0.3909
Sensitivity : 0.7076
Specificity : 0.7476
108
Penalty: Ridge
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_ridge$predict_new
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 500 624
Toxic 222 1724
Accuracy : 0.7244
109
Other model: Regularized Logistic Regression
Kappa : 0.3578
Sensitivity : 0.6925
Specificity : 0.7342
Pos Pred Value : 0.4448
Neg Pred Value : 0.8859
Precision : 0.4448
Recall : 0.6925
F1 : 0.5417
Prevalence : 0.2352
Detection Rate : 0.1629
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7134
Penalty: LASSO
110
Penalty: LASSO
𝑛 𝑘
𝐿(𝛽) = − ∑(𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )) + 𝜆 ∑ |𝛽𝑗 |
𝑖=1 𝑗=1
The goal is to minimize the loss function of the 𝛽s in the logistic regression
equation (found in Chapter 5.1, in log odds part).
This is how it looks like:
𝑛 𝑘
min(− ∑(𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )) + 𝜆 ∑ |𝛽𝑗 |)
𝛽
𝑖=1 𝑗=1
The tuned parameters for Elastic Net is just the Penalty parameter with
mixture = 1.
Actual result:
�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps
111
Other model: Regularized Logistic Regression
�� Model �����������������������������������������������������������������������������������
Df %Dev Lambda
1 0 0.00 0.159000
2 1 1.42 0.144900
3 1 2.60 0.132000
4 1 3.59 0.120300
5 1 4.42 0.109600
6 1 5.12 0.099870
7 1 5.71 0.091000
8 1 6.20 0.082920
9 1 6.62 0.075550
10 2 7.12 0.068840
11 2 7.66 0.062720
12 2 8.11 0.057150
13 2 8.50 0.052070
14 3 9.03 0.047450
15 3 9.55 0.043230
16 3 9.98 0.039390
17 4 10.45 0.035890
18 4 10.94 0.032700
19 5 11.44 0.029800
20 7 12.00 0.027150
21 7 12.57 0.024740
22 7 13.05 0.022540
23 8 13.63 0.020540
24 8 14.13 0.018710
25 8 14.55 0.017050
26 8 14.91 0.015540
27 8 15.22 0.014160
28 8 15.48 0.012900
112
Penalty: LASSO
29 8 15.70 0.011750
30 8 15.89 0.010710
31 8 16.05 0.009758
32 8 16.18 0.008891
33 8 16.30 0.008101
34 8 16.39 0.007381
35 8 16.47 0.006726
36 8 16.54 0.006128
37 8 16.60 0.005584
38 8 16.65 0.005088
39 8 16.69 0.004636
40 8 16.73 0.004224
41 8 16.75 0.003849
42 8 16.78 0.003507
43 8 16.80 0.003195
44 8 16.82 0.002911
45 8 16.83 0.002653
46 8 16.84 0.002417
...
and 14 more lines.
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_lasso$prediction
),
mode = "everything"
)
113
Other model: Regularized Logistic Regression
Prediction
Actual Nontoxic Toxic
Nontoxic 1717 1628
Toxic 717 5134
Accuracy : 0.745
95% CI : (0.736, 0.7539)
No Information Rate : 0.7353
P-Value [Acc > NIR] : 0.01794
Kappa : 0.415
Sensitivity : 0.7054
Specificity : 0.7592
Pos Pred Value : 0.5133
Neg Pred Value : 0.8775
Precision : 0.5133
Recall : 0.7054
F1 : 0.5942
Prevalence : 0.2647
Detection Rate : 0.1867
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7323
114
Penalty: LASSO
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_lasso$predict_new
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 537 587
Toxic 238 1708
Accuracy : 0.7313
95% CI : (0.7152, 0.7469)
No Information Rate : 0.7476
P-Value [Acc > NIR] : 0.9815
Kappa : 0.3804
Sensitivity : 0.6929
Specificity : 0.7442
Pos Pred Value : 0.4778
Neg Pred Value : 0.8777
Precision : 0.4778
Recall : 0.6929
115
Other model: Regularized Logistic Regression
F1 : 0.5656
Prevalence : 0.2524
Detection Rate : 0.1749
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7186
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_lasso$prediction
),
mode = "everything"
)
116
Penalty: LASSO
Prediction
Actual Nontoxic Toxic
Nontoxic 1716 1629
Toxic 722 5129
Accuracy : 0.7443
95% CI : (0.7353, 0.7532)
No Information Rate : 0.7349
P-Value [Acc > NIR] : 0.0202
Kappa : 0.4136
Sensitivity : 0.7039
Specificity : 0.7590
Pos Pred Value : 0.5130
Neg Pred Value : 0.8766
Precision : 0.5130
Recall : 0.7039
F1 : 0.5935
Prevalence : 0.2651
Detection Rate : 0.1866
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7314
Prediction Set
117
Other model: Regularized Logistic Regression
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_lasso$predict_new
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 538 586
Toxic 238 1708
Accuracy : 0.7316
95% CI : (0.7155, 0.7472)
No Information Rate : 0.7472
P-Value [Acc > NIR] : 0.9775
Kappa : 0.3813
Sensitivity : 0.6933
Specificity : 0.7446
Pos Pred Value : 0.4786
Neg Pred Value : 0.8777
Precision : 0.4786
Recall : 0.6933
F1 : 0.5663
Prevalence : 0.2528
118
Penalty: Elastic Net
The key features that I didn’t mentioned from both Ridge and LASSO:
• While LASSO tends to select only one feature from a group of cor-
related features, Elastic Net can retain all or several of them.
• The Elastic Net regularization encourages sparsity like Lasso but
also stabilizes the selection process like Ridge.
𝑛 𝑘 𝑘
𝐿(𝛽) = − ∑(𝑦𝑖 log(𝑝𝑖 ) + (1−𝑦𝑖 ) log(1−𝑝𝑖 )) + 𝜆(𝛼 ∑ |𝛽𝑗 | +(1−𝛼) ∑ 𝛽𝑗2 )
𝑖=1 𝑗=1 𝑗=1
The goal is to minimize the loss function of the 𝛽s in the logistic regression
equation (found in Chapter 5.1, in log odds part).
119
Other model: Regularized Logistic Regression
𝑛 𝑘 𝑘
min(− ∑(𝑦𝑖 log(𝑝𝑖 ) + (1−𝑦𝑖 ) log(1−𝑝𝑖 )) + 𝜆(𝛼 ∑ |𝛽𝑗 | +(1−𝛼) ∑ 𝛽𝑗2 ))
𝛽
𝑖=1 𝑗=1 𝑗=1
1. Penalty parameter
2. Mixture: or simply just 𝛼
Actual result:
120
Penalty: Elastic Net
�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps
�� Model �����������������������������������������������������������������������������������
Df %Dev Lambda
1 0 0.00 0.31800
2 1 0.89 0.28980
3 1 1.71 0.26400
4 1 2.46 0.24060
5 1 3.16 0.21920
6 1 3.79 0.19970
7 1 4.36 0.18200
8 1 4.89 0.16580
9 1 5.36 0.15110
10 2 5.95 0.13770
11 2 6.52 0.12540
12 2 7.03 0.11430
13 2 7.48 0.10410
14 2 7.88 0.09489
15 3 8.38 0.08646
16 4 8.90 0.07878
17 4 9.40 0.07178
18 5 9.96 0.06541
19 5 10.51 0.05960
20 6 11.03 0.05430
21 7 11.52 0.04948
121
Other model: Regularized Logistic Regression
22 7 12.05 0.04508
23 8 12.58 0.04108
24 8 13.12 0.03743
25 8 13.59 0.03410
26 8 14.01 0.03107
27 8 14.39 0.02831
28 9 14.72 0.02580
29 9 15.01 0.02351
30 9 15.26 0.02142
31 9 15.49 0.01952
32 9 15.68 0.01778
33 9 15.85 0.01620
34 9 16.00 0.01476
35 9 16.13 0.01345
36 9 16.24 0.01226
37 9 16.34 0.01117
38 9 16.42 0.01018
39 9 16.49 0.00927
40 8 16.56 0.00845
41 8 16.61 0.00770
42 8 16.65 0.00701
43 8 16.69 0.00639
44 8 16.72 0.00582
45 8 16.75 0.00530
46 8 16.78 0.00483
...
and 19 more lines.
122
Penalty: Elastic Net
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_en$prediction
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 1714 1631
Toxic 720 5131
Accuracy : 0.7443
95% CI : (0.7353, 0.7532)
No Information Rate : 0.7353
P-Value [Acc > NIR] : 0.02527
Kappa : 0.4135
Sensitivity : 0.7042
Specificity : 0.7588
Pos Pred Value : 0.5124
Neg Pred Value : 0.8769
Precision : 0.5124
Recall : 0.7042
123
Other model: Regularized Logistic Regression
F1 : 0.5932
Prevalence : 0.2647
Detection Rate : 0.1864
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7315
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_en$predict_new
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 536 588
Toxic 238 1708
Accuracy : 0.7309
95% CI : (0.7149, 0.7466)
No Information Rate : 0.7479
P-Value [Acc > NIR] : 0.985
124
Penalty: Elastic Net
Kappa : 0.3795
Sensitivity : 0.6925
Specificity : 0.7439
Pos Pred Value : 0.4769
Neg Pred Value : 0.8777
Precision : 0.4769
Recall : 0.6925
F1 : 0.5648
Prevalence : 0.2521
Detection Rate : 0.1746
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7182
125
Other model: Regularized Logistic Regression
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_en$prediction
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 1714 1631
Toxic 720 5131
Accuracy : 0.7443
95% CI : (0.7353, 0.7532)
No Information Rate : 0.7353
P-Value [Acc > NIR] : 0.02527
Kappa : 0.4135
Sensitivity : 0.7042
Specificity : 0.7588
Pos Pred Value : 0.5124
Neg Pred Value : 0.8769
Precision : 0.5124
Recall : 0.7042
126
Penalty: Elastic Net
F1 : 0.5932
Prevalence : 0.2647
Detection Rate : 0.1864
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7315
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_en$predict_new
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 536 588
Toxic 238 1708
Accuracy : 0.7309
95% CI : (0.7149, 0.7466)
No Information Rate : 0.7479
P-Value [Acc > NIR] : 0.985
127
Other model: Regularized Logistic Regression
Kappa : 0.3795
Sensitivity : 0.6925
Specificity : 0.7439
Pos Pred Value : 0.4769
Neg Pred Value : 0.8777
Precision : 0.4769
Recall : 0.6925
F1 : 0.5648
Prevalence : 0.2521
Detection Rate : 0.1746
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7182
Just like the ordinary logistic regression, nothing changes, whether you
control the workflow, including cross-validation and hyperparameter tun-
ing with tidymodels or not. Except the part where the LASSO penalty is
applied. Either it is applied with tidymodels or not, we can arguably say
that the elastic net penalty type is better model to predict both training
and test than Ridge and LASSO, but none of them got better result than
the 2 RBF Network models that got 80% accuracy in their both Training
and Prediction sets.
128
Other model: k-Nearest Neighbors
box::use(module_r/knn)
129
Other model: k-Nearest Neighbors
I stored the functions in a single script named knn, right? I saved this file
as knn.R, and stored it into a folder named module_r. Hence, I made an
argument in box::use function as module_r/knn, and no alias because it
is not needed anyways since the file name is too short. To view the source
code of module_r/knn, click this page.
Ĺ Note
With tidymodels
I am impressed that kNN is actually faster than I thought. Here, with its
default, I only made this function iterate few parameters, and I still got
better result.
Actual result:
�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps
130
With tidymodels
�� Model �����������������������������������������������������������������������������������
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(6L, data, 5))
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_knn$prediction
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 3180 165
Toxic 119 5732
Accuracy : 0.9691
95% CI : (0.9654, 0.9726)
No Information Rate : 0.6413
131
Other model: k-Nearest Neighbors
Kappa : 0.9331
Sensitivity : 0.9639
Specificity : 0.9720
Pos Pred Value : 0.9507
Neg Pred Value : 0.9797
Precision : 0.9507
Recall : 0.9639
F1 : 0.9573
Prevalence : 0.3587
Detection Rate : 0.3458
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9680
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_knn$prediction
),
mode = "everything"
)
132
With tidymodels
Prediction
Actual Nontoxic Toxic
Nontoxic 951 173
Toxic 153 1793
Accuracy : 0.8938
95% CI : (0.8824, 0.9045)
No Information Rate : 0.6404
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7704
Sensitivity : 0.8614
Specificity : 0.9120
Pos Pred Value : 0.8461
Neg Pred Value : 0.9214
Precision : 0.8461
Recall : 0.8614
F1 : 0.8537
Prevalence : 0.3596
Detection Rate : 0.3098
Detection Prevalence : 0.3661
Balanced Accuracy : 0.8867
133
Other model: k-Nearest Neighbors
Training Set
caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_knn$predictions
),
mode = "everything"
)
Prediction
Actual Nontoxic Toxic
Nontoxic 2977 368
Toxic 303 5548
Accuracy : 0.927
95% CI : (0.9215, 0.9323)
No Information Rate : 0.6433
P-Value [Acc > NIR] : < 2e-16
134
Without tidymodels workflow
Kappa : 0.8417
Sensitivity : 0.9076
Specificity : 0.9378
Pos Pred Value : 0.8900
Neg Pred Value : 0.9482
Precision : 0.8900
Recall : 0.9076
F1 : 0.8987
Prevalence : 0.3567
Detection Rate : 0.3237
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9227
Prediction Set
caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_knn$predictions_new
),
mode = "everything"
)
135
Other model: k-Nearest Neighbors
Prediction
Actual Nontoxic Toxic
Nontoxic 914 210
Toxic 172 1774
Accuracy : 0.8756
95% CI : (0.8634, 0.887)
No Information Rate : 0.6463
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.73
Sensitivity : 0.8416
Specificity : 0.8942
Pos Pred Value : 0.8132
Neg Pred Value : 0.9116
Precision : 0.8132
Recall : 0.8416
F1 : 0.8271
Prevalence : 0.3537
Detection Rate : 0.2977
Detection Prevalence : 0.3661
Balanced Accuracy : 0.8679
With or without using tidymodels, I still got a result from Training set
greater than 90%, and got a result from Prediction set greater than 85%,
making the kNN better than the previous 2 RBF Network models. I got
an exemption result, even though the result is in favor of with tidymodels
workflows.
136
Part VII.
Model Evaluation
137
Model Evaluation
1. Accuracy
2. Precision
3. Recall
4. F1-Score
The metrics like Sensitivity and Specificity will be summarized using Mo-
saic plot and ROC-AUC Curve plots
Here is the summary table for Accuracy, Precision, Recall, and F1-Score:
This PDF is only a skeleton. Please read the online HTML version
As you can see in the model, the result of 2 RBF Network models are the
same, no controlled workflows or whatsoever, however, the metrics you see
is from Chapter 5 I obtained with tidymodels workflows (not without it,
I am just showing you how it’s done), since the models to be trained for
Chapter 5 is from tidymodels.
138
Mosaic Plot
The mosaic plot of all the models to visualize the confusion matrix are
merged into 1 plot:
139
ROC and AUC
RBF Network
1 Layer
2 Layers
Logistic Regression
Naive Bayes
Random Forest
k-Nearest Neighbors
140
Part VIII.
141
Summary and Conclusion
Most of the models other than Radial Basis Neural Network (RBF Net-
work) outperforms it, whether it is controlled by tidymodels or not. To
be fair, the RBF Network models were not trained with many parameters,
and I got weak computational power making this model under performed.
In conclusion, this model is strong in training but weak in inference.
142