0% found this document useful (0 votes)
16 views148 pages

CC50 Toxicity Classification Using Radial Basis Function RBF Neural Network

Uploaded by

balilijoshua47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views148 pages

CC50 Toxicity Classification Using Radial Basis Function RBF Neural Network

Uploaded by

balilijoshua47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

CC50 Toxicity Classification using

Radial Basis Function (RBF) Neural


Network
An R implementation

Joshua Marie Ongcoy


Table of contents

I. INTRODUCTION 2

Introduction 3
Brief Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3

II. Chapter 1: Libraries to be used 5

Libraries 6

III. Chapter 2: Data and data frame frameworks 8

Data 9
How to load CSV files with 4 methods . . . . . . . . . . . . . . . 11
Rename . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

IV. Chapter 3: Exploratory Data Analysis and Data Wran-


gling 18

Data Wrangling and Exploratory Data Analysis 19


Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . 19
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

ii
Table of contents

Data Wrangling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Normalization and Coercion . . . . . . . . . . . . . . . . . . 27
Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . 31

V. Chapter 4: Radial Basis Neural Network 33

Neural Network with Torch 34


Data to torch readings . . . . . . . . . . . . . . . . . . . . . . . . 34
Radial Basis Function (RBF) Neural Networks . . . . . . . . . . 36
1 Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

VI. Chapter 5: Comparison to other models 48

Other models 49

Other model: Logistic Regression 50


With tidymodels workflow . . . . . . . . . . . . . . . . . . . . . . 51
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Without tidymodels workflow . . . . . . . . . . . . . . . . . . . . 56
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Other model: Extreme Gradient Boosting 60


With tidymodels workflow . . . . . . . . . . . . . . . . . . . . . . 61
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Without tidymodels workflow . . . . . . . . . . . . . . . . . . . . 65
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 67

iii
Table of contents

Other model: Support Vector Machine 69


With tidymodels workflow . . . . . . . . . . . . . . . . . . . . . . 70
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Without tidymodels workflow . . . . . . . . . . . . . . . . . . . . 74
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Other model: Naive Bayes 78


With tidymodels workflow . . . . . . . . . . . . . . . . . . . . . . 79
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Without tidymodels workflow . . . . . . . . . . . . . . . . . . . . 88
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Other model: Random Forest 91


With tidymodels workflow . . . . . . . . . . . . . . . . . . . . . . 92
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Without tidymodels workflow . . . . . . . . . . . . . . . . . . . . 97
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Other model: Regularized Logistic Regression 101


Penalty: Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
With tidymodels workflow . . . . . . . . . . . . . . . . . . . 102
Without tidymodels workflow . . . . . . . . . . . . . . . . . 107
Penalty: LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
With tidymodels workflow . . . . . . . . . . . . . . . . . . . 111
Without tidymodels workflow . . . . . . . . . . . . . . . . . 116
Penalty: Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . 119
With tidymodels workflow . . . . . . . . . . . . . . . . . . . 120
Without tidymodels workflow . . . . . . . . . . . . . . . . . 125

iv
Table of contents

Other model: k-Nearest Neighbors 129


With tidymodels . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Without tidymodels workflow . . . . . . . . . . . . . . . . . . . . 134
Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Prediction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 135

VII.Model Evaluation 137

Model Evaluation 138

Mosaic Plot 139

ROC and AUC 140


RBF Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
1 Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
2 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Extreme Gradient Boosting (XGBoost) . . . . . . . . . . . . . . 140
Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . 140
Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Penalized Logistic Regression . . . . . . . . . . . . . . . . . . . . 140
k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . 140

VIII.Chapter 6: Summary and Conclusion 141

Summary and Conclusion 142

v
1.

Thanks for downloading this book.

1
Part I.

INTRODUCTION

2
Introduction

Brief Introduction

This documentation is one of the blogs (see this page), providing you
a reproducible workflow, that will hope to teach you, from my research
in predicting the CC50, or the Median Cytotoxic Concentration, dataset
using Radial Basis Function (RBF) Neural Networks. It is a classification
problem in bioinformatics that involves machine learning—in particular,
neural networks—as an algorithm to categorize the CC50 toxicity levels,
not a quantitative outcome but a categorical outcome. And as you can see
in the title, I am going to evaluate how well is the Radial Basis Function
(RBF) Neural Network, or just RBF Networks, in the CC50 dataset. Later,
I will discuss the RBF Networks the concepts and usages.
The dataset has a total of 9196 observation from the training set and 3070
observation from the test set. It is basically 75:25, which means the dataset
has 75% observations from the train set, and 25% observation from the
test set. I have to remind you that this data is confidential thus I can’t
able to share/distribute this dataset to you, but I can show you what’s
going on with the dataset.
One of the mainlines in this document is the use of box::use. What is it?
Check box’s documentation for more details. What it does is, essentially,
to make R to have its own “modular” system, in which this system doesn’t
exist in R and that’s unfortunate. As a result, I can’t even use the R script
or a folder as a module, since the “modular” system is the way to organize
your codes, especially when your codebase is getting larger. If this feature

3
Introduction

does exist in R, then I wouldn’t be bothered to not create another package


just to store my codes there and reuse them.
And in summary, for this documentation, I will show you the tools, tech-
niques, and solutions in solving problems in data science, how they are
used in my study involving the classification of CC50 toxicity levels. And
by the way, all of that, they are done with R. If you want this to be done
with using Python, go to this page.

4
Part II.

Chapter 1: Libraries to be
used

5
Libraries

I have plenty libraries used to work with machine learning problem. How-
ever, there are only 3 loaded libraries, and those 3 libraries, found below,
are the main essential for the analysis. The rest are the dependencies
being called with package::.

library(torch)
library(tidyverse)
library(data.table)

The reason why I chose those 3 packages as my dependencies for CC50


classification:

1. torch - although I don’t need to load this, I still need the entire
namespace in order to teach you the process of using torch in clas-
sifying the toxicity with neural networks. This come handy for con-
verting the dataset into torch tensors and calling the functions for
neural network optimizers.

2. tidyverse - I used this package, paired with dtplyr, for data ma-
nipulation (see normalization), and Exploratory Data Analysis (see
EDA).

3. data.table - an extra package for data manipulation. I used this


package just for data splitting, even though tidyverse/dplyr is
already enough (see data splitting).

6
Libraries

The additional (extra) packages were being called through package:: in


R:

1. duckdb - I use it to interface with DuckDB. When reading the CSV


files with DuckDB, this is the fastest tool I known of. But, in order
to interface with DuckDB, I have to create a connection instance
with duckdb::dbConnect. But you have to create your own SQL
script to work with duckdb’s tables. Don’t worry, I have the SQL
codes shown.

*Note: You can technically use DBI::dbConnect with duckdb::duckdb as


the driver, since the dbConnect from duckdb is “borrowed” from DBI’s
namespace.

2. dtplyr - I will use this to paired with dplyr for data manipulation.
This package uses lazy evaluation, a technique to delay the code
you created with. This come handy if you prefer speed and memory
efficient with dplyr syntax. With R’s non-standard evaluation, the
dplyr’s code you create will be translated into data.table’s syntax.

I wouldn’t call it “lazy evaluation” since the data is being evaluated, any-
ways.

3. box - With the use of use function, you can call the folder or an R
script as if they were a module. This would be similar to Python’s
modular system, where you can gain access to the code from another
folder or a Python script.

7
Part III.

Chapter 2: Data and data


frame frameworks

8
Data

Initially, I just used the tidyverse/dplyr/tidyverse/readr and


data.table to handle this data, but I really wanted to show you how
to use R and SQL at the same time to handle data, which in this case,
I chose DuckDB because I was amazed by its speed, even though my
initial choice is SQLite. Thus, I used an “extra” package called duckdb to
interface with DuckDB just to call my data with 12,266 observations and
11 columns.
The description of the data:

Table 1.1.: CC50 Data Description: Descriptors used on the present


QSTR-perturbation model
Name Description
DDV(me) Perturbation term that characterizes the change
of the molar volume between the NPs used in the
new output and reference states, also depending
on the measures of the toxic effects.
DDL(me) Perturbation term that accounts for the variation
of the size between the NPs used in the new and
reference states, also depending on the measures
of the toxic effects.

9
Data

Name Description
DDµ1(ATO)bt Perturbation term that describes the difference of
the spectral moment of order 1 (weighted by the
atomic weight) between the NPs used in the new
and reference states, also depending on the
biological systems.
DDµ3(POL)ns Perturbation term that characterizes the change
of the spectral moment of order 3 (weighted by
the polarizability) between the NPs used in the
new and reference states, also depending on the
shapes of the NPs.
DDE(dm) Perturbation term that describes the variation of
the electronegativity between the NPs used in the
new and reference states, also depending on the
conditions under which the sizes of the NPs were
measured.
DDµ3(VAN)ta Perturbation term that accounts for the difference
of the spectral moment of order 3 (weighted by
the atomic van der Waals radius) between the
NPs used in the new and reference states, also
depending on the exposure times.
DDµ2(ATO)ta Perturbation term that characterizes the change
of the spectral moment of order 2 (weighted by
the atomic weight) between the NPs used in the
new and reference states, also depending on the
exposure times.
DGµ2(HYD)sc Perturbation general spectral moment of order 2
weighted by the hydrophobicity, which accounts
for the difference between the chemical structures
of the coating agents used in the new and
reference states.

10
How to load CSV files with 4 methods

Name Description
DGµ5(PSA)sc Perturbation general spectral moment of order 5
weighted by the polar surface area, which
characterizes the difference between the chemical
structures of the coating agents used in the new
and reference states.
Series Column used to index which rows are for Training
or Prediction, used for data splitting in the
dataset.
TEi(cj)_rf Dummy classification variable describing the toxic
effect of the NP used in the reference state.

How to load CSV files with 4 methods

I am gone with 4 trials to see which library is the fastest library to read
CSV.

First Trial: This is how you read your data with duckdb::read_csv_duckdb:

duckdb_con <- duckdb::dbConnect(duckdb::duckdb())


duckdb::read_csv_duckdb(
con = duckdb_con,
file = "data/cc50.csv",
table = "cc50"
)

Again, I chose DuckDB because it is faster than the libraries I know to read
CSV files. Also, don’t forget to encode your CSV file in UTF-8, otherwise,
duckdb::read_csv_duckdb can’t read your CSV file.

Second Trial: I choose tidyverse’s readr:

11
Data

cc50 <- read_csv("data/cc50.csv")

Third Trial: Using data.table:

cc50 <- fread("data/cc50.csv")

Fact: Both readr and data.table are not strict in terms of encoding,
unlike DuckDB where you need to set the CSV encoding into UTF-8 so
that the DuckDB can be able to read your CSV file.

Fourth Trial: Using polars:

cc50 <- polars::pl$read_csv("data/cc50.csv")

The speed (from the result in bench::mark):

• DuckDB: 55.4ms
• readr::read_csv(): 79.5ms
• data.table::fread(): 5.45ms
• polars::pl$read_csv(): 6.2ms

And thus, I will choose the data.table in this case because of its speed
and not gonna lie, it is not just faster, but also more readable and no-
brainer.

cc50 <- fread("data/cc50.csv")

Like readr and polars, I was able to read my data in 5 milliseconds with
just 1 line.

12
Rename

Rename

I wanted to rename the variables that is used for the analysis, including
all the features and a label. This is the description of the data:

Table 1.2.: CC50 Data Description: Before and After column names
Name Rename
DDV(me) x1
DDL(me) x2
DDµ1(ATO)bt x3
DDµ3(POL)ns x4
DDE(dm) x5
DDµ3(VAN)ta x6
DDµ2(ATO)ta x7
DGµ2(HYD)sc x8
DGµ5(PSA)sc x9
Series Series
TEi(cj)_rf y

As you can see, it would be difficult for me to synchronize my data analysis


with these column names. Thus, I have to rename them.

I am gone with trials again to see which of the 3 libraries is the fastest
to rename the columns that is used for the analysis. Why 3 libraries?
Because polars in R is so difficult to use when renaming the columns.

First Trial: with DuckDB’s SQL:

First, I need to write an SQL code in a separate script. Don’t worry, I


have a function that can retrieve the table operated in SQL into data
frame or you can just retrieve the data frame from the SQL code chunk
itself. That function is called sql_res_query and this will be called in

13
Data

a module named retrieve_df where that module is called by box::use,


not as a module but a function itself.

This is the SQL code:

SELECT
"DDV.me." AS x1,
"DDL.me." AS x2,
"DDµ1.ATO.bt" AS x3,
"DDµ3.POL.ns" AS x4,
"DDE.dm." AS x5,
"DDµ3.VAN.ta" AS x6,
"DDµ2.ATO.ta" AS x7,
"DGµ2.HYD.sc" AS x8,
"DGµ5.PSA.sc" AS x9,
"Series",
"TEi.cj._rf" AS y
FROM cc50

I saved that SQL code in a script named rename.sql so that I wouldn’t


get lost. And this is how you retrieve the result using sql_res_query
from retrieve_df module. To access the module, go to this page.

box::use(module_r/retrieve_df[...])
cc50 <- sql_res_query(conn = duckdb_con, "rename.sql")

You can even do this:

14
Rename

But, I can’t do this because this would be difficult to benchmark


Second Trial: With dplyr:

cc50 |>
rename_with(
~ c(paste0("x", 1:9), "y"),
c(1:9, 11)
)

Third Trial: With data.table:

setnames(cc50, names(cc50)[c(1:9, 11)], c(paste0("x", 1:9), "y"))

Benchmark:

• DuckDB: 18.7 ms
• dplyr: 824 µs
• data.table: 57.6 µs

I already showed you the dplyr code for renaming the variables so I have
no issues with readability. My issue is the speed and thus, I will choose
the data.table syntax:

15
Data

setnames(cc50, names(cc50)[c(1:9, 11)], c(paste0("x", 1:9), "y"))

And, you don’t have to assign it into another variable because the change
happens by reference. Hence, once you ran that data.table syntax (like
one in above), the change is already made. And therefore, we’re done
renaming the columns.

Miscellaneous

I know in this analysis, this won’t be much of use, but I wanted to show
this to you so that you will how it is done to store the data into SQL
connection. The DBI::dbWriteTable is a function where you don’t need
to do this:

CREATE TABLE table_name (


column1 datatype constraints,
column2 datatype constraints,
column3 datatype constraints,
...
);

INSERT INTO table_name (column1, column2, column3, ...)


VALUES
(value1, value2, value3, ...),
(value1, value2, value3, ...),
...;

This is the equivalent when using R with DBI package:

DBI::dbWriteTable(conn, "table_name", table_name)

16
Miscellaneous

Fact: The DBI::dbWriteTable function will automatically perform


CREATE TABLE ... and INSERT INTO for you. And this is how you
handle the data with R and SQL.

17
Part IV.

Chapter 3: Exploratory Data


Analysis and Data Wrangling

18
Data Wrangling and Exploratory
Data Analysis

As far as I know, in Data Science, the basics in data analysis has 3 parts:
Data Wrangling, Exploratory Data Analysis (EDA), and Feature Engineer-
ing. What I have done is that the EDA comes first, and then the Data
Wrangling part. The Feature Engineering will be done after the training
process of the ML model.

Exploratory Data Analysis

Features

Since I can’t share the data to you, I will explore the data for you so that
at least, you’ll understand what’s going on with the data.
Before the start of the analysis, make sure to check the quality of your
data. Start with the inspection of missing values.

cc50 |>
summarise(
across(
everything(),
~ sum(is.na(.x))
)
)

19
Data Wrangling and Exploratory Data Analysis

x1 x2 x3 x4 x5 x6 x7 x8 x9 Series y
1 0 0 0 0 0 0 0 0 0 0 0

Great! The data has no missing values. Thus, we can’t use drop_na to
drop rows with NA.
Here, you can take the glimpse of the data through descriptive statistics.
The data has 2 parts to be used for ML classification: Training and
Prediction.
Again, I can’t share you the data but this is the least I can do: summary
statistics in HTML table. The summary statistics of the data:
This PDF is only a skeleton. Please read the online HTML version
This is not a problem. Click the number as if you represent them as a
column number. Other option is to search the variable (e.g. “x1”, “x2”),
and that specific column is filtered out.
I chose the boxplot to visualize the features of the dataset. Here is the
boxplot:

cc50 |>
mutate(
Series = factor(
Series,
levels = c("Training", "Prediction")
)
) |>
gather(key = "variable", value = "value", -Series, -y) |>
ggplot(aes(x = variable, y = value, fill = Series)) +
facet_wrap(~ variable, scales = "free") +
geom_violin(
aes(color = Series),
alpha = 0.7, # With transparency: 0.07; No transparency: 0.7
position = position_dodge(width = 0.9),

20
Exploratory Data Analysis

width = 0.8
) +
geom_boxplot(
aes(color = Series),
outlier.size = 2,
width = 0.4,
position = position_dodge(width = 0.9)
) +
ggforce::geom_sina(
alpha = 0.08,
aes(color = Series),
position = position_dodge(width = 0.9)
) +
scale_fill_manual(values = c("Training" = "#6FDCE3", "Prediction" = "#D5ED9F")) +
scale_color_manual(values = c("Training" = "#6FDCE3", "Prediction" = "#D5ED9F")) +
theme_minimal() +
labs(
title = "Features' Distribution",
x = "Variables", y = "Sizes (in �m)"
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(
color = "#0099f8", size = 12, face = "bold", hjust = 0.5
),
axis.title.x = element_text(
color = "blue", size = 9, face = "bold"
),
axis.title.y = element_text(size = 9, face = "italic"),
plot.caption = element_text(face = "italic"),
text = element_text(family = "Times New Roman")
)

21
Data Wrangling and Exploratory Data Analysis

Based on that plot, among the features except x5 and x6, there’s too many
observation detected outside the range in the whiskers. We found many
noise in the data.

Label

For the dependent variable y, I will summarize them for you through
frequency statistics. I know this will be done later but I will use this so
that you will know the counts of the each labels in the data
By Labels:

cc50 |>
mutate(
y = case_when(
y == -1 ~ "Nontoxic",
y == 1 ~ "Toxic"
)

22
Exploratory Data Analysis

) |>
count(y)

y n
<char> <int>
1: Nontoxic 4469
2: Toxic 7797

The counts are: 4469 Nontoxic and 7797 Toxic.


By Series:

cc50 |>
mutate(
y = case_when(
y == -1 ~ "Nontoxic",
y == 1 ~ "Toxic"
)
) |>
group_by(Series) |>
count(y)

# A tibble: 4 x 3
# Groups: Series [2]
Series y n
<chr> <chr> <int>
1 Prediction Nontoxic 1124
2 Prediction Toxic 1946
3 Training Nontoxic 3345
4 Training Toxic 5851

So, for the summary of counts:

23
Data Wrangling and Exploratory Data Analysis

matrix(
c(5851, 3345,
1946, 1124),
nrow = 2,
byrow = T,
dimnames = list(
c("Training", "Prediction"),
c("Toxic", "Nontoxic")
)
) |> knitr::kable()

Toxic Nontoxic
Training 5851 3345
Prediction 1946 1124

• Training Set: 3345 Nontoxic and 5851 Toxic


• Prediction Set: 1124 Nontoxic and 1946 Toxic

The bar plot:

(cc50 |>
mutate(
Toxicity = case_when(
y == -1 ~ "Nontoxic",
y == 1 ~ "Toxic"
)
) |>
group_by(Series) |>
mutate(total = n()) |>
group_by(Series, Toxicity, total) |>
summarise(n = n(), .groups = "drop") |>

24
Exploratory Data Analysis

mutate(percentage = n / total * 100) |>


ggplot(aes(x = Series, y = n, fill = Toxicity)) +
geom_bar(stat = "identity", width = 0.6) +
geom_text(aes(
label = paste(n, paste0("(", round(percentage, 1), "%)")),
group = Toxicity,
color = Toxicity # Map color to Toxicity for text
),
family = "Times New Roman", # Apply Times New Roman font
position = position_stack(vjust = 0.5),
show.legend = FALSE # Disable legend for geom_text
) +
scale_fill_manual(
values = c("Toxic" = "#AAB396", "Nontoxic" = "#F7EED3"),
breaks = c("Toxic", "Nontoxic")
) +
scale_color_manual( # Define colors for the text labels
values = c("Toxic" = "#3C3D37", "Nontoxic" = "#229799")
) +
scale_x_discrete(
labels = function(x) {
paste0(x, "\nTotal: ", {cc50 |>
count(Series) |>
deframe()}[x])
}
) +
theme_minimal() +
theme(
legend.position = "top",
plot.title = element_text(color = "#7695FF", hjust = 0.5, size = 15),
text = element_text(family = "Times New Roman"),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "italic"),

25
Data Wrangling and Exploratory Data Analysis

axis.text.y = element_text(hjust = 0, size = 12, face = "italic"),


axis.text.x = element_text(hjust = 0, size = 12),
axis.text = element_text(family = "Times New Roman")
) +
coord_flip() +
labs(
x = "Series",
y = "Count",
title = "Counts of Toxicity Labels by Series"
))

:::

Summary: The data has huge variation. Therefore, in this case, the solu-
tion is that this will be done with normalization through standardization,
and re-coding the y variable into factor/integer.

26
Data Wrangling

Data Wrangling

When I discovered that the dataset has large, I decided to perform the
dataset, the features will be normalized under the hood. The dataset is
already in shape, in tidy format, and the dataset is already renamed. What
it comes to this part is to normalize the data through standardization using
transmutation with mutate function and data splitting with data.table’s
[ method.

Normalization and Coercion

In this part, I will normalize all the features (independent variables)


through standardization, and coerce the label (dependent variable) into
factors.
Small recap of standardization:

𝑥−𝜇
𝑧=
𝜎

where 𝑥 is the feature, 𝜇 is the mean of that feature, 𝜎 is the standard


deviation of that feature, and 𝑧 is the equivalent of that z-score.
Previously, I am using dplyr for normalization using the mutate function
and it’s relatively fast for my dataset with 12,266 records. And then, I am
going with many trials to get faster execution of normalizing the data.
This is the original dplyr code for normalization:

cc50 |>
mutate(
across(
starts_with('x'),
scale

27
Data Wrangling and Exploratory Data Analysis

)
)

What I did here is that, with across function, I mapped the columns that
starts with “x”, and then I iterate the standardization of those columns I
mapped through scale function, although we can just use mutate_if.
But then, I am not satisfied with its speed, even though this was also
fast. Thus, I am gone with many trials that may help me to speed up the
normalization. Starts with duckdb and then dtplyr.
For the trials:

i. I am going with DuckDB, initially:

cc50 <- sql_res_query(


conn = duckdb_con,
sql_path = "normalize.sql"
)

Where the normalize.sql:

SELECT
(x1 - avg(x1) OVER ()) / stddev(x1) OVER () AS x1,
(x2 - avg(x2) OVER ()) / stddev(x2) OVER () AS x2,
(x3 - avg(x3) OVER ()) / stddev(x3) OVER () AS x3,
(x4 - avg(x4) OVER ()) / stddev(x4) OVER () AS x4,
(x5 - avg(x5) OVER ()) / stddev(x5) OVER () AS x5,
(x6 - avg(x6) OVER ()) / stddev(x6) OVER () AS x6,
(x7 - avg(x7) OVER ()) / stddev(x7) OVER () AS x7,
(x8 - avg(x8) OVER ()) / stddev(x8) OVER () AS x8,
(x9 - avg(x9) OVER ()) / stddev(x9) OVER () AS x9,
Series,

28
Data Wrangling

y
FROM cc50_new;

It’s fast, but not faster than the next trial.

ii. Here, I am just recycling the dplyr code I have before, but this time,
the cc50 dataset is loaded in dtplyr’s lazy_dt. The method on how
you write the code for normalization is the same with dplyr’s, and
it is powered by data.table for speed.

cc50 |>
dtplyr::lazy_dt() |>
mutate(
across(
starts_with('x'),
scale
)
)

I have to create a “lazy” data.table to get a faster execution. And then,


I realize that the code above is actually faster than DuckDB’s execution.
I already tried incorporating with changing the labels into factors. Since,
dtplyr is actually faster than the method that uses DuckDB for normal-
ization. Thus, I used the result from that code: the normalization and
conversion of label of cc50 into factor is done with dtplyr.

iii. Third Trial: Using tidytable, but I am not using scale, but rather
the use of lambda, the purrr-style lambda.

cc50 |>
tidytable::as_tidytable() |>
mutate(
across(

29
Data Wrangling and Exploratory Data Analysis

starts_with('x'),
~ (.x - mean(.x)) / sd(.x)
)
)

I mean, when it comes to normalizing the data, the tidytable package


is the fastest, but I need to somewhat reconsider it because when I in-
corporate with the conversion of y values into factors, the tidytable
is slower. According to the result of the benchmark, dtplyr is slower
than tidytable, but when I incorporate the coercion of y into factors,
tidytable is now slower than dtplyr. Since the objective in data wran-
gling is to normalize the features or independent variables and to coerce
the y variable, therefore, I will disregard the dplyr and tidytable for
speed, I will disregard the use of duckdb for coercion, and I will consider
the use of dtplyr as “engine”.

cc50_label_name <- c("Nontoxic","Toxic")


cc50 <-
cc50 |>
dtplyr::lazy_dt() |>
mutate(
across(
starts_with('x'),
scale
),
y = factor(y, labels = cc50_label_name)
)

I also consider other libraries for data manipulation as well, the libraries
I didn’t include in the trials for speed. But I chose those because I ensure
the readability with speed and the ability to coerce. In contrast, I chose
polars, because this package is so fast (according to H20’s benchmark),
but when I tried this into CC50’s data frame to do normalization, this is

30
Data Wrangling

fast but not as fast as those 2 data frame libraries I used and I can’t even
coerce the y variable into factor.

Data Splitting

After normalizing the data, splitting the data is done with data.table.
Well, there’s another method of splitting the dataset, like the use
of dplyr’s sample_n() and sample_frac() function, combined
with anti_join() function to gather the dataset not captured by
sample_frac(). But, I prefer the data.table indexing method, and it’s
faster. It is important to note that the [ method from data.table is not
the usual indexing [ method from base R’s data frame, but it is the SQL
of base R’s data frame in functional programming.

The general structure of [ in data.table looks like this:

DT[𝑖, 𝑗, 𝑏𝑦]

Where

• DT stands for the data frame in data.table class.


• 𝑖 is use to index the row, often used in filtering rows based on condi-
tions.
• 𝑗 is the index in columns, often operated by .() to return the selected
rows, it could be singular or multiple columns (without using .()
returns the column in vector).
• 𝑏𝑦 is the index for grouping the data frame, equivalent to dplyr’s
group_by and SQL’s GROUP BY function/operator.

Furthermore, like dplyr, the data.table is using non-standard evaluation


(NSE), a LISP-y feature in R, as a result of less boilerplate syntax.

31
Data Wrangling and Exploratory Data Analysis

And thus, I have to convert the cc50 data frame into data.table via
coercion using as.data.table() first. Then, go through the data.table’s
process of data splitting, by Series.

split_data <- as.data.table(cc50)[, .(list(.SD)), by = Series]


cc50_train <- split_data[Series == "Training"]$V1[[1]]
cc50_test <- split_data[Series == "Prediction"]$V1[[1]]

In addition, this process completely and automatically omits the Series


column that is used to index the rows as Training or Prediction sets.
Now that with normalization and disintegration of the dataset into training
and test sets were done, then proceed to the modelling part. Moreover, if
you are done with DuckDB as well, you have to disconnect the connection
with duckdb::dbDisconnect / DBI::dbDisconnect function.

duckdb::dbDisconnect(duckdb_con)

32
Part V.

Chapter 4: Radial Basis


Neural Network

33
Neural Network with Torch

Like I said in the introduction, I’ll be using Radial Basis Function (RBF)
Network into the data. But, I didn’t mentioned that the model to be used
is just an experiment to see how well the RBF Networks’ performance to
the classification of CC50 toxicity, even though I already did the standard
feedforward neural networks (I tried Multilayer Perceptron (MLP) with
ReLu activation function for classification) and it’s working well.

Data to torch readings

It’s crucial to convert your data frames into Torch tensor objects to ensure
compatibility with the torch library. To do this, use the dataset function
from the torch namespace, which extends the R6 class of torch::dataset
for inheritance. This allows the dataset method to work seamlessly with
my data frames.

cc50_dataset <- dataset(


name = "cc50_dataset",
initialize = function(df) {
self$x <- df |>
select(starts_with('x')) |>
as.matrix() |>
torch_tensor()
self$y <- torch_tensor(
as.numeric(df$y)

34
Data to torch readings

)$to(torch_long())
},
.getitem = function(i) {
list(x = self$x[i, ], y = self$y[i])
},
.length = function() {
dim(self$x)[1]
}
)

Within the dataset instance code:

• initialize: Prepares the data, then select the relevant features


(independent variables) and a label (dependent variable), and then
convert them into Torch tensors.

Extra item:

• .getitem: Retrieves the i-th observation as a list containing both


the input features (x) and the label (y).

• .length: Returns the total number of observations in the dataset.

And then, encapsulate the data frames you have from the dataset method
with cc50_dataset.

cc50_train_torch <- cc50_dataset(cc50_train)


cc50_test_torch <- cc50_dataset(cc50_test)

And if you done it as well, then you successfully encapsulate the data
frames you have from the dataset method (here, the instance is saved in
cc50_dataset class), and converted them into torch tensors.

35
Neural Network with Torch

Radial Basis Function (RBF) Neural Networks

The Radial Basis Function (RBF) Neural Networks, or just RBF Network,
is just a typical Feedforward Neural Networks (FfNN) but the activation
function is not using the standard activation functions used in FfNN, such
as ReLu, sigmoid, softplus, etc…, but it uses Radial Basis Functions.
When RBF is applied, the centres and the shapes are determined.

||𝑥 − 𝑐||2
𝜙(𝑥) = exp(− )
2𝜎2

where:

• 𝑥 is the feature, an input vector.


• 𝑐 is the centre parameter of the RBF.
• 𝜎 is scale parameter of the RBF.

Yup, the model above is way similar to standardization. But the 𝜙(𝑥)
is a squared distance, and to get the distance, and then we will have to
square-root the calculated 𝜙(𝑥).
When you go to 1 Layer and 2 Layers, notice on how I got RBFNetwork,
even though it doesn’t exist in torch namespace or a manually-created
function within this post/website? Don’t misunderstand it, that’s
because I have a separate R script, in .../module_r/torch_rbf.R
directory, to contain that function, and made R to reuse the code
from .../module_r/torch_rbf.R as a module, using box::use with
module_r/torch_rbf as the unquoted argument. As I said in the
introduction and what I did during Chapter 2 and Chapter 3, this is the
way to access or re-use the code in a particular R script or a folder as a
module and import it using box::use. If you use [...] after the module,
you can have the access all of the (exported) functions in that module.
Go to this page to access the source codes. I do this to show you the
capabilities of R as a programming language.

36
Radial Basis Function (RBF) Neural Networks

For training the neural networks, I will apply the cross entropy loss func-
tion, the classification one to be exact, and the ADAM algorithm as the
optimizer. Both are used at the same time for estimation.

1 Layer

For 1 layer RBF Network, all the process wasn’t stored as a function as
a whole process so that I can explain to you the steps on how to train a
RBF Network into CC50 dataset.

box::use(./module_r/torch_rbf[...])

This is done after accessing the module, and then you can now access the
RBFNetwork function found inside the torch_rbf module. The number of
neurons in that hidden layer has 52 nodes. Note that you still can increase
the number of neurons as many as you want and as you can.

rbf_gauss_model <- RBFNetwork(


in_features = 9,
num_classes = 2,
num_rbf = 52,
basis_func = "gaussian"
)

loss_fn <- nn_cross_entropy_loss()


optimizer <- optim_adam(rbf_gauss_model$parameters, lr = 0.001)

The cross-entropy loss function measures the difference between the pre-
dicted class probabilities and the true class labels, while the ADAM opti-
mizer adjusts the model’s weights iteratively to minimize this loss. Both
functions are executed in tandem during training to optimize the network’s
performance.

37
Neural Network with Torch

Training Set

What happens during the training?

• After setting up neural network architecture for classifica-


tion, in the training loop, the optimizer resets its gradients
(optimizer$zero_grad()) at the start of each epoch, and the model
makes predictions on the training data (rbf_gauss_model(cc50_train_torch$x)).

• The loss is calculated using the cross-entropy loss function (loss_fn)


by comparing the model’s predictions with the actual labels. Then,
the backward() function performs backpropagation to compute the
gradients, and the optimizer$step() updates the model’s parame-
ters based on these gradients.

• The training process ensures that the neural network learns to


minimize classification errors by continuously adjusting its weights
with the ADAM optimizer while measuring performance with
cross-entropy loss.

For the number of epochs, I set it to 1,200. Meaning, the model is trained
over 1200 epochs, with progress being printed every 100 epochs, displaying
the current epoch and the loss.

epochs <- 1200


for (epoch in 1:epochs) {
optimizer$zero_grad()
cc50_train_out <- rbf_gauss_model(cc50_train_torch$x)
loss1_mod <- loss_fn(cc50_train_out, cc50_train_torch$y)
loss1_mod$backward()
optimizer$step()

if (epoch %% 100 == 0) {
cat("Epoch:", epoch, "Loss:", as.numeric(loss1_mod), "\n")

38
Radial Basis Function (RBF) Neural Networks

}
}

This is what happens during training process:

Epoch: 100 Loss: 0.6859608


Epoch: 200 Loss: 0.6239529
Epoch: 300 Loss: 0.5616919
Epoch: 400 Loss: 0.5109196
Epoch: 500 Loss: 0.4784966
Epoch: 600 Loss: 0.4558362
Epoch: 700 Loss: 0.4403452
Epoch: 800 Loss: 0.4298643
Epoch: 900 Loss: 0.4218033
Epoch: 1000 Loss: 0.4149215
Epoch: 1100 Loss: 0.4085923
Epoch: 1200 Loss: 0.4032557

After training the model, this is what I did:

• The predictions are stored in cc50_train_out torch array. Retrieve


it. It is a torch object so you better coerce it into R’s array via
as.array (or you can use torch’s as_array function) to work with
R’s array to generalize the result.
• The result is in 2 dimensional array. You can extract the result
by mapping the array across each rows (or observation) with apply
(there’s other option than this, for example, purrr’s framework can
deal with this), and you will obtain the series of predictions with
bunch of 1s and 2s (in Python, you get 0s and 1s since Python’s
array starts at 0, not 1, unlike R). In apply function, don’t forget to
place 1 for the margins since the array is 2 dimensional and 1 selects
the row margin.

39
Neural Network with Torch

• After that, replace back 1s with “Nontoxic” and 2s with “Toxic” by


iterating the label names, cc50_label_name, through the mapped
arrays, then create a confusion matrix in a matrix or in a table
(which in this case, I chose table, more convenient than matrix),
and then put the confusion matrix into caret::confusionMatrix
in data argument with mode = everything to show all the metrics
in classification task.

cc50_train_pred <- apply(as.array(cc50_train_out), 1, which.max)


predicted_cc50 <- as.factor(cc50_label_name[cc50_train_pred])

caret::confusionMatrix(
data = table(actual = cc50_train$y, predictions = predicted_cc50),
mode = 'everything'
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 2268 1077
Toxic 571 5280

Accuracy : 0.8208
95% CI : (0.8128, 0.8286)
No Information Rate : 0.6913
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.5999

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.7989

40
Radial Basis Function (RBF) Neural Networks

Specificity : 0.8306
Pos Pred Value : 0.6780
Neg Pred Value : 0.9024
Precision : 0.6780
Recall : 0.7989
F1 : 0.7335
Prevalence : 0.3087
Detection Rate : 0.2466
Detection Prevalence : 0.3637
Balanced Accuracy : 0.8147

'Positive' Class : Nontoxic

The accuracy is 82.08%, the Sensitivity is 79.89%, the Specificity is 83.06%,


and F1 is 73.35%.
The metrics achieved above 70% or at least 80% with low margin, so I
can say that the RBF Network with 1 layer for Training set is satisfactory.
Let’s see how well the RBF Network performed to the Prediction Set (press
the Prediction Set tab).

Prediction Set

Since the rbf_gauss_model is already trained, and so the estimated pa-


rameters, or simply weights, are retained. Now, what we can do here is
to simply plug the features of the CC50’s prediction set and repeat the
process of what we have done in prediction part of the known training set.
The process repeats as what I did in Training Set part.

cc50_test_pred <- apply(


as.array(rbf_gauss_model(cc50_test_torch$x)), 1, which.max
)

41
Neural Network with Torch

predicted_cc50_test <- as.factor(cc50_label_name[cc50_test_pred])

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
predictions = predicted_cc50_test
),
mode = 'everything'
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 703 421
Toxic 188 1758

Accuracy : 0.8016
95% CI : (0.7871, 0.8156)
No Information Rate : 0.7098
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.553

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.7890
Specificity : 0.8068
Pos Pred Value : 0.6254
Neg Pred Value : 0.9034
Precision : 0.6254
Recall : 0.7890
F1 : 0.6978

42
Radial Basis Function (RBF) Neural Networks

Prevalence : 0.2902
Detection Rate : 0.2290
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7979

'Positive' Class : Nontoxic

The accuracy is 80.16%, the Sensitivity is 78.89%, the Specificity is 80.68%,


and F1 is 69.78%.
The metrics achieved above 70% or at least 80% with low margin, so I can
say that the RBF Network with 1 layer for Prediction set is satisfactory.
And both Training set and Prediction set’s accuracy achieved satisfactory
remarks and doesn’t have huge variance in between, the RBF Network
with 1 layer is not overfitting to the CC50 dataset.

2 Layers

The RBF Network with 1 layer achieved pretty well to classify the CC50
toxicity levels, this time, I am gonna add 1 more layer to get 2 layers for
its hidden layer to see how well this is. Is it better to RBF Network with
1 layer? Let’s find out.
Note: For this part, I am already done discussing the details in the previ-
ous part. So in this part, the torch_rbf_2layer module has its function
that combines all the workflows, explained in first part of this chapter,
including the process in prediction with new data.
Thus, the module this time:

box::use(./module_r/torch_rbf_2layer)

is compose of functions, under the name of torch_rbf_2layer, that gives


you the model and prediction.

43
Neural Network with Torch

Training Set

Here, I can’t put box::use(./module_r/torch_rbf_2layer[...])


like this to the code chunk where I load the namespace of that mod-
ule, because it might conflict with the previous module, where it has
fit_rbf_nn, as well. Regardless, I store the R script into a module
named torch_rbf_2layer, and with $, I can access the namespace from
that module.
If you see the source code, for the RBF Network with 2 layers, the first
hidden layer has RBF as the activation function while the second layer
has ReLu activation function.
Thus, the result:

cc50_train_rbf <- torch_rbf_2layer$fit_rbf_nn(


train_data = cc50_train,
label_name = cc50_label_name,
num_rbf1 = 38,
hidden_size = 37
)

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
predictions = cc50_train_rbf$preds
),
mode = 'everything'
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic

44
Radial Basis Function (RBF) Neural Networks

Nontoxic 2671 674


Toxic 577 5274

Accuracy : 0.864
95% CI : (0.8568, 0.8709)
No Information Rate : 0.6468
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.7043

Mcnemar's Test P-Value : 0.006644

Sensitivity : 0.8224
Specificity : 0.8867
Pos Pred Value : 0.7985
Neg Pred Value : 0.9014
Precision : 0.7985
Recall : 0.8224
F1 : 0.8103
Prevalence : 0.3532
Detection Rate : 0.2905
Detection Prevalence : 0.3637
Balanced Accuracy : 0.8545

'Positive' Class : Nontoxic

Prediction Set

Access the test_predict function from the torch_rbf_2layer module


to create prediction from the “new” dataset.

Thus, the result:

45
Neural Network with Torch

cc50_test_rbf <- torch_rbf_2layer$test_predict(


object = cc50_train_rbf,
test_data = cc50_test,
label_name = cc50_label_name
)

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
predictions = cc50_test_rbf$preds
),
mode = 'everything'
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 839 285
Toxic 215 1731

Accuracy : 0.8371
95% CI : (0.8236, 0.85)
No Information Rate : 0.6567
P-Value [Acc > NIR] : < 2e-16

Kappa : 0.6444

Mcnemar's Test P-Value : 0.00203

Sensitivity : 0.7960
Specificity : 0.8586
Pos Pred Value : 0.7464

46
Radial Basis Function (RBF) Neural Networks

Neg Pred Value : 0.8895


Precision : 0.7464
Recall : 0.7960
F1 : 0.7704
Prevalence : 0.3433
Detection Rate : 0.2733
Detection Prevalence : 0.3661
Balanced Accuracy : 0.8273

'Positive' Class : Nontoxic

If you wonder why it is so slow, for both 1-layer and 2-layer RBF Network,
I suspected that the cause is the inference of RBF network. When I
read some literature, it is true that the performance of RBF Network for
inference is so slow, but not when it is trained.
So with the current structure of both 1-layer and 2-layer RBF Networks,
the accuracy values are:

1. 1-layer

- Training Set: 82.08%

- Prediction Set: 80.16%

2. 2-layer

- Training Set: 86.40%

- Prediction Set:: 83.71%

Although, the accuracies can still be increased if we add more parameters,


but I got more computational power, I may do that.

47
Part VI.

Chapter 5: Comparison to
other models

48
Other models

The accuracy of those 2 neural networks is still high, despite when the
CC50 data is having huge variance in it, however, the huge drawback of
the neural network models are the compilation of those models, it is so
slow, and you need to have large parameters to be train.
Now, the question is, how about we compare it to other models?
These are the list of models/algorithms to be compared:

1. Logistic Regression
2. Extreme Gradient Boosting (XGBoost)
3. Support Vector Machine
4. Naive Bayes
5. Random Forest
6. Penalized/Regularized Logistic Regression
7. K-nearest Neighbors (kNN)

Same as the previous 2 RBF Network models, after applying those models,
the confusion matrix is applied using caret::confusionMatrix function.
The visualization of the prediction/classification will be done in Chapter
6.

49
Other model: Logistic Regression

Logistic Regression is one of the simplest methods that can be applied


to a machine learning problem. It is a classical statistical method that
doesn’t predict a quantitative outcome, but rather a categorical outcome—
most often binary (e.g., yes/no). Why a categorical outcome? Logistic
regression is actually a Generalized Linear Model (GLM) that models the
log odds of the probability of the positive class, denoted 𝜋,̂ where:

𝑝
Odds = 𝜋̂ =
1−𝑝

Here, 𝑝 is the predicted probability of the positive outcome. Logistic


regression takes the logarithm of these odds, or in short, log odds, which
leads to the following relationship:

𝑝
log (𝜋)̂ = log ( ) = 𝑋𝛽
1−𝑝

Where 𝑋 is actually a matrix of input features and 𝛽 is a vector of esti-


mated coefficients. This equation shows how the log odds of the outcome
are modeled as a linear combination of the input variables. Another thing
to consider:

1
𝑝=
1 + exp (−𝑋𝐵)

50
With tidymodels workflow

As you can see in the equation above, the model’s prediction happens
by transforming the log odds back into a probability using the logistic
(sigmoid) function, ensuring the output is between 0 and 1, which can
then be used to classify the outcome (e.g., if 𝑝 ≥ 0.5, predict the positive
class).
The same as the previous 2 RBF Network models, the use of module is still
applied. There, I have a functions where it has a function that combines
all of the workflows in tidymodels, and the other one where it applies the
base R’s glm function and its predictions were already classified.

box::use(module_r/logit)

I have to note you that if you want to use it as a module, take note that
it is only applied with binary outcomes. The location of the source code
of module_r/logit has the same directory as the previous 2 RBF models
where I stored it in this page.

Ĺ Note

The problem at hand is a classification task, not a regression task, so


it makes sense to create a confusion matrix to classify observations
into categories (e.g., positive or negative classes).

With tidymodels workflow

Like I said, it is so simple to apply. With my module, the usage of logistic


regression function is more simpler to use and more usable and repro-
ducible, thanks to box package, where you can only apply the following:

1. The formula
2. The original data to be used to train the ML models

51
Other model: Logistic Regression

3. The new data to be used to predict “unknown” outcomes

Ď Remember

As you can see in the code below, this will be the same as other
models in this chapter, where you can just apply the formula, data,
and new_data arguments.

cc50_logit <- logit$train_TM_logistic(


y ~ ., data = cc50_train, new_data = cc50_test
)

To extract the model information, just run

model$model

where model is a placeholder, an object you obtained from logit$train_TM_logistic


function.
Actual result:

�� Workflow [trained] ����������������������������������������������������������������������


Preprocessor: Recipe
Model: logistic_reg()

�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps

�� Model �����������������������������������������������������������������������������������

Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:

52
With tidymodels workflow

(Intercept) x1 x2 x3 x4 x5 x6
0.72780 0.34341 -0.41703 0.20841 0.01045 -1.04238 -0.42912
x7 x8 x9
0.45369 0.40539 -0.64373

Degrees of Freedom: 9195 Total (i.e. Null); 9186 Residual


Null Deviance: 12060
Residual Deviance: 10020 AIC: 10040

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_logit$prediction
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 1720 1625
Toxic 734 5117

Accuracy : 0.7435
95% CI : (0.7344, 0.7524)
No Information Rate : 0.7331
P-Value [Acc > NIR] : 0.0127

53
Other model: Logistic Regression

Kappa : 0.4123

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.7009
Specificity : 0.7590
Pos Pred Value : 0.5142
Neg Pred Value : 0.8746
Precision : 0.5142
Recall : 0.7009
F1 : 0.5932
Prevalence : 0.2669
Detection Rate : 0.1870
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7299

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_logit$predict_new
),
mode = "everything"
)

Confusion Matrix and Statistics

54
With tidymodels workflow

Prediction
Actual Nontoxic Toxic
Nontoxic 540 584
Toxic 243 1703

Accuracy : 0.7306
95% CI : (0.7145, 0.7462)
No Information Rate : 0.745
P-Value [Acc > NIR] : 0.9667

Kappa : 0.3799

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.6897
Specificity : 0.7446
Pos Pred Value : 0.4804
Neg Pred Value : 0.8751
Precision : 0.4804
Recall : 0.6897
F1 : 0.5663
Prevalence : 0.2550
Detection Rate : 0.1759
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7171

'Positive' Class : Nontoxic

55
Other model: Logistic Regression

Without tidymodels workflow

For logistic regression, it doesn’t actually matter if you use tidymodels


or not since what you get after fitting logistic regression into the CC50
dataset, particularly the Training set one, the result is still the same. How-
ever, I have some calibrations done in logistic regression in logit module.
What I did is that first, I extract the levels of the factor of dependent vari-
able, or label, and then, I mapped those values with vectorized ifelse
so that the predicted values that are under 0.5, it will replaced with first
level, while the predicted values that are above 0.5, it will replaced with
second level. Thus, feel free to use this function.

cc50_logit <- logit$traintest_logit(


y ~ ., data = cc50_train, new_data = cc50_test
)

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_logit$prediction
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 1720 1625

56
Without tidymodels workflow

Toxic 734 5117

Accuracy : 0.7435
95% CI : (0.7344, 0.7524)
No Information Rate : 0.7331
P-Value [Acc > NIR] : 0.0127

Kappa : 0.4123

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.7009
Specificity : 0.7590
Pos Pred Value : 0.5142
Neg Pred Value : 0.8746
Precision : 0.5142
Recall : 0.7009
F1 : 0.5932
Prevalence : 0.2669
Detection Rate : 0.1870
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7299

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_logit$predict_new

57
Other model: Logistic Regression

),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 540 584
Toxic 243 1703

Accuracy : 0.7306
95% CI : (0.7145, 0.7462)
No Information Rate : 0.745
P-Value [Acc > NIR] : 0.9667

Kappa : 0.3799

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.6897
Specificity : 0.7446
Pos Pred Value : 0.4804
Neg Pred Value : 0.8751
Precision : 0.4804
Recall : 0.6897
F1 : 0.5663
Prevalence : 0.2550
Detection Rate : 0.1759
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7171

'Positive' Class : Nontoxic

58
Without tidymodels workflow

This is what I mean that it doesn’t really matter whether you apply
tidymodels to control the workflow, including the hyperparameter tuning
and 5-folds cross validation, or not, unless if you apply penalty parameters
(see Other model: Regularized Logistic Regression). The results are
still the same: Neither of them got 80% for both Training and Prediction
sets. Anyways, this model is no way better than 2 RBF Network models
to classify CC50 toxicity levels.

59
Other model: Extreme Gradient
Boosting

One of the models to be compared to RBF Network is Extreme Gradi-


ent Boosting, or XGBoost just to remember. XGBoost is another ML
algorithms that uses decision trees ensembles, where each tree is trained
to correct the errors made by the previous ones. We called this approach
as boosting, as it progressively boosts the model’s accuracy with each
additional tree. With those abilities I mentioned, this is what sets XG-
Boost apart from other boosting techniques, and its ability to handle large
datasets efficiently, with features like regularization to prevent overfitting,
parallel processing for faster computation, and support for missing data.
XGBoost is widely used for both classification and regression tasks, mak-
ing it a versatile choice for many real-world applications, such as time
series analysis.

This module contains the function that combines the whole workflow of
XGBoost with tidymodels and a function that uses regular XGBoost with
xgboost package, and both of them are utilizing formulas and new_data
in the arguments. This is how you load the module with box::use:

box::use(module_r/xgboost)

Again, to view the source code of module_r/random_forest, click this


page.

60
With tidymodels workflow

Ĺ Note

The problem at hand is a classification task, not a regression task, so


it makes sense to create a confusion matrix to classify observations
into categories (e.g., positive or negative classes).

With tidymodels workflow

I used the fastest approach of XGBoost algorithm with tidymodels, be-


cause my original implementation is TOO SLOW! It is roughly estimated
to 7 hours just to finish the learning of XGBoost with tidymodels, I hate
it. Why slow? I think, when I implement XGBoost with tidymodels, I
made a mistake where I am trying to train 100 models at once and batched
with 5-folds cross-validation. Do you really have to train 100 models with
weak computational power for one task? That’s why, I reduced the num-
ber of hyperparameters to be trained.

cc50_tm_xgb <- xgboost$train_TM_xgb(


formula = y ~ ., data = cc50_train, new_data = cc50_test
)

Actual result (from cc50_tm_xgb model):

parsnip model object

##### xgb.Booster
raw: 693.7 Kb
call:
xgboost::xgb.train(params = list(eta = 1.04032930083058, max_depth = 6L,
gamma = 0, colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 6L,
subsample = 1), data = x$data, nrounds = 500, watchlist = x$watchlist,

61
Other model: Extreme Gradient Boosting

verbose = 0, nthread = 1, objective = "binary:logistic")


params (as set within xgb.train):
eta = "1.04032930083058", max_depth = "6", gamma = "0", colsample_bytree =
xgb.attributes:
niter
callbacks:
cb.evaluation.log()
# of features: 9
niter: 500
nfeatures : 9
evaluation_log:
iter training_logloss
<num> <num>
1 0.43777884
2 0.39549225
---
499 0.01224666
500 0.01223759

The caret::confusionMatrix function is used to evaluate the perfor-


mance of ML models in classification task, just like the previous 2 RBF
Network models. This is so convenient to use since all you have to do is
to enter the confusion matrix into the data argument. Since the XGBoost
model’s task in this case is to classify, I am going to use this function to
obtain information of the metrics of XGBoost model’s performance.
Now, for the actual confusion matrix result:

Training Set

caret::confusionMatrix(
data = table(

62
With tidymodels workflow

actual = cc50_train$y,
prediction = cc50_tm_xgb$prediction
)
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 3322 23
Toxic 6 5845

Accuracy : 0.9968
95% CI : (0.9955, 0.9979)
No Information Rate : 0.6381
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9932

Mcnemar's Test P-Value : 0.002967

Sensitivity : 0.9982
Specificity : 0.9961
Pos Pred Value : 0.9931
Neg Pred Value : 0.9990
Precision : 0.9931
Recall : 0.9982
F1 : 0.9957
Prevalence : 0.3619
Detection Rate : 0.3612
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9971

63
Other model: Extreme Gradient Boosting

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_tm_xgb$predict_new
)
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 1031 93
Toxic 62 1884

Accuracy : 0.9495
95% CI : (0.9412, 0.957)
No Information Rate : 0.644
P-Value [Acc > NIR] : < 2e-16

Kappa : 0.8906

Mcnemar's Test P-Value : 0.01597

Sensitivity : 0.9433
Specificity : 0.9530
Pos Pred Value : 0.9173

64
Without tidymodels workflow

Neg Pred Value : 0.9681


Precision : 0.9173
Recall : 0.9433
F1 : 0.9301
Prevalence : 0.3560
Detection Rate : 0.3358
Detection Prevalence : 0.3661
Balanced Accuracy : 0.9481

'Positive' Class : Nontoxic

Without tidymodels workflow

Now, if you apply the XGBoost algorithm with the workflows used in
tidymodels package, you see that the hyperparameters are not tuned and
iterated at the same time. Thus, I’ll just rely on the default from xgboost’s
xgboost defaults with xgb.DMatrix as helpers for the XGBoost’s input
data.

cc50_xgb <- xgboost$traintest_xgb(


formula = y ~ ., data = cc50_train, new_data = cc50_test
)

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_xgb$predictions

65
Other model: Extreme Gradient Boosting

)
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 3288 57
Toxic 24 5827

Accuracy : 0.9912
95% CI : (0.9891, 0.993)
No Information Rate : 0.6398
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9809

Mcnemar's Test P-Value : 0.0003772

Sensitivity : 0.9928
Specificity : 0.9903
Pos Pred Value : 0.9830
Neg Pred Value : 0.9959
Precision : 0.9830
Recall : 0.9928
F1 : 0.9878
Prevalence : 0.3602
Detection Rate : 0.3575
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9915

'Positive' Class : Nontoxic

66
Without tidymodels workflow

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_xgb$predictions_new
)
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 1026 98
Toxic 73 1873

Accuracy : 0.9443
95% CI : (0.9356, 0.9521)
No Information Rate : 0.642
P-Value [Acc > NIR] : < 2e-16

Kappa : 0.8794

Mcnemar's Test P-Value : 0.06646

Sensitivity : 0.9336
Specificity : 0.9503
Pos Pred Value : 0.9128
Neg Pred Value : 0.9625
Precision : 0.9128
Recall : 0.9336
F1 : 0.9231

67
Other model: Extreme Gradient Boosting

Prevalence : 0.3580
Detection Rate : 0.3342
Detection Prevalence : 0.3661
Balanced Accuracy : 0.9419

'Positive' Class : Nontoxic

You see, with or without using tidymodels, both workflows have high ac-
curacies (both training and prediction set) and the training set’s accuracy
that are roughly equal to each other, even though the metrics are slightly
in favor to tidymodels’ XGBoost. And also, it happened that the accu-
racy between Training and Prediction sets, both with or without using
tidymodels, does not deviated to each other, hence, this is so nice that
no overfitting happened.

68
Other model: Support Vector
Machine

Let’s try another model, and that is the Support Vector Machine,
or SVM for short. SVM is a machine learning algorithm is one of the
easiest model to learn and understand, if not, that works by finding the
optimal boundary, or hyperplane, that best separates data into different
classes. SVMs are used for classification and regression problem and some
literature uses this algorithm for smaller to medium-sized datasets with
clear separation between classes. Its goal is to fit a margin between the
classes, with the support vectors being the data points that lie closest to
the boundary. In other terminology, we refer this as margin maximization,
or “the street”. This margin maximization is what makes SVMs robust
for classification tasks. What sets SVM apart is its ability to handle high-
dimensional data (the data is wider, not longer) and apply different kernel
functions—like linear, polynomial, or radial basis functions (RBF)—to
transform the input space, making it more flexible for complex problems.
But, I will apply the Radial Basis Function (RBF) since I am using RBF
Network, previously.

In case of this model, which is an experimental ML models as others, the


SVM model to be applied is used for classification task, and we call it Sup-
port Vector Classifier. Use the module module_r/svm that contains the
function, where it combines the whole workflow of SVM with tidymodels
and a function that uses regular SVM with e1071 package, and both of
them are utilizing formulas and new_data in the arguments. But this time,

69
Other model: Support Vector Machine

the module is assigned in svm alias. This is how you load the module with
box::use:

box::use(svm = module_r/svm)

The location of the source code of module_r/svm is still on this page.

Ĺ Note

The problem at hand is a classification task, not a regression task, so


it makes sense to create a confusion matrix to classify observations
into categories (e.g., positive or negative classes).

With tidymodels workflow

To tune the hyperparameter of SVM to train the model into CC50 dataset,
I complain to its speed. With the default hyperparameters, it trains the
SVM model to the CC50 dataset SO SLOW! That’s why, I reduce the
number of iterations to its bare minimum, the one you put an integer
value in n_iter as an argument.

cc50_tm_svm <- svm$train_TM_svm(


formula = y ~ ., data = cc50_train, new_data = cc50_test
)

The information of cc50_tm_svm model and an actual result:

parsnip model object

Support Vector Machine object of class "ksvm"

70
With tidymodels workflow

SV type: C-svc (classification)


parameter : cost C = 3.7394736982582

Gaussian Radial Basis kernel function.


Hyperparameter : sigma = 2.74325394773262

Number of Support Vectors : 5659

Objective Function Value : -3802.514


Training error : 0.016311
Probability model included.

The caret::confusionMatrix function is used in the previous 2 RBF


Network models to evaluate the ML models in classification task mode,
and I am still using to obtain information of the metrics of SVM model’s
performance.
Now, for the actual confusion matrix result:

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_tm_svm$prediction
)
)

Confusion Matrix and Statistics

Prediction

71
Other model: Support Vector Machine

Actual Nontoxic Toxic


Nontoxic 3239 106
Toxic 44 5807

Accuracy : 0.9837
95% CI : (0.9809, 0.9862)
No Information Rate : 0.643
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9646

Mcnemar's Test P-Value : 6.338e-07

Sensitivity : 0.9866
Specificity : 0.9821
Pos Pred Value : 0.9683
Neg Pred Value : 0.9925
Precision : 0.9683
Recall : 0.9866
F1 : 0.9774
Prevalence : 0.3570
Detection Rate : 0.3522
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9843

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(

72
With tidymodels workflow

actual = cc50_test$y,
prediction = cc50_tm_xgb$predict_new
)
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 919 205
Toxic 114 1832

Accuracy : 0.8961
95% CI : (0.8848, 0.9067)
No Information Rate : 0.6635
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.7722

Mcnemar's Test P-Value : 4.679e-07

Sensitivity : 0.8896
Specificity : 0.8994
Pos Pred Value : 0.8176
Neg Pred Value : 0.9414
Precision : 0.8176
Recall : 0.8896
F1 : 0.8521
Prevalence : 0.3365
Detection Rate : 0.2993
Detection Prevalence : 0.3661
Balanced Accuracy : 0.8945

73
Other model: Support Vector Machine

'Positive' Class : Nontoxic

Without tidymodels workflow

Now, if you apply the SVM model with the workflows used in tidymodels
package, you see that the hyperparameters are not tuned and iterated
at the same time. Thus, I’ll just rely on the default from e1071’s svm
defaults.

cc50_svm <- svm$traintest_svm(


formula = y ~ ., data = cc50_train, new_data = cc50_test
)

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_svm$predictions
)
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 2257 1088
Toxic 443 5408

74
Without tidymodels workflow

Accuracy : 0.8335
95% CI : (0.8257, 0.8411)
No Information Rate : 0.7064
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.6248

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.8359
Specificity : 0.8325
Pos Pred Value : 0.6747
Neg Pred Value : 0.9243
Precision : 0.6747
Recall : 0.8359
F1 : 0.7467
Prevalence : 0.2936
Detection Rate : 0.2454
Detection Prevalence : 0.3637
Balanced Accuracy : 0.8342

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_svm$predictions_new
)
)

75
Other model: Support Vector Machine

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 728 396
Toxic 150 1796

Accuracy : 0.8221
95% CI : (0.8082, 0.8355)
No Information Rate : 0.714
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.5983

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.8292
Specificity : 0.8193
Pos Pred Value : 0.6477
Neg Pred Value : 0.9229
Precision : 0.6477
Recall : 0.8292
F1 : 0.7273
Prevalence : 0.2860
Detection Rate : 0.2371
Detection Prevalence : 0.3661
Balanced Accuracy : 0.8243

'Positive' Class : Nontoxic

As you can see, the results, both with or without tidymodels workflows,
have huge difference to each, in favor to the SVM model with the control

76
Without tidymodels workflow

of tidymodels workflows. Furthermore, no overfitting occurs since the


Training and Prediction set metrics (both with or without tidymodels
workflows) don’t have strong deviation.

77
Other model: Naive Bayes
Naive Bayes is the only generative model to be used in this Chapter,
while the rest is discriminant models. This model is a probabilistic classi-
fication model based on applying Bayes’ theorem with strong (naive) in-
dependence assumptions between the features. This model is particularly
useful for large datasets and text classification tasks due to its simplicity
and efficiency. The independence assumption means that the presence of
a particular feature in the dataset does not affect the presence of any other
feature.
In practice, Naive Bayes classifiers work by calculating the posterior prob-
ability of each class given the features of the input data and then choosing
the class with the highest posterior probability. This makes Naive Bayes
models fast and scalable. Despite its simplicity, it often performs surpris-
ingly well in practice.
The name of the script is nb.R. It is so short you won’t need any alias like
nb for that module, just like in SVM part. Here’s how you load it:

box::use(module_r/nb)

To view the source code of module_r/nb, click this page.

Ĺ Note

The problem at hand is a classification task, not a regression task, so


it makes sense to create a confusion matrix to classify observations
into categories (e.g., positive or negative classes).

78
With tidymodels workflow

With tidymodels workflow

The speed is not a big deal, roughly like under 5 mins. The hyperparam-
eters to be tuned are not that huge, and there are only 2 of them:

1. Smoothness
2. Laplace parameter

cc50_nb <- nb$train_TM_nb(


y ~ ., data = cc50_train, new_data = cc50_test
)

Actual result:

parsnip model object

======================================= Naive Bayes ========================================

Call:
naive_bayes.default(x = maybe_data_frame(x), y = y, laplace = ~0.0945426446851343,
usekernel = TRUE, adjust = ~0.163482782384381)

--------------------------------------------------------------------------------------------

Laplace smoothing: 0.09454264

--------------------------------------------------------------------------------------------

A priori probabilities:

Nontoxic Toxic
0.3637451 0.6362549

79
Other model: Naive Bayes

-----------------------------------------------------------------------------

Tables:

-----------------------------------------------------------------------------
:: x1::Nontoxic (KDE)
-----------------------------------------------------------------------------

Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)

Data: x (3345 obs.); Bandwidth 'bw' = 0.02979

x y
Min. :-3.55164 Min. :0.00000
1st Qu.:-1.81466 1st Qu.:0.01215
Median :-0.07768 Median :0.09504
Mean :-0.07768 Mean :0.14366
3rd Qu.: 1.65931 3rd Qu.:0.21610
Max. : 3.39629 Max. :0.82503
-----------------------------------------------------------------------------
:: x1::Toxic (KDE)
-----------------------------------------------------------------------------

Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)

Data: x (5851 obs.); Bandwidth 'bw' = 0.02372

x y
Min. :-3.30409 Min. :0.00000
1st Qu.:-1.65993 1st Qu.:0.00634
Median :-0.01578 Median :0.07529

80
With tidymodels workflow

Mean :-0.01578 Mean :0.15177


3rd Qu.: 1.62838 3rd Qu.:0.25987
Max. : 3.27254 Max. :1.34462

--------------------------------------------------------------------------------------------
:: x2::Nontoxic (KDE)
--------------------------------------------------------------------------------------------

Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)

Data: x (3345 obs.); Bandwidth 'bw' = 0.02536

x y
Min. :-3.0150 Min. :0.0000007
1st Qu.:-1.3854 1st Qu.:0.0163480
Median : 0.2443 Median :0.0746043
Mean : 0.2443 Mean :0.1531175
3rd Qu.: 1.8739 3rd Qu.:0.2681576
Max. : 3.5035 Max. :0.6575734

--------------------------------------------------------------------------------------------
:: x2::Toxic (KDE)
--------------------------------------------------------------------------------------------

Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)

Data: x (5851 obs.); Bandwidth 'bw' = 0.02518

x y
Min. :-3.45844 Min. :0.0000921
1st Qu.:-1.75693 1st Qu.:0.0237468

81
Other model: Naive Bayes

Median :-0.05542 Median :0.0690132


Mean :-0.05542 Mean :0.1466397
3rd Qu.: 1.64609 3rd Qu.:0.2570213
Max. : 3.34760 Max. :0.8798388

-----------------------------------------------------------------------------
:: x3::Nontoxic (KDE)
-----------------------------------------------------------------------------

Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)

Data: x (3345 obs.); Bandwidth 'bw' = 0.02167

x y
Min. :-4.2747 Min. :0.000000
1st Qu.:-2.3451 1st Qu.:0.006735
Median :-0.4155 Median :0.057628
Mean :-0.4155 Mean :0.129308
3rd Qu.: 1.5141 3rd Qu.:0.171811
Max. : 3.4437 Max. :1.424567

-----------------------------------------------------------------------------
:: x3::Toxic (KDE)
-----------------------------------------------------------------------------

Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)

Data: x (5851 obs.); Bandwidth 'bw' = 0.01991

x y
Min. :-3.8168 Min. :0.000000

82
With tidymodels workflow

1st Qu.:-1.8106 1st Qu.:0.006313


Median : 0.1955 Median :0.051742
Mean : 0.1955 Mean :0.124366
3rd Qu.: 2.2017 3rd Qu.:0.160653
Max. : 4.2078 Max. :1.572118

--------------------------------------------------------------------------------------------
:: x4::Nontoxic (KDE)
--------------------------------------------------------------------------------------------

Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)

Data: x (3345 obs.); Bandwidth 'bw' = 0.01254

x y
Min. :-3.7336 Min. :0.00000
1st Qu.:-1.7856 1st Qu.:0.00000
Median : 0.1624 Median :0.00000
Mean : 0.1624 Mean :0.12812
3rd Qu.: 2.1103 3rd Qu.:0.02758
Max. : 4.0583 Max. :2.68670

--------------------------------------------------------------------------------------------
:: x4::Toxic (KDE)
--------------------------------------------------------------------------------------------

Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)

Data: x (5851 obs.); Bandwidth 'bw' = 0.01288

x y

83
Other model: Naive Bayes

Min. :-4.2881 Min. :0.0000000


1st Qu.:-2.2012 1st Qu.:0.0000000
Median :-0.1144 Median :0.0000018
Mean :-0.1144 Mean :0.1195532
3rd Qu.: 1.9725 3rd Qu.:0.0663469
Max. : 4.0593 Max. :3.0142851

-----------------------------------------------------------------------------
:: x5::Nontoxic (KDE)
-----------------------------------------------------------------------------

Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)

Data: x (3345 obs.); Bandwidth 'bw' = 0.02676

x y
Min. :-2.5025 Min. :0.00000
1st Qu.:-1.3117 1st Qu.:0.03801
Median :-0.1209 Median :0.12358
Mean :-0.1209 Mean :0.20952
3rd Qu.: 1.0698 3rd Qu.:0.33197
Max. : 2.2606 Max. :0.83230

-----------------------------------------------------------------------------
:: x5::Toxic (KDE)
-----------------------------------------------------------------------------

Call:
density.default(x = x, adjust = ..1, na.rm = TRUE)

Data: x (5851 obs.); Bandwidth 'bw' = 0.01887

84
With tidymodels workflow

x y
Min. :-2.6985 Min. :0.00000
1st Qu.:-1.4335 1st Qu.:0.05057
Median :-0.1685 Median :0.12368
Mean :-0.1685 Mean :0.19723
3rd Qu.: 1.0964 3rd Qu.:0.28863
Max. : 2.3614 Max. :1.50649

--------------------------------------------------------------------------------------------

# ... and 4 more tables

--------------------------------------------------------------------------------------------

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_nb$prediction
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 729 2616
Toxic 0 5851

85
Other model: Naive Bayes

Accuracy : 0.7155
95% CI : (0.7062, 0.7247)
No Information Rate : 0.9207
P-Value [Acc > NIR] : 1

Kappa : 0.2618

Mcnemar's Test P-Value : <2e-16

Sensitivity : 1.00000
Specificity : 0.69104
Pos Pred Value : 0.21794
Neg Pred Value : 1.00000
Precision : 0.21794
Recall : 1.00000
F1 : 0.35788
Prevalence : 0.07927
Detection Rate : 0.07927
Detection Prevalence : 0.36375
Balanced Accuracy : 0.84552

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_nb$predict_new
),

86
With tidymodels workflow

mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 227 897
Toxic 0 1946

Accuracy : 0.7078
95% CI : (0.6914, 0.7239)
No Information Rate : 0.9261
P-Value [Acc > NIR] : 1

Kappa : 0.2429

Mcnemar's Test P-Value : <2e-16

Sensitivity : 1.00000
Specificity : 0.68449
Pos Pred Value : 0.20196
Neg Pred Value : 1.00000
Precision : 0.20196
Recall : 1.00000
F1 : 0.33605
Prevalence : 0.07394
Detection Rate : 0.07394
Detection Prevalence : 0.36612
Balanced Accuracy : 0.84224

'Positive' Class : Nontoxic

87
Other model: Naive Bayes

Without tidymodels workflow

cc50_nb <- nb$traintest_nb(


y ~ ., data = cc50_train, new_data = cc50_test
)

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_nb$predictions
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 1724 1621
Toxic 1143 4708

Accuracy : 0.6994
95% CI : (0.6899, 0.7088)
No Information Rate : 0.6882
P-Value [Acc > NIR] : 0.01034

Kappa : 0.3301

88
Without tidymodels workflow

Mcnemar's Test P-Value : < 2e-16

Sensitivity : 0.6013
Specificity : 0.7439
Pos Pred Value : 0.5154
Neg Pred Value : 0.8046
Precision : 0.5154
Recall : 0.6013
F1 : 0.5551
Prevalence : 0.3118
Detection Rate : 0.1875
Detection Prevalence : 0.3637
Balanced Accuracy : 0.6726

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_nb$predict_new
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic

89
Other model: Naive Bayes

Nontoxic 584 540


Toxic 366 1580

Accuracy : 0.7049
95% CI : (0.6884, 0.721)
No Information Rate : 0.6906
P-Value [Acc > NIR] : 0.04427

Kappa : 0.3427

Mcnemar's Test P-Value : 9.055e-09

Sensitivity : 0.6147
Specificity : 0.7453
Pos Pred Value : 0.5196
Neg Pred Value : 0.8119
Precision : 0.5196
Recall : 0.6147
F1 : 0.5632
Prevalence : 0.3094
Detection Rate : 0.1902
Detection Prevalence : 0.3661
Balanced Accuracy : 0.6800

'Positive' Class : Nontoxic

You can see the difference between the Naive Bayes that was tuned and
controlled by tidymodels to the Naive Bayes that was ran natively in
naivebayes package. Yes, even though tidymodels Naive Bayes has bet-
ter results than naivebayes, Naive Bayes got barely 70% accuracy, for all
sets, and it is no way better than the previous 2 RBF Network models.

90
Other model: Random Forest

Another model to be compared to RBF Network, is an ensemble algorithm,


which is the Random Forest, or RF for short. Random Forest or RF
is another machine learning (ML) algorithm, where it builds multiple deci-
sion trees, and let those decision trees to cast their vote as the prediction,
and then combines the votes whatever is the majority as their outputs to
improve prediction accuracy and reduce overfitting. During the training,
each decision tree in the “forest” is trained on a random subset of the
data and a random selection of features, which introduces diversity in the
trees and leads to more robust predictions. The algorithm summarizes the
results from all the trees (through majority vote for classification or aver-
aging for regression) to make the final prediction, making it both accurate
and stable.

There’s actually 2 packages in R to work with Random Forest algorithms


that I known of:

1. randomForest
2. ranger

However, among the choices, I picked the ranger package due to its known
speed, confirmed to be faster than randomForest package. That’s right,
when you have large dataset to handle, in this case, I have 9,196 training set
and 3,070 prediction set, ranger is surprisingly faster than randomForest.
While, randomForest is written in C, ranger is more optimized, main-
tained, and updated than randomForest, plus this library is written in

91
Other model: Random Forest

C++, and both of them doesn’t compromise the accuracy of Random For-
est. Furthermore, ranger is also used when you are conducting survival
analysis.
Thus, I’ll be using ranger package to get a better performance in train-
ing RF model, in terms of speed. As you know, all of the model func-
tions are stored in a script as a module and access them using box::use.
Hence, I stored the workflow of the random forest, with or without using
tidymodels, as a function within a module named random_forest.

box::use(module_r/random_forest)

Recall that module_r is a folder that contains the scripts of all


the module used in this documentation. This time, the script is
random_forest.R and it’s qualified as a module. To view the source code
of module_r/random_forest, click this page.

Ĺ Note

The problem at hand is a classification task, not a regression task, so


it makes sense to create a confusion matrix to classify observations
into categories (e.g., positive or negative classes).

With tidymodels workflow

The implementation of ranger’s RF model with tidymodels is so slow,


but the estimated time is roughly 5 mins. This is due to I have many
models to be trained at the same (see the source code). The consumed
time is not a big deal to me that’s why I continued train the RF model.

cc50_tm_rf <- random_forest$train_TM_rf(


formula = y ~ ., data = cc50_train, new_data = cc50_test
)

92
With tidymodels workflow

Actual result (from cc50_tm_rf):

�� Workflow [trained] ����������������������������������������������������������������������


Preprocessor: Recipe
Model: rand_forest()

�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps

�� Model �����������������������������������������������������������������������������������
Ranger result

Call:
ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~16L, x), num.trees = ~

Type: Probability estimation


Number of trees: 500
Sample size: 9196
Number of independent variables: 9
Mtry: 9
Target node size: 2
Variable importance mode: permutation
Splitrule: gini
OOB prediction error (Brier s.): 0.04119084

Just like the previous result from the 2 RBF Network models, I am still
using caret::confusionMatrix to obtain information of the metrics of
RF model’s performance, since this RF model is a classification model, not
regression model.

93
Other model: Random Forest

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_tm_rf$predictions
)
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 3323 22
Toxic 1 5850

Accuracy : 0.9975
95% CI : (0.9962, 0.9984)
No Information Rate : 0.6385
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9946

Mcnemar's Test P-Value : 3.042e-05

Sensitivity : 0.9997
Specificity : 0.9963
Pos Pred Value : 0.9934
Neg Pred Value : 0.9998
Precision : 0.9934
Recall : 0.9997
F1 : 0.9966

94
With tidymodels workflow

Prevalence : 0.3615
Detection Rate : 0.3614
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9980

'Positive' Class : Nontoxic

After running that code, I obtain the metrics to measure the performance
of the RF model and they are quite large for the training model implying
that the RF model quite performed well compared to the previous 2 RBF
Network models (though I can obtained large metrics if I run them in
batch and increase the parameters).

Prediction Set

However, the result of the training set metrics are not sufficiently enough,
I have to verify the metrics of the test set. If the metrics didn’t have big
difference, the RF model doesn’t resonates from overfitting, otherwise, the
conclusion would be overfitting.

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_tm_rf$predictions_new
)
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic

95
Other model: Random Forest

Nontoxic 1026 98
Toxic 64 1882

Accuracy : 0.9472
95% CI : (0.9387, 0.9549)
No Information Rate : 0.645
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.8856

Mcnemar's Test P-Value : 0.009522

Sensitivity : 0.9413
Specificity : 0.9505
Pos Pred Value : 0.9128
Neg Pred Value : 0.9671
Precision : 0.9128
Recall : 0.9413
F1 : 0.9268
Prevalence : 0.3550
Detection Rate : 0.3342
Detection Prevalence : 0.3661
Balanced Accuracy : 0.9459

'Positive' Class : Nontoxic

RF models can be sometimes prone to overfitting, but the RF model in this


documentation have large training set and prediction set metrics, hence,
we obtained a well-balanced RF model.

96
Without tidymodels workflow

Without tidymodels workflow

This time, the ranger random forest model is run by default and the
hyperparameter tuning processes are not implemented. Let’s see how it is
performed to the CC50 dataset.

cc50_rf <- random_forest$traintest_ranger(


formula = y ~ ., data = cc50_train, new_data = cc50_test
)

Training Set

This doesn’t differ from the previous result where we obtain the
metrics of the classification model through confusion metrics with
caret::confusionMatrix, where the data is the matrix or table between
the actual set and its prediction set.

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_rf$train_preds
)
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 3063 282
Toxic 162 5689

Accuracy : 0.9517

97
Other model: Random Forest

95% CI : (0.9471, 0.956)


No Information Rate : 0.6493
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.8949

Mcnemar's Test P-Value : 1.628e-08

Sensitivity : 0.9498
Specificity : 0.9528
Pos Pred Value : 0.9157
Neg Pred Value : 0.9723
Precision : 0.9157
Recall : 0.9498
F1 : 0.9324
Prevalence : 0.3507
Detection Rate : 0.3331
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9513

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_rf$test_preds
)
)

98
Without tidymodels workflow

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 1022 102
Toxic 57 1889

Accuracy : 0.9482
95% CI : (0.9398, 0.9558)
No Information Rate : 0.6485
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.8875

Mcnemar's Test P-Value : 0.0004841

Sensitivity : 0.9472
Specificity : 0.9488
Pos Pred Value : 0.9093
Neg Pred Value : 0.9707
Precision : 0.9093
Recall : 0.9472
F1 : 0.9278
Prevalence : 0.3515
Detection Rate : 0.3329
Detection Prevalence : 0.3661
Balanced Accuracy : 0.9480

'Positive' Class : Nontoxic

I still obtained quite large metric values to measure the performance of


RF model in classification. With approximately 95% for Training set and

99
Other model: Random Forest

94% for Prediction set, we can say that the ranger random forest is not
overfitting to the CC50 dataset.
When using tidymodels, the metrics for accuracy are in favor to it than
without using tidymodels. But the performance, in terms of training
speed, is in favor to the regular use of ranger. Hence, we’ll be using the
result of tidymodels ranger RF model for comparison.

100
Other model: Regularized Logistic
Regression

I use all 3 types of regularization/penalization in this task:

1. Ridge
2. LASSO
3. Elastic Net

The key points of those regularization method are:

1. Shrinks coefficients: Large coefficients are penalized more, lead-


ing to smaller weights. You will know what happened during the
application.
2. Prevents overfitting: By discouraging overly complex models, reg-
ularization ensures better generalization to new data.
3. Penalty Parameter (noted by 𝜆): This hyperparameter controls
the regularization strength. A larger 𝜆 leads to more shrinkage,
whereas 𝜆 = 0 recovers standard logistic regression.

And so, their workflow has difference in it. Just like Support Vector Ma-
chine, The module is assigned in PLR alias and the location of the source
code of module_r/penal_reg has the same directory as the previous 2
RBF models where I stored it in this page.

box::use(PLR = module_r/penal_reg)

101
Other model: Regularized Logistic Regression

Ĺ Note

The problem at hand is a classification task, not a regression task, so


it makes sense to create a confusion matrix to classify observations
into categories (e.g., positive or negative classes).

Penalty: Ridge

What is actually Ridge? Some refers this as regression that has L2 regu-
larization. With logistic regression as a model to be penalized, is a modi-
fication of standard logistic regression will be added a penalty term to the
loss function 𝐿.
This is the equation:

𝑛 𝑘
𝐿(𝛽) = − ∑(𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )) + 𝜆 ∑ 𝛽𝑗2
𝑖=1 𝑗=1

The goal is to minimize the loss function of the 𝛽s in the logistic regression
equation (found in Chapter 5.1, in log odds part).
This is how it looks like:

𝑛 𝑘
min(− ∑(𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )) + 𝜆 ∑ 𝛽𝑗2 )
𝛽
𝑖=1 𝑗=1

With tidymodels workflow

The tuned parameters for Elastic Net is just the Penalty parameter with
mixture = 0.

102
Penalty: Ridge

cc50_ridge <- PLR$train_TM_logistic(


formula = y ~ .,
data = cc50_train,
new_data = cc50_test,
penal_type = "ridge"
)

Actual result:

�� Workflow [trained] ����������������������������������������������������������������������


Preprocessor: Recipe
Model: logistic_reg()

�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps

�� Model �����������������������������������������������������������������������������������

Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "binomial", alpha = ~0)

Df %Dev Lambda
1 9 0.00 159.000
2 9 0.05 144.900
3 9 0.05 132.000
4 9 0.06 120.300
5 9 0.07 109.600
6 9 0.07 99.870
7 9 0.08 91.000
8 9 0.09 82.920
9 9 0.09 75.550
10 9 0.10 68.840
11 9 0.11 62.720
12 9 0.12 57.150

103
Other model: Regularized Logistic Regression

13 9 0.14 52.070
14 9 0.15 47.450
15 9 0.16 43.230
16 9 0.18 39.390
17 9 0.20 35.890
18 9 0.22 32.700
19 9 0.24 29.800
20 9 0.26 27.150
21 9 0.28 24.740
22 9 0.31 22.540
23 9 0.34 20.540
24 9 0.37 18.710
25 9 0.41 17.050
26 9 0.44 15.540
27 9 0.49 14.160
28 9 0.53 12.900
29 9 0.58 11.750
30 9 0.63 10.710
31 9 0.69 9.758
32 9 0.75 8.891
33 9 0.82 8.101
34 9 0.90 7.381
35 9 0.98 6.726
36 9 1.06 6.128
37 9 1.16 5.584
38 9 1.26 5.088
39 9 1.36 4.636
40 9 1.48 4.224
41 9 1.61 3.849
42 9 1.74 3.507
43 9 1.88 3.195
44 9 2.03 2.911
45 9 2.20 2.653

104
Penalty: Ridge

46 9 2.37 2.417

...
and 54 more lines.

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_ridge$prediction
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 1592 1753
Toxic 658 5193

Accuracy : 0.7378
95% CI : (0.7287, 0.7468)
No Information Rate : 0.7553
P-Value [Acc > NIR] : 0.9999

Kappa : 0.3909

Mcnemar's Test P-Value : <2e-16

105
Other model: Regularized Logistic Regression

Sensitivity : 0.7076
Specificity : 0.7476
Pos Pred Value : 0.4759
Neg Pred Value : 0.8875
Precision : 0.4759
Recall : 0.7076
F1 : 0.5691
Prevalence : 0.2447
Detection Rate : 0.1731
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7276

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_ridge$predict_new
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 500 624
Toxic 222 1724

106
Penalty: Ridge

Accuracy : 0.7244
95% CI : (0.7083, 0.7402)
No Information Rate : 0.7648
P-Value [Acc > NIR] : 1

Kappa : 0.3578

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.6925
Specificity : 0.7342
Pos Pred Value : 0.4448
Neg Pred Value : 0.8859
Precision : 0.4448
Recall : 0.6925
F1 : 0.5417
Prevalence : 0.2352
Detection Rate : 0.1629
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7134

'Positive' Class : Nontoxic

Without tidymodels workflow

cc50_ridge <- PLR$train_TM_logistic(


formula = y ~ .,
data = cc50_train,
new_data = cc50_test,

107
Other model: Regularized Logistic Regression

penal_type = "ridge"
)

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_ridge$predictions
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 1592 1753
Toxic 658 5193

Accuracy : 0.7378
95% CI : (0.7287, 0.7468)
No Information Rate : 0.7553
P-Value [Acc > NIR] : 0.9999

Kappa : 0.3909

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.7076
Specificity : 0.7476

108
Penalty: Ridge

Pos Pred Value : 0.4759


Neg Pred Value : 0.8875
Precision : 0.4759
Recall : 0.7076
F1 : 0.5691
Prevalence : 0.2447
Detection Rate : 0.1731
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7276

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_ridge$predict_new
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 500 624
Toxic 222 1724

Accuracy : 0.7244

109
Other model: Regularized Logistic Regression

95% CI : (0.7083, 0.7402)


No Information Rate : 0.7648
P-Value [Acc > NIR] : 1

Kappa : 0.3578

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.6925
Specificity : 0.7342
Pos Pred Value : 0.4448
Neg Pred Value : 0.8859
Precision : 0.4448
Recall : 0.6925
F1 : 0.5417
Prevalence : 0.2352
Detection Rate : 0.1629
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7134

'Positive' Class : Nontoxic

Penalty: LASSO

What is actually LASSO? It’s an acronym of Least Absolute Shrinkage


and Selection Operator and some refers this as regression that has L1
regularization. The same as ridge, except the beta is shrank in absolute
value, not squared.

This is the equation of LASSO regression loss function:

110
Penalty: LASSO

𝑛 𝑘
𝐿(𝛽) = − ∑(𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )) + 𝜆 ∑ |𝛽𝑗 |
𝑖=1 𝑗=1

The goal is to minimize the loss function of the 𝛽s in the logistic regression
equation (found in Chapter 5.1, in log odds part).
This is how it looks like:

𝑛 𝑘
min(− ∑(𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )) + 𝜆 ∑ |𝛽𝑗 |)
𝛽
𝑖=1 𝑗=1

With tidymodels workflow

The tuned parameters for Elastic Net is just the Penalty parameter with
mixture = 1.

cc50_lasso <- PLR$train_TM_logistic(


formula = y ~ .,
data = cc50_train,
new_data = cc50_test,
penal_type = "lasso"
)

Actual result:

�� Workflow [trained] ����������������������������������������������������������������������


Preprocessor: Recipe
Model: logistic_reg()

�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps

111
Other model: Regularized Logistic Regression

�� Model �����������������������������������������������������������������������������������

Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "binomial", a

Df %Dev Lambda
1 0 0.00 0.159000
2 1 1.42 0.144900
3 1 2.60 0.132000
4 1 3.59 0.120300
5 1 4.42 0.109600
6 1 5.12 0.099870
7 1 5.71 0.091000
8 1 6.20 0.082920
9 1 6.62 0.075550
10 2 7.12 0.068840
11 2 7.66 0.062720
12 2 8.11 0.057150
13 2 8.50 0.052070
14 3 9.03 0.047450
15 3 9.55 0.043230
16 3 9.98 0.039390
17 4 10.45 0.035890
18 4 10.94 0.032700
19 5 11.44 0.029800
20 7 12.00 0.027150
21 7 12.57 0.024740
22 7 13.05 0.022540
23 8 13.63 0.020540
24 8 14.13 0.018710
25 8 14.55 0.017050
26 8 14.91 0.015540
27 8 15.22 0.014160
28 8 15.48 0.012900

112
Penalty: LASSO

29 8 15.70 0.011750
30 8 15.89 0.010710
31 8 16.05 0.009758
32 8 16.18 0.008891
33 8 16.30 0.008101
34 8 16.39 0.007381
35 8 16.47 0.006726
36 8 16.54 0.006128
37 8 16.60 0.005584
38 8 16.65 0.005088
39 8 16.69 0.004636
40 8 16.73 0.004224
41 8 16.75 0.003849
42 8 16.78 0.003507
43 8 16.80 0.003195
44 8 16.82 0.002911
45 8 16.83 0.002653
46 8 16.84 0.002417

...
and 14 more lines.

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_lasso$prediction
),
mode = "everything"
)

113
Other model: Regularized Logistic Regression

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 1717 1628
Toxic 717 5134

Accuracy : 0.745
95% CI : (0.736, 0.7539)
No Information Rate : 0.7353
P-Value [Acc > NIR] : 0.01794

Kappa : 0.415

Mcnemar's Test P-Value : < 2e-16

Sensitivity : 0.7054
Specificity : 0.7592
Pos Pred Value : 0.5133
Neg Pred Value : 0.8775
Precision : 0.5133
Recall : 0.7054
F1 : 0.5942
Prevalence : 0.2647
Detection Rate : 0.1867
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7323

'Positive' Class : Nontoxic

114
Penalty: LASSO

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_lasso$predict_new
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 537 587
Toxic 238 1708

Accuracy : 0.7313
95% CI : (0.7152, 0.7469)
No Information Rate : 0.7476
P-Value [Acc > NIR] : 0.9815

Kappa : 0.3804

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.6929
Specificity : 0.7442
Pos Pred Value : 0.4778
Neg Pred Value : 0.8777
Precision : 0.4778
Recall : 0.6929

115
Other model: Regularized Logistic Regression

F1 : 0.5656
Prevalence : 0.2524
Detection Rate : 0.1749
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7186

'Positive' Class : Nontoxic

Without tidymodels workflow

cc50_lasso <- PLR$traintest_glmnet(


formula = y ~ .,
data = cc50_train,
new_data = cc50_test,
penal_type = "lasso"
)

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_lasso$prediction
),
mode = "everything"
)

Confusion Matrix and Statistics

116
Penalty: LASSO

Prediction
Actual Nontoxic Toxic
Nontoxic 1716 1629
Toxic 722 5129

Accuracy : 0.7443
95% CI : (0.7353, 0.7532)
No Information Rate : 0.7349
P-Value [Acc > NIR] : 0.0202

Kappa : 0.4136

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.7039
Specificity : 0.7590
Pos Pred Value : 0.5130
Neg Pred Value : 0.8766
Precision : 0.5130
Recall : 0.7039
F1 : 0.5935
Prevalence : 0.2651
Detection Rate : 0.1866
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7314

'Positive' Class : Nontoxic

Prediction Set

117
Other model: Regularized Logistic Regression

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_lasso$predict_new
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 538 586
Toxic 238 1708

Accuracy : 0.7316
95% CI : (0.7155, 0.7472)
No Information Rate : 0.7472
P-Value [Acc > NIR] : 0.9775

Kappa : 0.3813

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.6933
Specificity : 0.7446
Pos Pred Value : 0.4786
Neg Pred Value : 0.8777
Precision : 0.4786
Recall : 0.6933
F1 : 0.5663
Prevalence : 0.2528

118
Penalty: Elastic Net

Detection Rate : 0.1752


Detection Prevalence : 0.3661
Balanced Accuracy : 0.7189

'Positive' Class : Nontoxic

Penalty: Elastic Net

What about Elastic Net? It is a combination of Ridge and LASSO regres-


sion. It is used to address the limitations of Lasso, especially when there
are highly correlated features or when the number of features is much
larger than the number of observations.

The key features that I didn’t mentioned from both Ridge and LASSO:

• While LASSO tends to select only one feature from a group of cor-
related features, Elastic Net can retain all or several of them.
• The Elastic Net regularization encourages sparsity like Lasso but
also stabilizes the selection process like Ridge.

This is the equation of Elastic net regression loss function:

𝑛 𝑘 𝑘
𝐿(𝛽) = − ∑(𝑦𝑖 log(𝑝𝑖 ) + (1−𝑦𝑖 ) log(1−𝑝𝑖 )) + 𝜆(𝛼 ∑ |𝛽𝑗 | +(1−𝛼) ∑ 𝛽𝑗2 )
𝑖=1 𝑗=1 𝑗=1

The goal is to minimize the loss function of the 𝛽s in the logistic regression
equation (found in Chapter 5.1, in log odds part).

This is how it looks like:

119
Other model: Regularized Logistic Regression

𝑛 𝑘 𝑘
min(− ∑(𝑦𝑖 log(𝑝𝑖 ) + (1−𝑦𝑖 ) log(1−𝑝𝑖 )) + 𝜆(𝛼 ∑ |𝛽𝑗 | +(1−𝛼) ∑ 𝛽𝑗2 ))
𝛽
𝑖=1 𝑗=1 𝑗=1

What key features added in Elastic Net:

1. The combination of Ridge and LASSO


2. The coefficient alpha:

• When 𝛼 = 1: The regularization becomes LASSO


• When 𝛼 = 0: The regularization becomes Ridge
• When 0 < 𝛼 < 1: The regularization is elastic net.

With tidymodels workflow

The tuned parameters for Elastic Net are:

1. Penalty parameter
2. Mixture: or simply just 𝛼

cc50_en <- PLR$train_TM_logistic(


formula = y ~ .,
data = cc50_train,
new_data = cc50_test,
penal_type = "elastic_net"
)

Actual result:

120
Penalty: Elastic Net

�� Workflow [trained] ����������������������������������������������������������������������


Preprocessor: Recipe
Model: logistic_reg()

�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps

�� Model �����������������������������������������������������������������������������������

Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "binomial", alpha = ~0.5)

Df %Dev Lambda
1 0 0.00 0.31800
2 1 0.89 0.28980
3 1 1.71 0.26400
4 1 2.46 0.24060
5 1 3.16 0.21920
6 1 3.79 0.19970
7 1 4.36 0.18200
8 1 4.89 0.16580
9 1 5.36 0.15110
10 2 5.95 0.13770
11 2 6.52 0.12540
12 2 7.03 0.11430
13 2 7.48 0.10410
14 2 7.88 0.09489
15 3 8.38 0.08646
16 4 8.90 0.07878
17 4 9.40 0.07178
18 5 9.96 0.06541
19 5 10.51 0.05960
20 6 11.03 0.05430
21 7 11.52 0.04948

121
Other model: Regularized Logistic Regression

22 7 12.05 0.04508
23 8 12.58 0.04108
24 8 13.12 0.03743
25 8 13.59 0.03410
26 8 14.01 0.03107
27 8 14.39 0.02831
28 9 14.72 0.02580
29 9 15.01 0.02351
30 9 15.26 0.02142
31 9 15.49 0.01952
32 9 15.68 0.01778
33 9 15.85 0.01620
34 9 16.00 0.01476
35 9 16.13 0.01345
36 9 16.24 0.01226
37 9 16.34 0.01117
38 9 16.42 0.01018
39 9 16.49 0.00927
40 8 16.56 0.00845
41 8 16.61 0.00770
42 8 16.65 0.00701
43 8 16.69 0.00639
44 8 16.72 0.00582
45 8 16.75 0.00530
46 8 16.78 0.00483

...
and 19 more lines.

122
Penalty: Elastic Net

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_en$prediction
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 1714 1631
Toxic 720 5131

Accuracy : 0.7443
95% CI : (0.7353, 0.7532)
No Information Rate : 0.7353
P-Value [Acc > NIR] : 0.02527

Kappa : 0.4135

Mcnemar's Test P-Value : < 2e-16

Sensitivity : 0.7042
Specificity : 0.7588
Pos Pred Value : 0.5124
Neg Pred Value : 0.8769
Precision : 0.5124
Recall : 0.7042

123
Other model: Regularized Logistic Regression

F1 : 0.5932
Prevalence : 0.2647
Detection Rate : 0.1864
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7315

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_en$predict_new
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 536 588
Toxic 238 1708

Accuracy : 0.7309
95% CI : (0.7149, 0.7466)
No Information Rate : 0.7479
P-Value [Acc > NIR] : 0.985

124
Penalty: Elastic Net

Kappa : 0.3795

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.6925
Specificity : 0.7439
Pos Pred Value : 0.4769
Neg Pred Value : 0.8777
Precision : 0.4769
Recall : 0.6925
F1 : 0.5648
Prevalence : 0.2521
Detection Rate : 0.1746
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7182

'Positive' Class : Nontoxic

Without tidymodels workflow

cc50_en <- PLR$traintest_glmnet(


formula = y ~ .,
data = cc50_train,
new_data = cc50_test,
penal_type = "elastic_net"
)

125
Other model: Regularized Logistic Regression

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_en$prediction
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 1714 1631
Toxic 720 5131

Accuracy : 0.7443
95% CI : (0.7353, 0.7532)
No Information Rate : 0.7353
P-Value [Acc > NIR] : 0.02527

Kappa : 0.4135

Mcnemar's Test P-Value : < 2e-16

Sensitivity : 0.7042
Specificity : 0.7588
Pos Pred Value : 0.5124
Neg Pred Value : 0.8769
Precision : 0.5124
Recall : 0.7042

126
Penalty: Elastic Net

F1 : 0.5932
Prevalence : 0.2647
Detection Rate : 0.1864
Detection Prevalence : 0.3637
Balanced Accuracy : 0.7315

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_en$predict_new
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 536 588
Toxic 238 1708

Accuracy : 0.7309
95% CI : (0.7149, 0.7466)
No Information Rate : 0.7479
P-Value [Acc > NIR] : 0.985

127
Other model: Regularized Logistic Regression

Kappa : 0.3795

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.6925
Specificity : 0.7439
Pos Pred Value : 0.4769
Neg Pred Value : 0.8777
Precision : 0.4769
Recall : 0.6925
F1 : 0.5648
Prevalence : 0.2521
Detection Rate : 0.1746
Detection Prevalence : 0.3661
Balanced Accuracy : 0.7182

'Positive' Class : Nontoxic

Just like the ordinary logistic regression, nothing changes, whether you
control the workflow, including cross-validation and hyperparameter tun-
ing with tidymodels or not. Except the part where the LASSO penalty is
applied. Either it is applied with tidymodels or not, we can arguably say
that the elastic net penalty type is better model to predict both training
and test than Ridge and LASSO, but none of them got better result than
the 2 RBF Network models that got 80% accuracy in their both Training
and Prediction sets.

128
Other model: k-Nearest Neighbors

Another model to be compared to RBF Network model, is a nonpara-


metric but simple supervised learning model, and that is the k-Nearest
Neighbors, or kNN for short. kNN is an instance-based algorithm that
makes predictions based on the closest examples in the training data. It
works by storing all available cases and classifying new cases by taking a
majority vote of their k nearest neighbors. In other words, for each new
instance, the algorithm finds the k closest data points (according to a dis-
tance metric like Euclidean distance), and the most common class label
among those neighbors is used as the prediction. For regression tasks, the
algorithm averages the values of the k nearest neighbors to make a pre-
diction. This makes kNN simple, yet effective for both classification and
regression tasks, especially when the data has a clear local structure.

For k-Nearest Neighbors (kNN), I chose the kknn package in R. While


there may be other implementations of kNN in R, kknn is a standout
choice for several reasons. It is also relatively fast and efficient, even when
handling moderately large datasets like mine, with 9,196 training samples
and 3,070 test samples.

Like other implementations, the workflow of the kNN model is organized


into a single function, whether it’s built with tidymodels or without. This
function is stored in a script as a module, which can then be accessed using
box::use. The script is named knn to reflect its purpose.

box::use(module_r/knn)

129
Other model: k-Nearest Neighbors

I stored the functions in a single script named knn, right? I saved this file
as knn.R, and stored it into a folder named module_r. Hence, I made an
argument in box::use function as module_r/knn, and no alias because it
is not needed anyways since the file name is too short. To view the source
code of module_r/knn, click this page.

Ĺ Note

The problem at hand is a classification task, not a regression task, so


it makes sense to create a confusion matrix to classify observations
into categories (e.g., positive or negative classes).

With tidymodels

I am impressed that kNN is actually faster than I thought. Here, with its
default, I only made this function iterate few parameters, and I still got
better result.

cc50_knn <- knn$train_TM_knn(


formula = y ~ .,
data = cc50_train,
new_data = cc50_test
)

Actual result:

�� Workflow [trained] ����������������������������������������������������������������������


Preprocessor: Recipe
Model: nearest_neighbor()

�� Preprocessor ����������������������������������������������������������������������������
0 Recipe Steps

130
With tidymodels

�� Model �����������������������������������������������������������������������������������

Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(6L, data, 5))

Type of response variable: nominal


Minimal misclassification: 0.1008047
Best kernel: optimal
Best k: 6

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_knn$prediction
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 3180 165
Toxic 119 5732

Accuracy : 0.9691
95% CI : (0.9654, 0.9726)
No Information Rate : 0.6413

131
Other model: k-Nearest Neighbors

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9331

Mcnemar's Test P-Value : 0.007579

Sensitivity : 0.9639
Specificity : 0.9720
Pos Pred Value : 0.9507
Neg Pred Value : 0.9797
Precision : 0.9507
Recall : 0.9639
F1 : 0.9573
Prevalence : 0.3587
Detection Rate : 0.3458
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9680

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_knn$prediction
),
mode = "everything"
)

Confusion Matrix and Statistics

132
With tidymodels

Prediction
Actual Nontoxic Toxic
Nontoxic 951 173
Toxic 153 1793

Accuracy : 0.8938
95% CI : (0.8824, 0.9045)
No Information Rate : 0.6404
P-Value [Acc > NIR] : <2e-16

Kappa : 0.7704

Mcnemar's Test P-Value : 0.2927

Sensitivity : 0.8614
Specificity : 0.9120
Pos Pred Value : 0.8461
Neg Pred Value : 0.9214
Precision : 0.8461
Recall : 0.8614
F1 : 0.8537
Prevalence : 0.3596
Detection Rate : 0.3098
Detection Prevalence : 0.3661
Balanced Accuracy : 0.8867

'Positive' Class : Nontoxic

133
Other model: k-Nearest Neighbors

Without tidymodels workflow

cc50_knn <- knn$traintest_knn(


formula = y ~ .,
data = cc50_train,
new_data = cc50_test
)

Training Set

caret::confusionMatrix(
data = table(
actual = cc50_train$y,
prediction = cc50_knn$predictions
),
mode = "everything"
)

Confusion Matrix and Statistics

Prediction
Actual Nontoxic Toxic
Nontoxic 2977 368
Toxic 303 5548

Accuracy : 0.927
95% CI : (0.9215, 0.9323)
No Information Rate : 0.6433
P-Value [Acc > NIR] : < 2e-16

134
Without tidymodels workflow

Kappa : 0.8417

Mcnemar's Test P-Value : 0.01349

Sensitivity : 0.9076
Specificity : 0.9378
Pos Pred Value : 0.8900
Neg Pred Value : 0.9482
Precision : 0.8900
Recall : 0.9076
F1 : 0.8987
Prevalence : 0.3567
Detection Rate : 0.3237
Detection Prevalence : 0.3637
Balanced Accuracy : 0.9227

'Positive' Class : Nontoxic

Prediction Set

caret::confusionMatrix(
data = table(
actual = cc50_test$y,
prediction = cc50_knn$predictions_new
),
mode = "everything"
)

Confusion Matrix and Statistics

135
Other model: k-Nearest Neighbors

Prediction
Actual Nontoxic Toxic
Nontoxic 914 210
Toxic 172 1774

Accuracy : 0.8756
95% CI : (0.8634, 0.887)
No Information Rate : 0.6463
P-Value [Acc > NIR] : < 2e-16

Kappa : 0.73

Mcnemar's Test P-Value : 0.05835

Sensitivity : 0.8416
Specificity : 0.8942
Pos Pred Value : 0.8132
Neg Pred Value : 0.9116
Precision : 0.8132
Recall : 0.8416
F1 : 0.8271
Prevalence : 0.3537
Detection Rate : 0.2977
Detection Prevalence : 0.3661
Balanced Accuracy : 0.8679

'Positive' Class : Nontoxic

With or without using tidymodels, I still got a result from Training set
greater than 90%, and got a result from Prediction set greater than 85%,
making the kNN better than the previous 2 RBF Network models. I got
an exemption result, even though the result is in favor of with tidymodels
workflows.

136
Part VII.

Model Evaluation

137
Model Evaluation

Everything in Chapter 4 and Chapter 5, even though I calculated all of


the metrics in caret::confusionMatrix, the problem is I only summarize
the result via accuracy only. But, in this chapter, the things to consider
to evaluate the classification models’ performance:

1. Accuracy
2. Precision
3. Recall
4. F1-Score

The metrics like Sensitivity and Specificity will be summarized using Mo-
saic plot and ROC-AUC Curve plots
Here is the summary table for Accuracy, Precision, Recall, and F1-Score:
This PDF is only a skeleton. Please read the online HTML version
As you can see in the model, the result of 2 RBF Network models are the
same, no controlled workflows or whatsoever, however, the metrics you see
is from Chapter 5 I obtained with tidymodels workflows (not without it,
I am just showing you how it’s done), since the models to be trained for
Chapter 5 is from tidymodels.

138
Mosaic Plot

The mosaic plot of all the models to visualize the confusion matrix are
merged into 1 plot:

139
ROC and AUC

RBF Network

1 Layer

2 Layers

Logistic Regression

Extreme Gradient Boosting (XGBoost)

Support Vector Machine (SVM)

Naive Bayes

Random Forest

Penalized Logistic Regression

k-Nearest Neighbors

140
Part VIII.

Chapter 6: Summary and


Conclusion

141
Summary and Conclusion

Most of the models other than Radial Basis Neural Network (RBF Net-
work) outperforms it, whether it is controlled by tidymodels or not. To
be fair, the RBF Network models were not trained with many parameters,
and I got weak computational power making this model under performed.
In conclusion, this model is strong in training but weak in inference.

142

You might also like