0% found this document useful (0 votes)

31 views12 pages

Iarjset 2024 11624

Uploaded by

chirag suresh chiru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views12 pages

Iarjset 2024 11624

Uploaded by

chirag suresh chiru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624

AID362 Bioassay Classification and Regression

(Neuronal Network and Extra Tree) with
Machine Learning
Sowmya T1, Sandeep S2, Chirag Suresh3, Rachana N4, Vasantha Poojary5
Assistant Professor, Computer Science and Engineering, BMS College of Engineering, Bengaluru, India1
Assistant Professor, Electronics and Communication Engineering, BMS College of Engineering, Bengaluru, India 2
Student, Computer Science and Engineering, BMS College of Engineering, Bengaluru, India 3
Student, Computer Science and Engineering, BMS College of Engineering, Bengaluru, India 4
Student, Computer Science and Engineering, BMS College of Engineering, Bengaluru, India 5

Abstract: In order to anticipate and classify bioassay data labeled AID362, this study investigates the use of sophisticated
machine learning techniques, including neural networks and extra trees. The usefulness and suitability of these algorithms
for bioassay categorization and regression are examined by means of meticulous testing and analysis. The UCI repository
dataset is prepared by applying several data preprocessing techniques such as cleaning, normalization, and feature scaling.
Overfitting is reduced and dimensionality is made easier with the help of Extra Trees feature extraction. The effectiveness
of the trained models in managing high-dimensional, nonlinear bioassay data is demonstrated by a variety of performance
indicators. The findings add significant knowledge to the field by demonstrating the effectiveness of neural networks in
identifying complex patterns and the advantages of extra trees in managing complicated datasets.

Keywords: Bioassay, AID362, Classification, Regression.

I. INTRODUCTION

Pharmaceutical and toxicological research relies heavily on the evaluation of chemical substances' biological activity and
toxicity. The AID362 dataset is a useful resource that includes detailed data from several bioassays carried out on
chemical substances. Molecular descriptors, activity classifications, and toxicity levels are only a few of the many aspects
that these assays cover.

In toxicological and pharmaceutical research, bioassays are essential for assessing the toxicity and biological activity of
chemical substances. Bioassay data analysis has typically involved manual labor and a lengthy processing period. The
field has experienced a revolution with the integration of machine learning techniques, which provide automated solutions
that improve accuracy and speed up analysis. The main focus of this work is the AID362 dataset, a repository with a
wealth of data on different types of bioassays carried out on chemical substances. Our goal is to efficiently handle
bioassay classification and regression tasks by utilizing the capabilities of machine learning methods, particularly
Neuronal Networks and Extra Trees.

Neuronal Networks: Inspired by the neural networks seen in the human brain, these systems are able to recognize
complex patterns and correlations in data. Bioassay analysis, which involves intricate relationships between chemical
characteristics and biological responses, is a suitable application for them due to their versatility and ability to do
nonlinear mapping.Extra Trees are an ensemble learning technology that work in tandem with neural networks to enhance
predictive performance. They build several decision trees and integrate their outputs. Strong solutions for bioassay
analysis are provided by Extra Trees, who are especially skilled at managing high-dimensional data and reducing
noise.Our goals in doing this research project are to apply the Neuronal Network and Extra Tree models, assess their
effectiveness with the AID362 dataset, and determine the advantages and disadvantages of each strategy through a
comparison analysis. In order to improve our knowledge of chemical compound behavior and toxicity and ultimately aid
in the creation of safer and more potent medications, this work aims to shed light on the use of machine learning in
bioassay analysis.

© IARJSET This work is licensed under a Creative Commons Attribution 4.0 International License 172
IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624
II. LITERATURE SURVEY

A summary of bioassay analysis is as follows: bioassay analysis measures the impact of substances on living things or
biological systems in order to determine their biological activity. This can involve evaluating toxicity levels, figuring out
how potent medications are, and finding possible treatment agents.

Bioassay Classification and Regression Using Machine Learning : The application of machine learning techniques in
bioassay analysis has grown in recent years because of their capacity to process enormous datasets and discern intricate
patterns. In order to forecast a compound's biological activity and toxicity based on its chemical features, classification
and regression methods are frequently used.

Neuronal Networks: With their interconnected nodes arranged in layers, neural networks are modeled after the
organization of the human brain. Bioinformatics and drug development are only two of the many fields that use them
because of their ability to learn intricate correlations between input and output variables.

Extra Trees: Extra Trees is an ensemble learning technique that builds several decision trees and aggregates the
predictions to increase precision. It works especially well with high-dimensional data, and noisy or incomplete datasets
can be handled with ease.

A review of related research on machine learning algorithms for bioassay classification and regression is presented. The
literature includes studies on deep learning-based structure-activity relationship modeling for chemical toxicity
classification and the use of decision trees, random forest, and deep neural networks for predicting acute oral toxicity.

III. IMPLEMENTATION

Data Collection and Preprocessing of the dataset

⚫ Description of AID362 Dataset: The AID362 dataset contains information on various bioassays conducted on
chemical compounds. It includes features such as molecular descriptors, activity classes, and toxicity levels.
⚫ Data Cleaning: The dataset underwent cleaning procedures to remove missing values, outliers, and redundant
features. This ensures the quality and integrity of the data used for model training.
⚫ Feature Selection: Feature selection techniques were employed to identify the most relevant features for bioassay
analysis. This helps improve model performance and reduce computational complexity.
⚫ Data Normalization: Normalization techniques were applied to scale the input features to a standard range. This
ensures that all features contribute equally to the model training process and prevents biases towards certain
variables.
⚫ Data Acquisition : The data for the project is acquired from the UCI repository, as described in the literature
⚫ Data Preprocessing : The acquired data is preprocessed to make it ready for further processing, including data
cleaning, normalization, and feature scaling.
⚫ Feature Extraction : Relevant features are extracted to reduce the data set's feature space and prevent model
overfitting. The Extra Tree ensemble feature selection technique is used to retrieve the optimal feature set.
⚫ Model Training : The Neuronal Network and Extra Tree models are trained using the preprocessed data and the
extracted features.
⚫ Model Evaluation : The performance of the models is evaluated using metrics such as accuracy, F1-score, precision,
recall, and computation time.

The architecture of the neural network model was developed utilizing input, hidden, and output layers, which are three
different layers of neurons. A range of optimization strategies and activation functions were investigated in an effort to
enhance model performance.
Extra Tree technique: An ensemble of decision trees was constructed using the Extra Trees technique. For optimal results,
parameters like the maximum depth and tree count were optimized.
Training and Evaluating the Models: Using metrics like accuracy, precision, recall, and F1-score, the preprocessed dataset
was used to train the Neuronal Network and Extra Tree models. To guarantee robustness and generalizability, cross-
validation techniques were utilized.

© IARJSET This work is licensed under a Creative Commons Attribution 4.0 International License 173
IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624

Figure 1: Partitions (left) and decision tree structure (right) for a CART classification model with three classes labeled 1,
2, 3. At each intermediate node, a case goes to the left child node if and only if the condition is satisfied. The predicted
class is given beneath each leaf node.

A training sample of n observations on a class variable Y that takes values 1, 2,..., k, and p predictor variables, X1,..., Xp,
is available to us in a classification issue. Finding a model to forecast Y values from fresh X values is our aim.
Theoretically, the answer is as easy as splitting the X divide the space into k distinct sets, A1, A2,..., Ak, so that, for j =
1, 2,..., k, the expected value of Y is j if X belongs to Aj. Two traditional methods for handling ordered values in the X
variables are nearest-neighbor classification [13] and linear discriminant analysis [17]. These methods yield sets Aj with
piecewise linear and nonlinear, respectively, boundaries that are not easy to interpret if p is large.

Classification tree approaches split the data set recursively, one X variable at a time, producing rectangle sets Aj. This
facilitates interpretation of the sets. As an illustration, consider the scenario with three classes and two X variables shown
in Figure 1. The relevant decision tree structure is displayed in the right panel, while the data points and divisions are
plotted in the left panel. The tree structure has the advantage of being applicable to an unlimited number of variables,
while the plot on its left can only be used to a maximum of two.
The first published classification tree algorithm is THAID [16, 35]. Employing a mea- sure of node impurity based on
the distribution of the observed Y values in the node, THAID splits a node by exhaustively searching over all X and S
for the split X S that minimizes the total impurity of its two child nodes. If X takes ordered values, the set S is an interval
of the form ( , c]. Otherwise, S is a subset of the values taken by X. The process is applied recursively on the data in each
child node. Splitting stops if the relative decrease in impurity is below a pre-specified threshold. Algorithm 1 gives the
pseudocode for the basic steps.

Algorithm 1 Pseudocode for tree construction by exhaustive search

1. Start at the root node
2. For each X, find the set S that minimizes the sum of the node impurities in the two child nodes and choose the split X∗
S∗that gives the minimum over all X and S.
3. If a stopping criterion is reached, exit. Otherwise, apply step 2 to each
child node in turn.

C4.5 [38] and CART [5] are two later classification tree algorithms that follow this approach. C4.5 uses entropy for its
impurity function while CART uses a generaliza- tion of the binomial variance called the Gini index. Unlike THAID,
however, they first grow an overly large tree and then prune it to a smaller size to minimize an estimate of the
misclassification error. CART employs ten-fold (default) cross-validation while C4.5 uses a heuristic formula to estimate
error rates. CART is implemented in the R system [39] as RPART [43], which we use in the examples below.Despite its
simplicity and elegance, the exhaustive search approach has an undesirable property. Note that an ordered variable with
m distinct values has (m 1) splits of the form X c, and an unordered variable with m distinct unordered values has (2 m−1
1) splits of the form X S. Therefore if everything else is equal, variables that have more distinct values have a greater
chance to be selected. This selection bias affects the integrity of inferences drawn from the tree structure.Building on an
idea that originated in the FACT [34] algorithm, CRUISE [21, 22], GUIDE [31] and QUEST [33] use a two-step approach
based on significance tests to split each node. First, each X is tested for association with Y and the most significant
variable is selected. Then an exhaustive search is performed for the set S. Because every X has the same chance to be
selected if each is independent of Y , this approach is effectively free of selection bias. Besides, much computation is

© IARJSET This work is licensed under a Creative Commons Attribution 4.0 International License 174
IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624
saved as the search for S is carried out only on the selected X variable. GUIDE and CRUISE use chi-squared tests, and
QUEST uses chi-squared tests for unordered and analysis of variance tests for ordered variables. CTree [18], another
unbiased method, uses permutation tests. Pseudocode for the GUIDE algorithm is given in Algorithm 2. The CRUISE,
GUIDE and QUEST trees are pruned the same way as CART.

Algorithm 2 Pseudocode for GUIDE classification tree construction

1. Start at the root node.
2. For each ordered variable X, convert it to an unordered variable X′ by grouping its values in the node into a small
number of intervals. If X is unordered, set X′ = X.
3. Perform a chi-squared test of independence of each X′ variable versus Y on the data in the node and compute its
significance probability.
4. Choose the variable X∗ associated with the X′ that has the smallest significance probability
5. Find the split set X∗S∗that minimizes the sum of Gini indexes and use it to split the node into two child nodes.
6. If a stopping criterion is reached, exit. Otherwise, apply steps 2–5 to each child node.
7. Prune the tree with the CART method.

CHAID [20] employs yet another strategy. If X is an ordered variable, its data values in the node are split into ten intervals
and one child node is assigned to each interval. If X is unordered, one child node is assigned to each value of X. Then
CHAID uses significance tests and Bonferroni corrections to try to iteratively merge pairs of child nodes. This approach
has two consequences. First, some nodes may be split into more than two child nodes. Second, owing to the sequential
nature of the tests and the inexactness of the corrections, the method is biased toward selecting variables with few distinct
values.CART, CRUISE and QUEST can allow splits on linear combinations of all the ordered variables while GUIDE
can split on combinations of two variables at a time. If there are missing values, CART and CRUISE use alternate splits
on other variables when needed, C4.5 sends each observation with a missing value in a split through every branch using
a probability weighting scheme, QUEST imputes the missing values lo- cally, and GUIDE treats missing values as
belonging to a separate category. All except C4.5 accept user-specified misclassification costs and all except C4.5 and
CHAID accept user-specified class prior probabilities. By default, all algorithms fit a constant model to each node,
predicting Y to be the class with the smallest misclassification cost. CRUISE can optionally fit bivariate linear
discriminant models and GUIDE can fit bivariate kernel density and nearest neighbor models in the nodes. GUIDE also
can produce ensemble models using bagging [3] and random forest [4] techniques. Table 1 summarizes the features of
the algorithms.To see how the algorithms perform in a real application, we apply them to a data set on new cars for the
1993 model year [26]. There are 93 cars and 25 variables. We let the Y variable be the type of drive train, which takes
three values (rear, front, or four-wheel drive). The X variables are listed in Table 2. Three are unordered (manuf, type,
and airbag, taking 31, 6, and 3 values, respectively), two binary-valued (manual and domestic), and the rest ordered. The
class frequencies are rather unequal: 16 (17.2%) are rear, 67 (72.0%) are front, and 10 (10.8%) are four-wheel drive
vehicles. To avoid randomness due to ten-fold cross-validation, we use leave-one-out (i.e., n- fold) crossvalidation to
prune the CRUISE, GUIDE, QUEST, and RPART trees in this article.Figure 2 shows the results if the 31-valued variable
manuf is excluded. The CHAID tree is not shown because it has no splits. The wide variety of variables selected in the
splits is due partly to differences between the algorithms and partly to the absence of a dominant X variable. Variable
passngr is chosen by three algorithms (C4.5, GUIDE QUEST); enginsz, fuel, length, and minprice by two; and hp,
hwympg,

Table 1 Comparison of classification tree methods. A check mark indicates presence of a feature. The codes are: b =
missing value branch, c = constant model, d= discriminant model, i = missing value imputation, k = kernel density model,
l = linear splits, m = missing value category, n = nearest neighbor model, u = univariate splits, s = surrogate splits, w =
probability weights.

© IARJSET This work is licensed under a Creative Commons Attribution 4.0 International License 175
IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624

luggage, maxprice, rev, type, and width by one each. Variables airbag, citympg, cylin, midprice, and rpm are not selected
by any.When GUIDE does not find a suitable variable to split a node, it looks for a linear split on a pair of variables. One
such split, on enginsz and rseat, occurs at the node marked with an asterisk (*) in the GUIDE tree. Restricting the linear
split to two variables allows the data and the split to be displayed in a plot, as shown in Figure 3. Clearly, no single split
on either variable alone can do as well in separating the two classes there.

Figure 4 shows the C4.5, CRUISE and GUIDE trees when variable manuf is included. Now CHAID, QUEST and RPART
give no splits. Comparing them with their coun- terparts in Figure 2, we see that the C4.5 tree is unchanged, the CRUISE
tree has an additional split (on manuf), and the GUIDE tree is much shorter. This behavior is not uncommon when there
are many variables with little or no predictive power: their introduction can substantially reduce the size of a tree structure
and its prediction accuracy; see, e.g., [15] for more empirical evidence.Table 3 reports the computational times used to
fit the tree models on a computer with a 2.66Ghz Intel Core 2 Quad Extreme processor. The fastest algorithm is C4.5,
which takes milliseconds. If manuf is excluded, the next fastest is RPART, at a tenth of a second. But if manuf is included,
RPART takes more than three hours—a 105-fold increase. This is a practical problem with the CART algorithm: because
manuf takes 31 values, the algorithm must search through 230 1 (more than 1 billion) splits at the root node alone. (If Y
takes only two values, a computational shortcut [5, p. 101] reduces the number of searches to just 30 splits.) C4.5 is not
similarly affected because it does not search for binary splits on unordered X variables. Instead, C4.5 splits the node into
one branch for each X value and then merges some branches after the tree.

© IARJSET This work is licensed under a Creative Commons Attribution 4.0 International License 176
IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624

CRUISE, GUIDE and QUEST also are unaffected because they search exhaustively for splits on unordered variables
only if the number of values is small. If the latter is large, these algorithms employ a technique in [34] that uses linear
discriminant analysis on dummy variables to find the splits.

Figure 2: CRUISE, QUEST, RPART, C4.5 and GUIDE trees for car data without manuf. The CHAID tree is trivial with
no splits. At each intermediate node, a case goes to the left child node if and only if the condition is satisfied. The predicted
class is given beneath each leaf node .

© IARJSET This work is licensed under a Creative Commons Attribution 4.0 International License 177
IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624

Figure 3: Data and split at the node marked with an asterisk (*) in the GUIDE tree in Figure 2.

Figure 4: CRUISE, C4.5 and GUIDE trees for car data with manuf included. The CHAID, RPART and QUEST trees are
trivial with no splits. Sets S1 and S2

© IARJSET This work is licensed under a Creative Commons Attribution 4.0 International License 178
IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624

Regression trees and classification trees are comparable, with the exception that the Y variable accepts ordered values.
And each node is fitted with a regression model to get the expected values of Y. In the past, AID [36], which debuted a
few years prior to THAID, was the first regression tree method. AID with the node prediction being the sample mean of
Y, and CART regression tree methods according to Algorithm 1. The node impurity is the sum of squared deviations
about the mean. Models with piecewise constants result from this. These models' prediction accuracy frequently falls
short of more smoothly constructed models, despite their ease of interpretation.

It can be computationally demanding to apply this method to piecewise linear models, though, as each prospective split
requires the fitting of two linear models, one for each child node. An variant of Quinlan's [37] regression tree approach,
M5' [44], builds piecewise linear models using a more computationally effective method. It fits a linear regression model
to the data in each leaf node after first building a piecewise constant tree. The resulting trees are typically larger than
those from other piecewise linear tree approaches since the tree structure is the same as that of a piecewise constant model.
GUIDE [27] addresses the regression problem by employing classification tree methods.

At each node, it fits a regression model to the data and computes the residuals. Then it defines a class variable Y ′
taking values 1 or 2, depending on whether the sign of the residual is positive or not. Finally it applies Algorithm 2 to the
Y ′ variable to split the node into two. This approach has three advantages: (i) the splits are unbiased,
(ii) only one regression model is fitted at each node, and (iii) because it is based on residuals, the method is neither limited
to piecewise constant models nor to the least squares criterion. Table 4 lists the main features of CART, GUIDE and M5’.

Table 4 Comparison of regression tree methods. A check mark indicates presence of a feature. The codes are: a = missing
value category, c = constant model, g = global mean/mode imputation, l = linear splits, m = multiple linear model, p =
polynomial model, r = stepwise linear model, s = surrogate splits, u = univariate splits, v = least squares, w= least median
of squares, quantile, Poisson, and proportional hazards.

© IARJSET This work is licensed under a Creative Commons Attribution 4.0 International License 179
IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624

To compare CART, GUIDE and M5’ with ordinary least squares (OLS) linear regression, we apply them to some data
on smoking and pulmonary function in children [40]. The data, collected from 654 children aged 3–19, give the forced
expiratory volume (FEV, in liters), gender (sex, M/F), smoking status (smoke, Y/N), age (years), and height (ht, inches)
of each child. Using an OLS model for predicting FEV that includes all the variables, Kahn [19] finds that smoke is the
only one not statistically significant. He also finds a significant age-smoke interaction if ht is excluded, but not if ht and
its square are both included. This problem with interpreting OLS models often occurs when collinearity is present (the
correlation between age and height is 0.8). Figure 5 shows five regression tree models: (a) GUIDE piecewise constant
(with sample mean of Y as the predicted value in each node), (b) GUIDE best simple linear (with a linear regression
model involving only one predictor in each node), (c) GUIDE best simple quadratic regression (with a quadratic model
involving only one predictor in each node), (d) GUIDE stepwise linear (with a stepwise linear regression model in each
node), and (e) M5’ piecewise constant. The CART tree (from RPART) is a subtree of (a), with six leaf nodes marked
by asterisks (*). In the piecewise polynomial models (b) and (c), the predictor variable is found independently in each
node, and non-significant terms of the highest orders are dropped.

For example for model (b) in Figure 5, a constant is fitted in the node containing females taller than 66.2, because the
linear term for ht is not significant at the 0.05 level. Similarly, two of the three leaf nodes in model (c) are fitted with first
degree polynomials in ht because the quadratic terms are not significant. Since the nodes, and hence the domains of the
polynomials, are defined by the splits in the tree, the estimated regression coefficients typically vary between
nodes.Because the total model complexity is shared between the tree structure and the set of node models, the complexity
of a tree structure often decreases as the complexity of the node models increases. Therefore the user can choose a model
by trading off tree structure complexity against node model complexity. Piecewise constant models are mainly used for
the insights their tree structures provide. But they tend to have low prediction accuracy, unless the data are are sufficiently
informative and plentiful to yield a tree with many nodes. The trouble is that the larger the tree, the harder it is to derive
insight from it. Trees (a) and (e) are quite large, but because they split almost exclusively on ht, we can infer from the
predicted values in the leaf nodes that FEV increases monotonically with ht.The piecewise simple linear (b) and quadratic
(c) models reduce tree complexity without much loss (if any) of interpretability. Instead of splitting the nodes, ht now
serves exclusively as the predictor variable in each node. This suggests that ht has strong linear and possibly quadratic
effects. On the other hand, the splits on age and sex point to interactions between them and ht. These interactions can be
interpreted with the help of Figures 6 and 7, which plot the data values of FEV and ht and the fitted regression functions,
with a different symbol and color for each node. In Figure 6, the slope of ht is zero for the group of females taller than
66.2 inches, but it is constant and non-zero across the other groups. This indicates a three-way interaction involving age,
ht and sex. A similar conclusion can be drawn from Figure 7, where there are

© IARJSET This work is licensed under a Creative Commons Attribution 4.0 International License 180
IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624

Figure 5: GUIDE piecewise constant, simple linear, simple quadratic, and stepwise linear, and M5’ piecewise constant
regression trees for predicting FEV. The RPART tree is a subtree of (a), with leaf nodes marked by asterisks (*). The
mean FEV and linear predictors (with signs of the coefficients) are printed beneath each leaf node. Variable ht2 is the
square of ht.

Figure 6: Data and fitted regression lines in the five leaf nodes of the GUIDE piecewise simple linear model in Figure
5(b).

© IARJSET This work is licensed under a Creative Commons Attribution 4.0 International License 181
IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624

Figure 7: Data and fitted regression functions in the three leaf nodes of the GUIDE
piecewise simple quadratic model in Figure 5(c)

Figure 8: Observed versus predicted values for the tree models in Figure 5 and two
ordinary least squares models.
only three groups, with the group of children aged 11 or below exhibiting a quadratic effect of ht on FEV. For children
above age 11, the effect of ht is linear, but males have, on average, about a half liter more FEV than females. Thus the
effect of sex seems to be due mainly to children older than 11 years. This conclusion is reinforced by the piecewise
stepwise linear tree in Figure 5(d), where a stepwise linear model is fitted to each of the two leaf nodes. Age and ht are
selected as linear predictors in both leaf nodes, but sex is selected only in the node corresponding to children taller than
66.5 inches, seventy percent of whom are above 11 years old. Figure 8 plots the observed versus predicted values of the
four GUIDE models and two OLS models containing all the variables, without and with the square of height. The
discreteness of the predicted values from the piecewise constant model is obvious, as is the curvature in plot (e). The

© IARJSET This work is licensed under a Creative Commons Attribution 4.0 International License 182
IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

Impact Factor 8.066Peer-reviewed & Refereed journalVol. 11, Issue 6, June 2024
DOI: 10.17148/IARJSET.2024.11624
piecewise simple quadratic model in plot (c) is strikingly similar to plot (f) where the OLS model includes ht-squared.
This suggests that the two models have similar prediction accuracy. Model (c) has an advantage over model (f), however,
because the former can be interpreted through its tree structure and the graph of its fitted function in Figure 7.

IV. RESULT

Performance Evaluation measures: Numerous measures, such as accuracy, precision, recall, and F1-score, were used to
assess the performance of the Neuronal Network and Extra Tree models. The efficiency of each method in bioassay
analysis was ascertained by comparing the results.

Analysis of the Neuronal Network and Extra Tree Models: Using a variety of bioassay classification and regression tasks,
the models' performances were compared. Conclusions were made about the advantages and disadvantages of each
algorithm for estimating biological activity and toxicity levels.

Interpretation of the outcomes: To find trends and patterns in the data, the experiment's outcomes were evaluated. The
variables affecting the results of bioassays and the machine learning models' predictive power were shown.

V. CONCLUSION

In summary, this project has effectively demonstrated how Neuronal Networks and Extra Trees may be used for bioassay
classification and regression problems. The algorithms have exhibited encouraging outcomes in precisely forecasting the
biological activity and toxicity thresholds of chemical substances.

It is imperative to recognize specific constraints linked to these approaches. The performance and efficiency of both
neural networks and extra trees can be affected by hyperparameter sensitivity and computational complexity. It is
imperative to tackle these obstacles in order to improve the resilience and usefulness of these models.

Subsequent research projects may investigate different ensemble techniques and machine learning algorithms for
bioassay analysis. Experimenting with several procedures can reveal which ones produce the most accurate and
dependable predictions. The models' interpretability and reliability could be further improved by including domain
knowledge and expert views into the modeling process, which would make them more appropriate for use in practical
settings.

Through the resolution of these constraints and investigation of novel approaches for enhancement, scholars can propel
the domain of bioassay analysis forward, ultimately playing a role in the creation of more efficacious and secure chemical
compounds for diverse uses.

REFERENCES

[1]https://github.com/emirhanai/AID362-Bioassay-Classification-and-RegressionNeuronal-Network-and-Extra-Tree-
with-Machine-Learnin/issues
[2]https://github.com/emirhanai/AID362-Bioassay-Classification-and-RegressionNeuronal-Network-and-Extra-Tree-
with-Machine-Learnin
[3]https://www.frontiersin.org/journals/physiology/articles/10.3389/fphys.2019.01044/full
[4]https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9573053/5https://www.ncbi.nlm.nih.gov/pmc/articles/PMC888558
5/

Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Detectron2 in Practice: Definitive Reference for Developers and Engineers
From Everand
Detectron2 in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Observability for Modern Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of Observability for Modern Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Jaeger Distributed Tracing in Practice: Definitive Reference for Developers and Engineers
From Everand
Jaeger Distributed Tracing in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Metabolites Using ML
No ratings yet
Metabolites Using ML
16 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Bioinformatics: Algorithms, Coding, Data Science And Biostatistics
From Everand
Bioinformatics: Algorithms, Coding, Data Science And Biostatistics
Rob Botwright
No ratings yet
PySiRC Supplementary Information
No ratings yet
PySiRC Supplementary Information
8 pages
Green Industrial Applications of Artificial Intelligence and Internet of Things
From Everand
Green Industrial Applications of Artificial Intelligence and Internet of Things
Biswadip Basu Mallik
No ratings yet
Netdata in Practice: Definitive Reference for Developers and Engineers
From Everand
Netdata in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
ML Acti
No ratings yet
ML Acti
23 pages
Synthetic Data Generation: A Beginner’s Guide
From Everand
Synthetic Data Generation: A Beginner’s Guide
Robert Johnson
No ratings yet
Introduction to Machine Learning and Neural Classification
From Everand
Introduction to Machine Learning and Neural Classification
Trilokesh Khatri
No ratings yet
Advanced Resilient Distributed Datasets in Distributed Computing: Definitive Reference for Developers and Engineers
From Everand
Advanced Resilient Distributed Datasets in Distributed Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cohere Rerank in Practice: The Complete Guide for Developers and Engineers
From Everand
Cohere Rerank in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Personalised Medicine Establishing Predictive Machine Learning Models For Drug Responses in Patient Derived Cell Culture
No ratings yet
Personalised Medicine Establishing Predictive Machine Learning Models For Drug Responses in Patient Derived Cell Culture
14 pages
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Process Analytical Technology in Modern Manufacturing: Definitive Reference for Developers and Engineers
From Everand
Process Analytical Technology in Modern Manufacturing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
Deep Learning For Comp Bio Review
No ratings yet
Deep Learning For Comp Bio Review
16 pages
Ray Tune for Scalable Hyperparameter Optimization: The Complete Guide for Developers and Engineers
From Everand
Ray Tune for Scalable Hyperparameter Optimization: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Applied Machine Learning with Scikit-learn: Definitive Reference for Developers and Engineers
From Everand
Applied Machine Learning with Scikit-learn: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Machine Learning Algorithms for Data Scientists: An Overview
From Everand
Machine Learning Algorithms for Data Scientists: An Overview
Vinaitheerthan Renganathan
No ratings yet
Observer Techniques and Applications: Definitive Reference for Developers and Engineers
From Everand
Observer Techniques and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
A Practitioner's Approach for Problem-Solving using AI
From Everand
A Practitioner's Approach for Problem-Solving using AI
Satvik Vats
No ratings yet
AI and ML Innovations in Nanotechnology
From Everand
AI and ML Innovations in Nanotechnology
Dr. Zemelak Goraga
No ratings yet
Andrew F
No ratings yet
Andrew F
4 pages
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
From Everand
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Pinecone Hybrid Search Engineering: The Complete Guide for Developers and Engineers
From Everand
Pinecone Hybrid Search Engineering: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Computational Intelligence and Machine Learning Approaches in Biomedical Engineering and Health Care Systems
From Everand
Computational Intelligence and Machine Learning Approaches in Biomedical Engineering and Health Care Systems
PublishDrive
No ratings yet
Energy Management Systems: Design and Implementation: Definitive Reference for Developers and Engineers
From Everand
Energy Management Systems: Design and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Artificial Intelligence and Natural Algorithms
From Everand
Artificial Intelligence and Natural Algorithms
PublishDrive
No ratings yet
Journal of Chemometrics - 2004 - Myles - An Introduction To Decision Tree Modeling
No ratings yet
Journal of Chemometrics - 2004 - Myles - An Introduction To Decision Tree Modeling
11 pages
Pinto Jeremy PDF
No ratings yet
Pinto Jeremy PDF
86 pages
Advanced Mathematical Applications in Data Science
From Everand
Advanced Mathematical Applications in Data Science
Biswadip Basu Mallik
No ratings yet
Deep Learning For CB
No ratings yet
Deep Learning For CB
16 pages
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
From Everand
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
From Everand
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Predictive Toxicology - 1st Edition Secure Ebook Download
100% (8)
Predictive Toxicology - 1st Edition Secure Ebook Download
17 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Cancerous Profiles - 2017 - Conference - Paper
No ratings yet
Cancerous Profiles - 2017 - Conference - Paper
6 pages
Predictive Toxicology, 1st Edition ISBN 082472397X, 9780824723972 Instant Download
No ratings yet
Predictive Toxicology, 1st Edition ISBN 082472397X, 9780824723972 Instant Download
14 pages
Sentry Error Monitoring and Application Observability: Definitive Reference for Developers and Engineers
From Everand
Sentry Error Monitoring and Application Observability: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Liver Disease Prediction Using Ensemble Technique
No ratings yet
Liver Disease Prediction Using Ensemble Technique
4 pages
Big Data and Data Science: Analytics for the Future
From Everand
Big Data and Data Science: Analytics for the Future
Dhaanyalakshmi Ahuja
No ratings yet
Instant Access To Predictive Toxicology 1st Edition Christoph Helma (Editor) Ebook Full Chapters
No ratings yet
Instant Access To Predictive Toxicology 1st Edition Christoph Helma (Editor) Ebook Full Chapters
81 pages
Airbrake Systems Engineering: Definitive Reference for Developers and Engineers
From Everand
Airbrake Systems Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenTracing in Distributed Systems: Definitive Reference for Developers and Engineers
From Everand
OpenTracing in Distributed Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
JETIR2504D89
No ratings yet
JETIR2504D89
20 pages
Liver Disease Prediction Using Ensemble Technique
No ratings yet
Liver Disease Prediction Using Ensemble Technique
4 pages
LHQ Thesis
No ratings yet
LHQ Thesis
198 pages
Trino Distributed SQL Query Engine Essentials: Definitive Reference for Developers and Engineers
From Everand
Trino Distributed SQL Query Engine Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Molecules 29 03512 v3
No ratings yet
Molecules 29 03512 v3
24 pages
Data Analysis and Machine Learning On The Wisconsin Breast Cancer Dataset
No ratings yet
Data Analysis and Machine Learning On The Wisconsin Breast Cancer Dataset
11 pages
Gmail - Transaction Response
No ratings yet
Gmail - Transaction Response
1 page
Quantum Computers For Next Generation
No ratings yet
Quantum Computers For Next Generation
8 pages
C 20 EICE 1 2 Sem
No ratings yet
C 20 EICE 1 2 Sem
154 pages
White and Blue Watercolor Happy New Year Instagram Post
No ratings yet
White and Blue Watercolor Happy New Year Instagram Post
1 page
Chirag AICTE Report
No ratings yet
Chirag AICTE Report
24 pages
A Study On Green Computing Practices For Eco-Friendly IT Resources and Applications
No ratings yet
A Study On Green Computing Practices For Eco-Friendly IT Resources and Applications
9 pages
Itinerary
No ratings yet
Itinerary
4 pages
Unit 2
No ratings yet
Unit 2
39 pages
MQQ WZL Uu CWBWN 6 T 9
No ratings yet
MQQ WZL Uu CWBWN 6 T 9
8 pages
IJSRDV9I10177
No ratings yet
IJSRDV9I10177
3 pages
JETIR1810354
No ratings yet
JETIR1810354
5 pages
Salesforce 2FA FAQ
No ratings yet
Salesforce 2FA FAQ
5 pages
Unit 1
No ratings yet
Unit 1
46 pages
L Nqhzy FG8 TTC 0 PJX
No ratings yet
L Nqhzy FG8 TTC 0 PJX
15 pages
JD For QA Automation
No ratings yet
JD For QA Automation
2 pages
Unit 4&5
No ratings yet
Unit 4&5
108 pages
Presented by Darshan Balakrishna Shetty
No ratings yet
Presented by Darshan Balakrishna Shetty
18 pages
Unit1 Part1-Basic Structure of Computer 7 9 2018 3pm
No ratings yet
Unit1 Part1-Basic Structure of Computer 7 9 2018 3pm
95 pages
Online Food Ordering System3
No ratings yet
Online Food Ordering System3
10 pages
Smart Asthma Inhaler
No ratings yet
Smart Asthma Inhaler
4 pages
Biometric Attendance System
100% (1)
Biometric Attendance System
12 pages
Wa0003.
No ratings yet
Wa0003.
309 pages
Plant Pathology - 2024 - Dolatabadian - Image Based Crop Disease Detection Using Machine Learning
No ratings yet
Plant Pathology - 2024 - Dolatabadian - Image Based Crop Disease Detection Using Machine Learning
21 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
9 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
48 pages
An Insight Into Deep Learning Based Cryptojacking Detection Model
100% (1)
An Insight Into Deep Learning Based Cryptojacking Detection Model
10 pages
Principles of Information: Systems, Tenth Edition
No ratings yet
Principles of Information: Systems, Tenth Edition
58 pages
Ieee ML-DL 2024
No ratings yet
Ieee ML-DL 2024
2 pages
Interpretable Subgroup Discovery in Treatment Effect Estimation With Application To Opioid Prescribing Guidelines
No ratings yet
Interpretable Subgroup Discovery in Treatment Effect Estimation With Application To Opioid Prescribing Guidelines
11 pages
Psychotherapy Integration Via Theoretical Unification: Warren W. Tryon
No ratings yet
Psychotherapy Integration Via Theoretical Unification: Warren W. Tryon
26 pages
IoT For Mechanical Systems
No ratings yet
IoT For Mechanical Systems
2 pages
IBM Article On AI
No ratings yet
IBM Article On AI
10 pages
Adaptive Cruise Control of The Autonomous Vehicle Based On Sliding Mode Controller Using Arduino and Ultrasonic Sensor
No ratings yet
Adaptive Cruise Control of The Autonomous Vehicle Based On Sliding Mode Controller Using Arduino and Ultrasonic Sensor
11 pages
Malicious Uses and Abuses of Artificial Intelligence 1605900758
No ratings yet
Malicious Uses and Abuses of Artificial Intelligence 1605900758
80 pages
Parallelism Between Neural and Social Networks
No ratings yet
Parallelism Between Neural and Social Networks
23 pages
Jacak Intelligent Robotic Systems. Design, Planning, and Control
100% (8)
Jacak Intelligent Robotic Systems. Design, Planning, and Control
321 pages
Parsa Abangah 2024
No ratings yet
Parsa Abangah 2024
99 pages
TrueNorth Design and Tool Flow of A 65 MW 1 Million Neuron Programmable Neurosynaptic Chip
No ratings yet
TrueNorth Design and Tool Flow of A 65 MW 1 Million Neuron Programmable Neurosynaptic Chip
21 pages
AI Lab 2
No ratings yet
AI Lab 2
7 pages
Reviews: Physics-Informed Machine Learning
No ratings yet
Reviews: Physics-Informed Machine Learning
19 pages
Design Implementation and Evaluation of Online Eng
No ratings yet
Design Implementation and Evaluation of Online Eng
11 pages
Realtime Power System Security Assessment Using Artifical Neural Networks
No ratings yet
Realtime Power System Security Assessment Using Artifical Neural Networks
13 pages
Jyotirmoy Ghosh Thesis PDF
No ratings yet
Jyotirmoy Ghosh Thesis PDF
192 pages
AI Techniques in Power Staion
No ratings yet
AI Techniques in Power Staion
17 pages
Ayush Shah: Education
No ratings yet
Ayush Shah: Education
1 page
Profile: Bhavik Bansal
No ratings yet
Profile: Bhavik Bansal
5 pages
Geological Applications of Wireline Logs: A Synopsis of Developments and Trends
No ratings yet
Geological Applications of Wireline Logs: A Synopsis of Developments and Trends
18 pages
Teaching Practice of College Students Marketing C
No ratings yet
Teaching Practice of College Students Marketing C
10 pages
Chapter 07 Artificial Neural Network
No ratings yet
Chapter 07 Artificial Neural Network
62 pages
Class 10 - Summer Holiday Homework 2023
100% (1)
Class 10 - Summer Holiday Homework 2023
9 pages
A Review of AI Teaching and Learning From 2000 To
No ratings yet
A Review of AI Teaching and Learning From 2000 To
59 pages
10 1108 - Ijcst 07 2022 0102
No ratings yet
10 1108 - Ijcst 07 2022 0102
15 pages

Iarjset 2024 11624

Uploaded by

Iarjset 2024 11624

Uploaded by

IARJSET ISSN (O) 2393-8021, ISSN (P) 2394-1588

International Advanced Research Journal in Science, Engineering and Technology

AID362 Bioassay Classification and Regression

Keywords: Bioassay, AID362, Classification, Regression.

International Advanced Research Journal in Science, Engineering and Technology

Data Collection and Preprocessing of the dataset

International Advanced Research Journal in Science, Engineering and Technology

Algorithm 1 Pseudocode for tree construction by exhaustive search

International Advanced Research Journal in Science, Engineering and Technology

Algorithm 2 Pseudocode for GUIDE classification tree construction

International Advanced Research Journal in Science, Engineering and Technology

International Advanced Research Journal in Science, Engineering and Technology

International Advanced Research Journal in Science, Engineering and Technology

International Advanced Research Journal in Science, Engineering and Technology

International Advanced Research Journal in Science, Engineering and Technology

International Advanced Research Journal in Science, Engineering and Technology

International Advanced Research Journal in Science, Engineering and Technology

International Advanced Research Journal in Science, Engineering and Technology

You might also like