JMP Neural Network Methodology
JMP Neural Network Methodology
JMP Neural Network Methodology
Christopher M. Gotwalt
Director of Statistical Research and Development
JMP Division
SAS Institute
1. Introduction
In JMP9, the Neural platform has been completely rewritten to make JMP
more competitive in the desktop data minining market. The legacy plat-
form is still available for backward compatibility via scripting and through
the user interface via a preference. The design goals for the project were to
create a new platform with a richer set of modeling options, with much im-
proved speed performance. Another important goal was to simplify certain
decisions, such as setting the penalty parameter, by finding the optimal value
of the penalty parameter for the user automatically as a part of the fitting
algorithm. This monograph describes the implementation details of the al-
gorithms and statistical methods used by the platform. Below is a summary
of the new features of the Neural platform. The features that are marked
(JMP-PRO) are only available in JMP-PRO.
• The early stopping rule is employed to speed the model fit and improve
predictive performance.
• The ability to fit two layer fully connected multilayer perceptron models
(JMP-PRO).
1
• Automated model construction through the use of gradient boosting
(JMP-PRO).
• Outlier resistant modeling via an option to train the model vis least
absolute deviations (JMP-PRO).
• A richer set of penalty functions that can be applied to the input vari-
ables (JMP-PRO).
2. Modeling Details
The Neural platform in JMP-PRO can fit one and two layer, fully connected,
multilayer perceptron (MLP) neural network models. An accessible introduc-
tion to these models for those with some statistical training can be found in
The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman
(2001). The interested reader can refer there for the basic description and
structure of MLP models, which will not be given in detail here. The two
layer models are new to JMP and are offered only in JMP-PRO.
There are significant differences in the Neural platform between JMP9
and JMP9-PRO. The JMP9 version of the platform retains the most of the
functionality of the JMP8 platform, and is similar in that it can fit single
layer neural models with the tanh activation function. However, the new
platform in simpler than the legacy platform in that many of the controls
have been suppressed. This was done for a variety of reasons. For example,
in previous versions of JMP, the user had to specify a penalty parameter to
regularize the network parameters. There isn’t any meaning to particular
values of the penalty parameter, and other than crossvalidation performance
there isn’t a reason to favor one value of the penalty parameter over another.
For this reason, the optimal value of the penalty parameter is found behind
the scenes. Another control that is in the old platform but not the new
one is the convergence tolerance. The converenge tolerance was removed
because it is actually no longer relavant. The platform now shortcuts out
2
of the optimization algorithm using a crossvalidation based early stopping
rule. This means that the optimizer won’t have the opportunity to use the
tolerance because it will almost never run to convergence.
There are two signficant features, the Sequence of Fits option and the
Profiler confidence intervals, that are in the old platform but not the new
version. Sequence of Fits was a JMP8 Neural platform option that automated
the modeling process by fitting a set models with a user specified range of
numbers of nodes and penalty parameters. This has been superceded by
the boosting algorithms in JMP-PRO which build models several nodes at
a time in a stagewise fashion, and also by the automatic selection of the
penalty parameter which happens every time a model is fit. The Profiler
intervals in the legacy platform did not have a valid statistical justification
and were at best highly misleading. They should not have been added to
JMP in the first place, and the Profiler in the new platform does not have
confidence intervals.
3
model with unspecified structure and model parameters be f (x). The sum
of squared errors of the data given prediction formula, f , is
n
X
SSE = (yi − f (xi ))2 ,
i=1
k−1 k−1
!
X X
LMultinomial = I(yi = j)θj + log 1 + eθj ,
j=1 j=1
where I(yi = j) is the indicator function of the event that yi = j. The neural
model is such that each θj is a linear combination of the uppermost hidden
layer nodes and the set of parameters that correspond to θj plus an intercept
type parameter. In this way, the prediction formula of the probability that
yi = j, fj , is
eθj
fj = Pk−1
1+ j 0 =1 eθj0
4
for j < k, and
1
fk = Pk−1 .
1+ j 0 =1 eθj0
For testing purposes, one can find the θ parameters from the prediction
probabities using the relation, θj = log(fj ) − log(fk ). For the r2 statistics for
categorical responses, the null model is the one where fj (xi ) = p̂j for all i,
where p̂j is the sample proportion of observations where yi = j.
In the case of multiple responses, the overall likelihood, LT otal is the
sum of the loglikelihoods across all the individual responses. The likelihood
contribution of rows that have missing values of some, but not all, of the
responses will have the contribution of the non missing responses, rather
that to have effectively dropped the entire row.
The legacy platform was only able to handle a single categorical response
or multiple continuous responses, and could not handle mixed contiuous and
categorical responses. When there were multiple continuous responses the
loss function employed was a weighted sum of the sums of squared errors
for the individual components. The weights were the reciprocal variances of
the responses. This led to a non-scale invariant estimation approach that
has the potential to lead to certain anomalies. For example, responses that
have low variance and are independent of the inputs will have more weight
than responses that have a large variance as a result of a strong relationship
with the input variables. The risk would be that the low variance response
with no relationship to the input variable would be overfit, while the other
response is underfit due to the weghting scheme.
5
models were implemented in the legacy platform. The way that each input
variable contributes to the design row depends on whether the independent
variable is continuous or categorical, and various modeling options in the
platform. By default, continuous variables simply contribute their value to
the design row in the same way that a continuous variable’s contribution to
the design row of a main effects linear model.
When the JMP-PRO only Transform Covariates option is enabled, a pre-
processing step will happen where a Johnson Su or Johnson Sb distrubution
will be fit to each of the continuous input variables. The continuous vari-
able’s design row contribution will then be the correponding Johnson trans-
formation to a normal distribution. This preprocessing step uses maximum
likelihood, but for speed purposes only 10 interations of Newton’s method
are done. This means that the resulting transformation may differ some-
what from Johnson transformations done in the Distribution platform. The
purpose of the Transform Covariates feature is to transform the inputs to
approximate normality as a way to mitigate the impact of input variables
with outliers and heavily skewed distributions.
When the JMP-PRO only Missing Value Parameterization is turned on
the continuous variables with missing cells will contribute two elements to the
design row, rather than just one. When the variable is not missing the first
element will simply be the value of the variable (or its Johnson transformed
value when the Transform Covariates option is enabled), and the second
element will be zero. When the variable is missing the first value will just be
the imputed mean of the variable and the second element will be one. So,
the second element is a zero-one valued indicator of whether the variable is
missing and the first is just it’s numeric, possibly transformed value, or an
imputed mean. This approach is a simple minded way to make use of as
much of the data as possible and build models that can provide reasonable
predictions even in the presence of incomplete data.
Categorical variables contribute to the design row in the same way that
they do in main effects models elsewhere in JMP, using what is referred to
as the effect parameterization. The Missing Value Parameterization simply
treats missing values of categorical input variables as if missing values were
an additional level of the variable.
For example, suppose that X1 and X2 are two input variables where X1
is continuous and X2 is categorical with three levels, A, B, and C. With-
out the missingness parameterization turned on, the design row for (x, A)
is [x, 1, 0, 1], where x is some non-missing value of X1 . The first element
6
corresponds to X1 ’s contribution, the next two elements correspond to the
contribution of X2 and the last element is the intercept term. Similarly, (x, B)
maps to a design row equal to [x, 0, 1, 1], and (x, C) maps to [x, −1, −1, 1].
Now with the missingess parameterization turned on there are now two ele-
ments that correpond to X1 and three for X2 , because a missing value for X2 ,
is now the last category. Now (., A) maps to [X̄1 , 1, 1, 0, 0, 1], (x, C) maps to
[x, 0, 0, 0, 1, 1], and (., .) leads to a design row equal to [X̄1 , 1, −1, −1, −1, 1].
During the numerical optimization of the model parameters one addi-
tional step occurs in the creation of the design rows. All the elements of the
design row other than the intercept are centered and scaled by their mean
and standard deviation. This is so that the penalty applied to a scaled and
centered version of the parameters which leads to a better fit and is impor-
tant to improve the optimization of the model. This centering and scaling
operation happens behind the scenes and is completely transparent to the
user.
7
less of whether the BFGS algorithm has converged in the traditional sense.
The parameter vector whose crossvalidation likelihood was the best over the
course of the BFGS iterations is the one that is kept. This is commonly re-
ferred to as the early stopping rule, and leads both to faster performance and
to models with improved predictive capacity. When, as in the legacy plat-
form, the model is fit until convergence on the training set, the model will
have often overfit the training data. This tendancy is mitigated to a great
extent by retaining the parameter vector at the iteration crossvalidated the
best. This typically happens within the first 20-50 BFGS iterations, whereas
convergence on the training set may take hundreds or thousands of BFGS
iterations.
When K-fold crossvalidation is used, the procedure described is applied
to each of the K training-holdback partitions of the data for each trial value
of the penalty parameter. After the model has been fit to all the folds,
the validation set likelihood is summed across the folds using each of the
individual fold’s parameter estimates. This is the value that is optimized in
the selection of the penalty parameter. The fold whose parameter estimates
give the best value on the complete dataset are the parameter estimates that
are kept and used by the platform. The training-holdback pair that led to
the model that was the best are the pair whose model diagnostics are given
in the report.
8
the penalty parameter proceeds the values of the preceding fit can be used
as starting values for the next optimization step, thereby speeding up the
iterations. Another advantage is that once further increasing or decreasing
of the penalty parameter is no longer improving the model’s crossvalidation
likelihood, the algorithm can safely terminate, which also makes the fitting
process faster than the Sequence of Fits option in the legacy platform.
In JMP-PRO there are a number of new penalty function options, while
in previous versions and in the standard edition of JMP9 the only option was
the sum of squared (centered and scaled) parameters penalty. In addition to
the squared penalty, JMP-PRO offers the absolute value penalty, the weight
decay penalty, β 2 /(1 + β 2 ), and an option for no penalty. The penalty is
applied to the subset of parameters that are neither intercept-type param-
eters nor parameters that lead out from the uppermost hidden later to the
predicted values.
3.2 Boosting
In JMP-PRO an effcient and automatic way to fit large neural models is
through the use of boosting. Boosting is reviewed in Hastie, Tishirani, and
Friedman, with emphasis on boosting of tree models. In the Neural platform,
boosting proceeds by selecting a base learner, a number of boosting iterations,
and a learning rate whose value should be in the interval (0, 1]. The base
learner is a fairly simple neural model, such as a single layer model with two
tanh units. The base model is repeatedly fit to the data. If the response
is continuous, after each iteration the predicted values of the base model
multiplied by the learning rate is subtracted from the response, and at the
next iteration the base learned is fit to this modified version of the response. If
the learning rate were 1.0, then the boosting algorithm would be equivalent
to recursively fitting a base model to residuals. The boosting iterations
proceed until either the maximum number of boosting iterations is attained,
or when for one iteration neither the training or validation likelihood can be
further improved. The last base learner fit by the boosting algorithm that
improved the validation and training likelihoods is not scaled by the learning
rate. If the response is categorical, then the boosting model is additive on
the log odds scale, rather on the probability scale as in traditional gradient
boosting. This is done so that the overall boosted model can be expressed
simply as a large single neural model. The boosting approach is intended to
supplant the part of the Sequence of Fits option in earlier versions of JMP
9
that allowed the user to specify a range of number of nodes in the hidden
later that would be fit. The approach outlined above should be both faster
and lead to models with better predictive performance than the models fit
using the legacy platform.
4. Crossvalidation Statistics
For each of the training, validation, and test sets a variaty of diagnostic mea-
sures are provided. These diagnostic measures differ whether the response is
categorical or continuous. In the continuous respnse case, the crossvalidation
statistics include a generalized r2 , an entropy r2 , the (negative) loglikelihood,
the root mean squared error (RMSE), and mean absolute deviation. The r2
measures use a comparison to a null loglikelihood, that we will refer to LN ull .
Note the parameter estimates used in LN ull are always computed over the
subset of the data in question, and this applies regardless of whether the
response is continuous or categorical, or any of the other platform options.
For example, in the validation set crossvalidation statistics report, the mean
and variance estimates in LN ull are computed over the validation set, rather
than the training set. Note that this is a change from that legacy platform
which always used the estimates from the training set for the base model in
the crossvalidation statistics. When the Robust Fit option is not turned on
LN ull is the simple mean and variance normal distribution model loglikeli-
hood. The parameter estimates are the sample mean and sample variance
(using the n−1 rather than the (n − 1)−1 definition). When the Robust Fit
option is enabled, LN ull is a Laplacian loglikelihood using the sample me-
dian as the scale parameter and the mean absolute deviation as the scale
parameter. The generalized and entropy r2 measures are computed using
the following formulas,
2
2
rGeneralized = 1 − e n (L(β̂)−LN ull )
2
Lβ̂
rEntropy = 1− ,
LN ull
where n is the size of the dataset in question (e.g. training, validation,
or test set), and Lβ̂ is the (negative) loglikelihood of the set in question
computed using the model parameters fit on the training data. Because of
the way early stopping and penalty functions are applied, degree of freedom
adjustments are not a natural fit for these models. As a result, the Neural
10
platform does not compute adjusted r2 measures. The generalized r2 reduces
to the standard definition of r2 in the case of the normal distrution, which
is employed when the Robust Fit option is not turned on. The RMSE and
mean absolute deviation are computed in the natural way for continuous
responses,
12
SSE
RM SE =
n
SAD
M AD = .
n
When the response is categorical, the missclassification rate is given in
addition to the model fit measures described above for continuous responses.
The misclassification rate is the proportion of times the actual level of the
response is not the one assigned the highest probability by the neural model.
The generalized r2 is the scaled differently,
2
2 1 − e n (L(β̂)−LN ull )
rGeneralized = 2 .
1 − e− n LN ull
This is done so that a perfect fit will lead to a generalized r2 that is equal to
one, which would not be the case without the scaling. The definitions of the
RMSE and mean absolute deviation are
n nX
1 X Levels 1
RM SE = ( I(yij = j)(1 − fj (xi ))2 ) 2
n i=1 j=1
n nX
1 X Levels
M AD = ( I(yij = j)|1 − fj (xi )|.
n i=1 j=1
The confusion matrices record the number of times each level of the response
is actually one level and predicted under the model to each of the levels. The
actual observations equal to each level are along the vertical axis, and the
predicted levels are along the horizontal axis. The confusion rate matrix is
equal to the confusion matrix with the rows divided by their row totals.
11