Interpretability of Hinging Hyperplanes
Interpretability of Hinging Hyperplanes
The hinging hyperplane model was proposed by Breiman [20]. This type of nonlinear
model is often referenced in the literature since it suffers from convergency and range
problems [19, 33–35]. Methods such as a penalty of the hinging angle were proposed
to improve Breiman’s algorithm [18]; alternatively, the Gauss-Newton algorithm can
be used to obtain the final nonlinear model [34]. Several application examples have
also been published in the literature; e.g., it can be used in the identification of
piecewise affine systems via mixed-integer programming [36], and this model also
lends itself to forming hierarchical models [19].
In this chapter a much more applicable algorithm is proposed for hinging hyper-
plane identification. The key idea is that in a special case (c = 2), the fuzzy c-
regression method (FCRM) [37] can be used for identifying hinging hyperplane
models. To ensure that two local linear models used by the fuzzy c-regression algo-
rithm form a hinging hyperplane function, it has to be guaranteed that local models
intersect each other in the operating regime of the model. The proposed constrained
FCRM algorithm is able to identify one hinging hyperplane model; therefore, to
generate more complex regression trees, the described method should be recursively
applied. Hinging hyperplane models containing two linear submodels divide the
operating region of the model into two parts, since hinging hyperplane functions
define a linear separating function in the input space of the hinging hyperplane func-
tion. These separations result in a regression tree where branches correspond to
linear divisions of the operating regime based on the hinge of the hyperplanes at
a given node. This type of partitioning can be considered as the crisp version of a
fuzzy regression-based tree described in [38]. Fortunately, in the case of a hinging
hyperplane-based regression tree there is no need to select the best splitting variable
at a given node, but, on the other hand, it is not as interpretable as regression trees
utilizing univariate decisions at nodes.
To support the analysis and building of this special model structure, novel model
performance and complexity measures are presented in this work. Special attention
is given to modeling and controlling nonlinear dynamical systems. Therefore, an
application example related to the Box-Jenkins gas furnace benchmark identification
© The Author(s) 2015 9
T. Kenesei and J. Abonyi, Interpretability of Computational
Intelligence-Based Regression Models, SpringerBriefs in Computer Science,
DOI 10.1007/978-3-319-21942-4_2
10 2 Interpretability of Hinging Hyperplanes
problem is added. It will also be shown, that thanks to the piecewise linear model
structure, the resulting regression tree can be easily utilized in model predictive
control. A detailed application example related to the model predictive control of a
water heater will demonstrate the benefits of the proposed framework.
A critical step in the application of model-based control is the development of a
suitable model for the process dynamics. This difficulty stems from lack of knowledge
or understanding of the process to be controlled. Fuzzy modeling has been proven
to be effective for the approximation of uncertain nonlinear processes. Recently,
nonlinear black-box techniques using fuzzy and neuro-fuzzy modeling have received
a great deal of attention [39]. Readers interested in industrial applications can find
an excellent overview in [40]. Details of relevant model-based control applications
are well presented in [41, 42].
Most nonlinear identification methods are based on the NARX (Nonlinear AutoRe-
gressive with eXogenous input) model [8]. The use of NARX black-box models for
high-order dynamic processes in some cases are impractical. Data-driven identifica-
tion techniques alone may yield unrealistic NARX models in terms of steady-state
characteristics, local behavior and unreliable parameter values. Moreover, the iden-
tified model can exhibit regimes which are not found in the original system [42].
This is typically due to insufficient information content of the identification data and
the overparametrization of the model. This problem can be remedied by incorporat-
ing prior knowledge into the identification method by constraining the parameters
of the model [43]. Another way to reduce the effects of overparametrization is to
restrict the structure of the NARX model, using, for instance, the Nonlinear Additive
AutoRegressive with eXogenous input (NAARX) model [44]. In this book a differ-
ent approach is proposed; a hierarchical set of local linear models are identified to
handle complex systems dynamics.
Operating regime-based modeling is a widely applied technique for identification
of these nonlinear systems. There are two approaches building operating regime-
based models. An additive model uses the sum of certain basis functions to represent
a non-linear system, while partitioning approach partitions the input space recursively
to increase modeling accuracy locally [18]. Models generated by this approach are
often represented by trees [45]. Piecewise linear systems [46] can be easily repre-
sented in a regression tree structure [47]. A special type of regression tree is called
the locally linear model tree, which combines a heuristic strategy for input space
decomposition with a local linear least squares optimization (like LOLIMOT [1]).
These models are hierarchical models consisting of nodes and branches. Internal
nodes represent tests on input variables of the model, and branches correspond to
outcomes of the tests. Leaf (terminal) nodes contains regression models in the case
of regression trees.
Thanks to the structured representation of the local linear models, hinging hyper-
planes lend themselves to a straightforward incorporation into model-based control
schemes. In this chapter this beneficial property is demonstrated in the design of an
instantaneous linearization-based model predictive control algorithm [32].
This chapter organized as follows: the next section discusses how hinging hyper-
plane function approximation is done with the FCRM identification approach. The
2 Interpretability of Hinging Hyperplanes 11
description of the tree growing algorithm and the measures proposed to support model
building are given in Sect. 2.2. In Sect. 2.3, application examples are presented, while
Sect. 2.4 concludes the chapter.
The following section gives a brief description of the hinging hyperplane approach
on the basis of [18, 34, 48], followed by a description of how the constraints can be
incorporated into FCRM clustering.
For a sufficiently smooth function f (xk ), which can be linear or nonlinear,
assume that regression data {xk , yk } is available for k = 1, . . . , N . Function
f (xk ) can be represented as the sum of a series of hinging hyperplane functions
h i (xk ) i = 1, 2, . . . , K . Breiman [20] proved that we can use hinging hyperplanes to
approximate continuous functions on compact sets, guaranteeing a bounded approx-
imation error
K
en = f − h i (x) ≤ (2R)4 c2 /K , (2.1)
i=1
where K is the number of hinging hyperplane functions, R is the radius of the sphere
in which the compact set is contained, and c is such that
w2 | f (w) |dw = c < ∞. (2.2)
The approximation with hinging hyperplane functions can get arbitrarily close if
a sufficiently large number of hinging hyperplane functions are used. The sum of
K
the hinging hyperplane functions, i=1 h i (xk ), constitutes a continuous piecewise
linear function. The number of input variables n in each hinging hyperplane function
and the number of hinging hyperplane functions K are two variables to be determined.
The explicit form for representing a function f (xk ) with hinging hyperplane functions
becomes (see Fig. 2.1)
K
K
f (xk ) = h i (xk ) = max | min xkT θ1,i , xkT θ2,i , (2.3)
i=1 i=1
where xk = xk,0 , xk,1 , xk,2 , . . . , xk,n , xk,0 ≡ 1, is the kth regressor vector and yk
is the kth output variable. These two hyperplanes are continuously joined together at
{x : x T (θ1 − θ2 ) = 0}, as can be seen in Fig. 2.1. As a result they are called hinging
hyperplanes. The joint = θ1 − θ2 , is defined as hinge for the two hyperplanes
yk = xkT θ1 and yk = xkT θ2 . The solid/shaded parts of the two hyperplanes are
explicitly given by
The hinging hyperplane method has some interesting advantages for nonlinear func-
tion approximation and identification:
1. Hinging hyperplane functions could be located by a simple computationally effi-
cient method. In fact, hinging hyperplane models are piecewise linear models;
the linear models are usually obtained by repeated use of the linear least squares
method, which is very efficient. The aim is to improve the whole identification
method with more sophisticated ideas.
2. For nonlinear functions that resemble hinging hyperplane functions, the hinging
hyperplane method has very good and fast convergence properties.
The hinging hyperplane method practically combines some advantages of neural
networks (in particular, the ability to handle very large dimensional inputs) and
constructive wavelet-based estimators (availability of very fast training algorithms).
The essential hinging hyperplane search problem can be viewed as an extension
of the linear least squares regression problem. Linear least squares regression aims
to find the best parameter vector
θ by minimizing a quadratic cost function with the
regression model that gives the best linear approximation to y. For nonsingular data
matrix X, the linear least squares estimate y = x T θ is always uniquely available.
The hinging hyperplane search problem, on the other hand, aims to find the two
parameter vectors θ1 and θ2 , defined by
2.1 Identification of Hinging Hyperplanes 13
2
[θ1 , θ2 ] = arg min max | min yk − xkT θ1 , yk − xkT θ2 . (2.6)
θ1 ,θ2
k=1
A brute force application of the Gauss-Newton method can solve the above
described optimization problem. However, two problems exist [18]:
c
N
E m (U, {θi }) = (μi,k )m E i,k (θi ) , (2.7)
i=1 k=1
2
E i,k = yk − f i (xk ; θi ) , (2.8)
but other measures can be applied as well, provided they fulfill the minimizer property
stated by Hathaway and Bezdek [37].
One possible approach to the minimization of the objective function (2.7) is the
group coordinate minimization method that results in the following algorithm:
• Initialization Given a set of data {(x1 , y1 ), . . . , (x N , y N )}, specify c, the structure
of the regression models (2.8) and the error measure (2.7). Choose a weighting
exponent m > 1 and a termination tolerance ε > 0. Initialize the partition matrix
randomly.
• Repeat For l = 1, 2, . . .
• Step 1 Calculate values for the model parameters θi that minimize the cost function
E m (U, {θi }).
• Step 2 Update the partition matrix
(l) 1
μi,k = c , 1 ≤ i ≤ c, 1 ≤ k ≤ N (2.9)
(E
j=1 i,k /E j,k )
2/(m−1)
hinging hyperplanes
hinge
x
V1 V2
constraints have to be taken into consideration as follows. Cluster centers vi can also
be computed from the result of FCRM as the weighted average of the known input
data points:
N
xk μi,k
vi = k=1 N
, (2.12)
k=1 μi,k
where the membership degree μi,k is interpreted as a weight representing the extent
to which the value predicted by the model matches yk . These cluster centers are
located in the ‘middle’ of the operating regime of the two linear submodels. Because
the two hyperplanes must cross each other, the following criteria can be specified
(see Fig. 2.2):
These relative constraints can be used to take into account the constraints above:
θ1 v1 −v1
λr el,1,2 ≤ 0 where λr el,1,2 = . (2.14)
θ2 −v2 v2
When linear equality and inequality constraints are defined on these prototypes,
quadratic programming (QP) has to be used instead of the least squares method. This
optimization problem still can be solved effectively compared to other constrained
nonlinear optimization algorithms.
Local linear constraints applied to fuzzy models can be grouped into the following
categories according to their validity region:
• Local constraints are valid only for the parameters of a regression model, λi θi ≤
ωi .
• Global constraints are related to all of the regression models, λgl θi ≤ ωgl , i =
1, . . . , c.
16 2 Interpretability of Hinging Hyperplanes
Global constraints
θ i,2
[ θ 3,1 , θ 3,2 ]
[ θ 4,1 , θ 4,2 ]
Local constraints
[ θ 1,1 , θ 1,2 ]
Relative
constraints
θ i,1
Fig. 2.3 Hinging hyperplane model with four local constraints and two parameters
with H = 2X
T ΦX
, c = −2X
T Φy
, where
⎡ ⎤ ⎡ ⎤
y θ1
⎢y⎥ ⎢ θ2 ⎥
⎢ ⎥ ⎢ ⎥
y
= ⎢ . ⎥ , θ = ⎢ . ⎥ , (2.17)
⎣ .. ⎦ ⎣ .. ⎦
y θc
⎡ ⎤ ⎡ ⎤
X1 0 ··· 0 Φ1 0 ··· 0
⎢ 0 X2 · · · 0 ⎥ ⎢ 0 Φ2 ··· 0 ⎥
⎢ ⎥ ⎢ ⎥
X
= ⎢ . .. . . .. ⎥ , Φ = ⎢ .. .. .. .. ⎥, (2.18)
⎣ .. . . . ⎦ ⎣ . . . . ⎦
0 0 · · · Xc 0 0 · · · Φc
2.1 Identification of Hinging Hyperplanes 17
θ ≤ ω, (2.19)
with ⎡ ⎤ ⎡ ⎤
λ1 0 ··· 0 ω1
⎢ 0 λ2 ··· 0 ⎥ ⎢ ω2 ⎥
⎢ ⎥ ⎢ ⎥
⎢ .. .. .. .. ⎥ ⎢ .. ⎥
⎢ . . . . ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 0 · · · λc ⎥ ⎢ ωc ⎥
⎢ ⎥ ⎢ ⎥
λ=⎢
⎢ λgl 0 ··· 0 ⎥, ω = ⎢
⎥
⎢ ωgl ⎥.
⎥ (2.20)
⎢ 0 λgl ··· 0 ⎥ ⎢ ωgl ⎥
⎢ ⎥ ⎢ ⎥
⎢ . .. . . .. ⎥ ⎥ ⎢ .. ⎥
⎢ .. . . . ⎥ ⎢ . ⎥
⎢ ⎢ ⎥
⎣ 0 0 · · · λgl ⎦ ⎣ ωgl ⎦
{λr el } {ωr el }
Referring back to Fig. 2.1, it can be concluded that with this method both parts of
the intersected hyperplanes are described and the part (max|min) that describes
the training data in the most accurate way is selected.
So far, the hinging hyperplane function identification method has been presented.
The proposed technique can be used to determine the parameters of one hinging
hyperplane function. The classical hinging hyperplane approach can be interpreted
by identifying K hinging hyperplane models consisting of global model pairs, since
these operating regimes cover the whole N dataset. This representation leads to sev-
eral problems during model identification and also renders model interpretability
more difficult. To overcome this problem, a tree structure is proposed where the data
is recursively partitioned into subsets, while each subset is used to form models of
lower levels of the tree. The concept is illustrated in Fig. 2.4, where the membership
functions and the identified hinging hyperplane models are also shown.
During the identification the following phenomena can be taken into consideration
(and can be considered as benefits too):
• When using the hinging hyperplane function there is no need to find splitting
variables at the nonterminal nodes, since this procedure is based on the hinge.
• A populated tree is always a binary tree, either balanced or non-balanced, depend-
ing on the algorithm (greedy or non-greedy). It is based on a binary tree; the hinge
splitting the x data pertains to the left side of the hinge and θ1 always goes to
the left child; and the right side behaves the similarly. For example, given a sim-
ple symmetrical binary tree structure model, the first level contains one hinging
18 2 Interpretability of Hinging Hyperplanes
0.5
0
0 0.5 1
1 1
0.5
0.5
0
−0.5 0
0.4 0.6 0.8 1 0 0.1 0.2
2 1.5
Expected output
1 1 Calculated output
Left sided membership function
0 0.5 Right sided membership function
−1 0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 2.4 Hinging hyperplane-based regression tree for basic data sample in case of greedy algorithm
hyperplane function, the second level contains two hinging hyperplane functions,
the third level contains four hinges, and in general the kth level contains 2(k−1)
hinging hyperplane functions.
Concluding the above and obtaining the parameters θ during recursive identifica-
tion, the following cost function has to be minimized:
K
E({θi }, π ) = πi E m i (θi ), (2.21)
i=1
where K is the number of the hinge functions (nodes), and π is the binary (πi ∈ 0, 1)
terminal set, indicating that the given node is a final linear model (πi = 1), and can
be incorporated as a terminal node of the identified piecewise model.
A growing algorithm can be either balanced or greedy. In the balanced case the
identification algorithm builds the tree till the desired stopping criteria, while the
greedy one continues the tree building by choosing a node for splitting that performs
worst during the building procedure. Hence, this operating regime needs more local
models for better model performance. For a greedy algorithm, the crucial item is the
selection of a good stopping criterion. Any of the following can be used to determine
whether to continue the tree growing process or stop the procedure:
1. The loss function becomes zero. This corresponds to the situation where the size
of the data set is less than or equal to the dimension of the hinge. Since the hinging
hyperplanes are located by linear least squares, when the number of data points is
2.2 Hinging Hyperplane-Based Binary Trees 19
The well-known regression performance estimators can be used for node perfor-
mance measurement; in this work, root mean squared prediction error (RMSE)
was used:
1 N
RMSE = (yk − ŷk )2 . (2.23)
N
k=1
0.9
0.8
0.7
Condition of the nodes
0 5 10 15
6
RMSE of the nodes
0
0 5 10 15
Fig. 2.6 Node by node ρ and RMSE results for non-greedy tree building
0.9
0.8
6
RMSE of the nodes
0
1 2 3 4 5 6 7 8 9 10 11
Fig. 2.7 Node by node ρ and RMSE results for greedy tree building
22 2 Interpretability of Hinging Hyperplanes
Accuracy and transparency of the proposed algorithm are shown based on multiple
datasets, two real life and two synthetic ones, followed by examples from the area
of dynamic system identification.
All datasets have been used before, most of them have originated from well-known
data repositories. Performance of the models is measured by the root mean squared
prediction error (RMSE; see Eq. 2.23).
Real life datasets:
• Abalone Dataset from UCI machine learning repository1 used to predict the age of
abalone from physical measurements. Contains 4,177 cases with eight attributes
(one nominal and seven continuous).
• Kin8nm Data containing information on the forward kinematics of an eight link
robot arm from the DVELVE repository. Contains 8,192 cases with eight contin-
uous attributes.
Synthetic datasets:
• Fried Artificial dataset used by Friedman [50] containing ten continuous attributes
with independent values uniformly distributed in the interval [0, 1]. The value of
the output variable is obtained with the equation:
Table 2.2 10-fold cross validation report for hinging hyperplanes based tree
Data Sample MIN MEAN MAX Standard dev.
Fried Train 0.5822 0.8677 1.2107 0.227
Test 0.6226 0.9208 1.2673 0.2337
3Dsin Train 0.0906 0.1741 0.3162 0.0714
Test 0.0838 0.178 0.342 0.0801
Abalone Train 2.3496 2.6241 2.9256 0.1532
Test 2.3242 2.8803 3.451 0.3445
Kinman Train 0.1433 0.1515 0.1595 0.0054
Test 0.1464 0.1579 0.1729 0.0092
0.6
0.4
0.2
−0.2
0 50 100 150 200 250 300
The well-known Box-Jenkins furnace data benchmark is used to illustrate the pro-
posed modeling approach and to compare its effectiveness with other methods. The
data set consists of 296 pairs of input-output observations taken from a laboratory
furnace with a sampling time of nine seconds. The process input is the methane flow
rate and the output is the percentage of CO2 in the off-gas. A number of researchers
concluded that a proper structure of a dynamic model for this system is
The approximation power of the model can be seen in Fig. 2.8 and Table 2.3.
Comparing its results with those of other techniques in [51] can conclude that the
modeling performance is in line with that of other techniques with a moderate number
of identified hinging hyperplanes.
So far, a general nonlinear modeling technique was presented and a new iden-
tification approach was given for hinging hyperplane-based nonlinear models:
y = f (x(k), θ ), where f (.) represents the hinging hyperplane-based tree structured
model and x(k) represents the input vector of the model. To identify a discrete-time
input-output model for a dynamical system, the dynamic model structure has to be
na
nb
y= ai y(k − i) + b j , f (u(k − j)) (2.28)
i=1 j=1
where y() and u() are the output and input of the system, respectively, and n a and
n b are the output and input orders of the model. The parameters of the blocks of
the Hammerstein model (static nonlinearity and linear dynamics) can be identified
by the proposed method simultaneously if the same linear dynamic behavior can be
guaranteed by all of the local hinging hyperplane-based models. It can be done in an
elegant way utilizing the flexibility of the proposed identification approach: global
constraints can be formulated for the ai and b j parameters of the local models (for
a detailed discussion on what constraints have to be formulated, see [32]). In the
following, the hinging hyperplane modeling technique is applied on a Hammerstein
type system. It will be shown why it is an effective tool for the above-mentioned
purpose.
The modeling of a simulated water heater (Fig. 2.9) is used to illustrate the advantages
of the proposed hinging hyperplane-based models. The water flows through a pair
of metal pipes containing a cartridge heater.
The outlet temperature, Tout , of the water can be varied by adjusting the heating
signal, u, of the cartridge heater (see [32] or Appendix D for details). The performance
of the cartridge heater is given by:
sin(2π u)
Q(u) = Q M u − (2.29)
2π
where Q M is the maximal power and u is the heating signal (voltage). As the equation
above shows, the heating performance is a static nonlinear function of the heating
signal. Hence, the Hammerstein model is a good match to this process. The aim
is to construct a dynamic prediction model from data for the output temperature
(the dependent variable, y = Tout ) as a function of the control input: the heating
26 2 Interpretability of Hinging Hyperplanes
34
Test data
32 Hinge h.
Neural network.
30 Linear model.
28
Temperature[°C]
26
24
22
20
18
16
14
0 50 100 150 200 250 300 350 400 450 500
Simulation time [sec]
Fig. 2.10 Free run simulation of the water heater (proposed hinging hyperplane model, neural
network, linear model)
Controller parameters
w y
u
Controller Process
Hp
2
Hc
J H p , Hc , λ = w (k + j) − ŷ (k + j) + λ Δu 2 (k + j − 1) , (2.30)
j=1 j=1
ŷ = SΔū + p, (2.31)
where the model prediction equation is given in its vector-based form as Δū =
[Δu(k), . . . , Δu(k + Hc )] , and p = p1 , p2 , . . . , p H p and ŷ = [ ŷ
(k + 1), . . . , ŷ(k
+ H p )] and the S containing the parameters of a step-response
model is an H p × Hc matrix with 0 entries si, j for j − i > 1:
⎡ ⎤
s1 0 0 0
⎢ s2 s1 0 0 ⎥
⎢ ⎥
S=⎢ .. .. ⎥. (2.32)
⎣ . . ⎦
s H p s H p −1 · · · s H p −Hc
2.3 Application Examples 29
When constraints are considered, the minimum of the cost function can be found
by quadratic optimization with linear constraints:
minū (SΔū + p − w)T (SΔū + p − w) + λΔūT Δū , (2.33)
1 T
minū Δū HΔū + dΔū ,
2
with H = 2 ST S + λI , d = −2 ST (w − p) , where I is an (Hc × Hc ) unity
matrix.
The constraints defined on u and Δu can be formulated with the following inequal-
ity: ⎛ ⎞ ⎛ ⎞
IΔū umax − Iū u(k − 1)
⎜ −IΔū ⎟ ⎜ −u min + Iū u(k − 1) ⎟
⎜ ⎟ ⎜ ⎟,
⎝ I Hc ⎠ Δū ≤ ⎝ Δumax ⎠ (2.34)
−I Hc −Δu min
40
35
30
Temp [°C]
25
20
15
10
0 100 200 300 400 500 600 700 800 900 1000
Time [sec]
0.8
0.6
u [−]
0.4
0.2
0
0 100 200 300 400 500 600 700 800 900 1000
Time [sec]
40
35
30
Temp [°C]
25
20
15
10
0 100 200 300 400 500 600 700 800 900 1000
Time [sec]
0.8
0.6
u [−]
0.4
0.2
0
0 100 200 300 400 500 600 700 800 900 1000
Time [sec]
To handle modeling error, the MPC is applied in the well-known internal model
control (IMC) scheme where the setpoint of the controller is shifted by the filtered
modeling error. For this purpose, a first-order linear filter is used:
35
30
Temp [°c]
25
20
15
0 200 400 600 800 1000
Time [sec]
0.8
0.6
u [−]
0.4
0.2
0
0 200 400 600 800 1000
Time [sec]
Table 2.5 Simulation results (SSE, sum squared tracking error; CE, sum square of the control
actions)
The applied model in GPC SSE CE
Linear model 1085 1.61
Neural network model 956 1.39
Hinging hyperplane model 966 0.58
Notice also that the oscillatory behavior of the neural network model-based MPC
is due to the bad prediction of the steady-state gain of the system around the middle
region. However, as can be seen from Table 2.5, both nonlinear models achieved
approximately the same summed squared tracking error (SSE), although a smaller
control effort (CE) was needed for the hinging hyperplane-based MPC.
2.4 Conclusions
ing of the data. The complexity of the model is controlled by the proposed model
performance measure. The resulting piecewise linear model can be effectively used
to represent nonlinear dynamical systems. The resulting linear parameter varying
(LPV) model can be easily utilized in model-based control.
To illustrate the advantages of the proposed approach, benchmark datasets were
modeled and a simulation example presented for the identification and model pre-
dictive control of a laboratory water heater.
The results show that, with the use of the proposed modeling framework, accurate
and transparent nonlinear models can be identified since the complexity and the
accuracy of the model can be easily controlled. The local linear models can be
easily interpreted and utilized to represent operating regimes of nonlinear dynamical
systems. Based on this interpretation, effective model-based control applications can
be designed.
http://www.springer.com/978-3-319-21941-7