Optimal Binning
Optimal Binning
formulation
Guillermo Navas-Palencia1
1
[email protected]
Abstract
The optimal binning is the optimal discretization of a variable into bins given a dis-
crete or continuous numeric target. We present a rigorous and extensible mathematical
programming formulation to solving the optimal binning problem for a binary, contin-
uous and multi-class target type, incorporating constraints not previously addressed.
For all three target types, we introduce a convex mixed-integer programming formula-
tion. Several algorithmic enhancements such as automatic determination of the most
suitable monotonic trend via a Machine-Learning-based classifier and implementation
aspects are thoughtfully discussed. The new mathematical programming formulations
are carefully implemented in the open-source python library OptBinning.
1 Introduction
Binning (grouping or bucketing) is a technique to discretize the values of a continuous
variable into bins (groups or buckets). From a modeling perspective, the binning technique
may address prevalent data issues such as the handling of missing values, the presence
of outliers and statistical noise, and data scaling. Furthermore, the binning process is
a valuable interpretable tool to enhance the understanding of the nonlinear dependence
between a variable and a given target while reducing the model complexity. Ultimately,
resulting bins can be used to perform data transformations.
Binning techniques are extensively used in machine learning applications, exploratory
data analysis and as an algorithm to speed up learning tasks; recently, binning has been
applied to accelerate learning in gradient boosting decision tree [12]. In particular, binning
is widely used in credit risk modeling, being an essential tool for credit scorecard modeling
to maximize differentiation between high-risk and low-risk observations, and in expected
credit loss modeling.
There are several unsupervised and supervised binning techniques. Common unsuper-
vised techniques are equal-width and equal-size or equal-frequency interval binning. On the
other hand, well-known supervised techniques based on merging are Monotone Adjacent
Pooling Algorithm (MAPA), also known as Maximum Likelihood Monotone Coarse Classi-
fier (MLMCC) [21] and ChiMerge [13], whereas other techniques based on decision trees are
CART [2], Minimum Description Length Principle (MDLP) [3] and more recently, condition
inference trees (CTREE) [9].
The binning process might require to satisfy certain constraints. These constraints might
range from requiring a minimum number of records per bin to monotonicity constraints. This
variant of the binning process is known as the optimal binning process. The optimal binning
is generally solved by iteratively merging an initial granular discretization until imposed
constraints are satisfied. Performing this fine-tuning manually is likely to be unsatisfactory
1
as the number of constraints increases, leading to suboptimal or even infeasible solutions.
However, we note that this manual adjustment has been encouraged by some authors [20],
legitimating the existing interplay of “art and science” in the binning process.
There are various commercial software tools to solving the optimal binning problem1 .
Software IBM SPSS and the MATLAB Financial Toolbox, use MDLP and MAPA as a de-
fault algorithm, respectively. The most advanced tool to solve the optimal binning problem
is available in the SAS Enterprise Miner software. A limited description of the proprietary
algorithm can be found in [17], where two mixed-integer programming (MIP) formulations
are sketched: a mixed-integer linear programming (MILP) formulation to obtain a fast
probably suboptimal solution, and a mixed-integer nonlinear programming (MINLP) for-
mulation to obtain an optimal solution. The suboptimal formulation is the default method
due to computational time limitations (MILP techniques are considerably more mature).
We note that the SAS implementation allows most of the constraints required in credit risk
modelling, becoming an industry standard. Besides, there exist a few open-source solutions,
but the existing gap comparing to the commercial options in terms of capabilities is still
significant. Among the available alternatives, we mention the MATLAB implementation of
the monotone optimal binning in [16], and the R specialized packages smbinning [7], relying
on CTREE, and MOB [22], which merely include basic functionalities.
In this paper, we develop a rigorous and extensible mathematical programming formu-
lation for solving the optimal binning problem. This general formulation can efficiently
handle binary, continuous, and multi-class target type. The presented formulations incorpo-
rate the constraints generally required to produce a good binning [20], and new constraints
not previously addressed. For all three target types, we introduce a convex mixed-integer
programming formulation, ranging from a integer linear programming (ILP) formulation for
the simplest cases to a mixed-integer quadratic programming (MIQP) formulation for those
cases adding more involved constraints.
The remainder of the paper is organized as follows. Section 2 introduces our general
problem formulation and the corresponding mixed-integer programming formulation for each
supported target. We focus on the formulation for binary target, investigating various
formulation variants. Then, in Section 3 we discuss in detail several algorithmic aspects
such as the automatic determination of the optimal monotonic trend and the development of
presolving algorithms to efficiently solve large size instances. Section 4 includes experiments
with real-world datasets and compares the performance of supported solvers for large size
instances. Finally, in Section 5, we present our conclusions and discuss possible research
directions.
2
be erased. Each column must contain exactly one 1.
n
X
Xij = 1, j = 1, . . . , n. (1)
i=1
• A solution has a last bin interval of the form [sk , ∞), for k ≤ n. The binary decision
variable Xnn = 1.
1 1
0 1 0 0
0 0 1 0 0 0
0 0 0 1 0 1 1 1
0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 1 1
0 0 0 0 0 0 1 0 0 0 0 0 0 1
Figure 1: Lower triangular matrix X. Initial solution after pre-binning (left). Optimal
solution with 4 bins after merging pre-bins (right).
The described problem can be seen as a generalized assignment problem. A direct for-
mulation of metrics involving ratios such as the mean or most of the divergence measures on
merged bins leads to a non-convex MINLP formulation, due to the ratio of sums of binary
variables. Solving non-convex MINLP problems to optimality is a challenging task requiring
the use of global MINLP solvers, especially for large size instances.
Investigating the binary lower triangular matrix in Figure 1, it can be observed, by
analyzing the constraints in Equations (1) and (2) imposing continuity by rows, that a
feasible solution is entirely characterized by the position of the first 1 for each row. This
observation permits the pre-computation of the set of possible solutions by rows, obtaining
an aggregated matrix with the shape of X for each involved metric. Consequently, the non-
convex objective function and constraints are linearized, resulting in a convex formulation
by exploiting problem information. Using this reformulation, we shall see that the definition
of constraints for binary, continuous and multi-class target are almost analogous.
riN E riE
pi = , qi = ,
rTN E rTE
3
where rTN E and rTE are the total number of non-event records and event records, respectively.
Next we define the Weight of Evidence (WoE) and event rate for each bin,
NE NE
r /r rE
WoEi = log i E TE , Di = E i N E .
ri /rT ri + ri
The Weight of Evidence WoEi and event rate Di for each bin are related by means of the
functional equations
E E
1 − Di rT rT
WoEi = log + log N E
= log − logit(Di )
Di rT rTN E
−1 E !−1
rT
rTN E WoEi
WoEi −log
rN E
Di = 1 + E e = 1+e T ,
rT
where Di can be characterized as a logistic function of WoEi , and WoEi can be expressed
in terms of the logit function of Di . This shows that WoE is inversely related to the event
rate. The constant term log(rTE /rTN E ) is the log ratio of the total number of event and the
total number of non-events.
Divergence measures serve to assess the discriminant power of a binning solution. The
Jeffreys’ divergence [10], also known as Information Value (IV) within the credit risk in-
dustry, is a symmetric measure expressible in terms of the Kullback-Leibler divergence
DKL (P ||Q) [14] defined by
n
X pi
J(P ||Q) = IV = DKL (P ||Q) + DKL (Q||P ) = (pi − qi ) log .
i=1
qi
The IV statistic is unbounded, but some authors have proposed rules of thumb to settings
quality thresholds [20]. Alternatively, the Jensen-Shannon divergence is a bounded symmet-
ric measure also expressible in terms of the Kullback-Leibler divergence
1 1
JSD(P ||Q) = (D(P ||M ) + D(Q||M )) , M= (P + Q),
2 2
and bounded by JSD(P ||Q) ∈ [0, log(2)]. Note that these divergence measures cannot be
computed when riN E = 0 and/or riE = 0. Other divergences measures without this limitation
are described in [23].
A good binning algorithm for binary target should be characterized by the following
properties [20]:
4
Let us define the parameters of the mathematical programming formulation:
The objective function is to maximize the discriminant power among bins, therefore,
maximize a divergence measure. The IV can be computed using the described parameters
and the decision variables Xij , yielding
Pi !
n i NE NE
E
j=1 rj /rTN E Xij
X X rz rz
IV = − E Xij log
Pi ,
i=1 j=1
rTN E rT E E
j=1 rj /rT Xij
The IV is the sum of the IV contributions per bin, i.e., the sum by rows. As previously
stated, given the constraints in Equations (1) and (2), an aggregated low triangular matrix
Vij ∈ R+ 0 , ∀(i, j) ∈ {1, . . . , n : i ≥ j} with all possible IV values from bin merges can be
pre-computed as follows
Pi !
i
X rz
NE
rz
E
z=j rzN E /rTN E
Vij = − E log Pi , i = 1, . . . , n; j = 1, . . . , i. (3)
rN E
z=j T
rT E
z=j rz /rT
E
The optimal IV for each bin is determined by using the remarked observation that a solu-
tion is characterized by the position of the first 1 for each row, thus, using the continuity
constraint in (2), we obtain
i
X i−1
X
Vi· = Vi1 Xi1 + Vij (Xij − Xij−1 ) ⇐⇒ Vii Xii + (Vij − Vij+1 )Xij . (4)
j=2 j=1
for i = 1, . . . , n. The latter formulation is preferred to reduce the fill-in of the matrix
of constraints. Similarly, a lower triangular matrix of event rates Dij ∈ [0, 1], ∀(i, j) ∈
{1, . . . , n : i ≥ j} can be pre-computed as follows
Pi
z=j rzE
Dij = Pi , i = 1, . . . , n; j = 1, . . . , i.
z=j rz
5
can be stated as follows
n
X i−1
X
max Vii Xii + (Vij − Vij+1 )Xij (5a)
X
i=1 j=1
Xn
s.t. Xij = 1, j = 1, . . . , n (5b)
i=j
Apart from the constraints (5b) and (5c) already described, constraint (5d) imposes a
lower and upper bound on the number of bins. Other range constraints (5e-5g) limit the
number of total, non-event and event records per bin. Note that to increase sparsity, the
range constraints are not implemented following the standard formulation to avoid having
the data twice in the model. For example, constraint (5d) is replaced by
n
X
d+ Xii − bmax = 0, 0 ≤ d ≤ bmax − bmin .
i=1
Monotonic trend: ascending and descending. The formulation for monotonic as-
cending trend can be stated as follows,
z−1
X
Dzz Xzz + (Dzj − Dzj+1 )Xzj + β(Xii + Xzz − 1)
j=1
i−1
X
≤ 1 + (Dii − 1)Xii + (Dij − Dij+1 )Xij , i = 2, . . . , n; z = 1, . . . i − 1.
j=1
6
The term 1 + (Dii − 1)Xii or simply 1 − Xii , is used to ensure that event rates are in
[0, 1], and the ascending constraint is satisfied even if bin i is not selected. Note that this
is a big-M formulation M + (Dii − M )Xii using M = 1, which suffices given D ∈ [0, 1],
however, a tighter (non integer) M = max({Dij : i = 1, . . . , n; i ≥ j}), can be used instead.
The parameter β is the minimum event rate difference between consecutive bins. The term
β(Xii + Xzz − 1) is required to ensure that the difference between two selected bins i and z
is greater or equal than β. Similarly, for the descending constraint,
i−1
X
Dii Xii + (Dij − Dij+1 )Xij + β(Xii + Xzz − 1)
j=1
z−1
X
≤ 1 + (Dzz − 1)Xzz + (Dzj − Dzj+1 )Xzj , i = 2, . . . , n; z = 1, . . . i − 1.
j=1
Monotonic trend: concave and convex. The concave and convex trend can be achieved
by taking the definition of concavity/convexity on equally spaced points:
Thus, replacing Equation (6) in the previous definition of concavity we obtain the concave
trend constraints,
j−1
i−1
! !
X X
− Dii Xii + (Diz − Diz+1 )Xiz + 2 Djj Xjj + (Djz − Djz+1 )Xjz
z=1 z=1
k−1
!
X
− Dkk Xkk + (Dkz − Dkz+1 )Xkz ≥ Xii + Xjj + Xkk − 3,
z=1
Monotonic trend: peak and valley. The peak and valley trend2 define an event rate
function exhibiting a single trend change or reversal. The optimal trend change position
is determined by using disjoint constraints, which can be linearized using auxiliary binary
variables. The resulting additional constraints are as follows,
7
where t is the position of the optimal trend change bin, yi are auxiliary binary variables and
n in (7a) is the smallest big-M value for this formulation while preserving the redundancy
of constraints. Furthermore, for the peak trend we incorporate the following constraints,
z−1
X
yi + yz + 1 + (Dzz − 1)Xzz + (Dzj − Dzj+1 )Xzj
j=1
i−1
X
≥ Dii Xii + (Dij − Dij+1 )Xij , i = 2, . . . , n; z = 1, . . . , i − 1,
j=1
i−1
X
2 − yi − yz + 1 + (Dii − 1)Xii + (Dij − Dij+1 )Xij
j=1
z−1
X
≥ Dzz Xzz + (Dzj − Dzj+1 )Xzj , i = 2, . . . , n; z = 1, . . . , i − 1.
j=1
Note that none of these constraints are necessary if the position of the change bin t is
fixed in advance. For example, given t, the valley trend constraints are replaced by two sets
of constraints; one to guarantee a descending monotonic trend before t and another to guar-
antee an ascending monotonic trend after t. Devising an effective heuristic to determine the
optimal t can yield probably optimal solutions while reducing the problem size significantly.
8
Taking w = (w1 , . . . , wn )T , the standard deviation t can be incorporated to the formulation
with a different representation. Since std = (wT w/(m − 1))1/2 then
||w||2
≤ t ⇐⇒ ||w||22 ≤ (m − 1)t2 .
(m − 1)1/2
The non-convex MINLP formulation using the parameter γ to control the importance of the
term t,
n
X i−1
X
max Vii Xii + (Vij − Vij+1 )Xij − γt (10a)
X,µ,w
i=1 j=1
9
The MILP formulation is given by
n
X i−1
X
max Vii Xii + (Vij − Vij+1 )Xij − γ(pmax − pmin ) (12a)
X,pmin ,pmax
i=1 j=1
Maximum p-value constraint. A necessary constraint to guarantee that event rates be-
tween consecutive bins are statistically different is to impose a maximum p-value constraint
setting a significance level α. Suitable statistical tests are the Z-test, Pearson’s Chi-square
test or Fisher ’s exact test. To perform these statistical tests we require an aggregated
matrix of non-event and event records per bin,
i
X i
X
NE
Rij = rzN E , E
Rij = rzE , i = 1, . . . , n; j = 1, . . . , i.
z=j z=j
The preprocessing procedure to detect pairs of pre-bins that do not satisfy the p-value
constraints using the Z-test is shown in Algorithm 1.
These constraints are added to the formulation by imposing that, at most, one of the
bins violating the maximum p-value constraint can be selected,
10
2.1.3 Mixed-integer programming reformulation for local and heuristic search
The number of binary decision variables X is n(n + 1)/2. For large n the N P-hardness of
the combinatorial optimization problem might limit the success of tree-search techniques. A
first approach to tackle this limitation is reformulating the problem to reduce the number
of decision variables. First, observe in Figure 1 that a solution is fully characterized by the
diagonal of X. Thus, having the diagonal we can place the ones on the positions satisfying
unique assignment (1) and continuity (2) constraints. On the other hand, to return indexed
elements in any aggregated matrix in the original formulation, we require the position of the
first one by row. We note that this information can be retrieved by counting the number of
consecutive zeros between ones (selected bins) of the diagonal. To perform this operation we
use two auxiliary decision variables: an accumulator of preceding zeros ai and the preceding
run-length of zeros zi . A similar approach to counting consecutive ones is introduced in [11]
The described approach is illustrated in Figure 2.
1 1 0 0 1 0 1 1
0 0
0 1 2 0 1 0 0
0 0 0
0 0 0 2 0 1 0
0 1 1 1
0 0 0 0 0
0 0 0 0 1 1
0 0 0 0 0 0 1
The positions in z are zero-based indexes of the reversed rows of the aggregated matrices.
For example, the aggregated lower triangular matrix RE is now computed backward: Rij E
=
Pj E NE
z=i rz for i = 1, . . . , n; j = 1, . . . , i. Same for the aggregated matrices V , D, R and R .
Let us define the parameters of the mathematical programming formulation:
11
and the decision variables:
xi ∈ {0, 1} binary indicator variable.
ai ∈ N0 accumulator of preceding zeros.
zi ∈ N0 preceding run-length of zeros.
The new formulation with 3n decision variables is stated as follows
n
X
max V[i,zi ] xi (13a)
X
i=1
s.t. xn = 1 (13b)
ai = (ai−1 + 1)(1 − xi ), i = 1, . . . , n (13c)
zi = ai−1 (1 − xi−1 )xi , i = 1, . . . , n (13d)
Xn
bmin ≤ xi ≤ bmax (13e)
i=1
Xn
rmin ≤ R[i,zi ] xi ≤ rmax , i = 1, . . . , n (13f)
i=1
Xn
NE NE NE
rmin ≤ R[i,z x ≤ rmax
i] i
, i = 1, . . . , n (13g)
i=1
Xn
E E E
rmin ≤ R[i,z x ≤ rmax
i] i
, i = 1, . . . , n (13h)
i=1
xi ∈ {0, 1}, i = 1, . . . , n (13i)
ai ∈ N0 , i = 1, . . . , n (13j)
zi ∈ N0 , i = 1, . . . , n (13k)
This MINLP formulation is particularly suitable for Local Search (LS) and heuristic tech-
niques, where decision variable zi can be used as an index, for example, V[i,zi ] . The nonlinear
constraints (13d) and (13e) are needed to counting consecutive zeros. After the linearization
of these constraints via big-M inequalities or indicator constraints [11], the formulation is
adequate for Constraint Programming (CP). Additional constraints such as monotonicity
constraints can be incorporated to (13b - 13k) in a relatively simple manner:
Monotonic trend ascending
D[i,zi ] xi + 1 − xi ≥ D[j,zj ] xj + β(xi + xj − 1), i = 2, . . . , n; z = 1, . . . , i − 1.
Monotonic trend descending
D[i,zi ] xi + β(xi + xj − 1) ≤ 1 − xj + D[j,zj ] xj , i = 2, . . . , n; z = 1, . . . , i − 1.
Monotonic trend concave
−D[i,zi ] xi + 2D[j,zj ] xj − D[k,zk ] xk ≥ xi + xj + xk − 3,
for i = 3, . . . , n; j = 2, . . . , i − 1; k = 1, . . . , j − 1.
Monotonic trend convex
D[i,zi ] xi − 2D[j,zj ] xj + D[k,zk ] xk ≥ xi + xj + xk − 3,
for i = 3, . . . , n; j = 2, . . . , i − 1; k = 1, . . . , j − 1.
Monotonic trend peak: constraints (7a - 7c) and
yi + yj + 1 + (D[j,zj ] − 1)xj − D[i,zi ] xi ≥ 0
2 − yi − yj + 1 + (D[i,zi ] − 1)xi − D[j,zj ] xj ≥ 0,
12
for i = 2, . . . , n; z = 1, . . . , i − 1.
Monotonic trend valley: constraints (7a - 7c) and
yi + yj + 1 + (D[i,zi ] − 1)xi − D[j,zj ] xj ≥ 0
2 − yi − yj + 1 + (D[j,zj ] − 1)xj − D[i,zi ] xi ≥ 0
for i = 2, . . . , n; z = 1, . . . , i − 1.
In Section 4, we compare the initial CP/MIP formulation to the presented LS formulation
for large size instances.
where µi ∈ R is the target mean for each pre-bin and Uij ∈ R is the aggregated matrix of
mean values, where si is the sum of target values for each pre-bin. A more robust approach
might replace the mean by an order statistic, but unfortunately, the aggregated matrix
computed from the pre-binning data would require approximation methods. Replacing Lij
in the objective function (5a) and changing the optimization sense to minimization, the
resulting formulation is given by
n
X i−1
X
min Lii Xii + (Lij − Lij+1 )Xij (17a)
X
i=1 j=1
The enforced constraint must be satisfied iff the two literals Xii and Xzz are true, otherwise
the constraint is ignored. This is half-reified linear constraint [4]. The parameter β is the
minimum mean difference between consecutive bins. Similarly, for the descending constraint,
i−1
X
Xii = 1 and Xzz = 1 =⇒ Uii Xii + (Uij − Uij+1 Xij ) + β
j=1
z−1
X
≤ Uzz Xzz + (Uzj − Uzj+1 )Xzj , i = 2, . . . , n; z = 1, . . . i − 1.
j=1
13
Furthermore, the concave and convex trend can be written in triple implication form
using literals Xii , Xjj and Xkk . The concave trend constraints are
i−1
!
X
Xii = 1, Xzz = 1 and Xkk = 1 =⇒ − Uii Xii + (Uiz − Uiz+1 )Xiz
z=1
j−1
!
X
+ 2 Ujj Xjj + (Ujz − Ujz+1 )Xjz
z=1
k−1
!
X
− Ukk Xkk + (Ukz − Ukz+1 )Xkz ≥ 0,
z=1
for i = 3, . . . n; j = 2, . . . , i − 1 and k = 1, . . . , j − 1.
The formulation can be extended to support valley and peak trend. The peak trend
requires constraints (7a - 7c) and
z−1
X
Xii = 1 and Xzz = 1 =⇒M (yi + yz ) + Uzz Xzz + (Uzj − Uzj+1 )Xzj
j=1
i−1
X
≥ Uii Xii + (Uij − Uij+1 )Xij ,
j=1
i−1
X
M (2 − yi − yz ) + Uii Xii + (Uij − Uij+1 )Xij
j=1
z−1
X
≥ Uzz Xzz + (Uzj − Uzj+1 )Xzj ,
j=1
14
for i = 2, . . . , n; z = 1, . . . , i − 1.
Finally, additional constraints described in Section 2.1.2 can be naturally incorporated
with minor changes.
Note that for this formulation we need an aggregated matrix V and D for each class c. It
is important to emphasize that the monotonicity constraints in Section 2.1.1 act as link-
ing constraints among classes, otherwise, nC optimal binning problems with binary target
could be solved separately. Again, additional constraints described in Section 2.1.2 can be
naturally incorporated with minor changes.
Label Instances %
A 84 0.20
D 200 0.48
P 76 0.18
V 57 0.14
Total 417
15
To perform experiments, the dataset is split into train and test subsets in a stratified
manner to treat unbalanced data. The proportion of the test set is 30%. Three interpretable
multi-class classification algorithms are tested, namely, logistic regression, decision trees
(CART) and Support vector Machine (SVM) using the Python library Scikit-learn [19].
All three algorithms are trained using option class weight="balanced". Throughout the
learning process, we discard 8 features and perform hyperparameter optimization for all
three algorithms. These experiments show that SVM and CART have similar classification
measures, and we decide to choose CART (max depth 5) to ease implementation.
On the test set, the trained CART has a weighted average accuracy, precision and recall
of 88%. See classification measures and the confusion matrix in Table 2. We observe that
various instances of the minority class (V) are misclassified, indicating that more instances or
new features might be required to improve classification measures. Improving this classifier
is part of ongoing research.
Label Precision Recall F1-score Support
A D P V
A 0.85 0.88 0.86 25
A 22 0 1 2
D 0.97 0.93 0.95 61
D 0 57 2 2
P 0.81 0.88 0.88 23
P 1 0 22 0
V 0.71 0.59 0.65 17
V 3 2 2 10
weighted avg 0.88 0.88 0.88 126
Table 2: Classification measures (left) and confusion matrix (right) for CART on the test
set.
16
• Predictive power: IV rule of thumb [20] in Table 3.
• Statistical significance: bin event rates must be statistically different, therefore large
p-values penalize the quality score.
• Homogeneity: binning with homogeneous bin sizes or uniform representativeness, in-
creases reliability.
IV predictive power
[0, 0.02) not useful
[0.02, 0.1) weak
[0.1, 0.3) medium
[0.3, 0.5) strong
[0.5, ∞) over-prediction
To account for all these aspects, we propose a rigorous binning quality score function
Proposition 3.1 Given a binning with Information Value ν, p-values between consecutive
bins pi , i = 1, . . . , n − 1 and normalized bins size si , i = 1, . . . , n, the binning quality score
function is defined as
!
n−1
Pn
1 − i=1 s2i
ν 2 2
Y
Q(ν, p, s) = exp −ν /(2c ) + 1/2 (1 − pi ) , (21)
c i=1
1 − 1/n
q
1 2
where Q(ν, p, s) ∈ [0, 1] and c = 5 log(5/3) is the best a priori IV value in [0.3, 0.5).
Proof: Given the rule of thumb in Table 3, let us consider the set of statistical distributions
with positive skewness, positive fat-tail, and support on the semi-infinite interval [0, ∞). The
function should penalize large values of Information Value ν and it is expected a fast decay
after certain threshold indicating over-prediction. This fast decay is a required property that
must be accompanied by the following statement: limν→0 f (ν) = limν→∞ f (ν) = 0 Among
the available distributions satisfying aforementioned properties, we select the Rayleigh dis-
tribution, which probability density function is given by
ν −ν 2 /(2c2 )
f (ν; c) = e , ν ≥ 0.
c2
This is a statistical distribution, not a function, hence we need a scaling factor so that
maxν∈[0,∞) f (ν; c) = 1: the maximum value of the unimodal probability distribution is the
mode c, thus
1 f (ν, c) ν exp −ν 2 /(2c2 ) + 1/2
γ = f (c, c) = √ =⇒ = .
c e γ c
The optimal c such that f (a) = f (b) for b > a can be obtained by solving f (b; c)−f (a; c) = 0
for c, which yields √
∗ b2 − a2
c =p .
2 log(b/a)
Qn−1
Term i=1 (1 − pi ) assesses the statistical significance of the bins. Furthermore, term
1− n 2
P
i=1 si
1−1/n = 1−HHI ∗ , where HHI ∗ is the normalized Herfindahl Hirschman Index, assesses
the homogeneity/uniformity of the bin sizes.
For example, if we consider that the boundaries of the interval with strong IV predictive
power in Table 3, a = 0.3 and b = 0.5, should produce the same quality score, c∗ (a, b) =
17
q
1
5
2
log(5/3) . Table 4 shows the value of f (ν, c∗ ) for various IV values ν; note the fast decay
∗
of f (ν, c ) when ν > 0.5.
3.4 Implementation
The presented mathematical programming formulations are implemented using Google OR-
Tools [15] with the open-source MILP solver CBC [6], and Google’s BOP and CP-SAT
solvers. Besides, the specialized formulation in Section 2.1.3 is implemented using the
commercial solver LocalSolver [1]. The python library OptBinning3 has been developed
throughout this work to ease usability and reproducibility.
Much of the implementation effort focuses on the careful implementation of constraints
and the development of fast algorithms for preprocessing and generating the model data.
A key preprocessing algorithm is a pre-binning refinement developed to guarantee that no
bins have 0 accounts for non-events and events in the binary target case.
Categorical variables require special treatment: pre-bins are ordered in ascending order
with respect to a given metric; the event rate for binary target and the target mean for a
continuous target. The original data is replaced by the ordered indexes and is then used as
a numerical (ordinal) variable. Furthermore, during preprocessing, the non-representative
categories may be binned into an “others” bin. Similarly, missing values and special values
are incorporated naturally as additional bins after the optimal binning is terminated.
4 Experiments
The experiments were run on an Intel(R) Core(TM) i5-3317 CPU at 1.70GHz, using a
single core, running Linux. Two binning examples are shown in Tables 5 and 6, using Fair
Isaac (FICO) credit risk dataset [5] (N = 10459) and Home Credit Default Risk Kaggle
competition dataset [8] (N = 307511), respectively.
Example in Table 5 uses the variable AverageMInFile (Average Months in File) as an
risk driver. FICO dataset imposes monotonicity constraints to some variables, in particular,
for this variable, the event rate must be monotonically decreasing. Moreover, the dataset
includes three special values/codes defined as follows:
18
Bin Count Count (%) Non-event Event Event rate WoE IV JS
(−∞, 30.5) 544 0.052013 99 445 0.818015 -1.41513 0.087337 0.010089
[30.5, 48.5) 1060 0.101348 286 774 0.730189 -0.907752 0.076782 0.009281
[48.5, 54.5) 528 0.050483 184 344 0.651515 -0.537878 0.014101 0.001742
[54.5, 64.5) 1099 0.105077 450 649 0.590537 -0.278357 0.008041 0.001002
[64.5, 70.5) 791 0.075629 369 422 0.533502 -0.046381 0.000162 0.000020
[70.5, 74.5) 536 0.051248 262 274 0.511194 0.0430441 0.000095 0.000012
[74.5, 81.5) 912 0.087198 475 437 0.479167 0.171209 0.002559 0.000320
[81.5, 101.5) 2009 0.192083 1141 868 0.432056 0.361296 0.025000 0.003108
[101.5, 116.5) 848 0.081078 532 316 0.372642 0.608729 0.029532 0.003636
[116.5, ∞) 1084 0.103643 702 382 0.352399 0.696341 0.049039 0.006009
Special 558 0.053351 252 306 0.548387 -0.106328 0.000601 0.000075
Missing 490 0.046850 248 242 0.493878 0.112319 0.000592 0.000074
Table 5: Example optimal binning using variable AverageMInFile from FICO dataset.
non-representative categories, which is excluded from the optimization problem, hence mono-
tonicity constraint does not apply. This optimal binning instance is solved in 0.25 seconds.
For categorical variables, most of the time is spent on pre-processing, 71% in this particular
case, whereas the optimization problem is solved generally faster.
Table 6: Example optimal binning using categorical variable ORGANIZATION TYPE from
Home Credit Default Risk Kaggle competition dataset.
5 Conclusions
We propose a rigorous and flexible mathematical programming formulation to compute the
optimal binning. This is the first optimal binning algorithm to achieve solutions for nontriv-
ial constraints, supporting binary, continuous and multi-class target, and handling several
monotonic trends rigorously. Importantly, the size of the decision variables and constraints
19
n monotonic trend solver variables constraints time solution gap
48 peak cp 1225 3528 12.7 0.03757878 -
48 peak ls 193 2352 1 0.03373904 10.2%
48 peak ls 193 2352 5 0.03386574 9.9%
48 peak ls 193 2352 10 0.03725560 0.9%
77 peak cp 3081 9009 140.9 0.03776231 -
77 peak ls 309 6007 1 0.03078212 18.5%
77 peak ls 309 6007 5 0.03776231 0.0%
77 descending cp 3003 6706 0.9 0.03386574 -
77 descending ls 231 2969 1 0.03386574 0.0%
used in the presented formulations is independent of the size of the datasets; they are en-
tirely controlled by the starting solution computed during the pre-binning process. In the
future, we plan to extend our methodology to piecewise-linear binning and multivariate bin-
ning. Lastly, the code is available at https://github.com/guillermo-navas-palencia/
optbinning to ease reproducibility.
References
[1] T. Benoist, B. Estellon, Gardi F., R. Megel, and K. Nouioua. Localsolver 1.x: a black-
box local-search solver for 0-1 programming. 4OR-Q J Oper Res, 9(299), 2011.
[2] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
1984.
[3] U. M. Fayyad and K. B. Irani. Multi-Interval Discretization of Continuous-Valued
Attributes for Classification Learning. International Joint Conferences on Artificial
Intelligence, 13:1022–1027, 1993.
[4] T. Feydy, Z. Somogyi, and P. J. Stuckey. Half reification and flattening. In Principles
and Practice of Constraint Programming – CP 2011, pages 286–301, Berlin, Heidelberg,
2011. Springer Berlin Heidelberg.
[5] FICO, Google, Imperial College London, MIT, University of Oxford, UC Irvine and UC
Berkeley. Explainable Machine Learning Challenge. https://community.fico.com/
s/explainable-machine-learning-challenge, 2018.
[6] J. Forrest, T. Ralphs, S. Vigerske, and et al. coin-or/cbc: Version 2.9.9. 2018.
[7] J. Herman. smbinning: Scoring Modeling and Optimal Binning, 2019.
20
[8] Home Credit Group. Kaggle competition: Home Credit Default Risk. https://www.
kaggle.com/c/home-credit-default-risk/overview, 2018.
[9] T. Hothorn, K. Hornik, and A. Zeileis. Unbiased recursive partitioning: A conditional
inference framework. Journal of Computational and Graphical statistics, 15(3):651–674,
2006.
[10] H. Jeffreys. An invariant form for the prior probability in estimation problems. Pro-
ceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences,
186(1007):453–461, 1946.
[11] E. Kalvelagen. A difficult MIP construct: counting consecutive 1’s.
https://yetanothermathprogrammingconsultant.blogspot.com/2018/04/
a-difficult-mip-construct-counting.html, 2018.
[12] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu. LightGBM:
A Highly Efficient Gradient Boosting Decision Tree. In I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances
in Neural Information Processing Systems 30, pages 3146–3154. Curran Associates,
Inc., 2017.
[13] R. Kerber. ChiMerge: Discretization of Numeric Attributes. AAAI-92 Proceedings,
1992.
[14] S. Kullback and R. A. Leibler. On Information and Sufficiency. Ann. Math. Statist.,
22(1):79–86, 1951.
[17] I. Oliveira, M. Chari, and S. Haller. Rigorous Constrained Optimization Binning for
Credit Scoring. SAS Global Forum 2008 - Data Mining and Predictive Modelling, 2008.
[18] A. Lodi P. Bonami and G. Zarpellon. Learning a Classification of Mixed-Integer
Quadratic Programming Problems. In W.-J. van Hoeve, editor, Integration of Con-
straint Programming, Artificial Intelligence, and Operations Research CPAIOR 2018,
Lecture Notes in Computer Science, pages 595–604. Springer-Verlag, 2018.
[19] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python .
Journal of Machine Learning Research, 12:2825–2830, 2011.
[20] N. Siddiqi. Credit Risk Scorecards: Developing And Implementing Intelligent Credit
Scoring. Wiley and SAS Business Series. Wiley, 2005.
[21] L. C. Thomas, D. B. Edelman, and J. N. Crook. Credit Scoring and Its Applications.
Society for Industrial and Applied Mathematics, 2002.
[22] L. WenSui. Monotonic Optimal Binning (MOB) for Risk Scorecard Development, 2020.
[23] G. Zeng. Metric Divergence Measures and Information Value in Credit Scoring. Journal
of Mathematics, 2013:1–10, 2013.
21