0% found this document useful (0 votes)

5 views21 pages

Optimal Binning

The document presents a mathematical programming formulation for optimal binning, which is the discretization of a variable into bins based on various target types. It introduces a convex mixed-integer programming approach that incorporates previously unaddressed constraints and discusses algorithmic enhancements for better performance. The new formulations are implemented in the open-source Python library OptBinning, aimed at improving binning techniques used in machine learning and credit risk modeling.

Uploaded by

kexuedaishu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views21 pages

Optimal Binning

Uploaded by

kexuedaishu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Optimal binning: mathematical programming

formulation
Guillermo Navas-Palencia1
1
[email protected]

January 23, 2020

arXiv:2001.08025v1 [cs.LG] 22 Jan 2020

Abstract
The optimal binning is the optimal discretization of a variable into bins given a dis-
crete or continuous numeric target. We present a rigorous and extensible mathematical
programming formulation to solving the optimal binning problem for a binary, contin-
uous and multi-class target type, incorporating constraints not previously addressed.
For all three target types, we introduce a convex mixed-integer programming formula-
tion. Several algorithmic enhancements such as automatic determination of the most
suitable monotonic trend via a Machine-Learning-based classifier and implementation
aspects are thoughtfully discussed. The new mathematical programming formulations
are carefully implemented in the open-source python library OptBinning.

1 Introduction
Binning (grouping or bucketing) is a technique to discretize the values of a continuous
variable into bins (groups or buckets). From a modeling perspective, the binning technique
may address prevalent data issues such as the handling of missing values, the presence
of outliers and statistical noise, and data scaling. Furthermore, the binning process is
a valuable interpretable tool to enhance the understanding of the nonlinear dependence
between a variable and a given target while reducing the model complexity. Ultimately,
resulting bins can be used to perform data transformations.
Binning techniques are extensively used in machine learning applications, exploratory
data analysis and as an algorithm to speed up learning tasks; recently, binning has been
applied to accelerate learning in gradient boosting decision tree [12]. In particular, binning
is widely used in credit risk modeling, being an essential tool for credit scorecard modeling
to maximize differentiation between high-risk and low-risk observations, and in expected
credit loss modeling.
There are several unsupervised and supervised binning techniques. Common unsuper-
vised techniques are equal-width and equal-size or equal-frequency interval binning. On the
other hand, well-known supervised techniques based on merging are Monotone Adjacent
Pooling Algorithm (MAPA), also known as Maximum Likelihood Monotone Coarse Classi-
fier (MLMCC) [21] and ChiMerge [13], whereas other techniques based on decision trees are
CART [2], Minimum Description Length Principle (MDLP) [3] and more recently, condition
inference trees (CTREE) [9].
The binning process might require to satisfy certain constraints. These constraints might
range from requiring a minimum number of records per bin to monotonicity constraints. This
variant of the binning process is known as the optimal binning process. The optimal binning
is generally solved by iteratively merging an initial granular discretization until imposed
constraints are satisfied. Performing this fine-tuning manually is likely to be unsatisfactory

1
as the number of constraints increases, leading to suboptimal or even infeasible solutions.
However, we note that this manual adjustment has been encouraged by some authors [20],
legitimating the existing interplay of “art and science” in the binning process.
There are various commercial software tools to solving the optimal binning problem1 .
Software IBM SPSS and the MATLAB Financial Toolbox, use MDLP and MAPA as a de-
fault algorithm, respectively. The most advanced tool to solve the optimal binning problem
is available in the SAS Enterprise Miner software. A limited description of the proprietary
algorithm can be found in [17], where two mixed-integer programming (MIP) formulations
are sketched: a mixed-integer linear programming (MILP) formulation to obtain a fast
probably suboptimal solution, and a mixed-integer nonlinear programming (MINLP) for-
mulation to obtain an optimal solution. The suboptimal formulation is the default method
due to computational time limitations (MILP techniques are considerably more mature).
We note that the SAS implementation allows most of the constraints required in credit risk
modelling, becoming an industry standard. Besides, there exist a few open-source solutions,
but the existing gap comparing to the commercial options in terms of capabilities is still
significant. Among the available alternatives, we mention the MATLAB implementation of
the monotone optimal binning in [16], and the R specialized packages smbinning [7], relying
on CTREE, and MOB [22], which merely include basic functionalities.
In this paper, we develop a rigorous and extensible mathematical programming formu-
lation for solving the optimal binning problem. This general formulation can efficiently
handle binary, continuous, and multi-class target type. The presented formulations incorpo-
rate the constraints generally required to produce a good binning [20], and new constraints
not previously addressed. For all three target types, we introduce a convex mixed-integer
programming formulation, ranging from a integer linear programming (ILP) formulation for
the simplest cases to a mixed-integer quadratic programming (MIQP) formulation for those
cases adding more involved constraints.
The remainder of the paper is organized as follows. Section 2 introduces our general
problem formulation and the corresponding mixed-integer programming formulation for each
supported target. We focus on the formulation for binary target, investigating various
formulation variants. Then, in Section 3 we discuss in detail several algorithmic aspects
such as the automatic determination of the optimal monotonic trend and the development of
presolving algorithms to efficiently solve large size instances. Section 4 includes experiments
with real-world datasets and compares the performance of supported solvers for large size
instances. Finally, in Section 5, we present our conclusions and discuss possible research
directions.

2 Mathematical programming formulation

The optimal binning process comprises two steps: A pre-binning process that generates
an initial granular discretization, and a subsequent refinement or optimization to satisfy
imposed constraints. The pre-binning process uses, for example, a decision tree algorithm
to calculate the initial split points. The resulting m split points are sorted in ascending
order, s1 ≤ s2 ≤ . . . ≤ sm to create n = m + 1 pre-bins. These pre-bins are defined by the
intervals (−∞, s1 ), [s1 , s2 ), . . . , [sm , ∞).
Given n pre-bins, the decision variables consist of a binary lower triangular matrix (in-
dicator variables) of size n, Xij ∈ {0, 1}, ∀(i, j) ∈ {1, . . . , n : i ≥ j}. The starting point is
a diagonal matrix, meaning that initially all pre-bins are selected. A basic feasible solution
must satisfy the following constraints:
• All pre-bins are either isolated or merged to create a larger bin interval, but cannot
1 To the author’s knowledge, at the time of writing, these tools are restricted to the problem of discretizing

a variable with respect to a binary target.

2
be erased. Each column must contain exactly one 1.
n
X
Xij = 1, j = 1, . . . , n. (1)
i=1

• Only consecutive pre-bins can be merged. Continuity by rows, no 0 − 1 gaps are

allowed.
Xij ≤ Xij+1 , i = 1, . . . , n; j = 1, . . . , i − 1. (2)

• A solution has a last bin interval of the form [sk , ∞), for k ≤ n. The binary decision
variable Xnn = 1.

To clarify, Figure 1 shows an example of a feasible solution. In this example, pre-bins

corresponding to split points (s2 , s3 , s4 ) and (s5 , s6 ) are merged, thus having an optimal
binning with bin intervals (−∞, s1 ), [s1 , s4 ), [s4 , s6 ), [s6 , ∞).

1 1

0 1 0 0

0 0 1 0 0 0

0 0 0 1 0 1 1 1

0 0 0 0 1 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 1 1

0 0 0 0 0 0 1 0 0 0 0 0 0 1

Figure 1: Lower triangular matrix X. Initial solution after pre-binning (left). Optimal
solution with 4 bins after merging pre-bins (right).

The described problem can be seen as a generalized assignment problem. A direct for-
mulation of metrics involving ratios such as the mean or most of the divergence measures on
merged bins leads to a non-convex MINLP formulation, due to the ratio of sums of binary
variables. Solving non-convex MINLP problems to optimality is a challenging task requiring
the use of global MINLP solvers, especially for large size instances.
Investigating the binary lower triangular matrix in Figure 1, it can be observed, by
analyzing the constraints in Equations (1) and (2) imposing continuity by rows, that a
feasible solution is entirely characterized by the position of the first 1 for each row. This
observation permits the pre-computation of the set of possible solutions by rows, obtaining
an aggregated matrix with the shape of X for each involved metric. Consequently, the non-
convex objective function and constraints are linearized, resulting in a convex formulation
by exploiting problem information. Using this reformulation, we shall see that the definition
of constraints for binary, continuous and multi-class target are almost analogous.

2.1 Mixed-integer programming formulation for binary target

Given a binary target y used to discretize a variable x into n bins, we define the normalized
count of non-events (NE) pi and events (E) qi for each bin i as

riN E riE
pi = , qi = ,
rTN E rTE

3
where rTN E and rTE are the total number of non-event records and event records, respectively.
Next we define the Weight of Evidence (WoE) and event rate for each bin,
NE NE
r /r rE
WoEi = log i E TE , Di = E i N E .
ri /rT ri + ri

The Weight of Evidence WoEi and event rate Di for each bin are related by means of the
functional equations
E E
1 − Di rT rT
WoEi = log + log N E
= log − logit(Di )
Di rT rTN E
−1 E !−1
rT
rTN E WoEi

WoEi −log
rN E
Di = 1 + E e = 1+e T ,
rT

where Di can be characterized as a logistic function of WoEi , and WoEi can be expressed
in terms of the logit function of Di . This shows that WoE is inversely related to the event
rate. The constant term log(rTE /rTN E ) is the log ratio of the total number of event and the
total number of non-events.
Divergence measures serve to assess the discriminant power of a binning solution. The
Jeffreys’ divergence [10], also known as Information Value (IV) within the credit risk in-
dustry, is a symmetric measure expressible in terms of the Kullback-Leibler divergence
DKL (P ||Q) [14] defined by
n
X pi
J(P ||Q) = IV = DKL (P ||Q) + DKL (Q||P ) = (pi − qi ) log .
i=1
qi

The IV statistic is unbounded, but some authors have proposed rules of thumb to settings
quality thresholds [20]. Alternatively, the Jensen-Shannon divergence is a bounded symmet-
ric measure also expressible in terms of the Kullback-Leibler divergence
1 1
JSD(P ||Q) = (D(P ||M ) + D(Q||M )) , M= (P + Q),
2 2
and bounded by JSD(P ||Q) ∈ [0, log(2)]. Note that these divergence measures cannot be
computed when riN E = 0 and/or riE = 0. Other divergences measures without this limitation
are described in [23].
A good binning algorithm for binary target should be characterized by the following
properties [20]:

1. Missing values are binned separately.

2. Each bin should contain at least 5% observations.
3. No bins should have 0 accounts for non-events or events.
Property 1 is adequately addressed in many implementations, where missing and special
values are incorporated as additional bins after the optimal binning terminates. Property 2
is a usual constraint to enforce representativeness. Property 3 is required to compute the
above divergence measures.

4
Let us define the parameters of the mathematical programming formulation:

n∈N number of pre-bins.

rTN E ∈ N total number of non-event records.
rTE ∈ N total number of event records.
riN E ∈ N number of non-event records per pre-bin.
riE ∈ N number of event records per pre-bin.
ri = riN E + riE number of records per pre-bin.
NE
rmin ∈N minimum number of non-event records per bins.
NE
rmax ∈ N maximum number of non-event records per bins.
E
rmin ∈N minimum number of event records per bins.
E
rmax ∈ N maximum number of event records per bins.
bmin ∈ N minimum number of bins.
bmax ∈ N maximum number of bins.

The objective function is to maximize the discriminant power among bins, therefore,
maximize a divergence measure. The IV can be computed using the described parameters
and the decision variables Xij , yielding
  Pi !
n i NE NE
E
j=1 rj /rTN E Xij

X X rz rz
IV =  − E Xij log
 Pi ,
i=1 j=1
rTN E rT E E
j=1 rj /rT Xij

The IV is the sum of the IV contributions per bin, i.e., the sum by rows. As previously
stated, given the constraints in Equations (1) and (2), an aggregated low triangular matrix
Vij ∈ R+ 0 , ∀(i, j) ∈ {1, . . . , n : i ≥ j} with all possible IV values from bin merges can be
pre-computed as follows
  Pi !
i
X rz
NE
rz
E
z=j rzN E /rTN E
Vij =  − E  log Pi , i = 1, . . . , n; j = 1, . . . , i. (3)
rN E
z=j T
rT E
z=j rz /rT
E

The optimal IV for each bin is determined by using the remarked observation that a solu-
tion is characterized by the position of the first 1 for each row, thus, using the continuity
constraint in (2), we obtain
i
X i−1
X
Vi· = Vi1 Xi1 + Vij (Xij − Xij−1 ) ⇐⇒ Vii Xii + (Vij − Vij+1 )Xij . (4)
j=2 j=1

for i = 1, . . . , n. The latter formulation is preferred to reduce the fill-in of the matrix
of constraints. Similarly, a lower triangular matrix of event rates Dij ∈ [0, 1], ∀(i, j) ∈
{1, . . . , n : i ≥ j} can be pre-computed as follows
Pi
z=j rzE
Dij = Pi , i = 1, . . . , n; j = 1, . . . , i.
z=j rz

The ILP formulation, with no additional constraints such as monotonicity constraints,

5
can be stated as follows
n
X i−1
X
max Vii Xii + (Vij − Vij+1 )Xij (5a)
X
i=1 j=1
Xn
s.t. Xij = 1, j = 1, . . . , n (5b)
i=j

Xij − Xij+1 ≤ 0, i = 1, . . . , n; j = 1, . . . , i − 1 (5c)

n
X
bmin ≤ Xii ≤ bmax (5d)
i=1
i
X
rmin Xii ≤ rj Xij ≤ rmax Xii , i = 1, . . . , n (5e)
j=1
i
X
NE
rmin Xii ≤ rjN E Xij ≤ rmax
NE
Xii , i = 1, . . . , n (5f)
j=1
i
X
E
rmin Xii ≤ rjE Xij ≤ rmax
E
Xii , i = 1, . . . , n (5g)
j=1

Xij ∈ {0, 1}, ∀(i, j) ∈ {1, . . . , n : i ≥ j} (5h)

Apart from the constraints (5b) and (5c) already described, constraint (5d) imposes a
lower and upper bound on the number of bins. Other range constraints (5e-5g) limit the
number of total, non-event and event records per bin. Note that to increase sparsity, the
range constraints are not implemented following the standard formulation to avoid having
the data twice in the model. For example, constraint (5d) is replaced by
n
X
d+ Xii − bmax = 0, 0 ≤ d ≤ bmax − bmin .
i=1

2.1.1 Monotonicity constraints

Monotonicity constraints between the event rates of consecutive bins can be imposed to
ensure legal compliance and business constraints. Three types of monotonic trends are con-
sidered: the usual ascending/descending, and two types of unimodal forms, concave/convex
and peak/valley. This modeling flexibility can help to capture overlooked or unexpected
patterns, providing new insights to enrich models. Note that work in [17] uses the WoE
approach instead.
Applying Equation (4), the optimal event rate for each bin is given by
i−1
X
Di· = Dii Xii + (Dij − Dij+1 )Xij , i = 1, . . . , n. (6)
j=1

Monotonic trend: ascending and descending. The formulation for monotonic as-
cending trend can be stated as follows,
z−1
X
Dzz Xzz + (Dzj − Dzj+1 )Xzj + β(Xii + Xzz − 1)
j=1
i−1
X
≤ 1 + (Dii − 1)Xii + (Dij − Dij+1 )Xij , i = 2, . . . , n; z = 1, . . . i − 1.
j=1

6
The term 1 + (Dii − 1)Xii or simply 1 − Xii , is used to ensure that event rates are in
[0, 1], and the ascending constraint is satisfied even if bin i is not selected. Note that this
is a big-M formulation M + (Dii − M )Xii using M = 1, which suffices given D ∈ [0, 1],
however, a tighter (non integer) M = max({Dij : i = 1, . . . , n; i ≥ j}), can be used instead.
The parameter β is the minimum event rate difference between consecutive bins. The term
β(Xii + Xzz − 1) is required to ensure that the difference between two selected bins i and z
is greater or equal than β. Similarly, for the descending constraint,
i−1
X
Dii Xii + (Dij − Dij+1 )Xij + β(Xii + Xzz − 1)
j=1
z−1
X
≤ 1 + (Dzz − 1)Xzz + (Dzj − Dzj+1 )Xzj , i = 2, . . . , n; z = 1, . . . i − 1.
j=1

Monotonic trend: concave and convex. The concave and convex trend can be achieved
by taking the definition of concavity/convexity on equally spaced points:

−xi+1 + 2xi − xi−1 ≥ 0 concave

xi+1 − 2xi + xi−1 ≥ 0 convex

Thus, replacing Equation (6) in the previous definition of concavity we obtain the concave
trend constraints,
j−1
i−1
! !
X X
− Dii Xii + (Diz − Diz+1 )Xiz + 2 Djj Xjj + (Djz − Djz+1 )Xjz
z=1 z=1
k−1
!
X
− Dkk Xkk + (Dkz − Dkz+1 )Xkz ≥ Xii + Xjj + Xkk − 3,
z=1

for i = 3, . . . n; j = 2, . . . , i − 1 and k = 1, . . . , j − 1. Similarly, for convex trend we get

j−1
i−1
! !
X X
Dii Xii + (Diz − Diz+1 )Xiz − 2 Djj Xjj + (Djz − Djz+1 )Xjz
z=1 z=1
k−1
!
X
Dkk Xkk + (Dkz − Dkz+1 )Xkz ≥ Xii + Xjj + Xkk − 3,
z=1

for i = 3, . . . n; j = 2, . . . , i − 1 and k = 1, . . . , j − 1. Note that term Xii + Xjj + Xkk − 3 is

used the preserve redundancy of constraints when not all bins i, j and k are selected, given
that D ∈ [0, 1].

Monotonic trend: peak and valley. The peak and valley trend2 define an event rate
function exhibiting a single trend change or reversal. The optimal trend change position
is determined by using disjoint constraints, which can be linearized using auxiliary binary
variables. The resulting additional constraints are as follows,

i − n(1 − yi ) ≤ t ≤ i + nyi , i = 1, . . . , n (7a)

t ∈ [0, n] (7b)
yi ∈ {0, 1}, i = 1, . . . , n (7c)
2 In some commercial tools, peak and valley trend are called inverse U-shaped and U-shaped, respectively.

7
where t is the position of the optimal trend change bin, yi are auxiliary binary variables and
n in (7a) is the smallest big-M value for this formulation while preserving the redundancy
of constraints. Furthermore, for the peak trend we incorporate the following constraints,
z−1
X
yi + yz + 1 + (Dzz − 1)Xzz + (Dzj − Dzj+1 )Xzj
j=1
i−1
X
≥ Dii Xii + (Dij − Dij+1 )Xij , i = 2, . . . , n; z = 1, . . . , i − 1,
j=1
i−1
X
2 − yi − yz + 1 + (Dii − 1)Xii + (Dij − Dij+1 )Xij
j=1
z−1
X
≥ Dzz Xzz + (Dzj − Dzj+1 )Xzj , i = 2, . . . , n; z = 1, . . . , i − 1.
j=1

Similarly, for the valley trend we include,

i−1
X
yi + yz + 1 + (Dii − 1)Xii + (Dij − Dij+1 )Xij
j=1
z−1
X
≥ Dzz Xzz + (Dzj − Dzj+1 )Xzj , i = 2, . . . , n; z = 1, . . . , i − 1,
j=1
z−1
X
2 − yi − yz + 1 + (Dzz − 1)Xzz + (Dzj − Dzj+1 )Xzj
j=1
i−1
X
≥ Dii Xii + (Dij − Dij+1 )Xij , i = 2, . . . , n; z = 1, . . . , i − 1.
j=1

Note that none of these constraints are necessary if the position of the change bin t is
fixed in advance. For example, given t, the valley trend constraints are replaced by two sets
of constraints; one to guarantee a descending monotonic trend before t and another to guar-
antee an ascending monotonic trend after t. Devising an effective heuristic to determine the
optimal t can yield probably optimal solutions while reducing the problem size significantly.

2.1.2 Additional constraints

Reduction of dominating bins. To prevent any particular bin from dominating the
results, it might also be required that bins have at most a certain number of (total/non-
event/event) records using constraints (5e - 5g). Furthermore, we might produce more
homogeneous solutions by reducing a concentration metric such as the standard deviation of
the number of total/non-event/event records among bins. Three concentration metrics are
considered: standard deviation, Herfindahl-Hirschman Index [16] and the difference between
the largest and smallest bin.
The standard deviation among the number of records for each bin is given by
  2 1/2
n i n Xi
 1 X X Xii X
std =  rj Xij − rj Xij   ,
 
m − 1 i=1 j=1 m i=1 j=1
Pn
where m = i=1 Xii is the optimal number of bins. Let us define the following auxiliary
variables
n i i
1 XX X
µ= rj Xij , wi = rj Xij − µXii , i = 1, . . . n.
m i=1 j=1 j=1

8
Taking w = (w1 , . . . , wn )T , the standard deviation t can be incorporated to the formulation
with a different representation. Since std = (wT w/(m − 1))1/2 then

||w||2
≤ t ⇐⇒ ||w||22 ≤ (m − 1)t2 .
(m − 1)1/2

The non-convex MINLP formulation using the parameter γ to control the importance of the
term t,
n
X i−1
X
max Vii Xii + (Vij − Vij+1 )Xij − γt (10a)
X,µ,w
i=1 j=1

s.t. (5b - 5h) (10b)

X n
wi2 ≤ (m − 1)t2 (10c)
i=1
n i
1 XX
µ= rj Xij (10d)
m i=1 j=1
i
X
wi = rj Xij − µXii , i = 1, . . . n (10e)
j=1
Xn
m= Xii (10f)
i=1
m≥0 (10g)
µ≥0 (10h)
wi ∈ R, i = 1, . . . , n. (10i)

A widely used metric to quantify concentration is the Herfindahl-Hirschman Index (HHI),

which can be employed to asses the quality of a binning solution. Lower values of HHI
correspond to more homogeneous bins. The HHI of the number of records for each bin is
given by
 2
n i
1 X X
HHI = 2 rj Xij  ,
rT i=1 j=1
Pn
where rT = i=1 ri is the total number of records. The MIQP formulation using the
parameter γ to control the importance of HHI is stated as
 2
n i−1 n i
X X γ X X
max Vii Xii + (Vij − Vij+1 )Xij − 2  rj Xij  (11a)
X
i=1 j=1
rT i=1 j=1

s.t. (5b - 5h) (11b)

An effective MILP formulation can be devised using a simplification of the standard

deviation approach based on reducing the difference between the largest and smallest bin.

9
The MILP formulation is given by
n
X i−1
X
max Vii Xii + (Vij − Vij+1 )Xij − γ(pmax − pmin ) (12a)
X,pmin ,pmax
i=1 j=1

s.t. (5b - 5h) (12b)

i
X
pmin ≤ rT (1 − Xii ) + rj Xij , i = 1, . . . , n (12c)
j=1
i
X
pmax ≥ rj Xij , i = 1, . . . , n (12d)
j=1

pmin ≤ pmax (12e)

pmin ≥ 0. (12f)
pmax ≥ 0. (12g)

Maximum p-value constraint. A necessary constraint to guarantee that event rates be-
tween consecutive bins are statistically different is to impose a maximum p-value constraint
setting a significance level α. Suitable statistical tests are the Z-test, Pearson’s Chi-square
test or Fisher ’s exact test. To perform these statistical tests we require an aggregated
matrix of non-event and event records per bin,
i
X i
X
NE
Rij = rzN E , E
Rij = rzE , i = 1, . . . , n; j = 1, . . . , i.
z=j z=j

The preprocessing procedure to detect pairs of pre-bins that do not satisfy the p-value
constraints using the Z-test is shown in Algorithm 1.

Algorithm 1 Maximum p-value constraint using Z-test

1: procedure p-value violation indices(n, RN E , RE , α)
2: zscore = Φ−1 (1 − α/2)
3: I = {}
4: for i = 1, . . . , n − 1 do
5: l =i+1
6: for j = 1, . . . , i do
E
7: x = Rij
NE
8: y = Rij
9: for k = l, . . . , n do
E
10: w = Rkl
NE
11: z = Rkl
12: if Z-test(x, y, w, z) < zscore then
13: I = I ∪ (i, j, k, l)
14: end if
15: end for
16: end for
17: end for
18: end procedure

These constraints are added to the formulation by imposing that, at most, one of the
bins violating the maximum p-value constraint can be selected,

Xij + Xkl ≤ 1, ∀(i, j, k, l) ∈ I.

10
2.1.3 Mixed-integer programming reformulation for local and heuristic search
The number of binary decision variables X is n(n + 1)/2. For large n the N P-hardness of
the combinatorial optimization problem might limit the success of tree-search techniques. A
first approach to tackle this limitation is reformulating the problem to reduce the number
of decision variables. First, observe in Figure 1 that a solution is fully characterized by the
diagonal of X. Thus, having the diagonal we can place the ones on the positions satisfying
unique assignment (1) and continuity (2) constraints. On the other hand, to return indexed
elements in any aggregated matrix in the original formulation, we require the position of the
first one by row. We note that this information can be retrieved by counting the number of
consecutive zeros between ones (selected bins) of the diagonal. To perform this operation we
use two auxiliary decision variables: an accumulator of preceding zeros ai and the preceding
run-length of zeros zi . A similar approach to counting consecutive ones is introduced in [11]
The described approach is illustrated in Figure 2.

1 1 0 0 1 0 1 1

0 0
0 1 2 0 1 0 0

0 0 0

0 0 0 2 0 1 0

0 1 1 1

0 0 0 0 0

0 0 0 0 1 1

0 0 0 0 0 0 1

Figure 2: New decision variables suitable to counting consecutive zeros.

The positions in z are zero-based indexes of the reversed rows of the aggregated matrices.
For example, the aggregated lower triangular matrix RE is now computed backward: Rij E
=
Pj E NE
z=i rz for i = 1, . . . , n; j = 1, . . . , i. Same for the aggregated matrices V , D, R and R .
Let us define the parameters of the mathematical programming formulation:

n∈N number of pre-bins.

V[i,zi ] ∈ R+
0 Information value.
D[i,zi ] ∈ [0, 1] event rate.
R[i,zi ] ∈ N number of records.
NE
R[i,z i]
∈N number of non-event records.
E
R[i,z i]
∈N number of event records.
NE
rmin ∈N minimum number of non-event records per bins.
NE
rmax ∈N maximum number of non-event records per bins.
E
rmin ∈ N minimum number of event records per bins.
E
rmax ∈N maximum number of event records per bins.
bmin ∈ N minimum number of bins.
bmax ∈ N maximum number of bins.

11
and the decision variables:
xi ∈ {0, 1} binary indicator variable.
ai ∈ N0 accumulator of preceding zeros.
zi ∈ N0 preceding run-length of zeros.
The new formulation with 3n decision variables is stated as follows
n
X
max V[i,zi ] xi (13a)
X
i=1
s.t. xn = 1 (13b)
ai = (ai−1 + 1)(1 − xi ), i = 1, . . . , n (13c)
zi = ai−1 (1 − xi−1 )xi , i = 1, . . . , n (13d)
Xn
bmin ≤ xi ≤ bmax (13e)
i=1
Xn
rmin ≤ R[i,zi ] xi ≤ rmax , i = 1, . . . , n (13f)
i=1
Xn
NE NE NE
rmin ≤ R[i,z x ≤ rmax
i] i
, i = 1, . . . , n (13g)
i=1
Xn
E E E
rmin ≤ R[i,z x ≤ rmax
i] i
, i = 1, . . . , n (13h)
i=1
xi ∈ {0, 1}, i = 1, . . . , n (13i)
ai ∈ N0 , i = 1, . . . , n (13j)
zi ∈ N0 , i = 1, . . . , n (13k)
This MINLP formulation is particularly suitable for Local Search (LS) and heuristic tech-
niques, where decision variable zi can be used as an index, for example, V[i,zi ] . The nonlinear
constraints (13d) and (13e) are needed to counting consecutive zeros. After the linearization
of these constraints via big-M inequalities or indicator constraints [11], the formulation is
adequate for Constraint Programming (CP). Additional constraints such as monotonicity
constraints can be incorporated to (13b - 13k) in a relatively simple manner:
Monotonic trend ascending
D[i,zi ] xi + 1 − xi ≥ D[j,zj ] xj + β(xi + xj − 1), i = 2, . . . , n; z = 1, . . . , i − 1.
Monotonic trend descending
D[i,zi ] xi + β(xi + xj − 1) ≤ 1 − xj + D[j,zj ] xj , i = 2, . . . , n; z = 1, . . . , i − 1.
Monotonic trend concave
−D[i,zi ] xi + 2D[j,zj ] xj − D[k,zk ] xk ≥ xi + xj + xk − 3,
for i = 3, . . . , n; j = 2, . . . , i − 1; k = 1, . . . , j − 1.
Monotonic trend convex
D[i,zi ] xi − 2D[j,zj ] xj + D[k,zk ] xk ≥ xi + xj + xk − 3,
for i = 3, . . . , n; j = 2, . . . , i − 1; k = 1, . . . , j − 1.
Monotonic trend peak: constraints (7a - 7c) and
yi + yj + 1 + (D[j,zj ] − 1)xj − D[i,zi ] xi ≥ 0
2 − yi − yj + 1 + (D[i,zi ] − 1)xi − D[j,zj ] xj ≥ 0,

12
for i = 2, . . . , n; z = 1, . . . , i − 1.
Monotonic trend valley: constraints (7a - 7c) and
yi + yj + 1 + (D[i,zi ] − 1)xi − D[j,zj ] xj ≥ 0
2 − yi − yj + 1 + (D[j,zj ] − 1)xj − D[i,zi ] xi ≥ 0
for i = 2, . . . , n; z = 1, . . . , i − 1.
In Section 4, we compare the initial CP/MIP formulation to the presented LS formulation
for large size instances.

2.2 Mixed-integer programming formulation for continuous target

The presented optimal binning formulation given a binary target can be seamlessly extended
to a continuous target. Following the methodology developed for Equations (3) and (4),
given a p-norm distance (L1 -norm or L2 -norm), an aggregated lower triangular matrix Lij ∈
R+0 , ∀(i, j) ∈ {1, . . . , n : i ≥ j} can be pre-computed as follows,
Pj
sz
Lij = kµi − Uij k , Uij = Pz=i
j
, (16)
z=i rz

where µi ∈ R is the target mean for each pre-bin and Uij ∈ R is the aggregated matrix of
mean values, where si is the sum of target values for each pre-bin. A more robust approach
might replace the mean by an order statistic, but unfortunately, the aggregated matrix
computed from the pre-binning data would require approximation methods. Replacing Lij
in the objective function (5a) and changing the optimization sense to minimization, the
resulting formulation is given by
n
X i−1
X
min Lii Xii + (Lij − Lij+1 )Xij (17a)
X
i=1 j=1

s.t. (5b - 5h) (17b)

2.2.1 Monotonicity constraints

As for the binary target case, we can impose monotonicity constraints between the mean
value of consecutive bins. Since establishing tight bounds for Uij ∈ R is not trivial, we
discard a big-M formulation. Another traditional technique such as the use of SOS1 sets
is also discarded to avoid extra variables and constraints. Instead, we state the ascending
monotonic trend in double implication form as follows
z−1
X
Xii = 1 and Xzz = 1 =⇒ Uzz Xzz + (Uzj − Uzj+1 Xzj ) + β
j=1
i−1
X
≤ Uii Xii + (Uij − Uij+1 )Xij , i = 2, . . . , n; z = 1, . . . i − 1.
j=1

The enforced constraint must be satisfied iff the two literals Xii and Xzz are true, otherwise
the constraint is ignored. This is half-reified linear constraint [4]. The parameter β is the
minimum mean difference between consecutive bins. Similarly, for the descending constraint,
i−1
X
Xii = 1 and Xzz = 1 =⇒ Uii Xii + (Uij − Uij+1 Xij ) + β
j=1
z−1
X
≤ Uzz Xzz + (Uzj − Uzj+1 )Xzj , i = 2, . . . , n; z = 1, . . . i − 1.
j=1

13
Furthermore, the concave and convex trend can be written in triple implication form
using literals Xii , Xjj and Xkk . The concave trend constraints are
i−1
!
X
Xii = 1, Xzz = 1 and Xkk = 1 =⇒ − Uii Xii + (Uiz − Uiz+1 )Xiz
z=1
j−1
!
X
+ 2 Ujj Xjj + (Ujz − Ujz+1 )Xjz
z=1
k−1
!
X
− Ukk Xkk + (Ukz − Ukz+1 )Xkz ≥ 0,
z=1

for i = 3, . . . n; j = 2, . . . , i − 1 and k = 1, . . . , j − 1. Similarly, for convex trend we get

i−1
!
X
Xii = 1, Xzz = 1 and Xkk = 1 =⇒ Uii Xii + (Uiz − Uiz+1 )Xiz
z=1
j−1
!
X
− 2 Ujj Xjj + (Ujz − Ujz+1 )Xjz
z=1
k−1
!
X
Ukk Xkk + (Ukz − Ukz+1 )Xkz ≥ 0,
z=1

for i = 3, . . . n; j = 2, . . . , i − 1 and k = 1, . . . , j − 1.
The formulation can be extended to support valley and peak trend. The peak trend
requires constraints (7a - 7c) and
z−1
X
Xii = 1 and Xzz = 1 =⇒M (yi + yz ) + Uzz Xzz + (Uzj − Uzj+1 )Xzj
j=1
i−1
X
≥ Uii Xii + (Uij − Uij+1 )Xij ,
j=1
i−1
X
M (2 − yi − yz ) + Uii Xii + (Uij − Uij+1 )Xij
j=1
z−1
X
≥ Uzz Xzz + (Uzj − Uzj+1 )Xzj ,
j=1

for i = 2, . . . , n; z = 1, . . . , i − 1. The big-M formulation to handle disjoint constraints in

(7a - 7c) requires an effective bound, we suggest M = max({|Uij | : i = 1, . . . , n : i ≥ j}).
Similarly, for the valley trend we include constraints and
i−1
X
Xii = 1 and Xzz = 1 =⇒M (yi + yz ) + Uii Xii + (Uij − Uij+1 )Xij
j=1
z−1
X
≥ Uzz Xzz + (Uzj − Uzj+1 )Xzj ,
j=1
z−1
X
M (2 − yi − yz ) + Uzz Xzz + (Uzj − Uzj+1 )Xzj ,
j=1
i−1
X
≥ Uii Xii + (Uij − Uij+1 )Xij ,
j=1

14
for i = 2, . . . , n; z = 1, . . . , i − 1.
Finally, additional constraints described in Section 2.1.2 can be naturally incorporated
with minor changes.

2.3 Mixed-integer programming formulation for multi-class target

A simple approach to support a multi-class target is to use the one-vs-rest scheme with
nC distinct classes. This scheme consists of building a binary target for each class. The
resulting mathematical formulation closely follows the formulation for binary target,
nC X
X n i−1
X
max Viic Xii + (Vijc − Vij+1
c
)Xij (20a)
X
c=1 i=1 j=1

s.t. (5b - 5e) (20b)

Note that for this formulation we need an aggregated matrix V and D for each class c. It
is important to emphasize that the monotonicity constraints in Section 2.1.1 act as link-
ing constraints among classes, otherwise, nC optimal binning problems with binary target
could be solved separately. Again, additional constraints described in Section 2.1.2 can be
naturally incorporated with minor changes.

3 Algorithmic details and implementation

3.1 Automatic monotonic trend algorithm
Our approach to automate the monotonic trend decision employs a Machine Learning (ML)
classifier that predicts, given the pre-binning data, the most suitable monotonic trend to
maximize discriminatory power (binary and multi-class target) or minimize a prediction
error (continuous target). In particular, we aim to integrate an off-line classifier, hence we
are merely interested in ML classification algorithms easily embeddable. Recently, a similar
approach was implemented in the commercial solver CPLEX to make automatic decisions
over some algorithmic choices [18].
For this study, a dataset is generated with 417 instances collected from public datasets.
We design a set of 16 numerical features, describing the pre-binning instances in terms of
number of pre-bins, distribution of records per pre-bin and trend features. The most relevant
trend features are: number of trend change points, linear regression coefficient sense, area
of the convex hull, and area comprised among extreme trend points.
The labeling procedure consists of solving all instances selecting the ascending (A), de-
scending (D), peak (P) and valley (V) monotonic trend. Concave and convex trends are
discarded due to being a special case of peak and valley, respectively. In what follows,
without loss of generality, we state the procedure for the binary target case: if the relative
difference between IV with ascending/descending monotonic trend and IV with peak/valley
trend is less than 10%, the ascending/descending monotonic trend is selected, due to the
lesser resolution times. Table 1 summarizes the composition of the dataset with respect to
assigned labels. We note that the dataset is slightly unbalanced, being predominant the
descending label (D).

Label Instances %
A 84 0.20
D 200 0.48
P 76 0.18
V 57 0.14
Total 417

Table 1: Number of instances and percentage for each label.

15
To perform experiments, the dataset is split into train and test subsets in a stratified
manner to treat unbalanced data. The proportion of the test set is 30%. Three interpretable
multi-class classification algorithms are tested, namely, logistic regression, decision trees
(CART) and Support vector Machine (SVM) using the Python library Scikit-learn [19].
All three algorithms are trained using option class weight="balanced". Throughout the
learning process, we discard 8 features and perform hyperparameter optimization for all
three algorithms. These experiments show that SVM and CART have similar classification
measures, and we decide to choose CART (max depth 5) to ease implementation.
On the test set, the trained CART has a weighted average accuracy, precision and recall
of 88%. See classification measures and the confusion matrix in Table 2. We observe that
various instances of the minority class (V) are misclassified, indicating that more instances or
new features might be required to improve classification measures. Improving this classifier
is part of ongoing research.
Label Precision Recall F1-score Support
A D P V
A 0.85 0.88 0.86 25
A 22 0 1 2
D 0.97 0.93 0.95 61
D 0 57 2 2
P 0.81 0.88 0.88 23
P 1 0 22 0
V 0.71 0.59 0.65 17
V 3 2 2 10
weighted avg 0.88 0.88 0.88 126

Table 2: Classification measures (left) and confusion matrix (right) for CART on the test
set.

3.2 Presolving algorithm

The mathematical programming formulation is a hard combinatorial optimization problem
that does not scale well as the number of pre-bins increases. To reduce solution times we
need to reduce the search space to avoid deep tree searches during branching. The idea is to
develop a presolving algorithm to fix bins not satisfying monotonicity constraints, after that
the default presolver may be able to reduce the problem size significantly. The presolving
algorithm applies to the binary target case and was developed after several observations
about the aggregated matrix of event rates D. Algorithm 2 shows the implemented approach
for the ascending monotonicity trend. Presolving algorithm for the descending monotonicity
is analogous, only requiring inequalities change.

Algorithm 2 Preprocessing ascending monotonicity

1: procedure PreprocessingAscending(D, X, β)
2: for i = 1, . . . , n − 1 do
3: if Di+1,i − Di+1,i+1 > 0 then
4: fix Xi,j = 0
5: end if
6: for j = 1, . . . , n − i − 1 do
7: if Di+1+j,i − Di+1+j,i+1+j > 0 then
8: fix Xi+j,i+j = 0
9: end if
10: end for
11: end for
12: end procedure

3.3 Binning quality score

To assess the quality of binning for binary target, we develop a binning quality score con-
sidering the following aspects:

16
• Predictive power: IV rule of thumb [20] in Table 3.
• Statistical significance: bin event rates must be statistically different, therefore large
p-values penalize the quality score.
• Homogeneity: binning with homogeneous bin sizes or uniform representativeness, in-
creases reliability.

IV predictive power
[0, 0.02) not useful
[0.02, 0.1) weak
[0.1, 0.3) medium
[0.3, 0.5) strong
[0.5, ∞) over-prediction

Table 3: Information Value rule of thumb.

To account for all these aspects, we propose a rigorous binning quality score function
Proposition 3.1 Given a binning with Information Value ν, p-values between consecutive
bins pi , i = 1, . . . , n − 1 and normalized bins size si , i = 1, . . . , n, the binning quality score
function is defined as
!
n−1
Pn
1 − i=1 s2i

ν 2 2
Y
Q(ν, p, s) = exp −ν /(2c ) + 1/2 (1 − pi ) , (21)
c i=1
1 − 1/n
q
1 2
where Q(ν, p, s) ∈ [0, 1] and c = 5 log(5/3) is the best a priori IV value in [0.3, 0.5).

Proof: Given the rule of thumb in Table 3, let us consider the set of statistical distributions
with positive skewness, positive fat-tail, and support on the semi-infinite interval [0, ∞). The
function should penalize large values of Information Value ν and it is expected a fast decay
after certain threshold indicating over-prediction. This fast decay is a required property that
must be accompanied by the following statement: limν→0 f (ν) = limν→∞ f (ν) = 0 Among
the available distributions satisfying aforementioned properties, we select the Rayleigh dis-
tribution, which probability density function is given by
ν −ν 2 /(2c2 )
f (ν; c) = e , ν ≥ 0.
c2
This is a statistical distribution, not a function, hence we need a scaling factor so that
maxν∈[0,∞) f (ν; c) = 1: the maximum value of the unimodal probability distribution is the
mode c, thus

1 f (ν, c) ν exp −ν 2 /(2c2 ) + 1/2
γ = f (c, c) = √ =⇒ = .
c e γ c

The optimal c such that f (a) = f (b) for b > a can be obtained by solving f (b; c)−f (a; c) = 0
for c, which yields √
∗ b2 − a2
c =p .
2 log(b/a)
Qn−1
Term i=1 (1 − pi ) assesses the statistical significance of the bins. Furthermore, term
1− n 2
P
i=1 si
1−1/n = 1−HHI ∗ , where HHI ∗ is the normalized Herfindahl Hirschman Index, assesses
the homogeneity/uniformity of the bin sizes.
For example, if we consider that the boundaries of the interval with strong IV predictive
power in Table 3, a = 0.3 and b = 0.5, should produce the same quality score, c∗ (a, b) =

17
q
1
5
2
log(5/3) . Table 4 shows the value of f (ν, c∗ ) for various IV values ν; note the fast decay
∗
of f (ν, c ) when ν > 0.5.

ν 0 0.02 0.1 0.3 0.5 0.7 0.9 1 1.5

f (ν, c∗ ) 0 0.083 0.404 0.938 0.938 0.610 0.282 0.171 0.005

Table 4: Function values for various ν values.

3.4 Implementation
The presented mathematical programming formulations are implemented using Google OR-
Tools [15] with the open-source MILP solver CBC [6], and Google’s BOP and CP-SAT
solvers. Besides, the specialized formulation in Section 2.1.3 is implemented using the
commercial solver LocalSolver [1]. The python library OptBinning3 has been developed
throughout this work to ease usability and reproducibility.
Much of the implementation effort focuses on the careful implementation of constraints
and the development of fast algorithms for preprocessing and generating the model data.
A key preprocessing algorithm is a pre-binning refinement developed to guarantee that no
bins have 0 accounts for non-events and events in the binary target case.
Categorical variables require special treatment: pre-bins are ordered in ascending order
with respect to a given metric; the event rate for binary target and the target mean for a
continuous target. The original data is replaced by the ordered indexes and is then used as
a numerical (ordinal) variable. Furthermore, during preprocessing, the non-representative
categories may be binned into an “others” bin. Similarly, missing values and special values
are incorporated naturally as additional bins after the optimal binning is terminated.

4 Experiments
The experiments were run on an Intel(R) Core(TM) i5-3317 CPU at 1.70GHz, using a
single core, running Linux. Two binning examples are shown in Tables 5 and 6, using Fair
Isaac (FICO) credit risk dataset [5] (N = 10459) and Home Credit Default Risk Kaggle
competition dataset [8] (N = 307511), respectively.
Example in Table 5 uses the variable AverageMInFile (Average Months in File) as an
risk driver. FICO dataset imposes monotonicity constraints to some variables, in particular,
for this variable, the event rate must be monotonically decreasing. Moreover, the dataset
includes three special values/codes defined as follows:

• -9: No Bureau Record or No Investigation

• -8: No Usable/Valid Trades or Inquiries
• -7: Condition not Met (e.g. No Inquiries, No Delinquencies)
For the sake of completeness, we also include a few random missing values on the dataset. As
shown in Table 5, these values are separately treated by incorporating a Special and Missing
bin. Regarding computation time, this optimal binning instance is solved in 0.08 seconds.
The optimization time accounts for 91% of the total time, followed by the pre-binning time
representing about 6%. The remaining 3% is spent in pre-processing and post-processing
operations.
Example in Table 6 uses the categorical variable ORGANIZATION TYPE from Kag-
gle dataset. This variable has 58 categories, and we set the non-representative categories
cut-off to 0.01. Note that the bin just before the Special bin corresponds to the bin with
3 https://github.com/guillermo-navas-palencia/optbinning

18
Bin Count Count (%) Non-event Event Event rate WoE IV JS
(−∞, 30.5) 544 0.052013 99 445 0.818015 -1.41513 0.087337 0.010089
[30.5, 48.5) 1060 0.101348 286 774 0.730189 -0.907752 0.076782 0.009281
[48.5, 54.5) 528 0.050483 184 344 0.651515 -0.537878 0.014101 0.001742
[54.5, 64.5) 1099 0.105077 450 649 0.590537 -0.278357 0.008041 0.001002
[64.5, 70.5) 791 0.075629 369 422 0.533502 -0.046381 0.000162 0.000020
[70.5, 74.5) 536 0.051248 262 274 0.511194 0.0430441 0.000095 0.000012
[74.5, 81.5) 912 0.087198 475 437 0.479167 0.171209 0.002559 0.000320
[81.5, 101.5) 2009 0.192083 1141 868 0.432056 0.361296 0.025000 0.003108
[101.5, 116.5) 848 0.081078 532 316 0.372642 0.608729 0.029532 0.003636
[116.5, ∞) 1084 0.103643 702 382 0.352399 0.696341 0.049039 0.006009
Special 558 0.053351 252 306 0.548387 -0.106328 0.000601 0.000075
Missing 490 0.046850 248 242 0.493878 0.112319 0.000592 0.000074

Table 5: Example optimal binning using variable AverageMInFile from FICO dataset.

non-representative categories, which is excluded from the optimization problem, hence mono-
tonicity constraint does not apply. This optimal binning instance is solved in 0.25 seconds.
For categorical variables, most of the time is spent on pre-processing, 71% in this particular
case, whereas the optimization problem is solved generally faster.

Bin Count Count (%) Non-event Event Event rate WoE IV JS

[XNA, School] 64267 0.208991 60751 3516 0.054709 0.416974 0.030554 0.003792
[Medicine, ...] 31845 0.103557 29673 2172 0.068205 0.182104 0.003182 0.000397
[Other] 16683 0.054252 15408 1275 0.076425 0.0594551 0.000187 0.000023
[Business ...] 16537 0.053777 15150 1387 0.083873 -0.0416281 0.000095 0.000012
[Transport: ...] 81221 0.264124 73657 7564 0.093129 -0.156466 0.006905 0.000862
[Security, ...] 55150 0.179343 49424 5726 0.103826 -0.277067 0.015465 0.001927
[Housing, ...] 41808 0.135956 38623 3185 0.076182 0.0629101 0.000524 0.000065
Special 0 0 0 0 0 0 0 0
Missing 0 0 0 0 0 0 0 0

Table 6: Example optimal binning using categorical variable ORGANIZATION TYPE from
Home Credit Default Risk Kaggle competition dataset.

4.1 Benchmark CP/MIP vs local search heuristic

For large instances, we compare the performance of Google OR-Tools’ solvers BOP (MIP)
and CP-SAT against LocalSolver. For these tests we select two variables from Home Credit
Default Risk Kaggle competition dataset [8] (N = 307511). We aim to perform a far finer
binning than typical in many applications to stress the performance of classical solvers for
large combinatorial optimization problems.
Tables 7 and 8 show results for varying number of pre-bins n and monotonic trends. In
test 1 from Table 7 , LocalSolver does not improve after 10 seconds, not being able to reduce
the optimality gap. In test 2, LocalSolver outperforms CP-SAT, finding the optimal solution
after 5 seconds, 28x faster. In test 3, solution times are comparable. Results reported in
Table 8 are also interesting; in test 1, BOP and CP-SAT solvers cannot find an optimal
solution after 1000 seconds. LocalSolver finds the best found feasible solution after 30
seconds. Nevertheless, we recall that the described heuristic for peak/valley trend introduced
in Section 2.1.1 could reduce resolution times substantially, obtaining times comparable to
those when choosing ascending/descending monotonic trend.

5 Conclusions
We propose a rigorous and flexible mathematical programming formulation to compute the
optimal binning. This is the first optimal binning algorithm to achieve solutions for nontriv-
ial constraints, supporting binary, continuous and multi-class target, and handling several
monotonic trends rigorously. Importantly, the size of the decision variables and constraints

19
n monotonic trend solver variables constraints time solution gap
48 peak cp 1225 3528 12.7 0.03757878 -
48 peak ls 193 2352 1 0.03373904 10.2%
48 peak ls 193 2352 5 0.03386574 9.9%
48 peak ls 193 2352 10 0.03725560 0.9%
77 peak cp 3081 9009 140.9 0.03776231 -
77 peak ls 309 6007 1 0.03078212 18.5%
77 peak ls 309 6007 5 0.03776231 0.0%
77 descending cp 3003 6706 0.9 0.03386574 -
77 descending ls 231 2969 1 0.03386574 0.0%

Table 7: Variable REGION POPULATION RELATIVE. Performance comparison Google

OR-Tools’ CP-SAT vs LocalSolver. Time in seconds.

n monotonic trend solver variables constraints time solution gap

100 peak cp 5151 15150 t 0.11721972 -
100 peak mip 5151 15150 t 0.11786335∗ -
100 peak ls 401 10101 1 0.11666556 1.0%
100 peak ls 401 10101 5 0.11735812 0.4%
100 peak ls 401 10101 10 0.11771822 0.1%
100 peak ls 401 10101 30 0.11786335 0.0%
100 ascending cp 5050 10933 2.3 0.05175782 -
100 ascending ls 300 5000 1 0.05175782 0.0%

Table 8: Variable DAYS EMPLOYED. Performance comparison Google OR-Tools’ CP-

SAT/BOP vs LocalSolver. Time in seconds. *: Best feasible solution. t: 1000 seconds
exceeded.

used in the presented formulations is independent of the size of the datasets; they are en-
tirely controlled by the starting solution computed during the pre-binning process. In the
future, we plan to extend our methodology to piecewise-linear binning and multivariate bin-
ning. Lastly, the code is available at https://github.com/guillermo-navas-palencia/
optbinning to ease reproducibility.

References
[1] T. Benoist, B. Estellon, Gardi F., R. Megel, and K. Nouioua. Localsolver 1.x: a black-
box local-search solver for 0-1 programming. 4OR-Q J Oper Res, 9(299), 2011.
[2] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
1984.
[3] U. M. Fayyad and K. B. Irani. Multi-Interval Discretization of Continuous-Valued
Attributes for Classification Learning. International Joint Conferences on Artificial
Intelligence, 13:1022–1027, 1993.

[4] T. Feydy, Z. Somogyi, and P. J. Stuckey. Half reification and flattening. In Principles
and Practice of Constraint Programming – CP 2011, pages 286–301, Berlin, Heidelberg,
2011. Springer Berlin Heidelberg.
[5] FICO, Google, Imperial College London, MIT, University of Oxford, UC Irvine and UC
Berkeley. Explainable Machine Learning Challenge. https://community.fico.com/
s/explainable-machine-learning-challenge, 2018.
[6] J. Forrest, T. Ralphs, S. Vigerske, and et al. coin-or/cbc: Version 2.9.9. 2018.
[7] J. Herman. smbinning: Scoring Modeling and Optimal Binning, 2019.

20
[8] Home Credit Group. Kaggle competition: Home Credit Default Risk. https://www.
kaggle.com/c/home-credit-default-risk/overview, 2018.
[9] T. Hothorn, K. Hornik, and A. Zeileis. Unbiased recursive partitioning: A conditional
inference framework. Journal of Computational and Graphical statistics, 15(3):651–674,
2006.

[10] H. Jeffreys. An invariant form for the prior probability in estimation problems. Pro-
ceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences,
186(1007):453–461, 1946.
[11] E. Kalvelagen. A difficult MIP construct: counting consecutive 1’s.
https://yetanothermathprogrammingconsultant.blogspot.com/2018/04/
a-difficult-mip-construct-counting.html, 2018.
[12] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu. LightGBM:
A Highly Efficient Gradient Boosting Decision Tree. In I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances
in Neural Information Processing Systems 30, pages 3146–3154. Curran Associates,
Inc., 2017.
[13] R. Kerber. ChiMerge: Discretization of Numeric Attributes. AAAI-92 Proceedings,
1992.
[14] S. Kullback and R. A. Leibler. On Information and Sufficiency. Ann. Math. Statist.,
22(1):79–86, 1951.

[15] P. Laurent and F. Vincent. Google OR-Tools 7.4. https://developers.google.com/

optimization/, 2020.
[16] P. Mironchyk and V. Tchistiakov. Monotone optimal binning algorithm for credit risk
modeling. 2017.

[17] I. Oliveira, M. Chari, and S. Haller. Rigorous Constrained Optimization Binning for
Credit Scoring. SAS Global Forum 2008 - Data Mining and Predictive Modelling, 2008.
[18] A. Lodi P. Bonami and G. Zarpellon. Learning a Classification of Mixed-Integer
Quadratic Programming Problems. In W.-J. van Hoeve, editor, Integration of Con-
straint Programming, Artificial Intelligence, and Operations Research CPAIOR 2018,
Lecture Notes in Computer Science, pages 595–604. Springer-Verlag, 2018.
[19] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python .
Journal of Machine Learning Research, 12:2825–2830, 2011.

[20] N. Siddiqi. Credit Risk Scorecards: Developing And Implementing Intelligent Credit
Scoring. Wiley and SAS Business Series. Wiley, 2005.
[21] L. C. Thomas, D. B. Edelman, and J. N. Crook. Credit Scoring and Its Applications.
Society for Industrial and Applied Mathematics, 2002.

[22] L. WenSui. Monotonic Optimal Binning (MOB) for Risk Scorecard Development, 2020.
[23] G. Zeng. Metric Divergence Measures and Information Value in Credit Scoring. Journal
of Mathematics, 2013:1–10, 2013.

Data Structures and Algorithms in Python Slides
100% (6)
Data Structures and Algorithms in Python Slides
917 pages
MB0048 Set 1 & 2
No ratings yet
MB0048 Set 1 & 2
14 pages
A Network Flow Model For Biclustering Via Optimal Re-Ordering of Data Matrices
No ratings yet
A Network Flow Model For Biclustering Via Optimal Re-Ordering of Data Matrices
12 pages
Op Tim Ization
No ratings yet
Op Tim Ization
21 pages
Applications of Optimization in Chemical Engineering: S F Assessment
No ratings yet
Applications of Optimization in Chemical Engineering: S F Assessment
11 pages
Mixed Integer Nonlinear Programming Algorithm
No ratings yet
Mixed Integer Nonlinear Programming Algorithm
4 pages
Berthold2014 Article RENS
No ratings yet
Berthold2014 Article RENS
22 pages
Chapter14 Optimization Present
No ratings yet
Chapter14 Optimization Present
131 pages
Lecture Notes (7) : Eme7102 Engineering Research Methodology
No ratings yet
Lecture Notes (7) : Eme7102 Engineering Research Methodology
32 pages
OR2 IntegerProgramming
No ratings yet
OR2 IntegerProgramming
18 pages
1.1 Mathematical Optimization
No ratings yet
1.1 Mathematical Optimization
8 pages
Convex Optimization - Introduction (S.l. Dr. Ing. Carmen Voicu)
No ratings yet
Convex Optimization - Introduction (S.l. Dr. Ing. Carmen Voicu)
32 pages
Integer Programming: Adopted From Taha and Other Sources
No ratings yet
Integer Programming: Adopted From Taha and Other Sources
33 pages
Optimizing Chemical Processes - Grossmann e Biegler
No ratings yet
Optimizing Chemical Processes - Grossmann e Biegler
9 pages
List Ex
No ratings yet
List Ex
2 pages
MINLP
No ratings yet
MINLP
123 pages
2 Binning Techniques in Data Mining With Examples
No ratings yet
2 Binning Techniques in Data Mining With Examples
10 pages
Optimizatio With Matlab
No ratings yet
Optimizatio With Matlab
49 pages
12-IntegerProgramming (Repaired)
No ratings yet
12-IntegerProgramming (Repaired)
9 pages
Linear Programming
No ratings yet
Linear Programming
8 pages
Learning Optimal Objective Values For MILP.18321v1
No ratings yet
Learning Optimal Objective Values For MILP.18321v1
12 pages
On The Complexity of Inverse Mixed Integer Linear Optimization
No ratings yet
On The Complexity of Inverse Mixed Integer Linear Optimization
32 pages
Linnear Nonlineae Numerical Method
No ratings yet
Linnear Nonlineae Numerical Method
43 pages
Presentation 6 1 MINLP
No ratings yet
Presentation 6 1 MINLP
85 pages
A Prlmal Algorithm For Interval Linear-Programming Problems
No ratings yet
A Prlmal Algorithm For Interval Linear-Programming Problems
14 pages
Chapter 4
No ratings yet
Chapter 4
18 pages
Analysis of Practical Applications Using The Linea
No ratings yet
Analysis of Practical Applications Using The Linea
9 pages
Lecture 01 Simplex
No ratings yet
Lecture 01 Simplex
37 pages
Operations Research
No ratings yet
Operations Research
1 page
Integer Programming Lecture v.3
No ratings yet
Integer Programming Lecture v.3
82 pages
9 OD Integer Programming A-2008
No ratings yet
9 OD Integer Programming A-2008
6 pages
Progress in Linear Programming-Based Algorithms For Integer Programming - Johnson Et Al.
No ratings yet
Progress in Linear Programming-Based Algorithms For Integer Programming - Johnson Et Al.
22 pages
Lec1 Coen505
No ratings yet
Lec1 Coen505
15 pages
Rahmaniani 2016
No ratings yet
Rahmaniani 2016
17 pages
Optimization Summary
No ratings yet
Optimization Summary
47 pages
The Product-Mix Problem
No ratings yet
The Product-Mix Problem
5 pages
Bin Packing Problem: Two Approximation Algorithms
0% (1)
Bin Packing Problem: Two Approximation Algorithms
11 pages
Chapter13 PDF
No ratings yet
Chapter13 PDF
11 pages
Introduction To Nonlinear Optimization and Optimality Conditions Fo
No ratings yet
Introduction To Nonlinear Optimization and Optimality Conditions Fo
46 pages
Bin Completion Algorithms For Multicontainer Packing, Knapsack, and Covering Problems
No ratings yet
Bin Completion Algorithms For Multicontainer Packing, Knapsack, and Covering Problems
76 pages
Integer Programming: Saurabh Chandra
No ratings yet
Integer Programming: Saurabh Chandra
48 pages
24ucs172 S6
No ratings yet
24ucs172 S6
19 pages
Mixed Integer Linear Programming and Mixed Nonlinear Programming Problems
No ratings yet
Mixed Integer Linear Programming and Mixed Nonlinear Programming Problems
11 pages
Improved Maximum Margin Clustering Via The Bundle Method
No ratings yet
Improved Maximum Margin Clustering Via The Bundle Method
13 pages
Mathematical Programming: Is Called The Feasible Set and
No ratings yet
Mathematical Programming: Is Called The Feasible Set and
12 pages
Phuong Phap Tinh
No ratings yet
Phuong Phap Tinh
49 pages
A Dynamic Programming Algorithm For Multiple-Choice Constraints
No ratings yet
A Dynamic Programming Algorithm For Multiple-Choice Constraints
4 pages
What Is Optimization?
No ratings yet
What Is Optimization?
15 pages
Lecture MILP
No ratings yet
Lecture MILP
27 pages
Lecture Notes On Operations Research 3OR
No ratings yet
Lecture Notes On Operations Research 3OR
51 pages
Milp For Cse - 2023
No ratings yet
Milp For Cse - 2023
38 pages
Introduction To Linear Programing Problems
No ratings yet
Introduction To Linear Programing Problems
19 pages
Research Announcements: Outline of An Algorithm For Integer Solutions To Linear Programs
No ratings yet
Research Announcements: Outline of An Algorithm For Integer Solutions To Linear Programs
4 pages
Machine Learning For Cutting Planes in Integer Programming: A Survey
No ratings yet
Machine Learning For Cutting Planes in Integer Programming: A Survey
9 pages
Materi 2-5
No ratings yet
Materi 2-5
80 pages
Introduction To Linear Programming: Algorithmic and Geometric Foundations of Optimization
No ratings yet
Introduction To Linear Programming: Algorithmic and Geometric Foundations of Optimization
28 pages
12 Capital Budgeting and Binary
No ratings yet
12 Capital Budgeting and Binary
15 pages
Mathematical Modelling of Robust Optimization For Integer Programming Problem
No ratings yet
Mathematical Modelling of Robust Optimization For Integer Programming Problem
4 pages
Tezis Eng
No ratings yet
Tezis Eng
12 pages
LP Formulation and Graph
No ratings yet
LP Formulation and Graph
30 pages
Linear Algebra-Week-2
No ratings yet
Linear Algebra-Week-2
18 pages
Error Detection Methods
No ratings yet
Error Detection Methods
6 pages
PR3-12217033-Muhammad Aqsal Ilham
No ratings yet
PR3-12217033-Muhammad Aqsal Ilham
6 pages
Lecture 2 Power Planning
No ratings yet
Lecture 2 Power Planning
35 pages
Deep Learning
No ratings yet
Deep Learning
7 pages
Hopfield
No ratings yet
Hopfield
3 pages
Partitioning Algorithms
No ratings yet
Partitioning Algorithms
5 pages
An Incremental Clustering Algorithm Based On Mahalanobis Distance
No ratings yet
An Incremental Clustering Algorithm Based On Mahalanobis Distance
1 page
Deep Learning LAB
No ratings yet
Deep Learning LAB
47 pages
Simplex Method Incase of Artificial Variables " "
No ratings yet
Simplex Method Incase of Artificial Variables " "
13 pages
Optimal Filtering
No ratings yet
Optimal Filtering
20 pages
A Machine Learning Approach For Tracking and Predicting Student Performance in Degree Programs
No ratings yet
A Machine Learning Approach For Tracking and Predicting Student Performance in Degree Programs
2 pages
2015 Chapter 7 MMS IT
No ratings yet
2015 Chapter 7 MMS IT
36 pages
Wandoor Ganitham - S.S.L.C Study Material 2021: Focus Area - Question Bank - Polynomials
0% (2)
Wandoor Ganitham - S.S.L.C Study Material 2021: Focus Area - Question Bank - Polynomials
8 pages
Log (n+1) 1 H (N 1) /2
No ratings yet
Log (n+1) 1 H (N 1) /2
6 pages
Lec 18
No ratings yet
Lec 18
6 pages
DSA-Class-Assignment 3
No ratings yet
DSA-Class-Assignment 3
2 pages
A New Iterative Refinement of The Solution of Ill-Conditioned Linear System of Equations
No ratings yet
A New Iterative Refinement of The Solution of Ill-Conditioned Linear System of Equations
10 pages
ML Program 7, 8,9 And10
No ratings yet
ML Program 7, 8,9 And10
12 pages
Backpropagation Learning in Neural Networks
No ratings yet
Backpropagation Learning in Neural Networks
27 pages
Machine Learning Questions
No ratings yet
Machine Learning Questions
2 pages
Mathcity Notes (NA)
No ratings yet
Mathcity Notes (NA)
281 pages
ESO207 - Assignment 1 Tanush Kumar (201043) Kartikeya Raghuvanshi (200494)
No ratings yet
ESO207 - Assignment 1 Tanush Kumar (201043) Kartikeya Raghuvanshi (200494)
6 pages
Kabir Khan 1147 - 4
No ratings yet
Kabir Khan 1147 - 4
4 pages
Unit-3: Divide and Conquer Algorithms
No ratings yet
Unit-3: Divide and Conquer Algorithms
78 pages
Huffman Coding
No ratings yet
Huffman Coding
10 pages
Lecture 1 - Introduction To Optimization PDF
No ratings yet
Lecture 1 - Introduction To Optimization PDF
31 pages
PDF of Digital Signal Processing Ramesh Babu 2 PDF
No ratings yet
PDF of Digital Signal Processing Ramesh Babu 2 PDF
2 pages

Optimal Binning

Uploaded by

Optimal Binning

Uploaded by

Optimal binning: mathematical programming

January 23, 2020

2 Mathematical programming formulation

a variable with respect to a binary target.

• Only consecutive pre-bins can be merged. Continuity by rows, no 0 − 1 gaps are

To clarify, Figure 1 shows an example of a feasible solution. In this example, pre-bins

2.1 Mixed-integer programming formulation for binary target

1. Missing values are binned separately.

n∈N number of pre-bins.

The ILP formulation, with no additional constraints such as monotonicity constraints,

Xij − Xij+1 ≤ 0, i = 1, . . . , n; j = 1, . . . , i − 1 (5c)

Xij ∈ {0, 1}, ∀(i, j) ∈ {1, . . . , n : i ≥ j} (5h)

2.1.1 Monotonicity constraints

−xi+1 + 2xi − xi−1 ≥ 0 concave

for i = 3, . . . n; j = 2, . . . , i − 1 and k = 1, . . . , j − 1. Similarly, for convex trend we get

for i = 3, . . . n; j = 2, . . . , i − 1 and k = 1, . . . , j − 1. Note that term Xii + Xjj + Xkk − 3 is

i − n(1 − yi ) ≤ t ≤ i + nyi , i = 1, . . . , n (7a)

Similarly, for the valley trend we include,

2.1.2 Additional constraints

s.t. (5b - 5h) (10b)

A widely used metric to quantify concentration is the Herfindahl-Hirschman Index (HHI),

s.t. (5b - 5h) (11b)

An effective MILP formulation can be devised using a simplification of the standard

s.t. (5b - 5h) (12b)

pmin ≤ pmax (12e)

Algorithm 1 Maximum p-value constraint using Z-test

Xij + Xkl ≤ 1, ∀(i, j, k, l) ∈ I.

Figure 2: New decision variables suitable to counting consecutive zeros.

n∈N number of pre-bins.

2.2 Mixed-integer programming formulation for continuous target

s.t. (5b - 5h) (17b)

2.2.1 Monotonicity constraints

for i = 3, . . . n; j = 2, . . . , i − 1 and k = 1, . . . , j − 1. Similarly, for convex trend we get

for i = 2, . . . , n; z = 1, . . . , i − 1. The big-M formulation to handle disjoint constraints in

2.3 Mixed-integer programming formulation for multi-class target

s.t. (5b - 5e) (20b)

3 Algorithmic details and implementation

Table 1: Number of instances and percentage for each label.

3.2 Presolving algorithm

Algorithm 2 Preprocessing ascending monotonicity

3.3 Binning quality score

Table 3: Information Value rule of thumb.

ν 0 0.02 0.1 0.3 0.5 0.7 0.9 1 1.5

Table 4: Function values for various ν values.

• -9: No Bureau Record or No Investigation

Bin Count Count (%) Non-event Event Event rate WoE IV JS

4.1 Benchmark CP/MIP vs local search heuristic

Table 7: Variable REGION POPULATION RELATIVE. Performance comparison Google

n monotonic trend solver variables constraints time solution gap

Table 8: Variable DAYS EMPLOYED. Performance comparison Google OR-Tools’ CP-

[15] P. Laurent and F. Vincent. Google OR-Tools 7.4. https://developers.google.com/

You might also like