MOSEKPortfolioCookbook A4paper
MOSEKPortfolioCookbook A4paper
Cookbook
Release 1.2.1
MOSEK ApS
13 September 2022
Contents
1 Preface 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
5 Factor models 38
5.1 Explicit factor models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Implicit factor models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Modeling considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Transaction costs 48
6.1 Variable transaction costs . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Fixed transaction costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 Market impact costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4 Cardinality constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.5 Buy-in threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
i
7.1 Active return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2 Factor model on active returns . . . . . . . . . . . . . . . . . . . . . . . 62
7.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
9 Risk budgeting 84
9.1 Risk contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.2 Risk budgeting portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.3 Risk budgeting with variance . . . . . . . . . . . . . . . . . . . . . . . . 85
9.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
10 Appendix 91
10.1 Conic optimization refresher . . . . . . . . . . . . . . . . . . . . . . . . . 91
10.2 Mixed-integer models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.3 Quadratic cones and riskless solution . . . . . . . . . . . . . . . . . . . . 99
10.4 Monte Carlo Optimization Selection (MCOS) . . . . . . . . . . . . . . . 100
Bibliography 105
ii
Chapter 1
Preface
1.1 Purpose
This book provides an introduction to the topic of portfolio optimization and discusses
several branches of practical interest from this broad subject.
We intended it to be a practical guide, a cookbook, that not only serves as a reference
but also supports the reader with practical implementation. We do not assume that the
reader is acquainted with portfolio optimization, thus the book can be used as a starting
point, while for the experienced reader it can serve as a review.
First we familiarize the reader with the basic concepts and the most relevant ap-
proaches in portfolio optimization, then we also present computational examples with
code to illustrate these concepts and to provide a basis for implementing more complex
and more specific cases. We aim to keep the discussion concise and self-contained, cov-
ering only the main ideas and tools, the most important pitfalls, both from theoretical
and from technical perspective. The reader is directed towards further reading in each
subject through references.
1.2 Content
Each of the chapters is organized around a specific subject:
• Sec. 3 summarizes concepts and pitfalls related to the preparation of raw data.
Here we discuss how to arrive at security returns data suitable for optimization,
starting from raw data that is commonly available on the internet. This chapter is
recommended as a second read.
• Each of the subsequent chapters are focusing on a specific topic: mitigating estima-
tion error, using factor models, modeling transaction costs, and finally optimizing
relative to a benchmark. These chapters are independent and thus can be read in
any order.
1
• The book assumes a basic familiarity with conic optimization. Sec. 10.1 can be
used as a quick refresher in this topic. It shows how to convert traditional quadratic
optimization (QO) and quadratically constrained quadratic optimization (QCQO)
problems into equivalent quadratic conic optimization models, and what are the
advantages of this conversion. Then it briefly introduces other types of cones as
well.
https://www.mosek.com/documentation/
The code examples appearing in this cookbook as well as other supplementary material
are available from
https://github.com/MOSEK/PortfolioOptimization
2
Chapter 2
In this section we introduce the Markowitz model in portfolio optimization, and discuss
its different formulations and the most important input parameters.
𝜇x = E(𝑅x ) = xT E(𝑅) = xT 𝜇.
1
The other component of risk called systematic risk can only be reduced by decreasing the expected
return.
3
Here 𝜇 is the vector of expected returns, Σ is the covariance matrix of returns, sum-
marizing the risks associated with the securities. Note that the covariance matrix is
symmetric positive semidefinite by definition, and if we assume that none of the securi-
ties is redundant2 , then it is positive definite. This can also be seen if we consider that
the portfolio variance must always be a positive number. After the above parameters the
problem is also referred to as mean–variance optimization (MVO). The choice of variance
as the risk measure results that MVO is a quadratic optimization (QO) problem.3
Using these input parameters, the MVO problem seeks to select a portfolio of securities
x in such a way that it finds the optimal tradeoff between expected portfolio return and
portfolio risk. In other words it seeks to maximize the return while limiting the level of
risk, or it wants to minimize the risk while demanding at least a given level of return.
Thus portfolio optimization can be seen as a bi-criteria optimization problem.
4
In this formulation it is possible to explicitly constrain the risk measure (the vari-
ance), which makes it more practical. We can also add constraints on multiple
types of risk measures more easily. However, the problem in this form is not a QO
because of the quadratic constraint. We can reformulate it as a conic problem; see
Sec. 2.3.
maximize 𝜇T x − 2𝛿 xT Σx
(2.3)
subject to 1T x = 1.
This form is favorable because then the standard deviation penalty term will be of
the same scale as portfolio return.
Moreover, if we assume portfolio return to be normally distributed, then 𝛿˜ has a
more tangible meaning; it is the z-score4 of portfolio return. Also, the objective
function will be the (1 − 𝛼)-quantile of portfolio return, which is the opposite of the
𝛼 confidence level value-at-risk (VaR). The number 𝛼 comes from Φ(−𝛿) ˜ = 1 − 𝛼,
where Φ is the cumulative distribution function of the normal distribution.
We can see that for 𝛿˜ = 0 we maximize expected portfolio return. Then by increas-
ing 𝛿˜ we put more and more weight on tail risk, i. e., we maximize a lower and lower
quantile of portfolio return. This makes selection of 𝛿˜ more intuitive in practice.
Note that computing quantiles is more complicated for other distributions, because
in general they are not determined by only mean and standard deviation.
These two problems will result in the same set of optimal solutions as the other two
formulations. They also allow an easy computation of the entire efficient frontier.
Maximizing problem (2.3) is again a QO, but problem (2.4) is not because of the
square root. This latter problem can be solved using conic reformulation, as we will
see in Sec. 2.3.
The optimal portfolio x computed by the Markowitz model is efficient in the sense
that there is no other portfolio giving a strictly higher return for the same amount of
risk or a strictly lower risk for the same amount of return. In other words an efficient
portfolio is Pareto optimal. The collection of such points form the efficient frontier in
mean return–variance space.
The above introduced simple formulations are trivial to solve. Through the method
of Lagrangian multipliers we can get analytical formulas, which involve 𝜇 and Σ−1 as
4
The z-score is the distance from the mean measured in units of standard deviation.
5
parameters. Thus the solution is simply to invert the covariance matrix (see e. g. in
[CJPT18]). This makes them useful as a demonstration example, but they have only
limited investment value.
We can also create a constraint that limits the total fraction of the 𝑚 largest investments
to at most 𝑝. Based on Sec. 10.1.1 this is represented by the following set of constraints:
𝑚𝑡 + 1𝑇 u ≤ 𝑝, u + 𝑡 ≥ x, u ≥ 0,
where u and 𝑡 are new variables.
6
2.2.3 Leverage constraints
Componentwise short sale limit
The simplest form of leverage constraint is when we do not allow any short selling. This
is the long-only constraint stated as
x ≥ 0.
𝑥𝑖 ≥ −𝑠𝑖 .
We can rewrite this as a set of linear constraints by modeling the maximum function
based on Sec. 10.1.1 using an auxiliary vector t− :
1T t− ≤ 𝑆, t− ≥ −x, t− ≥ 0.
Collateralization requirement
An interesting variant of this constraint is the collateralization requirement, which limits
the total of short positions to a fraction of the total of long positions. We can model it
by introducing new variables x+ and x− for the long and short part of x, based on Sec.
10.1.1. Then the collateralization requirement will be:
∑︁ ∑︁
𝑥−
𝑖 ≤ 𝑐 𝑥+
𝑖 .
𝑖 𝑖
Leverage strategy
A leverage strategy is to do short selling, then use the proceeds to buy other securities.
We can express such a strategy in general using two constraints:
• 1T x = 1,
where 𝑐 = 1 yields the long-only constraint and 𝑐 = 1.6 means the 130/30 strategy.
The 1-norm constraint is nonlinear, but can be modeled as a linear constraint based on
Sec. 10.1.1: −z ≤ x ≤ z, 1T z = 𝑐.
7
2.2.4 Turnover constraints
The turnover constraint limits the total change in the portfolio positions, which can help
limiting e. g. taxes and cost of transactions.
Suppose x0 is the initial holdings vector. Then we can write the turnover constraint
as
‖x − x0 ‖1 ≤ 𝑐.
This nonlinear expression can be modeled as linear constraint using Sec. 10.1.1.
maximize 𝜇T x
subject to xT Σx ≤ 𝛾 2 , (2.6)
x ∈ ℱ.
8
(see in Sec. 3) is available, then a centered and normalized data matrix can also serve as
matrix G.
Using the decomposition we can write the portfolio variance as xT Σx = xT GGT x =
‖G x‖22 . This leads to two different conic forms (see also in Sec. 10.1.1).
T
This way we can also model problem (2.4). We introduce the variable 𝑠 to represent
the upper bound of the portfolio standard deviation, and model the constraint
‖GT x‖2 ≤ 𝑠 using the quadratic cone as
maximize 𝜇T x − 𝛿𝑠˜
subject to (𝑠, G x) ∈ 𝒬𝑘+1 ,
T (2.9)
1T x = 1.
While modeling problem (2.3) using the rotated quadratic cone might seem more
natural, it can also be reasonable to model it using the quadratic cone for the
benefits of the smaller scale and intuitive interpretation of 𝛿.˜ This is the same as
modeling of problem (2.4), which shows that conic optimization can also provide
efficient solution to problems that are inefficient to solve in their original form.
In general, transforming the optimization problem into the conic form has multiple
practical advantages. Solving the problem in this format will result in a more robust and
usually faster and more reliable solution process. Check Sec. 10.1.2 for details.
2.4 Example
We have seen how to transform a portfolio optimization problem into conic form. Now
we will present a detailed example in MOSEK Fusion. Assume that the input data
estimates 𝜇 and Σ are given. The latter is also positive definite. For methods and
examples on how to obtain these see Sec. 3.
Suppose we would like to create a long only portfolio of eight stocks. Assume that we
receive the following variables:
9
# Expected returns and covariance matrix
m = np.array(
[0.0720, 0.1552, 0.1754, 0.0898, 0.4290, 0.3929, 0.3217, 0.1838]
)
S = np.array([
[0.0946, 0.0374, 0.0349, 0.0348, 0.0542, 0.0368, 0.0321, 0.0327],
[0.0374, 0.0775, 0.0387, 0.0367, 0.0382, 0.0363, 0.0356, 0.0342],
[0.0349, 0.0387, 0.0624, 0.0336, 0.0395, 0.0369, 0.0338, 0.0243],
[0.0348, 0.0367, 0.0336, 0.0682, 0.0402, 0.0335, 0.0436, 0.0371],
[0.0542, 0.0382, 0.0395, 0.0402, 0.1724, 0.0789, 0.0700, 0.0501],
[0.0368, 0.0363, 0.0369, 0.0335, 0.0789, 0.0909, 0.0536, 0.0449],
[0.0321, 0.0356, 0.0338, 0.0436, 0.0700, 0.0536, 0.0965, 0.0442],
[0.0327, 0.0342, 0.0243, 0.0371, 0.0501, 0.0449, 0.0442, 0.0816]
])
The similar magnitude and positivity of all the covariances suggests that the selected
securities are closely related. In Sec. 3.4.2 we will see that they are indeed from the same
market and that the data was collected from a highly bullish time period, resulting in the
large expected returns. Later in Sec. 5.4.1 we will analyze this covariance matrix more
thoroughly.
maximize 𝜇T x
subject to x Σx ≤ 𝛾 2 ,
T
(2.10)
1T x = 1,
x ≥ 0.
Recall that by the Cholesky decomposition we have Σ = GGT . Then we can model the
quadratic term xT Σx ≤ 𝛾 2 using the rotated quadratic cone as (𝛾 2 , 12 , GT x) ∈ 𝒬𝑁
r
+2
, and
arrive at
maximize (︀ 𝜇T x)︀
subject to 𝛾 2 , 12 , GT x ∈ 𝒬𝑁 +2
,
T
r
(2.11)
1 x = 1,
x ≥ 0.
In the code, G will be an input parameter, so we compute it first. We use the Python
package numpy for this purpose, abbreviated here as np.
10
with Model("markowitz") as M:
# Settings
M.setLogHandler(sys.stdout)
# Budget constraint
M.constraint('budget', Expr.sum(x), Domain.equalsTo(1))
# Objective
M.objective('obj', ObjectiveSense.Maximize, Expr.dot(m, x))
# Solve optimization
M.solve()
returns = M.primalObjValue()
portfolio = x.level()
maximize 𝜇T x − 𝛿𝑠˜
subject to T
1 x = 1,
(2.13)
x ≥ 0,
(𝑠, G x) ∈ 𝒬𝑁 +1 .
T
11
N = m.shape[0] # Number of securities
# Variables
# The variable x is the fraction of holdings in each security.
# x must be positive, this imposes the no short-selling constraint.
x = M.variable("x", N, Domain.greaterThan(0.0))
# Budget constraint
M.constraint('budget', Expr.sum(x), Domain.equalsTo(1))
Here we use a feature in Fusion that allows us to define model parameters. The line
delta = M.parameter() in the code does this. Defining a parameter is ideal for the
computation of the efficient frontier, because it allows us to reuse an already existing
optimization model, by only changing its parameter value.
To do so, we define the range of values for the risk aversion parameter 𝛿 to be 20
numbers between 10−1 and 101.5 :
Then we will solve the optimization model repeatedly in a loop, setting a new value
for the parameter 𝛿 in each iteration, without having to redefine the whole model each
time.
# Solve optimization
M.solve()
(continues on next page)
12
(continued from previous page)
# Save results
portfolio_return = m @ x.level()
portfolio_risk = s.level()[0]
row = pd.Series([d, M.primalObjValue(), portfolio_return,
portfolio_risk] + list(x.level()), index=columns)
df_result = df_result.append(row, ignore_index=True)
Then we can plot the results. See the risk-return plane on Fig. 2.1. We generate it
by the following code:
# Efficient frontier
df_result.plot(x="risk", y="return", style="-o",
xlabel="portfolio risk (std. dev.)",
ylabel="portfolio return", grid=True)
We can observe the portfolio composition for different risk-aversion levels on Fig. 2.2,
generated by
# Portfolio composition
my_cmap = LinearSegmentedColormap.from_list("non-extreme gray",
["#111111", "#eeeeee"], N=256, gamma=1.0)
df_result.set_index('risk').iloc[:, 3:].plot.area(colormap=my_cmap,
xlabel='portfolio risk (std. dev.)', ylabel="x")
13
Fig. 2.2: Portfolio composition x with varying level if risk-aversion 𝛿.
14
Chapter 3
In Sec. 2.1.1 we mentioned that the expected return vector 𝜇 and the covariance matrix Σ
of the securities needs to be estimated. In this chapter we will discuss this topic through
a more general approach.
Portfolio optimization problems are inherently stochastic because they contain uncer-
tain forward-looking quantities, most often the return of securities 𝑅. In the case of the
basic mean–variance optimization problems (2.1)-(2.3) the simple form of the objective
and constraint functions make it possible to isolate uncertainty from the decision vari-
ables and wrap it into the parameters. This allows us to separate stochastic optimization
into parameter estimation and deterministic optimization steps.
In many cases, for instance those involving different risk measures, however, it is
not possible to estimate the optimization inputs as a separate step. These objectives
or constraints can be a non-trivial function of portfolio returns, and thus we have to
explicitly model the randomness. See examples in Sec. 8. Therefore more generally our
goal is to estimate the distribution of returns over the investment time period. The
simplest way to do that is by using the concept of scenario.
3.1 Scenarios
A scenario is a possible realization z𝑘 of the 𝑁 dimensional random vector 𝑍 of a quantity
corresponding to a given time period [MWOS15]. Thus we can also see a set of scenarios
as a discretization of the distribution of 𝑍.
We denote∑︀ the number of scenarios by 𝑇 . The probability of scenario z𝑘 occurring will
be 𝑝𝑘 , with 𝑇𝑘=1 𝑝𝑘 = 1. In practice when the scenarios come from historical data or are
obtained via computer simulation, all scenarios are assumed to have the same probability
1/𝑇 .
We commonly work with scenarios of the security returns 𝑅. Scenario 𝑘 of security
returns will be r𝑘 . Assuming that 𝑝𝑘 = 1/𝑇 , we can arrange the scenarios as columns of
the 𝑁 × 𝑇 matrix R.
Models involving certain risk measures can most easily be solved as convex opti-
mization problems using scenarios (see Sec. 8). This approach is called sample average
approximation in stochastic optimization.
15
3.1.1 Scenario generation
Both the quality and the quantity of scenarios is important to get reliable optimal portfo-
lios. Scenarios must be representative and also there must be enough of them to accurately
model a random variable. Too few scenarios could lead to approximation errors, while
too many scenarios could result in excessive computational cost. Next we discuss some
common ways of generating scenarios.
• The underlying assumption about past realizations representing the future might
not be realistic, if markets fundamentally change in the observed time period, or
unexpected extreme events happen. Such changes can happen at least in every
several years, often making the collection of enough representative historical data
impossible.
• It can represent the tails of the distribution inaccurately because of low number of
samples corresponding to the low probability tail events. Also these extreme values
can highly depend on the given historical sample, and can fluctuate a lot when the
sample changes.
Later we will see an example of Monte Carlo scenario generation in Sec. 5.4.1.
16
3.1.4 Performance assessment
We advise to have two separate sets of scenario data.
• The first one is the in-sample data, which we use for running the optimization and
building the portfolio. We assume it known at the time of portfolio construction.
• The other data set is the out-of-sample data, which is for assessing the performance
of the optimal portfolio. Out-of sample data is assumed to become available after
the portfolio has been built.
Linear return
Also called simple return, calculated as
𝑃𝑡1
𝑅lin = − 1.
𝑃𝑡0
This return definition is the one we used so far in this book, the return whose distribution
we would like to compute over the time period of investment.
We can aggregate linear returns across securities, meaning that the linear return of a
portfolio is the average of the linear returns of its components:
∑︁
𝑅xlin = 𝑥𝑖 𝑅𝑖lin . (3.1)
𝑖
This is a property we need to be able to validly compute portfolio return and portfolio
risk. However, it is not easy to scale linear return from one time period to another, for
example to compute monthly return from daily return.
17
Logarithmic return
Also called continuously compounded return, calculated as
(︂ )︂
log 𝑃𝑡1
𝑅 = log .
𝑃𝑡0
We can aggregate logarithmic returns across time, meaning that the total logarith-
mic return over 𝐾 time∑︀periods (︀is the sum)︀ of all 𝐾 single-period logarithmic returns:
𝑘=1 log 𝑃𝑡𝑘 /𝑃𝑡𝑘−1 . This property makes it very easy to scale
= log (𝑃𝑡𝐾 /𝑃𝑡0 ) = 𝐾
log
𝑅𝐾
logarithmic return from one time period to another. However, we cannot average logarith-
mic returns the same way as linear returns, therefore they are unsuitable for computation
of portfolio return and portfolio risk.
We must not confuse them, especially on longer investment time periods [Meu10a]. Log-
arithmic returns are suitable for projecting from shorter to longer time periods, while
linear returns are suitable for computing portfolio level quantities.
1. We start with logarithmic return data over the time period of estimation (e. g. one
day if the data has daily frequency).
3. Using the property of aggregation across time we scale this distribution to the time
period of investment (e. g. one year). We can scale the mean and covariance of
logarithmic returns by the “square-root rule” 1 . This means that if ℎ is the time
period of investment and 𝜏 is the time period of estimation, then
ℎ ℎ
𝜇ℎ = 𝜇 , Σℎ = Σ𝜏 .
𝜏 𝜏 𝜏
4. Finally, we convert it into a distribution of linear returns over the period of in-
vestment. Then we can use the property of cross-sectional aggregation to compute
portfolio return and portfolio risk. We show an example working with the assump-
tion of normal distribution for logarithmic returns in Sec. 3.4.
1
Originally the square-root rule for stocks
√︀ states that we can annualize the standard deviation 𝜎𝜏 of
𝜏 -day logarithmic returns by 𝜎ann = 𝜎𝜏 252/𝜏 , where 𝜎ann is the annualized standard deviation, and
252 is the number of business days in a year.
18
3.2.3 Data preparation in general
We can generalize the above procedure to other types of securities; see in detail in [Meu05].
The role of the logarithmic return in case of stocks is generalized by the concept of market
invariant. A market invariant is a quantity that determines the security price and is
repeating (approximately) identically across time. Thus we can model it as a sequence of
independent and identically distributed random variables. The steps will be the following:
1. Identify the market invariants. As we have seen, the invariant for the stock market
is the logarithmic return. However, for other markets it is not necessarily the return,
e. g. for the fixed-income market the invariant is the change in yield to maturity.
2. Estimate the joint distribution of the market invariant over the time period of es-
timation. Apart from historical data on the invariants, any additional information
can be used. We can use parametric assumptions, approximate through moments,
apply factor models, shrinkage estimators, etc., or simply use the empirical distri-
bution, i. e., a set of scenarios.
4. Map the distribution of invariants into the distribution of security prices at the
investment horizon through a pricing function. Then compute the distribution of
linear returns from the distribution of prices. If we have a set of scenarios, we can
simply apply the pricing function on each of them.
3.3 Extensions
In this section we discuss some use cases which need a more general definition of the
MVO problem.
19
is the vector of final security prices. We can see that maximizing linear return implicitly
assumes that we wish to maximize final wealth.
Based on equation (3.3) there are two components in the aggregation formula that
have to be well defined:
𝑃ℎ,𝑖 − 𝑝0,𝑖
𝑅𝑖lin = (3.4)
𝑝0,𝑖
Linear return is not well defined if the price of the security 𝑝0,𝑖 is zero.
This can occur for example with swaps and futures.
𝑣𝑖 𝑝0,𝑖
𝑥𝑖 = . (3.5)
𝑤0
Portfolio weight is not well defined if the initial wealth 𝑤0 (the total
portfolio value at time 𝑡 = 0) is zero. This can occur for example in the
case of dollar neutral long-short portfolios.
• To extend the definition of linear return, we associate with each security a general-
ized normalizing quantity or basis 𝑏𝑖 that takes the role of the security price 𝑝0,𝑖 in
the denominator of formula (3.4):
𝑃ℎ,𝑖 − 𝑝0,𝑖
𝑅𝑖lin = (3.6)
𝑏𝑖
The basis depends on the nature of the security; for swaps it can be
the notional, for options it can be the strike or the underlying value,
etc. It can be any value that results in an intuitive return definition to
the portfolio manager. In most cases though it will be the price of the
security, giving back the usual linear return definition (3.4).
𝑣𝑖 𝑏𝑖
𝑥𝑖 = . (3.7)
𝑏P
20
In general, the basis quantities 𝑏𝑖 (𝑖 = 1, . . . , 𝑁 ) and 𝑏P have to satisfy the following
properties:
1. They are positive. This guarantees that for a long position, a positive net profit
(P&L) will correspond to a positive return.
2. They are measured in the same money units as the P&L. This ensures that the
return is dimensionless.
3. They are homogeneous, i.e. if 𝑏 is the basis for one unit of security, then the basis
for 𝑛 unit of securities will be 𝑛𝑏. This ensures that the return is independent of
the position size.
4. They are known at the beginning of the investment period (at time 𝑡 = 0). This
ensures that the return will correspond to the full investment period.
By defining a basis value for each security and a basis value for the portfolio, com-
puting the expected portfolio return and portfolio variance will be possible through the
generalized aggregation formula:
∑︁ 𝑣𝑖 𝑏𝑖 𝑃ℎ,𝑖 − 𝑝0,𝑖 ∑︁
𝑅xlin = = 𝑥𝑖 𝑅𝑖lin , (3.8)
𝑖
𝑏P 𝑏𝑖 𝑖
We can observe that in formula (3.8), the security level basis values 𝑏𝑖 cancel out. This
means that the value of the security level basis does not affect the portfolio return. It
only matters in the context of interpreting the security weight and security return.
• 𝑏P = 𝑖 𝑣𝑖 𝑝0,𝑖 = 𝑤0 . In this case we normalize the P&L with the total portfolio
∑︀
value. For long only portfolio this is the same as the initial capital.
• 𝑏P = 𝑖 |𝑣𝑖 |𝑝0,𝑖 , the gross exposure. This provides an intuitive fraction interpreta-
∑︀
tion to 𝑥𝑖 for long-short portfolios as well.
• 𝑏P = 1. In this case formula (3.8) will become the total net profit (P&L) in dollars,
and the portfolio weights will also be measured in dollars.
In this book, we will use 𝑏P = 𝑖 𝑣𝑖 𝑏𝑖 by default. Any other choice will be explicitly
∑︀
stated.
21
3.3.3 Long-short portfolios
When working with a long-short portfolio, we also have to extend the MVO problem
slightly. If the investor can short sell and also use leverage (e. g. margin loan), then
the total value of the investment (the gross exposure) can be greater than the amount
of initial capital. In case of leverage, the gross exposure provides better insight into the
risks taken than the total capital, so being fully invested is no longer meaningful as a sole
constraint.
Therefore in the optimization problem formulation of typical long-short portfolios,
we will substitute the budget constraint (2.5) to a gross exposure limit, i. e., a leverage
constraint:
∑︁
|𝑣𝑖 |𝑝0,𝑖 ≤ 𝐿 · 𝐶init ,
𝑖
where 𝐶init is the initial invested capital, and 𝐿 is the leverage limit. If we normalize here
with 𝑏P = 𝐶init , then we get
‖x‖1 ≤ 𝐿.
For example, Regulation-T states that for long positions the margin requirement is 50%
of the position value, and for short positions the collateral requirement is 150% of the
position value (of which 100% comes from short sale proceeds). This translates into
𝐿 = 2. A special case of 𝐿 = 1 would mean that we can short sell but have no leverage.
A more general version of the leverage constraint is
∑︁
𝑚𝑖 |𝑥𝑖 | ≤ 1.
𝑖
where 𝑚𝑖 is the margin requirement of position 𝑖. In case of Reg-T we had 𝑚𝑖 = 12 for all
𝑖.
We might also impose an enhanced active equity structure on the portfolio, like 120/20
or 130/30. This is typically considered when it is possible to use the short sale proceeds
to purchase more securities (another form of leverage), so there is no need to use margin
loans. Such a portfolio has 100% market exposure, which is expressed by
∑︁
𝑣𝑖 𝑝0,𝑖 = 𝐶init ,
𝑖
or writing the same using x after normalizing again with 𝑏P = 𝐶init , we get the ususal
budget constraint
∑︁
𝑥𝑖 = 1.
𝑖
For example, 130/30 type portfolio would have the constraints ‖x‖1 ≤ 1.6 and 1T x = 1.
We can also use factor neutrality constraints on a long-short portfolio. We can achieve
this simply by adding a linear equality constraint expressing that the portfolio should be
orthogonal to a vector 𝛽 of factor exposures:
𝛽 T x = 0.
22
A special case of this is when the factor exposures are all ones; then the portfolio will be
dollar neutral :
1T x = 0.
We have to note that in the case of dollar neutrality, the total portfolio value is 0.
Referring back to the discussion about portfolio weights in Sec. 3.3.1, in this case the
portfolio weights cannot be defined as (3.5). Instead, we need to define a 𝑏P that is
different from 𝑤0 . See also in Sec. 3.3.2.
Note also that if we model using the quadratic cone instead of the rotated quadratic
cone and x = 0 is a feasible solution, then there will be no optimal portfolios which
are not fully invested. The solutions will be either x = 0 or some risky portfolio with
‖x‖1 = 1. See a detailed discussion about this in Sec. 10.3.
3.4 Example
In this example we will show a case with real data, that demonstrates the steps in Sec.
3.2.3 and leads to the expected return vector and the covariance matrix we have seen in
Sec. 2.4.
23
Fig. 3.1: Daily prices of the 8 stocks in the example portfolio.
Market invariants
For stocks, both linear and logarithmic returns are market invariants, however it is easier
to project logarithmic returns to longer time periods. It also has an approximately
symmetrical distribution, which makes it easier to model. Therefore we resample each
price time series to weekly frequency, and obtain a series of approximately 250 non-
overlapping observations:
df_weekly_prices = df_prices.resample('W').last()
Next we compute weekly logarithmic returns from the weekly prices, followed by basic
handling of missing values:
df_weekly_log_returns =
np.log(df_weekly_prices) - np.log(df_weekly_prices.shift(1))
df_weekly_log_returns = df_weekly_log_returns.dropna(how='all')
df_weekly_log_returns = df_weekly_log_returns.fillna(0)
24
Distribution of invariants
To estimate the distribution of market invariants, in this example we choose a parametric
approach and fit the weekly logarithmic returns to a multivariate normal distribution.
This requires the estimation of the distribution parameters 𝜇log
𝜏 and Σ𝜏 . For simplicity
log
return_array = df_weekly_log_returns.to_numpy()
m_weekly_log = np.mean(return_array, axis=0)
S_weekly_log = np.cov(return_array.transpose())
Projection of invariants
We project the distribution of the weekly logarithmic returns represented by 𝜇log𝜏 and
Σ𝜏 to the one year investment horizon. Because the logarithmic returns are additive
log
across time, the projected distribution will also be normal with parameters 𝜇log ℎ log
ℎ = 𝜏 𝜇𝜏
and Σlog
ℎ = 𝜏 Σ𝜏 .
ℎ log
m_log = 52 * m_weekly_log
S_log = 52 * m_weekly_log
(︁ )︁
E(𝑃ℎ ) = p0 ∘ exp 𝜇log + 1
diag(Σ log
) ,
ℎ 2 ℎ
(3.9)
Cov(𝑃ℎ ) = E(𝑃ℎ )E(𝑃ℎ )T ∘ (exp(Σlog ℎ ) − 1).
In code, this will look like
p_0 = df_weekly_prices.iloc[0].to_numpy()
m_P = p_0 * np.exp(m_log + 1/2*np.diag(S_log))
S_P = np.outer(m_P, m_P) * (np.exp(S_log) - 1)
Then the estimated moments of the linear return is easy to get by 𝜇 = p10 ∘ E(𝑃ℎ ) − 1
and Σ = p01pT ∘ Cov(𝑃ℎ ), where ∘ denotes the elementwise product and division by a
0
vector or a matrix is also done elementwise.
m = 1 / p_0 * m_P - 1
S = 1 / np.outer(p_0, p_0) * S_P
Notice that we could have computed the distribution of linear returns from the dis-
tribution of logarithmic returns directly is this case, using the simple relationship (3.2).
However in general for different securities, especially for derivatives we derive the distri-
bution of linear returns from the distribution of prices. This case study is designed to
demonstrate the general procedure. See details in [[Meu05], [Meu11]].
Also in particular for stocks we could have started with linear returns as market
invariants, and model their distribution. Projecting it to the investment horizon, however,
25
would have been much more complicated. There exist no scaling formulas for the linear
return distribution or its moments as simple as the ones for logarithmic returns.
26
Chapter 4
27
4.1.1 Black-Litterman model
One way to deal with estimation error is to combine noisy data with prior informa-
tion. We can base prior beliefs on recent regulatory, political and socioeconomic events,
macroeconomy news, financial statements, analyst ratings, asset pricing theories, insights
about market dynamics, econometric forecasting methods, etc. [AZ10]
The Black-Litterman model considers the market equilibrium returns implied by
CAPM as a prior estimate, then allows the investor to update this information with
their own views on security returns. This way we do not need noisy historical data for
estimation [[Bra10], [CJPT18]].
Thus we assume the distribution of expected return 𝜇 to be 𝒩 (𝜇eq , Q), where 𝜇eq
is the equilibrium return vector, and the covariance matrix Q represents the investor’s
confidence in the equilibrium return vector as a realistic estimate.
The investor’s views are then modeled by the equation B𝜇 = q+𝜀, where 𝜀 ∼ 𝒩 (0, V).
Each row of this equation represents a view in the form of a linear equation, allowing
for both absolute and relative statements on one or more securities. The views are also
uncertain, we can use the diagonal of V to set the confidence in them, small variance
meaning strong confidence in a view.
By combining the prior with the views using Bayes’s theorem we arrive at the estimate
where 𝜋 is the prior, 𝜀𝜋 ∼ 𝒩 (0, Q), 𝜀q ∼ 𝒩 (0, V), Q and V nonsingular. The solution
will be
28
4.1.2 Shrinkage estimation
Sample estimators are unbiased, but perform well only in the limit case of an infinite
number of observations. For small samples their variance is large. In contrast, constant
estimators have no variance, but display large bias. Shrinkage estimators improve the
sample estimator by adjusting (shrinking) the sample estimator towards a constant esti-
mator, the shrinkage target, basically finding the optimal tradeoff between variance and
bias.
Shrinkage can also be thought of as regularization or a form of empirical Bayes esti-
mation. Also the shrinkage target has to be consistent with other prior information and
optimization constraints.
1. James–Stein estimator:
This is the most widely known shrinkage estimator for the estimation of mean vector
of a normal random variable with known covariance [[EM76], [LC98], [Ric99]]. The
sample mean 𝜇sam is normally distributed with mean 𝜇 and covariance Σ/𝑇 , where
Σ is the covariance matrix of the data. Assuming b as target, the James–Stein
estimator will be
29
4.2 Estimation of the covariance matrix
In contrast to the estimation of the mean, the standard error of the sample variance
depends only on the sample size, making it possible to estimate from higher frequency
data. Still, if the number of securities is high relative to the number of samples, the
optimal solution can suffer from two issues [MM08]:
• Ill conditioning: We need to have sufficient amount of data to get a well-conditioned
covariance estimate. A rule of thumb is that the number of observations 𝑇 should
be an order of magnitude greater than the number of securities 𝑁 .
• Estimation error: The estimation error of the covariance matrix increases quadrat-
ically (in contrast to linear increase in the case of the mean), so if 𝑁/𝑇 is large it
can quickly become the dominant contributing factor to loss of accuracy.
When 𝑁/𝑇 is too large, the following regularization methods can offer an improvement
over the sample covariance matrix [CM13]. Moreover, we can also apply factor models in
this context, we discuss this in more detail in Sec. 5.
BI = 𝑠¯2 I, (4.5)
where 𝑠¯2 is the average sample variance. This target is optimal among all targets
of the form 𝑐I with 𝑐 ∈ R. The corresponding optimal 𝛼 is
∑︀𝑇 (︁[︀ ]︀2 )︁
1 T
𝑇2 𝑘=1 Tr z𝑘 z𝑘 − Σsam
𝛼= ,
Tr [Σsam − BI ]2
(︀ )︀
where z𝑘 is column 𝑘 of the centered data matrix Z. See the details in [LW04].
The sample covariance matrix Σsam can be ill-conditioned because estimation tends
to scatter the sample eigenvalues further away from the mean 𝜎
¯ 2 of the true unknown
eigenvalues. This raises the condition number of Σsam relative to the true matrix Σ
and this effect is stronger as the condition number of Σ is better or as 𝑁/𝑇 grows.
30
The above shrinkage estimator basically pulls all sample eigenvalues back towards
their grand mean 𝑠¯2 , the estimate of 𝜎
¯ 2 , thus stabilizing the matrix. We can also
see this as a form of regularization.
where 𝑠2mkt is the sample variance of market returns, 𝛽 is the vector of factor ex-
posures, and Σ𝜀 is the diagonal matrix containing the variances of the residuals.
The single factor model is discussed in Sec. 5. In practice a proxy of the market
factor could be the equally-weighted or the market capitalization-weighted portfolio
of the constituents of a market index, resulting in two different shrinkage targets.
The optimal shrinkage intensity is derived in [LW03b].
where ∑︀𝑠𝑖,𝑗 ∑︀
is the sample covariance between security 𝑖 and 𝑗 and 𝑟¯ =
𝑗=𝑖+1 𝑟𝑖,𝑗 is the average sample correlation. The optimal shrinkage
2 𝑛−1 𝑛
𝑛(𝑛−1) 𝑖=1
intensity is derived in [LW03a].
Notice that all three targets are positive definite matrices. Shrinkage towards a posi-
tive target guarantees that the resulting estimate is also positive, even when the sample
covariance matrix itself is singular. Further advantage of this estimator is that it works
also in the case of singular covariance matrices, when 𝑁 > 𝑇 .
In [LW20], the authors present a nonlinear version of this estimator, which can apply
different shrinkage intensity to each sample eigenvalue, resulting in even better perfor-
mance.
1. Compute the correlation matrix from the covariance matrix: C = D−1/2 ΣD−1/2 ,
where D = Diag(Σ).
31
3. Fit the Marčenko–Pastur distribution to the empirical distribution, which gives the
theoretical bounds 𝜆+ and 𝜆− on the eigenvalues associated with noise. This way
we separate the spectrum into a random matrix (noise) component and the signal
component. The eigenvalues of the latter will be above the theoretical upper bound
𝜆+ .
4. Change the eigenvalues below 𝜆+ to their average, arriving at the de-noised corre-
lation matrix C̃.
• One-step methods: We perform the robust estimation and the portfolio optimization
in a single step. We substitute the portfolio risk and portfolio return expressions by
their robust counterparts. See for example in [VD09]. We discuss some examples
in section Sec. 8.
• Two-step methods: First we compute the robust estimates of the mean vector and
the covariance matrix of the data. This is followed by solving the usual portfolio
optimization problem with the parameters replaced by their robust estimates. There
are various robust covariance estimators, such as Minimum Covariance Determinant
(MCD), 2D Winsorization, S-estimators. See for example in [[RSTF20], [WZ07]].
We do not cover this topic in detail in this book.
maximize 𝜇T x − 2𝛿 xT Σx
subject to 1T x = 1, (4.8)
‖x‖𝑝𝑝 ≤ 𝑐.
32
If 𝑝 = 1 and 𝑐 = 1 we get back the constraint x ≥ 0, which prevents short-selling.
In [JM03] the authors found that preventing short-selling or limiting position size is
equivalent to shrinking all covariances of securities for which the respective constraint
is binding. The short-selling constraint also implies shrinkage effect on the mean vector
toward zero. In a conic optimization problem, the 1-norm constraint can be modeled
based on Sec. 10.1.1.
For the case 𝑝 = 1 and 𝑐 > 1 with the constraint 1T x = 1 also present, we get a
“short-sale budget”, meaning that the total weight of short positions will be bounded by
(𝑐 − 1)/2. This allows the possibility of optimizing e. g. 130/30 type portfolios with
𝑐 = 1.6.
If 𝑝 = 2 and the only constraint is 1T x = 1, then the minimum feasible value of 𝑐
is 1/𝑁 , giving the x = 1/𝑁 portfolio as the only possible solution. With larger values
of 𝑐 we get an effect equivalent to the Ledoit–Wolf shrinkage towards multiple of the
identity. For other shrinkage targets B, the corresponding constraint will be xT Bx ≤ 𝑐
or ‖GT x‖22 ≤ 𝑐 where GGT = B. In a conic optimization problem, we can model the
2-norm constraint based on Sec. 10.1.1.
In general,
33
In this process we can treat the number of simulated return samples in each batch of
resampled data as a parameter. It can be used to measure the level of certainty of the
expected return estimate. A large number of simulated returns is consistent with more
certainty, while a small number is consistent with less certainty.
While this method offers good results in terms of diversification and mitigating the
effects of estimation error, it has to be noted that it does this at the cost of intensive
computational work.
According to [MM08], improving the optimization procedure is more beneficial than
to use improved estimates (e. g. shrinkage) as input parameter. However, the two
approaches do not exclude each other, so they can be used simultaneously.
By splitting the problem in the above way into two independent parts, the error caused
by intra-cluster noise cannot propagate across clusters.
In the paper the author also proposes a method for estimating the error in the optimal
portfolio weights. We can use it with any optimization method, and facilitates comparison
of methods in practice. The steps of this procedure can be found in Sec. 10.4.
34
4.5 Example
In this part we add shrinkage estimation to the case study introduced in the previous
chapters. We start at the point where the expected weekly logarithmic returns and their
covariance matrix is available:
return_array = df_weekly_log_returns.to_numpy()
m_weekly_log = np.mean(return_array, axis=0)
S_weekly_log = np.cov(return_array.transpose())
The eigenvalues of the covariance matrix are 1e-3 * [0.2993, 0.3996, 0.4156, 0.
5468, 0.6499, 0.9179, 1.0834, 4.7805]. We apply the Ledoit–Wolf shrinkage with
target (4.5):
N = S_weekly_log.shape[0]
T = return_array.shape[0]
# Ledoit--Wolf shrinkage
S = S_weekly_log
s2_avg = np.trace(S) / N
B = s2_avg * np.eye(N)
Z = return_array.T - m_weekly_log[:, np.newaxis]
alpha_num = alpha_numerator(Z, S)
alpha_den = np.trace((S - B) @ (S - B))
alpha = alpha_num / alpha_den
S_shrunk = (1 - alpha) * S + alpha * B
In this example, we get 𝛼 = 0.0843, and the eigenvalues will become 1e-3 * [0.3699,
0.4618, 0.4764, 0.5965, 0.6909, 0.9363, 1.0879, 4.4732], closer to their sample
mean as expected.
Next we implement the James–Stein estimator for the expected return vector:
# James--Stein estimator
m = m_weekly_log[:, np.newaxis]
o = np.ones(N)[:, np.newaxis]
S = S_shrunk
iS = np.linalg.inv(S)
b = (o.T @ m / N) * o
N_eff = np.trace(S) / np.max(np.linalg.eigvalsh(S))
alpha_num = max(N_eff - 3, 0)
(continues on next page)
35
(continued from previous page)
alpha_den = T * (m - b).T @ iS @ (m - b)
alpha = alpha_num / alpha_den
m_shrunk = b + max(1 - alpha, 0) * (m - b)
m_shrunk = m_shrunk[:, 0]
36
Fig. 4.2: Portfolio composition x with varying level if risk-aversion 𝛿.
37
Chapter 5
Factor models
The purpose of factor models is to impose a structure on financial variables and their
covariance matrix by explaining them through a small number of common factors. This
can help to overcome estimation error by reducing the number of parameters, i. e., the
dimensionality of the estimation problem, making portfolio optimization more robust
against noise in the data. Factor models also provide a decomposition of financial risk to
systematic and security specific components [CM13].
Let 𝑍𝑡 be an 𝑁 -dimensional random vector representing the analyzed financial variable
at time 𝑡. We can write 𝑍𝑡 as a linear function of components as
𝑍𝑡 = 𝛽𝐹𝑡 + 𝜃𝑡 ,
𝜇𝑍 = 𝛼 + 𝛽𝜇𝐹 .
Σ𝑍 = 𝛽Σ𝐹 𝛽 T + Σ𝜃 .
1
For a white noise process 𝜀𝑡 we have E(𝜀𝑡 ) = 0 and Cov(𝜀𝑡1 , 𝜀𝑡2 ) = Ω if 𝑡1 = 𝑡2 , where Ω does not
depend on 𝑡, otherwise 0.
2
Additional lags of 𝐹𝑡 in the 𝑍𝑡 equations can be easily allowed, then we obtain a dynamic factor
model.
38
This way the covariance matrix Σ𝑍 has only 𝑁 𝐾 + 𝑁 + 𝐾(𝐾 + 1)/2 parameters, which
is linear in 𝑁 , instead of having 𝑁 (𝑁 + 1)/2 covariances, which is quadratic in 𝑁 .
Note that because the common factors and the specific components are uncorrelated,
we can separate the risks associated with each. For a portfolio x the portfolio level
factor exposures are given by b = 𝛽 T x and we can write the total portfolio variance as
bT Σ𝐹 b + xT Σ𝜃 x.
While the factor structure may introduce a bias on the covariance estimator, it will
also lower its variance owing to the reduced number of parameters to estimate.
• BARRA method: We determine the estimate 𝛽 of the matrix 𝛽 from the observed
security specific attributes (assumed time invariant). Then we do cross-sectional
regression for each 𝑡 to obtain the factor time-series F and its estimated covariance
matrix Σ𝐹 : F𝑡 = (𝛽 T Σ−1 𝜃 𝛽) 𝛽 Σ𝜃 Z𝑡 , where Σ𝜃 is the diagonal matrix of esti-
−1 T −1
• Fama–French method: For each time 𝑡 we order the securities into quintiles by a
given security attribute. Then we long the the top quintile and short the bottom
quintile. The factor realization at time 𝑡 corresponding to the attribute will be the
return of this hedged portfolio. After obtaining the factor time-series corresponding
to each attribute, we can estimate 𝛽 using time-series regression.
A drawback of explicit factor models is the misspecification risk. The derivation of the
covariance matrix Σ𝑍 strongly relies on the orthogonality between the factors 𝐹𝑡 and the
residuals 𝜃𝑡 , which might not be satisfied if relevant factors are missing from the model.
39
5.2 Implicit factor models
Implicit factor models derive both the factors and the factor exposures from the data, it
does not need any external input. However, the statistical estimation procedures involved
are prone to discovering spurious correlations, and they can only work if we assume the
factor exposures time invariant over the estimation period. Mainly two methods are used
to derive the factors.
• For the second condition, note that the variables 𝛽 and F𝑡 can only be determined
up to an invertible matrix H, because 𝛽F𝑡 = 𝛽H−1 HF𝑡 . Thus by decomposing the
factor covariance as Σ𝐹 = QQT and choosing H = Q−1 , we arrive at the desired
structure.
The covariance matrix will then become Σ𝑍 = 𝛽𝛽 T + Σ𝜃 . The matrix Q is still
determined only up to a rotation. This freedom can help in finding interpretation for the
factors.
Assuming that the variables 𝑍𝑡 are independent and identically normally distributed,
we can estimate 𝛼, 𝛽, and Σ𝜃 using maximum likelihood method, yielding the estimations
𝛼, 𝛽, and Σ𝜃 . Then we use cross-sectional regression (accounting for the cross-sectional
heteroskedasticity in 𝜀𝑡 ) to estimate the factor time-series: F𝑡 = (𝛽 T Σ−1 −1 T −1
𝜃 𝛽) 𝛽 Σ𝜃 (z𝑡 −
𝛼).
The number of factors can be determined e. g. using likelihood ratio test. The
drawback of factor analysis is that it is not efficient for large problems.
Σ = VΛVT ,
where the matrix Λ is diagonal with the eigenvalues 𝜆1 , . . . , 𝜆𝑁 in it ordered from largest
to smallest, and the matrix V contains the corresponding eigenvectors v1 , . . . , v𝑁 as
columns.
If we partition the eigenvector matrix as V = [V𝐾 , V𝑁 −𝐾 ], where V𝐾 consists of
the first 𝐾 eigenvectors, we can construct the solution: 𝛽 = V𝐾 , 𝛼 = E(𝑍𝑡 ), and
the principal components F𝑡 = V𝐾 T
(𝑍𝑡 − E(𝑍𝑡 )). For the residuals 𝜀𝑡 = 𝑍𝑡 − 𝑍˜𝑡 =
−𝐾 (𝑍𝑡 − E(𝑍𝑡 )) we have E(𝜀𝑡 ) = 0 and Cov(𝐹𝑡 , 𝜀𝑡 ) = 0, as required in the
T
V𝑁 −𝐾 V𝑁
factor model definition. Also the factors will be uncorrelated: Σ𝐹 = 𝑇1 𝐹𝑡 𝐹𝑡T = Λ𝐾 ,
containing the first 𝐾 eigenvalues in the diagonal.
40
The intuition behind PCA is that it does an orthogonal projection of 𝑍𝑡 onto the
hyperplane spanned by the 𝐾 directions corresponding to the 𝐾 largest eigenvalues.
The 𝐾-dimensional projection 𝑍˜𝑡 contains the most randomness of the original variable
𝑍𝑡 , that an affine transformation can preserve. Eigenvalue 𝜆𝑖 measures the randomness
of principal component 𝐹𝑡,𝑖 , the randomness of 𝑍𝑡 along the direction v𝑖 . The ratio
𝑖=1 𝜆𝑖 is the fraction of randomness explained by the factor model approxima-
∑︀𝐾 ∑︀𝑁
𝑖=1 𝜆𝑖 /
tion.
We can also do PCA on the correlation matrix. We get the correlation matrix Ω from
the covariance matrix Σ by Ω = D−1/2 ΣD−1/2 , where D = Diag(Σ). Then we keep only
the 𝐾 largest eigenvalues: Ω̃ = V𝐾 Λ𝐾 V𝐾 T
. To guarantee that this(︁ approximation
)︁ will
also be a correlation matrix, we add a diagonal correction term Diag I − Ω̃ . Finally, we
transform it back into a covariance matrix. By this method we can get better performance
when the scale of data variables is very different.
Advantages of PCA based factor models is that they are computationally simple, they
need no external data, and that they avoid misspecification risk because the residuals and
the factors are orthogonal by construction. Also they yield factors ranked according to
their fraction of explained variance, which can further help determining the number of
factors 𝐾 to use. The drawback of implicit factors is that they do not have financial in-
terpretation, and the factor exposures can be unstable. Factor models can be generalized
in many ways; see in [BS11].
Number of factors
In practice, the number of factors 𝐾 is unknown and has to be estimated. Too few factors
reduce the explanatory power of the model, too many could lead to retaining insignificant
factors. In [BN01] the authors propose an information criteria to estimate 𝐾:
(︂ )︂ (︂ )︂
*
(︀ 𝑇 )︀ 𝑁 +𝑇 𝑁𝑇
𝐾 = argmin𝐾 log 𝜀𝐾 𝜀𝐾 + 𝐾 log ,
𝑁𝑇 𝑁 +𝑇
Practical considerations
In practice, PCA relies on the assumption that the number of observations 𝑇 is large
relative to the number of securities 𝑁 . If this does not hold, i. e., 𝑁 > 𝑇 , then we can
do PCA in a different way. Suppose that Z is the 𝑁 × 𝑇 data matrix, and 𝜇 is the mean
of the data.
Then we do eigendecomposition of the 𝑇 ×𝑇 matrix 𝑁1 (Z−𝜇1T )T (Z−𝜇1T ) = WΛWT .
Then we get the principal component matrix as F = ΛWT .
The most efficient way to compute PCA is to use singular value decomposition (SVD)
directly on the data matrix Z. Then we get Z = VΛWT . Now it is easy to see the
connection between the above two PCA approaches:
F = VT Z = VT VΛWT = ΛWT .
41
5.3 Modeling considerations
We have arrived at the decomposition of the covariance matrix Σ𝑍 = 𝛽Σ𝐹 𝛽 T + Σ𝜃 ,
where Σ𝐹 is the 𝐾 × 𝐾 sample factor covariance matrix and 𝐾 is the number of factors.
Typically we have only a few factors, so 𝐾 ≪ 𝑁 and thus Σ𝐹 is much lower dimensional
than the 𝑁 × 𝑁 matrix Σ𝑍 . This makes it much cheaper to find the decomposition
Σ𝐹 = FFT , and consequently the factorization Σ𝑍 = GGT , where
[︁ ]︁
G = 𝛽F Σ𝜃 . 1/2
(5.1)
This form of G is typically very sparse. In practice sparsity is more important than the
number of variables and constraints in determining the computational efficiency of the
optimization problem. Thus it can be beneficial to focus on the number of nonzeros in the
matrix G and try to reduce it. Indeed, the 𝑁 × (𝑁 + 𝐾) dimensional G is larger than the
𝑁 × 𝑁 dimensional Cholesky factor of Σ𝑍 would be, but in G there are only 𝑁 (𝐾 + 1)
nonzeros in contrast to the 𝑁 (𝑁 + 1)/2 nonzeros in the Cholesky factor. Because in
practice 𝐾 tends to be a small number independent of 𝑁 , we can conclude that the usage
of a factor model reduced the number of nonzeros, and thus the storage requirement by
a factor of 𝑁 . This will in most cases also lead to a significant reduction in the solution
time.
In the risk minimization setting (2.1) of the Markowitz problem, according to Sec.
10.1.1 we can model the factor risk term xT 𝛽Σ𝐹 𝛽 T x ≤ 𝑡1 as ( 12 , 𝑡1 , FT 𝛽 T x) ∈ 𝒬𝐾+2 , and
1/2
the specific risk term xT Σ𝜃 x ≤ 𝑡2 as ( 21 , 𝑡2 , Σ𝜃 x) ∈ 𝒬𝑁 +2 . The latter can be written in
1/2
a computationally more favorable way as ( 21 , 𝑡2 , diag(Σ𝜃 ) ∘ x) ∈ 𝒬𝑁 +2 . The resulting
risk minimization problem will then look like
minimize 𝑡1 + 𝑡2
subject to 1 T T
( 2 , 𝑡1 , F 𝛽 x)∈ 𝒬𝐾+2 ,
1 1/2
( 2 , 𝑡2 , diag(Σ𝜃 ) ∘ x) ∈ 𝒬𝑁 +2 , (5.2)
T
𝜇 x ≥ 𝑟min ,
x ∈ ℱ.
We can also have prespecified limits 𝛾1 for the factor risk and 𝛾2 for the specific risk.
Then we can use the return maximization setting:
maximize 𝜇T x
subject to ( 21 , 𝛾1 , FT 𝛽 T x) ∈ 𝒬𝐾+2 ,
1/2 (5.3)
( 12 , 𝛾2 , diag(Σ𝜃 ) ∘ x) ∈ 𝒬𝑁 +2 ,
x ∈ ℱ.
5.4 Example
In this chapter there are two examples. The first shows the application of macroeconomic
factor models on the case study developed in the preceding chapters. The second example
demonstrates the performance gain that factor models can yield on a large scale portfolio
optimization problem.
42
5.4.1 Single factor model
In the example in Sec. 4.5 we have seen that the effective dimension 𝑁 ˜ is low, indicating
that there are only one or two independent sources of risk for all securities. This suggests
that a single factor model might give a good approximation of the covariance matrix. In
this example we will use the return of the SPY ETF as market factor. The SPY tracks
the S&P 500 index and thus is a good proxy for the U.S. stock market. Then our model
will be:
𝑅𝑡 = 𝛼 + 𝛽𝑅M,𝑡 + 𝜀𝑡 ,
where 𝑅M is the return of the market factor. To estimate this model using time-series
regression, we need to obtain scenarios of linear returns on the investment time horizon
(ℎ = 1 year in this example), both for the individual securities and for SPY.
To get this, we add SPY as the ninth security, and proceed exactly the same way as
in Sec. 3.4 up to the point where we have the expected yearly logarithmic return vector
ℎ and covariance matrix Σℎ . Then we generate Monte Carlo scenarios using these
𝜇log log
parameters.
scenarios_log =
np.random.default_rng().multivariate_normal(m_log, S_log, 100000)
scenarios_lin = np.exp(scenarios_log) - 1
Then we do the linear regression using the OLS class of the statsmodels package. The
independent variable X will contain the scenarios for the market returns and a constant
term. The dependent variable y will be the scenarios for one of the security returns.
params = []
resid = []
X = np.zeros((scenarios_lin.shape[0], 2))
X[:, 0] = scenarios_lin[:, -1]
X[:, 1] = 1
for k in range(N):
y = scenarios_lin[:, k]
model = sm.OLS(y, X, hasconst=True).fit()
resid.append(model.resid)
params.append(model.params)
resid = np.array(resid)
params = np.array(params)
After that we derive the estimates 𝛼, 𝛽, 𝑠2M , and diag(Σ𝜃 ) of the parameters 𝛼, 𝛽,
2
𝜎M , and diag(Σ𝜃 ) respectively.
a = params[:, 1]
B = params[:, 0]
s2_M = np.var(X[:, 0])
S_theta = np.cov(resid)
diag_S_theta = np.diag(S_theta)
At this point, we can compute the decomposition (5.1) of the covariance matrix:
43
G = np.block([[B[:, np.newaxis] * np.sqrt(s_M), np.sqrt(S_theta)]])
We can see that in this 8 × 9 matrix there are only 16 nonzero elements:
G = np.array([
[0.174, 0.253, 0, 0, 0, 0, 0, 0, 0 ],
[0.189, 0, 0.205, 0, 0, 0, 0, 0, 0 ],
[0.171, 0, 0, 0.181, 0, 0, 0, 0, 0 ],
[0.180, 0, 0, 0, 0.190, 0, 0, 0, 0 ],
[0.274, 0, 0, 0, 0, 0.314, 0, 0, 0 ],
[0.225, 0, 0, 0, 0, 0, 0.201, 0, 0 ],
[0.221, 0, 0, 0, 0, 0, 0, 0.218, 0 ],
[0.191, 0, 0, 0, 0, 0, 0, 0, 0.213 ]
])
This sparsity can considerably speed up the optimization. See the next example in
Sec. 5.4.2 for a large scale problem where this speedup is apparent.
While MOSEK can handle the sparsity efficiently, we can exploit this structure also
at the model creation phase, if we handle the factor risk and specific risk parts separately,
avoiding a large matrix-vector product:
factor_risk = Expr.mul(G_factor.T, x)
specific_risk = Expr.mulElm(g_specific, x)
If we plot the efficient frontier on Fig. 5.1, and the portfolio composition on Fig. 5.2
we can compare the results obtained with and without using a factor model.
44
Fig. 5.1: The efficient frontier.
45
(continued from previous page)
# Generate random factor model parameters
B = rng.normal(size=(N, K))
a = rng.normal(loc=1, size=(N, 1))
e = rng.multivariate_normal(np.zeros(N), np.eye(N), T).T
# Residual covariance
S_theta = np.cov(e)
diag_S_theta = np.diag(S_theta)
# Optimization parameters
m = np.mean(Z, axis=1)
S = np.cov(Z)
The following code computes the comparison by increasing the number of securities
𝑁 . The factor model in this example will have 𝐾 = 10 common factors, which is kept
constant, because it typically does not depend on 𝑁 .
# Risk limit
gamma2 = 0.1
# Number of factors
K = 10
F = np.linalg.cholesky(S_F)
G_factor = B @ F
g_specific = np.sqrt(diag_S_theta)
G_orig = np.linalg.cholesky(S)
list_runtimes_orig.append((N, runtime_orig))
list_runtimes_factor.append((N, runtime_factor))
46
The function call Markowitz(N, m, G, gamma2) runs the Fusion model, the same way
as in Sec. 2.4. If instead of the argument G we specify the tuple (G_factor, g_specific),
then the risk constraint in the Fusion model is created such that it exploits the risk factor
structure:
To get the runtime of the optimization, we add the following line to the model:
time = M.getSolverDoubleInfo("optimizerTime")
The results can be seen on Fig. 5.3, showing that the factor model based covariance
matrix modeling can result in orders of magnitude faster solution times for large scale
problems.
47
Chapter 6
Transaction costs
Rebalancing a portfolio generates turnover, i. e., buying and selling of securities to change
the portfolio composition. The basic Markowitz model assumes that there are no costs
associated with trading, but in reality, turnover incurs expenses. In this chapter we extend
the basic model to take this into account in the form of transaction cost constraints. We
also show some practical constraints, which can also limit turnover through limiting
position sizes.
We can classify transaction costs into two types [WO11]:
• Fixed costs are independent of transaction volume. These include brokerage com-
missions and transfer fees.
• Variable costs depend on the transaction volume. These comprise execution costs
such as market impact, bid/ask spread, or slippage; and opportunity costs of failed
or incomplete execution.
Note that to be able to compare transaction costs with returns and risk, we need to
aggregate them over the length of the investment time period.
In the optimization problem, let x̃ = x − x0 denote the change in the portfolio with
respect to the initial holdings x0 . Then in general we can take into account transaction
costs with the function 𝐶, where 𝐶(x̃) is the total transaction cost incurred by the change
x̃ in the portfolio. Here we assume that transaction costs are separable, ∑︀𝑛 i.e., the total
cost is the sum of the costs associated with each security: 𝐶(x̃) = 𝑖=1 𝐶𝑖 (˜ 𝑥𝑖 ), where
the function 𝐶𝑗 (˜
𝑥𝑖 ) specifies the transaction costs incurred for the change in the holdings
of security 𝑖. We can then write the MVO model with transaction cost in the following
way:
maximize 𝜇T x
subject to 1T x + 𝑖=1 𝐶𝑖 (˜
𝑛
𝑥𝑖 ) = 1 T x 0 ,
∑︀
(6.1)
x Σx ≤ 𝛾 2 ,
T
x ∈ ℱ.
48
6.1 Variable transaction costs
The simplest model that handles variable costs makes the assumption that costs grow
linearly with the trading volume [[BBD+17], [LMFB07]]. We can use linear costs, for
example, to model the cost related to the bid/ask spread, slippage, borrowing or shorting
cost, or fund management fees. Let the transaction cost function for security 𝑖 be given
by
{︂ +
𝑣𝑖 𝑥˜𝑖 , 𝑥˜𝑖 ≥ 0,
𝐶𝑖 (˜
𝑥𝑖 ) = −
−𝑣𝑖 𝑥˜𝑖 , 𝑥˜𝑖 < 0,
where 𝑣𝑖+ and 𝑣𝑖− are the cost rates associated with buying and selling security 𝑖. By
introducing positive and negative part variables 𝑥˜+ 𝑖 = max(˜ 𝑥𝑖 , 0) and 𝑥˜−
𝑖 = max(−˜
𝑥𝑖 , 0)
we can linearize this constraint to 𝐶𝑖 (˜
𝑥𝑖 ) = 𝑣𝑖 𝑥˜𝑖 + 𝑣𝑖 𝑥˜𝑖 . We can handle any piecewise
+ + − −
linear convex transaction cost function in a similar way. After modeling the variables 𝑥˜+ 𝑖
and 𝑥˜−
𝑖 as in Sec. 10.1.1, the optimization problem will then become
maximize 𝜇T x
subject to 1T x + ⟨v+ , x̃+ ⟩ + ⟨v− , x̃− ⟩ = 1,
x̃ = x̃+ − x̃− ,
(6.2)
x̃+ , x̃− ≥ 0,
xT Σx ≤ 𝛾 2,
x ∈ ℱ.
In this model the budget constraint ensures that the variables x̃+ and x̃− will not both
become positive in any optimal solution.
This function is not convex, but we can still formulate a mixed integer optimization
problem based on Sec. 10.2.1 by introducing new variables. Let y+ and y− be binary
49
vectors. Then the optimization problem with transaction costs will become
maximize 𝜇T x
subject to 1 x + ⟨f , y ⟩ + ⟨f , y− ⟩+
T + + −
where u+ and u− are vectors of upper bounds on the amounts of buying and selling in
each security and ∘ is the elementwise product. The products 𝑢+ 𝑖 𝑦𝑖 and 𝑢𝑖 𝑦𝑖 ensure
+ − −
that if security 𝑖 is traded (𝑦𝑖+ = 1 or 𝑦𝑖− = 1), then both fixed and variable costs are
incurred, otherwise (𝑦𝑖+ = 𝑦𝑖− = 0) the transaction cost is zero. Finally, the constraint
y+ + y− ≤ 1 ensures that the transaction for each security is either a buy or a sell, and
never both.
where 𝜎𝑖 is the volatility of security 𝑖 for a unit time period, 𝑞𝑖 is the average dollar
volume in a unit time period, and the sign depends on the direction of the trade. The
number 𝑐𝑖 has to be calibrated, but it is usually around one. Equation (6.4) is called the
“square-root” law, because 𝛽 − 1 is empirically shown to be around 1/2 [TLD+11].
The relative price difference (6.4) is the impact cost rate, assuming 𝑑˜𝑖 dollar amount
is traded. After actually trading this amount, we get the total market impact cost as
∆𝑝𝑖 ˜
𝐶𝑖 (𝑑˜𝑖 ) = 𝑑𝑖 = 𝑎𝑖 |𝑑˜𝑖 |𝛽 , (6.5)
𝑝𝑖
where 𝑎𝑖 = ±𝑐𝑖 𝜎𝑖 /𝑞𝑖𝛽−1 . Thus if 𝛽 − 1 = 1/2, the market impact cost increases with
𝛽 = 3/2 power of the traded dollar amount.
We can also express the market impact cost in terms of portfolio fraction 𝑥˜𝑖 instead
of 𝑑˜𝑖 by normalizing 𝑞𝑖 with the total portfolio value vT p0 .
50
Using Sec. 10.1.1 we can model 𝑡𝑖 ≥ |˜ 𝑥𝑖 |𝛽 with the power cone as (𝑡𝑖 , 1, 𝑥˜𝑖 ) ∈
1/𝛽,(𝛽−1)/𝛽
. Hence, it follows that the total market impact cost term 𝑁 𝑥𝑖 |𝛽 can be
∑︀
𝒫3 𝑖=1 𝑎𝑖 |˜
1/𝛽,(𝛽−1)/𝛽
modeled by 𝑖=1 𝑎𝑖 𝑡𝑖 under the constraint (𝑡𝑖 , 1, 𝑥˜𝑖 ) ∈ 𝒫3 .
∑︀𝑁
Note however, that in this model nothing forces 𝑡𝑖 to be small as possible to ensure
𝑡𝑖 = |˜𝑥𝑖 |𝛽 holds at the optimal solution. This freedom allows the optimizer to try reducing
portfolio risk by incorrectly treating 𝑎𝑖 𝑡𝑖 as a risk-free security. Then it would allocate
more weight to 𝑎𝑖 𝑡𝑖 while reducing weight allocated to risky securities, basically throwing
away money.
There are two solutions, which can prevent this unwanted behavior:
• Adding a risk-free security to the model. In this case the optimizer will prefer
to allocate to the risk-free security, which has positive return (the risk-free rate),
instead of allocating to 𝑎𝑖 𝑡𝑖 .
Let us denote the weight of the risk-free security by 𝑥f and the risk-free rate of return
by 𝑟f . Then the portfolio optimization problem accounting for market impact costs will
be
maximize 𝜇T x + 𝑟f 𝑥f
subject to 1T x + aT t + 𝑥f = 1,
xT Σx ≤ 𝛾 2, (6.6)
1/𝛽,(𝛽−1)/𝛽
(𝑡𝑖 , 1, 𝑥˜𝑖 ) ∈ 𝒫3 , 𝑖 = 1, . . . , 𝑁,
x, 𝑥f ∈ ℱ.
Note that if we model using the quadratic cone instead of the rotated quadratic cone
and a risk free security is present, then there will be no optimal portfolios for which
0 < 𝑥f < 1. The solutions will be either 𝑥f = 1 or some risky portfolio with 𝑥f = 0. See
a detailed discussion about this in Sec. 10.3.
51
model then gets updated as follows:
maximize 𝜇T x
subject to 1T x = 1,
x̃ = x − x0 ,
x̃ ≤ u ∘ y,
x̃ ≥ −u ∘ y, (6.7)
T
1 y ≤ 𝐾,
y ∈ {0, 1}𝑁 ,
xT Σx ≤ 𝛾 2,
x ∈ ℱ,
where the vector u is some a priori chosen upper bound on the amount of trading in each
security.
maximize 𝜇T x
subject to 1T x = 1,
x − x0 = x̃+ − x̃− ,
x̃+ , x̃− ≥ 0,
x̃+ ≤ u+ ∘ y+ ,
x̃+ ≥ ℓ+ ∘ y+ ,
(6.8)
x̃− ≤ u− ∘ y− ,
x̃− ≥ ℓ− ∘ y− ,
y+ + y− ≤ 1,
y+ , y− ∈ {0, 1}𝑁 ,
xT Σx ≤ 𝛾 2,
x ∈ ℱ.
This model is of course compatible with the fixed plus linear transaction cost model
discussed in Sec. 6.2.
52
6.6 Example
In this chapter we show two examples. The first demonstrates the modeling of market
impact through the use of the power cone, while the second example presents fixed and
variable transaction costs and the buy-in threshold.
According to the data, the average daily dollar volumes are 108 ·
[3.9883, 4.2416, 6.0054, 4.2584, 30.4647, 34.5619, 5.0077, 8.4950], and the daily volatilities
are [0.0164, 0.0154, 0.0146, 0.0155, 0.0191, 0.0173, 0.0186, 0.0169]. Thus in this example
we will choose the size of our portfolio to be 10 billion dollars so that we can see a
significant market impact.
Then we update the Fusion model introduced in Sec. 2.4.2 with new variables and
constraints:
def EfficientFrontier(N, m, G, deltas, a, beta, rf):
# Variables
# The variable x is the fraction of holdings in each security.
# x must be positive, this imposes the no short-selling constraint.
x = M.variable("x", N, Domain.greaterThan(0.0))
53
(continued from previous page)
# Solve optimization
M.solve()
# Save results
portfolio_return = m @ x.level() + np.array([rf]) @ xf.level()
portfolio_risk = np.sqrt(2 * s.level()[0])
t_resid = t.level() - np.abs(x.level())**beta
row = pd.Series([d, M.primalObjValue(), portfolio_return,
portfolio_risk, sum(t_resid), sum(x.level()),
sum(xf.level()), t.level() @ a]
+ list(x.level()), index=columns)
df_result = df_result.append(row, ignore_index=True)
return df_result
54
Next, we compute the efficient frontier with and without market impact costs. We
select 𝛽 = 3/2 and 𝑐𝑖 = 1. The following code produces the results:
On Fig. 6.1 we can see the return reducing effect of market impact costs. The left
part of the efficient frontier (up to the so called tangency portfolio) is linear because a
risk-free security was included. However, in this case borrowing is not allowed, so the
right part remains the usual parabola shape.
Fig. 6.1: The efficient frontier with risk-free security included, and market impact cost
taken into account.
55
6.6.2 Transaction cost models
In this example we show a problem that models fixed and variable transaction costs and
the buy-in threshold. Note that we do not model the market impact here.
We will assume now that x can take negative values too (short-selling is allowed), up
to the limit of 30% portfolio size. This way we can see how to apply different costs to
buy and sell trades. We also assume that x0 = 0, so x̃ = x.
The following code defines variables used as the positive and negative part variables
of x and the binary variables y+ , y− indicating whether there is buying or selling in a
security:
# Real variables
xp = M.variable("xp", N, Domain.greaterThan(0.0))
xm = M.variable("xm", N, Domain.greaterThan(0.0))
# Binary variables
yp = M.variable("yp", N, Domain.binary())
ym = M.variable("ym", N, Domain.binary())
Next we add two constraints. The first links xp and xm to x, so that they represent
the positive and negative parts. The second ensures that for each coordinate of yp and
ym only one of the values can be 1.
# Constraint assigning xp and xm to the positive and negative part of x.
M.constraint('pos-neg-part', Expr.sub(x, Expr.sub(xp, xm)),
Domain.equalsTo(0.0))
We update the budget constraint with the variable and fixed transaction cost terms.
The fixed cost of buy and sell trades are held by the variables fp and fm. These are
typically given in dollars, and have to be divided by the total portfolio value. The
variable cost coefficients are vp and vm. If these are given as percentages, then we do not
have to modify them.
# Budget constraint with transaction cost terms
fixcost_terms = Expr.add([Expr.dot(fp, yp), Expr.dot(fm, ym)])
varcost_terms = Expr.add([Expr.dot(vp, xp), Expr.dot(vm, xm)])
budget_terms = Expr.add([Expr.sum(x), varcost_terms, fixcost_terms])
M.constraint('budget', budget_terms, Domain.equalsTo(1.0))
Next, the 130/30 leverage constraint is added. Note that the transaction cost terms
from the budget constraint should also appear here, otherwise the two constraints com-
bined would allow a little more leverage than intended. (The sum of x would not reach
1 because of the cost terms, leaving more space in the leverage constraint for negative
positions.)
# Auxiliary variable for 130/30 leverage constraint
z = M.variable("z", N, Domain.unbounded())
56
(continued from previous page)
M.constraint('leverage-gt', Expr.sub(z, x), Domain.greaterThan(0.0))
M.constraint('leverage-ls', Expr.add(z, x), Domain.greaterThan(0.0))
M.constraint('leverage-sum',
Expr.add([Expr.sum(z), varcost_terms, fixcost_terms]),
Domain.equalsTo(1.6))
Finally, to be able to differentiate between zero allocation (not incurring fixed cost)
and nonzero allocation (incurring fixed cost), and to implement buy-in threshold, we need
bound constraint involving the binary variables:
# Bound constraints
M.constraint('ubound-p', Expr.sub(Expr.mul(up, yp), xp),
Domain.greaterThan(0.0))
M.constraint('ubound-m', Expr.sub(Expr.mul(um, ym), xm),
Domain.greaterThan(0.0))
M.constraint('lbound-p', Expr.sub(xp, Expr.mul(lp, yp)),
Domain.greaterThan(0.0))
M.constraint('lbound-m', Expr.sub(xm, Expr.mul(lm, ym)),
Domain.greaterThan(0.0))
The full updated model will then look like the following:
# Real variables
# The variable x is the fraction of holdings in each security.
x = M.variable("x", N, Domain.unbounded())
xp = M.variable("xp", N, Domain.greaterThan(0.0))
xm = M.variable("xm", N, Domain.greaterThan(0.0))
# Binary variables
yp = M.variable("yp", N, Domain.binary())
ym = M.variable("ym", N, Domain.binary())
# Bound constraints
(continues on next page)
57
(continued from previous page)
M.constraint('ubound-p', Expr.sub(Expr.mul(up, yp), xp),
Domain.greaterThan(0.0))
M.constraint('ubound-m', Expr.sub(Expr.mul(um, ym), xm),
Domain.greaterThan(0.0))
M.constraint('lbound-p', Expr.sub(xp, Expr.mul(lp, yp)),
Domain.greaterThan(0.0))
M.constraint('lbound-m', Expr.sub(xm, Expr.mul(lm, ym)),
Domain.greaterThan(0.0))
# Solve optimization
M.solve()
# Save results
portfolio_return = m @ x.level()
portfolio_risk = np.sqrt(2 * s.level()[0])
(continues on next page)
58
(continued from previous page)
tcost = np.dot(vp, xp.level()) + np.dot(vm, xm.level())
+ np.dot(fp, yp.level()) + np.dot(fm, ym.level())
row = pd.Series([d, M.primalObjValue(), portfolio_return,
portfolio_risk, sum(x.level()), tcost]
+ list(x.level()), index=columns)
df_result = df_result.append(row, ignore_index=True)
return df_result
Here we also used a penalty term in the objective to prevent excess growth of the pos-
itive part and negative part variables. The coefficient of the penalty has to be calibrated
so that we do not overpenalize.
We also have to mention that because of the binary variables, we can only solve this
as a mixed integer optimization (MIO) problem. The solution of such a problem might
not be as efficient as the solution of a problem with only continuous variables. See Sec.
10.2 for details regarding MIO problems.
We compute the efficient frontier with and without transaction costs. The following
code produces the results:
On Fig. 6.2 we can see the return reducing effect of transaction costs. The overall
return is higher because of the leverage.
59
Fig. 6.2: The efficient frontier with transaction costs taken into account.
60
Chapter 7
where Σ is the covariance matrix of return. The tracking error measures how close the
portfolio return is to the benchmark return. Outperforming the benchmark involves
additional risk; if the tracking error is zero then there cannot be any active return at all.
In general we can construct a class of functions for the purpose of measuring tracking
error differently, from any deviation measure. See in Sec. 8.
61
7.2 Factor model on active returns
In the active return there is still a systematic component attributable to the benchmark.
We can account for that using a single factor model. First we create a factor model for
security returns:
𝑅 = 𝛽𝑅xbm + 𝜃,
Σ = 𝛽 T 𝛽𝜎x2 bm + Σ𝜃 ,
where 𝜎x2 bm is the benchmark variance and Σ𝜃 is the residual covariance matrix.
This factor model allows us to decompose the portfolio return into a systematic com-
ponent, which is explained by the benchmark, and a residual component, which is specific
to the portfolio:
𝑅x = 𝛽x 𝑅xbm + 𝜃x ,
where 𝛽x − 1 is the active beta. After computing the variance of the decomposed active
return, we can also write the square of the tracking error as
7.3 Optimization
An active investment strategy would try to gain higher portfolio alpha 𝛼x at the cost
of accepting a higher tracking error. It follows that portfolio optimization with respect
to a benchmark will optimize the tradeoff between portfolio alpha and the square of the
tracking error. Additional constrains specific to such problems can be bounds on the
62
portfolio active beta or on the active holdings. We can see examples in the following
model:
maximize 𝛼T x
subject to (x − xbm ) Σ(x − xbm )
T
≤ 2
𝛾TE ,
x − xbm ≥ lh ,
x − xbm ≤ uh , (7.1)
𝛽x − 1 ≥ l𝛽 ,
𝛽x − 1 ≤ u𝛽 ,
x ∈ ℱ,
where 𝛼 and 𝛽 are estimates of 𝛼 and 𝛽. To compute these estimates, we can do a linear
regression. However, this tends to overestimate the betas of stocks with high benchmark
exposure and underestimate the betas of the stocks with low benchmark exposure. To
improve the estimation, shrinkage towards one (the beta of the benchmark) can be helpful:
7.4 Extensions
7.4.1 Variance constraint
Optimization of alpha with constraining only the tracking error can increase total port-
folio risk. According to [Jor04] it can be helpful to constrain also the total portfolio
variance, especially in cases when the benchmark portfolio is relatively inefficient:
maximize 𝛼T x
subject to (x − xbm ) Σ(x − xbm ) ≤ 𝛾TE
T 2
,
T 2 (7.2)
x Σx ≤ 𝛾 ,
x ∈ ℱ.
63
(see Sec. 10.1.1), we can formulate the benchmark tracking problem with this measure
as an LO model:
maximizex,t 𝛼T x
subject to 1
∑︀𝑇
𝑇 𝑘=1 𝑡𝑘 ≤ 𝛾,
T
t + R (x − xbm ) ≥ 0, (7.3)
t − RT (x − xbm ) ≥ 0,
x ∈ ℱ,
where R is the return data matrix with one observation r𝑘 , 𝑘 = 1, . . . , 𝑇 in each column.
If the investor perceives risk as portfolio return being below the benchmark return,
we can also use downside deviation measures. One example is the lower partial moment
LPM1 measure: E (max(−𝑅 0)). The scenario approximation of this expectation
x + 𝑅xbm , )︀
will be 𝑇1 𝑇𝑘=1 max −rT𝑘 (x − xbm ), 0(︀ . After modeling)︀ the maximum (see Sec. 10.1.1)
∑︀ (︀
by defining a new variable 𝑡𝑘 = max −r𝑘 (x − xbm ), 0 , we can solve the problem as an
− T
LO:
maximize 𝛼T x
subject to 1
∑︀𝑇 −
𝑇 𝑘=1 𝑡𝑘 ≤ 𝛾,
t− ≥ −RT (x − xbm ), (7.4)
t− ≥ 0,
x ∈ ℱ.
Note that for a symmetric portfolio return distribution, this will be equivalent to the
MAD model.
Linear models might be also preferable because of their more intuitive interpretation.
By measuring the tracking error according to a linear function, the measurement unit of
the objective function is percentage instead of squared percentage.
7.5 Example
Here we show an example of a benchmark relative optimization problem. The benchmark
will be the equally weighted portfolio of the eight stocks from the previous examples,
therefore xbm = 1/𝑁 . The benchmark is created by the following code by aggregating
the price data of the eight stocks:
# Create benchmark
df_prices['bm'] = df_prices.iloc[:-2, 0:8].mean(axis=1)
Then we follow the same Monte Carlo procedure as in Sec. 5.4.1, just with the bench-
mark instead of the market factor. This will yield scenarios of linear returns on the
investment time horizon of ℎ = 1 year, so that we can compute estimates 𝛼 and 𝛽 of 𝛼
and 𝛽 using time-series regression.
In the Fusion model, we make the following modifications:
• We define the active holdings variable xa = x − xbm by
# Active holdings
xa = Expr.sub(x, xbm)
64
# Conic constraint for the portfolio variance
M.constraint('risk', Expr.vstack(s, 1, Expr.mul(G.T, xa)),
Domain.inRotatedQCone())
• We also specify bounds on the active holdings and on the portfolio active beta:
# Constraint on active holdings
M.constraint('ubound-h', Expr.sub(uh, xa), Domain.greaterThan(0.0))
M.constraint('lbound-h', Expr.sub(xa, lh), Domain.greaterThan(0.0))
The complete Fusion model of the optimization problem (7.1) will then be
def EfficientFrontier(N, a, B, G, xbm, deltas, uh, ub, lh, lb):
# Variables
# The variable x is the fraction of holdings in each security.
# x must be positive, imposing the no short-selling constraint.
x = M.variable("x", N, Domain.greaterThan(0.0))
# Active holdings
xa = Expr.sub(x, xbm)
# Budget constraint
M.constraint('budget_x', Expr.sum(x), Domain.equalsTo(1))
65
(continued from previous page)
# Constraint on portfolio active beta
port_act_beta = Expr.sub(Expr.dot(B, x), 1)
M.constraint('ubound-b', Expr.sub(ub, port_act_beta),
Domain.greaterThan(0.0))
M.constraint('lbound-b', Expr.sub(port_act_beta, lb),
Domain.greaterThan(0.0))
# Solve optimization
M.solve()
# Save results
portfolio_return = a @ x.level()
portfolio_risk = np.sqrt(2 * s.level()[0])
row = pd.Series([d, M.primalObjValue(), portfolio_return,
portfolio_risk]
+ list(x.level()), index=columns)
df_result = df_result.append(row, ignore_index=True)
return df_result
We give the input parameters and compute the efficient frontier using the following
code:
66
(continued from previous page)
df_result[mask] = 0
df_result
On Fig. 7.1 we plot the efficient frontier and on Fig. 7.2 the portfolio composition.
On the latter we see that as the tracking error decreases, the portfolio gets closer to the
benchmark, i. e., the equal-weighted portfolio.
67
Fig. 7.2: Portfolio composition x with varying level if risk-aversion 𝛿.
68
Chapter 8
In the definitions (2.1)-(2.4) of the classical MVO problem the variance (or standard
deviation) of portfolio return is used as the measure of risk, making computation easy
and convenient. If we assume that the portfolio return is normally distributed, then the
variance is in fact the optimal risk measure, because then MVO can take into account all
information about return. Moreover, if 𝑇 is large compared to 𝑁 , the sample mean and
covariance matrix are maximum likelihood estimates (MLE), implying that the optimal
portfolio resulting from estimated inputs will also be a MLE of the true optimal portfolio
[Lau01].
Empirical observations suggest however, that the distribution of linear return is often
skewed, and even if it is elliptically symmetric, it can have fat tails and can exhibit
tail dependence1 . As the distribution moves away from normality, the performance of a
variance based portfolio estimator can quickly degrade.
No perfect risk measure exists, though. Depending on the distribution of return,
different measures can be more appropriate in different situations, capturing different
characteristics of risk. Return of diversified equity portfolios are often approximately
symmetric over periods of institutional interest. Options, swaps, hedge funds, and private
equity have return distributions that are unlikely to be symmetric. Also less symmetric
are distributions of fixed-income and real estate index returns, and diversified equity
portfolios over long time horizons [MM08].
The measures presented here are expectations, meaning that optimizing them would
lead to stochastic optimization problems in general. As a simplification, we consider only
their sample average approximation by assuming that a set of linear return scenarios are
given.
1
Tail dependence between two random variables means that their correlation is greater for more
extreme events. Typical in financial markets, where extreme events are likely to happen together for
multiple securities.
69
8.1 Deviation measures
This class of measures quantify the variability around the mean of the return distribution,
similar to variance. If we associate investment risk with the uncertainty of return, then
such measures could be ideal.
Let 𝑋, 𝑌 be random variables of investment returns, and let 𝐷 be a mapping. The
general properties that deviation measures should satisfy are [RUZ06]:
1. Positivity: 𝐷(𝑋) ≥ 0 with 𝐷(𝑋) = 0 only if 𝑋 = 𝑐 ∈ R,
2. Translation invariance: 𝐷(𝑋 + 𝑐) = 𝐷(𝑋), 𝑐 ∈ R,
3. Subadditivity: 𝐷(𝑋 + 𝑌 ) ≤ 𝐷(𝑋) + 𝐷(𝑌 ),
4. Positive homogeneity: 𝐷(𝑐𝑋) = 𝑐𝐷(𝑋), 𝑐 ≥ 0
Note that positive homegeneity and subadditivity imply the convexity of 𝐷.
Variance
Choosing 𝛿var (𝑦) = 12 𝑦 2 we get 𝐷𝛿var , a different expression for the portfolio variance. The
optimal 𝑞 in this case is the portfolio mean return 𝜇x . This gives us a way to perform
standard mean–variance optimization directly using scenario data, without the need for
a sample covariance matrix:
minimize 2𝑇 1
∑︀𝑇 T 2
𝑘=1 (x r𝑘 − 𝑞) (8.1)
subject to x ∈ ℱ.
By defining new variables 𝑡+ 𝑘 = max x r𝑘 − 𝑞, 0 and 𝑡𝑘 = max −x r𝑘 + 𝑞, 0 , then
−
(︀ T )︀ (︀ T )︀
70
𝛼-shortfall
Choosing 𝛿sf,𝛼 (𝑦) = 𝛼𝑦 − min(𝑦, 0) or equivalently 𝛿sf,𝛼 (𝑦) = 𝛼max(𝑦, 0) + (1 −
𝛼)max(−𝑦, 0) for 𝛼 ∈ (0, 1) will give the 𝛼-shortfall 𝐷𝛿sf,𝛼 studied in [Lau01]. The
optimal 𝑞 will be the 𝛼-quantile of the portfolio return. 𝛼-shortfall is a deviation mea-
sure with favorable robustness properties when the portfolio return has an asymmetric
distribution, fat tailed symmetric distribution, or exhibits multivariate tail-dependence.
The 𝛼 parameter adjusts the level of asymmetry in the 𝛿 function, allowing us to penalize
upside and downside returns differently.
Portfolio optimization using sample 𝛼-shortfall can be formulated as
minimize 𝑇𝛼 𝑇𝑘=1 (xT r𝑘 − 𝑞) − 𝑇1 𝑇𝑘=1 min(xT r𝑘 − 𝑞, 0)
∑︀ ∑︀
(8.3)
subject to x ∈ ℱ
By defining new variables 𝑡−
𝑘 = max(−x r𝑘 + 𝑞, 0), and
T
model accoring to Sec. 10.1.1 we
arrive at an LO model:
minimize 𝛼(xT 𝜇 − 𝑞) + 𝑇1 𝑇𝑘=1 𝑡−
∑︀
𝑘
t− ≥ −xT r𝑘 + 𝑞,
(8.4)
t− ≥ 0,
subject to x ∈ ℱ.
A drawback of this model is that the number of constraints depends on the sample size
𝑇 , resulting in computational inefficiency for large number of samples.
For elliptically symmetric return distributions the 𝛼-shortfall is equivalent to the
variance in the sense that the corresponding sample portfolio estimators will be estimating
the same portfolio. In fact for normal distribution, the 𝛼-shortfall is proportional to the
portfolio variance: 𝐷𝛿sf,𝛼 = 𝛼
𝜑(𝑞𝛼 )
𝐷𝛿var , where 𝜑(𝑞𝛼 ) is the value of the standard normal
√︀
MAD
A special case of 𝛼-shortfall is for 𝛼 = 0.5. The function 𝛿MAD (𝑦) = 21 |𝑦| gives us 𝐷𝛿MAD ,
which is called the mean absolute deviation (MAD) measure or the 𝐿1 -risk. For this case
the optimal 𝑞 will be the median of the portfolio return, and the sample MAD portfolio
optimization problem can be formulated as
minimize 2𝑇 1
∑︀𝑇 T
𝑘=1 |x r𝑘 − 𝑞| (8.5)
subject to x ∈ ℱ.
After modeling the absolute value based on Sec. 10.1.1 we arrive at the following LO:
minimize 2𝑇 1
∑︀𝑇
𝑘=1 𝑡𝑘
subject to T
x R−𝑞 ≥ −t,
(8.6)
xT R − 𝑞 ≤ t,
x ∈ ℱ,
where R is the return data matrix with one observation r𝑘 , 𝑘 = 1, . . . , 𝑇 in each column.
Note that the number of constraints in this LO problem again depends on the sample
size 𝑇 .
71
√︁ For normally distributed returns, the MAD is proportional to the variance: 𝐷𝛿𝑀 𝐴𝐷 =
𝐷𝛿var .
2
√︀
𝜋
The 𝐿1 -risk can also be applied without minimizing over 𝑞. We can just let 𝑞 to be
the sample portfolio mean instead, i. e., 𝑞 = xT 𝜇 [KY91].
yielding the risk measure 𝐷𝛿H . A different form of the Huber function is 𝛿H (𝑦) = min𝑢 𝑢2 +
2𝑐|𝑦 − 𝑢|, which leads to a QO formulation:
minimize 1
∑︀𝑇 1
∑︀𝑇
𝑢2𝑘 + 2𝑐|xT r𝑘 − 𝑞 − 𝑢𝑘 |
𝑇 𝑘=1 𝑇 𝑘=1 (8.7)
subject to x ∈ ℱ.
Note that the size of this problem depends on the number of samples 𝑇 .
where 𝑟bm is some given target return. The discrete version of this measure is
𝑘=1 𝑘 max(𝑟bm − x r𝑘 , 0) , where r𝑘 is a scenario of the portfolio return 𝑅x occur-
∑︀𝑇 T 𝑛
𝑝
ring with probability 𝑝𝑘 .
We have the following special cases:
• LPM0 is the probability of loss relative to the target return 𝑟bm . LPM0 is an
incomplete measure of risk, because it does not provide any indication of how
severe the shortfall can be, should it occur. Therefore it is best used as a constraint
while optimizing for a different risk measure. This way LPM0 can still provide
information about the risk tolerance of the investor.
72
• LPM2 is called target semi-variance.
While lower partial moments only consider outcomes below the target 𝑟bm , the op-
timization still uses the entire distribution of 𝑅x . The right tail of the distribution
(representing the outcomes above the target) is captured in the mean 𝜇x of the distribu-
tion.
The LPM𝑛 optimization problem can be formulated as [Lau01]
minimize
∑︀𝑇 T 𝑛
𝑘=1 𝑝𝑘 max(𝑟bm − x r𝑘 , 0) (8.9)
subject to x ∈ ℱ.
If we define the new variables 𝑡−
𝑘 = max(𝑟bm − x r𝑘 , 0), then for 𝑛 = 1 we arrive at an
T
73
8.2.1 Value-at-Risk
Denote the 𝛼-quantile of a random variable 𝐿 by 𝑞𝛼 (𝐿) = inf{𝑥|P(𝐿 ≤ 𝑥) ≥ 𝛼}. If 𝐿
is the loss over a given time horizon, then the value-at-risk (VaR) of 𝐿 with confidence
level 𝛼 (or risk level 1 − 𝛼) is defined to be
This is the amount of loss (a positive number) over the given time horizon that will
not be exceeded with probability 𝛼. However, VaR does not give information about the
magnitude of loss in case of the 1−𝛼 probability event when losses are greater than 𝑞𝛼 (𝐿).
Also it is not a coherent risk measure (it does not respect convexity property), meaning
it can discourage diversification.2 In other words, portfolio VaR may not be lower than
the sum of the VaRs of the individual securities.
which is a linear combination of VaR and the quantity called mean shortfall. The latter
is also not coherent on its own.
If the distribution function P(𝐿 ≤ 𝑞) is continuous at 𝑞𝛼 (𝐿) then 𝜆𝛼 = 0. If there is
discontinuity and we have to account for a probability mass concentrated at 𝑞𝛼 (𝐿), then
𝜆𝛼 is nonzero in general. This is often the case in practice, when the loss distribution is
discrete, for example for scenario based approximations.
where 𝜆𝛼 = 1−𝛼 1
( 𝑖𝑖=1 𝑝𝑖 − 𝛼). As a special case, if we assume that 𝑝𝑖 = 𝑇1 , i. e., we have
∑︀ 𝛼
a sample average approximation, then the above formula simplifies with 𝑖𝛼 = ⌈𝛼𝑇 ⌉.
It can be seen that VaR𝛼 (𝐿) ≤ CVaR𝛼 (𝐿) always holds. CVaR is also consistent with
second-order stochastic dominance (SSD), i. e., the CVaR efficient portfolios are the ones
actually preferred by some wealth-seeking risk-averse investors.
2
VaR only becomes coherent when return is normally distributed, but in this case it is proportional
to the standard deviation.
74
Portfolio optimization with CVaR
If we substitute portfolio loss scenarios into formula (8.11), we can see that the quantile
(−RT x)𝑖𝛼 will depend on x. It follows that the ordering of the scenarios and the index 𝑖𝛼
will also depend on x, making it difficult to optimize. However, note that formula (8.11)
is actually the linear combination of largest elements of the vector q. We can thus apply
Sec. 10.1.1 to get the dual form of CVaR𝛼 (𝐿), which is an LO problem:
minimize 𝑡 + 1
1−𝛼
pT u
subject to u + 𝑡 ≥ q, (8.12)
u ≥ 0.
Note that problem (8.12) is equivalent to
1
∑︀
min 𝑡 + 1−𝛼
𝑡 𝑖 𝑝𝑖 max(0, 𝑞𝑖 − 𝑡). (8.13)
This convex function (8.13) is exactly the one found in [RU00], where it is also proven to
be valid for continuous probability distributions as well.
Now we can substitute the portfolio return into q, and optimize over x to find the
portfolio that minimizes CVaR:
minimize 𝑡 + 1
1−𝛼
pT u
subject to u + 𝑡 ≥ −RT x,
(8.14)
u ≥ 0,
x ∈ ℱ.
Because CVaR is represented as a convex function in formula (8.13), we can also formulate
an LO to maximize expected return, while limiting risk in terms of CVaR:
maximize 𝜇T x
subject to 𝑡 + 1
1−𝛼
pT u ≤ 𝛾,
u+𝑡 ≥ −RT x, (8.15)
u ≥ 0,
x ∈ ℱ.
The drawback of optimizing CVaR using problems (8.14) or (8.15) is that both the number
of variables and the number of constraints depend on the number of scenarios 𝑇 . This
can make the LO model computationally expensive for very large number of samples. For
example if the distribution of return is not known, we might need to obtain or simulate
a substantial amount of samples, depending on the confidence level 𝛼.
75
EVaR for discrete distribution
Based on the definition (8.16), the discrete version of the EVaR will be
𝑠𝑞𝑖
(︂ ∑︀ )︂
1 𝑖 𝑝𝑖 e
EVaR𝛼 (𝐿) = inf log , (8.17)
𝑠>0 𝑠 1−𝛼
We can transform formula (8.18) into a conic optimization problem by substituting the
first term of the objective with a new variable 𝑧 and adding the new constraint 𝑧 ≥
EVaR𝛼 (𝐿). Then we apply the rule Sec. 10.1.1:
Because EVaR is represented as a convex function (8.18), we can also formulate a conic
problem to maximize expected return, while limiting risk in terms of EVaR:
maximize 𝜇T x
subject to 𝑧 − 𝑡log(1 − 𝛼), ≤ 𝛾,
∑︀𝑇
𝑖=1 𝑝𝑖 𝑢𝑖 ≤ 𝑡,
(8.21)
(𝑢𝑖 , 𝑡, −rT𝑖 x − 𝑧) ∈ 𝐾exp , 𝑖 = 1, . . . , 𝑇,
𝑡 ≥ 0,
x ∈ ℱ.
A disadvantage of the EVaR conic model is that it still depends on 𝑇 in the number of
exponential cone contraints.
Note that if we assume the return distribution to be a Gaussian mixture, we can find
a different approach to computing EVaR. See in Sec. 8.3.2.
76
8.2.4 Relationship between risk measures
Suppose that 𝜏 (𝑋) is a coherent tail risk measure that additionally satisfies 𝜏 (𝑋) >
−E(𝑋). Suppose also that 𝐷(𝑋) is a deviation measure that additionally satisfies
𝐷(𝑋) ≤ E(𝑋) for all 𝑋 ≥ 0. Then these subclasses of risk measures can be related
through the following identities:
𝐷(𝑋) = 𝜏 (𝑋 − E(𝑋))
(8.22)
𝜏 (𝑋) = 𝐷(𝑋) − E(𝑋)
For example, 𝛼-shortfall and CVaR is related this way. Details can be found in [RUZ06].
77
8.3.1 Optimal portfolio using gaussian mixture return
In [LB22] an expected utility maximization approach is taken, assuming that the return
distribution is a Gaussian mixture (GM). The benefits of a GM distribution are that it
can approximate any continuous distribution, including skewed and fat-tailed ones. Also
its components can be interpreted as return distributions given a specific market regime.
Moreover, the expected utility maximization using this return model can be formulated as
a convex optimization problem without needing return scenarios, making this approach
as efficient and scalable as MVO.
We denote security returns having Gaussian mixture (GM) distribution with 𝐾 com-
ponents as
𝑅 ∼ 𝒢ℳ({𝜇𝑖 , Σ𝑖 , 𝜋𝑖 }𝐾
𝑖=1 ), (8.24)
where 𝜇𝑖 is the mean, Σ𝑖 is the covariance matrix, and 𝜋𝑖 is the probability of component
𝑖. As special cases, for 𝐾 = 1 we get the normal distribution 𝒩 (𝜇1 , Σ1 ), and for Σ𝑖 =
0, 𝑖 = 1, . . . , 𝐾 we get a scenario distribution with 𝐾 return scenarios {𝜇1 , . . . , 𝜇𝐾 }.
Using definition (8.24), the distribution of portfolio return 𝑅x will also be GM:
2
𝑅x ∼ 𝒢ℳ({𝜇x,𝑖 , 𝜎x,𝑖 , 𝜋𝑖 }𝐾
𝑖=1 ), (8.25)
where 𝜇x,𝑖 = 𝜇T𝑖 x, and 𝜎x,𝑖
2
= xT Σ𝑖 x.
To select the optimal portfolio we use the exponential utility function 𝑈𝛿 (𝑥) = 1−𝑒−𝛿𝑥 ,
where 𝛿 > 0 is the risk aversion parameter. This choice allows us to write the expected
utility of the portfolio return 𝑅x using the moment generating function:
The function (8.28) is convex in x because it is a composition of the convex and increasing
log-sum-exp function and a convex quadratic function. Note that for 𝐾 = 1, we get back
the same quadratic utility objective as in version (2.3) of the MVO problem.
Assuming we have the GM distribution parameter estimates 𝜇𝑖 , Σ𝑖 , 𝑖 = 1, . . . , 𝐾,
we can apply Sec. 10.1.1 and Sec. 10.1.1 to arrive at the conic model of the utility
maximization problem (8.27):
minimize 𝑧
subject to
∑︀𝐾
𝑖=1 𝜋𝑖 𝑢𝑖 ≤ 1,
(𝑢𝑖 , 1, −𝛿𝜇T𝑖 x + 𝛿 2 𝑞𝑖 − 𝑧) ∈ 𝐾exp , 𝑖 = 1, . . . , 𝐾, (8.29)
(𝑞𝑖 , 1, GT x) ∈ 𝒬𝑁 r
+2
, 𝑖 = 1, . . . , 𝐾,
x ∈ ℱ,
78
where 𝑞𝑖 is an upper bound for the quadratic term 12 xT Σ𝑖 x, and G𝑖 is such that Σ𝑖 =
G𝑖 GT𝑖 .
minimize 1
𝐾 (−𝑠) − 1𝑠 log(1 − 𝛼)
𝑠 𝑅x (8.31)
subject to x ∈ ℱ,
We can find a connection between EVaR computation (8.31) and maximization of ex-
pected exponential utility (8.27). Suppose that the pair (x* , 𝑠* ) is optimal for problem
(8.31). Then x* also optimal for problem (8.27), with risk aversion parameter 𝛿 = 𝑠* .
By assuming a GM distribution for security return, we can optimize problem (8.31)
without needing a scenario distribution. First, to formulate it as a convex optimization
problem, define the new variable 𝑡 = 1𝑠 :
(︂ )︂
1
EVaR𝛼 (𝐿) = inf 𝑡𝐾𝑅x − − 𝑡 log(1 − 𝛼). (8.32)
𝑡>0 𝑡
We can observe that the first term in formula (8.32) is the perspective of 𝐾𝑅x (−1).
Therefore by substituting the cumulant generating function (8.28) of the GM portfolio
return, we get a convex function in the portfolio x:
(︃ 𝐾 (︂ )︂)︃
∑︁ 1 T 1 T
EVaR𝛼 (𝐿) = inf 𝑡 log 𝜋𝑖 exp − 𝜇𝑖 x + 2 x Σ𝑖 x − 𝑡 log(1 − 𝛼). (8.33)
𝑡>0
𝑖=1
𝑡 2𝑡
minimize 𝑧 − 𝑡log(1 − 𝛼)
subject to
∑︀𝐾
𝑖=1 𝜋𝑖 𝑢𝑖 ≤ 𝑡,
(𝑢𝑖 , 𝑡, −𝜇T𝑖 x + 𝑞𝑖 − 𝑧) ∈ 𝐾exp , 𝑖 = 1, . . . , 𝐾,
(8.34)
(𝑞𝑖 , 1, GT x) ∈ 𝒬𝑁 r
+2
, 𝑖 = 1, . . . , 𝐾,
𝑡 ≥ 0,
x ∈ ℱ,
where 𝑞𝑖 is an upper bound for the quadratic term 12 xT Σ𝑖 x, and G𝑖 is such that Σ𝑖 =
G𝑖 GT𝑖 .
The huge benefit of this EVaR formulation is that its size does not depend on the num-
ber of scenarios, because it is derived without using a scenario distribution. It depends
only on the number of GM components 𝐾.
79
8.4 Example
This example shows how can we compute the CVaR efficient frontier using the dual form
of CVaR in MOSEK Fusion.
Next, we convert the received logarithmic return scenarios to linear return scenarios
using the inverse of formula (3.2).
# Convert logarithmic return scenarios to linear return scenarios
R = np.exp(scenarios_log) - 1
R = R.T
We transpose the resulting matrix just to remain consistent with the notation in this
chapter, namely that each column of R is a scenario. We also set the scenario probabilities
to be 𝑇1 :
# Scenario probabilities
p = np.ones(T) / T
80
Now we model the maximum function, and arrive at the following LO model of the
mean-CVaR efficient frontier:
maximize 𝜇T x − 𝛿𝑡 − 1−𝛼
𝛿
∑︀
𝑖 𝑝𝑖 𝑢𝑖
subject to u ≥ −RT x − 𝑡,
u ≥ 0, (8.37)
1T x = 1,
x ≥ 0.
# Variables
# The variable x is the fraction of holdings relative to the initial␣
˓→capital.
# Budget constraint
M.constraint('budget', Expr.sum(x), Domain.equalsTo(1.0))
# Auxiliary variables.
t = M.variable("t", 1, Domain.unbounded())
u = M.variable("u", T, Domain.unbounded())
# Objective
delta = M.parameter()
cvar_term = Expr.add(t, Expr.mul(1/(1-alpha), Expr.dot(p, u)))
M.objective('obj', ObjectiveSense.Maximize, Expr.sub(Expr.dot(m, x),␣
˓→Expr.mul(delta, cvar_term)))
81
(continued from previous page)
df_result = pd.DataFrame(columns=columns)
for d in deltas:
# Update parameter
delta.setValue(d)
# Solve optimization
M.solve()
# Save results
portfolio_return = m @ x.level()
portfolio_risk = t.level()[0] + 1/(1-alpha) * np.dot(p, u.level())
row = pd.Series([d, M.primalObjValue(), portfolio_return, portfolio_
˓→risk] + list(x.level()), index=columns)
return df_result
Next, we compute the efficient frontier. We select the confidence level 𝛼 = 0.95. The
following code produces the optimization results:
alpha = 0.95
On Fig. 8.1 we can see the risk-return plane, and on Fig. 8.2 the portfolio composition
for different levels of risk.
82
Fig. 8.1: The CVaR efficient frontier.
83
Chapter 9
Risk budgeting
Traditional MVO only considers the risk of the portfolio and ignores how the risk is
distributed among individual securities. This might result in risk being concentrated into
only a few securities, when these gain too much weight in the portfolio.
Risk budgeting techniques were developed to give portfolio managers more control over
the amount of risk each security contributes to the total portfolio risk. They allocate the
risk instead of the capital, thus diversify risk more directly. Risk budgeting also makes
the resulting portfolio less sensitive to estimation errors, because it excludes the use of
expected return estimates [[CK20], [Pal20]].
84
Working with this form is typically easier. After solving these equations, we can recover
the portfolio x by normalizing again.
85
where Σ = GGT for G ∈ R𝑁 ×𝑘 .
Note that we have to be careful adding further constraints to problem (9.6), because
these might make the risk budgeting portfolio infeasible.
minimize 1 T
x Σx − 𝑐bT log (z ∘ x)
2 (9.7)
subject to z ∘ x ≥ 0.
We can use the parameter 𝑐 to scale the sum of vector b, leading to different magnitude
of
∑︀ the optimal solution. Thus we can ∑︀ directly find a solution that satisfies for example
𝑖 𝑥𝑖 = 1 or in the long-short case 𝑖 |𝑥𝑖 | = 1.
Problem (9.7) can be modeled using the exponential cone in the following way:
minimize 𝑠 − 𝑐bT t
subject to (𝑠, 1, GT x) ∈ 𝒬𝑘+2 ,
r
(9.8)
(𝑧𝑖 𝑥𝑖 , 1, 𝑡𝑖 ) ∈ 𝐾exp , 𝑖 = 1, . . . , 𝑁,
z ∘ x ≥ 0,
86
portfolio, we get:
minimize 1 T
2
x Σx − 𝑐bT log (x+ + x− )
subject to x+ , x − ≥ 0,
x = x+ − x− ,
x+ ≤ 𝑀 z+ , (9.9)
x− ≤ 𝑀 z− ,
z + z−
+
≤ 1,
z+ , z− ∈ {0, 1}𝑁 .
What makes sure that problem (9.9) will find a low risk solution? The first term of
the objective function
√ is the sum of risk budgets, thus in any optimal solution, we must
have xT Σx = 𝑐, because the risk budgeting conditions are satisfied. It follows that the
second term will decide the optimal(︀ objective)︀ value. This logarithmic term will become
𝑐
the lowest if the monomial |x|𝑐b = ‖x‖1 |x̂|b is largest, where x̂ denotes a unit vector.
Depending on b, this implies an optimal x with large 1-norm. It follows that after
normalization, this x will yield a low value for the total risk xT Σx. Note however, that
it is not guaranteed to get the lowest risk, because |x̂|b can also vary.
A further tradeoff with the mixed integer approach is that we have no performance
guarantee, finding the optimal solution can take a lot of time. Most likely we will have
to settle with suboptimal portfolios in terms of risk.
9.4 Example
In this example we show how can we find a risk parity portfolio by solving a convex
optimization problem in MOSEK Fusion.
The input data is again obtained the same way as detailed in Sec. 3.4.2. Here we
assume that the covariance matrix estimate Σ of yearly returns is available.
The optimization problem we solve here is (9.8), which we repeat here:
minimize 𝑠 − 𝑐bT t
subject to (𝑠, 1, GT x) ∈ 𝒬𝑘+2 ,
r
(9.10)
(𝑧𝑖 𝑥𝑖 , 1, 𝑡𝑖 ) ∈ 𝐾exp , 𝑖 = 1, . . . , 𝑁,
z ∘ x ≥ 0,
We search for a solution in the positive orthant, so we set z = 1. We choose the risk
budget vector to be b = 1/𝑁 , so that all securities contribute the same amount of risk.
The Fusion model of (9.10):
# Portfolio weights
x = M.variable("x", N, Domain.unbounded())
87
(continued from previous page)
M.constraint("orthant", Expr.mulElm(z, x), Domain.greaterThan(0.0))
# Auxiliary variables
t = M.variable("t", N, Domain.unbounded())
s = M.variable("s", 1, Domain.unbounded())
# Solve optimization
M.solve()
# Save results
xv = x.level()
return df_result
The following code defines the parameters, including the matrix G such that Σ =
GG𝑇 .
# Number of securities
(continues on next page)
88
(continued from previous page)
N = 8
# Risk budget
b = np.ones(N) / N
# Orthant selector
z = np.ones(N)
# Global setting for sum of b
c = 1
# Cholesky factor of the covariance matrix S
G = np.linalg.cholesky(S)
df_result = RiskBudgeting(N, G, b, z, c)
On Fig. 9.1 we can see the portfolio composition, and on Fig. 9.2 the risk contributions
of each security.
89
Fig. 9.2: The risk contributions of each security.
90
Chapter 10
Appendix
Modeling with these two cones cover the class of SOCO problems, which include all
traditionally QO and QCQO problems as well. See in Sec. 10.1.2 for more details.
• Primal power cone:
{︂ ⃒ √︁ }︂
𝛼,1−𝛼
⃒
𝑛 ⃒ 𝛼 1−𝛼 2 2
𝒫𝑛 = 𝑥 ∈ R ⃒𝑥1 𝑥2 ≥ 𝑥3 + · · · + 𝑥𝑛 , 𝑥1 , 𝑥2 ≥ 0 ,
91
• Primal exponential cone:
{︂ ⃒ (︂ )︂ }︂
⃒
3⃒ 𝑥3
𝐾exp = 𝑥 ∈ R ⃒𝑥1 ≥ 𝑥2 exp , 𝑥1 , 𝑥2 ≥ 0 ,
𝑥2
or its dual cone.
• Semidefinite cone:
Maximum function
We can model the maximum constraint max(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ≤ 𝑐 using linear constraints
by introducing an auxiliary variable 𝑡:
𝑡 ≤ 𝑐,
𝑡 ≥ 𝑥1 ,
.. (10.2)
.
𝑡 ≥ 𝑥𝑛 .
t ≤ c, t ≥ x, t ≥ 0, (10.3)
𝑥 = 𝑥+ − 𝑥− ,
+ − (10.4)
𝑥 , 𝑥 ≥ 0.
Note however, that in either case a freedom remains in the magnitude of 𝑥+ and 𝑥− .
This is because in the explicit case we relaxed the equalities, and in the impicit case
only the difference of the variables is constrained. In other words it will be possible for
𝑥+ and 𝑥− to be both positive, allowing optimal solutions where 𝑥+ = max(𝑥, 0) and
𝑥− = max(−𝑥, 0) does not hold. We could ensure that only either 𝑥+ or 𝑥− is positive
92
and never both by stating the complementarity constraint 𝑥+ 𝑥− = 0 (or in the vector case
⟨x+ , x− ⟩ = 0), but unfortunately such an equality constraint is non-convex and cannot
be modeled.
We can do workarounds to ensure that equalities 𝑥+ = max(𝑥, 0) and 𝑥− = max(−𝑥, 0)
hold in the optimal solution. One approach is to penalize the magnitude of these two
variables, so that if both are positive in any solution, then the solver could always im-
prove the objective by reducing them until either one becomes zero. Another possible
workaround is to formulate a mixed integer problem; see Sec. 10.2.1.
Absolute value
We can model the absolute value constraint |𝑥| ≤ 𝑐 using the maximum function by
observing that |𝑥| = max(𝑥, −𝑥) (see Sec. 10.1.1):
−𝑐 ≤ 𝑥 ≤ 𝑐. (10.5)
(𝑐, 𝑥) ∈ 𝒬2 . (10.6)
maximize xT z
subject to 1T z = 𝑚, (10.7)
0 ≤ z ≤ 1.
Here x cannot be a variable, because that would result in a nonlinear objective. Let us
take the dual of problem (10.7):
minimize 𝑚𝑡 + 1T u
subject to u + 𝑡 ≥ x, (10.8)
u ≥ 0.
Problem (10.8) is actually the same as min𝑡 𝑚𝑡 + 𝑖 max(0, 𝑥𝑖 − 𝑡). In this case x can
∑︀
be a variable, and thus it can also be optimized.
maximize xT z
subject to 1T z = 𝑐sum − 𝑏, (10.9)
0 ≤ z ≤ c.
∑︀𝑖𝑏 −1
This has the optimal objective 𝑐frac 𝑐𝑖 𝑥𝑖 , where 𝑖𝑏 is such that
∑︀
𝑖 𝑏 𝑥𝑖 𝑏 + 𝑖>𝑖𝑏 𝑖=1 𝑐𝑖 < 𝑏 ≤
𝑖=1 𝑐𝑖 , and 𝑐𝑖𝑏 = 𝑖=1 𝑐𝑖 − 𝑏 < 𝑐𝑖𝑏 .
∑︀𝑖𝑏 frac
∑︀𝑖𝑏
93
If we take the dual of problem (10.9), we get:
Manhattan norm
Let x ∈ R𝑛 and observe that ‖x‖1 = |𝑥1 | + · · · + |𝑥𝑛 |. Then we can model the Manhattan
norm or 1-norm constraint ‖x‖1 ≤ 𝑐 by modeling the absolute value for each coordinate:
𝑛
∑︁
−z ≤ x ≤ z, 𝑧𝑖 = 𝑐, (10.11)
𝑖=1
Euclidean norm
Let x ∈ R𝑛 and observe that ‖x‖2 = 𝑥21 + · · · + 𝑥2𝑛 . Then we can model the Euclidean
√︀
𝑐, 12 , x ∈ 𝒬𝑛+2 (10.13)
(︀ )︀
r .
Quadratic form
Let x ∈ R𝑛 and let Q ∈ 𝒮+𝑛 , i. e., a symmetric positive semidefinite matrix. Then we
can model the quadratic form constraint 21 xT Qx ≤ 𝑐 either using the quadratic cone or
the rotated quadratic cone. To see this, observe that there exists a matrix G ∈ R𝑛×𝑘
such that Q = GGT . Of course this decomposition can be done in many ways, so the
matrix G is not unique. The most interesting cases are when 𝑘 ≪ 𝑛 or G is very sparse,
because these make the optimization problem much easier to solve numerically (see in
Sec. 10.1.2).
Most common ways to compute the matrix G are the following:
94
• Matrix square root: Q = Q1/2 Q1/2 , where Q1/2 is the symmetric positive semidefi-
nite “square root” matrix of Q. From this decomposition we have G = Q1/2 ∈ R𝑛×𝑛 .
After obtaining the matrix G, we can write the quadratic form constraint as a sum-
of-squares 12 xT GGT x ≤ 𝑐, which is a squared Euclidean norm constraint 21 ‖GT x‖22 ≤ 𝑐.
We can choose to model this using the rotated quadratic cone as
𝑐, 1, GT x ∈ 𝒬𝑘+2 (10.14)
(︀ )︀
r ,
or we can choose to model its square root using the quadratic cone as
𝑐, GT x ∈ 𝒬𝑘+1 . (10.15)
(︀√ )︀
Whether to use the quadratic cone or the rotated quadratic cone for modeling can be
decided based on which is more natural. Typically the quadratic cone is used to model
2-norm constraints, while the rotated quadratic cone is more natural for modeling of
quadratic functions. There can be exceptions, however; see for example in Sec. 2.3.
Power
Let 𝑥 ∈ R and 𝛼 > 1. Then we can model the power constraint 𝑐 ≥ |𝑥|𝛼 or equivalently
𝑐1/𝛼 ≥ |𝑥| using the power cone:
(𝑐, 1, 𝑥) ∈ 𝒫3
1/𝛼,(𝛼−1)/𝛼
. (10.16)
Exponential
Let 𝑥 ∈ R. Then we can model the exponential constraint 𝑡 ≥ 𝑒𝑥 using the exponential
cone:
Log-sum-exp
Let 𝑥1 , . . . , 𝑥𝑛 ∈ R. Then we can model the log-sum-exp constraint 𝑡 ≥ log ( 𝑒𝑥𝑖 ) by
∑︀𝑛
𝑖=1
applying rule Sec. 10.1.1 𝑛 times:
∑︀𝑛
𝑖=1 𝑢𝑖 ≤ 1, (10.18)
(𝑢𝑖 , 1, 𝑥𝑖 − 𝑡) ∈ 𝐾exp , 𝑖 = 1, . . . , 𝑛
95
Perspective of function
The perspective of a function 𝑓 (𝑥) is defined as 𝑠𝑓 (𝑥/𝑠) on 𝑠 > 0. From any conic
representation of 𝑡 ≥ 𝑓 (𝑥) we can reach a representation of 𝑡 ≥ 𝑠𝑓 (𝑥/𝑠) by substituting
all constants 𝑐 with their homogenized counterpart 𝑠𝑐.
Perspective of log-sum-exp
Let 𝑥1 , . (︀. ∑︀
. , 𝑥𝑛 ∈ R.)︀ Then we can model the perspective of the log-sum-exp constraint
𝑡 ≥ 𝑠log 𝑛
𝑖=1 𝑒
𝑥𝑖 /𝑠
by applying rule Sec. 10.1.1 on constraint (10.18):
∑︀𝑛
𝑢𝑖 ≤ 𝑠,
𝑖=1 (10.19)
(𝑢𝑖 , 𝑠, 𝑥𝑖 − 𝑡) ∈ 𝐾exp , 𝑖 = 1, . . . , 𝑛
Quadratic optimization
The standard form of a quadratic optimization (QO) problem is the following:
minimize 1 T
2
x Qx + cT x
subject to Ax = b, (10.20)
x ≥ 0.
The matrix Q ∈ R𝑛×𝑛 must be symmetric positive semidefinite, otherwise the objective
function would not be convex.
Assuming the factorization Q = GGT with G ∈ R𝑛×𝑘 , we can reformulate the problem
(10.20) as a conic problem by applying the method described in Sec. 10.1.1:
minimize 𝑡 + cT x
subject to Ax = b,
(10.21)
x ≥ 0,
(𝑡, 1, G x) ∈ 𝒬𝑘+2
T
r .
96
Assuming the factorization Q𝑖 = G𝑖 GT𝑖 with G𝑖 ∈ R𝑛×𝑘𝑖 , we can reformulate the
problem (10.22) as a conic problem by applying the method described in Sec. 10.1.1 for
both the objective and the constraints:
minimize 𝑡0 + cT0 x + 𝑎0
subject to 𝑡𝑖 + cT𝑖 x + 𝑎𝑖 ≤ 0, 𝑖 = 1, . . . , 𝑚, (10.23)
(𝑡𝑖 , 1, GT𝑖 x) ∈ 𝒬r𝑘𝑖 +2 , 𝑖 = 0, . . . , 𝑚.
• The storage requirement 𝑛𝑘 of G can be much lower than the storage requirement
𝑛2 /2 of Q.
In summary, the conic equivalents are not only as easy to solve as the original QP or
QCQP problems, but in most cases also need less space and solution time.
maximize cT x
subject to Ax + b ∈ 𝒦, (10.24)
𝑥𝑖 ∈ Z, 𝑖 ∈ ℐ.
where 𝒦 is a cone and ℐ ⊆ {1, . . . , 𝑛} contains the indices of integer variables. We can
model any finite range for the integer variable 𝑥𝑖 by simply adding the extra constraint
𝑎𝑖 ≤ 𝑥 𝑖 ≤ 𝑏 𝑖 .
97
10.2.1 Selection of integer constraints
In the following we will list the integer constraints appearing in financial context in this
book and show how to model them.
Switch
In some practical cases we might wish to impose conditions on parts of our optimization
model. For example, allowing nonzero value for a variable only in presence of a condition.
We can model such situations using binary variables (or indicator variables), which can
only take 0 or 1 values. The following set of constraints only allow 𝑥 to be positive when
the indicator variable 𝑦 is equal to 1:
𝑥 ≤ 𝑀 𝑦,
(10.25)
𝑦 ∈ {0, 1}.
The number 𝑀 here is not related to the optimization problem, but it is necessary to
form such a switchable constraint. If 𝑦 = 0, then the upper limit of 𝑥 is 0. If 𝑦 = 1, then
the upper limit of 𝑥 is 𝑀 . This modeling technique is called big-M. The choice of 𝑀
can affect the solution performance, but a nice feature of it is that the problem cannot
accidentally become infeasible.
If we have a vector variable x then the switch will look like:
x ≤ M ∘ y,
(10.26)
y ∈ {0, 1}𝑛 ,
where ∘ denotes the elementwise product. We accounted for a possibly different big-M
value for each coordinate.
Semi-continuous variable
A slight extension of Sec. 10.2.1 is to model semi-continuous variables, i. e. when 𝑥 ∈
{0} ∪ [𝑎, 𝑏], where 0 < 𝑎 ≤ 𝑏. We can model this by
Cardinality
We might need to limit the number of nonzeros in a vector x to some number 𝐾 < 𝑛.
We can do this with the help of an indicator vector y of length 𝑛, which indicates |x| =
̸ 0
(see Sec. 10.2.1). First we add a big-M bound M ∘ y to the absolute value, and model it
based on Sec. 10.1.1. Then we limit the cardinality of x by limiting the cardinality of y:
x ≤ M ∘ y,
x ≥ −M ∘ y,
(10.28)
1T y ≤ 𝐾,
y ∈ {0, 1}𝑛 .
98
Positive and negative part
We introduced the positive part 𝑥+ and negative part 𝑥− of a variable 𝑥 in Sec. 10.1.1.
We noted that we need a way to ensure only 𝑥+ or 𝑥− will be positive in the optimal
solution and not both. One such way is to use binary variables:
𝑥 = 𝑥+ − 𝑥− ,
𝑥+ ≤ 𝑀 𝑦,
𝑥− ≤ 𝑀 (1 − 𝑦), (10.29)
𝑥+ , 𝑥− ≥ 0,
𝑦 ∈ {0, 1}.
Here the binary variable 𝑦 allows the positive (negative) part to become nonzero exactly
when the negative (positive) part is zero.
Sometimes we need to handle separately the case when both the positive and the
negative part is fixed to be zero. Then we need to introduce two binary variables 𝑦 + and
𝑦 − , and include the constraint 𝑦 + + 𝑦 − ≤ 1 to prevent both variables from being 1:
𝑥 = 𝑥 + − 𝑥− ,
𝑥+ ≤ 𝑀 𝑦+,
𝑥− ≤ 𝑀 𝑦−,
(10.30)
𝑥+ , 𝑥− ≥ 0,
𝑦+ + 𝑦− ≤ 1,
𝑦+, 𝑦− ∈ {0, 1}.
where 𝑟f is the risk-free rate and 𝑥f is the allocation to the risk-free asset.
We can transform this problem using 𝑥f = 1 − 1T x, such that we are only left with
the variable x:
√
maximize (𝜇 − 𝑟f 1)T x + 𝑟f − 𝛿˜ xT Σx
subject to 1T x ≤ 1, (10.32)
x ≥ 0.
The feasible region in this form is a probability simplex. The solution x = 0 means that
everything is allocated to the risk-free security, and the hyperplane 1T x = 1 has all the
feasible solutions purely involving risky assets.
99
Let us denote the objective function value by obj(x). Then the directional derivative
of the objective along the direction u > 0, ||u|| = 1 will be
xT Σu
𝜕u obj(x) = (𝜇 − 𝑟f 1)T u − 𝛿˜ √ .
xT Σx
This does not depend on the norm of x, meaning that 𝜕u obj(𝑐u) will be constant in 𝑐:
√
𝜕u obj(𝑐u) = (𝜇 − 𝑟f 1)T u − 𝛿˜ uT Σu.
Thus the objective is linear along the 1-dimensional slice between x = 0 and any x > 0,
meaning that the optimal solution is either x = 0 or some x > 0 satisfying 1T x = 1.
Furthermore, along every 1-dimensional slice represented by a direction u, there is a
˜
𝛿 threshold
(𝜇 − 𝑟f 1)T u
𝛿˜u = √ .
uT Su
This is the Sharpe ratio of all portfolios of the form 𝑐u. If 𝛿˜ > 𝛿˜u then 𝜕u obj(u) turns
negative.
˜ the fewer directions u will exist for which 𝛿˜ < 𝛿˜u , i.
In general, the larger we set 𝛿,
e., 𝜕u obj(u) > 0, and the optimal solution is a combination of risky assets. The largest
𝛿˜ for which we still have such a direction is 𝛿˜* = maxu 𝛿˜u , the maximal Sharpe ratio.
The corresponding optimal portfolio is the one having the smallest risk while consisting
purely of risky assets. For 𝛿˜ > 𝛿˜* we have 𝜕u obj(u) < 0 in all directions, meaning that
the optimal solution will always be x = 0, the 100% risk-free portfolio.
2. Compute the optimal asset allocation x for the original input pair (𝜇, Σ).
4. To estimate the error for each security in the optimal portfolio, compute the stan-
dard deviation of the difference vectors x𝑘 − x. Then by taking the mean we get a
single number estimating the error of the optimization method.
Regarding the final step, we can also measure the average decay in performance by
any of the following:
100
• the mean difference in expected outcomes: 𝐾sim1 T
∑︀
𝑘 (x𝑘 − x) 𝜇
• the mean difference in Sharpe ratio or other metric computed from the above statis-
tics.
101
Chapter 11
• 𝑝0,𝑖 : The known price of security 𝑖 at the beginning of the investment period.
• p0 : The known vector of security prices at the beginning of the investment period.
• 𝑃ℎ,𝑖 : The random price of security 𝑖 at the end of the investment period.
• 𝑃ℎ : The random vector of security prices at the end of the investment period.
• R: The sample return data matrix consisting of security return samples as columns.
• x: Portfolio vector.
• 𝑥f0 : The fraction of initial funds invested into the risk-free security.
102
• 𝑅x : Portfolio return computed from portfolio x.
• Var(𝑅): Variance of 𝑅.
• 1: Vector of ones.
• ⟨a, b⟩: Inner product of vectors a and b. Sometimes used instead of notation aT b.
• ℱ: Part of the feasible set of an optimization problem. Indicates that the problem
can be extended with further constraints.
• 𝑒: Euler’s number.
103
11.3 Abbreviations
• MVO: Mean–variance optimization
104
Bibliography
[BN01] J. Bai and S. Ng. Determining the number of factors in approximate factor
models. Econometrica, 01 2001. doi:10.1111/1468-0262.00273.
[BS11] J. Bai and S. Shi. Estimating high dimensional covariance matrices and
its applications. Annals of Economics and Finance, 12:199–215, 11 2011.
[CK20] Giorgio Costa and Roy H. Kwon. Generalized risk parity portfolio opti-
mization: an admm approach. J. of Global Optimization, 78(1):207–238,
sep 2020.
105
[dP19] M. López de Prado. A robust estimator of the efficient frontier. SSRN
Electronic Journal, 01 2019. doi:10.2139/ssrn.3469961.
[EM75] B. Efron and C. Morris. Data analysis using stein's estimator and its gen-
eralizations. Journal of The American Statistical Association, 70:311–319,
06 1975. doi:10.1080/01621459.1975.10479864.
[FS02] Hans Foellmer and Alexander Schied. Convex measures of risk and
trading constraints. Finance and Stochastics, 6:429–447, 09 2002.
doi:10.1007/s007800200072.
[JM03] R. Jagannathan and T. Ma. Risk reduction in large portfolios: why im-
posing the wrong constraint helps. Journal of Finance, 58:1651–1684, 08
2003. doi:10.1111/1540-6261.00580.
[KS13] R. Karels and M. Sun. Active portfolio construction when risk and alpha
factors are misaligned. In C. S. Wehn, C. Hoppe, and G. N. Gregoriou, ed-
itors, Rethinking Valuation and Pricing Models, pages 399–410. Academic
Press, 12 2013. doi:10.1016/B978-0-12-415875-7.00024-5.
106
[Lau01] G. J. Lauprête. Portfolio risk minimization under departures from normal-
ity. Ph.D. Thesis, Massachusetts Institute of Technology, Sloan School of
Management, Operations Research Center., 2001. URL: http://dspace.
mit.edu/handle/1721.1/7582.
[LW03a] O. Ledoit and M. Wolf. Honey, i shrunk the sample covariance matrix.
The Journal of Portfolio Management, 07 2003. doi:10.2139/ssrn.433840.
[LMFB07] M. S. Lobo, M. Fazel, and S. Boyd. Portfolio optimization with linear and
fixed transaction costs. Annals of Operations Research, 152:341–365, 03
2007. doi:10.1007/s10479-006-0145-1.
[Meu10c] A. Meucci. Quant nugget 5: return calculations for leveraged securities and
portfolios. GARP Risk Professional, pages 40–43, 10 2010. URL: https:
//ssrn.com/abstract=1675067.
107
[Meu11] A. Meucci. 'the prayer' ten-step checklist for advanced risk and portfolio
management. SSRN, 02 2011. URL: https://ssrn.com/abstract=1753788.
[MM08] Richard Michaud and Robert Michaud. Efficient Asset Management: A
Practical Guide to Stock Portfolio Optimization and Asset Allocation 2nd
Edition. Oxford University Press, 2 edition, 01 2008.
[Pal20] D. P. Palomar. Risk parity portfolio. 2020. Lecture notes, The Hong Kong
University of Science and Technology.
[Ric99] J. A. Richards. An introduction to james–stein estimation. 11 1999. M.I.T.
EECS Area Exam Report.
[RU00] R. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk.
Journal of risk, 2:21–42, 01 2000.
[RU02] R. T. Rockafellar and S. P. Uryasev. Conditional value-at-risk for general
loss distributions. Journal of Banking & Finance, 26(7):1443–1471, 2002.
doi:https://doi.org/10.1016/S0378-4266(02)00271-6.
[RUZ06] R. T. Rockafellar, S. P. Uryasev, and M. Zabarankin. Generalized devia-
tions in risk analysis. Finance and Stochastics, 10:51–74, 2006.
[RSTF20] D. Rosadi, E. Setiawan, M. Templ, and P. Filzmoser. Robust covari-
ance estimators for mean-variance portfolio optimization with trans-
action lots. Operations Research Perspectives, 7:100154, 06 2020.
doi:10.1016/j.orp.2020.100154.
[RWZ99] M. Rudolf, H.-J. Wolter, and H. Zimmermann. A linear model for tracking
error minimization. Journal of Banking & Finance, 23(1):85–103, 1999.
doi:https://doi.org/10.1016/S0378-4266(98)00076-4.
[The71] H. Theil. Principles of Econometrics. John Wiley and Sons, 1 edition, 06
1971.
[TLD+11] B. Tóth, Y. Lemperiere, C. Deremble, J. Lataillade, J. Kockelko-
ren, and J.-P. Bouchaud. Anomalous price impact and the critical na-
ture of liquidity in financial markets. Physical Review X, 05 2011.
doi:10.2139/ssrn.1836508.
[VD09] F. J. Nogales V. DeMiguel. Portfolio selection with ro-
bust estimation. Operations Research, 57(3):560–577, 02 2009.
doi:https://doi.org/10.1287/opre.1080.0566.
[WZ07] R. Welsch and X. Zhou. Application of robust statistics to asset allocation
models. REVSTAT, 5:97–114, 03 2007.
[WO11] M. Woodside-Oriakhi. Portfolio Optimisation with Transaction Cost. PhD
thesis, School of Information Systems, Computing and Mathematics,
Brunel University, 01 2011.
[MOSEKApS21] MOSEK ApS. MOSEK Modeling Cookbook. MOSEK ApS, Fruebjergvej
3, Boks 16, 2100 Copenhagen O, 2021. Last revised June 2021. URL:
https://docs.mosek.com/modeling-cookbook/index.html.
108