0% found this document useful (0 votes)
242 views

Quant Interview Cheat Sheet

Uploaded by

ja683
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
242 views

Quant Interview Cheat Sheet

Uploaded by

ja683
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

The Quant Interview Cheat Sheet

Written by the Team at QuantGuide.io

Contents
1 Preface 1
1.1 Introductory Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How to Use This . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Probability 2
2.1 Key Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Formulas and Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Statistical Learning 5
3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Tree Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Linear Algebra 9
4.1 Matrix Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Calculus 11

6 Finance 12
6.1 Options Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.2 Portfolio Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1 Preface
1.1 Introductory Remarks
Quant interviews are hard. And for good reason, too − at the intersection of computer science, statistics, and finance lies
the most lucrative job out of university, with graduate opportunities commanding up to $600,000 per year in compensation
from base salary, sign-on bonus, and performance bonus (and internships paying over $150 per hour).
Naturally, every year is more competitive than the last, and sometimes we even wonder if we’d still be able to pass the
interviews at the top trading firms today. We’re uncompromising believers that you do not have to be a genius to land
a role in quant and that there exists a path of persistent preparation and tenacious thoughtfulness that will lead to an
offer. We carefully thought about what resources would have helped us back then that would have made our lives less
stressful when preparing for interviews, so we authored this sheet in hopes that it can make your studying and interviewing
processes easier.

1.2 How to Use This


This document is meant to assist you in your preparation for interviews. It will refresh you on fundamental ideas that
universally show up in Quantitative Trading (QT) and Research (QR) interviews. Sections labelled QT and QR are to
represent the type of interview that the content is relevant for. Note that this document only consists of important results
without derivation/proof. It’s imperative that you properly practice questions and review the concepts listed here, as this
is not a substitute for studying.

− The Team at QuantGuide.io

1
2 PROBABILITY

2 Probability
2.1 Key Distributions
Name Modeling Intuition PMF/PDF MGF µ σ2
Toss a coin, 1 if heads, else
(
p if t = 1
Bernoulli 0, coin lands heads with f (t; p) = peθ + (1 − p) p p(1 − p)
probability p 1 − p if t = 0

Toss a coin n times, prob-


ability of t heads, coin n
  θ n
Binomial f (t; n, p) = t pt (1 − p)n−t pe + (1 − p) np np(1 − p)
lands heads with probabil-
ity p
Probability of tossing coin
t times until heads, coin peθ 1 1−p
Geometric f (t; p) = p(1 − p)t−1 1−(1−p)eθ p p2
lands heads with probabil-
ity p
Probability of t occur-
rences within a fixed time
λt e−λ θ
−1)
Poisson interval or space; param- f (t; λ) = t! eλ(e λ λ
eter λ represents average
number of occurrences
Probability distribution of
time between events in a
Exponential f (t; λ) = λe−λt 1t≥0 λ
λ−θ
1
λ
1
λ2
Poisson process occurring
with rate λ
1 ebθ −eaθ a+b (b−a)2
Uniform Uniform on a line f (t; a, b) = b−a 1t∈[a,b] θ(b−a) 2 12

Standard univariate nor-


 2
 1 2 2
Normal f (t) = √1 exp − (x−µ)
2σ 2 eµθ+ 2 σ θ
µ σ2
mal, z-score transform σ 2π

2.2 Formulas and Laws

Common Relationships Between Distributions (QT/QR)


n
X
1. X1 , . . . , Xn ∼ Bernoulli(p) IID =⇒ Xi ∼ Binom(n, p)
i=1

r
X
2. X1 , . . . , Xr ∼ Geom(p) IID =⇒ Xi ∼ NegBinom(r, p)
i=1

n n
!
X X
3. Xi ∼ Poisson(λi ), 1 ≤ i ≤ n independent =⇒ Xi ∼ Poisson λi
i=1 i=1

n
X
4. X1 , . . . , Xn ∼ Exp(λ) IID =⇒ Xi ∼ Gamma(n, λ−1 ) (Shape-Scale parameterization)
i=1

n n n
!
X X X
2
5. Xi ∼ N (µ, σ ), 1 ≤ i ≤ n independent =⇒ Xi ∼ N µi , σi2
i=1 i=1 i=1

−1
6. If X ∼ Unif(0, 1) and F (x) is an invertible CDF, then Y = F (X) has CDF F (x)

2 900+ free interview questions at QuantGuide.io


2.2 Formulas and Laws 2 PROBABILITY

Conditional Probability, Bayes, and Law of Total Probability (QT/QR)


Consider events A1 , . . . , An which form a partition of the sample space as well as event B. Then,

P(A1 ∩ B) P(B | A1 )P(A1 ) P(B | A1 )P(A1 ) P(B | A1 )P(A1 )


P(A1 | B) = = = Pn = Pn
P(B) P(B) i=1 P(B ∩ A i ) i=1 P(B | Ai )P(Ai )

Law of Total Expectation and Variance (QT/QR)


For two random variables X, Y defined on the same sample space,
∞ Z
discrete Y continuous Y
X
E[X] = E[E[X | Y ]] = P(Y = yi ) E[X | Y = yi ] = E[X | Y = y]fY (y)dy
i=1 R

Var(X) = Var(E[X | Y ]) + E[Var(X | Y )], where Var(X | Y ) = E[(X − E[X | Y ])2 | Y ]

Intuitively, the Law of Total Expectation says that if we ”average over all averages” of X obtained by some information
about Y , we obtain the true average. Similarly, the Law of Total Variance says that the true variance comes from two
sources: between samples (the first term) and within samples (the second term).

Covariance and Correlation (QT/QR)

Cov(X, Y )
Cov(X, Y ) = E[XY ] − E[X]E[Y ] Corr(X, Y ) =
σX σY
Covariance and correlation are measurements of linear association of X and Y . An example of uncorrelated but not
independent random variables are Z and Z 2 , where Z ∼ N (0, 1).

Properties of Expectation, Variance, and Covariance (QT/QR)


Let a, b, c, and d be real constants and X and Y be random variables with finite mean and variance. Then all of the
following hold:
1. E[aX + bY + c] = aE[X] + bE[Y ] + c

2. Var(aX + b) = a2 Var(X)
3. Cov(aX + b, cY + d) = acCov(X, Y )
4. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )

5. If X and Y are independent and have finite mean, then X and Y are uncorrelated.
6. Corr(aX + b, cY + d) = sign(ac)Corr(X, Y )
7. Cov(X, X) = Var(X)
8. The correlation and covariance matrices are both positive semidefinite.

9. |Corr(X, Y )| ≤ 1

Tail Sum and Integral (QT/QR)


If X is a non-negative random variable, then
Z ∞ ∞
continuous X integer-valued X X
E[X] = P(X > t) dt = P(X > t)
0 t=0

Law of the Unconscious Statistician (LOTUS) (QT/QR)


Let X be a random variable and g(x) be a function. Then we have
Z ∞
continuous X discrete X
X
E[g(X)] = g(x)fX (x) dx = g(k)P[X = k]
R k∈Supp(X)

3 900+ free interview questions at QuantGuide.io


2.3 Markov Chains 2 PROBABILITY

Central Limit Theorem (QT/QR)


Suppose that X1 , X2 , . . . , Xn are IID random variables with finite mean µ and variance σ 2 . Furthermore, suppose that n
is large. Define Sn = X1 + · · · + Xn and X n = Snn . Then we have that

Sn − nµ Xn − µ
√ = √ ≈ N (0, 1)
σ n σ/ n

This intuitively says that sums of large amounts of IID random variables approach a standard normal distribution when
re-scaled to have mean 0 and variance 1. This is useful to approximate sums whose exact distribution is not known.

Law of Large Numbers (QT/QR)


X1 + · · · + Xn
Suppose that X1 , X2 , . . . are IID random variables with finite mean µ. Define X n = . With probability 1,
n
we have that

lim X n = µ
n→∞

This intuitively says that the sample average approaches the true average as the sample size grows large. This theorem is
the basis of Monte Carlo Sampling, as we are attempting to find the true average by averaging the results of simulations.

2.3 Markov Chains


An n × n transition matrix P is defined as follows for a discrete state space with n states X = {x1 , x2 , . . . xn }, where Pij
is the probability of transitioning from state xi to state xj . Each entry in P is within [0, 1], and the sum of entries for
each row must total 1. Note that not all transition matrices are stationary.

Stationary Distribution (QR)


Solve for row vector π = (π1 , . . . , πn ), where πi is the stationary probability of the i-th state.

π = πP , π1 = 1

Expected Time to State (Example)


Suppose we have a 3 × 3 transition matrix P , and we wish to find the expected time to reach x3 from x1 . We can solve
the following system of equations for µ1 , where µi denotes the expected time to reach state x3 starting from xi :

µ1 = 1 + P 11 µ1 + P 12 µ2 + P 13
µ2 = 1 + P 21 µ1 + P 22 µ2 + P 23

Gambler’s Ruin/Random Walks (QT/QR)


Suppose that you and a friend start with $a and $b, respectively, where a, b > 0 are integers. Each round, a fair coin in
flipped. You receive $1 from your friend if it appears heads, while you must pay $1 to your friend if it appears tails. The
game ends once one of the players has no money left. The probability that you are ruined i.e. have no money left is
a
a+b
p
If the coin has probability p of appearing heads, define ϱ = . The probability that you are ruined is
1−p

1 − ϱb
1 − ϱa+b

4 900+ free interview questions at QuantGuide.io


3 STATISTICAL LEARNING

3 Statistical Learning
3.1 Linear Regression

Simple Linear Regression (QR)


Suppose we have a variable X and another variable Y which we assume has some sort of linear relationship with X, and
suppose we have m samples in a data set. Specifically, we assume

Y = β0 + β1 X + ϵ, E[ϵ] = 0, Var(ϵ) = σ 2 , ϵ and X independent


 Pm 
2
The best estimates for β0 , β1 that minimize the residual sum of squares RSS = i=1 (yi − ŷi ) are
Pm
(x − x̄)(yi − ȳ) Cov(X, Y )
β̂1 = Pm i
i=1
2
≈ , β̂0 = ȳ − β̂1 x̄, and
i=1 (xi − x̄) Var(X)
Pm m
σ2 σ 2 i=1 x2i 2 1 X 2
Var(β̂1 ) = Pm 2
, Var( β̂ 0 ) = P m , σ̂ = (yi − ŷi )
i=1 (xi − x̄) m i=1 (xi − x̄)2 m − 2 i=1

OLS Assumptions: Independent observations, constant variance (homoscedasticity), and no multicollinearity

Terms for Model Assessment and Inference (QR)

Term Math Intuition


Pm 2 Measures the amount of variability that is left
RSS (Residual Sum of Squares) i=1 (yi − ŷi )
unexplained after performing the regression
Used in hypothesis testing on the coefficients;
SE (Standard Error) √σ ≈ √σ̂
m m β̂
t = SE( β̂)
, distributed as tm−p−1
Absolute measure of lack of fit of model to the
q q
RSS RSS
RSE (Residual Standard Error) m−2 = m−p−1 data, generalized for p predictors
Pm 2
TSS (Total Sum of Squares) i=1 (yi − ȳ) Total variance in Y
2 RSS
R 1− TSS Proportion of variance in Y explained by X
RSS/(m−p−1) Similar to R2 , adjusted for p predictors and m
Adjusted R2 1− TSS/(m−1) examples
Pm
√Pm i=1 (xi −x̄)(yi −ȳ)

Corr(X, Y ) 2
Pm 2
Sample correlation between X, Y
i=1 (xi −x̄) i=1 (yi −ȳ)

Multiple Linear Regression (QR)


Our model assumes a linear relationship between Y and X1 , X2 , . . . , Xp .
n
X
Y = ϵ + β0 + βi Xi , E[ϵ] = 0, Var(ϵ) = σ 2
i=1

We can express the above in matrix form. Denote X as an m × (p + 1) matrix with each row an input vector with 1
in the first position corresponding to β0 . Further, let β = (β0 , β1 , . . . , βp )⊺ , and let y denote the m-vector of outputs
corresponding to X. Then we have that y = Xβ + ϵ, E[ϵi ] = 0, and Var(ϵi ) = σ 2 .

We can compute the unbiased estimator β̂ as follows:


∂RSS
RSS = (y − Xβ)⊺ (y − Xβ) ⇒ = −2X⊺ (y − Xβ) ⇒ β̂ = (X⊺ X)−1 X⊺ y
∂β
m
1 X
Var(β̂) = (X⊺ X)−1 σ 2 , σ̂ 2 = (yi − ŷi )2
m − p − 1 i=1

5 900+ free interview questions at QuantGuide.io


3.1 Linear Regression 3 STATISTICAL LEARNING

F Distribution for Subset Selection (QR)

1. Is a subset of X1 , X2 , . . . , Xp useful in predicting Y ?


We may conduct the hypothesis test by computing the F-statistic defined below, where subscripts with 1 denote relation
to the larger model.

(RSS0 − RSS1 )/(p1 − p0 )


F =
RSS1 /(m − p1 − 1)

The F statistic corresponds to the Fp1 −p0 ,m−p1 −1 distribution.

2. What subset of X1 , X2 , . . . , Xp helps explain Y ?


The F -ratio is defined as follows, where k denotes the number of predictors currently in the model, RSS0 and RSS1 are
the measures of error for the original and new (appropriately fitted) model, respectively.

RSS0 − RSS1
F =
RSS0 /(m − k − 2)

Forward selection greedily adds predictors into the model that produce the largest value of F , whereas backward selection
sequentially deletes predictors producing the smallest value of F at each stage. Stopping condition is an F -ratio ≥
(100 − α)th percentile for some α, typically 5 or 10, of the F1,N −k−2 distribution.

Coefficient Shrinkage Methods (QR)


Pm  Pp 2
Previously, we aimed to minimize RSS = i=1 yi − β0 − j=1 βj xij . Regularization methods such as ridge and lasso
shrink regression coefficients towards zero, which can significantly reduce coefficient estimates’ variance.

Ridge Regression
Objective function to be minimized, for hyperparameter λ:
 2
m
X p
X p
X p
X
yi − β0 − βj xij  + λ βj2 = RSS + λ βj2 = (y − Xβ)⊺ (y − Xβ) + λβ ⊺ β
i=1 j=1 j=1 j=1

⇒ β̂ ridge = (X⊺ X + λI)−1 X⊺ y

As λ → ∞, the impact of the shrinkage penalty grows, and ridge regression coefficient estimates will approach zero.
Note that we don’t want to shrink the intercept, since it is, conceptually, a measure of the center of the response when
predictors are at zero. Note that ridge regression will shrink all coefficients towards zero but will not set any of them
exactly to zero, reducing model interpretability for large p. Lasso overcomes this disadvantage.

Lasso Regression
Objective function to be minimized, for hyperparameter λ, yielding sparse models:
 2
m
X p
X p
X p
X
yi − β0 − βj xij  + λ |βj | = RSS + λ |βj |
i=1 j=1 j=1 j=1

Bias-Variance Tradeoff (QR)


We have a relationship Y = f (X)+ϵ, where E[ϵ] = 0, Var(ϵ) = σ 2 . Suppose we have learned a function fˆ(·) to approximate
f (·) via a training set D. Then, the expected prediction error over a test set x can be decomposed.
 2   h i2 h i
EPE(x) = E f (x) − fˆ(x) X = x = σ 2 + Bias fˆ(x) + Var fˆ(x)
h i h i
Bias fˆ(x) := ED fˆ(x) − f (x)

6 900+ free interview questions at QuantGuide.io


3.2 Classification 3 STATISTICAL LEARNING

3.2 Classification

k-Nearest Neighbors (QR)


Unlike other classification methods, k-nearest neighbors is a non-parametric method where we classify data points based
on the data points closest to it. We first calculate the distance between the new observation and each training point. We
then identify the k training observations that are closest. We then classify the new observation point as the most common
class within the k observations.

Logistic Regression (QR)


Suppose we now want to model our Y as a binary output for given X. We can now model the probability of X belonging
in a particular category. Here, we will assign X to group 1 if P(Y = 1 | X = x) > 0.5 and 0 otherwise. This is done via
the logistic function:
1
f (z) =
1 + e−z
Logistic regression is also known as a linear classification model. This is due to the fact that we use a linear regression as
the input into the logistic function, which then normalizes it into a probability. In other words, z = β0 + β1 x.

To find β, we can use the method of maximum likelihood, where the probabilities for each xi is denoted by f (β̂0 + βˆ1 xi ).
Unlike in linear regression, there is no closed form solution and we must use iterative methods to determine β̂. Multiple
logistic regression is done in the same manner, but now we have to calculate additional values of β.

Generative Models (QR)


Logistic regression models P(Y = k | X = x) directly. In generative models, we can use Bayes’ Theorem to obtain an
alternative formula for the probability:

P(X = x | Y = k)P(Y = k)
P(Y = k | X = x) =
P(X = x)
fk (x)πk
= PK
l=1 πl fl (x)

Generative models involve calculating these values individually, before combining them to indirectly obtain the desired
probability, hence, generative models.

Discriminant Analysis (QR)


Here, we assume that the data is normal. In other words, fk (x) ∼ N (µk , σk2 ). We can obtain estimates of µk and σk2
using standard maximum likelihood estimates. The prior can be estimated by using the proportion of training data that
is in each class.

When we assuming that σ1 = σ2 = · · · = σk , this gives us the result for linear discriminant analysis. When we ease these
assumptions, and allow unique σk , we obtain the result for quadratic discriminant analysis. Above, we assumed fk (x) to
be univariate. However, linear and quadratic discriminant analysis also apply when using multivariate Gaussians.

Naive Bayes (QR)


Naive Bayes makes a strong (yet different from discriminant analysis) assumption about the data distribution. We make
thee assumption that within each class, the p predictors are independent from each other. This gives the following
statement:
fk (x) = fk1 (x1 )fk2 (x2 ) . . . fkp (xp )
This greatly simplifies computation as it assumes no dependence between the independent variables. In discriminant
analysis, we assumed that the data follows a normal distribution, which may have some non-diagonal covariance structure.

7 900+ free interview questions at QuantGuide.io


3.3 Tree Methods 3 STATISTICAL LEARNING

3.3 Tree Methods

Decision Trees (QR)


Tree methods can be used for both classification and regression. We will start with a singular decision tree. A decision
trees will make a sequence of if-else statements, which eventually leads to some group in the classification scenario, or
some value in the regression scenario. In general, decision trees are very easy to interpret and can handle non-quantitative
predictors well. On the other side, decision trees can overfit to the data (low bias) and have high variance. Furthermore,
the accuracy isn’t the greatest. These problems can be remedied with bagging or boosting.

Bagging (QR)
Bagging is a method that is used to reduce the variance of a decision tree. In other words, from a single dataset, we
can keep training decision trees with replacement (bootstrapping), and then average all the trees to obtain a model with
lower variance (aggregation).

Due to bootstrapping, not all data points will be used when training. On average, each tree will only use about 2/3 of
the data. The remaining data is referred to OOB (out-of-bag) observations, which can be used to estimate out-of-sample
test error.

Random Forests (QR)


Random Forests essentially use bagging, but with a small modification. In bagging, we may have high correlation between
features and thus our trees can all be highly correlated. In random forests, we decorrelate the trees by forcing the trees
to consider only a small subset of the features, m, rather than all of them. When we set m = p, we obtain bagged trees.

Boosting (QR)
In bagging and random forests, we reduced variance by using an ensemble of decision trees. Here, we focus on reducing
bias by sequentially growing decision trees. In addition, we train the model on a modified version of the data, rather than
a bootstrapped dataset. More specifically, we train the model on the residuals.

First, we will train the data on the original dataset. Then, we calculate the residuals and re-train a tree on the residuals.
We then add the models together and repeat. Boosting is additive, focusing on reducing the error with each sequentially
grown tree.

3.4 Neural Networks

Neural Networks (QR)


Neural networks are a very powerful way of introducing non-linearity to both regression and classification problems. This
is done through the usage of activation functions. Let us denote an activation function as f (x). A neural network does
the following:
y = fn (fn−1 (. . . f1 (W x + b)))
We call W , the weights, and b the bias, and x is the input, which can range from a scalar value to a vector. The fn
subscript denotes that the activation functions between layers don’t need to be the same. Common activation functions
are ReLU, softmax, or tanh. One can notice that without activation functions, the result simplifies to that of a linear
regression, where W and b are the βs. Similarly, a logistic regression is one of the most basic versions of a neural network,
where the activation function is the sigmoid function.

One can see that there can be an extremely large amount of parameters. These parameters are trained through gradient
descent and backpropagation. Many different losses can be used − which can explain the versatility of neural networks.
A downside is they require a large amount of data to be effective. They are also not necessarily the quickest and lack
interpretability. All 3 statements are reasons why neural networks are not that widely used in finance, and many go
towards simpler models, like regression for their needs. However, we can see the strength of neural networks, especially
with the sudden popularity of large-language models.

8 900+ free interview questions at QuantGuide.io


4 LINEAR ALGEBRA

4 Linear Algebra
4.1 Matrix Basics

Fundamental Knowledge (QT/QR)


Let A and B be square n × n matrices. Then all of the following hold:
x⊺ y
cos(θ) = (AB)⊺ = B ⊺ A⊺ (AB)−1 = B −1 A−1 A−1 A = AA−1 = I rank(A) + null(A) = n
∥x∥∥y∥
1
Av = λv ⇒ (A − λI)v = 0 ⇒ det(A − λI) = 0 det(A) = det(A) = det(AT )
det(A−1 )
n
Y n
X n
X
det(AB) = det(A)det(B) det(cA) = cn det(A) det(A) = λi trace(A) = Aii = λi
i=1 i=1 i=1

Nonsingular Matrices (QR)


A nonsingular matrix is invertible. A (n × n) is nonsingular if and only if any (and therefore all) of the following hold:
1. Columns of A span Rn , or equivalently, rank(A) = dim(range(A)) = n
2. A⊺ is nonsingular

3. det(A) ̸= 0
4. Ax = 0 has only the trivial solution x = 0; dim(nul(A)) = 0
   
a b d −b
Note that if A = , then A−1 = det(A)
1
. Larger inverses may be found via Gauss-Jordan Elimination:
c d −c a

elementary row operations


[A | I] ⇒ [I | A−1 ]

2D Rotation Matrices (QR)  


cos θ − sin θ
2D Rotation matrices by θ radians counter-clockwise about the origin are matrices in the form Rθ = .
sin θ cos θ

Orthogonal Matrices (QR)


Orthogonal matrices (unitary matrices in the reals) are square with orthonormal row and column vectors. They are
nonsingular and satisfy QT = Q−1 . Orthogonal matrices can be intepreted as rotation matrices.

Idempotent Matrices (QR)


Idempotent matrices are square matrices satisfying A2 = A. In other words, the effect of applying the linear transformation
A twice is the same as applying it once. Projection matrices are Idempotent.

Positive Semi-definite Matrices (QR)


Covariance and correlation matrices are always positive semi-definite and positive definite if there is no perfect linear
dependence among random variables. Each of the following conditions is a necessary and sufficient condition for A to be
positive semi-definite/definite:

Positive Semi-Definite Positive Definite



z Az ≥ 0 for all column vectors z z ⊺ Az > 0 for all nonzero column vectors z
All eigenvalues are nonnegative All eigenvalues are positive
All upper left/lower right submatrices have All upper left/lower right submatrices have
nonnegative determinants positive determinants

Note that if A has negative diagonal elements, then A cannot be positive semi-definite.

9 900+ free interview questions at QuantGuide.io


4.2 Matrix Decompositions 4 LINEAR ALGEBRA

4.2 Matrix Decompositions

Diagonalizable Matrices (QR)


A is diagonalizable if and only if it has linearly independent eigenvectors, or equivalently, if the geometric multiplicity and
the algebraic multiplicity of all the eigenvalues agree. A special case of this is if A has n distinct eigenvalues. Suppose we
have eigenvalues λ1 , . . . , λn and corresponding eigenvectors v1 , . . . , vn . Then
 

| |
 λ1 0
A = XDX −1 , X = v1 · · · vn  , D = 
 .. 
. 
| | 0 λn

Intuitively, this says that we can find a basis consisting of the eigenvectors of A. Useful for computing large powers of A,
as An = XDn X −1 . An important example is A being real and symmetric implies A is diagonalizable.

Singular Value Decomposition (QR)


SVD is powerful in low-rank approximations of matrices. Unlike eigenvalue decomposition, SVD uses two unique bases
(left/right singular vectors). For orthogonal matrices U (m × m), V (n × n) and diagonal matrix Σ(m × n) with nonnegative
diagonal entries in nonincreasing order, we can write any m × n matrix A as:

A = U ΣV ⊺

Intuitively, this says that we can express A as a diagonal matrix with suitable choices of (orthogonal) bases.

QR Decomposition (QR)
For nonsingular A, we can write A = QR, where Q is orthogonal and R is an upper triangular matrix with positive
diagonal elements. QR decomposition assists in increasing the efficiency of solving Ax = b for nonsingular A:

Ax = b ⇒ QRx = b ⇒ Rx = Q−1 b = Q⊺ b

QR decomposition is very useful in efficiently solving large numerical systems and inversion of matrices. Furthermore, it
is also used in least-squares when our data is not full rank.

LU and Cholesky Decompositions (QR)


For nonsingular A, we can write A = LU , where L is a lower and U is an upper triangular matrix. This decomposition
assists in solving Ax = b as well as computing the determinant:
n
Y n
Y
det(A) = det(L)det(U ) = Lii Ujj
i=1 j=1

If A is symmetric positive definite, then A can be expressed as A = R⊺ R via Cholesky decomposition, where R is an upper
triangular matrix with positive diagonal entries. Cholesky decomposition is essentially LU decomposition with L = U ⊺ .
These decompositions are both useful for solving large linear systems.

Projections (QR)
Fix a vector v ∈ Rn . The projection of x ∈ Rn onto v is given by

vv T x·v
projv (x) = Pv x = 2
x= v
||v|| ||v||2

More generally, if S = Span{v1 , . . . , vk } ⊆ Rn has orthogonal basis {v1 , . . . , vk }, then the projection of x ∈ Rn onto S is
given by
k
X x · vi
projS (x) = v
2 i
i=1
||v i ||

The main property is that projS (x) ∈ S and x − projS (x) is orthogonal to any s ∈ S. Linear Regression can be viewed as
a projection of our observed data onto the subspace formed by the span of the collected data.

10 900+ free interview questions at QuantGuide.io


5 CALCULUS

5 Calculus

Differentiation (QT/QR)
At all points x where the functions and the derivatives are defined,
d n d d d
(x ) = nxn−1 sin(x) = cos(x) cos(x) = − sin(x) tan(x) = sec2 (x)
dx dx dx dx
d d d
sec(x) = sec(x) tan(x) csc(x) = − csc(x) cot(x) cot(x) = − csc2 (x)
dx dx dx
d 1 d 1 d 1
arcsin(x) = √ arctan(x) = 2
arcsec(x) = √
dx 1−x 2 dx 1+x dx |x| 1 − x2
d x d d
(e ) = ex (f (x) ± g(x)) = f ′ (x) ± g ′ (x) (f (x)g(x)) = f ′ (x)g(x) + g ′ (x)f (x)
dx dx dx
f ′ (x)g(x) − f (x)g ′ (x)
 
d 1 d d f (x)
(ln(x)) = f (g(x)) = f ′ (g(x))g ′ (x) =
dx x dx dx g(x) (g(x))2
f ′ (x)
 
d   d x
f (x)g(x) = f (x)g(x) g ′ (x) ln(f (x)) + g(x) · (x ) = xx (ln(x) + 1)
dx f (x) dx

Integration (QT/QR)
Disregarding the +C on all the integrals,

xn+1
Z Z Z Z
n
x dx = , n ̸= −1 sin(x)dx = − cos(x) cos(x)dx = sin(x) tan(x)dx = − ln | cos(x)|
n+1
Z Z Z
sec(x)dx = ln | sec(x) + tan(x)| csc(x)dx = ln | csc(x) − cot(x)| cot(x)dx = ln | sin(x)|
Z Z Z
1 1 1
√ dx = arcsin(x) 2
dx = arctan(x) √ dx = arcsec(x)
1−x 2 1 + x |x| 1 − x2
Z Z Z Z Z
x x 1
e dx = e dx = ln |x| (f (x) ± g(x))dx = f (x)dx ± g(x)dx
x
Z Z Z
u(x)v ′ (x)dx = u(x)v(x) − v(x)u′ (x)dx f ′ (g(x))g ′ (x)dx = f (g(x))

Taylor Series (QT/QR)



X f (n) (x0 )
Select some point x = x0 . If x0 = 0, we have the Maclaurin series. Generally, f (x) = (x − x0 )n .
n=0
n!
Common Maclaurin series expansions:

X xn x x2
ex = =1+ + + ···
n=0
n! 1! 2!
X (−1)n x2n+1 x3 x5 x7
sin(x) = =x− + − + ···
n=0
(2n + 1)! 3! 5! 7!
X (−1)n x2n x2 x4 x6
cos(x) = =1− + − + ···
n=0
(2n)! 2! 4! 6!

Common Summation Formulae (QT/QR)

n n ∞ ∞
X n(n + 1) X
2 n(n + 1)(2n + 1) X
k rs X 1 π2
k= k = a·r =a· =
2 6 1−r k2 6
k=1 k=1 k=s k=1

11 900+ free interview questions at QuantGuide.io


6 FINANCE

6 Finance
6.1 Options Theory

Bonds, Underlying, Forward Contracts (QT)


We can form many assets in options theory with the usage of two assets: the underlying, and the bond. The underlying is
simple. This can be a stock, a future, or any non-derivative; we will denote this as S. The bond represents the risk-free rate,
which is typically represented mathematically as some constant. We will denote this as B. For simplicity, we will assume
the final payout of a bond at T is always $1. However, we need to discount the value of this bond. This is known as a dis-
count factor, which can be written as e−rT , where r corresponds to the risk-free rate and T represents the time until expiry.

Using these two assets, we can form an asset that has a linear payout. This is known as a forward contract. This involves
going long 1 unit of the underlying and shorting K units of the bond. The forward as a final payoff of VT = ST − K. This
is the most basic example of a derivative. In general, this K is known as the strike price. Here, we have a positive payout
when ST > K and a negative payout when ST < K.

Vanilla Options (QT)


The most simple of options are calls and puts. A call represents the right (but not obligation) to purchase the underlying
at strike K. Similarly, a put represents the right (but not obligation) to sell the underlying at strike K. Here, we see
that the forward contract represents the obligation to purchase the underlying at strike K while a call option does not.
Mathematically, a call option has time-T payoff max(ST − K, 0) and a put option has time-T payoff max(K − ST , 0).

An important concept in options pricing is the pricing of derivatives. This can be done through the method of replication.
If two assets have the same final payoff, they should also have the same intermediate payoffs. Otherwise, there would be
arbitrage. One can immediately see that we can replicate a forward contract with a put option and a call option. In other
words, the final payoff of a forward contract is the same as 1 unit of a call and −1 unit of a put. This is known as put-call
parity. We can mathematically write this as C − P = S − K.

Black-Scholes (QT/QR)
Black-Scholes is most important concept in all of options theory. It gives the basis to pricing mathematical options
through a differential equation as well as the concepts of greeks that are commonly used in the industry. The Black-
Scholes differential equation is as follows:

1 2 2 ∂2C ∂C ∂C
σ S 2
+ rS + = rC
2 ∂S ∂S ∂t
When we consider European options (exercise only occurs at expiration), a constant risk free rate, non-dividend paying
stocks, and that stocks follows a Geometric Brownian Motion, we can use Black-Scholes. In practice, this is usually not
the case. This model still gives a solid framework for modern options pricing, such as certain numerical methods used in
practice. Here, σ 2 represents the constant volatility of the GBM, S is the price of the underlying, C is the option, and r
is the constant risk-free rate.

Solving the differential equation gives a closed form for the price of an option, as a function of t, S, r, and σ. Partial
derivatives can be taken to obtain the sensitivity of the option to these values.
∂C
Delta :
∂S
∂2C
Gamma :
∂S 2
∂C
Theta :
∂t
∂C
Vega :
∂σ
∂C
Rho :
∂r
One of the biggest empirical downsides of Black-Scholes is the existence of a non-constant volatility. In real life, there
exists a volatility skew. In other words, the volatility is not constant across different strikes. This can lead to huge
inaccuracies when pricing with Black-Scholes.

12 900+ free interview questions at QuantGuide.io


6.2 Portfolio Theory 6 FINANCE

6.2 Portfolio Theory

Two Asset Portfolio (QT/QR)


The theme of portfolio optimization is that diversification improves portfolio performance. Consider a two-asset portfolio,
denoted asset 1 with mean µ1 and variance σ12 and asset 2 with mean µ2 and variance σ22 , with correlation ρ. We weight
these assets with weights w1 = w and w2 = 1 − w. We have the following returns of the portfolio.

µ = wµ1 + (1 − w)µ2
σ 2 = w2 σ12 + 2w(1 − w)ρσ1 σ2 + (1 − w)2 σ22

If the assets are perfectly correlated ρ = 1, we have σ = wσ1 + (1 − w)σ2 . In other words, the volatility scales with the
returns linearly. If we have ρ < 1, then we have σ < wσ1 +(1−w)σ2 . In other words, the volatility of the portfolio decreases
while the expected returns stays the same. We see that this only requires ρ < 1 for the benefits of diversification to appear.
σ1
In fact, if ρ = −1, then we can obtain a risk-free portfolio. We have the following weight: w = σ1 +σ2 .

We can generalize this to larger portfolios. The portfolio variance can be written as σ = n1 E(σi2 ) + n−1n E(σi,j ). As we
take n → ∞, the individual variances of the securities do not matter: only the covariance between the securities. Once
the portfolio is large enough, the only influence of portfolio variance is the average covariance structure between all the
assets. Idiosyncratic risk is the variance that can be diversified away by increasing the number of assets while systematic
risk is the risk inherent to the market – it cannot be diversified.

Portfolio Optimization (QT/QR)


Given a large number of assets, we can determine the optimal portfolio via mean-variance optimization. We can imagine
the portfolio existing in a subspace of mean and variance vectors. We denote w as a vector of weights for the assets
and µ as vector of average returns. We solve the following optimization problem. We will minimize w′ Σw such that
w′ µ = µp and w′ 1 = 1. In other words, we want to minimize the variance, such that we obtain a specific return and such
that we have an overall portfolio weight of 1. Solving this optimization gives the optimal portfolio, one that gives the
lowest variance for a desired level of returns. As one increases the level of desired returns, the portfolio variance will also
increase. Hence the term, ”high risk, high return”.

There are various ways to measure portfolio performance. Assuming investors only care about portfolio returns and
variance, we can denote the Sharpe Ratio as µp /σp . This represents the level of return for a unit of variance. In general,
investors are risk averse and don’t just care about portfolio return and variance. For example, investors may care more
about negative variance – volatility that makes them lose money. A Sortino ratio is also a useful metric, µp /σd , where
σd is the volatility only on ”bad” days. Another similar metric is known as the Calmar ratio, which normalizes the level
of return to the maximum drawdown. In other words, µp /Maximum Drawdown. Many portfolio managers have different
metrics, depending on the goals of themselves and their clients. Some may have the strategy of collecting fees from clients,
and are willing to take more volatility, caring only about Sharpe ratio. Others may care about their reputation, and
instead give great consideration to their Calmar ratio, in order to not lose clients.

Pricing Models (QT/QR)


The Capital Asset Pricing Model (CAPM) is a linear factor model that explains the returns of any asset through that of
a market portfolio.
E(ri ) = β E(rm )
The premium of a given asset is a scaled version of the total market. This is what typically is referred to as β in the
market. Hedge funds and portfolio managers do not want β, as this means that their returns are correlated to the
market. Note that the above equation does not have an intercept term, of α. This is what many portfolio managers
seek: returns that are not correlated to the market. In practice, regression tests show that CAPM does not hold, though
it is a good theoretical starting point. We would expect there to be no intercept term. Empirical evidence shows otherwise.

The CAPM seeks to explain returns through the market. However, there may be other factors that can explain market
returns better. The Fama-French three factor model expands on the market factor by adding a size and value factor. The
size factor is designed by going long small stocks and shorting large stocks while the value factor longs value stocks and
shorts growth stocks. Nowadays, there are many factor models, with factors like momentum, profitability, and investment.
Many times, portfolio managers compare their returns to these factors to see if they have exposure. A portfolio manager
may believe that they have found α, but in reality, they have created a portfolio that is highly correlated to one of the
Fama-French factors.

13 900+ free interview questions at QuantGuide.io

You might also like