Quant Interview Cheat Sheet
Quant Interview Cheat Sheet
Contents
1 Preface 1
1.1 Introductory Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How to Use This . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Probability 2
2.1 Key Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Formulas and Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Statistical Learning 5
3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Tree Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Linear Algebra 9
4.1 Matrix Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Calculus 11
6 Finance 12
6.1 Options Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.2 Portfolio Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1 Preface
1.1 Introductory Remarks
Quant interviews are hard. And for good reason, too − at the intersection of computer science, statistics, and finance lies
the most lucrative job out of university, with graduate opportunities commanding up to $600,000 per year in compensation
from base salary, sign-on bonus, and performance bonus (and internships paying over $150 per hour).
Naturally, every year is more competitive than the last, and sometimes we even wonder if we’d still be able to pass the
interviews at the top trading firms today. We’re uncompromising believers that you do not have to be a genius to land
a role in quant and that there exists a path of persistent preparation and tenacious thoughtfulness that will lead to an
offer. We carefully thought about what resources would have helped us back then that would have made our lives less
stressful when preparing for interviews, so we authored this sheet in hopes that it can make your studying and interviewing
processes easier.
1
2 PROBABILITY
2 Probability
2.1 Key Distributions
Name Modeling Intuition PMF/PDF MGF µ σ2
Toss a coin, 1 if heads, else
(
p if t = 1
Bernoulli 0, coin lands heads with f (t; p) = peθ + (1 − p) p p(1 − p)
probability p 1 − p if t = 0
r
X
2. X1 , . . . , Xr ∼ Geom(p) IID =⇒ Xi ∼ NegBinom(r, p)
i=1
n n
!
X X
3. Xi ∼ Poisson(λi ), 1 ≤ i ≤ n independent =⇒ Xi ∼ Poisson λi
i=1 i=1
n
X
4. X1 , . . . , Xn ∼ Exp(λ) IID =⇒ Xi ∼ Gamma(n, λ−1 ) (Shape-Scale parameterization)
i=1
n n n
!
X X X
2
5. Xi ∼ N (µ, σ ), 1 ≤ i ≤ n independent =⇒ Xi ∼ N µi , σi2
i=1 i=1 i=1
−1
6. If X ∼ Unif(0, 1) and F (x) is an invertible CDF, then Y = F (X) has CDF F (x)
Intuitively, the Law of Total Expectation says that if we ”average over all averages” of X obtained by some information
about Y , we obtain the true average. Similarly, the Law of Total Variance says that the true variance comes from two
sources: between samples (the first term) and within samples (the second term).
Cov(X, Y )
Cov(X, Y ) = E[XY ] − E[X]E[Y ] Corr(X, Y ) =
σX σY
Covariance and correlation are measurements of linear association of X and Y . An example of uncorrelated but not
independent random variables are Z and Z 2 , where Z ∼ N (0, 1).
2. Var(aX + b) = a2 Var(X)
3. Cov(aX + b, cY + d) = acCov(X, Y )
4. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
5. If X and Y are independent and have finite mean, then X and Y are uncorrelated.
6. Corr(aX + b, cY + d) = sign(ac)Corr(X, Y )
7. Cov(X, X) = Var(X)
8. The correlation and covariance matrices are both positive semidefinite.
9. |Corr(X, Y )| ≤ 1
Sn − nµ Xn − µ
√ = √ ≈ N (0, 1)
σ n σ/ n
This intuitively says that sums of large amounts of IID random variables approach a standard normal distribution when
re-scaled to have mean 0 and variance 1. This is useful to approximate sums whose exact distribution is not known.
lim X n = µ
n→∞
This intuitively says that the sample average approaches the true average as the sample size grows large. This theorem is
the basis of Monte Carlo Sampling, as we are attempting to find the true average by averaging the results of simulations.
π = πP , π1 = 1
µ1 = 1 + P 11 µ1 + P 12 µ2 + P 13
µ2 = 1 + P 21 µ1 + P 22 µ2 + P 23
1 − ϱb
1 − ϱa+b
3 Statistical Learning
3.1 Linear Regression
We can express the above in matrix form. Denote X as an m × (p + 1) matrix with each row an input vector with 1
in the first position corresponding to β0 . Further, let β = (β0 , β1 , . . . , βp )⊺ , and let y denote the m-vector of outputs
corresponding to X. Then we have that y = Xβ + ϵ, E[ϵi ] = 0, and Var(ϵi ) = σ 2 .
RSS0 − RSS1
F =
RSS0 /(m − k − 2)
Forward selection greedily adds predictors into the model that produce the largest value of F , whereas backward selection
sequentially deletes predictors producing the smallest value of F at each stage. Stopping condition is an F -ratio ≥
(100 − α)th percentile for some α, typically 5 or 10, of the F1,N −k−2 distribution.
Ridge Regression
Objective function to be minimized, for hyperparameter λ:
2
m
X p
X p
X p
X
yi − β0 − βj xij + λ βj2 = RSS + λ βj2 = (y − Xβ)⊺ (y − Xβ) + λβ ⊺ β
i=1 j=1 j=1 j=1
As λ → ∞, the impact of the shrinkage penalty grows, and ridge regression coefficient estimates will approach zero.
Note that we don’t want to shrink the intercept, since it is, conceptually, a measure of the center of the response when
predictors are at zero. Note that ridge regression will shrink all coefficients towards zero but will not set any of them
exactly to zero, reducing model interpretability for large p. Lasso overcomes this disadvantage.
Lasso Regression
Objective function to be minimized, for hyperparameter λ, yielding sparse models:
2
m
X p
X p
X p
X
yi − β0 − βj xij + λ |βj | = RSS + λ |βj |
i=1 j=1 j=1 j=1
3.2 Classification
To find β, we can use the method of maximum likelihood, where the probabilities for each xi is denoted by f (β̂0 + βˆ1 xi ).
Unlike in linear regression, there is no closed form solution and we must use iterative methods to determine β̂. Multiple
logistic regression is done in the same manner, but now we have to calculate additional values of β.
P(X = x | Y = k)P(Y = k)
P(Y = k | X = x) =
P(X = x)
fk (x)πk
= PK
l=1 πl fl (x)
Generative models involve calculating these values individually, before combining them to indirectly obtain the desired
probability, hence, generative models.
When we assuming that σ1 = σ2 = · · · = σk , this gives us the result for linear discriminant analysis. When we ease these
assumptions, and allow unique σk , we obtain the result for quadratic discriminant analysis. Above, we assumed fk (x) to
be univariate. However, linear and quadratic discriminant analysis also apply when using multivariate Gaussians.
Bagging (QR)
Bagging is a method that is used to reduce the variance of a decision tree. In other words, from a single dataset, we
can keep training decision trees with replacement (bootstrapping), and then average all the trees to obtain a model with
lower variance (aggregation).
Due to bootstrapping, not all data points will be used when training. On average, each tree will only use about 2/3 of
the data. The remaining data is referred to OOB (out-of-bag) observations, which can be used to estimate out-of-sample
test error.
Boosting (QR)
In bagging and random forests, we reduced variance by using an ensemble of decision trees. Here, we focus on reducing
bias by sequentially growing decision trees. In addition, we train the model on a modified version of the data, rather than
a bootstrapped dataset. More specifically, we train the model on the residuals.
First, we will train the data on the original dataset. Then, we calculate the residuals and re-train a tree on the residuals.
We then add the models together and repeat. Boosting is additive, focusing on reducing the error with each sequentially
grown tree.
One can see that there can be an extremely large amount of parameters. These parameters are trained through gradient
descent and backpropagation. Many different losses can be used − which can explain the versatility of neural networks.
A downside is they require a large amount of data to be effective. They are also not necessarily the quickest and lack
interpretability. All 3 statements are reasons why neural networks are not that widely used in finance, and many go
towards simpler models, like regression for their needs. However, we can see the strength of neural networks, especially
with the sudden popularity of large-language models.
4 Linear Algebra
4.1 Matrix Basics
3. det(A) ̸= 0
4. Ax = 0 has only the trivial solution x = 0; dim(nul(A)) = 0
a b d −b
Note that if A = , then A−1 = det(A)
1
. Larger inverses may be found via Gauss-Jordan Elimination:
c d −c a
Note that if A has negative diagonal elements, then A cannot be positive semi-definite.
Intuitively, this says that we can find a basis consisting of the eigenvectors of A. Useful for computing large powers of A,
as An = XDn X −1 . An important example is A being real and symmetric implies A is diagonalizable.
A = U ΣV ⊺
Intuitively, this says that we can express A as a diagonal matrix with suitable choices of (orthogonal) bases.
QR Decomposition (QR)
For nonsingular A, we can write A = QR, where Q is orthogonal and R is an upper triangular matrix with positive
diagonal elements. QR decomposition assists in increasing the efficiency of solving Ax = b for nonsingular A:
Ax = b ⇒ QRx = b ⇒ Rx = Q−1 b = Q⊺ b
QR decomposition is very useful in efficiently solving large numerical systems and inversion of matrices. Furthermore, it
is also used in least-squares when our data is not full rank.
If A is symmetric positive definite, then A can be expressed as A = R⊺ R via Cholesky decomposition, where R is an upper
triangular matrix with positive diagonal entries. Cholesky decomposition is essentially LU decomposition with L = U ⊺ .
These decompositions are both useful for solving large linear systems.
Projections (QR)
Fix a vector v ∈ Rn . The projection of x ∈ Rn onto v is given by
vv T x·v
projv (x) = Pv x = 2
x= v
||v|| ||v||2
More generally, if S = Span{v1 , . . . , vk } ⊆ Rn has orthogonal basis {v1 , . . . , vk }, then the projection of x ∈ Rn onto S is
given by
k
X x · vi
projS (x) = v
2 i
i=1
||v i ||
The main property is that projS (x) ∈ S and x − projS (x) is orthogonal to any s ∈ S. Linear Regression can be viewed as
a projection of our observed data onto the subspace formed by the span of the collected data.
5 Calculus
Differentiation (QT/QR)
At all points x where the functions and the derivatives are defined,
d n d d d
(x ) = nxn−1 sin(x) = cos(x) cos(x) = − sin(x) tan(x) = sec2 (x)
dx dx dx dx
d d d
sec(x) = sec(x) tan(x) csc(x) = − csc(x) cot(x) cot(x) = − csc2 (x)
dx dx dx
d 1 d 1 d 1
arcsin(x) = √ arctan(x) = 2
arcsec(x) = √
dx 1−x 2 dx 1+x dx |x| 1 − x2
d x d d
(e ) = ex (f (x) ± g(x)) = f ′ (x) ± g ′ (x) (f (x)g(x)) = f ′ (x)g(x) + g ′ (x)f (x)
dx dx dx
f ′ (x)g(x) − f (x)g ′ (x)
d 1 d d f (x)
(ln(x)) = f (g(x)) = f ′ (g(x))g ′ (x) =
dx x dx dx g(x) (g(x))2
f ′ (x)
d d x
f (x)g(x) = f (x)g(x) g ′ (x) ln(f (x)) + g(x) · (x ) = xx (ln(x) + 1)
dx f (x) dx
Integration (QT/QR)
Disregarding the +C on all the integrals,
xn+1
Z Z Z Z
n
x dx = , n ̸= −1 sin(x)dx = − cos(x) cos(x)dx = sin(x) tan(x)dx = − ln | cos(x)|
n+1
Z Z Z
sec(x)dx = ln | sec(x) + tan(x)| csc(x)dx = ln | csc(x) − cot(x)| cot(x)dx = ln | sin(x)|
Z Z Z
1 1 1
√ dx = arcsin(x) 2
dx = arctan(x) √ dx = arcsec(x)
1−x 2 1 + x |x| 1 − x2
Z Z Z Z Z
x x 1
e dx = e dx = ln |x| (f (x) ± g(x))dx = f (x)dx ± g(x)dx
x
Z Z Z
u(x)v ′ (x)dx = u(x)v(x) − v(x)u′ (x)dx f ′ (g(x))g ′ (x)dx = f (g(x))
n n ∞ ∞
X n(n + 1) X
2 n(n + 1)(2n + 1) X
k rs X 1 π2
k= k = a·r =a· =
2 6 1−r k2 6
k=1 k=1 k=s k=1
6 Finance
6.1 Options Theory
Using these two assets, we can form an asset that has a linear payout. This is known as a forward contract. This involves
going long 1 unit of the underlying and shorting K units of the bond. The forward as a final payoff of VT = ST − K. This
is the most basic example of a derivative. In general, this K is known as the strike price. Here, we have a positive payout
when ST > K and a negative payout when ST < K.
An important concept in options pricing is the pricing of derivatives. This can be done through the method of replication.
If two assets have the same final payoff, they should also have the same intermediate payoffs. Otherwise, there would be
arbitrage. One can immediately see that we can replicate a forward contract with a put option and a call option. In other
words, the final payoff of a forward contract is the same as 1 unit of a call and −1 unit of a put. This is known as put-call
parity. We can mathematically write this as C − P = S − K.
Black-Scholes (QT/QR)
Black-Scholes is most important concept in all of options theory. It gives the basis to pricing mathematical options
through a differential equation as well as the concepts of greeks that are commonly used in the industry. The Black-
Scholes differential equation is as follows:
1 2 2 ∂2C ∂C ∂C
σ S 2
+ rS + = rC
2 ∂S ∂S ∂t
When we consider European options (exercise only occurs at expiration), a constant risk free rate, non-dividend paying
stocks, and that stocks follows a Geometric Brownian Motion, we can use Black-Scholes. In practice, this is usually not
the case. This model still gives a solid framework for modern options pricing, such as certain numerical methods used in
practice. Here, σ 2 represents the constant volatility of the GBM, S is the price of the underlying, C is the option, and r
is the constant risk-free rate.
Solving the differential equation gives a closed form for the price of an option, as a function of t, S, r, and σ. Partial
derivatives can be taken to obtain the sensitivity of the option to these values.
∂C
Delta :
∂S
∂2C
Gamma :
∂S 2
∂C
Theta :
∂t
∂C
Vega :
∂σ
∂C
Rho :
∂r
One of the biggest empirical downsides of Black-Scholes is the existence of a non-constant volatility. In real life, there
exists a volatility skew. In other words, the volatility is not constant across different strikes. This can lead to huge
inaccuracies when pricing with Black-Scholes.
µ = wµ1 + (1 − w)µ2
σ 2 = w2 σ12 + 2w(1 − w)ρσ1 σ2 + (1 − w)2 σ22
If the assets are perfectly correlated ρ = 1, we have σ = wσ1 + (1 − w)σ2 . In other words, the volatility scales with the
returns linearly. If we have ρ < 1, then we have σ < wσ1 +(1−w)σ2 . In other words, the volatility of the portfolio decreases
while the expected returns stays the same. We see that this only requires ρ < 1 for the benefits of diversification to appear.
σ1
In fact, if ρ = −1, then we can obtain a risk-free portfolio. We have the following weight: w = σ1 +σ2 .
We can generalize this to larger portfolios. The portfolio variance can be written as σ = n1 E(σi2 ) + n−1n E(σi,j ). As we
take n → ∞, the individual variances of the securities do not matter: only the covariance between the securities. Once
the portfolio is large enough, the only influence of portfolio variance is the average covariance structure between all the
assets. Idiosyncratic risk is the variance that can be diversified away by increasing the number of assets while systematic
risk is the risk inherent to the market – it cannot be diversified.
There are various ways to measure portfolio performance. Assuming investors only care about portfolio returns and
variance, we can denote the Sharpe Ratio as µp /σp . This represents the level of return for a unit of variance. In general,
investors are risk averse and don’t just care about portfolio return and variance. For example, investors may care more
about negative variance – volatility that makes them lose money. A Sortino ratio is also a useful metric, µp /σd , where
σd is the volatility only on ”bad” days. Another similar metric is known as the Calmar ratio, which normalizes the level
of return to the maximum drawdown. In other words, µp /Maximum Drawdown. Many portfolio managers have different
metrics, depending on the goals of themselves and their clients. Some may have the strategy of collecting fees from clients,
and are willing to take more volatility, caring only about Sharpe ratio. Others may care about their reputation, and
instead give great consideration to their Calmar ratio, in order to not lose clients.
The CAPM seeks to explain returns through the market. However, there may be other factors that can explain market
returns better. The Fama-French three factor model expands on the market factor by adding a size and value factor. The
size factor is designed by going long small stocks and shorting large stocks while the value factor longs value stocks and
shorts growth stocks. Nowadays, there are many factor models, with factors like momentum, profitability, and investment.
Many times, portfolio managers compare their returns to these factors to see if they have exposure. A portfolio manager
may believe that they have found α, but in reality, they have created a portfolio that is highly correlated to one of the
Fama-French factors.