0% found this document useful (0 votes)
29 views10 pages

Mathematics For Machine Learning V5

Uploaded by

abdoalsenaweabdo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views10 pages

Mathematics For Machine Learning V5

Uploaded by

abdoalsenaweabdo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Mathematics for Machine Learning : Essential

Equations (V5)

1. Linear Algebra
• Addition of Vectors:  
u1 + v1
 u2 + v2 
u + v =  .. 
 
 . 
un + vn

• Scaling a Vector:  
cv1
 cv2 
c · v =  .. 
 
 . 
cvn

• Matrix Scalar Multiplication:


 
ca11 ca12 ··· ca1n
 ca21 ca22 ··· ca2n 
c · A =  ..
 
.. .. .. 
 . . . . 
cam1 cam2 · · · camn

• Matrix-Vector Product:
 Pn 
j=1 a1j vj
 Pn a2j vj 
 j=1
Av = 

.. 
Pn .
 
j=1 amj vj

• Matrix Trace: n
X
tr(A) = aii
i=1

• Matrix Determinant (2x2 Matrix):

det(A) = a11 a22 − a12 a21

• Eigenvector Equation:
Av = λv

1
• Vector Projection:
a·b
projb (a) = b
b·b
• Inverse of a 2x2 Matrix:
 
−1 1 a22 −a12
A =
det(A) −a21 a11

• Orthogonality Condition:
u · v = 0 if u and v are orthogonal.

2. Probability and Statistics


• Joint Probability:
P (A ∩ B) = P (A|B)P (B)
• Bayes’ Theorem (Alternative Form):
P (A|B)P (B)
P (B|A) =
P (A)

• Variance (Alternative):
Var(X) = E[X 2 ] − (E[X])2

• Cumulative Distribution Function (CDF):


FX (x) = P (X ≤ x)

• Covariance (Alternative):
Cov(X, Y ) = E[XY ] − E[X]E[Y ]

• Entropy: X
H(X) = − P (x) log P (x)
x

• KL Divergence:
X P (x)
DKL (P ||Q) = P (x) log
x
Q(x)

• Conditional Expectation:
Z
E[Y |X] = yfY |X (y|x)dy
y

• Law of Iterated Expectations:


E[Y ] = E[E[Y |X]]

• Central Limit Theorem:


X̄ − µ d
√ → − N (0, 1)
σ/ n

2
3. Calculus
• Power Rule:
d n
[x ] = nxn−1
dx
• Product Rule:
d dv du
[uv] = u + v
dx dx dx
• Quotient Rule:
d h u i v du − u dv
= dx 2 dx
dx v v
• Exponential Derivative:
d x
[e ] = ex
dx
• Logarithmic Derivative:
d 1
[ln x] =
dx x
• Integral of a Power Function:
xn+1
Z
xn dx = +C for n ̸= −1
n+1

• Fundamental Theorem of Calculus:


Z b
f ′ (x)dx = f (b) − f (a)
a

• Chain Rule (Alternative Form):


dy dy du
= ·
dx du dx

• Taylor Expansion (Simplified):


f ′′ (a)
f (x) ≈ f (a) + f ′ (a)(x − a) + (x − a)2
2

• Jacobian Matrix:
∂fi
Jij =
∂xj

4. Optimization
• Stochastic Gradient Descent (SGD):

w ← w − η∇J(w; xi , yi )

• Momentum Gradient Descent:

vt = βvt−1 + (1 − β)∇J(w), w ← w − ηvt

3
• RMSProp Update Rule:
η
w←w− p ∇J(w)
2
∇ J(w) + ϵ

• Nesterov Accelerated Gradient:

wt+1 = wt − η∇J(wt + β(wt − wt−1 ))

• Adam Optimization:

mt = β1 mt−1 + (1 − β1 )∇J(w), vt = β2 vt−1 + (1 − β2 )(∇J(w))2

• Gradient Clipping:

∇J(w)
∇J(w) ←
max(1, ||∇J(w)||/c)

• Projected Gradient Descent:

wt+1 = ΠC (wt − η∇J(wt ))

• Newton’s Method:
wt+1 = wt − ηH−1 ∇J(wt )

• Proximal Gradient Method:

wt+1 = proxg (wt − η∇f (wt ))

• Learning Rate Decay:


η0
ηt =
1 + λt

5. Regression Models
• Linear Regression Hypothesis:

ŷ = Xw + b

• Ordinary Least Squares (OLS):

w = (XT X)−1 XT y

• Ridge Regression Objective:

J(w) = ||y − Xw||2 + λ||w||2

• Lasso Regression Objective:

J(w) = ||y − Xw||2 + λ||w||1

4
• Logistic Regression Hypothesis:
1
ŷ = σ(Xw + b), σ(z) =
1 + e−z

• Cross-Entropy Loss:
m
1 X
J(w) = − [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
m i=1

• Mean Absolute Error (MAE):


m
1 X
MAE = |yi − ŷi |
m i=1

• Mean Squared Error (MSE):


m
1 X
MSE = (yi − ŷi )2
m i=1

• Coefficient of Determination (R-squared):


Pm
2 (yi − ŷi )2
R = 1 − Pi=1
m 2
i=1 (yi − ȳ)

• Adjusted R-squared:

(1 − R2 )(n − 1)
R̄2 = 1 −
n−p−1

• Gradient of MSE Loss:


1 T
∇J(w) = X (Xw − y)
m

• Hinge Loss for SVM:


m
1 X
J(w) = max(0, 1 − yi (wT xi + b))
m i=1

• Huber Loss: (
1 2
2
a if |a| ≤ δ,
Lδ (a) = 1
δ(|a| − 2
δ) if |a| > δ

5
6. Neural Networks
• Perceptron Update Rule:

w ← w + η(y − ŷ)x

• Sigmoid Activation Function:


1
σ(z) =
1 + e−z

• ReLU Activation Function:

f (x) = max(0, x)

• Softmax Function:
ezi
Softmax(zi ) = Pn
j=1 ezj

• Loss Function for Multi-Class Classification:


m K
1 XX
J(w) = − yik log(ŷik )
m i=1 k=1

• Forward Propagation (Single Layer):

a = σ(wT x + b)

• Backward Propagation (Gradient for Weights):

∂J
= x(ŷ − y)
∂w

• Gradient Descent for Neural Networks:


∂J
w ←w−η
∂w

• Dropout Regularization:
(l) (l)
hi = ri hi , ri ∼ Bernoulli(p)

• Batch Normalization:
xi − µ B
x̂i = p 2 , yi = γ x̂i + β
σB + ϵ

6
7. Clustering
• k-Means Objective Function:
K X
X
J= ||xi − µk ||2
k=1 i∈Ck

• Centroid Update Rule:


1 X
µk = x
|Ck | x∈C
k

• Distance Metric (Euclidean Distance):


v
u n
uX
d(x, y) = t (xi − yi )2
i=1

• Silhouette Score:
b(i) − a(i)
s(i) =
max(a(i), b(i))
• DBSCAN Core Point Condition:

|Nϵ (x)| ≥ MinPts where Nϵ (x) = {y : d(x, y) ≤ ϵ}

• Hierarchical Clustering Dendrogram Objective:

Minimize the linkage criterion L(A, B)

• Gaussian Mixture Model (GMM):


K
X
p(x) = πk N (x|µk , Σk )
k=1

• Expectation-Maximization (E-step):

πk N (xi |µk , Σk )
γik = PK
j=1 πj N (xi |µj , Σj )

• Expectation-Maximization (M-step):
PN PN
γik xi γik (xi − µk )(xi − µk )T
µk = Pi=1
N
and Σk = i=1
PN
i=1 γik i=1 γik

• Elbow Method for Optimal k:

Choose k where J(k) has the largest drop.

7
8. Dimensionality Reduction
• Principal Component Analysis (PCA) Objective:

Maximize ||Xw||2 subject to ||w|| = 1

• Covariance Matrix for PCA:


1 T
C= X X
m

• Eigen Decomposition for PCA:

Cw = λw

• t-SNE Objective: X pij


C= pij log
i̸=j
qij

• Singular Value Decomposition (SVD):

X = UΣVT

• LDA Objective (Fisher’s Criterion):

wT Sb w
J(w) =
wT Sw w

• Reconstruction Error for PCA:

Error = ||X − X̂||F

• Kernel PCA Transformation:

ϕ(x) → Principal Components in Feature Space

• Autoencoder Reconstruction:

X ≈ g(f (X))

• Explained Variance Ratio:


λi
Ratio = P
j λj

8
9. Probability Distributions
• Bernoulli Distribution:

P (X = x) = px (1 − p)1−x , x ∈ {0, 1}

• Binomial Distribution:
 
n k
P (X = k) = p (1 − p)n−k , k ∈ {0, 1, . . . , n}
k

• Poisson Distribution:
λk e−λ
P (X = k) = , k≥0
k!

• Uniform Distribution:
(
1
b−a
, a≤x≤b
f (x) =
0, otherwise

• Normal Distribution:
1 (x−µ)2
f (x) = √ e− 2σ 2
2πσ 2
• Exponential Distribution:
(
λe−λx , x ≥ 0
f (x) =
0, x<0

• Beta Distribution:
xα−1 (1 − x)β−1
f (x; α, β) = , x ∈ [0, 1]
B(α, β)

• Gamma Distribution:
β α xα−1 e−βx
f (x; α, β) = , x≥0
Γ(α)

• Multinomial Distribution:
n!
P (X1 = x1 , . . . , Xk = xk ) = px1 1 px2 2 · · · pxkk
x1 !x2 ! · · · xk !

• Chi-Square Distribution:
k x
x 2 −1 e− 2
f (x; k) = k , x≥0
2 2 Γ( k2 )

9
10. Reinforcement Learning
• Bellman Equation for State-Value Function:

V (s) = E[Rt + γV (St+1 )|St = s]

• Bellman Equation for Action-Value Function:

Q(s, a) = E[Rt + γQ(St+1 , At+1 )|St = s, At = a]

• Policy Improvement:
π ′ (s) = arg max Q(s, a)
a

• Temporal Difference Update Rule:

V (St ) ← V (St ) + α[Rt+1 + γV (St+1 ) − V (St )]

• Q-Learning Update Rule:

Q(St , At ) ← Q(St , At ) + α[Rt+1 + γ max Q(St+1 , a) − Q(St , At )]


a

• SARSA Update Rule:

Q(St , At ) ← Q(St , At ) + α[Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )]

• Reward Function:
R(s, a) = E[Rt |St = s, At = a]

• Value Iteration Update Rule:


X
V (s) ← max[R(s, a) + γ P (s′ |s, a)V (s′ )]
a
s′

• Actor-Critic Policy Update:

θ ← θ + α∇θ log πθ (a|s)δ

• Discounted Return: ∞
X
Gt = γ k Rt+k+1
k=0

10

You might also like