3 Math Basics
3 Math Basics
Special Matrices
Eigendecomposition
Numeric Calculation
In monthly target and total reward/penalty amount the preceding equation, 𝑤𝑖 is the total reward/penalty amount in the month 𝑖, 𝑠𝑖 is the
total mileage, 𝑑𝑖 is the monthly target, ℎ𝑖 is the difference between the actual distance and monthly target, and 𝑥𝑖 is the reward/penalty
amount of each kilometer every month. This activity received good feedback and was later adopted by the Cloud Department. The
following tables listed the difference between the actual distance and of some participants in the first quarter:
Month Month
Name 𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒘 Name
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒘
A 10 8 12 20 A 2 4 5 10
B 4 4 2 8 B 4 2 2 6
C 2 -4 -2 -5 C -2 2 2 3
In this way, the solutions of the equations are the answer to the question.
Significance:
The determinant is equal to the product of all the eigenvalues of the matrix.
The absolute value of the determinant can be thought of as a measure of how much
multiplication by the matrix expands or contracts space. If the determinant is 0, then
space is contracted completely along at least one dimension, causing it to lose all its
volume. If the determinant is 1, then the transformation preserves volume.
Note: The two matrices can be added only when the matrices have the same row quantity and column quantity.
Scalar and matrix multiplication: Suppose 𝐴 = (𝑎𝑖𝑗 )𝑠×𝑛 and 𝑘 ∈ 𝐾. The product of k
and matrix A is 𝑘𝐴 = (𝑘𝑎𝑖𝑗 )𝑠×𝑛 . The addition of a scalar and matrix follows the same rule.
Note:In order for 𝐴𝐵 to be defined, A must have the same number of columns as B has rows.
Example:
Tr A = i A i,i .
• Tr A = Tr AT
• Tr A = A
• Tr ABC = Tr CAB = Tr(BCA)
According to the matrix multiplication rule, the equations (1.1) can be represented by a matrix as
follows:
Special Matrices
Eigendecomposition
Numeric Calculation
M𝐚𝐭𝐫𝐢𝐱 𝐢𝐧𝐯𝐞𝐫𝐬𝐞: The 𝐦𝐚𝐭𝐫𝐢𝐱 𝐢𝐧𝐯𝐞𝐫𝐬𝐞 of A is denoted as A−1, and it is defined as the
matrix such that A−1 A = In .
symmetric matrix.
Special Matrices
Eigendecomposition
Numeric Calculation
Suppose that A is a n-level matrix in the digital domain K. If there is a non-zero column vector 𝛼 in 𝐾^𝑛 that
meets the following:
Aα = λα,𝑎𝑛𝑑 λϵK,
Example:
3 − 2 −1 𝑥1 0
Taking 𝜆1 = 2, the corresponding eigenvector satisfies = , we find that 𝑥1 = 𝑥2 .
−1 3 − 2 𝑥2 0
1
Therefore, the corresponding eigenvector is 𝑝1 = . When 𝜆1 = 2, the eigenvector is 𝑘𝑝1 𝑘 ≠ 0 .
1
−1
Taking 𝜆2 = 4, we find that 𝑥1 = −𝑥2. The eigenvector is 𝑝2 = . When 𝜆2 = 4, the eigenvector is 𝑘𝑝2 𝑘 ≠ 0 .
1
𝐴 = 𝑃𝑑𝑖𝑎𝑔 𝜆 𝑃 −1
where 𝑃 = {𝛼1 , 𝛼2 , ⋯ , 𝛼𝑛 }, and 𝜆 = {𝜆1 , 𝜆2 , ⋯ , 𝜆𝑛 }.
Matrix accuracy:
A matrix whose eigenvalues are all positive A matrix whose eigenvalues are all positive or
is called positive definite. zero valued is called positive semidefinite.
If all eigenvalues are negative, the matrix is If all eigenvalues are negative or zero valued, it is
negative definite. negative semidefinite.
𝐴 = 𝑈𝐷𝑉 𝑇 ,
Among 𝑈 = (𝑏𝑖𝑗 )𝑚×𝑚 , 𝐷 = (𝑐𝑖𝑗 )𝑚×𝑛 , and 𝑉 𝑇 = (𝑑𝑖𝑗 )𝑛×𝑛 , the matrices U and V are both
defined to be orthogonal matrices. The columns of U are known as the left-singular
vectors. The columns of V are known as as the right-singular vectors. The matrix D is
defined to be a diagonal matrix. Note that D is not necessarily square. Elements on the
diagonal line of D is referred to as a singular value of the matrix.
Practical algorithms for calculating the pseudoinverse are based on the formula:
A+ = VD+ U T
where U, D and V are the singular value decomposition of A, and the pseudoinverse 𝐷+
of a diagonal matrix D is obtained by taking the reciprocal of its non-zero elements
then taking the transpose of the resulting matrix.
Basic : Assume that there are n objects, and each object is composed of {𝑥1 , . . . , 𝑥𝑝 } . The following table
lists the factor data corresponding to each object.
Factor
Object
𝒙𝟏 𝒙𝟐 … 𝒙𝒋 … 𝒙𝒑
The coefficient 𝑙𝑖𝑗 meets the following rules: 𝑧𝑖 is not related to 𝑧𝑗 𝑖 ≠ 𝑗; 𝑖, 𝑗 = 1, 2, … , 𝑚 . 𝑧1 has the largest variance
among all linear combinations of 𝑥1 , . . . , 𝑥𝑝 . 𝑧2 has the largest variance among all linear combinations of 𝑥1 , . . . ,
𝑥𝑝 that is not related to 𝑧1. z has the largest variance among all linear combinations of 𝑥1 , . . . , 𝑥𝑝 that is not
m
related to 𝑧1 , 𝑧2 , …, 𝑧𝑚−1 .
According to the above rules, 𝑙𝑖𝑗 is a eigenvector of m large eigenvalues of the coefficient matrix corresponding
to 𝑥1 , . . . , 𝑥𝑝 .
If the cumulative contribution rate of the first i main components reaches 85% to 90%, those components are
used as the new variables.
There may be more than one result of each test, and all
possible results of the test can be specified in advance.
Example:
𝐸1 :Toss two coins and check the outcome (front or back).
Sample space: a collection of all possible results of a random test, which is represented by
𝑆 = 𝑒1 , 𝑒2 , … , 𝑒𝑛 .
Random variables event: any subset of the sample space S. If a sample point of event A
occurs, event A occurs. In particular, a random event containing only one sample point is
called a basic event.
Example:
Sample point: 𝑒𝑖 = 1, 2, 3, 4, 5, 6
recorded as 𝑓𝑛 𝐴 .
Probability: Suppose that E is a random test and S is the sample space. Assign a real
number P(A) (event probability ) on each event A of E. The set function P(∗) must meet
the following conditions:
Non-negative: For each event 𝐴, 0≤𝑃(𝐴)≤1.
Countable additivity: A1 , … are events incompatible with each other. That is, if
Ai Aj = ∅, i ≠ j, i, j = 1, 2, ⋯, we have P A1 ∪ A2 ∪ ⋯ = P A1 + P A2 + ⋯ .
Example 1: Random test E4: Toss two dice and check the sum of the results. The sample space of
the test is S = e = i, j i, j = 1, 2, 3, 4, 5, 6 . i indicates the first outcome and j indicates the second
outcome. X is the sum of the two outcomes, which is a random variable.
𝑋 = 𝑋 𝑒 = 𝑋 𝑖, 𝑗 = 𝑖 + 𝑗, 𝑖, 𝑗 = 1,2, ⋯ , 6.
Example 2: Random test E1 : Throw two coins and check the outcome (front side H or back side T).
The sample space for the test is S = HH, HT, TH, TT . Y, as the total occurrence of the back side T, is a
random variable.
Distribution law: If all the possible values of discrete random variable X are xk k = 1, 2, ⋯ ,
the probability of X getting a possible value {X = xk } is:
𝑃 𝑋 = 𝑥𝑘 = 𝑝𝑘 , 𝑘 = 1,2, ⋯ .
(1)𝑝𝑘 ≥ 0, 𝑘 = 1,2, ⋯ .
∞
(2) 𝑘=1 𝑝𝑘 = 1.
𝑿 𝑥1 𝑥2 … 𝑥𝑛 …
The distribution law can also be expressed in a table:
𝒑𝒌 𝑝1 𝑝2 … 𝑝𝑛 …
𝑿 0 1
𝒑𝒌 1−𝑝 𝑝
𝐸(𝑋)=𝑝, V𝑎𝑟(𝑋)=𝑝(1−𝑝).
The experiments that meet the following conditions are called 𝒏 Bernoulli
experiments:
Each experiment is repeated under the same conditions.
There are only two possible results per experiment: 𝐴 and 𝐴 and 𝑃 (𝐴) =𝑝.
At this time, 𝑋 obeys binomial distribution with 𝑛 and 𝑝 parameters. This is expressed as 𝑋~
𝐵 (𝑛,𝑝), where 𝐸 (𝑋) equals 𝑛𝑝 and 𝑉𝑎𝑟 (𝑥) equals 𝑛𝑝(1−𝑝).
𝜆𝑘 𝑒 −𝜆
lim𝑛→∞ 𝐶𝑛𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 ≈
𝑘!
Poisson distribution: If all possible values of random variables are 0, 1, 2, ..., the
probability of taking each value is:
𝜆𝑘 𝑒 −𝜆
𝑃 𝑋=𝑘 = , 𝑘 = 0,1,2, ⋯
𝑘!
Then, 𝑋 obeys Poisson distribution with parameter 𝜆. It is expressed as 𝑋~𝑃 (𝜆), where 𝐸
(𝑋) equals 𝜆, and 𝐷 (𝑋) equals 𝜆.
𝟏 (𝒙−𝝁)𝟐
−
𝒇 𝒙 = 𝒆 𝟐𝝈𝟐 −∞<𝒙<∞
𝟐𝝅𝝈 ,
where 𝜇, 𝜎 (𝜎>0) is constant, 𝑋 obeys the normal distribution or Gaussian distribution of 𝜇, 𝜎, which is expressed as
𝑋~𝑁 (𝜇,𝜎^2). Especially when 𝜇 = 0, 𝜎 = 1, random variable 𝑋 obeys the standard normal distribution, which is
expressed as 𝑋~𝑁 (0, 1).
where λ>0 is a constant, indicating the time when a random event occurs once,𝑋 obeys
the exponential distribution with parameter 𝜆. This distribution is expressed as 𝑋~𝐸(𝜆).𝐸
(𝑋)= 1/𝜆,𝑉ar(𝑋)=1/𝜆^2 .
1 −|𝑥−𝜇| ,
𝐿𝑎𝑝𝑙𝑎𝑐𝑒 𝑥; 𝜇, 𝑏 = 𝑒 𝑏
2𝑏
𝒏 experiments
𝑛𝑝 > 20
Binomial
Poisson distribution Dirac distribution
distribution 𝑝 < 0.05
𝒏→∞ Interval
Back-to-back
Multinoulli
Laplace distribution
distribution
Mixture distribution
𝐹 𝑥, 𝑦 = 𝑃 𝑋 ≤ 𝑥 ∩ 𝑌 ≤ 𝑦 = 𝑃{𝑋 ≤ 𝑥, 𝑌 ≤ 𝑦}
It is called a distribution function for a two-dimensional random variable (𝑋,𝑌), or a joint distribution function
for random variables 𝑋 and 𝑌.
𝑿
𝒙𝟏 𝒙𝟐 ⋯ 𝒙𝒊 ⋯
𝒀
𝒚𝟏 𝑝11 𝑝12 ⋯ 𝑝1𝑖 ⋯
𝒚𝟐 𝑝21 𝑝22 ⋯ 𝑝2𝑖 ⋯
⋮ ⋮ ⋮ ⋮
𝒚𝒋 𝑝𝑗1 𝑝𝑗2 ⋯ 𝑝𝑗𝑖 ⋯
⋮ ⋮ ⋮ ⋮
𝑦 𝑥
𝐹 𝑥, 𝑦 = 𝑓 𝑢, 𝑣 𝑑𝑢𝑑𝑣
−∞ −∞
𝑃(𝑌𝑋) 𝑃 𝑋𝑌 𝑃 𝑌 𝑋 𝑃(𝑋)
𝑃 𝑌𝑋 = 𝑃 𝑋𝑌 = =
𝑃(𝑋) 𝑃 𝑌 𝑃(𝑌)
Assuming that 𝑋 is a probabilistic space {𝑋_1,𝑋_2,...,𝑋_𝑛} composed of independent events, 𝑃 (𝑌) can be expanded with a
full probability formula: 𝑃 (𝑌) = 𝑃 (𝑌│𝑋_1) 𝑃 (𝑋_1) +𝑃 (𝑌│𝑋_2) 𝑃 (𝑋_2) +...+𝑃 (𝑌│𝑋_𝑛) 𝑃 (𝑋_𝑛). Then, the Bayes formula
can be expressed as:
𝑃 𝑌 𝑋𝑖 𝑃(𝑋𝑖 )
𝑃 𝑋𝑖 |𝑌 = 𝑛
𝑖=1 𝑃 𝑌 𝑋𝑖 𝑃(𝑋𝑖 )
If 𝑃(𝑋=1)=0.001, 𝑃(𝑋=1,𝑦=1)=0.09.
Variance: A measure of the degree of dispersion in which the probability theory and
statistical variance measure random variables or a set of data. According to the
probability theory, variance measures the deviation between the random variable and
its mathematical expectation.
𝐷 𝑋 = 𝑉𝑎𝑟 𝑋 = 𝐸{[𝑋 − 𝐸 𝑋 ]2 }
𝑋∗ =
𝑋−𝐸(𝑋)
is called standard variable for 𝑋.
𝜎(𝑋)
correlation coefficient:
𝜌𝑋𝑌 =
𝐶𝑜𝑣(𝑋,𝑌)
.
𝐷(𝑋) 𝐷(𝑌)
Example: If you throw a coin, the information quantity about the coin showing the
front or opposite is 𝐼(front)= 𝐼(𝑜𝑝𝑝𝑜𝑠𝑖𝑡𝑒)= 1𝑏𝑖𝑡.
|𝐶𝑖,𝐷 |
where 𝑝𝑖 is a none-zero probability that any tuple in 𝐷 belongs to class 𝐶𝑖 , 𝑝𝑖 = |𝐷|
.
1 1 1 1
𝐼𝑛𝑓𝑜 𝐷 = − log 2 + log 2 = 1𝑏𝑖𝑡.
2 2 2 2
Numerical Calculation
Basic Concepts
Numerical Computer
Problem Mathematical Programming
calculation calculation
analysis model design
method result
Overflow: Overflow occurs when a large number is approximated to ∞ or −∞. Further operations
usually cause these infinite values to become non-numeric.
The large number " swallows" the small number: When 𝑎≫𝑏, 𝑎 + 𝑏 = 𝑎, a numerical abnormality
occurs.
exp(𝑥𝑖 )
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑥)𝑖 = 𝑛 exp(𝑥 ).
𝑗=1 𝑗
Considering function 𝑓 (𝑥) = 𝐴^ (−1) 𝑥, when 𝐴∈ℝ^(𝑛×𝑛) has feature decomposition, the
number of conditions is:
𝝀𝒊
𝒎𝒂𝒙 | |
𝒊,𝒋 𝝀𝒋
This is the modulus ratio of the maximum and minimum eigenvalues. When this ratio is
Numerical Calculation
Basic Concepts
𝑚𝑖𝑛(𝑚𝑎𝑥)𝑓(𝑥)
𝑚𝑖𝑛𝑓(𝑥)
The common method is Fermat theorem. If 𝑓′ (𝑥) = 0, the critical point is obtained. Then, verify that the extreme value
can be obtained at the critical point.
𝑚𝑖𝑛𝑓(𝑥)
𝑠. 𝑡. ℎ𝑖 𝑥 = 0, 𝑖 = 1,2, ⋯ , 𝑛.
The common method is Lagrange multiplier method, that is, introducing 𝑛 Lagrange multipliers 𝜆 to construct
Lagrange function 𝐿 𝑥, 𝜆 = 𝑓 𝑥 + 𝑖=1 𝜆𝑖 ℎ𝑖 (𝑥) and
𝑛
then seeking the partial derivative of each variable to be zero. Then,
we can get the collection of candidate values, and get the optimal value through verification.
𝑚𝑖𝑛𝑓 𝑥
𝑠. 𝑡. ℎ𝑖 𝑥 = 0, 𝑖 = 1,2, ⋯ , 𝑛,
𝑔𝑗 𝑥 ≤ 0, 𝑗 = 1,2, ⋯ , 𝑚.
A common method is to introduce new variables 𝜆𝑖 and 𝛼𝑗 , to Generalized Lagrangian functions based on
all equality, inequality constraints and 𝑓 (𝑥).
𝐿 𝑥, 𝜆, 𝛼 = 𝑓 𝑥 + 𝑖 𝜆𝑖 ℎ𝑖 𝑥 + 𝛼𝑗 𝑔𝑗 𝑥 ,
We can use a set of simple properties to describe the most advantageous properties of constrained
optimization problems, which are called KKT (kuhn-kuhn-tucker) conditions.
The gradient of the generalized Lagrangian is 0.
All constraints on 𝑥 and KKKT multiplier are met.
Inequality constraints show "complementary slackness type": 𝛼⨀ℎ(𝑥)=0.
Then, 𝑓(𝑥) is called a convex function. The extremum point of the convex
function is present at the stationary point.
Gradient: It is a derivative relative to vector 𝑋, and is expressed as 𝛻_𝑥𝑓 (𝑥). The derivative of
𝑓 (𝑥) in the direction of 𝑢 (unit vector) is 𝑢^𝑇 𝛻_𝑥 𝑓(𝑥).
For a task to minimize 𝑓 (𝑥), we want to find the direction with the fastest downward
change, where 𝜃 is the angle between 𝑢 and gradient 𝛻_𝑥𝑓 (𝑥).
You can see that the direction in which 𝑓 (𝑥) value decreases the maximum is the negative
direction of the gradient.
Under the gradient descent method, the update point is proposed as:
𝑥 ′ = 𝑥 − 𝜀𝛻𝑥 𝑓(𝑥)
where 𝜀 is the learning rate, which is a positive scalar with a fixed step length.