0% found this document useful (0 votes)
88 views70 pages

3 Math Basics

Uploaded by

Seif Riahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views70 pages

3 Math Basics

Uploaded by

Seif Riahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Math Basics

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Linear Algebra

 Concept and Calculation of Matrices

 Special Matrices

 Eigendecomposition

 Probability Theory and


Information Theory

 Numeric Calculation

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 To avoid obesity and improve employees' health, the Big Data Department organized a monthly running activity at the beginning of 2018.
The rules were as follows: The department set the monthly target for participants at the beginning of the month. The participants who
fulfilled the targets would be rewarded while those who failed would be punished. The calculation rule of the reward or penalty amount was
as follows:
𝑤𝑖 = 𝑠𝑖 − 𝑑𝑖 𝑥𝑖 = ℎ𝑖 𝑥𝑖

 In monthly target and total reward/penalty amount the preceding equation, 𝑤𝑖 is the total reward/penalty amount in the month 𝑖, 𝑠𝑖 is the
total mileage, 𝑑𝑖 is the monthly target, ℎ𝑖 is the difference between the actual distance and monthly target, and 𝑥𝑖 is the reward/penalty
amount of each kilometer every month. This activity received good feedback and was later adopted by the Cloud Department. The
following tables listed the difference between the actual distance and of some participants in the first quarter:

Table 1 Big Data Department Table 2 Cloud Department

Month Month
Name 𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒘 Name
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒘
A 10 8 12 20 A 2 4 5 10
B 4 4 2 8 B 4 2 2 6
C 2 -4 -2 -5 C -2 2 2 3

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 In the preceding case, what is the reward/penalty amount set by the Big Data Department for
each kilometer in each month? The equations are as follows using the given data:

𝟏𝟎𝑥𝟏 + 𝟖𝑥𝟐 + 𝟏𝟐𝑥𝟑 = 𝟐𝟎


𝟒𝑥𝟏 + 𝟒𝑥𝟐 + 𝟐𝑥𝟑 = 𝟖 (𝟏. 𝟏)
𝟐𝑥𝟏 − 𝟒𝑥𝟐 − 𝟐𝑥𝟑 = −𝟓

In this way, the solutions of the equations are the answer to the question.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


𝑥𝟏
𝑥𝟐
 Vector: x= M
𝑥𝒏

𝑎11 𝑎12 𝐿 𝑎1𝑛


matrix 𝑎21 𝑎22 𝐿 𝑎2𝑛
𝐴=
𝑀 𝑀 𝑀 𝑀
𝑎𝑚1 𝑎𝑚2 𝐿 𝑎𝑚𝑛

𝐴 = 𝐴𝑚×𝑛 = (𝑎𝑖𝑗 )𝑚×𝑛 = (𝑎𝑖𝑗 )

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


𝑎11 𝑎12 𝐿 𝑎1𝑛

𝑎21 𝑎22 𝐿 𝑎2𝑛


 det(A): det(𝐴) =
𝑀 𝑀 𝑀 𝑀
𝑎𝑚1 𝑎𝑚2 𝐿 𝑎𝑚𝑛

 Significance:

 The determinant is equal to the product of all the eigenvalues of the matrix.

 The absolute value of the determinant can be thought of as a measure of how much
multiplication by the matrix expands or contracts space. If the determinant is 0, then
space is contracted completely along at least one dimension, causing it to lose all its
volume. If the determinant is 1, then the transformation preserves volume.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Matrix addition: Suppose that 𝐴 = (𝑎𝑖𝑗 )𝑠×𝑛 and 𝐵 = (𝑏𝑖𝑗 )𝑠×𝑛 are s x n matrices, and the sum of the two
matrices is 𝐶 = 𝐴 + 𝐵 = (𝑎𝑖𝑗 + 𝑏𝑖𝑗 )𝑠×𝑛 .

Note: The two matrices can be added only when the matrices have the same row quantity and column quantity.

 Scalar and matrix multiplication: Suppose 𝐴 = (𝑎𝑖𝑗 )𝑠×𝑛 and 𝑘 ∈ 𝐾. The product of k
and matrix A is 𝑘𝐴 = (𝑘𝑎𝑖𝑗 )𝑠×𝑛 . The addition of a scalar and matrix follows the same rule.

 Matrix multiplication: Suppose 𝐴 = (𝑎𝑖𝑗 )𝑠×𝑛 and 𝐵 = (𝑏𝑖𝑗 )𝑛×𝑝 ,


𝐶 = 𝐴𝐵 = (𝑐𝑖𝑗 )𝑠×𝑝 ,
where 𝐶𝑖,𝑗 = 𝑘 𝐴𝑖,𝑘 𝐵𝑘,𝑗

Note:In order for 𝐴𝐵 to be defined, A must have the same number of columns as B has rows.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Transposed matrix: The transpose of a matrix is an operator which flips a matrix over its diagonal, that is,
it switches the row and column indices of the matrix by producing another matrix denoted as 𝐴^𝑇 (also written
as A′).

Example:

 Nature of a transposed matrix:


 (𝐴𝑇 )𝑇 = 𝐴  (𝜆𝐴)𝑇 = 𝜆𝐴𝑇
 (𝐴 + 𝐵)𝑇 = 𝐴𝑇 +𝐵𝑇  (𝐴𝐵)𝑇 = 𝐵𝑇 𝐴𝑇

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Trace operator :

Tr A = i A i,i .

 Nature of a trace operator:

• Tr A = Tr AT
• Tr A = A
• Tr ABC = Tr CAB = Tr(BCA)

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


In the preceding case, the calculation is as follows:
 The running result of the big data department and cloud department in the first quarter can be calculated as
follows:

 Perform the following operation on the matrices:

 According to the matrix multiplication rule, the equations (1.1) can be represented by a matrix as
follows:

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Linear Algebra

 Concept and Calculation of Matrices

 Special Matrices

 Eigendecomposition

 Probability Theory and


Information Theory

 Numeric Calculation

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Identity matrix: All the entries along the main diagonal are 1, while all the other
entries are 0. An identity matrix does not change any vector when we multiply that
vector by that matrix.

 M𝐚𝐭𝐫𝐢𝐱 𝐢𝐧𝐯𝐞𝐫𝐬𝐞: The 𝐦𝐚𝐭𝐫𝐢𝐱 𝐢𝐧𝐯𝐞𝐫𝐬𝐞 of A is denoted as A−1, and it is defined as the
matrix such that A−1 A = In .

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Diagonal matrix: consists mostly of zeros and have non-zero entries only along the main
diagonal. It is often written as 𝑑𝑖𝑎𝑔(𝜆1 , 𝜆2 , ⋯ , 𝜆𝑛 )

 Nature of a diagonal matrix:


 The sum, difference, product, and square power of the elements on the diagonal matrix are
the sum, difference, product, and square power of the elements along the main diagonal.

 The inverse matrix is as follows:

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Symmetric matrix: If 𝐴𝑇 = 𝐴 (𝑎𝑖𝑗 = 𝑎𝑗𝑖 ) in square matrix 𝐴 = (𝑎𝑖𝑗 )𝑛×𝑛 , A is a

symmetric matrix.

 Orthogonal matrix: If 𝐴𝐴𝑇 = 𝐴𝑇 𝐴 = 𝐼𝑛 in the square matrix 𝐴 = (𝑎𝑖𝑗 )𝑛×𝑛 , A is an


orthogonal matrix. That is, 𝐴−1 = 𝐴𝑇 .

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Linear Algebra

 Concept and Calculation of Matrices

 Special Matrices

 Eigendecomposition

 Probability Theory and


Information Theory

 Numeric Calculation

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 One of the most widely used kinds of matrix decomposition is called eigen decomposition, in which we
decompose a matrix into a set of eigenvectors and eigenvalues. We can decompose matrices in ways that show
us information about their functional properties that is not obvious from the representation of the matrix as an
array of elements.

 Suppose that A is a n-level matrix in the digital domain K. If there is a non-zero column vector 𝛼 in 𝐾^𝑛 that
meets the following:

Aα = λα,𝑎𝑛𝑑 λϵK,

𝜆 is called an eigenvalue of 𝐴, and 𝛼 is a eigenvector of A and belongs to the eigenvalue 𝜆.

Example:

Therefore, 2 is an eigenvalue of 𝐴, and 𝛼 is an eigenvector of A and belongs to eigenvalue 2.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Obtaining the eigenvalues and eigenvectors of matrix A:

In the preceding information, |𝐴−𝜆𝐼|=0 is a feature equation of matrix A, 𝜆 is a solution


(characteristic root) of the feature equation. To obtain the eigenvector 𝛼, substitute the
characteristic root 𝜆 into 𝐴𝛼=𝜆𝛼.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


3 −1
Example: Find the eigenvalues and eigenvectors of the matrix 𝐴 = .
−1 3
3−𝜆 −1
Solution: The characteristic polynomial of A is −1 3−𝜆
= (3 − 𝜆)2 −1 = 4 − 𝜆 2 − 𝜆 . Therefore, the

eigenvalues of A are 𝜆1 = 2 and 𝜆2 = 4.

3 − 2 −1 𝑥1 0
Taking 𝜆1 = 2, the corresponding eigenvector satisfies = , we find that 𝑥1 = 𝑥2 .
−1 3 − 2 𝑥2 0
1
Therefore, the corresponding eigenvector is 𝑝1 = . When 𝜆1 = 2, the eigenvector is 𝑘𝑝1 𝑘 ≠ 0 .
1

−1
Taking 𝜆2 = 4, we find that 𝑥1 = −𝑥2. The eigenvector is 𝑝2 = . When 𝜆2 = 4, the eigenvector is 𝑘𝑝2 𝑘 ≠ 0 .
1

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Suppose that a matrix A has n linearly independent eigenvectors {𝛼1 , . . . , 𝛼𝑛 } with corresponding
eigenvalues {λ1, . . . , λn}. The eigendecomposition of A is then given by

𝐴 = 𝑃𝑑𝑖𝑎𝑔 𝜆 𝑃 −1
where 𝑃 = {𝛼1 , 𝛼2 , ⋯ , 𝛼𝑛 }, and 𝜆 = {𝜆1 , 𝜆2 , ⋯ , 𝜆𝑛 }.

 Matrix accuracy:

 A matrix whose eigenvalues are all positive  A matrix whose eigenvalues are all positive or
is called positive definite. zero valued is called positive semidefinite.

 If all eigenvalues are negative, the matrix is  If all eigenvalues are negative or zero valued, it is
negative definite. negative semidefinite.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


Singular Value Decomposition: The matrix is decomposed into singular vectors and
singular values. The matrix 𝐴 = (𝑎𝑖𝑗 )𝑚×𝑛 can be decomposed into a product of three
matrices:

𝐴 = 𝑈𝐷𝑉 𝑇 ,

Among 𝑈 = (𝑏𝑖𝑗 )𝑚×𝑚 , 𝐷 = (𝑐𝑖𝑗 )𝑚×𝑛 , and 𝑉 𝑇 = (𝑑𝑖𝑗 )𝑛×𝑛 , the matrices U and V are both
defined to be orthogonal matrices. The columns of U are known as the left-singular
vectors. The columns of V are known as as the right-singular vectors. The matrix D is
defined to be a diagonal matrix. Note that D is not necessarily square. Elements on the
diagonal line of D is referred to as a singular value of the matrix.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 The Moore-Penrose pseudoinverse enables us to make some headway in finding the
solution of 𝐴𝑥 = 𝑦(𝐴 = (𝑎𝑖𝑗 )𝑚×𝑛 , 𝑚 ≠ 𝑛). The pseudoinverse of A is defined as a matrix:

Practical algorithms for calculating the pseudoinverse are based on the formula:

A+ = VD+ U T

where U, D and V are the singular value decomposition of A, and the pseudoinverse 𝐷+
of a diagonal matrix D is obtained by taking the reciprocal of its non-zero elements
then taking the transpose of the resulting matrix.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Principal Component Analysis (PCA):a statistical method. Through orthogonal transform, a group of variables that
may have correlation relationships are converted into a set of linear unrelated variables, and the converted variables
are called main components.

 Basic : Assume that there are n objects, and each object is composed of {𝑥1 , . . . , 𝑥𝑝 } . The following table
lists the factor data corresponding to each object.

Factor
Object
𝒙𝟏 𝒙𝟐 … 𝒙𝒋 … 𝒙𝒑

1 𝑥11 𝑥12 … 𝑥1𝑗 … 𝑥1𝑝


2 𝑥21 𝑥22 … 𝑥2𝑗 … 𝑥2𝑝
… … … … … … …
𝑖 𝑥𝑖1 𝑥𝑖2 … 𝑥𝑖𝑗 … 𝑥𝑖𝑝
… … … … … … …
𝑛 𝑥𝑛1 𝑥𝑛2 … 𝑥𝑛𝑗 … 𝑥𝑛𝑝

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 The original variables are 𝑥1 , . . . , 𝑥𝑝 After the dimension-reduction processing,
set their comprehensive indexes. That is, the new variables are 𝑧1, . . . , 𝑧𝑚
𝑚 ≤ 𝑝 . 𝑧1 , . . . , 𝑧𝑚 are called the first, the second, ..., the mth main component
of 𝑥1 , . . . , 𝑥𝑝 . We have the following expression:

 To obtain m principal components, the steps are as follows:

 The coefficient 𝑙𝑖𝑗 meets the following rules: 𝑧𝑖 is not related to 𝑧𝑗 𝑖 ≠ 𝑗; 𝑖, 𝑗 = 1, 2, … , 𝑚 . 𝑧1 has the largest variance
among all linear combinations of 𝑥1 , . . . , 𝑥𝑝 . 𝑧2 has the largest variance among all linear combinations of 𝑥1 , . . . ,
𝑥𝑝 that is not related to 𝑧1. z has the largest variance among all linear combinations of 𝑥1 , . . . , 𝑥𝑝 that is not
m
related to 𝑧1 , 𝑧2 , …, 𝑧𝑚−1 .

 According to the above rules, 𝑙𝑖𝑗 is a eigenvector of m large eigenvalues of the coefficient matrix corresponding
to 𝑥1 , . . . , 𝑥𝑝 .

 If the cumulative contribution rate of the first i main components reaches 85% to 90%, those components are
used as the new variables.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Correlation coefficient matrix and correlation coefficient:

 Contribution rate of the main components and cumulative contribution rate:

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Linear Algebra

 Probability Theory and


Information Theory
 Basic Concepts of Probability Theory

 Random Variables and Their


Distribution Functions
 Numerical Characteristics of
Random Variables
 Information Theory
 Numeric Calculation

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 While probability theory allows us to make uncertain statements and to reason in the
presence of uncertainty, information theory enables us to quantify the amount of
uncertainty in a probability distribution.

 There are three possible sources of uncertainty:

Inherent stochasticity in the Incomplete Incomplete


system being modeled observability modeling

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 The test that meets the following three characteristics is called a random test:
 It can be repeated under the same condition.

 There may be more than one result of each test, and all
possible results of the test can be specified in advance.

 Before a test, we cannot determine which result will


appear.

 Example:
 𝐸1 :Toss two coins and check the outcome (front or back).

 𝐸2 :Throw a dice and check the number of points that may


appear.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Sample point: each possible result of a random test, which is represented by 𝑒.

 Sample space: a collection of all possible results of a random test, which is represented by
𝑆 = 𝑒1 , 𝑒2 , … , 𝑒𝑛 .

 Random variables event: any subset of the sample space S. If a sample point of event A
occurs, event A occurs. In particular, a random event containing only one sample point is
called a basic event.

 Example:

Random test: Throw a dice and check the outcome.


 Sample space: S={1, 2, 3, 4, 5, 6}

 Sample point: 𝑒𝑖 = 1, 2, 3, 4, 5, 6

 Random event 𝐴1 : "The outcome is 5", that is, 𝐴1 = 𝑥 𝑥 = 5 .

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Frequency: Under the same conditions, perform tests for 𝑛 times. The occurrence of event
𝑛𝐴
A is called the frequency of event A. The ratio , occurrence probability of event A, is
𝑛

recorded as 𝑓𝑛 𝐴 .

 Probability: Suppose that E is a random test and S is the sample space. Assign a real
number P(A) (event probability ) on each event A of E. The set function P(∗) must meet
the following conditions:
 Non-negative: For each event 𝐴, 0≤𝑃(𝐴)≤1.

 Standard: For the inevitable event S, 𝑃(𝑆)=1.

 Countable additivity: A1 , … are events incompatible with each other. That is, if
Ai Aj = ∅, i ≠ j, i, j = 1, 2, ⋯, we have P A1 ∪ A2 ∪ ⋯ = P A1 + P A2 + ⋯ .

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 random variable : indicates a single- and real-valued function that represents a random test of
various results.

 Example 1: Random test E4: Toss two dice and check the sum of the results. The sample space of
the test is S = e = i, j i, j = 1, 2, 3, 4, 5, 6 . i indicates the first outcome and j indicates the second
outcome. X is the sum of the two outcomes, which is a random variable.
𝑋 = 𝑋 𝑒 = 𝑋 𝑖, 𝑗 = 𝑖 + 𝑗, 𝑖, 𝑗 = 1,2, ⋯ , 6.

 Example 2: Random test E1 : Throw two coins and check the outcome (front side H or back side T).
The sample space for the test is S = HH, HT, TH, TT . Y, as the total occurrence of the back side T, is a
random variable.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Linear Algebra

 Probability Theory and


Information Theory
 Basic Concepts of Probability Theory

 Random Variables and Their


Distribution Functions
 Numerical Characteristics of
Random Variables
 Information Theory
 Numeric Calculation

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Discrete random variables: All the values of random variables may be finite or infinite. A typical
random variable is the number of vehicles passing through a monitoring gate within one minute.

 Distribution law: If all the possible values of discrete random variable X are xk k = 1, 2, ⋯ ,
the probability of X getting a possible value {X = xk } is:

𝑃 𝑋 = 𝑥𝑘 = 𝑝𝑘 , 𝑘 = 1,2, ⋯ .

As defined for probability, 𝑝𝑘 should meet the following conditions:

(1)𝑝𝑘 ≥ 0, 𝑘 = 1,2, ⋯ .

(2) 𝑘=1 𝑝𝑘 = 1.
𝑿 𝑥1 𝑥2 … 𝑥𝑛 …
The distribution law can also be expressed in a table:
𝒑𝒌 𝑝1 𝑝2 … 𝑝𝑛 …

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Bernoulli distribution (0-1 distribution, two-point distribution, a-b distribution):If
random variable 𝑋 can be either 0 or 1, its distribution law is:
𝑃 𝑋 = 𝑘 = 𝑝𝑘 (1 − 𝑝)1−𝑘 , 𝑘 = 0,1 0 < 𝑝 < 1

That is, 𝑋 obeys Bernoulli distribution with the 𝑝 parameter.

 The distribution law of Bernoulli distribution can also be written as below:

𝑿 0 1
𝒑𝒌 1−𝑝 𝑝
𝐸(𝑋)=𝑝, V𝑎𝑟(𝑋)=𝑝(1−𝑝).

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 𝐧 𝐢𝐧𝐝𝐞𝐩𝐞𝐧𝐝𝐞𝐧𝐭 𝐫𝐞𝐩𝐞𝐭𝐢𝐭𝐢𝐯𝐞 𝐭𝐞𝐬𝐭𝐬:The experiment 𝐸 is repeated 𝑛 times. If the results of each
experiment do not affect each other, the 𝑛 experiments are said to be independent of each other.

 The experiments that meet the following conditions are called 𝒏 Bernoulli
experiments:
 Each experiment is repeated under the same conditions.

 There are only two possible results per experiment: 𝐴 and 𝐴 and 𝑃 (𝐴) =𝑝.

 The results of each experiment are independent of each other.

If the times of event 𝐴 occurring in 𝑛 Bernoulli experiments are expressed by 𝑋, the


probability of event 𝐴 occurring for 𝑘 times in 𝑛 experiments is as below:
𝑃 𝑋 = 𝑘 = 𝐶𝑛𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 , 𝑘 = 0,1,2, ⋯ 𝑛,

At this time, 𝑋 obeys binomial distribution with 𝑛 and 𝑝 parameters. This is expressed as 𝑋~
𝐵 (𝑛,𝑝), where 𝐸 (𝑋) equals 𝑛𝑝 and 𝑉𝑎𝑟 (𝑥) equals 𝑛𝑝(1−𝑝).

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Poisson theorem: If 𝜆>0 is set as a constant, 𝑛 is any positive integer, and 𝑛𝑝 equals 𝜆,
the following applies to any fixed non-negative integer 𝑘:

𝜆𝑘 𝑒 −𝜆
lim𝑛→∞ 𝐶𝑛𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 ≈
𝑘!

 Poisson distribution: If all possible values of random variables are 0, 1, 2, ..., the
probability of taking each value is:

𝜆𝑘 𝑒 −𝜆
𝑃 𝑋=𝑘 = , 𝑘 = 0,1,2, ⋯
𝑘!

Then, 𝑋 obeys Poisson distribution with parameter 𝜆. It is expressed as 𝑋~𝑃 (𝜆), where 𝐸
(𝑋) equals 𝜆, and 𝐷 (𝑋) equals 𝜆.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


The mathematical models of Poisson distribution and Binomial distribution are both
Bernoulli-type. Poisson distribution has the appropriately equal calculation as binomial
distribution when 𝑛 is very large and 𝑝 very small.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Distribution function: If 𝑋 is a random variable, and 𝑥 is an arbitrary real number,
function 𝐹 (𝑥) is called the distribution function of 𝑋.
𝐹 𝑥 = 𝑃 𝑋 ≤ 𝑥 , −∞ < 𝑥 < ∞.

 Distribution function 𝐹 (𝑥) has the following basic properties:


 𝐹 (𝑥) is a function of no subtraction.
lim 𝐹 𝑥 = 0,𝐹(∞) = lim 𝐹 𝑥 = 1 .
 0 ≤ 𝐹 𝑥 ≤ 1, and 𝐹 −∞ = 𝑥→−∞ 𝑥→∞
 𝐹 𝑥 + 0 = 𝐹(𝑥), that is, 𝐹 𝑥 is of right continuity.
 Significance of distribution function 𝐹 (𝑥): If 𝑋 is regarded as the coordinate of a random
point on the number axis, the function value of distribution function 𝐹 (𝑥) at 𝑥 indicates the
probability that 𝑋 falls in the interval (−∞,𝑥).

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 If distribution function 𝐹 (𝑥) for random variable 𝑋 has a non-negative function 𝑓 (𝑥),
and the following applies to arbitrary real number 𝑥:
𝑥
𝐹 𝑥 = −∞
𝑓 𝑡 𝑑𝑡 ,
Then, 𝑋 is called a continuous random variable, and function 𝑓 (𝑥) is called the probability
density function of 𝑋, or probability density.
The area
is 1 y=𝑓 𝑥
 Probability density 𝑓(𝑥)has the following properties:
 𝑓 𝑥 ≥ 0.
+∞
 −∞ 𝑓 𝑥 𝑑𝑥 = 1 .
 For arbitrary real number 𝑥1 ,𝑥2 (𝑥1 < 𝑥2 ),𝑃 𝑥1 < 𝑋 ≤ 𝑥2 = 𝐹 𝑥2 −
𝑥
𝐹 𝑥1 = 𝑥 2 𝑓 𝑥 𝑑𝑥 . 𝑃 𝑥x11 <
P  𝑋X≤𝑥x22 
1

 If 𝑓(𝑥)is continuous at 𝑥,𝐹 , (𝑥) = 𝑓 𝑥 .


 The probability value of random variable 𝑋 taking any real
number is 0, that is, 𝑃 (𝑋=𝑎) = 0. 𝑥1 𝑥2

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 If the probability density function of continuous random variable 𝑋 is

𝟏 (𝒙−𝝁)𝟐

𝒇 𝒙 = 𝒆 𝟐𝝈𝟐 −∞<𝒙<∞
𝟐𝝅𝝈 ,

where 𝜇, 𝜎 (𝜎>0) is constant, 𝑋 obeys the normal distribution or Gaussian distribution of 𝜇, 𝜎, which is expressed as
𝑋~𝑁 (𝜇,𝜎^2). Especially when 𝜇 = 0, 𝜎 = 1, random variable 𝑋 obeys the standard normal distribution, which is
expressed as 𝑋~𝑁 (0, 1).

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 If the probability density of continuous random variable 𝑋 is

where λ>0 is a constant, indicating the time when a random event occurs once,𝑋 obeys
the exponential distribution with parameter 𝜆. This distribution is expressed as 𝑋~𝐸(𝜆).𝐸
(𝑋)= 1/𝜆,𝑉ar(𝑋)=1/𝜆^2 .

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 If the probability density of continuous random variable 𝑋 is

1 −|𝑥−𝜇| ,
𝐿𝑎𝑝𝑙𝑎𝑐𝑒 𝑥; 𝜇, 𝑏 = 𝑒 𝑏
2𝑏

where 𝜇 is the position parameter, and 𝑏 is the scale


parameter, 𝑋 obeys the Laplace distribution. This
distribution is expressed as 𝑋~𝐿𝑎𝑝𝑙𝑎𝑐𝑒(𝑥;𝜇,𝑏).𝐸(𝑋)=𝜇,𝑉
ar(𝑋)=2𝑏^2.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


Bernoulli
distribution

𝒏 experiments
𝑛𝑝 > 20
Binomial
Poisson distribution Dirac distribution
distribution 𝑝 < 0.05

𝒏→∞ Interval

Gaussian Exponential Empirical


distribution distribution distribution

Back-to-back

Multinoulli
Laplace distribution
distribution

Mixture distribution

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Two-dimensional random variable: 𝐸 is a random experiment, and its sample space is 𝑆 = {𝑒}. If 𝑋 = 𝑋 (𝑒)
and 𝑌 = 𝑌 (𝑒) are defined as random variables on 𝑆, they make a vector (𝑋,𝑌), called two-dimensional random variable.

 Distribution function of two-dimensional random variable: If (𝑋,𝑌) is a two-dimensional random


variable, for any real numbers 𝑥,𝑦, the binary function applies:

𝐹 𝑥, 𝑦 = 𝑃 𝑋 ≤ 𝑥 ∩ 𝑌 ≤ 𝑦 = 𝑃{𝑋 ≤ 𝑥, 𝑌 ≤ 𝑦}

It is called a distribution function for a two-dimensional random variable (𝑋,𝑌), or a joint distribution function
for random variables 𝑋 and 𝑌.

 Significance of the joint distribution function: If (𝑋,𝑌) is


considered as the coordinate of a random point on the plane, distribution
function 𝐹 (𝑥,𝑦) at (𝑥,𝑦) is the probability of random point (𝑋,𝑌) falling in the
infinite rectangular field at the point (𝑥,𝑦) vertex and at the lower left of the point.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Two-dimensional discrete random variable: All possible values of discrete random
variable (𝑋,𝑌) can be finite or infinite pairs.

 Joint distribution law of 𝑋 and 𝑌:

𝑿
𝒙𝟏 𝒙𝟐 ⋯ 𝒙𝒊 ⋯
𝒀
𝒚𝟏 𝑝11 𝑝12 ⋯ 𝑝1𝑖 ⋯
𝒚𝟐 𝑝21 𝑝22 ⋯ 𝑝2𝑖 ⋯
⋮ ⋮ ⋮ ⋮
𝒚𝒋 𝑝𝑗1 𝑝𝑗2 ⋯ 𝑝𝑗𝑖 ⋯
⋮ ⋮ ⋮ ⋮

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


If distribution function 𝐹(𝑥,𝑦) of two-
dimensional random variable (𝑋,𝑌) has a non-
negative function 𝑓 (𝑥,𝑦) that makes the
following apply to arbitrary 𝑥, 𝑦

𝑦 𝑥
𝐹 𝑥, 𝑦 = 𝑓 𝑢, 𝑣 𝑑𝑢𝑑𝑣
−∞ −∞

(𝑋,𝑌) is a continuous two-dimensional


random variable and function is the joint
probability density for two-dimensional
random variable 𝑋.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Marginal distribution function: Two-dimensional random variable (𝑋,𝑌) as a whole has distribution
function 𝐹 (𝑥,𝑦). 𝑋 and 𝑌 are random variables, and they also have their distribution functions, which are
expressed as 𝐹_𝑋 (𝑥) and 𝐹_𝑌 (𝑥) and are called as the marginal distribution functions of two-dimensional
random variable (𝑋,𝑌) about 𝑋 and 𝑌, respectively. 𝐹_𝑋 (𝑥)=𝑃{𝑋≤𝑥}=𝑃{𝑋≤𝑥,𝑌≤∞}=𝐹(𝑥,∞)

 For discrete random variable:



 Marginal distribution function: 𝐹𝑋 𝑥 = 𝑥𝑖 ≤𝑥 𝑗=1 𝑝𝑖𝑗 .

 Marginal density function: 𝑝𝑖. = 𝑗=1 𝑝𝑖𝑗 , 𝑗 = 1,2, ⋯ .

 For continuous random variable:


𝑥 +∞
 Marginal distribution function: 𝐹𝑋 𝑥 = 𝐹 𝑥, ∞ = [
−∞ −∞
𝑓 𝑥, 𝑦 𝑑𝑦]𝑑𝑥.
+∞
 Marginal density function: 𝑓𝑋 𝑥 = −∞
𝑓 𝑥, 𝑦 𝑑𝑦.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 conditional probability:  Bayes formula :

𝑃(𝑌𝑋) 𝑃 𝑋𝑌 𝑃 𝑌 𝑋 𝑃(𝑋)
𝑃 𝑌𝑋 = 𝑃 𝑋𝑌 = =
𝑃(𝑋) 𝑃 𝑌 𝑃(𝑌)

 Assuming that 𝑋 is a probabilistic space {𝑋_1,𝑋_2,...,𝑋_𝑛} composed of independent events, 𝑃 (𝑌) can be expanded with a
full probability formula: 𝑃 (𝑌) = 𝑃 (𝑌│𝑋_1) 𝑃 (𝑋_1) +𝑃 (𝑌│𝑋_2) 𝑃 (𝑋_2) +...+𝑃 (𝑌│𝑋_𝑛) 𝑃 (𝑋_𝑛). Then, the Bayes formula
can be expressed as:
𝑃 𝑌 𝑋𝑖 𝑃(𝑋𝑖 )
𝑃 𝑋𝑖 |𝑌 = 𝑛
𝑖=1 𝑃 𝑌 𝑋𝑖 𝑃(𝑋𝑖 )

 The chain rule of conditional probability:

𝑃 𝑋1 , 𝑋2 , … , 𝑋𝑛 = 𝑃(𝑋1 ) 𝑃(𝑋𝑖 |𝑋1 , … , 𝑋𝑖−1 )


𝑖=2

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Two random variables 𝑋 and 𝑌, if for all 𝑥, 𝑦, the following applies

𝑃(𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑃(𝑋 = 𝑥)𝑃(𝑌 = 𝑦)

Random variables 𝑋 and 𝑌 are of mutual independence, which is expressed as 𝑋⊥𝑌.


 If for each value of 𝑍 for the conditional probability about 𝑋 and 𝑌, the following
applies
𝑃(𝑋 = 𝑥, 𝑌 = 𝑦|𝑍 = 𝑧) = 𝑃(𝑋 = 𝑥|𝑍 = 𝑧)𝑃(𝑌 = 𝑦|𝑍 = 𝑧)

Random variables 𝑋 are of conditional independence at given random variable 𝑍, which is


expressed as 𝑋⊥𝑌|𝑍 .

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Wang went to hospital for a blood test, and got a positive result, indicating that he may have been attacked by
the 𝑋 disease. According to data on the Internet, 1% of the people who were sick of this disease were false
positive, and 99% were true positive. In those who did not get sick of this disease, 1% of the people were false
negative, and 99% were true negative. As a result, Wang thought, with only 1% false positive rate, and 99% true
positive rate, the probability of Wang getting infected with the 𝑋 disease should be 99%. However, the doctor
told him that the probability of his infection was only about 0.09.

𝑋 = 1 (infected), 𝑋 = 0 (not infected), 𝑦 = 1 (tested as positive), 𝑦 = 0 (tested as negative)

𝑃 𝑋 = 1 𝑃(𝑦 = 1|𝑋 = 1) 𝑃(𝑋 = 1) × 0.99


𝑃 𝑋=1𝑦=1 = =
𝑃 𝑦 = 1 𝑋 = 1 𝑃 𝑋 = 1 + 𝑃 𝑦 = 1 𝑋 = 0 𝑃(𝑋 = 0) 0.99 × 𝑃 𝑋 = 1 + 0.01 × (1 − 𝑃(𝑋 = 1))

If 𝑃(𝑋=1)=0.001, 𝑃(𝑋=1,𝑦=1)=0.09.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Linear Algebra

 Probability Theory and


Information Theory
 Basic Concepts of Probability Theory

 Random Variables and Their


Distribution Functions
 Numerical Characteristics of
Random Variables
 Information Theory
 Numeric Calculation

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Mathematical expectation (or mean, also referred to as expectation):

 For discrete random variable: 𝐸 𝑋 = ∞


𝑘=1 𝑥𝑘 𝑝𝑘 , 𝑘 = 1,2, ⋯.

 For continuous random variable: 𝐸 𝑋 = −∞ 𝑥𝑓 𝑥 𝑑𝑥.

 Variance: A measure of the degree of dispersion in which the probability theory and
statistical variance measure random variables or a set of data. According to the
probability theory, variance measures the deviation between the random variable and
its mathematical expectation.

𝐷 𝑋 = 𝑉𝑎𝑟 𝑋 = 𝐸{[𝑋 − 𝐸 𝑋 ]2 }

In addition, √(𝐷(𝑋)),ex𝑝𝑟𝑒𝑠𝑠𝑒𝑑 𝑎𝑠 𝜎(𝑋),is called standard variance or mean variance.

𝑋∗ =
𝑋−𝐸(𝑋)
is called standard variable for 𝑋.
𝜎(𝑋)

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Covariance:

𝐶ov x, y = E((X − E X )(Y − E Y )

 correlation coefficient:

𝜌𝑋𝑌 =
𝐶𝑜𝑣(𝑋,𝑌)
.
𝐷(𝑋) 𝐷(𝑌)

 Covariance matrices for random variable (𝑿𝟏 , 𝑿𝟐):


𝑐11 , 𝑐12
𝐶= 𝑐 , 𝑐22
21

where 𝑐𝑖𝑗 = 𝐶𝑜𝑣 𝑋𝑖 , 𝑋𝑗 = 𝐸 𝑋𝑖 − 𝐸 𝑋𝑖 𝑋𝑗 − 𝐸 𝑋𝑗 , 𝑖, 𝑗 = 1, 2, ⋯ , 𝑛.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Linear Algebra

 Probability Theory and


Information Theory
 Basic Concepts of Probability Theory

 Random Variables and Their


Distribution Functions
 Numerical Characteristics of
Random Variables
 Information Theory
 Numeric Calculation

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


As a branch of applied mathematics, information
theory mainly studies how to measure information
contained in a signal. The sign of information theory
was the publication of Shannon's paper, "A
Mathematical Theory of Communication" in 1948. In
this paper, Shannon creatively used probability theory
to study communication problems, gave a scientific and
quantitative description of information, and for the first
time proposed the concept of information entropy.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 The following conditions should be met to define self-information 𝐼(𝑥) for event 𝑋=𝑥:
 𝑓(𝑝) should be a strictly monotonic decreasing function of probability, that is, 𝑝1 > 𝑝2 , 𝑓(𝑝1 ) < 𝑓(𝑝2 ).
 When 𝑝 = 1,𝑓(𝑝) = 0.
 When 𝑝 = 0, 𝑓 𝑝 = ∞.
 The joint information content of two independent events should be equal to the sum of their
respective information quantity.

Therefore, if the probability of a message is 𝑝, the information quantity contained in this


message is:
𝐼 𝑥 = −log 2 𝑝.

 Example: If you throw a coin, the information quantity about the coin showing the
front or opposite is 𝐼(front)= 𝐼(𝑜𝑝𝑝𝑜𝑠𝑖𝑡𝑒)= 1𝑏𝑖𝑡.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 The information contained in the source is the average uncertainty of all possible messages
transmitted by the source. Shannon, the founder of Information theory, refers to the amount of
content that the source contains as information entropy, which is the statistical average of the
amount of content in data partition 𝐷. The information entropy for the classification of 𝑚 tuples in
𝐷 is calculated as follows:

|𝐶𝑖,𝐷 |
where 𝑝𝑖 is a none-zero probability that any tuple in 𝐷 belongs to class 𝐶𝑖 , 𝑝𝑖 = |𝐷|
.

 For example, what is the entropy of throwing a coin?

1 1 1 1
𝐼𝑛𝑓𝑜 𝐷 = − log 2 + log 2 = 1𝑏𝑖𝑡.
2 2 2 2

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Linear Algebra

 Probability Theory and


Information Theory

 Numerical Calculation

 Basic Concepts

 Classification of and Solutions to the


Optimization Problem

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Numerical calculation: Refers to the method and process of effectively using a digital computer to
solve approximate solutions of mathematical problems, and the disciplines formed by related theories.
The process of solving practical problems with computers is as follows:

Numerical Computer
Problem Mathematical Programming
calculation calculation
analysis model design
method result

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Underflow: An underflow occurs when a number approximate to 0 is rounded to zero. Many functions
show a qualitative difference when their arguments are zero rather than a small positive number.

 Overflow: Overflow occurs when a large number is approximated to ∞ or −∞. Further operations
usually cause these infinite values to become non-numeric.

 The large number " swallows" the small number: When 𝑎≫𝑏, 𝑎 + 𝑏 = 𝑎, a numerical abnormality
occurs.

 The 𝑺𝒐𝒇𝒕𝒎𝒂𝒙 function can numerically stabilize overflow and underflow:

exp(𝑥𝑖 )
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑥)𝑖 = 𝑛 exp(𝑥 ).
𝑗=1 𝑗

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Ill-condition number: Refers to the speed for a function to change with small changes of
input.

 Considering function 𝑓 (𝑥) = 𝐴^ (−1) 𝑥, when 𝐴∈ℝ^(𝑛×𝑛) has feature decomposition, the
number of conditions is:

𝝀𝒊
𝒎𝒂𝒙 | |
𝒊,𝒋 𝝀𝒋

This is the modulus ratio of the maximum and minimum eigenvalues. When this ratio is

large, matrix inversion is particularly sensitive to input errors.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Linear Algebra

 Probability Theory and


Information Theory

 Numerical Calculation

 Basic Concepts

 Classification of and Solutions to the


Optimization Problem

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Optimization problem: It can be expressed as

𝑚𝑖𝑛(𝑚𝑎𝑥)𝑓(𝑥)

𝑠.𝑡. 𝑔𝑖 (𝑥)≥0,𝑖=1,2,⋯,𝑚,inequality constraints


ℎ𝑗 (𝑥)=0,𝑗=1,2,⋯,𝑝,equality constraints

where 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )𝑇 ∈ 𝑅𝑛 . We refer to 𝑓 (𝑥) as the objective function or guideline,


or as a cost function, loss function, or error function when minimizing it.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Constraint optimization: a branch of optimization problems. Sometimes, the maximized or minimized 𝑓 (𝑥) function
under all possible values is not what we desire. Instead, we might want to find the maximum or minimum value of 𝑓 (𝑥)
when 𝑥 is in a certain collection 𝑠. The points within the collection 𝑠 are called feasible points.
 With no constraints, it can be expressed as:

𝑚𝑖𝑛𝑓(𝑥)
The common method is Fermat theorem. If 𝑓′ (𝑥) = 0, the critical point is obtained. Then, verify that the extreme value
can be obtained at the critical point.

 With equality constraints, it can be expressed as:

𝑚𝑖𝑛𝑓(𝑥)
𝑠. 𝑡. ℎ𝑖 𝑥 = 0, 𝑖 = 1,2, ⋯ , 𝑛.

The common method is Lagrange multiplier method, that is, introducing 𝑛 Lagrange multipliers 𝜆 to construct
Lagrange function 𝐿 𝑥, 𝜆 = 𝑓 𝑥 + 𝑖=1 𝜆𝑖 ℎ𝑖 (𝑥) and
𝑛
then seeking the partial derivative of each variable to be zero. Then,
we can get the collection of candidate values, and get the optimal value through verification.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 With inequality constraints, it can be expressed as:

𝑚𝑖𝑛𝑓 𝑥
𝑠. 𝑡. ℎ𝑖 𝑥 = 0, 𝑖 = 1,2, ⋯ , 𝑛,
𝑔𝑗 𝑥 ≤ 0, 𝑗 = 1,2, ⋯ , 𝑚.

A common method is to introduce new variables 𝜆𝑖 and 𝛼𝑗 , to Generalized Lagrangian functions based on
all equality, inequality constraints and 𝑓 (𝑥).

𝐿 𝑥, 𝜆, 𝛼 = 𝑓 𝑥 + 𝑖 𝜆𝑖 ℎ𝑖 𝑥 + 𝛼𝑗 𝑔𝑗 𝑥 ,

We can use a set of simple properties to describe the most advantageous properties of constrained
optimization problems, which are called KKT (kuhn-kuhn-tucker) conditions.
 The gradient of the generalized Lagrangian is 0.
 All constraints on 𝑥 and KKKT multiplier are met.
 Inequality constraints show "complementary slackness type": 𝛼⨀ℎ(𝑥)=0.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Gradient descent: The derivative indicates how to change 𝑥 to slightly improve 𝑦. For example,
we know that 𝑓(𝑥−∆𝑥𝑠𝑖𝑔𝑛(𝑓′(𝑥))) is smaller than 𝑓(𝑥) for Δ𝑥 that is small enough. So we can move 𝑥
in the opposite direction of the derivative by a small step to reduce 𝑓(𝑥). This technique is called
gradient descent.

 The extremum problem of a one-dimensional function:


 The local extremum point of the function means that 𝑓 (𝑥)
cannot be reduced or increased by moving 𝑥.

 The point where 𝑓^′ (𝑥) =0 is called a critical point or a


stationary point.

 The extremum point of a function must be a stationary


point, but a stationary point may not be the extremum
point.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 Convex function: For𝜆 ∈ (0,1), given arbitrary 𝑥1 ,𝑥2 ∈ 𝑅, the following applies:
𝑓 𝜆𝑥1 + 1 − 𝜆 𝑥2 ≤ 𝜆𝑓 𝑥1 + 1 − 𝜆 𝑓(𝑥2 )

Then, 𝑓(𝑥) is called a convex function. The extremum point of the convex
function is present at the stationary point.

(a) (b) (c) (d)

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 To the case of multidimensional functions, the partial derivative is used to describe the
degree of variation of the function relative to the respective variable.

 Gradient: It is a derivative relative to vector 𝑋, and is expressed as 𝛻_𝑥𝑓 (𝑥). The derivative of
𝑓 (𝑥) in the direction of 𝑢 (unit vector) is 𝑢^𝑇 𝛻_𝑥 𝑓(𝑥).

 For a task to minimize 𝑓 (𝑥), we want to find the direction with the fastest downward
change, where 𝜃 is the angle between 𝑢 and gradient 𝛻_𝑥𝑓 (𝑥).

 You can see that the direction in which 𝑓 (𝑥) value decreases the maximum is the negative
direction of the gradient.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


 A positive gradient vector points uphill, and a negative gradient vector points downhill.
A move in the negative gradient direction can reduce 𝑓 (𝑥), which is called method of
steepest descent or gradient descent.

 Under the gradient descent method, the update point is proposed as:

𝑥 ′ = 𝑥 − 𝜀𝛻𝑥 𝑓(𝑥)

where 𝜀 is the learning rate, which is a positive scalar with a fixed step length.

 Iteration converges when the gradient is zero or approaching zero.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


Tangent
corresponding to
Initial position the initial value
The slope is
greater than 0

Position where function J


(W) gets the minimum value

Two-dimensional space Three-dimensional space

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved


单击此处编辑母版标题样式
• 单击此处编辑母版文本样式
• 第二级
• 第三级
• 第四级
• 第五级

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved

You might also like