Artificial Intelligence Mathematics
(Artificial Intelligence-related
Mathematical Concepts)
Some Math & Probability
• Expected value
• Covariance
• Differentiation
• Information Theory
2
Random Variable
• A random variable x takes on a defined set of
values with different probabilities
• For example, if you roll a die, the outcome is random (not
fixed) and there are 6 possible outcomes, each of which
occur with probability one-sixth.
• For example, if you poll people about their voting
preferences, the percentage of the sample that responds
“Yes on Proposition A” is a also a random variable (the
percentage will be slightly differently every time you poll).
• Roughly, probability is how frequently we
expect different outcomes to occur if we
repeat the experiment over and over
3
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Random variables can be discrete or
continuous
• Discrete random variables have a countable
number of outcomes
– Examples: Dead/alive, treatment/placebo, dice,
counts, etc.
• Continuous random variables have an
infinite continuum of possible values.
– Examples: blood pressure, weight, the speed of a
car, the real numbers from 1 to 6.
4
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Expected value
- Weighted average of possible values of a random variable
Discrete case:
E( X ) = ∑ x p(x )
all x
i i
Continuous case:
E( X ) = ∫
all x
xi p(xi )dx
5
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Expected value
• If X is a random integer between 1 and 10,
what’s the expected value of X?
6
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Expected value
• If X is a random integer between 1 and 10,
what’s the expected value of X?
10 10
1 1 10(10 + 1)
∑
µ = E ( x) = i ( ) =
i =1 10 10
∑
i
i = (.1)
2
= 55(.1) = 5.5
7
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Average vs. Expected Value
• E.g. Random integer between 1 and 10
– Average (a.k.a. arithmetic mean)
Given a list of (4, 6, 9, 1, 10)
Average: (4+6+9+1+10) / 5
– Expected Value : 5.5
8
Variance/standard deviation
“The expected squared distance (or deviation)
from the mean”
i.e. spread of a data set around its mean value
σ = Var ( x) = E[( x − µ ) ] =
2 2
∑ (x − µ )
all x
i
2
p(xi )
9
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Variance
Discrete case:
Var ( X ) = σ =2
∑ (x − µ )
all x
i
2
p(xi )
Continuous case:
∞
Var ( X ) = σ = ∫ ( xi − µ ) p ( xi )dx
2 2
−∞
10
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
∑ (x − µ ) ∑x
2
Var ( X ) = i
2
p(xi ) = i p(xi ) − ( µ ) 2
all x all x
2 2
Proofs (optional):
= E ( x ) − [ E ( x)]
E(x-µ)2 = E(x2–2µx + µ2)
=E(x2) – E(2µx) +E(µ2) expected value:E(X+Y)= E(X) + E(Y)
= E(x2) – 2µE(x) +µ2 E(c) = c
= E(x2) – 2µµ +µ2 E(x) = µ
= E(x2) – µ2
= E(x2) – [E(x)]2
11
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Variance
Find the variance and standard deviation for
the number of ships to arrive at the harbor
x 10 11 12 13 14
P(x) .4 .2 .2 .1 .1
(the mean is 11.3).
12
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Variance and std dev
x2 100 121 144 169 196
P(x) .4 .2 .2 .1 .1
5
2
E(x ) = ∑i =1
2
xi p ( x i ) =(100)(.4) + (121)(.2) + 144(.2) + 169(.1) + 196(.1) = 129.5
Var ( x) = E ( x 2 ) − [ E ( x)] 2 = 129.5 − 11.3 2 = 1.81
stddev( x) = 1.81 = 1.35
Interpretation: On an average day, we expect 11.3 ships to
arrive in the harbor, plus or minus 1.35. This gives you a feel for
what would be considered a usual day.
13
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Gaussian (Normal)
• If I look at the height of women in country xx, it will look
approximately Gaussian
• Small random noise errors, look Gaussian/Normal
• Distribution:
• Mean/var
14
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Coin tosses
• Number of heads in 100 tosses
– Flip coins virtually
– Flip a virtual coin 100 times; count # of heads
– Repeat this over and over again a large number of
times (e.g. 30,000)
– Plot the results
15
Coin tosses
• Number of heads in 100 tosses
– 30,000 trials
Mean = 50
Std. dev = 5
Follows a normal
distribution
∴95% of the time, we
get between 40 and 60
heads…
16
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Covariance: joint probability
• The covariance measures the strength of the
linear relationship between two variables
e.g. A positive covariance means both investments' returns tend
to move upward or downward in value at the same time.
E[( x − µ x )( y − µ y )]
N
σ xy = ∑ ( xi − µ x )( yi − µ y ) P ( xi , yi )
i =1
17
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Interpreting Covariance
• Covariance between two random variables:
E[( x − µ x )( y − µ y )]
cov(X,Y) > 0 X and Y are positively correlated
cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent
18
Calculation of covariance
19
Calculation of covariance
20
Determinant of a Matrix (행렬식)
e.g.
e.g.
21
https://www.mathsisfun.com/algebra/matrix-determinant.html
Eigenvalue and Eigenvector
• A v = λ v (A must be a square matrix)
• v (A – λ Ι) = 0 , eigenvector v is non-zero vector
• Finding λ (eigenvalue) use determinant
A=
22
https://jeongchul.tistory.com/603
Eigenvalue and Eigenvector
A=
eigenvalueis 3
When eigenvalue is 1
23
https://jeongchul.tistory.com/603
Singular Value Decomposition
• Problem:
– #1: Find concepts in data
– #2: Reduce dimensionality
24
Recommender Systems, Lior Rokach
SVD - Definition
A[n x m] = U[n x r] Λ [ r x r] (V[m x r])T
• A: n x m matrix (e.g., n documents, m terms)
• U: n x r matrix (n documents, r concepts)
• Λ: r x r diagonal matrix (strength of each
‘concept’) (r: rank of the matrix)
• V: m x r matrix (m terms, r concepts)
25
Manning and Raghavan, 2004
SVD - Example
• A = U Λ VT - example:
retrieval
inf. lung
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71
26
SVD - Example
• A = U Λ VT - example:
doc-to-concept
retrieval CS-concept similarity matrix
inf. MD-concept
data brain lung
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71
27
SVD - Example
• A = U Λ VT - example:
retrieval
inf. lung ‘strength’ of CS-concept
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71
28
Bayes Rule
• Which tells us:
– how often A happens given that B happens, written P(A|B),
• When we know:
– how often B happens given that A happens, written P(B|A)
– how likely A is on its own, written P(A)
– how likely B is on its own, written P(B)
• E.g.
– P(Fire|Smoke) means how often there is fire when we can see smoke
– P(Smoke|Fire) means how often we can see smoke when there is fire
29
Bayes’ Rule
30
https://hyeongminlee.github.io/post/bnn001_bayes_rule/
Differentiation
Differentiation is all about measuring change.
E.g. Measuring change in a linear function:
y = a + bx
a = intercept
b = constant slope i.e. the impact of a unit
change in x on the level of y
∆y y2 − y1
b= =
∆x x2 − x1 31
Example: A firms cost function is
Y = X2
40
30
X ∆X Y ∆Y
0 0
1 +1 1 +1
y=x2
20
2 +1 4 +3
10 3 +1 9 +5
4 +1 16 +7
0
0 1 2 3
X 4 5 6
Y = X2
Y+∆Y = (X+∆X) 2
Y+∆Y =X2+2X.∆X+∆X2
∆Y = X2+2X.∆X+∆X2 – Y
since Y = X2 ⇒ ∆Y = 2X.∆X+∆X2
∆Y
∆X
= 2X+∆X
34
The slope depends on X and ∆X
The slope of the graph of a function is called
the derivative of the function
dy ∆y
f ' ( x) = = lim
dx ∆x→0 ∆x
• The process of differentiation involves letting the
change in x become arbitrarily small, i.e. letting
∆x→0
e.g if = 2X+∆X and ∆X →0
⇒ = 2X in the limit as ∆X →0
35
Differentiation
• Examples:
f(x) = 3
f(x) = 4x
f(x) = 4x2
36
Differentiation
• Product Rule
If y = u.v where u and v are functions of x,
(u = f(x) and v = g(x) ) Then
dy dv du
=u +v
dx dx dx
• Examples
i) y = (x+2)(ax2+bx)
dy
dx
= (x + 2)(2ax + b ) + ax 2 + bx ( )
ii) y = (4x3-3x+2)(2x2+4x)
dy = 4 x 3 − 3 x + 2 (4 x + 4)+ 2 x 2 + 4 x 12x 2 − 3
dx 37
Differentiation
• The Chain Rule
If y is a function of v, and v is a function of x,
then y is a function of x and
dy dy dv
= .
dx dv dx
• Example:
i) y = (ax2 + bx)½
2 ½
let v = (ax + bx) , so y = v
1
dy 1
dx 2
= ( −
)
ax + bx 2 .(2ax + b )
2
38
Sigmoid Function
• Sigmoid Function:
• In an exponential function, the base value is a function with a natural
constant e.
• Natural constant e:
• Called 'the base of the natural logarithmic' and 'Euler's number'
• An irrational number that is important in mathematics, like pi (π)
• Its value is approximately 2.718281828... Lim
• When the natural constant e is included in the exponential function
and enters the denominator, it becomes a sigmoid function
39
모두의 딥러닝
Sigmoid Function
40
모두의 딥러닝
Information Theory
• Key idea: unlikely events are more informative than
frequent events
41
https://ratsgo.github.io/statistics/2017/09/22/information/
Shannon’s Entropy
• The amount of information in an event with a value of x of
the random variable X:
– Example: A coin toss that results in heads: −log20.5=1,
– Example: Roll the dice and eye gets 1: −log21/6=2.5849%
– If the base is 2, the unit of information is called the shannon or bit.
• Shannon’s Entropy: Expected value of all incident information
• If you toss a coin that has the same chance of getting heads and tails,
1 fair coin
42
https://ratsgo.github.io/statistics/2017/09/22/information/
Entropy
• The x-axis is the fairness of the coin (1, i.e. the
probability of getting heads)
43
https://ratsgo.github.io/statistics/2017/09/22/information/
Cross Entropy
• H(P,Q) is similar to H(P), which is the entropy of P, but
the probability of being multiplied outside the logarithm is
P(x), and what goes inside the logarithm is Q(x). Entropy
is entropy, but the two probability distributions are
multiplied by an intersection
44
https://ratsgo.github.io/statistics/2017/09/22/information/
KL Divergence
• Calculate the difference between two probability distributions
• The difference between the distribution P(x) of the actual data and
the distribution Q(x) of the data estimated by the model can be
obtained using KLD
45
https://ratsgo.github.io/statistics/2017/09/22/information/
Softmax Function
• If we take the probability that the ith element is zi and the probability that the ith
class is correct in the vector of k dimension is pi, then the softmax function defines
pi as
• If it's three-dimensional
46
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE)
• The Maximum Likelihood Estimation (MLE, 최대가능도)
is a method of estimating the parameters of a model.
This estimation method is one of the most widely used.
• The method of maximum likelihood selects the set of
values of the model parameters that maximizes the
likelihood function, i.e. maximizes the "agreement" of the
selected model with the observed data
– parameter values for which the observed sample is
most likely to have been generated
Example: Flip a Thumbtack
• When tossed, it can land in one of two positions:
Head(H) or Tail (T)
• We denote by θ the (unknown) probability P(H).
• Estimation task:
– Given a sequence of toss samples x[1], x[2], …, x[M]
we want to estimate the probabilities
P(H)= θ and P(T) = 1 – θ
The task is to find a vector of parameters (θ in this case)
that have generated the given data. This vector
parameter can be used to predict future data.
Likelihood Function
• How good is a particular ?
It depends on how likely it is to generate the
observed data
LD (θ ) P=
= (D | θ ) ∏ P( x[m] | θ )
m
• The likelihood for the sequence H,T, T, H, H is
L(θ)
LD (θ ) = θ ⋅ (1 − θ ) ⋅ (1 − θ ) ⋅ θ ⋅ θ
0 0.2 0.4
θ 0.6 0.8 1
MLE
• To compute the likelihood in the thumbtack
example we only require NH and NT
(the number of heads and the number of tails)
LD (θ )= θ NH
⋅ (1 − θ ) NT
MLE
MLE Principle: Choose parameters that maximize
the likelihood function
• This is one of the most commonly used estimators
• Intuitively appealing
• One usually maximizes the log-likelihood function,
defined as lD(θ) = ln LD(θ)
LD(θ) = θ NH . (1- θ)NT
lD (θ ) = N H log θ + NT log (1 − θ )
Given that logarithm is an increasing function so it is equivalent to
maximize the log likelihood
Example: MLE in Binomial Data
lD (θ ) = N H log θ + NT log (1 − θ )
Taking derivative and equating it to 0,
we get N NT NH
H
= ⇒ θ=ˆ
θ 1−θ N H + NT
Example:
(NH, NT ) = (3,2)
L(θ)
MLE estimate is 3/5 = 0.6
LD(θ) = θ NH . (1- θ)NT
53
0 0.2 0.4 0.6 0.8 1
MLE
• Given observed values X1 = x1, X2 = x2, . . . , Xn = xn,
the likelihood of θ is the function
L(θ) = f(x1, x2, . . . , xn|θ)
probability of observing the given data as a function of θ
L(θ) = If the Xi are iid (independent and identically distributed)
• It will be equivalent to maximize the log likelihood:
2 or more
MLE for Multinomial
• Now suppose X can have the values 1,2,…,K
(For example a die has K=6 sides)
• We want to learn the parameters θ1, θ2. …, θk
N1, N2, …, NK - number of times each outcome is
observed
K
• Likelihood function: LD (θ ) = ∏ θ k Nk
k =1
ˆ Nk
• MLE: θk = Note: 2:LD(θ) = θ NH . (1- θ)NT
∑ N