0% found this document useful (0 votes)
26 views53 pages

2 - Artificial Intelligence Mathematics

The document covers fundamental mathematical concepts related to artificial intelligence, including random variables, expected value, variance, covariance, and differentiation. It explains the differences between discrete and continuous random variables, how to calculate expected value and variance, and introduces concepts like Gaussian distribution and Bayes' Rule. Additionally, it touches on matrix operations such as eigenvalues, eigenvectors, and singular value decomposition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views53 pages

2 - Artificial Intelligence Mathematics

The document covers fundamental mathematical concepts related to artificial intelligence, including random variables, expected value, variance, covariance, and differentiation. It explains the differences between discrete and continuous random variables, how to calculate expected value and variance, and introduces concepts like Gaussian distribution and Bayes' Rule. Additionally, it touches on matrix operations such as eigenvalues, eigenvectors, and singular value decomposition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Artificial Intelligence Mathematics

(Artificial Intelligence-related
Mathematical Concepts)
Some Math & Probability

• Expected value
• Covariance
• Differentiation
• Information Theory

2
Random Variable

• A random variable x takes on a defined set of


values with different probabilities
• For example, if you roll a die, the outcome is random (not
fixed) and there are 6 possible outcomes, each of which
occur with probability one-sixth.
• For example, if you poll people about their voting
preferences, the percentage of the sample that responds
“Yes on Proposition A” is a also a random variable (the
percentage will be slightly differently every time you poll).

• Roughly, probability is how frequently we


expect different outcomes to occur if we
repeat the experiment over and over
3
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Random variables can be discrete or
continuous

• Discrete random variables have a countable


number of outcomes
– Examples: Dead/alive, treatment/placebo, dice,
counts, etc.
• Continuous random variables have an
infinite continuum of possible values.
– Examples: blood pressure, weight, the speed of a
car, the real numbers from 1 to 6.

4
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Expected value

- Weighted average of possible values of a random variable

Discrete case:

E( X ) = ∑ x p(x )
all x
i i

Continuous case:

E( X ) = ∫
all x
xi p(xi )dx
5
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Expected value

• If X is a random integer between 1 and 10,


what’s the expected value of X?

6
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Expected value

• If X is a random integer between 1 and 10,


what’s the expected value of X?

10 10
1 1 10(10 + 1)

µ = E ( x) = i ( ) =
i =1 10 10

i
i = (.1)
2
= 55(.1) = 5.5

7
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Average vs. Expected Value

• E.g. Random integer between 1 and 10

– Average (a.k.a. arithmetic mean)


Given a list of (4, 6, 9, 1, 10)
Average: (4+6+9+1+10) / 5

– Expected Value : 5.5

8
Variance/standard deviation

“The expected squared distance (or deviation)


from the mean”
i.e. spread of a data set around its mean value

σ = Var ( x) = E[( x − µ ) ] =
2 2
∑ (x − µ )
all x
i
2
p(xi )

9
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Variance
Discrete case:

Var ( X ) = σ =2
∑ (x − µ )
all x
i
2
p(xi )

Continuous case:

Var ( X ) = σ = ∫ ( xi − µ ) p ( xi )dx
2 2

−∞
10
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
∑ (x − µ ) ∑x
2
Var ( X ) = i
2
p(xi ) = i p(xi ) − ( µ ) 2

all x all x

2 2
Proofs (optional):
= E ( x ) − [ E ( x)]
E(x-µ)2 = E(x2–2µx + µ2)
=E(x2) – E(2µx) +E(µ2) expected value:E(X+Y)= E(X) + E(Y)
= E(x2) – 2µE(x) +µ2 E(c) = c
= E(x2) – 2µµ +µ2 E(x) = µ
= E(x2) – µ2
= E(x2) – [E(x)]2

11
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Variance

Find the variance and standard deviation for


the number of ships to arrive at the harbor

x 10 11 12 13 14
P(x) .4 .2 .2 .1 .1

(the mean is 11.3).

12
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Variance and std dev

x2 100 121 144 169 196


P(x) .4 .2 .2 .1 .1

5
2
E(x ) = ∑i =1
2
xi p ( x i ) =(100)(.4) + (121)(.2) + 144(.2) + 169(.1) + 196(.1) = 129.5

Var ( x) = E ( x 2 ) − [ E ( x)] 2 = 129.5 − 11.3 2 = 1.81


stddev( x) = 1.81 = 1.35

Interpretation: On an average day, we expect 11.3 ships to


arrive in the harbor, plus or minus 1.35. This gives you a feel for
what would be considered a usual day.
13
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Gaussian (Normal)
• If I look at the height of women in country xx, it will look
approximately Gaussian
• Small random noise errors, look Gaussian/Normal

• Distribution:

• Mean/var

14
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Coin tosses

• Number of heads in 100 tosses


– Flip coins virtually
– Flip a virtual coin 100 times; count # of heads
– Repeat this over and over again a large number of
times (e.g. 30,000)
– Plot the results

15
Coin tosses
• Number of heads in 100 tosses
– 30,000 trials

Mean = 50
Std. dev = 5
Follows a normal
distribution
∴95% of the time, we
get between 40 and 60
heads…

16
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Covariance: joint probability

• The covariance measures the strength of the


linear relationship between two variables
e.g. A positive covariance means both investments' returns tend
to move upward or downward in value at the same time.
E[( x − µ x )( y − µ y )]

N
σ xy = ∑ ( xi − µ x )( yi − µ y ) P ( xi , yi )
i =1

17
Kristin L. Sainani, HRP 259: Intro to Probability and Statistics
Interpreting Covariance

• Covariance between two random variables:


E[( x − µ x )( y − µ y )]

cov(X,Y) > 0 X and Y are positively correlated

cov(X,Y) < 0 X and Y are inversely correlated

cov(X,Y) = 0 X and Y are independent

18
Calculation of covariance

19
Calculation of covariance

20
Determinant of a Matrix (행렬식)

e.g.

e.g.

21
https://www.mathsisfun.com/algebra/matrix-determinant.html
Eigenvalue and Eigenvector

• A v = λ v (A must be a square matrix)


• v (A – λ Ι) = 0 , eigenvector v is non-zero vector
• Finding λ (eigenvalue)  use determinant

A=

22
https://jeongchul.tistory.com/603
Eigenvalue and Eigenvector

A=

eigenvalueis 3

When eigenvalue is 1

23
https://jeongchul.tistory.com/603
Singular Value Decomposition

• Problem:
– #1: Find concepts in data
– #2: Reduce dimensionality

24
Recommender Systems, Lior Rokach
SVD - Definition

A[n x m] = U[n x r] Λ [ r x r] (V[m x r])T

• A: n x m matrix (e.g., n documents, m terms)


• U: n x r matrix (n documents, r concepts)
• Λ: r x r diagonal matrix (strength of each
‘concept’) (r: rank of the matrix)
• V: m x r matrix (m terms, r concepts)

25
Manning and Raghavan, 2004
SVD - Example

• A = U Λ VT - example:
retrieval
inf. lung
data brain

1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71

26
SVD - Example

• A = U Λ VT - example:
doc-to-concept
retrieval CS-concept similarity matrix
inf. MD-concept
data brain lung
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71

27
SVD - Example

• A = U Λ VT - example:
retrieval
inf. lung ‘strength’ of CS-concept
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71

28
Bayes Rule

• Which tells us:


– how often A happens given that B happens, written P(A|B),
• When we know:
– how often B happens given that A happens, written P(B|A)
– how likely A is on its own, written P(A)
– how likely B is on its own, written P(B)
• E.g.
– P(Fire|Smoke) means how often there is fire when we can see smoke
– P(Smoke|Fire) means how often we can see smoke when there is fire

29
Bayes’ Rule

30
https://hyeongminlee.github.io/post/bnn001_bayes_rule/
Differentiation

Differentiation is all about measuring change.


E.g. Measuring change in a linear function:

y = a + bx
a = intercept
b = constant slope i.e. the impact of a unit
change in x on the level of y

∆y y2 − y1
b= =
∆x x2 − x1 31
Example: A firms cost function is
Y = X2
40

30
X ∆X Y ∆Y
0 0
1 +1 1 +1
y=x2

20

2 +1 4 +3
10 3 +1 9 +5
4 +1 16 +7
0
0 1 2 3
X 4 5 6

Y = X2
Y+∆Y = (X+∆X) 2
Y+∆Y =X2+2X.∆X+∆X2
∆Y = X2+2X.∆X+∆X2 – Y
since Y = X2 ⇒ ∆Y = 2X.∆X+∆X2
∆Y
∆X
= 2X+∆X
34
The slope depends on X and ∆X
The slope of the graph of a function is called
the derivative of the function

dy ∆y
f ' ( x) = = lim
dx ∆x→0 ∆x
• The process of differentiation involves letting the
change in x become arbitrarily small, i.e. letting
∆x→0
e.g if = 2X+∆X and ∆X →0
⇒ = 2X in the limit as ∆X →0
35
Differentiation

• Examples:
f(x) = 3
f(x) = 4x
f(x) = 4x2

36
Differentiation

• Product Rule
If y = u.v where u and v are functions of x,
(u = f(x) and v = g(x) ) Then
dy dv du
=u +v
dx dx dx
• Examples
i) y = (x+2)(ax2+bx)
dy
dx
= (x + 2)(2ax + b ) + ax 2 + bx ( )
ii) y = (4x3-3x+2)(2x2+4x)
dy =  4 x 3 − 3 x + 2 (4 x + 4)+  2 x 2 + 4 x  12x 2 − 3
dx      37
Differentiation
• The Chain Rule
If y is a function of v, and v is a function of x,
then y is a function of x and
dy dy dv
= .
dx dv dx
• Example:
i) y = (ax2 + bx)½
2 ½
let v = (ax + bx) , so y = v
1
dy 1
dx 2
= ( −
)
ax + bx 2 .(2ax + b )
2
38
Sigmoid Function
• Sigmoid Function:
• In an exponential function, the base value is a function with a natural
constant e.
• Natural constant e:
• Called 'the base of the natural logarithmic' and 'Euler's number'
• An irrational number that is important in mathematics, like pi (π)
• Its value is approximately 2.718281828... Lim
• When the natural constant e is included in the exponential function
and enters the denominator, it becomes a sigmoid function

39
모두의 딥러닝
Sigmoid Function

40
모두의 딥러닝
Information Theory

• Key idea: unlikely events are more informative than


frequent events

41
https://ratsgo.github.io/statistics/2017/09/22/information/
Shannon’s Entropy

• The amount of information in an event with a value of x of


the random variable X:

– Example: A coin toss that results in heads: −log20.5=1,


– Example: Roll the dice and eye gets 1: −log21/6=2.5849%
– If the base is 2, the unit of information is called the shannon or bit.
• Shannon’s Entropy: Expected value of all incident information

• If you toss a coin that has the same chance of getting heads and tails,
1 fair coin

42
https://ratsgo.github.io/statistics/2017/09/22/information/
Entropy

• The x-axis is the fairness of the coin (1, i.e. the


probability of getting heads)
43
https://ratsgo.github.io/statistics/2017/09/22/information/
Cross Entropy

• H(P,Q) is similar to H(P), which is the entropy of P, but


the probability of being multiplied outside the logarithm is
P(x), and what goes inside the logarithm is Q(x). Entropy
is entropy, but the two probability distributions are
multiplied by an intersection

44
https://ratsgo.github.io/statistics/2017/09/22/information/
KL Divergence
• Calculate the difference between two probability distributions
• The difference between the distribution P(x) of the actual data and
the distribution Q(x) of the data estimated by the model can be
obtained using KLD

45
https://ratsgo.github.io/statistics/2017/09/22/information/
Softmax Function
• If we take the probability that the ith element is zi and the probability that the ith
class is correct in the vector of k dimension is pi, then the softmax function defines
pi as

• If it's three-dimensional

46
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE)

• The Maximum Likelihood Estimation (MLE, 최대가능도)


is a method of estimating the parameters of a model.
This estimation method is one of the most widely used.
• The method of maximum likelihood selects the set of
values of the model parameters that maximizes the
likelihood function, i.e. maximizes the "agreement" of the
selected model with the observed data
– parameter values for which the observed sample is
most likely to have been generated
Example: Flip a Thumbtack

• When tossed, it can land in one of two positions:


Head(H) or Tail (T)
• We denote by θ the (unknown) probability P(H).
• Estimation task:
– Given a sequence of toss samples x[1], x[2], …, x[M]
we want to estimate the probabilities
P(H)= θ and P(T) = 1 – θ
The task is to find a vector of parameters (θ in this case)
that have generated the given data. This vector
parameter can be used to predict future data.
Likelihood Function

• How good is a particular ?


It depends on how likely it is to generate the
observed data
LD (θ ) P=
= (D | θ ) ∏ P( x[m] | θ )
m

• The likelihood for the sequence H,T, T, H, H is


L(θ)

LD (θ ) = θ ⋅ (1 − θ ) ⋅ (1 − θ ) ⋅ θ ⋅ θ

0 0.2 0.4
θ 0.6 0.8 1
MLE

• To compute the likelihood in the thumbtack


example we only require NH and NT
(the number of heads and the number of tails)
LD (θ )= θ NH
⋅ (1 − θ ) NT
MLE

MLE Principle: Choose parameters that maximize


the likelihood function
• This is one of the most commonly used estimators
• Intuitively appealing
• One usually maximizes the log-likelihood function,
defined as lD(θ) = ln LD(θ)
LD(θ) = θ NH . (1- θ)NT
lD (θ ) = N H log θ + NT log (1 − θ )
Given that logarithm is an increasing function so it is equivalent to
maximize the log likelihood
Example: MLE in Binomial Data
lD (θ ) = N H log θ + NT log (1 − θ )
Taking derivative and equating it to 0,
we get N NT NH
H
= ⇒ θ=ˆ
θ 1−θ N H + NT
Example:
(NH, NT ) = (3,2)
L(θ)
MLE estimate is 3/5 = 0.6

LD(θ) = θ NH . (1- θ)NT


53
0 0.2 0.4 0.6 0.8 1
MLE

• Given observed values X1 = x1, X2 = x2, . . . , Xn = xn,


the likelihood of θ is the function
L(θ) = f(x1, x2, . . . , xn|θ)
probability of observing the given data as a function of θ
L(θ) = If the Xi are iid (independent and identically distributed)

• It will be equivalent to maximize the log likelihood:


2 or more
MLE for Multinomial

• Now suppose X can have the values 1,2,…,K


(For example a die has K=6 sides)
• We want to learn the parameters θ1, θ2. …, θk
N1, N2, …, NK - number of times each outcome is
observed
K
• Likelihood function: LD (θ ) = ∏ θ k Nk
k =1

ˆ Nk
• MLE: θk = Note: 2:LD(θ) = θ NH . (1- θ)NT

∑ N

You might also like