0% found this document useful (0 votes)
10 views29 pages

Computational Thinking & Artificial Intelligence: (5 Week, Probability)

The document outlines a lecture series on Computational Thinking and Artificial Intelligence, focusing on Probability and Information Theory. It discusses the role of probability in AI decision-making, contrasting Frequentist and Bayesian perspectives, and introduces concepts such as Bayes' theorem and Shannon entropy. Additionally, it emphasizes the importance of quantifying information and uncertainty in AI applications.

Uploaded by

yeeheon06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views29 pages

Computational Thinking & Artificial Intelligence: (5 Week, Probability)

The document outlines a lecture series on Computational Thinking and Artificial Intelligence, focusing on Probability and Information Theory. It discusses the role of probability in AI decision-making, contrasting Frequentist and Bayesian perspectives, and introduces concepts such as Bayes' theorem and Shannon entropy. Additionally, it emphasizes the importance of quantifying information and uncertainty in AI applications.

Uploaded by

yeeheon06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Computational Thinking &

Artificial Intelligence
(5th Week, Probability)

Kyoungwon Seo
Dept. of Applied Artificial Intelligence
[email protected]
Lecture

• Understanding Probability Theory


• Understanding Information Theory
Lecture program
• 1st Lecture: Introduction 9th Lecture: Cat & Dog Classification

• 2nd Lecture: Matrix 10th Lecture: Image Generation Model

• 3rd Lecture: Vector and Matrix 11th Lecture: Practice: Image Generation Model

• 4th Lecture: Perceptron 12th Lecture: Language Generation Model

• 5th Lecture: Probability 13th Lecture: Practice: Language Generation Model

• 6th Lecture: Learning Techniques 14th Lecture: AI and Ethics

• 7th Lecture: Deep Learning 15th Lecture: Make-Up Lecture

• 8th Lecture: Deep Learning Library & 16th Lecture: Final Term Exam
Mid-term Exam.
Probability Theory
Probability theory
• What is the role of probability theory in AI ?
▪ Tools to make better decisions in uncertainty by estimating parameters

Navigation

Uncertainty

Person
Identification

AI Speaker
Two perspectives in the probability theory
• Frequentist probability vs. Bayesian probability
▪ Frequentist: uses maximum likelihood estimation (MLE)

▪ Law of large numbers: a theorem that describes the result of performing the same experiment,
a large number of times
Two perspectives in the probability theory
• Frequentist probability vs. Bayesian probability
▪ In 1830, Joseph Jagger discovered a bias (9 numbers work well) in the Beaux-Arts casino roulette
(0-36 numbers) in Monte Carlo, Monaco, where a $1 stake could win $35
36 1
▪ Original expected value = −$1 ∗ + $35 ∗ = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 − 2.7𝑐𝑒𝑛𝑡
37 37

35.8 1.2
▪ Biased expected value = −$1 ∗ + $35 ∗ = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 + 16.8𝑐𝑒𝑛𝑡 (bias = 1.2)
37 37

Ref. Biography joseph Jagger


Two perspectives in the probability theory
• Frequentist probability vs. Bayesian probability
▪ Frequentist probability is difficult to apply unless it is an event that repeats a lot
(e.g., What if a doctor said that a patient has a 40% chance of developing dementia?)

▪ Bayesian: Calculate posterior probabilities based on prior probabilities

Event A: People who show certain symptoms


Event A Event B
Event B: People who actually have dementia

𝑷 𝑨∩𝑩 𝒃
▪ Conditional probability: 𝑷 𝑩 𝑨 = = (𝑷 𝑨 > 𝟎)
𝑷𝑨 𝒂+𝒃
Two perspectives in the probability theory
• Frequentist probability vs. Bayesian probability
▪ Conditional probability question: 70% of all used cars have air conditioning, and 40% have a
CD player. If 90% of all used cars own at least one of the two, what is the probability that used
cars without air conditioning will not have a CD player?

▪ Answer:

𝑃 𝐴 = 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑛𝑜 𝑎𝑖𝑟 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑖𝑛𝑔 = 0.3

𝑃 𝐵 = 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑛𝑜 𝐶𝐷 𝑝𝑙𝑎𝑦𝑒𝑟 = 0.6

𝑃 𝐴 ∩ 𝐵 = 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑡ℎ𝑎𝑡 𝑛𝑒𝑖𝑡ℎ𝑒𝑟 𝑎𝑖𝑟 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑖𝑛𝑔 𝑛𝑜𝑟 𝐶𝐷 𝑝𝑙𝑎𝑦𝑒𝑟𝑠 𝑎𝑟𝑒 𝑝𝑟𝑒𝑠𝑒𝑛𝑡 = 0.1

𝑃 𝐵∩𝐴 0.1 1
𝑃 𝐵|𝐴 = = =
𝑃 𝐴 0.3 3
Bayesian probability
• Bayes’ theorem
▪ A theorem that expresses the relationship between prior and posterior probabilities

𝑃 𝐵𝐴 𝑃 𝐴
𝑃 𝐴𝐵 =
𝑃 𝐵

𝑃 𝐴∩𝐵
𝑃 𝐵𝐴 = , 𝑃 𝐵 𝐴 𝑃 𝐴 = 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝐵 𝑃(𝐵)
𝑃 𝐴

▪ 𝑷(𝑨) prior probability:


Initial probability determined based on the information currently available

▪ 𝑷(𝑨|𝑩) posterior probability:


Revising the prior information based on the added information
Bayesian probability
• Example
▪ Mr. O want to check for cancer using a test.
People in their 70s and 80s have a 0.6% chance of getting a cancer.
When you have a cancer, the probability that the test will come out positive is 90%.
Even if it is not cancer, the probability that the test is positive 5%.
When the test is positive, how should Mr. O be Judged?

▪ Answer:
𝑃 𝑐 = 0.006, 𝑃 ℎ = 0.994

𝑃 𝑝∩𝑐 𝑃 𝑝∩ℎ
𝑃 𝑝𝑐 = = 0.9, 𝑃 𝑝ℎ = = 0.05
𝑃 𝑐 𝑃 ℎ

𝑃 𝑐∩𝑝 𝑃 𝑝|𝑐 𝑃(𝑐) 0.9 ∗ 0.006


𝑃 𝑐𝑝 = = = = 0.0980 (9.8%)
𝑃 𝑝 𝑃 𝑝|𝑐)𝑃 𝑐 + 𝑃 𝑝 ℎ 𝑃(ℎ 0.9 ∗ 0.006 + 0.05 ∗ 0.994
Bayesian probability
• Unfair condition
▪ ℎ: hidden probability (predicted data)

▪ 𝑥: observed probability (train data)

𝑃 𝑥 ℎ 𝑃(ℎ)
𝑃 ℎ𝑥 =
𝑃 𝑥

• Coin toss (fair or unfair coin)


▪ 𝑝 𝐻𝑒𝑎𝑑 = 0.5

▪ previous outcomes are all head for 𝑛 − 1 times, you don’t know whether a coin is fair or not

▪ 𝑝 𝐻𝑒𝑎𝑑|𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 > 0.5


Bayesian deep learning
• Example
▪ Whereas frequentist(standard) deep learning trains a model based on fixed parameters,
Bayesian deep learning tries to reflect uncertainty through flexible parameters.

Parameters are represented by single, fixed values Parameters are represented by distribution
Information Theory
Information theory
• What is the role of information theory in AI ?
▪ Applied mathematics to quantify the amount of information

Claude E. Shannon
Information theory
• What is the role of information theory in AI?
▪ Describe the amount of information in relation to the probability
that represents the degree of uncertainty

▪ Highest uncertainty when tossing a coin has a 50% chance of heads and 50% tails

▪ Decreased uncertainty when 20% head and 80% tails

▪ Uncertainty disappears when the back side is 100%

▪ How can we quantify this?


Information theory – self-information
• Claude Shannon’s definition of self-information is
▪ An event with probability 100% is perfectly unsurprising and yields no information
▪ The less probable event is, the more surprising it is and the more information it yields
▪ If two independent events are measured separately, the total amount of information is the sum
of information of the individual events

• Given an event 𝑝(𝑥), the self-information is defined by


𝐼 𝑥 = −log 𝑝(𝑥),

If 𝑝 𝑥 = 1 ➔ 𝐼 𝑥 = − log 1 = 0
1 1
If 𝑝 𝑥 = ➔ 𝐼 𝑥 = − log = −log 𝑛−1 = log 𝑛
𝑛 𝑛
Shannon entropy
• Expected value of all event information

𝐻 𝑋 = − σ𝑛𝑖=1 𝑃 𝑥𝑖 log 𝑃(𝑥𝑖 )

• Coins with 50% head and 50% tails

𝐻 𝑋 = − σ𝑛𝑖=1 𝑃 𝑥 log 𝑃 𝑥

= − 0.5 ∗ log 0.5 + 0.5 ∗ log 0.5

= −log 0.5

= −(−log 2) = 0.693
Shannon entropy
• Expected value of all event information
▪ When information uncertainty disappears, entropy decreases

Ref: Pattern Recognition and Machine Learning, C.M.Bishop

Ref: Wikipedia entropy (information theory)


Shannon entropy
• Example
▪ In a match between A soccer team and B soccer team, there is a 99% chance that A will win

A wins: −log 𝑃 𝑋 = − log 0.99 = 0.01

B wins: −log 𝑃 𝑋 = − log 0.01 = 4.6

▪ Event/information that B win is more surprising (460 times) than A wins

𝐻 𝑋 = − σ𝑛𝑖=1 𝑃 𝑥 log 𝑃 𝑥

= −(0.99 ∗ log 0.99 + 0.01 ∗ log 0.01)

= 0.056
Shannon entropy
• Average amount of information in a match between Team A and Team B
(99% chance that A will win)
𝐻 𝑋 = − σ𝑛𝑖=1 𝑃 𝑥 log 𝑃 𝑥
= − 0.99 ∗ log 0.99 + 0.01 ∗ log 0.01
= 0.056

• When Team A and Team C play a match, it is impossible to predict who will win
(50% chance that A will win)

𝐻 𝑋 = − σ𝑛𝑖=1 𝑃 𝑥 log 𝑃 𝑥

= −(0.5 ∗ log 0.5 + 0.5 ∗ log 0.5)

= 0.693

• Average degree of surprise = degree of uncertainty


Others (entropy)
• KL-Divergence (Kullback-Leibler divergence), relative entropy
▪ How different my predicted distribution is from the actual distribution

▪ 𝑄(𝑥) is the prediction, 𝑃(𝑥) is the actual probability

KL-divergence = 𝐸(−log (𝑄 𝑥 )) − 𝐸(− log (𝑃 𝑥 ))


Others (entropy)
• Cross entropy (example)

𝐸 − log 𝑄 𝑥 = −σ𝑃 𝑥 ∗ log(𝑄 𝑥 )

= −(𝑃 𝑇𝑒𝑎𝑚 𝐴 𝑤𝑖𝑛𝑠 ∗ log(𝑄(𝑇𝑒𝑎𝑚 𝐴 𝑤𝑖𝑙𝑙 𝑤𝑖𝑛)) + 𝑃 𝑇𝑒𝑎𝑚 𝐵 𝑤𝑖𝑛𝑠 ∗ log(𝑄 𝑇𝑒𝑎𝑚𝐵 𝑤𝑖𝑙𝑙 𝑤𝑖𝑛 ))

▪ Read images and classify them into three classes: dog/cat/fish

▪ 𝑄(𝑥) predicted observation probabilities are 0.2 for dogs, 0.3 for cats, and 0.5 for fish.

▪ 𝑃(𝑥) the actual distribution is 0 for dogs, 0 for cats, and 1 for fish

𝐸(− log(𝑄 𝑥 )) = −σ(𝑃 𝑥 ∗ log(𝑄 𝑥 )

= −(𝑃 𝑎𝑐𝑡𝑢𝑎𝑙 𝑑𝑜𝑔 ∗ log(𝑄 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑑𝑜𝑔 )+𝑃 𝑎𝑐𝑎𝑡 ∗ log(𝑄 𝑝𝑐𝑎𝑡 )+𝑃 𝑎𝑓𝑖𝑠ℎ ∗ log(𝑄 𝑝𝑓𝑖𝑠ℎ ))

= −log(0.5)
Entropy down = uncertainty down
• Cross entropy can be used as a loss function

• Reduce the entropy of AI models


▪ Reduce the uncertainty of artificial intelligence models

Situation Cross Entropy


Correct prediction + high confidence Low (Good)
Correct prediction + low confidence Slightly high
Wrong prediction + low confidence Moderate
Wrong prediction + high confidence (on wrong class) Very high (Worst)
Entropy Coding
• Suppose sentences are generated from 4 alphabets: A, B, C, and D, according to the
probability distributions, given by

A B C D
P(x) 0.5 0.25 0.125 0.125

• We want to compress sentences using a code of 0 and 1, if so, how many bits are
needed per alphabet?

A B C D
Uniform 00 01 10 11
Entropy 0 10 110 111

▪ Entropy coding required 1.75bits // traditional coding is 2bits for communication


Entropy Coding

A B C D
P(x) 0.5 0.25 0.125 0.125
codewords 0 10 110 111

• Entropy coding required 1.75bits // traditional coding is 2bits for communication


Entropy coding : alphabet

26 alphabet : 5 bit vs. 12 bit

Communication data is reduced into 1/3


Computational Thinking &
Artificial Intelligence
Thanks for your attention.

Kyoungwon Seo
Dept. of Applied Artificial Intelligence
[email protected]

You might also like