0% found this document useful (0 votes)
79 views35 pages

EE2211 Lecture 3

Uploaded by

Tze Long Gan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views35 pages

EE2211 Lecture 3

Uploaded by

Tze Long Gan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

EE2211 Introduction to Machine

Learning
Lecture 3
Semester 1
2020/2021

Li Haizhou ([email protected])

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou

© Copyright EE, NUS. All Rights Reserved.


Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Kar-Ann / Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

EE2211 Introduction to Machine Learning 2


© Copyright EE, NUS. All Rights Reserved.
Introduction to
Probability and Statistics
Module I Contents
• What is Machine Learning and Types of Learning
• How Supervised Learning works
• Regression and Classification Tasks
• Induction versus Deduction Reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule
• Parameter estimation
• Parametric vs Non-Parametric machine learning

3
© Copyright EE, NUS. All Rights Reserved.
Causality
What is statistical causality or causation?
• In statistics, causation means that one thing will cause
the other, which is why it is also referred to as cause and
effect.
• The gold standard for causal data analysis is to combine
specific experimental designs such as randomized
studies with standard statistical analysis techniques.

• Causality creep is the idea that causal language is


often used to describe inferential or predictive analyses.

EE2211 Introduction to Machine Learning 4


© Copyright EE, NUS. All Rights Reserved.
Correlation
• In statistics, correlation is any statistical relationship,
whether causal or not, between two random variables.
• Correlations are useful because they can indicate a
predictive relationship that can be exploited in practice.

https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html

EE2211 Introduction to Machine Learning 5


© Copyright EE, NUS. All Rights Reserved.
Correlation does not imply causation
• Most data analyses involve inference or prediction.
• Unless a randomized study is performed, it is difficult to infer why
there is a relationship between two variables.
• Some great examples of correlations that can be calculated but are
clearly not causally related appear at http://tylervigen.com/
(See figure below).

EE2211 Introduction to Machine Learning 6


© Copyright EE, NUS. All Rights Reserved.
Example
– Decades of data show a clear causal relationship between smoking
and cancer.
– If you smoke, it is a sure thing that your risk of cancer will increase.
– But it is not a sure thing that you will get cancer.
– The causal effect is real, but it is an effect on your average risk.

EE2211 Introduction to Machine Learning 7


© Copyright EE, NUS. All Rights Reserved.
https://larspsyll.files.wordpress.com/2013/07/causation.jpg?w=538&h=416

EE2211 Introduction to Machine Learning 8


© Copyright EE, NUS. All Rights Reserved.
Caution
• Particular caution should be used when applying words
such as “cause” and “effect” when performing inferential
analysis.
• Causal language applied to even clearly labeled
inferential analyses may lead to misinterpretation - a
phenomenon called causation creep.

EE2211 Introduction to Machine Learning 9


© Copyright EE, NUS. All Rights Reserved.
Simpson’s paradox
Simpson's paradox is a phenomenon in probability and statistics, in
which a trend appears in several different groups of data but
disappears or reverses when these groups are combined.

https://en.wikipedia.org/wiki/Simpson%27s_paradox

EE2211 Introduction to Machine Learning 10


© Copyright EE, NUS. All Rights Reserved.
Example

Ref: Gardener, Martin (March 1979). "MATHEMATICAL GAMES: On the fabric of inductive logic, and some probability paradoxes" (PDF). Scientific American. 234

EE2211 Introduction to Machine Learning 11


© Copyright EE, NUS. All Rights Reserved.
Probability

• We describe a random experiment by describing its


procedure and observations of its outcomes.

• Outcomes are mutual exclusive in the sense that only one


outcome occurs in a specific trial of the random
experiment. This also means an outcome is not
decomposable. All unique outcomes form a sample
space.

• A subset of sample space 𝑆𝑆, denote as 𝐴𝐴, is an event in a


random experiment 𝐴𝐴 ⊆ 𝑆𝑆, that is meaningful in

EE2211 Introduction to Machine Learning 12


© Copyright EE, NUS. All Rights Reserved.
Axioms of Probability

Assuming events 𝐴𝐴 ⊆ 𝑆𝑆 and 𝐵𝐵 ⊆ 𝑆𝑆, the probabilities of


events related with and must satisfy,

1. 𝑃𝑃𝑟𝑟 𝐴𝐴 > 0
2. 𝑃𝑃𝑃𝑃 𝑆𝑆 = 1
3. If 𝐴𝐴 ∩ 𝐵𝐵 = ∅ , then 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵
*otherwise, 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵 − 𝑃𝑃𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)

EE2211 Introduction to Machine Learning 13


© Copyright EE, NUS. All Rights Reserved.
Random Variable
• A random variable, usually written as an italic capital
letter, like X, is a variable whose possible values are
numerical outcomes of a random phenomenon.
– Examples of random phenomena with a numerical outcome include
a toss of a coin (0 for heads and 1 for tails), a roll of a dice, or the
height of the first stranger you meet outside.
• There are two types of random variables:
– discrete and continuous.

s
R
X(s)
EE2211 Introduction to Machine Learning 14
© Copyright EE, NUS. All Rights Reserved.
Notations
• Some books used P(·) and p(·) to distinguish between the
probability of discrete random variable and the probability
of continuous random variables respectively.

• We shall use Pr(·) for both the above cases.

EE2211 Introduction to Machine Learning 15


© Copyright EE, NUS. All Rights Reserved.
Discrete random variable
• A discrete random variable (DRV) takes on only a
countable number of distinct values such as red, yellow,
blue or 1, 2, 3, . . ..
• The probability distribution of a discrete random variable is
described by a list of probabilities associated with each of its possible
values.
- This list of probabilities is called a
probability mass function (pmf).
(Like a histogram, except that here
the probabilities sum to 1)

A probability mass function


EE2211 Introduction to Machine Learning 16
© Copyright EE, NUS. All Rights Reserved.
• Let a discrete random variable X have k possible values
𝑥𝑥𝑖𝑖 𝑘𝑘𝑖𝑖=1 .
• The expectation of X denoted as 𝐸𝐸 𝑥𝑥 is given by,
𝑘𝑘
𝐸𝐸 𝑥𝑥 ≝ �𝑖𝑖=1 𝑥𝑥𝑖𝑖 · Pr(𝑋𝑋 = 𝑥𝑥𝑖𝑖 )
= 𝑥𝑥1 · Pr(𝑋𝑋 = 𝑥𝑥1 ) + 𝑥𝑥2 · Pr(𝑋𝑋 = 𝑥𝑥2 ) + ··· + 𝑥𝑥𝑘𝑘 · Pr(𝑋𝑋 = 𝑥𝑥𝑘𝑘 )
where Pr(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) is the probability that X has the value 𝑥𝑥𝑖𝑖
according to the pmf.
• The expectation of a random variable is also called the
mean, average or expected value and is frequently
denoted with the letter μ.
• The expectation is one of the most important statistics of a
random variable.

EE2211 Introduction to Machine Learning 17


© Copyright EE, NUS. All Rights Reserved.
• Another important statistic is the standard deviation,
defined as,
σ ≝ 𝐸𝐸 (𝑋𝑋 − 𝜇𝜇)2 .
• Variance, denoted as 𝜎𝜎 2 or var(X), is defined as,
𝜎𝜎 2 = 𝐸𝐸 (𝑋𝑋 − 𝜇𝜇)2
• For a discrete random variable, the standard deviation is
given by
σ = Pr(𝑋𝑋 = 𝑥𝑥1 )(𝑥𝑥1 − 𝜇𝜇)2 + ⋯ + Pr(𝑋𝑋 = 𝑥𝑥𝑘𝑘 )(𝑥𝑥𝑘𝑘 − 𝜇𝜇)2
where 𝜇𝜇 = 𝐸𝐸(𝑋𝑋).

EE2211 Introduction to Machine Learning 18


© Copyright EE, NUS. All Rights Reserved.
Continuous random variable
• A continuous random variable (CRV) takes an infinite
number of possible values in some interval.
– Examples include height, weight, and time.
– Because the number of values of a continuous random variable X
is infinite, the probability Pr(X = c) for any c is 0.
– Therefore, instead of the list of probabilities, the probability
distribution of a CRV (a continuous probability distribution) is
described by a probability density function (pdf).
– The pdf is a function whose codomain
is nonnegative and the area under the
curve is equal to 1.

A probability density function

EE2211 Introduction to Machine Learning 19


© Copyright EE, NUS. All Rights Reserved.
• The expectation of a continuous random variable 𝑋𝑋 is given
by 𝐸𝐸 𝑥𝑥 ≝ ∫𝑅𝑅 𝑥𝑥 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥
where 𝑓𝑓𝑋𝑋 is the pdf of the variable 𝑋𝑋 and ∫𝑅𝑅 is the
integral of function 𝑥𝑥 𝑓𝑓𝑋𝑋 .
• The variance of a continuous random variable 𝑋𝑋 is given
by 𝜎𝜎 2 ≝ ∫𝑅𝑅(𝑋𝑋 − 𝜇𝜇)2 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥

• Integral is an equivalent of the summation over all values of


the function when the function has a continuous domain.
• It equals the area under the curve of the function.
• The property of the pdf that the area under its curve is 1
mathematically means that ∫𝑅𝑅 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥 = 1

EE2211 Introduction to Machine Learning 20


© Copyright EE, NUS. All Rights Reserved.
Mean and Standard Deviation of a Gaussian
distribution 𝑥𝑥 2

95%
90%

𝜇𝜇

𝑥𝑥1

EE2211 Introduction to Machine Learning 21


© Copyright EE, NUS. All Rights Reserved.
Example 1
Independent random variables
• Consider tossing a coin twice, what is the probability of
having (H,H)?
Pr(x=H, y=H) = Pr(x=H)Pr(y=H)
= (1/2)(1/2) = 1/4

Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 22


© Copyright EE, NUS. All Rights Reserved.
Example 2
Dependent random variables
• Given 2 balls with different colors (Red and Black), what
is the probability of having B and then R?
The space of outcomes of taking two balls
sequentially without replacement:
B–R
R–B Thus having B-R is ½ .
Mathematically:
Pr(x=B, y=R) = Pr(y=R | x=B) Pr(x=B)
= 1×(1/2) = 1/2
Since we are given the first pick was B, and thus we
know the probability of the remaining ball to be R is 1. Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 23


© Copyright EE, NUS. All Rights Reserved.
Example 3
Dependent random variables
• Given 3 balls with different colors (R,G,B), what is the
probability of having B and then G?
The space of outcomes of taking two balls
sequentially without replacement:
R–G|G–B|B–R
R – B | G – R | B – G Thus Pr(y=G, x=B) = 1/6 .
Mathematically:
Pr(y=G, x=B) = Pr(y=G | x=B) Pr(x=B)
= (1/2) × (1/3)
= 1/6
Given that the first pick is B, then the remaining balls are
G and R, and thus the chance of picking up G is ½. Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 24


© Copyright EE, NUS. All Rights Reserved.
Bayes’ Rule
Thomas Bayes (1701 – 1761)

• The conditional probability Pr(𝑌𝑌 = 𝑦𝑦|𝑋𝑋 = 𝑥𝑥) is the


probability of the random variable 𝑌𝑌 to have a specific
value 𝑦𝑦 given that another random variable 𝑋𝑋 has a
specific value of 𝑥𝑥.
• The Bayes’ Rule (also known as the Bayes’ Theorem)
stipulates that:
Pr 𝑋𝑋 = 𝑥𝑥 𝑌𝑌 = 𝑦𝑦 Pr(𝑌𝑌=𝑦𝑦)
Pr 𝑌𝑌 = 𝑦𝑦 𝑋𝑋 = 𝑥𝑥 = (1)
Pr(𝑋𝑋=𝑥𝑥)

𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 × 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
EE2211 Introduction to Machine Learning 25
© Copyright EE, NUS. All Rights Reserved.
Bayes’ Rule
Likelihood – propensity for
Prior – what we know about
observing a certain value of 𝑥𝑥
y BEFORE seeing 𝑥𝑥
given a certain value of 𝑦𝑦

Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦
Pr 𝑦𝑦 𝑥𝑥 = =
Pr(𝑥𝑥) 𝛴𝛴𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦

Evidence – a constant to
Posterior – what we know ensure that the left hand side
about y AFTER seeing 𝑥𝑥 is a valid distribution

Adapted from S. Prince


© Copyright EE, NUS. All Rights Reserved.
Parameter Estimation
• Bayes’ rule comes in handy when we have a model of
𝑋𝑋’s distribution, e.g., to model the term Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃�
with a normal distribution 𝑁𝑁(𝜇𝜇, 𝜎𝜎2) ,
− 𝑥𝑥−𝜇𝜇 2
1
𝑓𝑓𝜃𝜃� 𝑥𝑥 = Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ = 𝑒𝑒 2𝜎𝜎2 (2)
2𝜋𝜋𝜎𝜎 2

where 𝜃𝜃� ≝ [𝜇𝜇, σ] is the parameter vector


and 𝜋𝜋 is the constant (3.14159…).

EE2211 Introduction to Machine Learning 27


© Copyright EE, NUS. All Rights Reserved.
Parameter Estimation
• Suppose Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃� ≝ 𝑓𝑓𝜃𝜃� 𝑥𝑥 of equation (2), we
can update the values of parameters in the vector 𝜃𝜃 from
the data using the Bayes’ rule:
Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃 = 𝜃𝜃)
̂
Pr 𝜃𝜃 = 𝜃𝜃̂ 𝑋𝑋 = 𝑥𝑥 ⃪
Pr(𝑋𝑋 = 𝑥𝑥)
posterior probability
(3)
𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃=𝜃𝜃)
Pr �
=
� Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃=𝜃𝜃)

� 𝜃𝜃

EE2211 Introduction to Machine Learning 28


© Copyright EE, NUS. All Rights Reserved.
Parameter Estimation
• The best value of the parameters 𝜃𝜃 ∗ given some
examples is obtained using the principle of maximum a
posteriori (or MAP):

𝜃𝜃 ∗ = argmax𝜃𝜃 Pr 𝜃𝜃 = 𝜃𝜃� 𝑋𝑋 = 𝑥𝑥
= argmax𝜃𝜃 𝑙𝑙𝑙𝑙𝑙𝑙 Pr 𝜃𝜃 = 𝜃𝜃� 𝑋𝑋 = 𝑥𝑥

EE2211 Introduction to Machine Learning 29


© Copyright EE, NUS. All Rights Reserved.
Parameter Estimation
called maximum likelihood
estimation when only this term is
• Suppose Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃� ≝ 𝑓𝑓𝜃𝜃� 𝑥𝑥 used for estimating the non-
random parameter θ while
assuming X random.

Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃 = 𝜃𝜃)


̂
Pr 𝜃𝜃 = 𝜃𝜃̂ 𝑋𝑋 = 𝑥𝑥 ⃪
Pr(𝑋𝑋 = 𝑥𝑥)

• In this way, maximum a posteriori (MAP) estimation


becomes maximum likelihood estimation

EE2211 Introduction to Machine Learning 30


© Copyright EE, NUS. All Rights Reserved.
Maximum Likelihood Estimation
• Suppose that the underlying distribution is a normal
𝑁𝑁(𝜇𝜇, 𝜎𝜎2) and we want to estimate the mean 𝜇𝜇 and
variance 𝜎𝜎2 from sample data 𝑋𝑋 = 𝑥𝑥1, 𝑥𝑥2, … 𝑥𝑥𝑥𝑥 using the
maximum likelihood estimator,

• Log-likelihood function 𝐿𝐿(𝑋𝑋|𝜃𝜃) = ∑𝑚𝑚


𝑖𝑖=1 log 𝑓𝑓𝜃𝜃 (𝑥𝑥𝑥𝑥) ,

𝐿𝐿(𝑋𝑋|𝜃𝜃)
− 𝑥𝑥𝑖𝑖−𝜇𝜇 2
1
= ∑𝑚𝑚
𝑖𝑖=1 log 𝑒𝑒 2𝜎𝜎2
2𝜋𝜋𝜎𝜎 2
𝑚𝑚 1
= −𝑚𝑚 𝑙𝑙𝑙𝑙𝑙𝑙 𝜎𝜎 − log 2𝜋𝜋 − 2 ∑𝑚𝑚
𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝜇𝜇 2
2 2𝜎𝜎

EE2211 Introduction to Machine Learning 31


© Copyright EE, NUS. All Rights Reserved.
Maximum Likelihood Estimation
Let
𝜕𝜕𝐿𝐿(𝑋𝑋|𝜃𝜃) 1
= ∑𝑚𝑚 𝑥𝑥𝑖𝑖 − 𝜇𝜇 = 0
𝜕𝜕 𝜇𝜇 𝜎𝜎𝜎 𝑖𝑖=1
we have,
1 𝑚𝑚
𝜇𝜇̂ = ∑ 𝑥𝑥 ≔ 𝑥𝑥̅
𝑚𝑚 𝑖𝑖=1 𝑖𝑖

and
𝜕𝜕𝐿𝐿(𝑋𝑋|𝜃𝜃) 𝑚𝑚 1 𝑚𝑚
= − + 3 ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝜇𝜇 2 =0
𝜕𝜕 𝜎𝜎 𝜎𝜎 𝜎𝜎
we have,
1 𝑚𝑚
𝜎𝜎� = ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2
𝑚𝑚

EE2211 Introduction to Machine Learning 32


© Copyright EE, NUS. All Rights Reserved.
Parametric Machine Learning
• A learning model that summarizes data with a set of
parameters of fixed size is called a parametric model. No
matter how much data you throw at a parametric model, it
won’t change its mind about how many parameters it
need. Artificial Intelligence: A Modern Approach, pp. 737
• The algorithms involve two steps,
1. Select a form for the function, e.g. normal distribution,
2. Learn the coefficients for the function from the training
data.

Pros: Simple, speed, less data


Cons: Strong assumption, limited complexity, poor fit

EE2211 Introduction to Machine Learning 33


© Copyright EE, NUS. All Rights Reserved.
Non-parametric Machine Learning
• Algorithms that do not make strong assumptions about the
form of the mapping function are called nonparametric
machine learning algorithms. They are good when you have
a lot of data and no prior knowledge, and when you don’t
want to worry too much about choosing just the right
features. Artificial Intelligence: A Modern Approach, pp. 757

Pros: Flexibility, performance


Cons: More data, slow, overfitting

EE2211 Introduction to Machine Learning 34


© Copyright EE, NUS. All Rights Reserved.
The End

EE2211 Introduction to Machine Learning 35


© Copyright EE, NUS. All Rights Reserved.

You might also like