Lec 4
Lec 4
Yi Zhang
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 1 / 35
Introduction
We now introduce basic models for the probabilistic modeling of
discrete data in the rest of the lecture.
Probability models are sometimes called generative models in the
machine learning literature.
Compared to LSA and word2vec, probability models have some
advantages:
1 Specify explicit data generating process for text.
2 Make clear the statistical foundations for dimensionality
reduction, allows for well-defined inference procedures.
3 Easier to interpret the latent components onto which data is
projected.
4 Relatively straightforward to extend to incorporate additional
dependencies of interest.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 2 / 35
Probability Models
As a stepping stone for building dimensionality reduction algorithms,
we will introduce basic models for the probabilistic modeling of
discrete data in the rest of the lecture.
Probability models are sometimes called generative models in the
machine learning literature.
1, In supervised learning, generative models allow us to
model the full joint distribution p(yd , xd ), which we revisit in
the final lecture.
2. In unsupervised learning, generative models allow us to
given a statistical interpretation to the hidden structure in a
corpus.
Note: For now, we ignore document heterogeneity, and instead
introduce models that will form the building blocks for probabilistic
unsupervised learning for a given document.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 3 / 35
Simple Probability Model
Consider the list of terms w = (w1 , w2 , w3 , ...wN ) where
wn ∈ {1, 2, 3, 4, 5..V }
Suppose that each term is iid, and that:
βv = P rob(wn = v) ∈ [0, 1]
Let β = (β1 , β2 , β3 ...βV ) ∈ ∆V −1 be the parameter vector we want
to estimate.
The probability of the data given the parameters is
YX Y
P rob(w|β) = 1(wn = v)βv = βvxv
n v v
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 5 / 35
Implications of MLE I
Suppose you do not speak Portuguese, but someone lists for you
10,000 possible words the spoken language might contain.
You are then shown a single snippet of text ‘eles bebem’. The
parameters that best explain this data put 1/2 probability each on
‘eles’ and on ‘bebem’ and 0 on every other possible word.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 6 / 35
Implications of MLE II
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 7 / 35
Bayesian Inference
Bayesian inference treats β as a random variable drawn from a prior
distribution, which can encode any knowledge we might have.
On the other hand, we treat the data as a fixed quantity that
provides information about β.
The likelihood principle states that all relevant information about an
unknown quantity θ is contained in the likelihood function of θ for
the given data (Berger and Wolpert 1988).
Bayesian inference is consistent with the likelihood principle,
frequentist reasoning need not be.
”Many Bayesians became Bayesians only because the LP left them
little choice” (Berger and Wolpert 1988).
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 8 / 35
Bayes’ Rule
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 9 / 35
What is a Bayesian Estimate?
There are several ways of reporting the Bayesian estimate of β:
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 10 / 35
Penalized Regression Revisited
Consider the parametric linear regression model in which
yi ∼ N (xTi β, 1N σ 2 ) in which σ is known.
Suppose we draw each regression coefficient βi from a normal prior
N (0, τ 2 ).
The posterior distribution over β is then proportional to
p(y|β, x)p(β).
Q (y −xT β)2 Q β2
We know that p(y|β, X)p(β) ∝ i exp{− i 2σi2 } j {− 2τj2 }
MAP estimate can be obtained by minimizing
X σ2 X 2
(yi − xTi β)2 + 2 β
j
τ j j
which is exactly the ridge regression model.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 11 / 35
LASSO
X
RSS(β) + λ |βj |
j
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 12 / 35
Choosing Priors I
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 13 / 35
Choosing Priors II
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 14 / 35
Dirichlet Prior
The Dirichlet distribution is parametrized by:
η = (η1 , η2 , η3 , η4 , ηV )
is defined on the V − 1 simplex; and has probability density function:
Y
Dir(β|η) ∝ βvηv −1
v
Q P
The normalization constant is B(η) = v Γ(ηv )/Γ ηv .
P v
Marginal distribution is βv ∝ Beta ηv , v ηv − ηv Mean and
variance are:
P
η η v η
v v − η v
E[β] = P v , and V ar(ηv ) = P P
v ηv ( v ηv )2 ( v) + 1
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 15 / 35
Interpreting the Dirichlet
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 16 / 35
Beta with Different Parameters
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 17 / 35
Draws from Dirichlet with η = 1
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 18 / 35
Draws from Dirichlet with η = 0.5
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 19 / 35
Draws from Dirichlet with η = 5
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 20 / 35
Example
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 21 / 35
Sparsity-inducing prior (η = 2)
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 22 / 35
Uniform prior (η = 1)
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 23 / 35
Density-inducing prior (η = 5)
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 24 / 35
Data Overwhelming the Prior
Y Y Y
P r(β|w) ∝ P r(w|β)p(β) ∝ βvxv βvηv −1 = βvηv +xv −1
v v v
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 26 / 35
Application: Feature Selection
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 27 / 35
Application: Which Information Matters?
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 28 / 35
Results: Top Topic
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 29 / 35
Results: Top Topic
1-Year Spot Rate:
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 30 / 35
Results: Top Topic
5-Year, 5-Year Forward Rate
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 31 / 35
Application: Mixed Membership Model
We will talk about the latent topics for mixed membership model.
The framework is built up upon the probalistic framework with
Bayesian view.
In practice, we might imagine that documents cover more than one
topic.
Examples: State-of-the-Union Addresses discuss domestic and foreign
policy; monetary policy speeches discuss inflation and growth.
Models that associated observations with more than one latent
variable are called mixed-membership models. Also relevant outside
of text mining: in models of group formation, agents can be
associated with different latent communities (sports team, workplace,
church, etc).
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 32 / 35
Discriminative vs Generative Model
Discriminative classifiers refer to the estimated models of the form
p(y|x), which can be applied directly to text data.
1
JASA, ”The efficiency of logistic regression compared to Normal
Discriminant Analysis”
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 33 / 35
Discriminative vs Generative Model
2
NIPS, ”On Discriminative vs Generative Classifiers: A comparison of logistic
regression and naive Bayes”.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 34 / 35
Conclusion
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 35 / 35