0% found this document useful (0 votes)
19 views35 pages

Lec 4

The document introduces probabilistic modeling for discrete data in finance and economics, emphasizing its advantages over traditional methods like LSA and word2vec. It discusses maximum likelihood estimation (MLE) and Bayesian inference, highlighting the use of Dirichlet priors for modeling distributions of terms. The content also explores the implications of these models, including their application in dimensionality reduction and the interpretation of latent components.

Uploaded by

xqing7985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views35 pages

Lec 4

The document introduces probabilistic modeling for discrete data in finance and economics, emphasizing its advantages over traditional methods like LSA and word2vec. It discusses maximum likelihood estimation (MLE) and Bayesian inference, highlighting the use of Dirichlet priors for modeling distributions of terms. The content also explores the implications of these models, including their application in dimensionality reduction and the interpretation of latent components.

Uploaded by

xqing7985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Text Mining in Finance and Economics

Introduction to Probabilistic Modeling

Yi Zhang

The Hong Kong University of Science and Technology (Guangzhou)

January 23, 2024

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 1 / 35
Introduction
We now introduce basic models for the probabilistic modeling of
discrete data in the rest of the lecture.
Probability models are sometimes called generative models in the
machine learning literature.
Compared to LSA and word2vec, probability models have some
advantages:
1 Specify explicit data generating process for text.
2 Make clear the statistical foundations for dimensionality
reduction, allows for well-defined inference procedures.
3 Easier to interpret the latent components onto which data is
projected.
4 Relatively straightforward to extend to incorporate additional
dependencies of interest.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 2 / 35
Probability Models
As a stepping stone for building dimensionality reduction algorithms,
we will introduce basic models for the probabilistic modeling of
discrete data in the rest of the lecture.
Probability models are sometimes called generative models in the
machine learning literature.
1, In supervised learning, generative models allow us to
model the full joint distribution p(yd , xd ), which we revisit in
the final lecture.
2. In unsupervised learning, generative models allow us to
given a statistical interpretation to the hidden structure in a
corpus.
Note: For now, we ignore document heterogeneity, and instead
introduce models that will form the building blocks for probabilistic
unsupervised learning for a given document.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 3 / 35
Simple Probability Model
Consider the list of terms w = (w1 , w2 , w3 , ...wN ) where
wn ∈ {1, 2, 3, 4, 5..V }
Suppose that each term is iid, and that:
βv = P rob(wn = v) ∈ [0, 1]
Let β = (β1 , β2 , β3 ...βV ) ∈ ∆V −1 be the parameter vector we want
to estimate.
The probability of the data given the parameters is
YX Y
P rob(w|β) = 1(wn = v)βv = βvxv
n v v

where xv is the count of term v in w.


Note that term counts are a sufficient statistic for w in the
estimation of β. The independence assumption provides statistical
foundations for the bag-of-words model.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 4 / 35
Maximum Likelihood Inference

We can estimate βv with maximum likelihood. The Lagrangian is:


X X
L= xv log(βv ) + λ(1 − βv )
v v
| {z } | {z }
log−likelihood Constraint on β

First order condition is: xv


βv
− λ = 0 → βv = xv
λ

xv P
Constraint gives v
λ
=1→λ= v xv = N
xv
So MLE estimate is β̂v = N
, the frequency of term v in list of terms.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 5 / 35
Implications of MLE I

Suppose you do not speak Portuguese, but someone lists for you
10,000 possible words the spoken language might contain.
You are then shown a single snippet of text ‘eles bebem’. The
parameters that best explain this data put 1/2 probability each on
‘eles’ and on ‘bebem’ and 0 on every other possible word.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 6 / 35
Implications of MLE II

Is this a reasonable model? We ‘know’ that working languages


contain hundreds of regularly spoken words; we ‘know’ that the
distribution of word frequencies is highly skewed; we ‘know’ that the
language is similar to Spanish, and should inherit a similar frequency
distribution; and so on.
The MLE estimates relies solely on the data we observe.
More subtle problem is to take V to be the number of unique
observations, which may be misleading even with large samples
(black swan paradox).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 7 / 35
Bayesian Inference
Bayesian inference treats β as a random variable drawn from a prior
distribution, which can encode any knowledge we might have.
On the other hand, we treat the data as a fixed quantity that
provides information about β.
The likelihood principle states that all relevant information about an
unknown quantity θ is contained in the likelihood function of θ for
the given data (Berger and Wolpert 1988).
Bayesian inference is consistent with the likelihood principle,
frequentist reasoning need not be.
”Many Bayesians became Bayesians only because the LP left them
little choice” (Berger and Wolpert 1988).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 8 / 35
Bayes’ Rule

Bayesian inference is operationalized via the application of Bayes’


rule:
p(w|β)p(β)
p(β|w) =
p(w)
where:

p(β|w) is the posterior distribution.


p(w|β) is the likelihood function.
p(β) is the prior distribution
p(w) is a normalizing constant sometimes called the evidence.
The prior is often parametrized by hyperparameters.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 9 / 35
What is a Bayesian Estimate?
There are several ways of reporting the Bayesian estimate of β:

1MAP estimate is the value at which the posterior distribution is


highest, i.e. its mode.
2 Expected value of β under the posterior.
3 Choose point estimate to minimize some expected loss function.
4 Compute credible interval P rob(β ∈ A|w) for some set A.
We are also sometimes interested in P rob(wN +1 |w) for some unseen
data wN +1 . This is called the predictive distribution.
All of these depend fundamentally on the posterior distribution.
If we can compute the posterior, we can do Bayesian inference.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 10 / 35
Penalized Regression Revisited
Consider the parametric linear regression model in which
yi ∼ N (xTi β, 1N σ 2 ) in which σ is known.
Suppose we draw each regression coefficient βi from a normal prior
N (0, τ 2 ).
The posterior distribution over β is then proportional to
p(y|β, x)p(β).
Q (y −xT β)2 Q β2
We know that p(y|β, X)p(β) ∝ i exp{− i 2σi2 } j {− 2τj2 }
MAP estimate can be obtained by minimizing
X σ2 X 2
(yi − xTi β)2 + 2 β
j
τ j j
which is exactly the ridge regression model.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 11 / 35
LASSO

Now suppose we draw each regression coefficient βj from a Laplace


prior so that P r(βj |λ) ∝ exp(−λ|βj |).
The Laplace distribution has a spike at 0 which promotes sparsity.
The objection function for MAP estimation can be written

X
RSS(β) + λ |βj |
j

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 12 / 35
Choosing Priors I

A popular choice for the prior distribution is that it be conjugate, i.e.


the posterior distribution belongs to the same parametric family as
the prior. This facilitates analytic computation of the posterior.
All distributions in the exponential family have conjugate prior
distributions...but are they meaningful?
The Dirichlet distribution is conjugate to the categorical/multinomial
distributions (as we shall see).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 13 / 35
Choosing Priors II

When the conjugate prior is not sufficiently expressive, we can adopt


another prior and simulate the posterior distribution.
For example, the log-normal distribution more naturally embeds
dependence on covariates and correlation in text.
Once we choose a prior, we still need to choose hyperparameters: can
set at some value consistent with our domain-specific knowledge, or
select them to maximize the evidence (empirical Bayes).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 14 / 35
Dirichlet Prior
The Dirichlet distribution is parametrized by:
η = (η1 , η2 , η3 , η4 , ηV )
is defined on the V − 1 simplex; and has probability density function:
Y
Dir(β|η) ∝ βvηv −1
v
Q P 
The normalization constant is B(η) = v Γ(ηv )/Γ ηv .
 P  v
Marginal distribution is βv ∝ Beta ηv , v ηv − ηv Mean and
variance are:
P 
η η v η
v v − η v
E[β] = P v , and V ar(ηv ) = P P
v ηv ( v ηv )2 ( v) + 1
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 15 / 35
Interpreting the Dirichlet

Consider a symmetric Dirichlet in which ηv ≡ η. Agnostic about


favoring one component over another.
Here the η parameter measures the concentration of distribution on
the center of the simplex, where the mass on each term is more
evenly spread:
1 η = 1, is a uniform distribution.
2 η > 1, puts relatively more weight in center of simplex.
3 η < 1, puts relatively more weight on corners of simplex.
When V = 2, the Dirichlet becomes the beta distribution.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 16 / 35
Beta with Different Parameters

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 17 / 35
Draws from Dirichlet with η = 1

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 18 / 35
Draws from Dirichlet with η = 0.5

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 19 / 35
Draws from Dirichlet with η = 5

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 20 / 35
Example

Suppose we begin with a possible vocabulary of size V = 25.


We observe N = 50 total words and terms v appears 5 times.
The MLE point estimate of βv is 0.1.
The Bayesian estimate of βv depends on the prior.
Consider symmetric Dirichlet with hyperparameter η.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 21 / 35
Sparsity-inducing prior (η = 2)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 22 / 35
Uniform prior (η = 1)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 23 / 35
Density-inducing prior (η = 5)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 24 / 35
Data Overwhelming the Prior

Y Y Y
P r(β|w) ∝ P r(w|β)p(β) ∝ βvxv βvηv −1 = βvηv +xv −1
v v v

Posterior is a Dirichlet with parameters (η1′ , η2′ , · · · ηV ) where


ηv′ ≡ ηv + xv . Add term counts to the prior distribution’s parameters
to form posterior distribution. The Dirichlet hyperparameters can be
viewed as pseudo-counts, i.e. observations made before observing w.
Therefore, we obtain estimator for β:
η + xv
E[βv |w] = P v
v ηv + N

which also corresponds to the predictive distribution


v +xv −1
P rob(wN +1 = v|w). MAP estimator of βv is: ∑ η(η v +xv )−2 v
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 25 / 35
Example: Unigram Model

Suppose that β follows the Dirichlet Distribution and the


observations of xv follows the multinomial distribution. The
hyperparameter over ηv = 1
The posterior predictive distribution:
3 5 5 1 2 2 1 2 1 5
P r(wN +1 = v) = ( , , , , , , , , , )
27 27 27 27 27 27 27 27 27 27
when v=(1,2,3,4,5,6,7,8,9,10) respectively, where v is token number

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 26 / 35
Application: Feature Selection

In recent work, [Hansen, McMahon and Tong, 2019, JME], they


study the impact of the release of the Bank of England’s Inflation
Report on bond price changes at different maturities.
IR contains forecast variables we use as controls: (i) mode, variance,
and skewness of inflation and GDP forecasts; (ii) their difference
from the previous forecast.
To represent text, we estimate a 30-topic model and represent each
IR in terms of (i) topic shares and (ii) evolution of topic shares from
previous IR.
First step in the analysis is to partial out the forecast variables from
bond price moves and topic shares by constructing residuals.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 27 / 35
Application: Which Information Matters?

LASSO selects dozens of features at all maturities: standard


over-selection problem.
How to identify key topics?
We apply a non-parametric bootstrap to simulate the ”inclusion
probabilities” of topic features at different maturities.
Draw with replacement from our 69 observations to obtain new
sample, perform LASSO, and record whether each feature is included.
Repeat 500 times, and rank topics according to the fraction of
bootstrap draws in which they appear.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 28 / 35
Results: Top Topic

Comments: the general procedure is to exploit the unsupervised


learning first like (LSA, probablistic LSA, LDA); then, you may
implement the Lasso for feature selection.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 29 / 35
Results: Top Topic
1-Year Spot Rate:

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 30 / 35
Results: Top Topic
5-Year, 5-Year Forward Rate

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 31 / 35
Application: Mixed Membership Model
We will talk about the latent topics for mixed membership model.
The framework is built up upon the probalistic framework with
Bayesian view.
In practice, we might imagine that documents cover more than one
topic.
Examples: State-of-the-Union Addresses discuss domestic and foreign
policy; monetary policy speeches discuss inflation and growth.
Models that associated observations with more than one latent
variable are called mixed-membership models. Also relevant outside
of text mining: in models of group formation, agents can be
associated with different latent communities (sports team, workplace,
church, etc).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 32 / 35
Discriminative vs Generative Model
Discriminative classifiers refer to the estimated models of the form
p(y|x), which can be applied directly to text data.

Typical discriminative models include some Generalized Linear Model


like Logit Model, Logistic Model etc.

Recall that a generative classifier estimates the full joint distribution


p(yi , xi )

Efron (1975)1 shows that discriminative classifiers obtain a lower


asymptotic error than generative ones.

1
JASA, ”The efficiency of logistic regression compared to Normal
Discriminant Analysis”
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 33 / 35
Discriminative vs Generative Model

Why then study generative classifiers?

Ng and Jordan (2001)2 show that generative classifiers can


1

approach their (higher) asymptotic error faster.


2 They can reveal interesting structure.
Applying a generative classifier requires a probability model for xi ,
which have developed in previous lectures.
In economics/finance, structural model with Bayesian inference in
macro finance is widely exploited to capture belief quantitatively.

2
NIPS, ”On Discriminative vs Generative Classifiers: A comparison of logistic
regression and naive Bayes”.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 34 / 35
Conclusion

In MLE we treat parameters as constants and choose them to


maximize the likelihood function. In Bayesian estimation, we treat
them as random variables and compute a posterior distribution given
observed data.
In models with a large number of parameters, Bayesian inference can
be more robust and avoids over-sensitivity to sparse data.
Outside of special cases, obtaining closed-form solutions for the
posterior is impossible; this held back Bayesian methods for decades.
Computation is a large component of Bayesian machine learning (we
will see a simple example later).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 35 / 35

You might also like