0% found this document useful (0 votes)

20 views35 pages

Lec 4

The document introduces probabilistic modeling for discrete data in finance and economics, emphasizing its advantages over traditional methods like LSA and word2vec. It discusses maximum likelihood estimation (MLE) and Bayesian inference, highlighting the use of Dirichlet priors for modeling distributions of terms. The content also explores the implications of these models, including their application in dimensionality reduction and the interpretation of latent components.

Uploaded by

xqing7985

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views35 pages

Lec 4

Uploaded by

xqing7985

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Text Mining in Finance and Economics

Introduction to Probabilistic Modeling

Yi Zhang

The Hong Kong University of Science and Technology (Guangzhou)

January 23, 2024

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 1 / 35
Introduction
We now introduce basic models for the probabilistic modeling of
discrete data in the rest of the lecture.
Probability models are sometimes called generative models in the
machine learning literature.
Compared to LSA and word2vec, probability models have some
advantages:
1 Specify explicit data generating process for text.
2 Make clear the statistical foundations for dimensionality
reduction, allows for well-defined inference procedures.
3 Easier to interpret the latent components onto which data is
projected.
4 Relatively straightforward to extend to incorporate additional
dependencies of interest.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 2 / 35
Probability Models
As a stepping stone for building dimensionality reduction algorithms,
we will introduce basic models for the probabilistic modeling of
discrete data in the rest of the lecture.
Probability models are sometimes called generative models in the
machine learning literature.
1, In supervised learning, generative models allow us to
model the full joint distribution p(yd , xd ), which we revisit in
the final lecture.
2. In unsupervised learning, generative models allow us to
given a statistical interpretation to the hidden structure in a
corpus.
Note: For now, we ignore document heterogeneity, and instead
introduce models that will form the building blocks for probabilistic
unsupervised learning for a given document.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 3 / 35
Simple Probability Model
Consider the list of terms w = (w1 , w2 , w3 , ...wN ) where
wn ∈ {1, 2, 3, 4, 5..V }
Suppose that each term is iid, and that:
βv = P rob(wn = v) ∈ [0, 1]
Let β = (β1 , β2 , β3 ...βV ) ∈ ∆V −1 be the parameter vector we want
to estimate.
The probability of the data given the parameters is
YX Y
P rob(w|β) = 1(wn = v)βv = βvxv
n v v

where xv is the count of term v in w.

Note that term counts are a suﬀicient statistic for w in the
estimation of β. The independence assumption provides statistical
foundations for the bag-of-words model.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 4 / 35
Maximum Likelihood Inference

We can estimate βv with maximum likelihood. The Lagrangian is:

X X
L= xv log(βv ) + λ(1 − βv )
v v
| {z } | {z }
log−likelihood Constraint on β

First order condition is: xv

βv
− λ = 0 → βv = xv
λ
∑
xv P
Constraint gives v
λ
=1→λ= v xv = N
xv
So MLE estimate is β̂v = N
, the frequency of term v in list of terms.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 5 / 35
Implications of MLE I

Suppose you do not speak Portuguese, but someone lists for you
10,000 possible words the spoken language might contain.
You are then shown a single snippet of text ‘eles bebem’. The
parameters that best explain this data put 1/2 probability each on
‘eles’ and on ‘bebem’ and 0 on every other possible word.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 6 / 35
Implications of MLE II

Is this a reasonable model? We ‘know’ that working languages

contain hundreds of regularly spoken words; we ‘know’ that the
distribution of word frequencies is highly skewed; we ‘know’ that the
language is similar to Spanish, and should inherit a similar frequency
distribution; and so on.
The MLE estimates relies solely on the data we observe.
More subtle problem is to take V to be the number of unique
observations, which may be misleading even with large samples
(black swan paradox).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 7 / 35
Bayesian Inference
Bayesian inference treats β as a random variable drawn from a prior
distribution, which can encode any knowledge we might have.
On the other hand, we treat the data as a fixed quantity that
provides information about β.
The likelihood principle states that all relevant information about an
unknown quantity θ is contained in the likelihood function of θ for
the given data (Berger and Wolpert 1988).
Bayesian inference is consistent with the likelihood principle,
frequentist reasoning need not be.
”Many Bayesians became Bayesians only because the LP left them
little choice” (Berger and Wolpert 1988).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 8 / 35
Bayes’ Rule

Bayesian inference is operationalized via the application of Bayes’

rule:
p(w|β)p(β)
p(β|w) =
p(w)
where:

p(β|w) is the posterior distribution.

p(w|β) is the likelihood function.
p(β) is the prior distribution
p(w) is a normalizing constant sometimes called the evidence.
The prior is often parametrized by hyperparameters.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 9 / 35
What is a Bayesian Estimate?
There are several ways of reporting the Bayesian estimate of β:

1MAP estimate is the value at which the posterior distribution is

highest, i.e. its mode.
2 Expected value of β under the posterior.
3 Choose point estimate to minimize some expected loss function.
4 Compute credible interval P rob(β ∈ A|w) for some set A.
We are also sometimes interested in P rob(wN +1 |w) for some unseen
data wN +1 . This is called the predictive distribution.
All of these depend fundamentally on the posterior distribution.
If we can compute the posterior, we can do Bayesian inference.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 10 / 35
Penalized Regression Revisited
Consider the parametric linear regression model in which
yi ∼ N (xTi β, 1N σ 2 ) in which σ is known.
Suppose we draw each regression coeﬀicient βi from a normal prior
N (0, τ 2 ).
The posterior distribution over β is then proportional to
p(y|β, x)p(β).
Q (y −xT β)2 Q β2
We know that p(y|β, X)p(β) ∝ i exp{− i 2σi2 } j {− 2τj2 }
MAP estimate can be obtained by minimizing
X σ2 X 2
(yi − xTi β)2 + 2 β
j
τ j j
which is exactly the ridge regression model.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 11 / 35
LASSO

Now suppose we draw each regression coeﬀicient βj from a Laplace

prior so that P r(βj |λ) ∝ exp(−λ|βj |).
The Laplace distribution has a spike at 0 which promotes sparsity.
The objection function for MAP estimation can be written

X
RSS(β) + λ |βj |
j

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 12 / 35
Choosing Priors I

A popular choice for the prior distribution is that it be conjugate, i.e.

the posterior distribution belongs to the same parametric family as
the prior. This facilitates analytic computation of the posterior.
All distributions in the exponential family have conjugate prior
distributions...but are they meaningful?
The Dirichlet distribution is conjugate to the categorical/multinomial
distributions (as we shall see).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 13 / 35
Choosing Priors II

When the conjugate prior is not suﬀiciently expressive, we can adopt

another prior and simulate the posterior distribution.
For example, the log-normal distribution more naturally embeds
dependence on covariates and correlation in text.
Once we choose a prior, we still need to choose hyperparameters: can
set at some value consistent with our domain-specific knowledge, or
select them to maximize the evidence (empirical Bayes).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 14 / 35
Dirichlet Prior
The Dirichlet distribution is parametrized by:
η = (η1 , η2 , η3 , η4 , ηV )
is defined on the V − 1 simplex; and has probability density function:
Y
Dir(β|η) ∝ βvηv −1
v
Q P
The normalization constant is B(η) = v Γ(ηv )/Γ ηv .
P v
Marginal distribution is βv ∝ Beta ηv , v ηv − ηv Mean and
variance are:
P
η η v η
v v − η v
E[β] = P v , and V ar(ηv ) = P P
v ηv ( v ηv )2 ( v) + 1
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 15 / 35
Interpreting the Dirichlet

Consider a symmetric Dirichlet in which ηv ≡ η. Agnostic about

favoring one component over another.
Here the η parameter measures the concentration of distribution on
the center of the simplex, where the mass on each term is more
evenly spread:
1 η = 1, is a uniform distribution.
2 η > 1, puts relatively more weight in center of simplex.
3 η < 1, puts relatively more weight on corners of simplex.
When V = 2, the Dirichlet becomes the beta distribution.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 16 / 35
Beta with Different Parameters

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 17 / 35
Draws from Dirichlet with η = 1

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 18 / 35
Draws from Dirichlet with η = 0.5

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 19 / 35
Draws from Dirichlet with η = 5

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 20 / 35
Example

Suppose we begin with a possible vocabulary of size V = 25.

We observe N = 50 total words and terms v appears 5 times.
The MLE point estimate of βv is 0.1.
The Bayesian estimate of βv depends on the prior.
Consider symmetric Dirichlet with hyperparameter η.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 21 / 35
Sparsity-inducing prior (η = 2)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 22 / 35
Uniform prior (η = 1)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 23 / 35
Density-inducing prior (η = 5)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 24 / 35
Data Overwhelming the Prior

Y Y Y
P r(β|w) ∝ P r(w|β)p(β) ∝ βvxv βvηv −1 = βvηv +xv −1
v v v

Posterior is a Dirichlet with parameters (η1′ , η2′ , · · · ηV ) where

ηv′ ≡ ηv + xv . Add term counts to the prior distribution’s parameters
to form posterior distribution. The Dirichlet hyperparameters can be
viewed as pseudo-counts, i.e. observations made before observing w.
Therefore, we obtain estimator for β:
η + xv
E[βv |w] = P v
v ηv + N

which also corresponds to the predictive distribution

v +xv −1
P rob(wN +1 = v|w). MAP estimator of βv is: ∑ η(η v +xv )−2 v
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 25 / 35
Example: Unigram Model

Suppose that β follows the Dirichlet Distribution and the

observations of xv follows the multinomial distribution. The
hyperparameter over ηv = 1
The posterior predictive distribution:
3 5 5 1 2 2 1 2 1 5
P r(wN +1 = v) = ( , , , , , , , , , )
27 27 27 27 27 27 27 27 27 27
when v=(1,2,3,4,5,6,7,8,9,10) respectively, where v is token number

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 26 / 35
Application: Feature Selection

In recent work, [Hansen, McMahon and Tong, 2019, JME], they

study the impact of the release of the Bank of England’s Inflation
Report on bond price changes at different maturities.
IR contains forecast variables we use as controls: (i) mode, variance,
and skewness of inflation and GDP forecasts; (ii) their difference
from the previous forecast.
To represent text, we estimate a 30-topic model and represent each
IR in terms of (i) topic shares and (ii) evolution of topic shares from
previous IR.
First step in the analysis is to partial out the forecast variables from
bond price moves and topic shares by constructing residuals.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 27 / 35
Application: Which Information Matters?

LASSO selects dozens of features at all maturities: standard

over-selection problem.
How to identify key topics?
We apply a non-parametric bootstrap to simulate the ”inclusion
probabilities” of topic features at different maturities.
Draw with replacement from our 69 observations to obtain new
sample, perform LASSO, and record whether each feature is included.
Repeat 500 times, and rank topics according to the fraction of
bootstrap draws in which they appear.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 28 / 35
Results: Top Topic

Comments: the general procedure is to exploit the unsupervised

learning first like (LSA, probablistic LSA, LDA); then, you may
implement the Lasso for feature selection.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 29 / 35
Results: Top Topic
1-Year Spot Rate:

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 30 / 35
Results: Top Topic
5-Year, 5-Year Forward Rate

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 31 / 35
Application: Mixed Membership Model
We will talk about the latent topics for mixed membership model.
The framework is built up upon the probalistic framework with
Bayesian view.
In practice, we might imagine that documents cover more than one
topic.
Examples: State-of-the-Union Addresses discuss domestic and foreign
policy; monetary policy speeches discuss inflation and growth.
Models that associated observations with more than one latent
variable are called mixed-membership models. Also relevant outside
of text mining: in models of group formation, agents can be
associated with different latent communities (sports team, workplace,
church, etc).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 32 / 35
Discriminative vs Generative Model
Discriminative classifiers refer to the estimated models of the form
p(y|x), which can be applied directly to text data.

Typical discriminative models include some Generalized Linear Model

like Logit Model, Logistic Model etc.

Recall that a generative classifier estimates the full joint distribution

p(yi , xi )

Efron (1975)1 shows that discriminative classifiers obtain a lower

asymptotic error than generative ones.

1
JASA, ”The eﬀiciency of logistic regression compared to Normal
Discriminant Analysis”
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 33 / 35
Discriminative vs Generative Model

Why then study generative classifiers?

Ng and Jordan (2001)2 show that generative classifiers can

approach their (higher) asymptotic error faster.

2 They can reveal interesting structure.
Applying a generative classifier requires a probability model for xi ,
which have developed in previous lectures.
In economics/finance, structural model with Bayesian inference in
macro finance is widely exploited to capture belief quantitatively.

2
NIPS, ”On Discriminative vs Generative Classifiers: A comparison of logistic
regression and naive Bayes”.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 34 / 35
Conclusion

In MLE we treat parameters as constants and choose them to

maximize the likelihood function. In Bayesian estimation, we treat
them as random variables and compute a posterior distribution given
observed data.
In models with a large number of parameters, Bayesian inference can
be more robust and avoids over-sensitivity to sparse data.
Outside of special cases, obtaining closed-form solutions for the
posterior is impossible; this held back Bayesian methods for decades.
Computation is a large component of Bayesian machine learning (we
will see a simple example later).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
January 23, 2024 35 / 35

Iso 2854 1976
100% (2)
Iso 2854 1976
52 pages
Statistical Methods in QSAR/QSPR
No ratings yet
Statistical Methods in QSAR/QSPR
23 pages
Worksheet On Regression
No ratings yet
Worksheet On Regression
2 pages
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
No ratings yet
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
36 pages
Deurenberg Formula Imc para %grasa
No ratings yet
Deurenberg Formula Imc para %grasa
10 pages
CH 6 Practice
No ratings yet
CH 6 Practice
5 pages
Introduction To Bayesian Statistics
No ratings yet
Introduction To Bayesian Statistics
33 pages
Methods of Research and Procedures
No ratings yet
Methods of Research and Procedures
22 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Var PPTS
No ratings yet
Var PPTS
249 pages
Modern Bayesian Econometrics
No ratings yet
Modern Bayesian Econometrics
100 pages
Parametric Test
No ratings yet
Parametric Test
28 pages
03 MLE MAP NBayes-1-21-2015
No ratings yet
03 MLE MAP NBayes-1-21-2015
40 pages
Baysian-Slides 16 Bayes Intro
No ratings yet
Baysian-Slides 16 Bayes Intro
49 pages
확통1 LectureNote09 on Bayesian Statistical Inference
No ratings yet
확통1 LectureNote09 on Bayesian Statistical Inference
78 pages
Lecture1 introToBayes
No ratings yet
Lecture1 introToBayes
65 pages
Week 2 Test Statistics
No ratings yet
Week 2 Test Statistics
61 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
ML Lecture 03 - Probabilistic Inference (Spring 2024)
No ratings yet
ML Lecture 03 - Probabilistic Inference (Spring 2024)
46 pages
CSE291D Lecture 3: Conjugate Priors Generative Models For Discrete Data
No ratings yet
CSE291D Lecture 3: Conjugate Priors Generative Models For Discrete Data
71 pages
Test-7 Result-18.02.2024
No ratings yet
Test-7 Result-18.02.2024
62 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
IDS22Bayes Applications
No ratings yet
IDS22Bayes Applications
34 pages
Cinelli and Hazlett Making Sense of Sensitivity 221027 155400
No ratings yet
Cinelli and Hazlett Making Sense of Sensitivity 221027 155400
42 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
CH 5
No ratings yet
CH 5
45 pages
Bayesian Inference Slides 2021
No ratings yet
Bayesian Inference Slides 2021
37 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Introduction To Bayesian Methods With An Example
No ratings yet
Introduction To Bayesian Methods With An Example
25 pages
Bayesian Modeling - Student
No ratings yet
Bayesian Modeling - Student
26 pages
Averaging and Exponential Smoothing Methods
No ratings yet
Averaging and Exponential Smoothing Methods
21 pages
An Overview of Bayesian Econometrics
No ratings yet
An Overview of Bayesian Econometrics
30 pages
Chapter 1 B
No ratings yet
Chapter 1 B
35 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
Advance Statistics
No ratings yet
Advance Statistics
23 pages
Bayesian Inference: A Practical Primer: Outline
No ratings yet
Bayesian Inference: A Practical Primer: Outline
28 pages
Bayesian Inference
No ratings yet
Bayesian Inference
18 pages
L10 - T Test
No ratings yet
L10 - T Test
28 pages
Modeling Demand For Coal in India: Vector Autoregressive Models With Cointegrated Variables
No ratings yet
Modeling Demand For Coal in India: Vector Autoregressive Models With Cointegrated Variables
20 pages
Block 4 ST3189
No ratings yet
Block 4 ST3189
25 pages
Lecture 10
No ratings yet
Lecture 10
33 pages
25 Intro To Bayesian Inference
No ratings yet
25 Intro To Bayesian Inference
31 pages
Central Tendency and Variability
No ratings yet
Central Tendency and Variability
20 pages
02 Review Estimation 2
No ratings yet
02 Review Estimation 2
36 pages
DS 630 - Lec 02 - ST
No ratings yet
DS 630 - Lec 02 - ST
34 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Single Parameter Models
No ratings yet
Single Parameter Models
37 pages
Lecture 2 - 4 Prior
No ratings yet
Lecture 2 - 4 Prior
51 pages
Ds 7
No ratings yet
Ds 7
20 pages
Lecture Material 2.5 - Bayesian Estimation & Concepts
No ratings yet
Lecture Material 2.5 - Bayesian Estimation & Concepts
12 pages
FSMLecture 4
No ratings yet
FSMLecture 4
49 pages
Discrete Probability and Likelihood: Readings: Agresti (2002), Section 1.2
No ratings yet
Discrete Probability and Likelihood: Readings: Agresti (2002), Section 1.2
17 pages
Part 1: Regression Model With Dummy Variables
No ratings yet
Part 1: Regression Model With Dummy Variables
16 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
Bayesian Week2 LectureNotes
No ratings yet
Bayesian Week2 LectureNotes
14 pages
Regression Numarical
No ratings yet
Regression Numarical
8 pages
Notes4 BayesianLearning
No ratings yet
Notes4 BayesianLearning
8 pages
2 Probability
No ratings yet
2 Probability
30 pages
Text
No ratings yet
Text
13 pages
A Beginner's Notes On Bayesian Econometrics (Art)
No ratings yet
A Beginner's Notes On Bayesian Econometrics (Art)
21 pages
PML Class 1 2025
No ratings yet
PML Class 1 2025
54 pages
Bayesian Modelling For Data Analysis and Learning From Data
No ratings yet
Bayesian Modelling For Data Analysis and Learning From Data
19 pages
BT Wk3 LectureNotes
No ratings yet
BT Wk3 LectureNotes
16 pages
19-Bayesian 2
No ratings yet
19-Bayesian 2
39 pages
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
No ratings yet
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
9 pages
Single Parametric Models
No ratings yet
Single Parametric Models
10 pages
Lecture 5
No ratings yet
Lecture 5
23 pages
G-9 Chapter 14 Statistics
No ratings yet
G-9 Chapter 14 Statistics
6 pages
Conjugate Prior
No ratings yet
Conjugate Prior
5 pages
Midterm So Ls
No ratings yet
Midterm So Ls
9 pages
Bayesian Basics: Ryan P. Adams
No ratings yet
Bayesian Basics: Ryan P. Adams
7 pages
Uji Validitas Dan Reliabilitas Uji Valid 1: Item-Total Statistics
No ratings yet
Uji Validitas Dan Reliabilitas Uji Valid 1: Item-Total Statistics
5 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
Effectiveness of Audiovisual-Based Training On Basic Life Support Knowledge of Students in Bengkulu
No ratings yet
Effectiveness of Audiovisual-Based Training On Basic Life Support Knowledge of Students in Bengkulu
6 pages
Chapter Six
No ratings yet
Chapter Six
7 pages
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
No ratings yet
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
49 pages
Chapter 3 Outline
No ratings yet
Chapter 3 Outline
4 pages
LN 13
No ratings yet
LN 13
5 pages
Variable 1 Variable 2: T-Test: Two-Sample Assuming Equal Variances
No ratings yet
Variable 1 Variable 2: T-Test: Two-Sample Assuming Equal Variances
4 pages
USN 18CS654: B. E. Degree (Autonomous) Sixth Semester End Examination (SEE)
No ratings yet
USN 18CS654: B. E. Degree (Autonomous) Sixth Semester End Examination (SEE)
2 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
GSB 2 PDF
No ratings yet
GSB 2 PDF
3 pages
DATA MINING - Syllabus
No ratings yet
DATA MINING - Syllabus
4 pages
Tutorial 2 PSNM (2024-25) Unit-1 Correlation, Regression and Curve Fitting
No ratings yet
Tutorial 2 PSNM (2024-25) Unit-1 Correlation, Regression and Curve Fitting
2 pages
Assign 1
No ratings yet
Assign 1
5 pages
Business Analytics Syllubus
No ratings yet
Business Analytics Syllubus
1 page
Course Outline173 S
No ratings yet
Course Outline173 S
3 pages
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
The mathematics of quantum mechanics
From Everand
The mathematics of quantum mechanics
Alessio Mangoni
No ratings yet

Lec 4

Uploaded by

Lec 4

Uploaded by

Text Mining in Finance and Economics

Introduction to Probabilistic Modeling

The Hong Kong University of Science and Technology (Guangzhou)

January 23, 2024

where xv is the count of term v in w.

We can estimate βv with maximum likelihood. The Lagrangian is:

First order condition is: xv

Is this a reasonable model? We ‘know’ that working languages

Bayesian inference is operationalized via the application of Bayes’

p(β|w) is the posterior distribution.

1MAP estimate is the value at which the posterior distribution is

Now suppose we draw each regression coeﬀicient βj from a Laplace

A popular choice for the prior distribution is that it be conjugate, i.e.

When the conjugate prior is not suﬀiciently expressive, we can adopt

Consider a symmetric Dirichlet in which ηv ≡ η. Agnostic about

Suppose we begin with a possible vocabulary of size V = 25.

Posterior is a Dirichlet with parameters (η1′ , η2′ , · · · ηV ) where

which also corresponds to the predictive distribution

Suppose that β follows the Dirichlet Distribution and the

In recent work, [Hansen, McMahon and Tong, 2019, JME], they

LASSO selects dozens of features at all maturities: standard

Comments: the general procedure is to exploit the unsupervised

Typical discriminative models include some Generalized Linear Model

Recall that a generative classifier estimates the full joint distribution

Efron (1975)1 shows that discriminative classifiers obtain a lower

Why then study generative classifiers?

Ng and Jordan (2001)2 show that generative classifiers can

approach their (higher) asymptotic error faster.

In MLE we treat parameters as constants and choose them to

You might also like