0% found this document useful (0 votes)
11 views51 pages

Lec 3

This lecture focuses on vector space models in text mining, particularly in finance and economics, highlighting methods for document similarity assessment using cosine similarity and latent semantic analysis. It discusses word embeddings, including Word2Vec and GloVe, and their applications in clustering and understanding semantic relationships among terms. The lecture emphasizes the importance of dimensionality reduction techniques and the role of context in word representation.

Uploaded by

xqing7985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views51 pages

Lec 3

This lecture focuses on vector space models in text mining, particularly in finance and economics, highlighting methods for document similarity assessment using cosine similarity and latent semantic analysis. It discusses word embeddings, including Word2Vec and GloVe, and their applications in clustering and understanding semantic relationships among terms. The lecture emphasizes the importance of dimensionality reduction techniques and the role of context in word representation.

Uploaded by

xqing7985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Text Mining in Finance and Economics

Lecture 3: Vector Space Models

Dr.Yi Zhang

The Hong Kong University of Science and Technology (Guangzhou)

February 23, 2024

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 1 / 51
Introduction

In the previous lecture we studied dictionary methods for reducing


the dimensionality of the document-term matrix.

We now focus on methods that exploit the full V-dimensional feature


space of X.

The starting point is to view the rows of X as lying in a


V-dimensional vector space.

The basis for the vector space is e1 , e2 , e3 , e4 ......eV .

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 2 / 51
Three Documents

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 3 / 51
Distance in the Vector Space

An initial question of interest is how similar are any two documents in


the vector space.

Initial instinct might be to use Euclidean distance:



Σv (xi,v − xj,v )2

What is the problem with Euclidean distance? How can we correct


this?

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 4 / 51
Cosine Similarity

Define the cosine similarity between documents i and j as:


xi · xj
CS(i, j) =
∥xi ∥∥xj ∥

1 Since document vectors have no negative elements


C(i, j) ∈ [0, 1].
xi
2
∥xi ∥
is unit-length, correction for different distances.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 5 / 51
Application: Product Similarity
An important theoretical concept in industrial organization is location
on a product space.

Industry classification measures are quite crude proxies of this.

Hoberg and Phillips (2010) take product descriptions from 49,408


10-K filings and use the vector space model (with bit vectors defined
by dictionaries) to compute similarity between firms.

General Question: How do firms conduct Mergers and Acquisitions


through product differentiation and market competition ?

Data source available at http://alex2.umd.edu/industrydata/

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 6 / 51
Distorted Distance in Vector Space
Latent variable representations can more accurately identify
document similarity.

The problem of synonomy is that several different words can be


associated with the same topic. Cosine similarity between following
documents?

The problem of polysemy is that the same word can have multiple
meanings. Cosine similarity between following documents?
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 7 / 51
Word Embedding

A word embedding is a low-dimensional vector representation of a


word.
Ideally in this low-dimensional vector space words with similar
meanings will lie close together.
The construction of word embeddings has been a major topic in NLP
in the past few years.
The embedding vector can be of interest in their own right, or else
form part of the representation of a document for other tasks.
Closely related idea is a document embedding: a low-dimensional
representation of a document.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 8 / 51
Latent Semantic Analysis

One of the first NLP models for finding low-dimensional structure in


a corpus is the Latent Semantic Analysis/Indexing model of
Deerwester et. al. (1990).
A linear algebra approach that applies a singular value decomposition
to document-term matrix.
Closely related to classical principal components analysis.
Examples in economics: Boukus and Rosenberg (2006); Hendry and
Madeley (2010); Acosta (2014); Waldinger et. al. (2018).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 9 / 51
Singular Value Decomposition

The term-document (document-term is transposed) matrix X,


m × n with rank(X) = r, is not square, but we can decompose it
using a generalization of the eigenvector decomposition called the
singular value decomposition.
Proposition
The term-document matrix can be written as X = T ΣD ′ where T
are semi-orthogonal matrix with m × r. D is also an semi-orthogonal
matrix with n × r. Σ is a diagonal matrix, r × r,where Σi,i = σi with
σi+1 ≤ σi , and Σi,j = 0 for all i ̸= j

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 10 / 51
Approximating the Term-Document Matrix
We can obtain a rank k approximation of the term-document matrix
X k , by constructing X k = T Σk D′ , where Σk is the diagonal matrix
formed by replacing Σi,i = 0 for i > k.
The idea is to keep the ”content” dimensions that explain common
variation across terms and documents and drop ”noise” dimensions
that represent idiosyncratic variation.
Often k is selected to explain a fixed portion p of variance in the
data. In this case k is the smallest value that satisfies:

i=1 σi /Σi σi ≥ p
Σi=k 2 2

We can then perform the same operations on Xk as X, e.g. cosine


similarity.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 11 / 51
Visualization

Intuition:
Terms · Doc = (Terms · Topics) + (Topics · Docs)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 12 / 51
Example

[Deerwester etc 1990]

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 13 / 51
Example

The singular values are

Σ = diag(3.34, 2.54, 2.35, 1.64, 1.50, 1.31, 0.85, 0.56, 0.36)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 14 / 51
Example

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 15 / 51
Example

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 16 / 51
Example: Cosine Similarity

1 Comparing Two Terms, the dot product of XX ′ is equal to


T Σ2k T ′ . The final result is to compare row ith and jth of T Σk
2 Comparing Two Documents, the dot product of X ′ X is equal to
DΣ2k D′ . The final result is to compare row ith and jth of DΣk

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 17 / 51
Application

Concept detection: [Boukus and Rosenberg, 2006] apply LSA to


central bank communication documents, relate document
representations to market responses.
Distance between documents:

1 Iaria Schwarz and Waldinger (2018), AER apply LSA to


scientific documents to measure overlap in research
agendas across countries.
2 Ellen, Larsen and Thorsrud (2021) apply LSA to financial
newspapers to derive narrative monetary policy shock.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 18 / 51
Statistical Models of Dimensionality Reduction

LSA has statistical foundations, but is not itself a statistical model.


Advantages of statistical models:
Make clear the statistical foundations for dimensionality
reduction, allows for well-defined inference procedures.
Easier to interpret the latent components onto which data is
projected.Relatively straightforward to extend to incorporate
additional dependencies of interest.
Relatively straightforward to extend to incorporate additional
dependencies of interest.
Disadvantage: Disadvantage: require more elaborate inference
algorithms.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 19 / 51
Word2Vec

The bag-of-words model (and LSA) ignores entirely the context in


which a word appears.
More recently, algorithms have developed to incorporate this
information in the construction of word embeddings.
Let the context of wd,n by the L words that precede and proceed it
(2L words in total). Denote the context by C(wd,n ).

1 Continuous bag-of-words model: predict word given its context.


2 Skip-gram model: predict context given a word.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 20 / 51
John R Firth

”You shall know a word by the company it keeps”


Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 21 / 51
CBOW Model Description

We model the probability of observing wd,n as:


T
exp(ᾱd,n ρv )
P r[wd,n = v|C(wd,n )] = T
,
Σv′ ∈V exp(ᾱd,n ρv )
1
where ᾱd,n = 2L Σw∈C(wd,n ) αw . ρv ∈ RK is the embedding vector for
the vth vocabulary term. αv ∈ RK is the context vector for the vth
vocabulary term.
1 Example of self-supervised learning.
2 The skip-gram model instead predicts context words given a
center word.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 22 / 51
Terms Close to Uncertainty in FOMC Transcripts

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 23 / 51
Terms Close to Risk

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 24 / 51
GloVe

The GloVe model [Pennington, Socher, Manning 2014] begins with a


V matrix W of local word:
Wi,j is the number of times term j appears within the context of i.
Assign to each term v an embedding vector ρv ∈ RV by optimizing
WLS:
minimize Σi,j f (Wi,j )[ρTi ρj − log(Wi,j )]2
x∈S

Terms that co-occur frequently will have more highly correlated


embedding vectors. f (W ) is weighted scheme to tune overweightings
for rare co-currencies or extreme co-currencies.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 25 / 51
Interpretability: Clustering in the Vector Space

Since words close together in the vector space are semantically


related, we may be interested to group them together.
This can be cast as a clustering problem: grouping unlabeled
observations into distinct groups.
Many approaches to clustering in statistics and machine learning.
One of the most oldest and simplest is the k-means algorithm.
Clustering word embeddings can be viewed as a non-parametric topic
model; approach appears to produce interpretable output.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 26 / 51
K-Means
Let Vk be the set of all embeddings that are in cluster k. The
centroid of cluster k is
1
⃗uk = Σv∈Vk ⃗xv
|Vk |
where ⃗xv is the embedding vector for the vth term.
In k-means we choose cluster assignments {V1 , V2 , V3 , V4 , ...VK } to
minimize the sum of squares between each term and its cluster
centroid:
Σk Σv∈Vk ∥⃗xv − ⃗uk ∥2
Solution groups similar embeddings together, and centroids represent
prototype embeddings within each cluster.
Normalize embeddings to have unit length as before.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 27 / 51
Solution Algorithm

First initialize the centroids ⃗uk for 1,2,3,4,...K


Repeat the following steps until convergence:

Assign each embedding to its closest centroid, i.e. choose an


1

assignment k for v that minimizes ∥⃗xv − ⃗uk ∥


2 Recompute the cluster centroids as ⃗uk = |V1k | Σv∈Vk ⃗xv given the
updated assignments in previous step.
The objective function is guaranteed to decrease at each step →
convergence to local minimum.
Note: for step 1 obvious; for step 2 choose elements of vector
⃗y ∈ R+ to minimize y in Σv∈Vk ∥⃗xv − ⃗y ∥2 . The solution is ⃗y = ⃗uk .

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 28 / 51
Directions Encode Meaning

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 29 / 51
Importance of Training Corpus
Relationships among words can vary depending on the training
corpus.
Example of training word embeddings on Wiki/Newswire text and on
Harvard Business Review.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 30 / 51
Application: Embedding Dictionaries

Dictionaries provide a coarse representation of concepts in that some


relevant terms might be missing altogether, and strength of
association with concept isn t accounted for.
One strategy is to measure the association between documents and
word lists in an embedding space rather than the bag-of-words space.
Recent example is [Gennaro and Ash, 2022] which studies emotional
language in politics using the Congressional Record corpus.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 31 / 51
Application: Embedding Dictionaries
Set A of words represents emotion, and set C of words represents
cognition (both from LIWC). Emotionality of speech i is:

sim(di , A) + b
Yi =
sim(di , C) + b

Yi is emotional-to-cognitive signal measure. The vector A


representing emotion is the average of the vectors w for the words in
the emotion word list, w ∈ A. The vector C for cognition is defined
analogously.
Validation:
Rationality of semantic word/sentence selection, Sentiment
Difference, Embedding clustering
Mannual validation with subsamples
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 32 / 51
Application: Embedding Dictionaries
[Gennaro and Ash, 2022, EJ]

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 33 / 51
Word Cloud: Cognitive

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 34 / 51
Word Cloud: Emotional

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 35 / 51
Word Embeddings and Cultural Attitudes
Because word embeddings appear to capture semantically meaningful
relationships among words, there is interest in using them to measure
cultural attitudes.
In psychology there is a long-standing Implicit Association Test that
measures participants’ time to correctly classify images depending on
word combinations.
The hypothesis is that reaction times are shorter when word
combinations more naturally belong together, which allows a measure
of bias.
Caliskan et. al. (2017) have use word embeddings to ask whether
similar biases exist in natural language.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 36 / 51
Implicit Association Test

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 37 / 51
Implicit Association Test: China

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 38 / 51
Word-Embedding Association Test

The Word-Embedding Association Test (WEAT) measures whether


two sets of target words X, Y (e.g. male, female words) differ in
their relative similarity to two sets of attribute words A, B (e.g.
career, family words).
Let cos(x, y) be cosine similarity between vectors x and y.
Let s(x, A, B) = meana∈A cos(x, a) − meanb∈B cos(x, b)

Σx∈X s(x, A, B) − Σy∈Y s(y, A, B)


W EAT ≡
stdx∈X∪Y s(x, A, B)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 39 / 51
IAT vs WEAT

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 40 / 51
Language and Culture (Kozlowski et. al. 2018)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 41 / 51
Language and Culture (Kozlowski et. al. 2018)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 42 / 51
Application: Does Language affect Decisions?

[Ash, Chen, and Ornaghi (2022), AEJ] use a measure similar to


WEAT to measure linguistic gender bias among judges using written
opinions.
They then match judge-specific bias scores with individual judge
decisions to see whether the two are related.
Data is the universe of US appellate court decisions from 1890-2013.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 43 / 51
WEAT and Judge Characteristics I

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 44 / 51
WEAT and Judge Characteristics II

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 45 / 51
Application: Expanding Dictionaries
One application of word embeddings is to augment human judgment
in the construction of dictionaries.
Motivation is that economists are experts in which concept might be
most important in a particular setting, but not in which words relate
to that concept.
One can specify a set of seed words and then find nearest
neighbors of those words to populate a dictionary.
Strategy adopted by several recent papers:

1 Li, Mai, Shen and Yan 2021, RFS, Corporate Culture


2 Giglio, Kuchler, Stroebel, Zeng, 2023 NBER WP, Biodiversity
Risk
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 46 / 51
Embedding-Based Document Similarity

Several papers use the distance between documents as captured by


average embedding vectors.
[Kogan et al., 2019] measures distance between patents and
occupation descriptions to proxy exposure of jobs to technical change.
[Hansen et al., 2021] measures distance between O*NET occupation
descriptions and job postings to proxy skill demand.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 47 / 51
Choosing Among Algorithms

By now we have seen multiple algorithms for document similarity, but


provided no means to assess which one to choose.
Does the choice of algorithm matter?
We evaluate document similarity in the context of 10-K risk factors
using randomly sampled pairs from the universe of 2019 filing firms.
Keep data constant, and vary the algorithm used for similarity
comparison.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 48 / 51
How to Proceed?

Clear need for some validation task against which to compare


these algorithms.
Unclear that existing ideas from computer science are relevant in
economics.
Inevitable need for some human input to judge some baseline truth.
Ideally one would find validation tasks that were relevant across
research questions so that each author doesn t have to start from
scratch.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 49 / 51
Transfer Learning

In the above examples, Word2vec is fit directly to the data of interest.


In many use cases, one instead uses an estimated model from an
auxiliary corpus for word embeddings, or as starting values in
embedding estimation.
This is known as transfer learning and becomes particularly important
for large-scale language models.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 50 / 51
Conclusion

We have taken first steps in using a rich feature space in a


corpus to quantify text.
Key idea is the need to work in a low-dimensional space to
obtain useful representations of high-dimensional objects.
While word embeddings have proved useful for a range of tasks,
why the uncover meaning is still unclear.
Fully probabilistic models have a more transparent structure,
which we begin to discuss next.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 51 / 51

You might also like