0% found this document useful (0 votes)

11 views51 pages

Lec 3

This lecture focuses on vector space models in text mining, particularly in finance and economics, highlighting methods for document similarity assessment using cosine similarity and latent semantic analysis. It discusses word embeddings, including Word2Vec and GloVe, and their applications in clustering and understanding semantic relationships among terms. The lecture emphasizes the importance of dimensionality reduction techniques and the role of context in word representation.

Uploaded by

xqing7985

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views51 pages

Lec 3

Uploaded by

xqing7985

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Text Mining in Finance and Economics

Lecture 3: Vector Space Models

Dr.Yi Zhang

The Hong Kong University of Science and Technology (Guangzhou)

February 23, 2024

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 1 / 51
Introduction

In the previous lecture we studied dictionary methods for reducing

the dimensionality of the document-term matrix.

We now focus on methods that exploit the full V-dimensional feature

space of X.

The starting point is to view the rows of X as lying in a

V-dimensional vector space.

The basis for the vector space is e1 , e2 , e3 , e4 ......eV .

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 2 / 51
Three Documents

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 3 / 51
Distance in the Vector Space

An initial question of interest is how similar are any two documents in

the vector space.

Initial instinct might be to use Euclidean distance:

√
Σv (xi,v − xj,v )2

What is the problem with Euclidean distance? How can we correct

this?

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 4 / 51
Cosine Similarity

Define the cosine similarity between documents i and j as:

xi · xj
CS(i, j) =
∥xi ∥∥xj ∥

1 Since document vectors have no negative elements

C(i, j) ∈ [0, 1].
xi
2
∥xi ∥
is unit-length, correction for different distances.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 5 / 51
Application: Product Similarity
An important theoretical concept in industrial organization is location
on a product space.

Industry classification measures are quite crude proxies of this.

Hoberg and Phillips (2010) take product descriptions from 49,408

10-K filings and use the vector space model (with bit vectors defined
by dictionaries) to compute similarity between firms.

General Question: How do firms conduct Mergers and Acquisitions

through product differentiation and market competition ?

Data source available at http://alex2.umd.edu/industrydata/

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 6 / 51
Distorted Distance in Vector Space
Latent variable representations can more accurately identify
document similarity.

The problem of synonomy is that several different words can be

associated with the same topic. Cosine similarity between following
documents?

The problem of polysemy is that the same word can have multiple
meanings. Cosine similarity between following documents?
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 7 / 51
Word Embedding

A word embedding is a low-dimensional vector representation of a

word.
Ideally in this low-dimensional vector space words with similar
meanings will lie close together.
The construction of word embeddings has been a major topic in NLP
in the past few years.
The embedding vector can be of interest in their own right, or else
form part of the representation of a document for other tasks.
Closely related idea is a document embedding: a low-dimensional
representation of a document.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 8 / 51
Latent Semantic Analysis

One of the first NLP models for finding low-dimensional structure in

a corpus is the Latent Semantic Analysis/Indexing model of
Deerwester et. al. (1990).
A linear algebra approach that applies a singular value decomposition
to document-term matrix.
Closely related to classical principal components analysis.
Examples in economics: Boukus and Rosenberg (2006); Hendry and
Madeley (2010); Acosta (2014); Waldinger et. al. (2018).

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 9 / 51
Singular Value Decomposition

The term-document (document-term is transposed) matrix X,

m × n with rank(X) = r, is not square, but we can decompose it
using a generalization of the eigenvector decomposition called the
singular value decomposition.
Proposition
The term-document matrix can be written as X = T ΣD ′ where T
are semi-orthogonal matrix with m × r. D is also an semi-orthogonal
matrix with n × r. Σ is a diagonal matrix, r × r,where Σi,i = σi with
σi+1 ≤ σi , and Σi,j = 0 for all i ̸= j

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 10 / 51
Approximating the Term-Document Matrix
We can obtain a rank k approximation of the term-document matrix
X k , by constructing X k = T Σk D′ , where Σk is the diagonal matrix
formed by replacing Σi,i = 0 for i > k.
The idea is to keep the ”content” dimensions that explain common
variation across terms and documents and drop ”noise” dimensions
that represent idiosyncratic variation.
Often k is selected to explain a fixed portion p of variance in the
data. In this case k is the smallest value that satisfies:

i=1 σi /Σi σi ≥ p
Σi=k 2 2

We can then perform the same operations on Xk as X, e.g. cosine

similarity.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 11 / 51
Visualization

Intuition:
Terms · Doc = (Terms · Topics) + (Topics · Docs)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 12 / 51
Example

[Deerwester etc 1990]

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 13 / 51
Example

The singular values are

Σ = diag(3.34, 2.54, 2.35, 1.64, 1.50, 1.31, 0.85, 0.56, 0.36)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 14 / 51
Example

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 15 / 51
Example

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 16 / 51
Example: Cosine Similarity

1 Comparing Two Terms, the dot product of XX ′ is equal to

T Σ2k T ′ . The final result is to compare row ith and jth of T Σk
2 Comparing Two Documents, the dot product of X ′ X is equal to
DΣ2k D′ . The final result is to compare row ith and jth of DΣk

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 17 / 51
Application

Concept detection: [Boukus and Rosenberg, 2006] apply LSA to

central bank communication documents, relate document
representations to market responses.
Distance between documents:

1 Iaria Schwarz and Waldinger (2018), AER apply LSA to

scientific documents to measure overlap in research
agendas across countries.
2 Ellen, Larsen and Thorsrud (2021) apply LSA to financial
newspapers to derive narrative monetary policy shock.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 18 / 51
Statistical Models of Dimensionality Reduction

LSA has statistical foundations, but is not itself a statistical model.

Advantages of statistical models:
Make clear the statistical foundations for dimensionality
reduction, allows for well-defined inference procedures.
Easier to interpret the latent components onto which data is
projected.Relatively straightforward to extend to incorporate
additional dependencies of interest.
Relatively straightforward to extend to incorporate additional
dependencies of interest.
Disadvantage: Disadvantage: require more elaborate inference
algorithms.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 19 / 51
Word2Vec

The bag-of-words model (and LSA) ignores entirely the context in

which a word appears.
More recently, algorithms have developed to incorporate this
information in the construction of word embeddings.
Let the context of wd,n by the L words that precede and proceed it
(2L words in total). Denote the context by C(wd,n ).

1 Continuous bag-of-words model: predict word given its context.

2 Skip-gram model: predict context given a word.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 20 / 51
John R Firth

”You shall know a word by the company it keeps”

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 21 / 51
CBOW Model Description

We model the probability of observing wd,n as:

T
exp(ᾱd,n ρv )
P r[wd,n = v|C(wd,n )] = T
,
Σv′ ∈V exp(ᾱd,n ρv )
1
where ᾱd,n = 2L Σw∈C(wd,n ) αw . ρv ∈ RK is the embedding vector for
the vth vocabulary term. αv ∈ RK is the context vector for the vth
vocabulary term.
1 Example of self-supervised learning.
2 The skip-gram model instead predicts context words given a
center word.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 22 / 51
Terms Close to Uncertainty in FOMC Transcripts

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 23 / 51
Terms Close to Risk

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 24 / 51
GloVe

The GloVe model [Pennington, Socher, Manning 2014] begins with a

V matrix W of local word:
Wi,j is the number of times term j appears within the context of i.
Assign to each term v an embedding vector ρv ∈ RV by optimizing
WLS:
minimize Σi,j f (Wi,j )[ρTi ρj − log(Wi,j )]2
x∈S

Terms that co-occur frequently will have more highly correlated

embedding vectors. f (W ) is weighted scheme to tune overweightings
for rare co-currencies or extreme co-currencies.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 25 / 51
Interpretability: Clustering in the Vector Space

Since words close together in the vector space are semantically

related, we may be interested to group them together.
This can be cast as a clustering problem: grouping unlabeled
observations into distinct groups.
Many approaches to clustering in statistics and machine learning.
One of the most oldest and simplest is the k-means algorithm.
Clustering word embeddings can be viewed as a non-parametric topic
model; approach appears to produce interpretable output.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 26 / 51
K-Means
Let Vk be the set of all embeddings that are in cluster k. The
centroid of cluster k is
1
⃗uk = Σv∈Vk ⃗xv
|Vk |
where ⃗xv is the embedding vector for the vth term.
In k-means we choose cluster assignments {V1 , V2 , V3 , V4 , ...VK } to
minimize the sum of squares between each term and its cluster
centroid:
Σk Σv∈Vk ∥⃗xv − ⃗uk ∥2
Solution groups similar embeddings together, and centroids represent
prototype embeddings within each cluster.
Normalize embeddings to have unit length as before.
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 27 / 51
Solution Algorithm

First initialize the centroids ⃗uk for 1,2,3,4,...K

Repeat the following steps until convergence:

Assign each embedding to its closest centroid, i.e. choose an

assignment k for v that minimizes ∥⃗xv − ⃗uk ∥

2 Recompute the cluster centroids as ⃗uk = |V1k | Σv∈Vk ⃗xv given the
updated assignments in previous step.
The objective function is guaranteed to decrease at each step →
convergence to local minimum.
Note: for step 1 obvious; for step 2 choose elements of vector
⃗y ∈ R+ to minimize y in Σv∈Vk ∥⃗xv − ⃗y ∥2 . The solution is ⃗y = ⃗uk .

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 28 / 51
Directions Encode Meaning

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 29 / 51
Importance of Training Corpus
Relationships among words can vary depending on the training
corpus.
Example of training word embeddings on Wiki/Newswire text and on
Harvard Business Review.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 30 / 51
Application: Embedding Dictionaries

Dictionaries provide a coarse representation of concepts in that some

relevant terms might be missing altogether, and strength of
association with concept isn t accounted for.
One strategy is to measure the association between documents and
word lists in an embedding space rather than the bag-of-words space.
Recent example is [Gennaro and Ash, 2022] which studies emotional
language in politics using the Congressional Record corpus.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 31 / 51
Application: Embedding Dictionaries
Set A of words represents emotion, and set C of words represents
cognition (both from LIWC). Emotionality of speech i is:

sim(di , A) + b
Yi =
sim(di , C) + b

Yi is emotional-to-cognitive signal measure. The vector A

representing emotion is the average of the vectors w for the words in
the emotion word list, w ∈ A. The vector C for cognition is defined
analogously.
Validation:
Rationality of semantic word/sentence selection, Sentiment
Difference, Embedding clustering
Mannual validation with subsamples
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 32 / 51
Application: Embedding Dictionaries
[Gennaro and Ash, 2022, EJ]

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 33 / 51
Word Cloud: Cognitive

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 34 / 51
Word Cloud: Emotional

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 35 / 51
Word Embeddings and Cultural Attitudes
Because word embeddings appear to capture semantically meaningful
relationships among words, there is interest in using them to measure
cultural attitudes.
In psychology there is a long-standing Implicit Association Test that
measures participants’ time to correctly classify images depending on
word combinations.
The hypothesis is that reaction times are shorter when word
combinations more naturally belong together, which allows a measure
of bias.
Caliskan et. al. (2017) have use word embeddings to ask whether
similar biases exist in natural language.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 36 / 51
Implicit Association Test

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 37 / 51
Implicit Association Test: China

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 38 / 51
Word-Embedding Association Test

The Word-Embedding Association Test (WEAT) measures whether

two sets of target words X, Y (e.g. male, female words) differ in
their relative similarity to two sets of attribute words A, B (e.g.
career, family words).
Let cos(x, y) be cosine similarity between vectors x and y.
Let s(x, A, B) = meana∈A cos(x, a) − meanb∈B cos(x, b)

Σx∈X s(x, A, B) − Σy∈Y s(y, A, B)

W EAT ≡
stdx∈X∪Y s(x, A, B)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 39 / 51
IAT vs WEAT

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 40 / 51
Language and Culture (Kozlowski et. al. 2018)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 41 / 51
Language and Culture (Kozlowski et. al. 2018)

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 42 / 51
Application: Does Language affect Decisions?

[Ash, Chen, and Ornaghi (2022), AEJ] use a measure similar to

WEAT to measure linguistic gender bias among judges using written
opinions.
They then match judge-specific bias scores with individual judge
decisions to see whether the two are related.
Data is the universe of US appellate court decisions from 1890-2013.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 43 / 51
WEAT and Judge Characteristics I

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 44 / 51
WEAT and Judge Characteristics II

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 45 / 51
Application: Expanding Dictionaries
One application of word embeddings is to augment human judgment
in the construction of dictionaries.
Motivation is that economists are experts in which concept might be
most important in a particular setting, but not in which words relate
to that concept.
One can specify a set of seed words and then find nearest
neighbors of those words to populate a dictionary.
Strategy adopted by several recent papers:

1 Li, Mai, Shen and Yan 2021, RFS, Corporate Culture

2 Giglio, Kuchler, Stroebel, Zeng, 2023 NBER WP, Biodiversity
Risk
Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 46 / 51
Embedding-Based Document Similarity

Several papers use the distance between documents as captured by

average embedding vectors.
[Kogan et al., 2019] measures distance between patents and
occupation descriptions to proxy exposure of jobs to technical change.
[Hansen et al., 2021] measures distance between O*NET occupation
descriptions and job postings to proxy skill demand.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 47 / 51
Choosing Among Algorithms

By now we have seen multiple algorithms for document similarity, but

provided no means to assess which one to choose.
Does the choice of algorithm matter?
We evaluate document similarity in the context of 10-K risk factors
using randomly sampled pairs from the universe of 2019 filing firms.
Keep data constant, and vary the algorithm used for similarity
comparison.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 48 / 51
How to Proceed?

Clear need for some validation task against which to compare

these algorithms.
Unclear that existing ideas from computer science are relevant in
economics.
Inevitable need for some human input to judge some baseline truth.
Ideally one would find validation tasks that were relevant across
research questions so that each author doesn t have to start from
scratch.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 49 / 51
Transfer Learning

In the above examples, Word2vec is fit directly to the data of interest.

In many use cases, one instead uses an estimated model from an
auxiliary corpus for word embeddings, or as starting values in
embedding estimation.
This is known as transfer learning and becomes particularly important
for large-scale language models.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 50 / 51
Conclusion

We have taken first steps in using a rich feature space in a

corpus to quantify text.
Key idea is the need to work in a low-dimensional space to
obtain useful representations of high-dimensional objects.
While word embeddings have proved useful for a range of tasks,
why the uncover meaning is still unclear.
Fully probabilistic models have a more transparent structure,
which we begin to discuss next.

Yi Zhang ([email protected]) The Hong Kong University of Science and Technology (Guangzhou)
February 23, 2024 51 / 51

Unitary and Orthogonal Operators and Their Matrices-Pt2
100% (1)
Unitary and Orthogonal Operators and Their Matrices-Pt2
22 pages
Word Embeddings
No ratings yet
Word Embeddings
163 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
57 pages
Chemical Process Performance Evaluation
No ratings yet
Chemical Process Performance Evaluation
170 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
37 pages
Geometric and Topological Data Reduction
No ratings yet
Geometric and Topological Data Reduction
275 pages
Data Mining: Dimensionality Reduction Pca - SVD
No ratings yet
Data Mining: Dimensionality Reduction Pca - SVD
33 pages
Matlab Robust Control Toolbox
100% (1)
Matlab Robust Control Toolbox
168 pages
Latent Semantic Analysis: Dr. Maunendra Sankar Desarkar IIT Hyderabad
No ratings yet
Latent Semantic Analysis: Dr. Maunendra Sankar Desarkar IIT Hyderabad
41 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
85 pages
Latent Semantic Analysis
No ratings yet
Latent Semantic Analysis
36 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
Matrixde
No ratings yet
Matrixde
182 pages
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
No ratings yet
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
57 pages
Latent Semantic Indexing (LSI) : CSE 434/535 Information Retrieval Fall 2019
No ratings yet
Latent Semantic Indexing (LSI) : CSE 434/535 Information Retrieval Fall 2019
65 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
06 VectorSpaceModel PDF
No ratings yet
06 VectorSpaceModel PDF
75 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Matrix Norms: Tom Lyche
No ratings yet
Matrix Norms: Tom Lyche
45 pages
Basic Simulation Lab
No ratings yet
Basic Simulation Lab
69 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
No ratings yet
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
40 pages
3rd and 4th Sem Curriculum and Syllabus CSE
No ratings yet
3rd and 4th Sem Curriculum and Syllabus CSE
31 pages
Dimensionality Reduction: Pca, SVD, MDS, Ica, and Friends
No ratings yet
Dimensionality Reduction: Pca, SVD, MDS, Ica, and Friends
50 pages
RiskMetrics (Monitor) 3
No ratings yet
RiskMetrics (Monitor) 3
28 pages
Annexure-9 SPL AI & ML-3
No ratings yet
Annexure-9 SPL AI & ML-3
13 pages
Final Report New
No ratings yet
Final Report New
59 pages
Non Numeric Clustering Seminar
No ratings yet
Non Numeric Clustering Seminar
26 pages
Chapter 4 - Part II
No ratings yet
Chapter 4 - Part II
44 pages
Information Retrieval: Latent Semantic Indexing
No ratings yet
Information Retrieval: Latent Semantic Indexing
36 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Week 2 and 3
No ratings yet
Week 2 and 3
76 pages
Lecture 15
No ratings yet
Lecture 15
43 pages
Week 5 - Latent Semantic Indexing
No ratings yet
Week 5 - Latent Semantic Indexing
38 pages
Singular Value Decomposition
No ratings yet
Singular Value Decomposition
43 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Chapter 6
No ratings yet
Chapter 6
55 pages
Full Download Data Science and Analytics With Python 1st Edition Jesus Rogel-Salazar PDF
100% (11)
Full Download Data Science and Analytics With Python 1st Edition Jesus Rogel-Salazar PDF
49 pages
Wordembedding
No ratings yet
Wordembedding
25 pages
Information Retrieval On Cranfield Dataset
No ratings yet
Information Retrieval On Cranfield Dataset
15 pages
Linear Algebra - Part II: Projection, Eigendecomposition, SVD
No ratings yet
Linear Algebra - Part II: Projection, Eigendecomposition, SVD
20 pages
cs224n 2025 Lecture02 Wordvecs2
No ratings yet
cs224n 2025 Lecture02 Wordvecs2
46 pages
Topic Modelling and LSA
No ratings yet
Topic Modelling and LSA
10 pages
Latent Semantic Indexing by Singular Value Decomposition
No ratings yet
Latent Semantic Indexing by Singular Value Decomposition
26 pages
Latent Semantic Analysis
No ratings yet
Latent Semantic Analysis
15 pages
Independent Component Analysis and Projection Pursuit - Ps
No ratings yet
Independent Component Analysis and Projection Pursuit - Ps
23 pages
Topic Models Dsi Talk March 2017
No ratings yet
Topic Models Dsi Talk March 2017
24 pages
Wordembed
No ratings yet
Wordembed
31 pages
Lecture 2 Introduction To Linear Algebra (Part 1)
No ratings yet
Lecture 2 Introduction To Linear Algebra (Part 1)
49 pages
Properties of Antisymmetric Matrices
No ratings yet
Properties of Antisymmetric Matrices
17 pages
Data Analytics Certification
No ratings yet
Data Analytics Certification
10 pages
A Case Study in A Recommender System Based On
No ratings yet
A Case Study in A Recommender System Based On
9 pages
Zhao - Sun - 2021 - An Accurate Positioning Method For Robotic Manipulation Based On Vision and
No ratings yet
Zhao - Sun - 2021 - An Accurate Positioning Method For Robotic Manipulation Based On Vision and
16 pages
Fast Registration Based On Noisy Planes With Unknown Correspondences For 3-D Mapping
No ratings yet
Fast Registration Based On Noisy Planes With Unknown Correspondences For 3-D Mapping
18 pages
04 Stereo Systems
No ratings yet
04 Stereo Systems
18 pages
Mathophilia
No ratings yet
Mathophilia
18 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Webir 06
No ratings yet
Webir 06
32 pages
Fiducial Registration Error and Target Registration Error Are Uncorrelated
No ratings yet
Fiducial Registration Error and Target Registration Error Are Uncorrelated
13 pages
On The Maximum Principle of Ky Fan
No ratings yet
On The Maximum Principle of Ky Fan
8 pages
Module 3 Part B
No ratings yet
Module 3 Part B
19 pages
A Comparison Study For Channel Capacity of MIMO Systems With Nakagami-M, Weibull, and Rice Distributions
No ratings yet
A Comparison Study For Channel Capacity of MIMO Systems With Nakagami-M, Weibull, and Rice Distributions
6 pages
LRTV: MR Image Super-Resolution With Low-Rank and Total Variation Regularizations
No ratings yet
LRTV: MR Image Super-Resolution With Low-Rank and Total Variation Regularizations
8 pages
Semantic Technology-Assisted Review STAR Document
No ratings yet
Semantic Technology-Assisted Review STAR Document
14 pages
SVD Computation
No ratings yet
SVD Computation
13 pages
Using Linear Algebra For Intelligent Information Retrieval
No ratings yet
Using Linear Algebra For Intelligent Information Retrieval
23 pages
QPM 7 Question Bank
No ratings yet
QPM 7 Question Bank
8 pages
NLP 3
No ratings yet
NLP 3
16 pages
An Optimal Parallel Jacobi-Like Solution Method For The Singular Value Decomposition
No ratings yet
An Optimal Parallel Jacobi-Like Solution Method For The Singular Value Decomposition
7 pages
Documents Similarity
No ratings yet
Documents Similarity
6 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
9 pages
Dimensionality Reduction For Bag-Of-Words Models PCA Vs LSA
No ratings yet
Dimensionality Reduction For Bag-Of-Words Models PCA Vs LSA
6 pages
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
No ratings yet
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
5 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Frequency Response of Linear Time Periodic Systems
No ratings yet
Frequency Response of Linear Time Periodic Systems
6 pages
5 Lsa
No ratings yet
5 Lsa
5 pages
Vector Semantics 2 Word Embeddings (Vector Semantics)
No ratings yet
Vector Semantics 2 Word Embeddings (Vector Semantics)
5 pages
Recommendation System Sample Paper
No ratings yet
Recommendation System Sample Paper
6 pages
CIKM2022 Submission 3961
No ratings yet
CIKM2022 Submission 3961
5 pages
Exploring the Interconnections Between String Theory, Multidimensional Reality, and Matter Displacement
From Everand
Exploring the Interconnections Between String Theory, Multidimensional Reality, and Matter Displacement
Alberto De Miranda
No ratings yet
Projective Geometry: Exploring Projective Geometry in Computer Vision
From Everand
Projective Geometry: Exploring Projective Geometry in Computer Vision
Fouad Sabry
No ratings yet
Spatial Infrastructure: Essays on Architectural Thinking as a Form of Knowledge
From Everand
Spatial Infrastructure: Essays on Architectural Thinking as a Form of Knowledge
Jose Araguez
No ratings yet

Lec 3

Uploaded by

Lec 3

Uploaded by

Text Mining in Finance and Economics

Lecture 3: Vector Space Models

The Hong Kong University of Science and Technology (Guangzhou)

February 23, 2024

In the previous lecture we studied dictionary methods for reducing

We now focus on methods that exploit the full V-dimensional feature

The starting point is to view the rows of X as lying in a

The basis for the vector space is e1 , e2 , e3 , e4 ......eV .

An initial question of interest is how similar are any two documents in

Initial instinct might be to use Euclidean distance:

What is the problem with Euclidean distance? How can we correct

Define the cosine similarity between documents i and j as:

1 Since document vectors have no negative elements

Industry classification measures are quite crude proxies of this.

Hoberg and Phillips (2010) take product descriptions from 49,408

General Question: How do firms conduct Mergers and Acquisitions

Data source available at http://alex2.umd.edu/industrydata/

The problem of synonomy is that several different words can be

A word embedding is a low-dimensional vector representation of a

One of the first NLP models for finding low-dimensional structure in

The term-document (document-term is transposed) matrix X,

We can then perform the same operations on Xk as X, e.g. cosine

[Deerwester etc 1990]

The singular values are

Σ = diag(3.34, 2.54, 2.35, 1.64, 1.50, 1.31, 0.85, 0.56, 0.36)

1 Comparing Two Terms, the dot product of XX ′ is equal to

Concept detection: [Boukus and Rosenberg, 2006] apply LSA to

1 Iaria Schwarz and Waldinger (2018), AER apply LSA to

LSA has statistical foundations, but is not itself a statistical model.

The bag-of-words model (and LSA) ignores entirely the context in

1 Continuous bag-of-words model: predict word given its context.

”You shall know a word by the company it keeps”

We model the probability of observing wd,n as:

The GloVe model [Pennington, Socher, Manning 2014] begins with a

Terms that co-occur frequently will have more highly correlated

Since words close together in the vector space are semantically

First initialize the centroids ⃗uk for 1,2,3,4,...K

Assign each embedding to its closest centroid, i.e. choose an

assignment k for v that minimizes ∥⃗xv − ⃗uk ∥

Dictionaries provide a coarse representation of concepts in that some

Yi is emotional-to-cognitive signal measure. The vector A

The Word-Embedding Association Test (WEAT) measures whether

Σx∈X s(x, A, B) − Σy∈Y s(y, A, B)

[Ash, Chen, and Ornaghi (2022), AEJ] use a measure similar to

1 Li, Mai, Shen and Yan 2021, RFS, Corporate Culture

Several papers use the distance between documents as captured by

By now we have seen multiple algorithms for document similarity, but

Clear need for some validation task against which to compare

In the above examples, Word2vec is fit directly to the data of interest.

We have taken first steps in using a rich feature space in a

You might also like