0% found this document useful (0 votes)
62 views5 pages

Description of Approach

Uploaded by

Kanute
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views5 pages

Description of Approach

Uploaded by

Kanute
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Application of Graph Neural Networks to Semantic Search

Ahmet Korkmaz1 , Amos Dinh1 , Henrik Rathai1 , and Matthias Fast1


1
Cooperative State University Baden-Wuerttemberg Mannheim

Abstract.
We present a novel approach for semantic search, utilizing Graph Neural Networks (GNNs). The proposed
framework first constructs a graph linking papers, words, and authors, where edges are weighted by TF-IDF
and word co-occurrence statistics. The Heterogeneous Graph Transformer is employed in combination with
the Knowledge Graph Embedding (KGE) method TransE to find informative representations for entities in
the graph. Applied in unision with a vector database, the approach enables content search beyond simple
keywords. The representations are learned in a joint metric space, enabling users to find related papers, explore
topics through keywords, or discover authors with similar research focus, enhancing the quality of academic
literature exploration. The feasiblity of the training and application of the proposed method on large corpora is
demonstrated on the basis of the arXiv dataset.

1 Introduction [7] introduce the Heterogeneous Graph Transformer


(HGT). Similar to [8] the HGT learns different weight ma-
Most commonly, semantic search on text corpora is en- trices W per entity type for the neighborhood aggregation
abled on the basis of word- or sentence-embeddings, step. The intuition is that different entity types, such as
learned with methods such as Word2Vec [1], GloVe [2] authors and papers posses different feature distributions
or Sentence-BERT [3]. The present paper builds on the and thereby, distinct weight matrices W allow a richer
intuition introduced by these methods that co-occurence information extraction. The HGT further employes the
presents a notion of similarity. The proposed method com- attention-mechanism [9] for the source-relationship-target
bines co-occurence statistics between words, as well as oc- triple, representing an edge between nodes. For computing
curence statistics of words in documents, term frequency- a target nodes’ representation, the HGT learns the atten-
inverse document frequency (TF-IDF). Further, this infor- tion matrices K, W and Q for source node, relationship and
mation is gathered and represented together with entity at- target node. The matrices are learned for each relationship
tributes through a graph structure. type and node type independently. Attention allows the
In detail, the graph is constructed using the arXiv paper model to determine the neighbors’ importance relative to
dataset [4], whereby different entities, such as authors, pa- the target node. To handle web-scale data, the authors in-
pers, journals and words are linked through different rela- troduce a novel sampling method. A subgraph around a
tionships, such as "paper-written_by-author". The Hetero- minibatch of target nodes is sampled. The sampling of
geneous Graph Transformer can then combine information neighbors is probabilistic. It depends on the connection
of entity relationships, the occurrence- and co-occurrence count of neighbors towards the target nodes, normalized
statistics to find representations for entities in the context by the neighbors’ absolute degree count. This effectively
of the arXiv dataset. The method is closely related to [5], allows to sample more dense subgraphs, such that each tar-
which employ the combination of text-based statistics and get node in the minibatch is represented by more neighbors
graph structure to text-classification. than if neighbors were sampled uniformly at random.

2 Related Work [10] introduce the KGE method TransE. On a knowl-


edge graph, embeddings of nodes are learned. Specifically,
[5] construct a graph containing documents and the words the distance d(h s + hr , ht ) over all realizations (s, r, t) is
which occur in the documents as nodes. Documents are minimized using a lookup-table-based approach, where h s
linked to words by the respective TF-IDF weights, and is the source node’s embedding, hr a relationship embed-
words share an edge weighted by their point-wise mu- ding with type ϕ(r), ht the target node’s embedding and
tual information statistic (PMI). For the task of document d is the Minkowski distance. In contrast to methods such
classification, the authors demonstrate that a Graph Con- as [11], TransE embeds all entity types in the same metric
volutional Network [6] trained on this graph outperforms space, thereby allowing to compare all embeddings with
a logistric regression model trained on TF-IDF-weighted each other. Suppose two relationship types in a graph,
bags-of-words. ϕ(e1 ) = R1 and ϕ(e2 ) = R2 exist, where e1 connects node
s to t, e1 = (s, t) and e2 connects node t to u, e2 = (t, u). value of 1 indicates that the words always appear together,
Then the embedding of node s can be compared with the a value of 0 signifies that they occur independently, and a
embedding of node u by the distance d(h s + hR1 + hR2 , hu ). value of -1 means that they never appear together.[14, p. 6]
Although no explicit relationship type R3 between node
types ψ(s) and ψ(u) exist, TransE still enables their com-
3.3 Training
parison.
Training uses a two layer deep HGT [7]. Each layer is of
3 Methodology size 256 and includes 8 attention-heads. To allow nodes
to attend to themselves, the input graph is augmented by
This section describes the proposed methodology, dis- self-edges. First, the node attributes are fed through a lin-
tributing information obtained from text-based metrics ear layer to downsize to 256 dimensions, then through two
across links to other entities related to the main body of HGT-layers and a preceding linear layer with the same di-
documents. mensionality. Finally, the node representations are learned
with an additional TransE [10] head with Minkowski dis-
3.1 Graph Modeling tance p = 2 in combination with pairwise margin loss [15]
as proposed in [10], with γ = 0.5:
The basis of the method is represented by modeling docu- X X
ments and corresponding relational data in a graph struc- L= [γ + d(h s + hr , ht ) − d(h s′ + hr , ht′ )]+
ture. Taking the arXiv dataset [4] as an example, most (s,r,t)∈S (s′ ,r,t′ )∈S ′
information is contained in the provided abstracts’ texts

a,
 if a ≥ 0
[a]+ = 

of each research paper. This data is augmented by the in- 0
 otherwise
formation about authorship, categories, and journals. The
information can be expanded as entities in the graph, link-
In detail, existing edges and corresponding node pairs are
ing concepts together. Graph Neural Networks then al-
sampled to create the set of positive examples S and the
low us to learn representations, taking into consideration
source nodes are connected to random target nodes for the
the knowledge about all entity types. In contrast to meth-
negative examples ∈ S ′ .
ods like TF-IDF, authors can not only be represented by
the average of their documents embeddings, but the Graph
Neural Networks may consider the representations of co- 4 Experiment Setup
authors or the journals in which papers are published as
well. In this section, the application of the proposed method to
the use case of semantic document-search is presented.
3.2 Assign edge weights
4.1 Dataset
In this paper the term frequency-inverse document fre-
quency (TF-IDF) metric is used for document-word rela- The arXiv dataset [4] is a continually updatet repository
tions and point-wise mutual information (PMI) for word- containing 1.7 million scholarly articles in various fields
word co occurrencies as in [5]. such as computer science, physics, quantitive biology and
TF is the count of a word in a document. IDF mea- economics. After extracting timestamp and page count
sures the rareness of a word across all documents and is from the dataset we are left with a refined set of features
calculated as follows: including license, doi, pages, jornal, date, id, title, author,
category, journal and abstracts. The data is being prepro-
1+n
!
IDF(t) = ln cessed by lemmatization and the deletion of stopwords.
1 + d fk (docs) The spacy lemmatizer assigns the base forms to each to-
where n is the number of documents in the corpus and d fk ken in a corpus. [16] Additionally, stopwords are removed
the number of documents including term k. The TF-IDF using the NLTK package[17].
score is the product of TF and IDF. [12, p. 2][13]
PMI with a fixed-size sliding window can be used to
4.2 Implementation Details
analyze word relationships across a corpus. In this pa-
per the normalized PMI metric NPMI is being used. The The dataset is represented using a heterogenous graph-
weight of the connection between two word nodes i and j object from the PyTorch Geometric library [18] consisting
is thus defined using the following approach: of five node types: paper, author, category, journal, and
word. The paper node type has attributes like license, DOI,
p(x, y) pages, journal, date, id, and title and forms edges with au-
!
NPMI(x, y) = ln / − ln p(x, y) thor, category, word, titleword and journal nodes. In order
p(x)p(y)
to create the nodes OrderedSet [19] was used. This way
One side effect is the potential reduction of low fre- the representation of every word without duplicates within
quency bias. [14, p. 5] The NPMI value between two a list is possible while keeping the order of the entries.
words of the corpus is based on their co-occurence. A The nodes are assigned numerical representations based
on created dictionaries. After creating nodes using an Or- inal attributes of the data are also stored. For instance, for
deredSet to ensure a unique and ordered representation of papers also their titles are saved. The frontend applica-
each word, edges between the nodes are generated through tion is realized with the Python library Streamlit. The user
the iterative mapping of the lists. Edges between two word can make an entry via a text input field. Below this is a
nodes are established if their PMI score is greater than 0, drop-down menu where the user can select which class to
signifying a positive correlation. Additionally, the edge search in. The user is then shown the 10 best hits for their
weights connecting paper nodes to word nodes are deter- search. The user can then select one of these and similar
mined by their TF-IDF scores, reflecting the importance of entries for the selected class are returned. The similarity
words in the papers. The NPMI calculation in this study search searches similar vectors in the database using the
uses a window size of 10, as opposed to the window size of Euclidean distance and returns the best hits.
20 used by Han et al. in [20], reflecting the smaller word Some difficulties are encountered during implemen-
count in abstracts opposed to entire texts. Because of prac- tation, of which the lack of working memory proves to
ticality, for the word-word edges, only approximately the be particularly challenging. A technique called Product
50 edges with the highest PMI score are used. Similarly, Quantization (PQ) is being used to reduce the required
each paper is only represented by the 50 words with the memory space. This method is being utilized for the com-
highest TF-IDF score. To provide initial node features for pression of vectors. It is exhibiting high efficiency when
the GNN to learn on, Sentence-BERT [3] embeddings en- being applied to the compression of high dimensional vec-
code the identifying text attribute of each node type. For tors, specifically for the purpose of nearest neighbor search
authors, papers, journals, categories and words, it is their [24].
respective name. PQ is a method used to speed up the process of find-
Data pre-processing is conducted on an 124GB 22- ing the nearest neighbors in high-dimensional spaces. PQ
core machine for 5 hours, although code-optimizations works by dividing the original high-dimensional vector
would allow the pre-processing with arbitrarily small re- space into a Cartesian product of low-dimensional sub-
sources. The model is trained on a single Nvidia P100 spaces. Each subspace is then quantized separately, lead-
GPU. The GNN learns over 36 hours from 900000 tar- ing to a set of quantization indices for each vector. This
get edges with the additional amount of negative exam- approach results in a compact code for each vector, which
ples. The Adam optimizer [21] with a learning rate of is used for storage and comparison purposes [24].
2e-4 is used. Sampling is conducted with minibatch size The advantages of product quantization are twofold: it
32, amounting to two separate sampling operations by the reduces the memory usage significantly, as each vector is
heterogeneous graph sampling of 32 target nodes respec- represented by a short code rather than a high-dimensional
tively, in case the target edge type contains two different vector; and it speeds up the nearest neighbor search, as the
entity types. Otherwise a subgraph of 64 nodes is sampled. distance between two vectors can be computed efficiently
Compared to simple random neighborhood sampling , the using the precomputed lookup tables.
employed graph sampling approach limits training itera-
tion speed, although obtained subgraphs might hold more
6 Conclusion
information. The inference of the final model amounts to
12 hours, to produce approximately 7 million entity em- This paper presents the idea of the application of Graph
beddings. Ideally, inference should be conducted multiple Neural Networks in combination with text-based statis-
times to average the obtained embeddings, as representa- tics. The Graph Neural Network enables the approach to
tions vary based on the sampled subgraph. Pre-processing, utilize the text-based data in order to inform the represen-
training and deployment code can be found in the reposi- tations of entities linked to the main body of documents.
tory [22]. Thereby, a holistic view on the relations naturally found in
the data becomes possible, empowering advanced seman-
4.3 Results tic search capabilities. By example the setup of semantic
search is demonstrated on the large corpus of the arXiv
Query results from pure TF-IDF embeddings extracted dataset. While the results show feasibility of the proposed
from the papers’ abstracts, and reduced to 256 dimensions approach, further work is needed to validate and conse-
with principal component analysis are compared against quently improve the method. Different GNN models and
the proposed methods’ results. For a query paper title, rel- training approaches could be experimented with. The mar-
evant query results are annotated. The scoring is a simple gin loss could be replaced with unsupervised loss methods
average across the ranking of query results which should like attribute prediction or explicit link prediction.
be placed at the top for both methods respectively. TF- Future work could include exploring various Graph
IDF achieves an average rank of 189037 out of 2381173 Neural Network (GNN) models and training methodolo-
whereas the proposed approach achieves an average rank- gies, not only focusing on ranking loss but also incorpo-
ing of 81752 out of 2381173 (lower is better). rating link prediction.

5 Implementation References
The vectors that the GNN is learning are stored in a Weav- [1] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient
iate [23] vector database. Along with the vectors, the orig- estimation of word representations in vector space,
arXiv preprint arXiv:1301.3781 (2013) [12] J. Ramos et al., Using tf-idf to determine word rele-
[2] J. Pennington, R. Socher, C.D. Manning, Glove: vance in document queries, in Proceedings of the first
Global vectors for word representation, in Proceed- instructional conference on machine learning (New
ings of the 2014 conference on empirical methods in Jersey, USA, 2003), Vol. 242, pp. 29–48
natural language processing (EMNLP) (2014), pp. [13] sklearn tfidfvectorizer (2024), accessed: 2024-01-
1532–1543 21, https://scikit-learn.org/stable/
[3] N. Reimers, I. Gurevych, Sentence-bert: Sentence modules/generated/sklearn.feature_
embeddings using siamese bert-networks, arXiv extraction.text.TfidfVectorizer.html
preprint arXiv:1908.10084 (2019) [14] G. Bouma, Normalized (pointwise) mutual informa-
[4] C. University, Arxiv dataset, https://www.kaggle. tion in collocation extraction, Proceedings of GSCL
com/datasets/Cornell-University/arxiv (2023), rE- 30, 31 (2009)
PLACE Accessed: January 25, 2024 [15] pytorch, margin ranking loss (2023), 22.01.2024,
[5] L. Yao, C. Mao, Y. Luo, Graph convolutional net- https://pytorch.org/docs/stable/
works for text classification, in Proceedings of the generated/torch.nn.MarginRankingLoss.
AAAI conference on artificial intelligence (2019), html
Vol. 33, pp. 7370–7377 [16] spacy lemmatizer documentation (2024), ac-
[6] T.N. Kipf, M. Welling, Semi-supervised classifi- cessed: 2024-01-20, https://spacy.io/api/
cation with graph convolutional networks, arXiv lemmatizer
preprint arXiv:1609.02907 (2016) [17] S. Bird, E. Loper, E. Klein, Natural Language Pro-
[7] Z. Hu, Y. Dong, K. Wang, Y. Sun, Heterogeneous cessing with Python (O’Reilly Media Inc., 2009)
graph transformer, in Proceedings of the web con- [18] Heterodata (2024), accessed: 2024-01-19,
ference 2020 (2020), pp. 2704–2710 https://pytorch-geometric.readthedocs.
[8] M. Schlichtkrull, T.N. Kipf, P. Bloem, R. Van io/en/latest/generated/torch_geometric.
Den Berg, I. Titov, M. Welling, Modeling relational data.HeteroData.html
data with graph convolutional networks, in The Se- [19] Ordered set (2024), accessed: 2024-01-19, https:
mantic Web: 15th International Conference, ESWC //pypi.org/project/ordered-set/
2018, Heraklion, Crete, Greece, June 3–7, 2018, [20] S.C. Han, Z. Yuan, K. Wang, S. Long, J. Poon,
Proceedings 15 (Springer, 2018), pp. 593–607 sklearn tfidfvectorizer (2022), 2203.16060
[9] D. Bahdanau, K. Cho, Y. Bengio, Neural machine [21] D.P. Kingma, J. Ba, Adam: A method for stochastic
translation by jointly learning to align and translate, optimization, arXiv preprint arXiv:1412.6980 (2014)
arXiv preprint arXiv:1409.0473 (2014) [22] H.R. Ahmet Korkmaz, Amos Dinh, M. Fast, GN-
[10] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, NPapersSearch: Graph neural networks for se-
O. Yakhnenko, Translating embeddings for model- mantic search, https://github.com/AmosDinh/
ing multi-relational data, Advances in neural infor- GNNpapersearch (2024)
mation processing systems 26 (2013) [23] Weaviate (2018), 22.01.2024, https://weaviate.
[11] B. Yang, W.t. Yih, X. He, J. Gao, L. Deng, Embed- io/
ding entities and relations for learning and inference [24] H. Jegou, M. Douze, C. Schmid, Product quantiza-
in knowledge bases, arXiv preprint arXiv:1412.6575 tion for nearest neighbor search, IEEE transactions
(2014) on pattern analysis and machine intelligence 33, 117
(2010)
Arbeitsteilung im Projekt GNNPaperSearch
1. Preprocessing and Graph Modeling: Ahmet Korkmaz und Henrik Rathai
2. Training: Amos Dinh
3. Applikation und Evaluation: Matthias Fast

You might also like