Description of Approach
Description of Approach
Abstract.
We present a novel approach for semantic search, utilizing Graph Neural Networks (GNNs). The proposed
framework first constructs a graph linking papers, words, and authors, where edges are weighted by TF-IDF
and word co-occurrence statistics. The Heterogeneous Graph Transformer is employed in combination with
the Knowledge Graph Embedding (KGE) method TransE to find informative representations for entities in
the graph. Applied in unision with a vector database, the approach enables content search beyond simple
keywords. The representations are learned in a joint metric space, enabling users to find related papers, explore
topics through keywords, or discover authors with similar research focus, enhancing the quality of academic
literature exploration. The feasiblity of the training and application of the proposed method on large corpora is
demonstrated on the basis of the arXiv dataset.
5 Implementation References
The vectors that the GNN is learning are stored in a Weav- [1] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient
iate [23] vector database. Along with the vectors, the orig- estimation of word representations in vector space,
arXiv preprint arXiv:1301.3781 (2013) [12] J. Ramos et al., Using tf-idf to determine word rele-
[2] J. Pennington, R. Socher, C.D. Manning, Glove: vance in document queries, in Proceedings of the first
Global vectors for word representation, in Proceed- instructional conference on machine learning (New
ings of the 2014 conference on empirical methods in Jersey, USA, 2003), Vol. 242, pp. 29–48
natural language processing (EMNLP) (2014), pp. [13] sklearn tfidfvectorizer (2024), accessed: 2024-01-
1532–1543 21, https://scikit-learn.org/stable/
[3] N. Reimers, I. Gurevych, Sentence-bert: Sentence modules/generated/sklearn.feature_
embeddings using siamese bert-networks, arXiv extraction.text.TfidfVectorizer.html
preprint arXiv:1908.10084 (2019) [14] G. Bouma, Normalized (pointwise) mutual informa-
[4] C. University, Arxiv dataset, https://www.kaggle. tion in collocation extraction, Proceedings of GSCL
com/datasets/Cornell-University/arxiv (2023), rE- 30, 31 (2009)
PLACE Accessed: January 25, 2024 [15] pytorch, margin ranking loss (2023), 22.01.2024,
[5] L. Yao, C. Mao, Y. Luo, Graph convolutional net- https://pytorch.org/docs/stable/
works for text classification, in Proceedings of the generated/torch.nn.MarginRankingLoss.
AAAI conference on artificial intelligence (2019), html
Vol. 33, pp. 7370–7377 [16] spacy lemmatizer documentation (2024), ac-
[6] T.N. Kipf, M. Welling, Semi-supervised classifi- cessed: 2024-01-20, https://spacy.io/api/
cation with graph convolutional networks, arXiv lemmatizer
preprint arXiv:1609.02907 (2016) [17] S. Bird, E. Loper, E. Klein, Natural Language Pro-
[7] Z. Hu, Y. Dong, K. Wang, Y. Sun, Heterogeneous cessing with Python (O’Reilly Media Inc., 2009)
graph transformer, in Proceedings of the web con- [18] Heterodata (2024), accessed: 2024-01-19,
ference 2020 (2020), pp. 2704–2710 https://pytorch-geometric.readthedocs.
[8] M. Schlichtkrull, T.N. Kipf, P. Bloem, R. Van io/en/latest/generated/torch_geometric.
Den Berg, I. Titov, M. Welling, Modeling relational data.HeteroData.html
data with graph convolutional networks, in The Se- [19] Ordered set (2024), accessed: 2024-01-19, https:
mantic Web: 15th International Conference, ESWC //pypi.org/project/ordered-set/
2018, Heraklion, Crete, Greece, June 3–7, 2018, [20] S.C. Han, Z. Yuan, K. Wang, S. Long, J. Poon,
Proceedings 15 (Springer, 2018), pp. 593–607 sklearn tfidfvectorizer (2022), 2203.16060
[9] D. Bahdanau, K. Cho, Y. Bengio, Neural machine [21] D.P. Kingma, J. Ba, Adam: A method for stochastic
translation by jointly learning to align and translate, optimization, arXiv preprint arXiv:1412.6980 (2014)
arXiv preprint arXiv:1409.0473 (2014) [22] H.R. Ahmet Korkmaz, Amos Dinh, M. Fast, GN-
[10] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, NPapersSearch: Graph neural networks for se-
O. Yakhnenko, Translating embeddings for model- mantic search, https://github.com/AmosDinh/
ing multi-relational data, Advances in neural infor- GNNpapersearch (2024)
mation processing systems 26 (2013) [23] Weaviate (2018), 22.01.2024, https://weaviate.
[11] B. Yang, W.t. Yih, X. He, J. Gao, L. Deng, Embed- io/
ding entities and relations for learning and inference [24] H. Jegou, M. Douze, C. Schmid, Product quantiza-
in knowledge bases, arXiv preprint arXiv:1412.6575 tion for nearest neighbor search, IEEE transactions
(2014) on pattern analysis and machine intelligence 33, 117
(2010)
Arbeitsteilung im Projekt GNNPaperSearch
1. Preprocessing and Graph Modeling: Ahmet Korkmaz und Henrik Rathai
2. Training: Amos Dinh
3. Applikation und Evaluation: Matthias Fast