The Title of The Article
The Title of The Article
672
Semantic Analysis (LSA) and Rocchio Algorithms with large-scale approximate nearest neighbor search based on
ball trees. PURE7, another content-based PubMed article recommender developed using a finite mixture model for
soft clustering with Expectation-Maximization (EM) algorithm, which achieved 78.2% precision at 10% recall with
200 training articles. Lin and Wilbur developed pmra8, a probabilistic topic-based content similarity model for PubMed
articles. Their method achieved slight but statistically significant improvement on precision@5 compared to BM25.
With the popularity of NLP models, such as Google’s doc2vec, USE, and most recently BERT, there have been some
efforts in incorporating these embedding methods in research papers recommenders. Colin and Beel 9 experimented
with doc2vec, TF-IDF and key phrases for providing related-article recommendations to both digital library Sowiport13
and the open-source reference manager Jabref14. A. Mohamed Hassan et al.10 evaluated USE, InferSent, ELMo, BERT
and SciBERT for reranking results from BM25 for research paper recommendations.
Materials
We used data series from GEO and MEDLINE articles from PubMed. For GEO series, metadata such as title,
summary, date of publications and names of authors were collected using a web crawler. We also collected the PMIDs
of the articles associated with each series. From these PMIDs, metadata of corresponding articles such as title, abstract,
authors, affiliations, MeSH terms, publisher name were also collected. Figure 1 shows an example of GEO data series,
Figure 2 shows an example of PubMed publication.
673
associated unique publications. Multiple series can reference the same paper(s). 96% of the series have only 1 related
publication and the rest have between 2 to 10.
Methods
We adopted an information retrieval strategy, where the data series are treated as queries and the list of recommended
publications as retrieved documents. In our experiments, series were represented by their titles and summaries; while
publications were represented by their titles and abstracts. Further, we removed stop words, punctuation, and URLs
from summaries of series before transforming them into vectors.
We used cosine similarity as the ranking score, which is a popular measure in query-document analysis15 for the
similarity of features due to its low-complexity and intuitional definition. In our case, we only returned the top 10
recommendations based on cosine similarity, which is a realistic scenario where few people would check the end of a
long recommendation list. Figure 3 shows our recommender’s architecture.
674
ELMo22: a deep, contextualized bi-directional Long Short-Term Memory (LSTM) model that was pre-trained on 1B
Word Benchmark23. We used the latest TensorFlow Hub implementation 24 of ELMo to obtain embeddings of 1024
dimensions.
InferSent25: a bi-directional LSTM encoder with max-pooling that was pre-trained on the supervised data of Stanford
Natural Language Inference (SNLI)26. There are two versions of InferSent models, and we used one with fastText
word embeddings from Facebook’s github 27, with the resulting embedding dimension of 4096.
USE28: Universal Sentence Encoder, developed by Google, has two variations of model structures: one is transformer-
based while the other one is Deep Average Network (DAN)-based, both of which were pre-trained on unsupervised
data such as Wikipedia, web news and web question-answer pages, discussion forums, and further on supervised data
of SNLI. We used the TensorFlow Hub implementation of transformer USE to obtain embeddings of 512 dimensions.
BERT29: Bidirectional Encoder Representations from Transformer developed by Google, which has previously
achieved state-of-the-art performance in many classical natural language processing tasks. It was pre-trained on 800M-
words BooksCorpus and 2500M-word English Wikipedia using masked language model (MLM) and next sentence
prediction (NSP) as the pre-training objectives. We used the package Sentence-BERT30 to obtain vectors optimized
for Semantic Textual Similarity (STS) task, which is of 768 dimensions.
SciBERT31: a BERT model that was further pre-trained on 1.14M full-paper corpus from semanticscholar.org32.
Similarly, we used Sentence-BERT to obtain vectors of 768 dimensions.
BioBERT33: a BERT model that was further pre-trained on large scale biomedical corpus, i.e. 4.5B-word PubMed
abstracts and 13.5B-word PubMed Central full-text articles. Similar to BERT, vectors of 768 dimensions were
obtained using Sentence-BERT.
RoBERTa34: a robust version of BERT that has been further pre-trained on CC-NEWs35 corpus, with enhanced
hyperparameters choices including batch-sizes, epochs, and dynamic masking patterns in the pre-training process. We
used Sentence-BERT to obtain vectors of 768 dimensions.
DistilBERT36: a distilled version of BERT with a 40% reduced size, 97% of the original performance while being
60% faster. We used Sentence-BERT to obtain vectors of 768 dimensions.
For all term-frequency based methods, the experiments were performed on 8 Intel(R) Xeon(R) Gold 6140 CPUs@
2.30GHz. For embedding based methods, the experiments were performed using 1 Tesla V100-PCIE-16GB GPU.
The implementations of the experiments are at https://github.com/chocolocked/RecommendersOfScholarlyPapers
Evaluation metrics
The following metrics were used to evaluate our system:
Mean reciprocal rank (MRR)@k: The Reciprocal Rank (RR) measures the reciprocal of the rank at which the first
relevant document was retrieved. RR is 1 if the relevant document was retrieved at rank 1, RR is 0.5 if document is
retrieved at rank 2, and so on. When we average the top k retrieved items across queries, the measure is called the
Mean Reciprocal Rank@k37. In our case, we chose k=10.
Recall@k: At the k-th retrieved item, this metric measures the proportion of relevant items that are retrieved. We
evaluated both recall@1 and recall@10.
Precision@k: At the k-th retrieved item, this metric measures the proportion of the retrieved items that are relevant.
In our case, we are interested in precision@1. Since most of our data series has only 1 corresponding publication,
which means most of the data only has 1 relevant item.
Mean average precision (MAP)@k: Average Precision is the average of the precision value obtained for the set of
top k items after each relevant document is retrieved. When average precision is averaged again over all retrieval, this
value becomes mean average precision.
Detailed procedure-example
675
Below, we demonstrate the detailed procedure using BM25 and data series ‘GSE11663’ as an example:
• For each of the 50,159 publications, we concatenated processed titles with abstracts. We then created a BM25
object, its dictionary and corpus out of the list.
• For ‘GES11663’, we concatenated the title ('human cleavage stage embryos chromosomally unstable') and the
processed summary ('embryonic chromosome aberrations cause birth defects reduce human fertility however
neither nature incidence known develop assess genome-wide copy number variation loss heterozygosity single
cells apply screen blastomeres vitro fertilized preimplantation embryos complex patterns chromosome-arm
imbalances segmental deletions duplications amplifications reciprocal sister blastomeres detected large
proportion embryos addition aneuploidies uniparental isodisomies frequently observed since embryos derived
young fertile couples data indicate chromosomal instability common human embryogenesis comparative genomic
hybridisation') and got its vector representation using dictionary: [ (27, 1), (32, 1), (44, 1), (46, 1), (80, 1), (116,
1), (141, 1), (175, 1), (182, 1), (190, 1), (360, 2), (390, 1), (407, 1), (530, 1), (649, 1), (663, 1), (725, 1), (842, 1),
(844, 1), (999, 1), (1034, 1), (1186, 1), (1235, 1), (1370, 1), (1634, 1), (1635, 1), (1636, 1), (1761, 1), (1862, 1),
(2023, 1), (2174, 1), (2224, 1), (2292, 1), (2675, 1), (2677, 1), (3023, 1), (3082, 1), (3113, 1), (3144, 2), (3145,
2), (3153, 1), (3697, 1), (4265, 1), (4935, 1), (5021, 1), (5105, 1), (5775, 1), (6665, 1), (6772, 1), (6828, 1), (7298,
1), (7372, 1), (7684, 1), (7808, 1), (7949, 1), (8211, 1), (8344, 1), (8569, 2), (8974, 1), (9009, 1), (9302, 1), (9705,
1), (10480, 1), (11360, 1), (17139, 1), (24769, 1), (28560, 1), (38594, 1), (54855, 1), (228500, 1), (250370, 1) ].
Then we used sklearn ‘cosine-similarity’ to get similarity scores of all 50,159 publications with this series.
• Since ‘GSE11663’ has the citation ['19396175', '21854607'] (without order), and our top 10 recommendations
were [‘19396175’, ‘23526301’, ‘16698960’, ‘25475586’, ‘29040498’, ‘23054067’, ‘27197242’, ‘23136300’,
‘24035391’, ‘18713793’]. Our recommendations hit top 1. In this case, we calculated
1 1
𝑀𝑅𝑅@10: = 1, 𝑟𝑒𝑐𝑎𝑙𝑙 @1: = 0.33,
1 3
1 1
𝑟𝑒𝑐𝑎𝑙𝑙@10: = 0.33, 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@1: = 1,
3 1
1
𝑀𝐴𝑃@10: ∗ (1 + 0) = 0.5.
2
• We repeated the above two steps, and computed average for all 72,971 series
Results
Table 1 shows the results of our experiments with different vector representations. BM25 outperformed all other
methods in terms of all evaluation metrics, with MRR@10, recall@1, recall@10, precision@1, and MAP@10 of
0.785, 0.742, 0.833, 0.756, and 0.783 respectively, followed closely by TF-IDF. None of the embedding methods
alone was able to outperform BM25. Furthermore, word2vec, doc2vec, and BioBERT were among the top embedding
methods outperforming ELMo, USE, and the rest.
Our findings show that traditional term-frequency based methods (BM25, TF-IDF) were more effective for
recommendations compared to embedding methods. Contrasting previous beliefs that embeddings can conquer it all,
given their performances in standardized general NLP tasks such as sentiment analysis, Questions & Answering
(Q&A), and Named Entity Recognition (NER). They failed to show advantage in the simple scenario of capturing
semantic similarity as measured by cosine similarity. Even though the context were not exactly the same, Colin and
Beel9 did find out in their studies that doc2vec failed to defeat TF-IDF or key phrases in the two experimental setups
of publication recommendations for digital library Sowiport and reference manager Jabref. Moreover, A. Mohamed
Hassan et al.12 also concluded in their study that none of the sentence embeddings (USE, InferSent, ELMo, BERT and
SciBERT) that they had employed were able to outperform BM25 alone for their research paper recommendations.
One possible reason could be that traditional statistical methods produce better features when the queries are relatively
homogenous, Ogilvie and Callan38 showed that single database (homogeneous) queries with TF-IDF performed
unanimously better than multi-database (heterogenous) queries when no additional IR techniques, such as query
expansion, were involved. Currently, we are only using GEO datasets for queries which are all related to gene
expressions. But as we introduce more diverse datasets for our platform in the future, e.g. immunology and infectious
676
Table 1. MRR@10, Recall@1, Recall@10, Precision@1, and MAP@10 for recommenders using different vector
representations.
Metrics
Vector representations
MRR@10 Recall@1 Recall@10 Precision@1 MAP@10
disease datasets, the heterogeneity might require more advanced embedding methods. Further, as we observe
approximately 8% improvement from regular BERT to BioBERT, we think it might be of importance for NLP models
to be further trained on domain-specific corpus for better feature representations for cosine similarity. Another possible
reason could be that, as these embeddings were pre-trained on standardized tasks, thus the embeddings might be
specialized towards those tasks instead of representing simple semantic information. This could explain the
observation that general text embeddings, e.g. word2vec, doc2vec, perform better than other more specialized NLP
models, e.g. ELMo and BERT, which were pre-trained to perform on tasks such as Q&A, sequence classification.
Therefore, we might be able to take full advantage of their potentials when formulating our problem from a simple
cosine similarity between query and documents to matching classification for example; a format closer to how these
models were designed for in the first place. That is also the direction we are heading towards for future experiments.
Even though we do not currently have users’ feedback for manual evaluations, we did, however, manually inspect the
recommendation results for the completeness of our experiments, particular for those where the cited articles did not
appear within the top 5 recommendations. We randomly sampled 20 such data series and examined recommended
papers by thoroughly reading through papers’ abstract, introduction, and methods. We had some interesting
observations regarding those cases: For example, for ‘GSE96174’ data series, even though our top 5 recommendations
did not include the existing related article, three of them actually cited and used the data series as relevant research
materials. Another example is that of ‘GSE27139’, where our top recommendations were from the same author that
submitted the data series, and those articles were extensions from their previous research work. Due to time limitation,
we could not check all the 13,013 cases, but we found at least 10 cases (‘GSE96174’, ‘GSE836’, ‘GSE92231’,
‘GSE78579’, ‘GSE96211’, ‘GSE27139’,‘GSE10903’, ‘GSE105628’, ‘GSE44657’, ‘GSE81888’) that had similar
situations as we mentioned above and where the top 3 recommendations were, to the best of our judgement, associated
with data series of concern, even though they did not appear in the citation as of the time we did our experiments.
Therefore, we believe that our recommendation systems might do even better in the real setting than the evaluations
presented here.
We want to mention that we also experimented with re-ranking. The final ranking score is defined as the previous
cosine-similarity adding a re-ranking score, with the re-ranking score calculated using cosine similarity of only titles
of the queried dataset and of the articles. We did not find statistically significant improvements, and therefore did not
report the results in this paper.
677
Discussion
In this work, we developed a scholarly recommendation system to identify and recommend research-papers relevant
to public datasets. The sources of papers and datasets are PubMed and Gene Expression Omnibus (GEO) series,
respectively. Different techniques for representing textual data ranging from traditional term- frequency based
methods and topic-modeling to embeddings are employed and compared in this work. Our results show that
embedding models that perform well in their standardized NLP tasks, failed to outperform term-frequency based
probabilistic methods such as BM25. General embeddings (word2vec and doc2vec) performed better than more
specialized embeddings (ELMo and BERT) and domain-specific embeddings (BioBERT) performed better than non-
domain specific embeddings (BERT). In future experiments, we plan to develop a hybrid method combining the
strengths of the term-frequency approach and also embeddings to maximize their potentials in different (heterogeneous
vs. homogeneous) problem scenarios. In addition, we plan to engage users in rating our recommendations, use
interrater agreement approach to further evaluate results, and incorporate the feedback to further improve our system.
We hope to utilize content-based and collaborative filtering for better recommendations.
Given their usefulness, extending the applications of recommender systems to aid scholars in finding relevant
information and resources will significantly enhance research productivity and will ultimately promote data and
resources reusability.
References
1. Ali M, Johnson CC, Tang AK. Parallel collaborative filtering for streaming data. University of Texas Austin,
Tech. Rep. 2011 Dec 8:5-7.
2. Bell RM, Koren Y. Lessons from the Netflix prize challenge. Acm Sigkdd Explorations Newsletter. 2007 Dec
1;9(2):75-9.
3. Vaz PC, Martins de Matos D, Martins B, Calado P. Improving a hybrid literary book recommendation system
through author ranking. InProceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries 2012 Jun
10 (pp. 387-388).
4. Li L, Chu W, Langford J, Schapire RE. A contextual-bandit approach to personalized news article
recommendation. InProceedings of the 19th international conference on World wide web 2010 Apr 26 (pp. 661-
670).
5. Bollacker KD, Lawrence S, Giles CL. CiteSeer: An autonomous web agent for automatic retrieval and
identification of interesting publications. InProceedings of the second international conference on Autonomous
agents 1998 May 1 (pp. 116-123).
6. Achakulvisut T, Acuna DE, Ruangrong T, Kording K. Science Concierge: A fast content-based recommendation
system for scientific publications. PloS one. 2016 Jul 6;11(7):e0158423.
7. Yoneya T, Mamitsuka H. PURE: a PubMed article recommendation system based on content-based filtering.
Genome informatics. 2007;18:267-76.
8. Lin J, Wilbur WJ. PubMed related articles: a probabilistic topic-based model for content similarity. BMC
bioinformatics. 2007 Dec 1;8(1):423.
9. Collins A, Beel J. Document Embeddings vs. Keyphrases vs. Terms: An Online Evaluation in Digital Library
Recommender Systems. arXiv preprint arXiv:1905.11244. 2019 May 27.
10. Hassan HA, Sansonetti G, Gasparetti F, Micarelli A, Beel J. BERT, ELMo, USE and InferSent Sentence
Encoders: The Panacea for Research-Paper Recommendation? InRecSys (Late-Breaking Results) 2019 (pp. 6-
10).
11. About GEO datasets [Internet]. GEO. 2020 [cited 18 August 2020]. Available from:
https://www.ncbi.nlm.nih.gov/geo/info/datasets.html.
12. NIH strategic plan for data science [Internet]. National Institutes of Health. 2020 [cited 18 August 2020].
Available from: https://datascience.nih.gov/strategicplan.
13. Hienert D, Sawitzki F, Mayr P. Digital library research in action–supporting information retrieval in sowiport. D-
Lib Magazine. 2015 Mar 4;21(3):4.
14. Kopp O, Breitenbücher U, Müller T. CloudRef-Towards Collaborative Reference Management in the Cloud.
InZEUS 2018 (pp. 63-68).
15. Han J, Kamber M, Pei J. Getting to know your data. Data mining: concepts and techniques. 2011;3(744):39-81.
16. Rajaraman A, Ullman JD. Data mining. In: mining of massive datasets. Cambridge: Cambridge University Press;
2011. p. 1–17.
678
17. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. the Journal of machine
Learning research. 2011 Nov 1;12:2825-30.
18. Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M. Okapi at TREC-3. Nist special publication
Sp 109 (1995): 109
19. Rehurek R, Sojka P. Software framework for topic modelling with large corpora. InIn Proceedings of the LREC
2010 Workshop on New Challenges for NLP Frameworks 2010.
20. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv
preprint arXiv:1301.3781. 2013 Jan 16.
21. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their
compositionality. InAdvances in neural information processing systems 2013 (pp. 3111-3119).
22. Peters ME, Neumann M, Iyyer M, et al. Deep contextualized word representations. arXiv preprint
arXiv:1802.05365. 2018 Feb 15.
23. Chelba C, Mikolov T, Schuster M, et al. One billion word benchmark for measuring progress in statistical
language modeling. arXiv preprint arXiv:1312.3005. 2013 Dec 11.
24. Abadi M, Agarwal A, Barham P, et al. TensorFlow: Large-scale machine learning on heterogeneous systems.
25. Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A. Supervised learning of universal sentence representations
from natural language inference data. arXiv preprint arXiv:1705.02364. 2017 May 5.
26. Bowman SR, Angeli G, Potts C, Manning CD. A large annotated corpus for learning natural language inference.
arXiv preprint arXiv:1508.05326. 2015 Aug 21.
27. Facebookresearch / InferSent [Internet]. GitHub repository. 2020 [cited 18 August 2020]. Available from:
https://github.com/facebookresearch/InferSent.
28. Cer D, Yang Y, Kong SY, et al. Universal sentence encoder. arXiv preprint arXiv:1803.11175. 2018 Mar 29.
29. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805. 2018 Oct 11.
30. Reimers N, Gurevych I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint
arXiv:1908.10084. 2019 Aug 27.
31. Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. arXiv preprint
arXiv:1903.10676. 2019 Mar 26.
32. Semantic scholar | AI-Powered Research Tool [Internet]. Semanticscholar.org. 2020 [cited 18 August 2020].
Available from: https://www.semanticscholar.org/
33. Lee J, Yoon W, Kim S, et al.(2019). Biobert: a pretrained biomedical language representation model for
biomedical text mining. arXiv preprint arXiv:1901.08746.
34. Liu Y, Ott M, Goyal N, et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint
arXiv:1907.11692. 2019 Jul 26.
35. Mackenzie J, Benham R, Petri M, Trippas JR, Culpepper JS, Moffat A. CC-News-En: A Large English News
Corpus.
36. Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and
lighter. arXiv preprint arXiv:1910.01108. 2019 Oct 2.
37. Craswell N. Mean reciprocal rank. Encyclopedia of database systems. 2009;1703.
38. Ogilvie P, Callan J. The effectiveness of query expansion for distributed information retrieval. In Proceedings of
the tenth international conference on Information and knowledge management, pp. 183-190. 2001
679