Afsana 2020
Afsana 2020
Afsana 2020
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
Abstract— Today Information in the world wide web is The extensive spread of unreliable information can nega-
overwhelmed by unprecedented quantity of data on ver- tively affect public health. Misinformation based wrong deci-
satile topics with varied quality. However, the quality of sion forces people to uphold erroneous belief and opinions
information disseminated in the field of medicine has been
questioned as the negative health consequences of health instead of irrefutable evidence [8]. Sometimes, these types
misinformation can be life-threatening. There is currently of articles fail to render envisioned information to pursuer.
no generic automated tool for evaluating the quality of This may result misinterpretation of concept and eventually
online health information spanned over broad range. To trigger fear and incite one to change regular habit overnight.
address this gap, in this paper, we applied data mining ap- However, the online network isn’t going anywhere and seeking
proach to automatically assess the quality of online health
articles based on 10 quality criteria. We have prepared a and sharing health information online will not be stopped.
labelled dataset with 53012 features and applied different Misinformation will prevail as well [9]. For this reason,
feature selection methods to identify the best feature sub- assessing and assuring the quality of health information on
set with which our trained classifier achieved an accuracy World Wide Web becomes a fundamental issue for users
of 84% − 90% varied over 10 criteria. Our semantic analysis [10]. The better the quality of health information, the more
of features shows the underpinning associations between
the selected features & assessment criteria and further reliable and accessible it is and the more effective it will be
rationalize our assessment approach. Our findings will help in moulding users behaviour towards health-care.
in identifying high quality health articles and thus aiding In order to curb this situation, several approaches have been
users in shaping their opinion to make right choice while proposed to assess the quality of health related information.
picking health related help from online.
Among these, some of the approaches conducted assessment
Index Terms— Health articles, misinformation, quality as- manually and demanded users perception to qualify a health
sessment, data mining. news. A number of studies [11], [12] estimated the quality of
the overall web sources rather evaluating each of the article
I. I NTRODUCTION published in it. A few others [13]–[15] tried to evaluate
the quality of articles published in specific disease domain
T HE tremendous advancement of digital technology and
widespread usage of Internet have made information
accessible worldwide. Consequently, majority of people are
which narrowed down the scope of their work. Some studies
(e.g., [16]) proposed evaluation criteria framework and some
tried to assess quality based on that proposed framework [9].
turning to the Internet for searching a diverse range of
But in case of criteria selection, a question is always there
health related information. According to a study by Australian
about its specific application on medical domain as criteria
Institute of Health and Welfare, 78% of Australian adults
selection for health specific articles necessitate the involvement
were found to search health-related information in 2015 [1].
of health professionals. However, given the ever changing
However, the reliability of information from web sources are
landscape of Internet, no universal framework for automati-
questionable due to the unregulated nature of Internet.
cally assessing the quality of OHA has been proposed to date.
In this era of Internet, misinformation (dubious, low quality
With this context in mind, this study attempts to automate
fabricated information) disseminates much faster like wildfire
the quality assessment process of OHA based on the ideas
than the truth. A plethora of information from online health
and effort from HealthNewsReview.org1 . This organization
articles (OHA) and other sources (Blogs, Facebook, Twitter,
manually evaluates health-related articles by a team comprised
YouTube, etc.) are available for health information quester.
50 experts from various disciplines including journalism,
But all the information are not reliable as these stem from
medicine, health services research, public health and patient
various individuals and organization [2]–[4]. Hence, the task
perspectives. Performance of this organization is excellent
of distinguishing unreliable health information from reliable
but not scalable in comparison to the speed of information
one poses substantial challenges on individuals [5], [6], [7].
explosion worldwide. In this paper, we applied a data mining
F. Afsana, M. A. Kabir and M. Paul are with the School of Computing based approach to assess the quality of online health articles
and Mathematics, Charles Sturt University, NSW, Australia (e-mail: automatically. Our main contributions can be summarized as
[email protected]; [email protected]; [email protected]) follows:
N. Hassan is with the Philip Merrill College of Journalism, University
of Maryland, USA (e-mail: [email protected])
Corresponding author: Muhammad Ashad Kabir 1 https://www.healthnewsreview.org
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
2 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS
• We have developed a labelled dataset of health related prescribe the action that is required following the evaluation.
news articles which were finely annotated by health EQIP tool was demonstrated through several processes of item
experts from HealthNewsReview.org. So far, no generic generation, testing for concurrent validity, inter-rater reliability
health related dataset is available that is suitable for and utility using large diverse samples of written health care
assessing the quality of OHA. Our dataset, once released, information.
will be a valuable resource for health and research The Quality Index for health related Media Reports (QIMR)
communities for conducting future studies on the topic was developed as an evaluation tool to monitor the quality of
of misinformation in the field of medicine. health research reporting in the lay media, more specifically
• We have explored multifaceted feature spaces through for Canadian media. Themes from interviews with health
systematic content analysis to identify appropriate fea- journalists and researchers were undertaken to develop QIMR
tures to automate quality assessment process. We have [21]. However, QIMR approach is limited in sample size and
also keyed out criteria-wise discriminating features by scope, and failed to evaluate quality of news sources having
analyzing feature importance. content of varying quality.
• We have examined the applicability of various data min- However, specific focus on treatment information or partic-
ing techniques in assessing the quality of OHA automat- ular media has narrowed down the scope of these approaches
ically and achieved state-of-the-art performance on it. on different applications and questions their applicability to
• We also have provided explanation of feature subset online content about other aspects of health and illness. On
corresponding to each criterion to justify the value of the contrary, our approach is applicable to all health related
the assessment. information domains. Moreover, the existing approaches were
conducted through manual labour, whereas ours is fully auto-
II. R ELATED W ORK mated system to assess quality of health articles in a shorter
Quality of the online health related information has been a possible time.
major concern from the dawn of the World Wide Web (WWW)
era [17], [18]. Numerous tools have been developed to alleviate B. Criteria Based Quality Assessment Approach
the quality measurement of health related information most of
To date, there is no clear universal standard to assess the
which are based on a particular disease (e.g., cancer, diabetes,
quality of web based health information [24]. Kim et. al. con-
etc.) and lack in robust validity and reliability testing. In
ducted extensive review to identify criteria that were already
[19], Keselman et. al. conducted an exploratory study with
proposed or employed specifically for evaluating health related
a view to developing a methodological approach to analyze
information world wide [25]. Eysenbach et. al. conducted a
health related web pages and apply it to a set of relevant
systematic review to compile criteria actually used to measure
web pages. This qualitative study analysed webpages about
the quality of health information on the Web and synthesized
natural treatment of diabetes to accentuate the challenges
evaluation results from studies containing quantitative data
faced by consumers in seeking health information. It has also
on structure and process [10]. Comparing the methodological
underscored the importance of developing support tools so
frameworks of existing approaches authors concluded with the
that this formative study could help users to seek, evaluate,
need for defining operational criteria for quality assessment.
and analyze information in the world wide web. We have
[2] is another systematic review where authors reviewed
summarized the relevant research along three categories.
empirical studies on trust and credibility in the use of web-
based health information (WHI) with an aim to identify factors
A. Statistical Analysis Based Quality Assessment that impact judgments of trustworthiness and credibility, and
Approach to explore the role of demographic factors affecting trust
DISCERN [20], a short instrument, was developed for formation.
judging the quality of written consumer health information The Code of Conduct for medical websites (HONcode),
about treatment choices by producers, health professionals and initiated by the Health On the Net Foundation, was the first
patients, and for facilitating the production of high quality attempt to propose guidelines to information providers for
evidence-based patient information. The DISCERN approach raising the quality of medical and health information available
was a combination of qualitative methods and a statistical on the World Wide Web [26]. Adopting a set of eight criteria to
measure of inter-rater agreements among expert panel repre- certify websites containing health information, its creators also
senting a range of expertise including production and use of developed a Health Website Evaluation Tool, which offered
consumer health information [21]. For establishing the face users with an indication of commitment to quality from the
and content validity, and inter-rater reliability, this approach providers.
administered questionnaire to information providers and self- There are several criteria-based assessment tools and few of
help organizations. Later, authors of [20] developed an explicit them have proper validation [27]. Quality Evaluation Scoring
scheme for calculating a 5-star quality rating system for Tool (QUEST) is the first quantitative tool that supports a
consumer health information based on DISCERN [22]. broad range of health information and had undergone a vali-
The Ensuring Quality Information for Patients (EQIP) [23] dation process [16]. Based on a review of existing tools [13],
is another tool to assess the presentation quality of all types of [28], QUEST quantitatively measures six criteria: authorship,
written health care information in a more rigorous way, and to attribution, conflicts of interest, currency, complementarity
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
AFSANA et al.: QUALITY ASSESSMENT OF HEALTH ARTICLES 3
and tone which can be used by health care professionals which has not be examined so far to the best of our knowledge.
and researchers alike. QUESTs reliability and validity were Thus, in the paper, we focus on to use finely tuned manually
demonstrated by evaluating online articles on Alzheimers annotated health articles by a group of experts to examine the
disease . In an Fuzzy VIKOR based approach, Afful-Dadzie performance of automated quality assessment approach using
et. al. [9] proposed a new criteria framework for measuring data mining.
the quality of information provided by each site. Authors
demonstrated a decision making model to find out how online III. DATASET D ESCRIPTION
health information providers could be assessed and ranked
There is currently no single dataset for assessing the quality
based on their quality.
of online health articles (OHA). For this study, we have
prepared a dataset based on 1720 health-related articles from
C. Machine Learning Based Analysis and Miscellaneous HealthNewsReview.org. The mission of this website is to intro-
Apart from aforementioned approaches, there are few more duce a significant step towards meaningful health care reform
studies which are not directly aligned with our research but by evaluating the accuracy of medical news and examining
provide us with valuable insights. the quality of evidence they provide. Since its foundation
In [29], authors developed a new labelled dataset of misin- in 2006, HealthNewsReview.org provides reviews of health
formative and non-misinformative comments from a medical news reporting from major U.S. news organizations conducted
health forum, MedHelp, with a view to making a resource for by a multi-disciplinary team of reviewers from journalism,
medical research communities to study the spread of medical medicine, health services research and public health domain.
misinformation. Preliminary feature analysis of the dataset was According to the editorial team of HealthNewsReview.org,
also presented to develop a real-time automated system for all stories and press news releases about public health inter-
identifying and classifying medical misinformation in online ventions should be evaluated by ten different criteria to ensure
forums. the quality of information in terms of accuracy, balance and
An applied machine learning based approach is proposed completeness. This organization proposed ten criteria, based
in [30], where authors addressed the veracity of online health on a analysis from previous studies combined with viewpoint
information by automating systemic approaches in conjunction from health care journalism2 , and as a standard of judging the
with Evidence-Based Medicine (EBM). Based on EBM and quality of health articles. This criteria address the basic issues
trusted medical information sources, authors proposed an that a consumer should know for developing their opinions on
algorithm, MedFact, which would recommend trusted medical health related interventions and how/whether they matter in
information within health related social media and empower their lives. Below we provide a list of those criteria3 .
online users to determine the veracity of health information • Criterion 1 Does the story adequately discuss the costs of
using machine learning techniques. Their aim was to address the intervention?
the factual accuracy of online health information from social • Criterion 2 Does the story adequately quantify the benefits
media discourse based on keyword extraction. Whereas, our of the intervention?
objective is to evaluate the quality of online health realted • Criterion 3 Does the story adequately explain/quantify the
articles from datamining perspective. we have focused on harms of the intervention?
identifying the discriminating features of health related articles • Criterion 4 Does the story seem to grasp the quality of the
for assessing the quality in a automatic manner. evidence?
Ghenai et. al. [31] proposed a tool for tracking misinfor- • Criterion 5 Does the story commit disease-mongering?
mation around health concerns on Twitter based on a case • Criterion 6 Does the story use independent sources and
study about Zika. The tool discovered health related rumours identify conflicts of interest?
in social media by incorporating professional health experts • Criterion 7 Does the story compare the new approach with
through crowdsourcing for annotating dataset and machine existing alternatives?
learning for rumour classification. Our aim is different from • Criterion 8 Does the story establish the availability of the
this study. Rather than focusing on health related rumour, we treatment/test/product/procedure?
focused on all types of health related articles available online • Criterion 9 Does the story establish the true novelty of the
to evaluate their quality so that people could be able to identify approach?
which articles to read or which to avoid for decision making. • Criterion 10 Does the story appear to rely solely or largely
A recent study by Dhoju et. al. [11] has identified structural, on a news release?
topical and semantic differences between health related news In HealthNewsReview.org, for a published health news
articles from reliable and unreliable media by conducting article, a group of expert reviews and justifies each of the
a systematic content analysis. By leveraging a large-scale above criterion with ‘satisfactory’ or ‘Not Satisfactory’ scores
dataset, authors successfully identified some discriminating based on their quality. In some cases, some criteria is rated as
features which separate reliable health news from the unre- ‘Not Applicable’ when it is impossible or unreasonable for an
liable one.
2 https://healthjournalism.org/secondarypage-detai
However, our study is quite different from these already
ls.php?id=56
existing methodologies. Our aim is to automate the quality 3 https://www.healthnewsreview.org/about-us/review
assessment process of health related articles using data mining -criteria/
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
4 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS
1400
Not Satisfactory
Not Applicable
In this section, we will explain the data pre-processing,
1200
feature extraction and feature selection process to establish
baseline performance for our approach. All our data pre-
Number of WHA
1000
processing and feature extraction have been conducted using
800 python, and some other useful library, e.g., scikit-learn6 and
600 NLTK7 .
400
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
AFSANA et al.: QUALITY ASSESSMENT OF HEALTH ARTICLES 5
• Function words (Consists of 15 features; e.g., Pronoun, well as extremely high frequency for achieving a better accu-
article, conjunction) racy [34] [35]. Since too frequent or too rare words are not
• Perceptual process (Consists of 4 features; e.g., see, hear) influential in characterizing an article, we ignored all words
• Biological process (Consists of 5 features; e.g., Body, that have appeared in more than 90% of the documents and
health) less than 3 documents. Again, to keep the dimensionality of
• Drives (Consists of 6 features; e.g., reward, risk, power) our feature set to a manageable size, we set maximum feature
• Other grammar (Consists of 6 features; e.g., interroga- count to top 4000 terms based on frequency.
tives, numbers) 3) Part Of Speech Tagging: Part Of Speech Tagging
• Time orientation (Consists of 3 features; e.g., past, (POST), also known as word-category disambiguation, is used
present, future) to annotate word with appropriate part-of-speech based on
• Relativity (Consists of 4 features; e.g., motion, time) both its definition and context to resolve lexical ambiguity
• Affect (Consists of 6 features; positive emotion, negative [36]. To recognize POST, we have applied Stanford postag-
emotion (e.g. anger)) ger8 . We found 35 tagset (list of part-of-speech tags, e.g., CC,
• Personal concerns (Consists of 6 features; e.g., word, CD, NP, RBR, etc. ) in the corpus with which we derived
leisure, money) two sets of features: POS tag count and POSWord count.
• Social (Consists of 5 features; e.g., Family, friend) For POS tag count, we measured the document wise count
• Informal language (Consists of 6 features; e.g., filler, of words belonging to a particular POS tag and thus found 35
swear) individual features. On the other hand, for POSWord count, we
• Cognitive process (Consists of 7 features; e.g., Differ, measured the count of tag associated with each individual word
Insight) within a document and found 47,451 non-overlapping features.
2) Term Frequency and Inverse Document Frequency: We POSWord features are capable of performing rudimentary
have used this weighting metric to measure the importance word sense disambiguation in situations where a word can
of a term in a document within entire dataset [32]. Term represent several meanings.
Frequency (TF) is used to quantify the frequency of a word 4) Citation and Ranking: We analysed the presence of hy-
in a particular document. On the contrary, Inverse Document perlinks to determine the credibility of an article. We extracted
Frequency (IDF) measures the importance of a term within three features – internal link, external link and Rank from link
the corpus. Let, D symbolizes the whole corpus with N attribute. We counted the number of internal links to inspect
documents. If n(t)d denotes the number of times term t the amount of self-citation occurred in a document so that
appears in a document d, then TF, denoted by T F (t)d , can be we can predict some biasness in it. Conversely, number of
calculated by equation (1): external links were counted to predict the citation network of
an article. We derived the rank attribute to envisage the quality
n(t)d
T F (t)d = P , (1) of the article by measuring the superiority of the webpages
t́∈d n(t́)d that particular article cited to. We considered Alexa Global
Ranking9 as an indicator of superiority measurement of a
And IDF, denoted by IDF (t)D , can be calculated by
webpage as it gives an estimation of a websites popularity.
equation (2):
We counted outgoing links from all the documents within the
IDF (t)D = 1 + log[N × |{d ∈ D : t ∈ d}|−1 ], (2) corpus and found 1428 distinct domains. We replaced each
domain with its associated alexa rank value and thus, we found
For a particular term, the product of TF and IDF represents 1428 distinct rank features for overall corpus.
the TF-IDF weight of that term. The higher the TF-IDF
score, the rarer the term is. We applied unigram tokenizer 8 https://nlp.stanford.edu/software/tagger.html
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
6 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS
5) Similarity Measure: Ambiguous and misleading headline PolyKernel as kernel to control the projection and the amount
can degrade the quality of an article. So, to measure the of flexibility in separating classes in our dataset. Second, Naive
relevance between headline-body pair of each article, we used Bayes classification algorithm which calculates the posterior
TF-IDF Cosine similarity metric to extract similarity feature probability for each class using a simple implementation of the
[37]. It quantifies the similarity between headline and body Bayes theorem and makes the prediction for the class with the
of the document irrespective of their size by measuring the highest probability. For each numerical attribute, a Gaussian
cosine of the angle between two vectors projected in a multi- distribution is assumed by default [40]. Third, Random Forest
dimensional space. classifier which constructs a multitude of decision trees at
6) Miscellaneous: We quantified normalized distinct word training time and merges them together to get a more accurate
count as a feature to determine how rare a word contributes in and stable prediction [41]. We considered 100 trees for Ran-
the classification problem as health related articles comprise dom Forest implementation. Forth, EnsembleVoteClassifier,
different medical terms. We have also counted the number a meta-classifier combining similar or conceptually different
of organizations and person mentioned in articles to predict machine learning classifiers for classification via majority
biasness. We used Stanford Named Entity Recognizer (NER)10 voting. We have combined three aforementioned classifiers to
to extract these features. build our ensemble estimator and examined its performance
on our dataset. All methods were evaluated by 10-fold cross-
C. Feature Selection validation, where in each validation 90% of dataset was used
for training purpose and 10% for testing. Various combinations
We are aiming to predict ten different criteria using numer-
of the extracted features have been experimented to evaluate
ous features (total of 53012) some of which might redundant
how accurately our approach can automatically classify each
or irrelevant to make predictions. Dataset containing irrelevant
criterion.
features can result in over-fitting. It also can mislead the
modelling power of a method. Thus, it is critically impor-
tant to select most relevant features from the feature set. B. Identify Feature Selection Method and Feature Size
In order to select the features that contribute most in our To identify the feature selection method and the feature size
classification task, we have employed three different automatic that result best classification accuracy for our dataset, We have
feature selection techniques. First, correlation-based attribute experimented the impact of different feature selection methods
evaluation (Co AE − P C), which evaluates worth of a feature and varied feature sizes on classification accuracy.
by measuring Pearsons correlation between it and the class. 1) Identify Feature Selection Method: We ran three feature
Second, Classifier-based attribute evaluation (Cl AE − LR), selection methods on our feature space, with a goal of de-
which evaluates the worth of a feature using Logistic Re- termining which feature selection method performs best by
gression classifier. Third, Classifier-based attribute evaluation selecting a best feature subset that results best classification
(Cl AE − RF ), which evaluates the worth of a feature using performance. Table II presents the outcomes of the compara-
Random Forest classifier. For each of the above three attribute tive study of three different feature selection methods over four
evaluator, rank search method was performed which ranks different classifiers (SVM, Naive Bayes, Random Forest and
features by their individual evaluations to find out the most EnsembleVote) carried out against a feature subset with feature
correlated feature set. size 4000. Here, we have presented weighted Precision (WP ),
weighted Recall (WR ) and weighted F-Measure (WF ) from
V. E XPERIMENTAL E VALUATION the Weka output for presenting a better estimate of overall
classification performance [42]. Weka [38] calculates weighted
The core contribution of our work is to assess the quality
average by taking average of each class, weighted by the
of online health articles automatically applying various data
proportion of how many elements are in each class. So, for
mining techniques. In this section, we have quantified and
our binary class problem, WP , WR and WF are calculated
evaluated the performance of a number of classification tech-
from equation (3), (4) and (5) respectively.
niques, for different feature selection methods and variable
feature sizes, to achieve the best result. We have used WEKA (PCS × |CS|) + (PCN S × |CN S|)
WP = , (3)
tool [38] in our experimental evaluation. |CS| + |CN S|
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
AFSANA et al.: QUALITY ASSESSMENT OF HEALTH ARTICLES 7
1 1
0.9 0.9
0.7 0.7
0.5 0.5
0.4 0.4
0.8
0.3 0.3
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
8 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS
TABLE II: Comparison study of three feature selection methods over four classifiers (Feature Size: 4000) (bold number is
showing the highest weighted F-Measures for each criterion)
Criterion FS Methods SVM Random Forest Naive Bayes Ensemble
WP WR WF WP WR WF WP WR WF WP WR WF
1 Co AE −P C 0.901 0.903 0.899 0.794 0.786 0.718 0.861 0.756 0.774 0.868 0.803 0.816
Cl AE − LR 0.887 0.888 0.881 0.826 0.786 0.711 0.813 0.684 0.708 0.859 0.827 0.836
Cl AE − RF 0.877 0.881 0.875 0.831 0.792 0.724 0.792 0.637 0.666 0.848 0.813 0.823
2 Co AE −P C 0.855 0.857 0.854 0.739 0.707 0.621 0.793 0.721 0.730 0.799 0.738 0.747
Cl AE − LR 0.826 0.828 0.821 0.765 0.707 0.615 0.766 0.686 0.696 0.789 0.739 0.747
Cl AE − RF 0.795 0.801 0.794 0.751 0.703 0.610 0.691 0.567 0.587 0.729 0.674 0.685
3 Co AE −P C 0.835 0.837 0.832 0.752 0.720 0.655 0.790 0.712 0.720 0.793 0.727 0.735
Cl AE − LR 0.851 0.849 0.841 0.779 0.722 0.651 0.764 0.698 0.707 0.773 0.727 0.735
Cl AE − RF 0.804 0.808 0.803 0.780 0.721 0.648 0.704 0.612 0.624 0.729 0.674 0.685
4 Co AE −P C 0.847 0.848 0.846 0.707 0.695 0.635 0.783 0.713 0.718 0.790 0.728 0.733
Cl AE − LR 0.824 0.824 0.818 0.746 0.693 0.615 0.758 0.689 0.694 0.769 0.720 0.726
Cl AE − RF 0.779 0.783 0.778 0.741 0.692 0.615 0.680 0.583 0.592 0.704 0.643 0.650
5 Co AE −P C 0.894 0.887 0.843 0.769 0.877 0.819 0.864 0.722 0.767 0.873 0.890 0.863.
Cl AE − LR 0.888 0.901 0.888 0.769 0.877 0.819 0.829 0.757 0.786 0.869 0.887 0.873
Cl AE − RF 0.856 0.880 0.862 0.892 0.877 0.820 0.805 0.707 0.748 0.845 0.873 0.852
6 Co AE −P C 0.835 0.835 0.835 0.688 0.688 0.688 0.754 0.744 0.742 0.747 0.737 0.735
Cl AE − LR 0.733 0.733 0.732 0.678 0.676 0.674 0.689 0.644 0.623 0.693 0.648 0.628
Cl AE − RF 0.719 0.719 0.719 0.687 0.686 0.685 0.665 0.636 0.621 0.667 0.638 0.624
7 Co AE −P C 0.855 0.854 0.854 0.669 0.666 0.663 0.737 0.733 0.732 0.747 0.737 0.735
Cl AE − LR 0.716 0.715 0.715 0.678 0.666 0.659 0.669 0.630 0.611 0.669 0.630 0.611
Cl AE − RF 0.690 0.689 0.688 0.644 0.638 0.632 0.627 0.607 0.595 0.627 0.607 0.595
8 Co AE −P C 0.875 0.876 0.870 0.768 0.737 0.638 0.813 0.777 0.786 0.813 0.781 0.789
Cl AE − LR 0.814 0.821 0.807 0.780 0.737 0.638 0.710 0.642 0.661 0.737 0.712 0.721
Cl AE − RF 0.762 0.775 0.765 0.750 0.736 0.639 0.657 0.523 0.563 0.718 0.699 0.706
9 Co AE −P C 0.867 0.869 0.867 0.695 0.698 0.607 0.782 0.765 0.770 0.782 0.765 0.770
Cl AE − LR 0.827 0.826 0.816 0.754 0.710 0.621 0.689 0.608 0.621 0.727 0.689 0.699
Cl AE − RF 0.769 0.777 0.769 0.752 0.704 0.608 0.620 0.477 0.507 0.661 0.610 0.623
10 Co AE −P C 0.880 0.886 0.878 0.815 0.805 0.722 0.848 0.765 0.787 0.854 0.794 0.811
Cl AE − LR 0.867 0.875 0.865 0.777 0.804 0.721 0.779 0.689 0.717 0.822 0.814 0.818
Cl AE − RF 0.828 0.842 0.830 0.783 0.806 0.730 0.757 0.632 0.679 0.796 0.793 0.795
Legend: FS – Feature Selection; WP – Weighted Precision; WR – Weighted Recall; WF – Weighted F-Measure.
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
AFSANA et al.: QUALITY ASSESSMENT OF HEALTH ARTICLES 9
TABLE III: Most discriminating Features [Criteria 1-10] this context, our work will make manual reviewing process
Cr Most Correlated Features (Top 16) scalable and save manual labour and time. Our developed
1 M oney ? , cost NN‡ , costs VBZ‡ ,insurance NN‡ , dataset will help researchers to contribute in the growing
costs NNS‡ , But CC ‡ , price NN‡ , negate? ,verb? ,
not RB ‡ , dollars NNS‡ , covered VBN‡ ,
field of health care research. Overall, this automated quality
thousand dollar∗ , pay VB‡ , cost VB‡ , tag N N ‡ assessment approach may help search engine to promote high
2 Percent NN‡ , quant? ,were V BD‡ , quality health information and discourage low quality articles.
compared V BN ‡ ,england journal∗ , group NN‡ ,
dif f er? ,new england∗ , year percent∗ , However, there are some limitations in our study. Experts
Reuters NNP‡ , standardsth thomson∗ , trust principl∗ , from HealthNewsReview.org used three labels - ‘Satisfactory’,
compar percent∗ , reuter trust∗ , percent percent∗ ,
journal medicin∗
‘Not Satisfactory’ and ‘Not Applicable’ for characterizing 10
3 not RB‡ , should MD‡ , W C ? , negate? , dif f er? , criteria . Cases where a number of criteria may be impossible
effects NNS‡ , some DT ‡ , causeV B ‡ , risks NNS‡ , or unreasonable for some of the stories were rated as ‘Not
tentat? , have VB‡ , nausea NN‡ , external link⊗ ,
common ef f ect∗ , ef f ect includ∗ , side JJ‡, high dos∗ , Applicable’ by the review experts. In our study, we deducted
died V BD‡ stories with ‘Not Applicable’ criteria from our training set
4 study NN‡ , N ormalizeddistinctwordcount⊗ , not RB‡ ,
randomly RB‡ , dif f er? , assigned VBN‡ , as those stories constituted a small part of the whole corpus
studies NNS‡ , T he DT ‡ , were VBD‡ , placebo NN‡ , and trained our classifiers for two class labels - ‘Satisfactory’
One CD‡ , editorial NN‡ , evidence NN‡ , group N N ‡ , and ‘Not Satisfactory’. That’s why we could not use all 1720
randomized V BN ‡ , random assign∗ , placebo group∗ ,
email N N ‡ articles for each of the 10 criteria and number of total dataset
5 f amili histori∗ , Anesthesiologists NNPS‡ , dri eye∗ , varied from criteria to criteria (e.g., our dataset for criterion 1
revealed V BN ‡ , history N N ‡ , Hed NNP‡ , moist JJ‡ ,
transit NN‡ , american suf f er∗ , anesthesiology NN‡ , comprised of 1426 articles after removing class instances of
need new∗ , excessive JJ ‡ , inf orm patient∗ , ‘Not Applicable’ label). In our future study we plan to address
labbased JJ‡ , histori breast∗ , air N N ‡ this shortcoming.
6 per ner count⊗ , professor NN‡ , study NN‡ ,
U niversity N N P ‡ , involved VBN‡ , The DT‡ , Another limitation is, our dataset is not large enough to
normalizeddistinctwordcount⊗ , W C?, But CC‡ , be compatible for deep learning framework. We trained deep
that IN ‡ , said VBD‡ , National NNP‡ , School N N P ‡ ,
not RB‡ , f unded V BN ‡ , about IN ‡ learning classifier for our dataset though and found approxi-
7 Not RB‡ , But CC‡ , W C?, Dif f er? , mately 50% accuracy over all criteria. In our future work, we
There EX‡ , The DT‡ , Than IN‡ , Are VBP‡ , plan to enrich our dataset to examine its feasibility from deep
normalizeddistinctwordcount⊗ , Many JJ‡ , F or IN ‡ ,
Year NN‡ , T hat DT ‡ , U niversity N N P ‡ , Of ten RB ‡ , learning perspective.
Better JJR‡
8 N ot RB ‡ , dif f er? , are V BP ‡ , negate? , VIII. C ONCLUSION AND F UTURE W ORK
radiotherapy N N ‡ , Alessandro N N P ‡ , Magnet reson∗ ,
CITATION NNP‡ , Twice week∗ , Reson imag∗ , In this paper, we have applied data mining approach to
T emperature N N ‡ , Outcom studi∗ , Resonance N N ‡ ,
Cognit impair∗ , Welltolerated VBN‡ , Axis NN‡ automatically assess the quality of online health articles.
9 Have V BP ‡ , Studies N N S ‡ , FoxNewscom NNP‡ , We have prepared our dataset comprises 1720 health related
Moisturizers NNS‡ , N ews releas∗ , Result promis∗ , articles extensively reviewed by a group of experts. Through
Help woman∗ , Control blood∗ , Consumption N N ‡ ,
Healing VBG‡ , Leadership NN‡ , M olecular JJ ‡ , a pipeline of data pre-processing steps, we have refined our
Melbourne NNP‡ , Educated VBN‡ , Obesity N N P ‡ , data and extracted 53012 features to train classifiers. We
Penetrate VB‡
10 Dif f er? , N egate? , social? , tentat? , Sixltr? , have identified the best feature selection technique to select
Detect diseas∗ , N ews releas∗ , Develop research∗ , most relevant feature subset from our feature space, and have
Lead investig ∗ , Collaborate VBP‡ , Discovery NN‡ ,
Tumour NN‡ , M edia contact∗ , Innovator N N ‡ , applied four different classifiers - SVM, Naive Bayes, Random
Resume VB‡ , Exceptional JJ‡ Forest and EnsembleVote to train model. For our dataset, we
Legend: Cr – Criterion ? – LIWC Feature; ∗ – TF-IDF Feature; ‡ – found SVM is the best performer achieving accuracy upto
POSWord count; ⊗ – Miscellaneous Features; Features common in all three 84% to 90% for ten different criteria. We have also analyzed
feature set are indicated by Bold texts.
top 16 most correlated features for each of the ten criteria
to justify the feasibility of our assessment. We found that
VII. D ISCUSSION our selected features are capable of characterizing criteria
successfully. From our experimental results and analysis, it can
In this study, we have examined the application of machine be concluded that it is feasible to apply data mining techniques
learning approach to automate the quality assessment process to automate quality assessment process for online health
for web based health related information. We found that it is articles. Following the richness of dataset and specific focus
feasible to apply machine learning classifiers to estimate the independent nature of analysis, proposed model may serve as
quality of health related articles if the classifier can be trained a universal standard for appraising quality of OHA and wipe
properly. This work is not directly comparable to the already out the negative impact of misinformation dissemination to
existing studies because most of the studies examined the some extent.
quality of health information from a single domain perspective As future work, we will further investigate this study with
(e.g., vaccination [43], [44], [45]; diabetic neuropathy [13]; deep learning approach and explore multinomial classifica-
reproductive health information [14]; nutrition coverage [15] tion problem to evaluate health related articles which cannot
etc.) through a manual process and statistical analysis. We address some of the specific criteria. We have also plan to
have examined articles over entire health domain, ensuring conduct case studies and develop an article recommendation
its applicability to all possible health related category. In system based on our model.
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
10 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.