0% found this document useful (0 votes)
52 views5 pages

Named Entity Recognition Using Ensemble

This document presents an ensemble learning model for named entity recognition (NER) in the biomedical domain. The proposed model contains a dictionary-based entity identifier and a self-learning classifier. The model is evaluated on clinical text data and is shown to achieve high accuracy for NER tasks such as identifying diseases, drugs, treatments, and other medical entities. Ensemble learning combines multiple machine learning models to improve performance over single models.

Uploaded by

said mansouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views5 pages

Named Entity Recognition Using Ensemble

This document presents an ensemble learning model for named entity recognition (NER) in the biomedical domain. The proposed model contains a dictionary-based entity identifier and a self-learning classifier. The model is evaluated on clinical text data and is shown to achieve high accuracy for NER tasks such as identifying diseases, drugs, treatments, and other medical entities. Ensemble learning combines multiple machine learning models to improve performance over single models.

Uploaded by

said mansouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

www.rspsciencehub.

com Volume 02 Issue 07 July 2020

Named Entity Recognition using Ensemble Learning


R. Ramachandran1, Dr. K. Arutchelvan2, R. Senthamizh Selvan3,
1,2,3
Assistant Professor, Department of Computer and Information Science, Annamalai University, Tamil
Nadu, India.
[email protected]
[email protected]
[email protected]
Abstract

Upgrading Industry 4.0 to 5.0 provides numerous research opportunities for the industrialists and
researchers. This industrial revolution cross the peak of automation in the life science domain. In this
digitalized world, big data plays a key role to provide the valuable insights by using various analytical
methods. In life science, available of huge textual data contains wide spread of valuable information. To
extract the hidden information from the big data, natural language processing plays a major and
significant role. In NLP, named entity recognition is one of the key factor and biggest challenge for the
research community. This paper presents the high level architecture of NER using ensemble learning
method. The EL model contains a dictionary based entity identifier and a self-learning classifier. Proposed
model outperformed well and produced high accuracy.

Keywords: Natural Language Processing, Named Entity Recognition, Conditional Random Field,
Lexicon Based Approach, Ensemble Learning
projected that in 2020 the healthcare industry has
1.Introduction 2134 Exabyte of data.The data deluge in
healthcare industry which is commonly generated
Life science is one of the prominent and growing by electronic healthcare record are stored
domain in the business industry. The revolution of inregional language. The stored data are in
Artificial Intelligence (AI) in the pharmacy organized structure which is making more
industry provided many research and job difficulties for the retrieving the hidden
opportunity. Plenty of applications that are being information from the huge amount of text data.
used in life science industry are migrated to The digitalized information of the clinical records
automation. Most of data are in textual format and are frequently store in the formal language. NLP is
creates the biggest challenge to the researchers and helpful for researchers and industrialists to
industrialist. To work with the textual data, NLP is communicate occasions and clinical ideas,
one of the key technique to extract the valuable astonishingly it makes the information hard for
insights. looking due to the lack of technologies and tools.
In various industries such as healthcare, education, To overcome these difficulties, the data must be
finance, social media, etc., contains abundant properly processed by the NLP techniques. Named
information which are difficult to handle. NLP is Entity Recognition (NER) is a key NLP errand to
significant to handle those sources. This paper extricate the elements of intrigue (e.g., ailment
distressthe role of NLP strategies in biomedical names, medicine names and lab tests) from clinical
field.According to statista , in 2019 they have

International Research Journal on Advanced Science Hub (IRJASH) 44


www.rspsciencehub.com Volume 02 Issue 07 July 2020
stories, along these lines to help clinical and classification necessities. The accessible clarified
translational exploration. databases are normally inadequate for named
The paper is organized as follows: the background element acknowledgment undertaking to prepare
study of the NER in biomedical domain is present the model [6]. Also, the clinical writings are
in the section 2.The proposed architectural flow written in a particular way, unique in relation to
the Lex-NER model is described in the section 3 customary language. There are a great deal of
with elegant workflow figures. Section 4 presents inadequate sentences, casual syntax and covered
the results and discussions. Section 5 concludes with incorrect spellings and non-standard
shorthand, shortened forms and abbreviations.
with limitations of the proposed work
Also, the medication is a quickly extending field
with huge number of investigates led the add to
2. Background Study continually developing number of clinical ideas. It
makes incredibly hard to staying up with the latest.
Named Entity Recognition (NER) is a powerful Besides, ideas in medication regularly convey
technique in the NLP [1]. It is a sub-field of importance, identified with the idea. It infers the
information retrieval.It is an errand of perceiving NER models to keep the word setting data along
the articulations that ought to be ordered as the preparation procedure. Another ordinary
articulations indicating substances. Model element is that clinical language is described by
substance tags in clinical arena are ailments, drugs, long expressions containing exceptional characters
treatment, qualities, malignant growth, protein and and runs.
RNA [2, 3, and 4]. A great part of the examination
in life science informatics has focused on NER. As The greater parts of the investigations are
indicated by [5] the majority of the techniques are performed on Corpuses in English. Second most
rule-based, in spite of the fact that there are well-known language is Chinese. There are
executed some half and half methodology that obviously inadequate with regards to explores in
consolidate AI with these principles. different dialects. The term characteristic language
is utilized to depict any language utilized by
The creators in [6] makes reference to people, to recognize it from programming dialects
Conditional Random Fields (CRF), Support Vector and information portrayal dialects utilized by PCs
Machines (SVM) and Hidden Markov Model and portrayed as counterfeit [11, 12]. Normal
(HMM) as regular AI strategies that are at present language handling (NLP) term depicts
applied for NER undertakings in clinical space. computational methods that procedure
The latest papers focus on profound wisdom communicated in and composed human language
methodologies put on repetitive neural systems [13]. Characteristic language handling incorporates
(RNNs), for example, Long-Short Term Memory information preprocessing strategies like
(LSTM) [7], Gated Recurrent Units (GRU) [8]. information cleaning, tokenization, standardization
Basic pattern is joining the RNN with factual (stemming, lemmatization or different types of
technique on head of the intermittent layers. It normalization). Setting up the content requires
guarantees that the ideal succession of labels over picking the ideal apparatuses; anyway it assists
the whole sentence is acquired [9]. CRF is the most with improving precision of continuing NLP
regularly utilized measurable strategy in this cross errands. Different assignments of NLP focus on
breed approach. The creators of [9] consolidated removing the factual highlights like term
RNN with CRF. Because of the difficulties frequency, inverse document frequency or
recorded beneath the Clinical NER endeavors get linguistic highlights including Part of Speech
lower execution estimates esteemed thebest F1 (POS) labeling. NLP methods are devices to
score acquired by [9] is sums 91.32% in correlation accomplish the unrivaled errand. Data Extraction
of comparable preliminaries with corpuses in non- (IE) including scanning for pertinent data in
specialized fields, where as of late the creators of records exist among the most applied assignments.
[10] got F1 score ninety three percentile on the
CoNLL 2003 corpus. NER is a phase of Information Extraction. It is
one of key NLP undertakings that assists with
Right off the bat, the information accessible for changing over unstructured content into PC
scientists in the biomedical field is restricted, for coherent organized information [13]. NER alludes
the most part because of the patient security and to the undertaking of perceiving the articulations

International Research Journal on Advanced Science Hub (IRJASH) 45


www.rspsciencehub.com Volume 02 Issue 07 July 2020
indicating substances (for example Named Entities, (1) Identifying the abstracts based on the
for example, illnesses, medications or individuals' keywords
names, in free content archives [14]. NER can be (2) NLP module which includes the text
tackled with the utilization of numerous methods processing
that can be separated into a few gatherings [15]: (3) Data preparation for training the machine
word reference based methodology, rule based learning model
methodology, measurable methodology, profound (4) Machine Learning module
learning approach, crossover approach. The author
performed NER task on GENIA corpus. Genia is The first module is to identify the abstracts from
normally utilized corpus by analysts both as word the PubMed articles. For this experiment, it has
reference and as base corpus to perform NER task. been limited with certain keywords. Keywords
The NER model is accessible in various variants
such as drug names, diseases and symptoms.
and various configurations. Authors have utilized
Based on the keywords the document is classified.
corpus which comprises of 1001 dynamic records
from MEDLINE database and it is a scientific Around 531 documents are collected for the
classification of 30 organically pertinent classes. proposed work.

3. Materials and Methods Second module, is the engine of the proposed


work. It contains the Lex-NER component,
Lex-NER is the hybrid NER framework which dictionary component and NLP component. The
istrained and tested on PubMedabstracts. The dictionary component is built by the UMLS
proposed work aims to build a hybrid model which dictionary. UMLS is one of the powerful database
combines both the string matcher and Machine which contains around 130 attributes of different
Learning (ML) model to produce better accuracy. tokens. The texts in the documents are processed
The ML which is incorporated a phase known as by using the basic NLP component. Figure 2
human-in-the-loop. The human-in-the-loop phase presents the workflow of Lex-NER component.
is used to increase the accuracy of the ML model. The common techniques of NLP component are
The domain experts evaluate the results of ML given in the figure 2. The important keyword is
model and update the training dataset. This helps extracted and matched by using the string matcher.
to increase the accuracy level. The matched word from the UMLS dictionary is
then passed into Lex-NER component.

UMLS
Dictionary NLP Component
Lex-NER Lex-NER
Web  Stemming  Abbreviation
Resources NLP  Tokenization Extractor
Component  Lemmatization  Entity Tagger
 POS Tagging  Generating
Training Dataset

Data
ML Model Preparation CRF Model

Fig.1.High-level Architecture of Modules in


NER Processing Annotated Texts

Figure 1 presents the high level architecture of the


Modules in NER processing. The proposed Fig.2. Workflow of Lex-NER
framework is composed of the following four
modules. Third module is data preparation for training the
ML model. The training data set is generated and
stored in JSON format. It the light-weight format

International Research Journal on Advanced Science Hub (IRJASH) 46


www.rspsciencehub.com Volume 02 Issue 07 July 2020
to train the model with huge datasets. The
following is an example of training datasets. Comparative Study
100
[('caineiontophoresis system for topical anesthesia 80
in adults and children: a randomized, \n', 60
40
{'entities': [(20, 26, 'DRUG'), (31, 38, 20
'ROUTE')]}), ('caineiontophoresis system for 0
topical anesthesia in adults and children: a Precision Recall F-Score
randomized, \n', {'entities': [(20, 26, 'DRUG'), (31,
38, 'ROUTE')]}), ('Kreyden, O.P. Iontophoresis for LinePipe ABNER Lex-NER
palmoplantar hyperhidrosis. \n', {'entities': [(45,
58, 'DISEASE')]}), (' Dermal, subdermal, and Fig.3. Comparative Study of NER methods
systemic concentrations of granisetron \n', The comparative study of proposed Lex-NER
{'entities': [(51, 62, 'DRUG')]}), ('Morrel, E.M., model with LinePipe and ABNER is presented in
Spruance, S.L. & Goldberg, D.I. Topical the table 1. From the table it is evident that Lex-
iontophoretic administration of \n', {'entities': [(46, NER model outperforms well than the existing
53, 'ROUTE')]}), ('acyclovir for the episodic models. Figure 3 depicts the graphical
treatment of herpes labialis: a randomized, double- representation of the comparative study.
blind, \n', {'entities': [(0, 9, 'DRUG')]})]
Conclusions
In the proposed work, the CRM (Conditional The proposed hybrid based NER model
Random Field) based ML model is used to train outperforms well when compared to the existing
the annotated texts.The following equation 1 states LinePipe and ABNER. The state of art which has
the formula of the CRF model. Y is the hidden examined and presented in the background study
state and X is the entities which are observed by provide the solid knowledge for the NER. The
the string matcher. 𝒚𝒕 , 𝒚𝒕−𝟏 , 𝑿𝒕 denotes the features model comprises of the string matcher and CRF
of the datasets and 𝜽𝒌 𝒇𝒌resembles the weight of the model. The UMLS database is used to identify the
features. The weight of the feature is calculated by entities with the phrase matcher. The matched
the maximum likelihood estimation. The feature words are annotated by the entity and passed into
are set by the developer. model. The CRF model which has been worked
based on forward parsing method produced good
𝑻 𝑲
𝟏 accuracy. The model is trained using on the 531
𝒑 𝒚𝒙 =( ) 𝒆𝒙𝒑{ 𝜽𝒌 𝒇𝒌 (𝒚𝒕 , 𝒚𝒕−𝟏 , 𝑿𝒕 )}
𝒁 𝒙 abstracts which has been extracted from the
𝒕=𝟏 𝒌=𝟏
…. (1) PubMed database. In future, the work would be
extended on increasing the abstracts and increasing
number of features.
Table 1 depicts the results and discussion of the
proposed work. References
[1] S. M. Meystre, G. K. Savova, K. C. Kipper-
Table.1. Comparative Results of Lex-NER Schuler and J. F. Hurdle, "Extracting
information from textual documents in the
Comparative study electronic health record: A review of recent
Techniques Precision Recall F-Score research", pp. 128–144, ISSN 0943-4747.
[2] A. B. Abacha and P. Zweigenbaum, "Medical
LingPipe 72.95 88.49 79.97 Entity Recognition: A Comparison of
86.93 51.49 64.88 Semantic and Statistical Methods",
ABNER
Proceedings of BioNLP 2011 Workshop,
BioNLP’11, pp. 56–64, 2011.
Lex-NER 88.66 84.32 86.43
[3] V. Hatzivassiloglou, P. A. DubouÃl’ and A.
Rzhetsky, "Disambiguating proteins, genes,

International Research Journal on Advanced Science Hub (IRJASH) 47


www.rspsciencehub.com Volume 02 Issue 07 July 2020
and RNA in text: A machine learning [13] Y. Sasaki and Y. Tsuruoka and J.
approach" Suppl 1:S97–106. ISSN 1367-4803. McNaught and S. Ananiadou, "How to make
[4] J. Song, B. Jo, C.Y. Park, J.-D. Kim and Y.-S. the most of NE dictionaries in statistical NER",
Kim, “Comparison of named entity recognition Vol. 9(11):S5, ISSN 1471-2105.
methodologies in biomedical documents”, [14] M. A. Hearst, "Untangling Text Data
BioMedical Engineering OnLine, Vol. Mining", Proceedings of 37th Annual Meeting
17(2):158, 2018. of the Association for Computational
[5] S. Pradhan, N. Elhadad, B. R. South, D. Linguistics on Computational Linguistics,
Martinez, L. Christensen, A. Vogel, H. ACL’99, pp. 3–10, 1999.
Suominen, W. W. Chapman and G. Savova, [15] M. Allahyari and S. Pouriyeh and M.
“Evaluating the state of the art in disorder Assefi and S. Safaei and E. Trippe and J. B.
recognition and normalization of the clinical Gutierrez and K. Kochut, "A Brief Survey of
narrative”, Vol. 22(1), pp. 143–154, ISSN Text Mining: Classifiation, Clustering and
1527-974X. Extraction Techniques, 2017.
[6] J. Zhang and J. Li and S.Wang and Y. Zhang
and Y. Cao and L. Hou and X. Li, "Category
Multi-representation: A Unified Solution for
Named Entity Recognition in Clinical Texts",
Advances in Knowledge Discovery and Data
Mining, Springer, pp. 275–287, 2018.
[7] Y. Qin and Y. Zeng, "Research of Clinical
Named Entity Recognition Based on Bi-
LSTM-CRF", Journal of Shanghai Jiaotong
University (Science), Vol. 23(3), pp. 392–397,
2018.
[8] A. P. Quimbaya and A. S. MÞnera and R. A.
GonzÃ˛alez Rivera and J. C. DazaRodrÃ_guez
and O. M. MuôsozVelandia and A. A. Garcia
Peôsa and C. LabbÃl’, "Named Entity
Recognition Over Electronic Health Records
Through a Combined Dictionary-based
Approach", Procedia Computer Science, Vol.
100, pp. 55–61, 2016.
[9] J. Qiu and Q. Wang and Y. Zhou and T. Ruan
and J. Gao, "Fast and Accurate Recognition of
Chinese Clinical Named Entities with Residual
Dilated Convolutions", Proceedings of IEEE
International Conference on Bioinformatics
and Biomedicine (BIBM), pp. 935–942, 2018.
[10] A. Baevski and S. Edunov and Y. Liu and
L. Zettlemoyer and M. Auli, "Cloze-driven
Pretraining of Self-attention Networks",
http://arxiv.org/abs/1903.07785.
[11] G. Lample and M. Ballesteros and S.
Subramanian and K. Kawakami and C. Dyer,
"Neural Architectures for Named Entity
Recognition, http://arxiv.org/abs/1603.01360.
[12] D. Jurafsky and J. H. Martin, "Speech and
Language Processing", 2nd Edition. Prentice
Hall, ISBN 978-0-13-187321-6.

International Research Journal on Advanced Science Hub (IRJASH) 48

You might also like