0% found this document useful (0 votes)
20 views74 pages

CEP talk-IITP

Uploaded by

ram Dindu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views74 pages

CEP talk-IITP

Uploaded by

ram Dindu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

CEP course on Deep Learning for Natural Language Processing

Text Mining in
Biomedical and Healthcare Domain

Dr. Shweta Yadav


Postdoctoral Research Fellow
Wright State University, USA
http://shwetanlp.github.io/

1
Overview

Patient Data De-Identification Pharmacovigilance Mining


(Entity Extraction from EMR) from Social Media
1 3

2 Medical Sentiment Analysis


4
Introduction: Types of
Biomedical Text
INTRODUCTION:
Types of Biomedical Texts
Types Of Biomedical Text and its Sources

01 Biomedical Literatures (MEDLINE,PubMed, PMC)

02 Electronic Medical Record

03 Social Media (Twitter, Medical Discussion Forums, Blogs)

Text Title
04 Place your own
text here

4
BIOMEDICAL LITERATURE

5 Image credit: Wikipedia


ELECTRONIC MEDICAL RECORDS

6
Image credit: Wikipedia
SOCIAL MEDIA

7
APPLICATION
Biomedical Literature

Gene-disease associations Gene cluster identification

Protein-disease associations Protein interactions


10
Image credit: [30]
11
Image credit: [30]
12
Image credit: [30]
How to maintain the unstructured biomedical
and clinical information ??

13
DOMAIN KNOWLEDGE BASE

14
Image credit: [31]
15
16
Image credit: [32]
17
Image credit: [32]
Structured Data

Text Mining
Exponential
Unstructured Data

18
2 Entity
Extraction
Patient Data De-Identification
(Electronic Medical Records)
19
Problem Statement

Mr. <XXX_Patient> is a <XXX_AGE> old


white male with a history of diabetes
mellitus who underwent a surgery on
<XXX_DATE>. He was transferred to
<XXX_Hospital for endoscopy.
01 Raw Electronic
Medical Record

Mr. John is a 60 year old white male with


De-Identified Medical
Record. 02
a history of diabetes mellitus who
underwent a surgery on November 15.
He was transferred to Valwatnal
Community Hospital for endoscopy.
Date
Admission Date :
06/07/1999
Report Status : You can simply impress your audience and
add a unique zing and appeal to your
Signed
Date Presentations. Easy to change colors,
Discharge Date : photos and Text. Get a modern
PowerPoint Presentation that is beautifully
06/13/1999 Patient Name
HISTORY OF PRESENT ILLNESS :
Hospital Name designed. You can simply impress your
audience and add a unique zing and
appeal to your Presentations. Easy to
Essentially , Mr. Cornea is a 60 year old male who noted the onset of dark urine
changeduring earlyand
colors, photos January
Text. Get .a
He underwent CT and ERCP at the Lisonatemi Faylandsburgnic, Community Hospital with a stent placement
modern PowerPoint Presentation that is
beautifully designed.
and resolution of jaundice .
He underwent an ECHO and endoscopy at Ingree and Ot of Weamanshy Medical Center on April 28 .
Modern
He was found to have a large , bulging , extrinsic mass in the lesser curvature of his stomach .
Fine needle aspiration showed atypical cells , positively reactive mesothelial cells .
MEDICATIONS PRIOR TO ADMISSION :
Portfolio
Hydrochlorothiazide 25 mg q.d. , Clonidine 0.1 mg p.o. q.d. , baclofen 5 mg p.o. t.i.d. Physician Name
HOSPITAL COURSE :

Presentation
Basically , patient underwent a subtotal gastrectomy on the 7th of June by Dr. Kotefooksshuff .
Motivation

▪ Automatically augmenting patient databases


Use big image
▪ Unavailability of clinical records for research (even for de-identification) without
being de-identified

22
Challenges

▪ Inter PHI ambiguity: PHI terms overlap with the non-PHI terms.
Brown (Doctor name) vs. brown (non-PHI)
▪ Intra PHI ambiguity: One candidate word seems to belong to two or many
different PHI terms.
August (Patient name) vs. August (Date)
▪ Lexical Variation: For example, variation of the entities such as the ‘50 yo
m’, ‘50 yo M’, ‘55 YO MALE’
▪ Terminological variation and irregularities: For example ‘3041023MARY’
is the combination of two different PHI categories ‘3041023’ which
represents the MEDICALRECORD and ‘MARY’ which is another PHI
23
category
System Algorithm Lexical Syntactic Semantics
Guo et SVM Word , Capitalization, Prefixes/Suffixes, Word POS(Word) Entity Extract by
al.[7] Length, Numbers, Regular Expression ANNIE(Doc, Hosp,
Loc)
Szarvas Decision Word Length, Capitalization, Numbers, None Dictionary Terms
et al.[8] Tree Regular Expression, Token Frequency (Names , US Loc,
Counties, cities,
Diseases, Non PHI),
Section Headings
Uzuner et SVM Word , Lexical Bigrams, Capitalization, POS( Word+2 MeSH ID, dictionary
al. [9] Punctuation, Numbers, Word Length surrounding) Terms(Names, US
and word locations,
hospital name)

Wellner CRF Word Unigram/Bigram, Surroundings word, None Dictionary Terms ( US


et al. [10] Prefixes/Suffixes, capitalization, Numbers, states, months,
Regular Expression General English
Terms)
Aramaki CRF Word, Surroundings words, capitalization, POS( Word+2 Dictionary Terms
et al. [11] Word Length, Regular Exp, Sentence surroundings (Names and Loc)
Position & Length word)
24
Feature Engineering

Bag-of-words
Part-of-speech (POS) tags
POS tag of current and surrounding token
Contextual features
Sentence information
Affixes
Orthographic features
Word shapes
Section information
Task specific features

25
Dataset (i2b2 2014)

26
PROPOSED APPROACH (Elman RNN)

27
PROPOSED APPROACH (Jordan RNN)

28
29
Results with PSO

PHI Category CRF CRF+PSO Elman Jordan

PATIENT 58.95 59.26 88.89 91.30

DOCTOR 79.08 81.02 83.26 85.84

HOSPITAL 60.39 62.51 78.03 76.41

LOCATION 55.56 55.13 47.83 61.90

PHONE 78.26 78.89 88.00 80.00

ID 74.44 75.41 90.31 91.66

DATE 94.69 95.14 96.74 96.83

OVERALL 81.39 82.58 89.22 90.18

30
3 Medical Sentiment Analysis
Social-media Texts (Medical Blogs)

31
32
Sample Medical Blog-post

33
Who is Talking?

34
Why Not??? Sentiment Analysis

35
Problem Statement

▷ To prioritize user blog post over two medical


sentiment aspects:

1. Status of health condition


2. Outcome of treatment

36
Medical Condition

Black
Is the color of coal, ebony, and
of outer space. It is the
darkest color, the result of
the absence of or complete
absorption of light.

37
Medication

38
Proposed Approach (Single Task Learning)

39
Results

Task 1: Medical Condition Task 2: Medication

Models Precision Recall F-Score Precision Recall F-Score

Baseline 1: SVM 0.42 0.49 0.43 0.74 0.76 0.75

Baseline 2: Random Forest 0.45 0.48 0.46 0.72 0.73 0.73

Baseline 3: MLP 0.41 0.43 0.46 0.74 0.75 0.74

Proposed Approach (CNN) 0.68 0.60 0.63 0.86 0.77 0.82

40
Method 2: Multi-task Learning

41
Multi-tasking in NLP

Image Credit:[35]
MTL methods for Deep Learning
Hard parameter sharing Soft parameter sharing

Image Credit:[36]
Benefits of MTL

▷ Regularization: it reduces the risk of overfitting as well as the


Rademacher complexity of the model

▷ Representation bias: prefer representations that other tasks


also prefer. This will also help the model to generalize to new
tasks in the future

▷ Attention focusing: focus its attention on those features that


actually matter as other tasks will provide additional evidence
for the relevance or irrelevance of those features.
Feature Space

Shared Private Model Goal

45
Adversarial Learning

Image Credit: [37]


Adversarial Net Framework

Image Credit: [37]


Proposed Approach (Multi Task Learning)

48
Results

Task 1: Medical Condition Task 2: Medication

Models Precision Recall F-Score Precision Recall F-Score

Baseline 1: MT-LSTM 63.40 61.38 62.37 88.23 77.38 82.45

Baseline 2: ST-LSTM 63.19 62.47 62.83 85.94 77.46 81.48

Proposed Approach 66.82 63.61 65.18 85.83 81.79 83.76

49
Error Analysis

50
Approach-3

51
Result

52
Disease wise Analysis (Medical Condition)

F1: ‘CNN+Emotion (coarse) ’, F2: ‘CNN+Emotion (fine)’, F3: ‘CNN+Sentiment word feature’, F4: ‘CNN+Textual
Content Feature’, F5: ‘Personality’, F6: ‘Sarcasm’ 53
Disease wise Analysis (Medication)

F1: ‘CNN+Emotion (coarse) ’, F2: ‘CNN+Emotion (fine)’, F3: ‘CNN+Sentiment word feature’, F4: ‘CNN+Textual
Content Feature’, F5: ‘Personality’, F6: ‘Sarcasm’ 54
4 Pharmacovigilance Mining
Social-media Texts (Medical Blogs)

55
Slide credit: [33] 56
Slide credit: [33] 57
Slide credit: [33] 58
How it Begin??

“Secrets of Seroxat”

BBC Documentary: Panorama broadcasted in 2001

50-minute programme about paroxetine


Crowd Opinion

The programme attracted a record response, including some

65,000 : telephone calls

124,000 : website hits

1,374: emails
FIRST STUDY EXPLORING CROWD INTELLIGENCE

Paroxetine, Panorama and user reporting of ADRs: Consumer intelligence


matters in clinical practice and post-marketing drug
surveillance

Medawar, C., Herxheimer, A., Bell, A., & Jofre, S. (2002). Paroxetine, Panorama and user reporting of ADRs:
Consumer intelligence matters in clinical practice and post‐marketing drug surveillance. International Journal of Risk &
Safety in Medicine, 15(3, 4), 161-169.
Crowd Intelligence Matters !!

● “Dr Healy confirmed what I already knew. My husband shot himself after 4 days on Seroxat
never having been suicidal in his life. . .”

● “I took Seroxat 2 years ago because I have a breathing condition called ‘chronic
hyperventilation syndrome’ which is exacerbated by stress and anxiety. I have never been
depressed or had suicidal feelings. However I was prescribed Seroxat to reduce stress &
anxiety. A day or two after taking the pills I (went) into a severe state of mental turmoil. I
felt really suicidal. It was so severe that all I did was stay in bed for two or three days.
Fortunately I recognised Seroxat and stopped taking it immediately.”
Image credit: [33] 63
Problem Statement

64
Proposed Approach

65
Model Twitter CADEC MEDLINE

Results
ST-BLSTM 57.3 51.1 71.91

ST-CNN 67.1 42.0 70.17

CRNN
64.9 48.2 75.5
(Huynh et al.,2016)

RCNN
63.6 43.6 74.0
(Huynh et al.,2016)

MT-BLSTM
63.19 57.62 74.0
(Chowdhury et al.,2016)

MT-BLSTM-Attention
65.73 58.27 77.95
(Chowdhury et al.,2016)

Proposed Approach 69.69 65.58 82.18


66
Conclusion

● Explored the various unstructured form of biomedical text and its application in solving
real-world problem.

● Explored deep learning solution based on Elman and Jordan Deep Learning framework for
solving patient data de-identification task.

● Exploited the sentiment analysis in medical domain and neural network approach to
address the task.

● Explored the unified multi-task learning framework for pharmacovigilance mining that is
generic and easily adaptable to extract the pharmacovigilance information from any form of
text.

67
References

1. G. Zhou, J. Zhang, J. Su, D. Shen, and C. Tan, “Recognizing names in biomedical texts: a
machine learning approach,” Bioinformatics, vol. 20, no. 7, pp. 1178–1190, 2004.
2. J.-D. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier, “Introduction to the bio-entity
recognition task at jnlpba,” in Proceedings of the international joint workshop on natural
language processing in biomedicine and its applications, pp. 70–75, Association for
Computational Linguistics, 2004.
3. Park K-M, Kim S-H, Rim H-C, Hwang Y-S (2006) ME-based biomedical named entity
recognition using lexical knowledge. ACM Trans Asian Lang Inf Process (TALIP) 5(1):4–21
4. B. Settles, “Abner: an open source tool for automatically tagging genes, proteins and other
entity names in text,” Bioinformatics, vol. 21, no. 14, pp. 3191–3192, 2005.
5. Song Y, Kim E, Lee GG, Yi B-k (2004) POSBIOTM-NER in the shared task of BioNLP/NLPBA
2004. In: Proceedings of the international joint workshop on natural language processing in
biomedicine and its applications, pp 100–103

68
References

7. Yikun Guo, Robert Gaizauskas, Ian Roberts, George Demetriou, and Mark Hepple. 2006.
Identifying personal health information using support vector machines. In i2b2 workshop on
challenges in natural language processing for clinical data, pages 10–11.
8. György Szarvas, Richárd Farkas, and András Kocsor. 2006. A multilingual named entity
recognition system using boosting and c4. 5 decision tree learning algorithms. In International
Conference on Discovery Science, pages 267–278. Springer.
9. Özlem Uzuner, Yuan Luo, and Peter Szolovits. 2007. Evaluating the state-of-the-art in automatic
de-identification. Journal of the American Medical Informatics Association, 14(5):550–563.
10. Ben Wellner, Matt Huyck, Scott Mardis, John Aberdeen, Alex Morgan, Leonid Peshkin, Alex
Yeh, Janet Hitzeman, and Lynette Hirschman. 2007. Rapidly retargetable approaches to
de-identification in medical records. Journal of the American Medical Informatics Association,
14(5):564–573.
11. Eiji Aramaki, Takeshi Imai, Kengo Miyo, and Kazuhiko Ohe. 2006. Automatic deidentification by
using sentence features and label consistency. In i2b2 Workshop on Challenges in Natural
Language Processing for Clinical Data, pages 10–11.
69
References

12. M. Krallinger and A. Valencia. Evaluating the detection and ranking of protein interaction
relevant articles: the biocreative challenge interaction article sub-task (ias). In Proceedings of
the Second Biocreative Challenge Evaluation Workshop, 2007.
13. C. Grover, B. Haddow, E. Klein, M. Matthews, L. A. Nielsen, R. Tobin, and X. Wang. Adapting a
relation extraction pipeline for the biocreative ii task. In Proceedings of the BioCreAtIvE II
Workshop, volume 2, 2007.
14. M. Lan, C. L. Tan, and J. Su. Feature generation and representations for protein–protein
interaction classification. Journal of biomedical informatics, 42(5):866–872, 2009.
15. A. Cohen. Automatically expanded dictionaries with exclusion rules and support vector machine
text classifiers: approaches to the biocreative 2 gn and ppi-ias tasks. In Proceedings of the
Second BioCreative Challenge Evaluation Workshop, pages 169–174, 2007.
16. G. E. Hinton, N. Srivastava, A. Krizhevsky. Improving neural networks by preventing
co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
17. W. A. Baumgartner Jr, Z. Lu,. An integrated approach to concept recognition in biomedical text.
In Proceedings of the Second BioCreative Challenge Evaluation Workshop, volume 23, pages
257–71. Centro Nacional de Investigaciones Oncolog-icas (CNIO) Madrid, Spain, 2007
70
References

18. Hua L., Quan C.A shortest dependency path based convolutional neural network for
protein-protein relation extraction BioMed. Res. Int., 2016 (2016)
19. Choi S.-P.Extraction of protein–protein interactions (ppis) from the literature by deep
convolutional neural networks with various feature embeddings J. Inf. Sci. (2016)
20. Y. Peng, Z. Lu, Deep learning for extracting protein-protein interactions from biomedical literature,
arXiv preprint arXiv:1706.01556.
21. Qian L., Zhou G.Tree kernel-based protein–protein interaction extraction from biomedical
literature J. Biomed. Inform., 45 (3) (2012), pp. 535-543
22. Li L., Guo R., Jiang Z., Huang D.An approach to improve kernel-based protein–protein
interaction extraction by learning from large-scale network data Methods, 83 (2015), pp.
44-50
23. Choi S.-P., Myaeng S.-H.Simplicity is better: revisiting single kernel ppi extraction.
Proceedings of the 23rd International Conference on Computational Linguistics, Association for
Computational Linguistics (2010), pp. 206-214

71
References

24. Nikfarjam, A., & Gonzalez, G. H. (2011). Pattern mining for extraction of mentions of adverse drug
reactions from user comments. In AMIA Annual Symposium Proceedings (Vol. 2011, p. 1019).
American Medical Informatics Association.
25. Yates, Andrew, and Nazli Goharian. "ADRTrace: detecting expected and unexpected adverse drug
reactions from user reviews on social media sites." In European Conference on Information
Retrieval, pp. 816-819. Springer, Berlin, Heidelberg, 2013.
26. Sarker, A., & Gonzalez, G. (2015). Portable automatic text classification for adverse drug reaction
detection via multi-corpus training. Journal of biomedical informatics, 53, 196-207.
27. O’Connor, K., Pimpalkhute, P., Nikfarjam, A., Ginn, R., Smith, K. L., & Gonzalez, G. (2014).
Pharmacovigilance on twitter? Mining tweets for adverse drug reactions. In AMIA annual
symposium proceedings (Vol. 2014, p. 924). American Medical Informatics Association.
28. Trung Huynh, Yulan He, Alistair Willis, and Stefan Rüger. 2016. Adverse drug reaction classification
with deep neural networks. Coling.
29. Shaika Chowdhury, Chenwei Zhang, and Philip S. Yu. 2018. Multi-task pharmacovigilance mining
from social media posts. In Proceedings of the 2018 World Wide Web Conference, WWW ’18,
pages 117–126, Republic and Canton of Geneva, Switzerland. International World Wide Web
72
Conferences Steering Committee.
References

30. Simmons, M., Singhal, A., & Lu, Z. (2016). Text mining for precision medicine: bringing structure to EHRs and
biomedical literature to understand genes and health. In Translational Biomedical Informatics (pp. 139-166).
Springer, Singapore.
31. Olivier Bodenreider. (2010, October 21). The Unified Medical Language System What is it and how to use it?
Retrieved from
https://www.slideshare.net/roger961/the-unified-medical-language-system-what-is-it-and-how-to-use-it.
32. Kerrie Holley. (2013). Technological Trend of the Future: Next Era of Computing. Retrieved from
https://www.slideshare.net/IBMSMMA/ibm-summit-2013-16-9next-era
33. DigiMind. (2016). Pharmacovigilance & Social Media. Retrieved from
https://www.slideshare.net/digimind/pharmacovigilance-social-media
34. Liu, P., Qiu, X., & Huang, X. (2017). Adversarial multi-task learning for text classification. arXiv preprint
arXiv:1704.05742.
35. Alexandr Honchar. (2018). Multitask learning: teach your AI more to make it better. Retrieved from
https://towardsdatascience.com/multitask-learning-teach-your-ai-more-to-make-it-better-dde116c2cd40
36. Sebastian Ruder. (2017). An Overview of Multi-Task Learning In Deep Neural Networks. Retrieved from
https://ruder.io/multi-task/
37. Ian Goodfellow. (2017). Introduction to GANs, NIPS 2016. Retrieved from
https://www.youtube.com/watch?v=9JpdAg6uMXs&frags=pl%2Cwn
73
THANK YOU!

74

You might also like