CEP talk-IITP
CEP talk-IITP
Text Mining in
Biomedical and Healthcare Domain
1
Overview
Text Title
04 Place your own
text here
4
BIOMEDICAL LITERATURE
6
Image credit: Wikipedia
SOCIAL MEDIA
7
APPLICATION
Biomedical Literature
13
DOMAIN KNOWLEDGE BASE
14
Image credit: [31]
15
16
Image credit: [32]
17
Image credit: [32]
Structured Data
Text Mining
Exponential
Unstructured Data
18
2 Entity
Extraction
Patient Data De-Identification
(Electronic Medical Records)
19
Problem Statement
Presentation
Basically , patient underwent a subtotal gastrectomy on the 7th of June by Dr. Kotefooksshuff .
Motivation
22
Challenges
▪ Inter PHI ambiguity: PHI terms overlap with the non-PHI terms.
Brown (Doctor name) vs. brown (non-PHI)
▪ Intra PHI ambiguity: One candidate word seems to belong to two or many
different PHI terms.
August (Patient name) vs. August (Date)
▪ Lexical Variation: For example, variation of the entities such as the ‘50 yo
m’, ‘50 yo M’, ‘55 YO MALE’
▪ Terminological variation and irregularities: For example ‘3041023MARY’
is the combination of two different PHI categories ‘3041023’ which
represents the MEDICALRECORD and ‘MARY’ which is another PHI
23
category
System Algorithm Lexical Syntactic Semantics
Guo et SVM Word , Capitalization, Prefixes/Suffixes, Word POS(Word) Entity Extract by
al.[7] Length, Numbers, Regular Expression ANNIE(Doc, Hosp,
Loc)
Szarvas Decision Word Length, Capitalization, Numbers, None Dictionary Terms
et al.[8] Tree Regular Expression, Token Frequency (Names , US Loc,
Counties, cities,
Diseases, Non PHI),
Section Headings
Uzuner et SVM Word , Lexical Bigrams, Capitalization, POS( Word+2 MeSH ID, dictionary
al. [9] Punctuation, Numbers, Word Length surrounding) Terms(Names, US
and word locations,
hospital name)
Bag-of-words
Part-of-speech (POS) tags
POS tag of current and surrounding token
Contextual features
Sentence information
Affixes
Orthographic features
Word shapes
Section information
Task specific features
25
Dataset (i2b2 2014)
26
PROPOSED APPROACH (Elman RNN)
27
PROPOSED APPROACH (Jordan RNN)
28
29
Results with PSO
30
3 Medical Sentiment Analysis
Social-media Texts (Medical Blogs)
31
32
Sample Medical Blog-post
33
Who is Talking?
34
Why Not??? Sentiment Analysis
35
Problem Statement
36
Medical Condition
Black
Is the color of coal, ebony, and
of outer space. It is the
darkest color, the result of
the absence of or complete
absorption of light.
37
Medication
38
Proposed Approach (Single Task Learning)
39
Results
40
Method 2: Multi-task Learning
41
Multi-tasking in NLP
Image Credit:[35]
MTL methods for Deep Learning
Hard parameter sharing Soft parameter sharing
Image Credit:[36]
Benefits of MTL
45
Adversarial Learning
48
Results
49
Error Analysis
50
Approach-3
51
Result
52
Disease wise Analysis (Medical Condition)
F1: ‘CNN+Emotion (coarse) ’, F2: ‘CNN+Emotion (fine)’, F3: ‘CNN+Sentiment word feature’, F4: ‘CNN+Textual
Content Feature’, F5: ‘Personality’, F6: ‘Sarcasm’ 53
Disease wise Analysis (Medication)
F1: ‘CNN+Emotion (coarse) ’, F2: ‘CNN+Emotion (fine)’, F3: ‘CNN+Sentiment word feature’, F4: ‘CNN+Textual
Content Feature’, F5: ‘Personality’, F6: ‘Sarcasm’ 54
4 Pharmacovigilance Mining
Social-media Texts (Medical Blogs)
55
Slide credit: [33] 56
Slide credit: [33] 57
Slide credit: [33] 58
How it Begin??
“Secrets of Seroxat”
1,374: emails
FIRST STUDY EXPLORING CROWD INTELLIGENCE
Medawar, C., Herxheimer, A., Bell, A., & Jofre, S. (2002). Paroxetine, Panorama and user reporting of ADRs:
Consumer intelligence matters in clinical practice and post‐marketing drug surveillance. International Journal of Risk &
Safety in Medicine, 15(3, 4), 161-169.
Crowd Intelligence Matters !!
● “Dr Healy confirmed what I already knew. My husband shot himself after 4 days on Seroxat
never having been suicidal in his life. . .”
● “I took Seroxat 2 years ago because I have a breathing condition called ‘chronic
hyperventilation syndrome’ which is exacerbated by stress and anxiety. I have never been
depressed or had suicidal feelings. However I was prescribed Seroxat to reduce stress &
anxiety. A day or two after taking the pills I (went) into a severe state of mental turmoil. I
felt really suicidal. It was so severe that all I did was stay in bed for two or three days.
Fortunately I recognised Seroxat and stopped taking it immediately.”
Image credit: [33] 63
Problem Statement
64
Proposed Approach
65
Model Twitter CADEC MEDLINE
Results
ST-BLSTM 57.3 51.1 71.91
CRNN
64.9 48.2 75.5
(Huynh et al.,2016)
RCNN
63.6 43.6 74.0
(Huynh et al.,2016)
MT-BLSTM
63.19 57.62 74.0
(Chowdhury et al.,2016)
MT-BLSTM-Attention
65.73 58.27 77.95
(Chowdhury et al.,2016)
● Explored the various unstructured form of biomedical text and its application in solving
real-world problem.
● Explored deep learning solution based on Elman and Jordan Deep Learning framework for
solving patient data de-identification task.
● Exploited the sentiment analysis in medical domain and neural network approach to
address the task.
● Explored the unified multi-task learning framework for pharmacovigilance mining that is
generic and easily adaptable to extract the pharmacovigilance information from any form of
text.
67
References
1. G. Zhou, J. Zhang, J. Su, D. Shen, and C. Tan, “Recognizing names in biomedical texts: a
machine learning approach,” Bioinformatics, vol. 20, no. 7, pp. 1178–1190, 2004.
2. J.-D. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier, “Introduction to the bio-entity
recognition task at jnlpba,” in Proceedings of the international joint workshop on natural
language processing in biomedicine and its applications, pp. 70–75, Association for
Computational Linguistics, 2004.
3. Park K-M, Kim S-H, Rim H-C, Hwang Y-S (2006) ME-based biomedical named entity
recognition using lexical knowledge. ACM Trans Asian Lang Inf Process (TALIP) 5(1):4–21
4. B. Settles, “Abner: an open source tool for automatically tagging genes, proteins and other
entity names in text,” Bioinformatics, vol. 21, no. 14, pp. 3191–3192, 2005.
5. Song Y, Kim E, Lee GG, Yi B-k (2004) POSBIOTM-NER in the shared task of BioNLP/NLPBA
2004. In: Proceedings of the international joint workshop on natural language processing in
biomedicine and its applications, pp 100–103
68
References
7. Yikun Guo, Robert Gaizauskas, Ian Roberts, George Demetriou, and Mark Hepple. 2006.
Identifying personal health information using support vector machines. In i2b2 workshop on
challenges in natural language processing for clinical data, pages 10–11.
8. György Szarvas, Richárd Farkas, and András Kocsor. 2006. A multilingual named entity
recognition system using boosting and c4. 5 decision tree learning algorithms. In International
Conference on Discovery Science, pages 267–278. Springer.
9. Özlem Uzuner, Yuan Luo, and Peter Szolovits. 2007. Evaluating the state-of-the-art in automatic
de-identification. Journal of the American Medical Informatics Association, 14(5):550–563.
10. Ben Wellner, Matt Huyck, Scott Mardis, John Aberdeen, Alex Morgan, Leonid Peshkin, Alex
Yeh, Janet Hitzeman, and Lynette Hirschman. 2007. Rapidly retargetable approaches to
de-identification in medical records. Journal of the American Medical Informatics Association,
14(5):564–573.
11. Eiji Aramaki, Takeshi Imai, Kengo Miyo, and Kazuhiko Ohe. 2006. Automatic deidentification by
using sentence features and label consistency. In i2b2 Workshop on Challenges in Natural
Language Processing for Clinical Data, pages 10–11.
69
References
12. M. Krallinger and A. Valencia. Evaluating the detection and ranking of protein interaction
relevant articles: the biocreative challenge interaction article sub-task (ias). In Proceedings of
the Second Biocreative Challenge Evaluation Workshop, 2007.
13. C. Grover, B. Haddow, E. Klein, M. Matthews, L. A. Nielsen, R. Tobin, and X. Wang. Adapting a
relation extraction pipeline for the biocreative ii task. In Proceedings of the BioCreAtIvE II
Workshop, volume 2, 2007.
14. M. Lan, C. L. Tan, and J. Su. Feature generation and representations for protein–protein
interaction classification. Journal of biomedical informatics, 42(5):866–872, 2009.
15. A. Cohen. Automatically expanded dictionaries with exclusion rules and support vector machine
text classifiers: approaches to the biocreative 2 gn and ppi-ias tasks. In Proceedings of the
Second BioCreative Challenge Evaluation Workshop, pages 169–174, 2007.
16. G. E. Hinton, N. Srivastava, A. Krizhevsky. Improving neural networks by preventing
co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
17. W. A. Baumgartner Jr, Z. Lu,. An integrated approach to concept recognition in biomedical text.
In Proceedings of the Second BioCreative Challenge Evaluation Workshop, volume 23, pages
257–71. Centro Nacional de Investigaciones Oncolog-icas (CNIO) Madrid, Spain, 2007
70
References
18. Hua L., Quan C.A shortest dependency path based convolutional neural network for
protein-protein relation extraction BioMed. Res. Int., 2016 (2016)
19. Choi S.-P.Extraction of protein–protein interactions (ppis) from the literature by deep
convolutional neural networks with various feature embeddings J. Inf. Sci. (2016)
20. Y. Peng, Z. Lu, Deep learning for extracting protein-protein interactions from biomedical literature,
arXiv preprint arXiv:1706.01556.
21. Qian L., Zhou G.Tree kernel-based protein–protein interaction extraction from biomedical
literature J. Biomed. Inform., 45 (3) (2012), pp. 535-543
22. Li L., Guo R., Jiang Z., Huang D.An approach to improve kernel-based protein–protein
interaction extraction by learning from large-scale network data Methods, 83 (2015), pp.
44-50
23. Choi S.-P., Myaeng S.-H.Simplicity is better: revisiting single kernel ppi extraction.
Proceedings of the 23rd International Conference on Computational Linguistics, Association for
Computational Linguistics (2010), pp. 206-214
71
References
24. Nikfarjam, A., & Gonzalez, G. H. (2011). Pattern mining for extraction of mentions of adverse drug
reactions from user comments. In AMIA Annual Symposium Proceedings (Vol. 2011, p. 1019).
American Medical Informatics Association.
25. Yates, Andrew, and Nazli Goharian. "ADRTrace: detecting expected and unexpected adverse drug
reactions from user reviews on social media sites." In European Conference on Information
Retrieval, pp. 816-819. Springer, Berlin, Heidelberg, 2013.
26. Sarker, A., & Gonzalez, G. (2015). Portable automatic text classification for adverse drug reaction
detection via multi-corpus training. Journal of biomedical informatics, 53, 196-207.
27. O’Connor, K., Pimpalkhute, P., Nikfarjam, A., Ginn, R., Smith, K. L., & Gonzalez, G. (2014).
Pharmacovigilance on twitter? Mining tweets for adverse drug reactions. In AMIA annual
symposium proceedings (Vol. 2014, p. 924). American Medical Informatics Association.
28. Trung Huynh, Yulan He, Alistair Willis, and Stefan Rüger. 2016. Adverse drug reaction classification
with deep neural networks. Coling.
29. Shaika Chowdhury, Chenwei Zhang, and Philip S. Yu. 2018. Multi-task pharmacovigilance mining
from social media posts. In Proceedings of the 2018 World Wide Web Conference, WWW ’18,
pages 117–126, Republic and Canton of Geneva, Switzerland. International World Wide Web
72
Conferences Steering Committee.
References
30. Simmons, M., Singhal, A., & Lu, Z. (2016). Text mining for precision medicine: bringing structure to EHRs and
biomedical literature to understand genes and health. In Translational Biomedical Informatics (pp. 139-166).
Springer, Singapore.
31. Olivier Bodenreider. (2010, October 21). The Unified Medical Language System What is it and how to use it?
Retrieved from
https://www.slideshare.net/roger961/the-unified-medical-language-system-what-is-it-and-how-to-use-it.
32. Kerrie Holley. (2013). Technological Trend of the Future: Next Era of Computing. Retrieved from
https://www.slideshare.net/IBMSMMA/ibm-summit-2013-16-9next-era
33. DigiMind. (2016). Pharmacovigilance & Social Media. Retrieved from
https://www.slideshare.net/digimind/pharmacovigilance-social-media
34. Liu, P., Qiu, X., & Huang, X. (2017). Adversarial multi-task learning for text classification. arXiv preprint
arXiv:1704.05742.
35. Alexandr Honchar. (2018). Multitask learning: teach your AI more to make it better. Retrieved from
https://towardsdatascience.com/multitask-learning-teach-your-ai-more-to-make-it-better-dde116c2cd40
36. Sebastian Ruder. (2017). An Overview of Multi-Task Learning In Deep Neural Networks. Retrieved from
https://ruder.io/multi-task/
37. Ian Goodfellow. (2017). Introduction to GANs, NIPS 2016. Retrieved from
https://www.youtube.com/watch?v=9JpdAg6uMXs&frags=pl%2Cwn
73
THANK YOU!
74