Development of National Health Data Warehouse For Data Mining
Development of National Health Data Warehouse For Data Mining
net/publication/296938522
CITATIONS READS
11 11,099
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Shahidul Islam Khan on 06 March 2016.
Health informatics is currently one of the top focuses of computer science researchers.
Availability of timely and accurate data is essential for medical decision making. Health care
organizations face a common problem with the large amount of data they have in numerous
systems. Researchers, health care providers and patients will not be able to utilize the knowledge
stored in different repositories unless amalgamate the information from disparate sources is
done. This problem can be solved by Data warehousing. Data warehousing techniques share a
common set of tasks, include requirements analysis, data design, architectural design,
implementation and deployment. Developing health data warehouse is complex and time
consuming but is also essential to deliver quality health services. This paper depicts prospects
and complexities of health data warehousing and mining and illustrate a data-warehousing
model suitable for integrating data from different health care sources to discover effective
knowledge.
Keywords: Data Mining, Data Warehouse, Health Informatics, Clinical Database, Data
Preprocessing
Archiving and Communications System), regular basis from different data sources.
RIS (Radiology Information System) in The advantages and disadvantages of DW
various hospitals, departments and are given below [9, 10, 11]:
diagnostic laboratories. Data required to Advantages of DW:
make informed medical decisions are 1. Standardize data across the organization
trapped within fragmented, disparate, and 2. Improve turnaround time for analysis and
heterogeneous clinical and administrative reporting
systems that are not properly integrated. 3. Easy Sharing of data
As a result health care suffer because 4. Remove informational processing load
medical practitioners and health care from operational database
providers are unable to access this 5. Enhance Data Quality and Consistency
information to perform activities such as 6. Provide historical intelligence and reduce
diagnostics, and treatment optimization to cost to access historical data
improve patient care [1, 6, 7]. 7. Integrate data from multiple sources into a
Successful healthcare data management is single repository
an important factor in developing support 8. Improve data quality by providing fixing
systems for the clinical decision-making noisy data
process. Traditional operational database 9. Restructure the data so that it delivers
system does not satisfy the requirements excellent query performance
for critical data analysis tasks of the 10.Make decision–support queries easier to
clinical decision-making users. It write.
contains detailed data but do not include Disadvantage of DW:
important historical data, and since it is 1. Long initial development time and
highly normalized, it performs poorly for associated high cost
complex queries that need to join many 2. Data owners lose control over data,
relational tables or to aggregate large raising ownership and privacy issues
volumes of data in order to generate Implementing a Health DW is a complex
various clinical reports. A health data task containing two major phases. Firstly, in
warehouse is a data store that is different the configuration phase, a conceptual view
from the hospital’s operational databases. of the warehouse is specified according to
It can be used for the analysis of user requirements (DW design). Secondly,
consolidated historical data [7, 8]. the related data sources and the Extraction-
According to Inmon [9] A data Transform-Load (ETL) process (data
warehouse (DW) is a subject-oriented, acquisition) are determined. After the initial
integrated, non-volatile, and time-variant load during the operation phase, warehouse
collection of data in support of data must be regularly refreshed that is,
management’s decisions. modifications of operational data since the
Subject-oriented: as the warehouse is last DW refreshment must be propagated
organized around the major subjects of into the warehouse such that data stored in
the enterprise (such as customers, the data warehouse reflect the state of the
products, and sales). underlying operational systems [5, 8, 12].
Integrated: as DW is constructed by The main aim of this research is to identify
integrating multiple heterogeneous the obstacles for healthcare data integration
sources usually, such as relational and to propose a data-warehousing model
databases, flat files etc. suitable for integrating fragmented data in
Time-variant: as data in the warehouse is respect to Bangladesh as well as anywhere
only accurate and valid at some point in else. The result will contribute to the
time or over some time interval. advancement of knowledge in the field of
Non-volatile: as the data is not updated in medical informatics. In this paper “Health”,
real time but is refreshed from on a “Clinical” “Pathological” and “Medical”
Database Systems Journal vol. VI, no. 1/2015 5
these terms are used for similar meaning. different hospitals to enable any hospital to
The rest of this paper is organized as obtain a total overview of a patient's health
follows. In Section 2 we have presented history. Different heterogeneity problems
selected literature reviews on DW, Health have to be solved in order to integrate EHR
DW and KDD techniques. Section 3 systems from different hospitals and health
describes briefly some design issues of service providers in a consistent way. The
National Health DW. In Section 4 we first problem is that different hospitals
have shown the calculation of normally do not use a same DBMS and
approximate size of our DW. Some therefore, the traditional ACID properties of
preprocessing techniques that we have databases are missing across the different
used are illustrated in Section 5. Section hospital locations. This may cause
6 gives readers ideas about how our DW performance, autonomy, and consistency
will be used for knowledge discovery and problems. Another heterogeneity problem is
mining. Finally Section 7 concludes the that there are several incompatible standards
paper. for EHR entries [12].
The trend of adopting data warehouses for
2. Literature Review health systems in presented in [13], where
DW unifies the data scattered throughout the design experience in the University of
an organization into a single centralized Virginia Health System is reported. Here the
data structure. It is a repository of data warehouse is used to provide clinicians
integrated information available for and researchers with direct, rapid access to
querying and analysis. DW may be desired patients’ data. In addition they use
considered a proactive approach to DW also for educational and research aims,
information integration, as compared to as it serves to face informatics issues−such
the more traditional query driven as data capture−and to perform exploratory
approaches where processing and analyses of healthcare problems.
integration starts when a query arrives [5, Medical domain has certain unique data
6]. A health data warehouse is a requirements such as high volumes of
repository where healthcare providers can unstructured data and data confidentiality.
gain access to medical data gathered in There are huge constraints and issues that
the patient care process. Extracting limit the way the data mining is performed
medical domain information to a data for medical datasets. Some of these issues
warehouse can facilitate efficient storage, are the way the data is collected; accuracy of
enhances timely analysis and increases the data, ethical, privacy and social issues
the quality of real time decision making that comes with patient’s records [2].
processes. Today’s healthcare Research is also done to find out impact of
organizations require not only the quality missing values and explore the impact of
and effectiveness of their treatment, but noise and how this can influence the output.
also reduction of waste and unnecessary Zhu et al. classified noises into class noise
costs. In order to construct an operational and attributes noise. Attribute noise include
and effective DW it is essential to incorrect attribute values, missing or don’t
combine process work, domain expertise know attribute values and incomplete
and high quality database design [7, 8]. attributes or don’t care values [14].
Electronic Health Record (EHR) Several researches have focused on the
describes the diseases and treatments of techniques that have built in mechanism to
patients, are normally stored in hospitals handle noise and missing values and which
or clinics, where they are created. are more appropriate to use for medical
Patients may be treated in different applications. Few techniques that have been
hospitals, clinics and, therefore, there is a applied and are more suited to medical data
need for integrating health records from sets are studied in [15, 16]. For example
6 Development of National Health Data Warehouse for Data Mining
decision tree, logic programs, K-nearest studied in [21]. Cubillas et. al. proposed a
neighbour, and Bayesian classifiers. Lee model for improvement in appointment
et al recommended that Bayesian scheduling in health care centers [22].
networks and decision trees are the Hoque et. al. discussed present structure of
primary techniques applied in medical pathological data, requirements to formulate
information systems [17]. Obenshain efficient models and the necessity to reform
claimed that that neural networks the present structure for predicative data
performed better then logistic regression, mining in [23]. Kumari and Singh used
but the decision tree did better in identify Neural Network for the diagnosis of diabetes
active compounds most likely to have [24]. Yilmaz et. al. proposed a modified K-
biological activity [18]. Wang and Wang means Algorithm based data preparation
discussed that most process models do method for diagnosis of heart and diabetes
not focus in gaining new knowledge. diseases [25]. Herland et. al. present recent
Medical data mining applications should research using Big Data tools and
follow a five stage data mining approaches for the analysis of Health
development cycle: planning tasks, Informatics [26].
developing data mining hypotheses,
preparing data, selecting data mining 3. Design Issues of National Health DW
tools, and evaluating data mining results The architecture of national health DW
[19]. model is illustrated in Fig. 1. Health data
Handling Missing Data in Pathology from different govt. and private sources such
Databases using Multiple Imputation as hospitals, clinics, diagnostic centers,
technique is discussed in [20]. research centers will be collected. Using
Optimizing public health data collection ETL process data will be integrated into a
for KDD using feature selection is temporary data repository [27].
summarization of national health data. Model is the Star Schema [9, 12, 13]. We
Partial materialization is used rather than have used Star Schema in our design,
full materialization of cuboids to reduce illustrated in Fig. 3.
huge space requirements [9, 10]. Using the building blocks of the fact table
Logical design of DW involves the and the various dimension tables, one has
definition of structures that enable an thousands of ways to aggregate the data. For
efficient access to information. There are clinical analysis purposes, frequently needed
many logical models like Flat schema, aggregated datasets should be created in
Star schema, Star Cluster schema, advance for the users. Having data readily
Snowflake schema, Fact Constellation and easily available is a major tenet of data
schema etc. Among them, star schema, warehousing. For our DW, some aggregated
snowflake schema and fact constellation datasets could be:
schema are mostly used commercially. • Patient count by Diagnosis, Gender,
Efficiency is the most important factor in Age, Date
DW modeling because many queries • Count of Procedures by Provider and
access large amounts of data that may Date
involve multiple join operations. Most • Billing and discount information.
suitable Logical Data Warehousing • Count of retesting
Minimum 0.1
Standard Dev. 5.3070
Table 4 presents normalization technique of used to replace result data for Urine colour
nominal data where metadata of Table 5 are diagnosis.
[7] Sahama TR, Croll PR (2007) A Data [18] Obenshain MK, Application of Data
Warehouse Architecture for Clinical Mining Techniques to Healthcare Data,
Data Warehousing. Australasian Infection Control and Hospital
Workshop on Health Knowledge Epidemiology, vol.25, no 8, pp. 690-
Management and Discovery (HKMD 695, 2004
2007) [19] Wang, H, Wang S (2008) Medical
[8] Lyman JA, Scully K, Harrison JH knowledge acquisition through data
(2008) The development of health mining,. IEEE International Symposium
care data warehouses to support data ITME.
mining. Clin Lab Med. 28(1):55-71 [20] S. FU (2011) Missing Data in
[9] Inmon, W (2005): Building the Data Pathology Databases. MSc Thesis,
Warehouse, 4th edition, Wiley-New Australian National University.
York. [21] Partington SN, Papakroni V, Menzies T
[10] Jiawei H, Micheline K, Jian P (2014) Optimizing data collection for
(2012) Data Mining Concepts and public health decisions: a data mining
Techniques 3rd Edition, Elsevier approach. BMC Public Health 14: 593-
[11] Kimball R, Ross M (2013) The Data 598
Warehouse Toolkit: The Definitive [22] Cubillas JJ, Ramos MI, Feito FR,
Guide to Dimensional Modeling 3rd Ureña T (2014) An improvement in the
Edition, Wiley appointment scheduling in primary
[12] Nugawela S (2013) Data health care centers using data mining. J.
Warehousing Model For Integrating Med. Syst., Springer 38: 89
Fragmented Electronic Health [23] Hoque ASML, Galib S, Tasnim M
Records From Disparate And (2013) Mining Pathological Data to
Heterogeneous Clinical Data Stores, Support Medical Diagnostics.
M.Sc. Thesis, Queensland University Workshop on Advances on Data
of Technology Management: Applications and
[13] Mullins M, Siadaty MS, Lyman J et Algorithms, Department of Computer
al (2006) Data mining and clinical Science and Engineering, BUET,
data repositories: Insights from a Dhaka, 71-74
667,000 patient data set. Comput. [24] Kumari S, Singh A (2013) A data
Biol. Med. 36: 1351–1377 mining approach for the diagnosis of
[14] Zhu X, Khoshgoftaar T, Davidson I, diabetes mellitus. IEEE 7th
Zhang S (2007) Special issue on International Conference on Intelligent
mining low-quality data, Knowledge Systems and Control
and Information Systems, 11:131- [25] Yilmaz N, Inan O, Uzer MS (2014) A
136 New Data Preparation Method Based on
[15] Brown ML, Kros JF (2003) Data Clustering Algorithms for Diagnosis
mining and the impact of missing Systems of Heart and Diabetes
data. Industrial Management & Data Diseases,” J. Med. Syst., Springer, 38
Systems 103: 611-621 [26] Herland M, Khoshgoftaar TM, Wald R
[16] Lavrač N (1999) Selected techniques (2014) A review of data mining using
for data mining in medicine. big data in health informatics. J. Big
Artificial intelligence in medicine Data, Springer 1: 2
16(1): 3-23 [27] Khan SI, Hoque ASML (2015)
[17] Lee IN, Liao SC, Embrechts M Towards Development of Health Data
(2000) Data mining techniques Warehouse: Bangladesh Perspective.
applied to medical information. Accepted in 2nd International
Medical Informatics & the Internet in Conference on Electrical Engineering
Medicine 25(2): 81-102
Database Systems Journal vol. VI, no. 1/2015 13
Shahidul Islam Khan obtained his B.Sc. and M.Sc. Engineering Degree in
Computer Science and Engineering (CSE) from Ahsanullah University of
Science & Technology (AUST) and Bangladesh University of Engineering
& Technology (BUET), Dhaka, Bangladesh in 2003 and 2011 respectively.
He is now a Doctoral Student in the Department of CSE, BUET, which is the
highest ranked technical university of Bangladesh. His current field of
research is data mining and health informatics. He has more than 10
published papers in international conferences and journal. He is also an Assistant Professor
(Currently in study leave) in the Dept. of CSE, International Islamic University Chittagong
(IIUC), Bangladesh.
Abu Sayed Md. Latiful Hoque graduated from the Dept. of Electrical and
Electronic Engineering (EEE), Bangladesh University of Engineering &
Technology (BUET), Dhaka, Bangladesh in 1986. He obtained Post
Graduate Diploma in 1992 from Asian Institute of Technology (AIT),
Thailand and Ph.D. in CS from University of Strathclyde, Glasgow, UK in
2003. He is a professor of the Dept. of CSE, BUET and a prominent
international researcher in the field of Database, Data Mining, E-Learning.
He has near about 50 published papers in reputed international journals and conferences. He is
also author of a book on Database Systems which is taught in Universities.