Seah 2014
Seah 2014
Seah 2014
Abstract—Various business organization or government loading, the second stage was application development that
bodies are enhancing their decision making capabilities using based on the data warehouse. At present, application based on
data warehouse. For government bodies, data warehouse the data warehouse mainly have two types, they were pre-
provides a means by enabling policy making to be formulated defined reports and OLAP analysis, with addition of data
much easier based on available data such as survey-based mining supplemented to meet requirements of higher
services data. In this paper we present a survey-based service management.
data with the design and implementation of a Data Warehouse
framework for data mining and business intelligence reporting. In this paper, we detailed in Section II the background of
In the design of the data warehouse, we developed a multi- the principle of having the data warehouse. Section III gives an
dimensional Data Model for the creation of multiple data marts overview of the data model and data warehouse design while
and design of an ETL process for populating the data marts from Section IV presents the data model consideration and Section
the data source. The development of multiple data marts will V details the data model implemented. In Section VI, we
enable easier report generation by identifying common
dimension amongst the data marts. The cross-join capabilities of
describe the ETL design process implemented to extract the
the data marts through common dimensions, demonstrate the data source until the loading to the data warehouse. In Section
ability to easily drill across the data marts for cross data analysis VII, we present the result of the data model and data warehouse
and reporting. In addition, we also have incorporate data quality and finally in Section VIII, we present the conclusion of this
checking on the data source as well as data detection rules to paper.
filter out unmatched data schema and data range from being
stored in the data warehouse for analysis.
II. DATA WAREHOUSE BACKGROUND
Keywords—data model, data warehouse; Extract Transform
Data warehouse terminology by Ralph Kimball, defines
Load (ETL); OLAP; business intelligence, data mining, star
data warehouse as "a copy of transaction data specifically
schema, data marts
structured for query and analysis" [9]. Data warehouse of an
enterprise consolidates data from heterogeneous sources to
I. INTRODUCTION support enterprise wide decision-making, reporting, and
In order to better provide an environment for government, analyzing. Data warehouses often use star and snowflake
organisations as well as business community in planning and schemas [14] to provide the fastest possible response times to
decision making, survey-based services data are collected from complex queries. Bill Inmon [6] proposed the snowflake
various industries to help forms policies decisions. schema which is a variant of the star schema model, where
Nevertheless, to achieve this there is a need for the some dimension tables are normalised, thereby further splitting
implementation of a business intelligence dashboard and data the data into additional tables. Ralph Kimball [3, 4, 5] proposed
mining which relies heavily on the formulation of the data the star schema for representing multidimensional data where
warehouse and the data model for enabling such activities. In the schema graph resembles a starburst, with the dimension
this paper we present a survey-based service analysis with the tables displayed in a radial pattern around the central fact table.
design and implementation of a Data Warehouse framework In general, populating any data warehouse consists of the
for data mining and business intelligence reporting. In the following steps:
design of the data warehouse, we developed a multi-
dimensional Data Model for the creation of multiple data marts
for the data analysis. With the multiple data marts, it is easier A. Data Loading
to cater for each report needs by identifying common Data consolidated from heterogeneous data sources may
dimension amongst the data marts. The cross-join capabilities have problems, and needs to be first transformed and cleaned
of the data marts through common dimensions, demonstrate the before loading into the DW. The data may have incorrect data
ability to easily drill across the data marts. In addition, we also value such as null value, inconsistent reference code, and
designed an ETL process for data population from data source others. Hence, data cleansing is an essential task in data
to the data warehouse. The construction of the system had two warehousing process to get correct and qualitative data into the
stages, the first stage was the data model design and data DW. This process has basically the following tasks [10]:
59
C. Data warehouse architecture design IV. DATA MODEL DESIGN CONSIDERATION
The system was built according to the principle of three tier A. Dimensional Modeling
structure which is data acquisition tier, data storage tier and Dimensional modeling is a logical design technique to organize
data display tier. dimensions and data marts in a star model. Every dimensional
In data acquisition tier, data source are provided model is composed of a data mart table and a set of dimension
through MS Access MDB files. These data sources are tables (Bill Inmon/Ralph Kimball, 1997) [3, 4, 5, 6].
extracted, cleansed (detection), transformed and finally loaded
into the data warehouse[16]. In this PoC, we have developed
data detection rules for the data exception handling. B. Dimension Tables
In data storage layer tier, the physical representations The dimension tables contain the textual descriptors of the
of the data marts are developed. These data marts which are business. In a well-designed dimensional model, dimension
based on subject areas are developed with star schema and tables have many columns or attributes. These attributes
common dimensions in order to facilitate the cross-join drill describe the rows in the dimension table [3, 4, 5]. They are
down of the data used in the OLAP cube. The data stored in very useful in describing different entities in the business and
data marts have measures which are pre-aggregated. Further provide textual meaningful data. They help to make business
details of the data model design are given in Section IV. understandable.
In data display tier, data was analyzed and processed
according to demands, and the analysis result can be shown by
C. Data Marts Tables
business intelligence and data mining tools. In this tier,
accurate information that was required by business users can be A data mart table is the primary table in a dimensional model
made available for decision-making. Figure 2.0 illustrates the where the numerical performance measurements of the
data warehouse architecture. business are stored [3, 4, 5]. It gives all the numerical measures
As shown in Figure 2.0, the data is stored in both the at one place instead of being duplicated at different places in
main data warehouse and data marts. The detailed data was the data warehouse. They act as measures for analysis along
stored in the data warehouse in the 3DNF. Data orientated to various dimensions.
specific subject areas are stored in the respective data marts.
D. Attributes
Dimensional table contains collection of attributes which are
useful for performing aggregations and analyzing business
facts stored in fact table.
60
A. Common Dimensions
Dimension Description
Fig. 3. Data Marts schema relationship model State It contains attributes describing each state where
the industries are located. Hence drill down can
be done according to state level.
B. Data Marts
Data Marts Description
61
C. Dimension Tables Relating to Respective Data Marts
Dimension Description
62
· Total workforce by telecommunication
· Total workforce by cargo services.
· Total workforce by gender based on education,
telecommunication and cargo services.
· Year to year comparison of workforce growth based
on education, telecommunication and cargo services.
· Top 10 industries with highest workforce employed in
education service
· Top 10 industries with highest workforce employed in
telecommunication service Fig. 8. Ad-hoc Reports
· Top 10 industries with highest workforce employed in
cargo service
· Total workforce by Worker Category based on
education, telecommunication and cargo services
· Total workforce by Worker Category by gender
based on education, telecommunication and cargo
services
· Total workforce by Certificate Level such as PhD,
Master, Degree, Diploma, etc based on education,
telecommunication and cargo services
· Total workforce by Worker Category by Certificate
Level
· Total Asset based on education, telecommunication
and cargo services
· Top 10 industries with highest asset value in
education service Fig. 9. Dashboard Charts Comparison
· Top 10 industries with highest asset value in
telecommunication service
· Top 10 industries with highest asset value in cargo
service
In addition, the data model which uses the common dimension
can enable multi-dimension queries to be made which can
further broaden the scope of analysis.
63
as described in Section III, IV and V. The data models
incorporate four data marts with multiple common dimensions
to enable drill across the data marts for cross data analysis.
Hence, multidimensional analysis can be performed such as
employment based on education level, gender, state
distribution and others amongst the education, cargo and
telecommunication services. The authority can developed
appropriate education and working policies to help address the
needs of the industries. The data models are designed with
ability to cater for more service based data in other areas
besides the three services given in this PoC. In this data
warehouse implementation, we also have designed the ETL
process which incorporate the data quality to filter out
inconsistent data with respect to data schema and data value.
Hence, this data quality process will enable correct and
qualitative data into the data warehouse.
REFERENCES
[1] R. Arora, P. Pahwa, S. Bansal, "Alliance Rules of Data Warehouse
Cleansing", IEEE , International Conference on Signal Processing
Systems, Singapore, May 2009, pp. 743 – 747.
[2] S. Chaudhuri, K. Ganjam, V. Ganti, “Data Cleaning in Microsoft SQL
Server 2005”, In Proceedings of the ACM SIGMOD Conference,
Baltimore, MD, 2005.
[3] R. Kimball, L. Reeves, M. Ross and W. Thornthwaite. " The Data
Warehouse Lifecycle Toolkit : Expert Methods for Designing,
Developing, and Deploying Data Warehouses", John Wiley & Sons,
1998.
[4] R. Kimball and M. Ross. "The Data Warehouse Toolkit: The Complete
Guide to Dimensional Modeling", 2nd Edition, John Wiley & Sons,
2002.
[5] R. Kimball and J. Caserta. "The Data Warehouse ETL Toolkit: Practical
Techniques for Extracting, Cleaning, Conforming, and Delivering Data",
John Wiley & Sons, 2004.
[6] W. H. Inmon, Building the data warehouse (2nd ed.), John Wiley &
Sons, Inc., New York, NY, 1996.
[7] Pentaho Data Integration, www.pentaho.com/
[8] Apache Tomcat, http://tomcat.apache.org/
[9] T. Manjunath, S. Ravindra, and G. Ravikumar, “Analysis of data quality
aspects in data warehouse systems,” International Journal of Computer
Science and Information Technologies, Vol. 2, No. 1, 2010, pp. 477-
485.
[10] B. Pinar, A Comparison of Data Warehouse Design Models, Master
Thesis, Atilim University, Jan. 2005.
[11] W. Eckerson and C. White, “Evaluating ETL and Data Integration
Platforms”, TDWI REPORT SERIES, 101communications LLC, 2003.
[12] J. Trujillo and S. Lujan-Mora. “A UML Based Approach for Modelling
ETL Processes in Data Warehouses”. In I.-Y. Song, S. W. Liddle, T. W.
Ling,and P. Scheuermann, editors, ER, volume 2813 of Lecture Notes in
Computer science, Springer, 2003.
[13] Wang, Richard Y., and Strong, Diane M. Beyond accuracy: What data
quality means to data consumers, Journal of Management Information
Systems; Armonk; Spring 1996, 12 (4), pp. 5-34.
[14] Jarke, M., M. Lenzerini, Y. Vaasssiliou, P.Vassiliadis, "Fundamentals of
Data Warehouse," 2000.
[15] You, Y. L. and Zhang, X. M., “A reliable strategy and design of
architecture of ETL in data warehouse”, Computer Engineering and
Applications, Vol. 10, 172_174 , 2005.
[16] B. K. Seah, "An application of a healthcare data warehouse system",
Innovative Computing Technology (INTECH), 2013, pp. 269 - 273.
64