Dcap603 Dataware Housing and Datamining PDF
Dcap603 Dataware Housing and Datamining PDF
DCAP603
Data Warehousing anD
Data Mining
Copyright © 2011 Ritendra Goyal
All rights reserved
S. No. Description
1 Data Warehouse Practice: Data warehouse components, Designing the Data Warehouse, Getting Heterogeneous Data into the
Warehouse, Getting Multidimensional Data out of the Warehouse.
2 Data Warehouse Research-Issues and Research: Data Extraction and Reconciliation, Data Aggregation and Customization, Query
Optimization, Update Propagation, Modelling and Measuring Data Warehouse Quality, Some Major Research Projects in Data
Warehousing, Three Perspectives of Data Warehouse Metadata.
3 Source Integration: The Practice of Source Integration, Research in Source Integration, Towards Systematic Methodologies for
Source Integration.
4 Data Warehouse Refreshment: Data Warehouse Refreshment, Incremental Data Extraction, Data Cleaning,
5 Data Warehouse Refreshment: Update Propagation into Materialized Views, Towards a Quality-Oriented Refreshment Process,
Implementation of the Approach
6 Multidimensional Data Models and Aggregation: Multidimensional View of Information, ROLAP Data Model, MOLAP Data
Model, Logical Models for Multidimensional Information, Conceptual Models for Multidimensional Information
7 Query Processing and Optimization: Description and Requirements for Data Warehouse Queries, Query Processing Techniques.
8 Metadata and Warehouse Quality: Matadata Management in Data Warehouse Practice, A repository Model for the DWQ
Framework, Defining Data Warehouse Quality.
9 Metadata and Data Warehouse Quality: Representing and Analyzing Data Warehouse Quality, Quality Analysis in Data
Staging.
10 Quality-Driven Data Warehouse Design: Interactions between Quality Factors and DW Tasks, The DWQ Data Warehouse
Design Methodology, Optimizing the Materialization of DW Views
Contents
Contents
Objectives
Introduction
1.1 What is a Data Warehouse?
1.1.1 Use of Data Warehouses in Organisations
1.1.2 Query Driven Approach versus Update Driven Approach for Heterogeneous
Database Integration
1.1.3 Differences between Operational Database Systems and Data Warehouses
1.1.4 Need to Build a Data Warehouse
1.2 Data Warehousing
1.3 Characteristics of Data Warehouse
1.4 Data Warehouse Components
1.5 Designing the Data Warehouse
1.6 Data Warehouse Architecture
1.6.1 Why do Business Analysts need Data Warehouse?
1.6.2 Process of Data Warehouse Design
1.6.3 A Three-tier Data Warehouse Architecture
1.6.4 OLAP Server Architectures
1.7 Getting Heterogeneous Data into the Warehouse
1.8 Getting Multidimensional Data out of the Warehouse
1.9 Summary
1.10 Keywords
1.11 Self Assessment
1.12 Review Questions
1.13 Further Readings
Objectives
After studying this unit, you will be able to:
zz Know data warehouse concept
zz Explain data warehouse components
zz Describe data warehouse architecture
Notes Introduction
Remember using Lotus 1-2-3-? This was your first taste of “What if?” processing on the desktop.
This is what a data warehouse is all about-using information your business has gathered to help
it react better, smarter, quicker and more efficiently.
To expand upon this definition, a data warehouse is a collection of corporate information,
derived directly from operational systems and some external data sources. Its specific purpose is
to support business decisions, not business operations. This is what a data warehouse is all about,
helping your business ask “What if?” questions. The answers to these questions will ensure your
business is proactive, instead of reactive, a necessity in today’s information age.
The industry trend today is moving towards more powerful hardware and software configurations.
With these more powerful configurations, we now have the ability to process vast volumes
of information analytically, which would have been unheard of ten or even five years ago.
A business today must be able to use this emerging technology or rum the risk of being information
under-loaded. You read that correctly under-loaded the opposite of overloaded. Overloaded
means you are so overwhelmed by the enormous gult of information. It’s hard to wade through
it to determine what is important. If you are under-loaded, you are information deficient. You
cannot cope with decision-making exceptions because you do not know where you stand. You
are missing critical pieces of information required to make informed decisions.
In today’s world, you do not want to be the country mouse. In today’s world, full of vast amounts
of unfiltered information, a business that does not effectively use technology to shift through that
information will not survive the information age. Access to and the understanding of information
is power. This power equates to a competitive advantage are survival.
3. Time-variant: Data are stored to provide information from a historical perspective (e.g., Notes
the past 5-10 years). Every key structure in the data warehouse contains, either implicitly
or explicitly, an element of time.
4. Non-volatile: A data warehouse is always a physically separate store of data transformed
from the application data found in the operational environment. Due to this separation, a
data warehouse does not require transaction processing, recovery, and concurrency control
mechanisms. It usually requires only two operations in data accessing: initial loading of
data and access of data.
Example: A typical data warehouse is organised around major subjects, such as customer,
vendor, product, and sales rather than concentrating on the day-to-day operations and transaction
processing of an organization.
Many organisations are creating data warehouse to support business decision-making activities
for the following reasons:
1. To increasing customer focus, which includes the analysis of customer buying patterns
(such as buying preference, buying time, budget cycles, and appetites for spending),
2. To reposition products and managing product portfolios by comparing the performance
of sales by quarter, by year, and by geographic regions, in order to fine-tune production
strategies,
3. To analyzing operations and looking for sources of profit,
4. To managing the customer relationships, making environmental corrections, and managing
the cost of corporate assets, and
5. Data warehousing is also very useful from the point of view of heterogeneous database
integration. Many organisations typically collect diverse kinds of data and maintain
large databases from multiple, heterogeneous, autonomous, and distributed information
sources.
The first major stepping stone in understanding Data Warehousing is to grasp the concepts and
differences between the two overall database categories. The type most of us are used to dealing
with is the On Line Transactional Processing (OLTP) category. The other major category is On
Line Analytical Processing (OLAP).
OLTP is what we characterise as the ongoing day-to-day functional copy of the database. It is
where data is added and updated but never overwritten or deleted. The main needs of the OLTP
operational database being easily controlled insertion and updating of data with efficient access
to data manipulation and viewing mechanisms. Typically only single record or small record-sets
should be manipulated in a single operation in an OLTP designed database. The main thrust
here is to avoid having the same data in different tables. This basic tenet of Relational Database
modeling is known as “normalising” data.
OLAP is a broad term that also encompasses data warehousing. In this model data is stored
in a format, which enables the efficient creation of data mining/reports. OLAP design should
accommodate reporting on very large record sets with little degradation in operational efficiency.
The overall term used to describe taking data structures in an OLTP format and holding the same
data in an OLAP format is “Dimensional Modeling” It is the primary building block of Data
Warehousing.
The major distinguishing features between OLTP and OLAP are summarised as follows:
You know that data warehouse queries are often complex. They involve the computation of large
groups of data at summarised levels and may require the use of special data organisation, access,
and implementation methods based on multidimensional views. Processing OLAP queries
in operational databases would substantially degrade the performance of operational tasks.
Moreover, an operational database supports the concurrent processing of several transactions as
well recovery mechanism such as locking and logging to ensure the consistency and robustness
of transactions. An OLAP query often needs read-only access of data records for summarisation
and aggregation. Concurrency control and recovery mechanisms, if applied for such OLAP
operations, may jeopardise the execution of concurrent transactions and thus substantially
reduce the throughput of an OLTP system.
Notes 2. Integrated: When data resides in many separate applications in the operational environment,
encoding of data is often inconsistent. For instance, in one application, gender might be
coded as “m” and “f” in another by 0 and 1. When data are moved from the operational
environment into the data warehouse, they assume a consistent coding convention e.g.
gender data is transformed to “m” and “f”.
3. Time variant: The data warehouse contains a place for storing data that are five to 10
years old, or older, to be used for comparisons, trends, and forecasting. These data are not
updated.
4. Non volatile: Data are not updated or changed in any way once they enter the data
warehouse, but are only loaded and accessed.
Task You know about store room and warehouse. Exactly what the difference
between warehouse and data warehouse? Explain with the suitable of suitable
example.
Data Sources
Data sources refer to any electronic repository of information that contains data of interest for
management use or analytics. This definition covers mainframe databases (e.g. IBM DB2, ISAM,
Adabas, Teradata, etc.), client-server databases (e.g. Teradata, IBM DB2, Oracle database, Informix,
Microsoft SQL Server, etc.), PC databases (e.g. Microsoft Access, Alpha Five), spreadsheets (e.g.
Microsoft Excel) and any other electronic store of data. Data needs to be passed from these
systems to the data warehouse either on a transaction-by-transaction basis for real-time data
warehouses or on a regular cycle (e.g., daily or weekly) for offline data warehouses.
The data warehouse is normally (but does not have to be) a relational database. It must be
organized to hold information in a structure that best supports not only query and reporting, but
also advanced analysis techniques, like data mining. Most data warehouses hold information for
at least 1 year and sometimes can reach half century, depending on the business/operations data
retention requirement. As a result these databases can become very large.
Reporting
The data in the data warehouse must be available to the organisation’s staff if the data warehouse
is to be useful. There are a very large number of software applications that perform this function,
or reporting can be custom-developed. Examples of types of reporting tools include:
1. Business intelligence tools: These are software applications that simplify the process of
development and production of business reports based on data warehouse data.
2. Executive information systems (known more widely as Dashboard (business): These are
software applications that are used to display complex business metrics and information
in a graphical way to allow rapid understanding.
3. OLAP Tools: OLAP tools form data into logical multi-dimensional structures and allow
users to select which dimensions to view data by.
4. Data Mining: Data mining tools are software that allow users to perform detailed
mathematical and statistical calculations on detailed data warehouse data to detect trends,
identify patterns and analyze data.
Metadata
Metadata, or “data about data”, is used not only to inform operators and users of the data
warehouse about its status and the information held within the data warehouse, but also as
a means of integration of incoming data and a tool to update and refine the underlying DW
model.
Example: Data warehouse metadata include table and column names, their detailed
descriptions, their connection to business meaningful names, the most recent data load date, the
business meaning of a data item and the number of users that are logged in currently.
Operations
Optional Components
Notes extra analysis without slowing down the main system). In either case, however, it is not the
organization’s official repository, the way a data warehouse is.
2. Logical Data Marts: A logical data mart is a filtered view of the main data warehouse but
does not physically exist as a separate data copy. This approach to data marts delivers the
same benefits but has the additional advantages of not requiring additional (costly) disk
space and it is always as current with data as the main data warehouse. The downside is
that Logical Data Marts can have slower response times than physicalized ones.
3. Operational Data Store: An ODS is an integrated database of operational data. Its sources
include legacy systems, and it contains current or near-term data. An ODS may contain
30 to 60 days of information, while a data warehouse typically contains years of data.
ODSs are used in some data warehouse architectures to provide near-real-time reporting
capability in the event that the Data Warehouse’s loading time or architecture prevents it
from being able to provide near-real-time reporting capability.
Figure 1.2
The job of designing and implementing a data warehouse is very challenging and difficult one,
even though at the same time, there is a lot of focus and importance attached to it. The designer of
a data warehouse may be asked by the top management: “The all enterprise data and build a data
warehouse such that the management can get answers to all their questions”. This is a daunting
task with responsibility being visible and exciting. But how to get started? Where to start? Which
data should be put first? Where is that data available? Which queries should be answered? How
would bring down the scope of the project to something smaller and manageable, yet be scalable Notes
to gradually upgrade to build a comprehensive data warehouse environment finally?
The recent trend is to build data marts before a real large data warehouse is built. People want
something smaller, so as to get manageable results before proceeding to a real data warehouse.
Ralph Kimball identified a nine-step method as follows:
Step 1: Choose the subject matter (one at a time)
Step 2: Decide what the fact table represents
Step 3: Identify and conform the dimensions
Step 4: Choose the facts
Step 5: Store pre-calculations in the fact table
Step 6: Define the dimensions and tables
Step 7: Decide the duration of the database and periodicity of updation
Step 8: Track slowly the changing dimensions
Step 9: Decide the query priorities and the query modes.
All the above steps are required before the data warehouse is implemented. The final step or
step 10 is to implement a simple data warehouse or a data mart. The approach should be ‘from
simpler to complex’,
First, only a few data marts are identifies, designed and implemented. A data warehouse then
will emerge gradually.
Let us discuss the above mentioned steps in detail. Interaction with the users is essential for
obtaining answers to many of the above questions. The users to be interviewed include top
management, middle management, executives as also operational users, in addition to sales-
force and marketing teams. A clear picture emerges from the entire project on data warehousing
as to what are their problems and how they can possibly be solved with the help of the data
warehousing.
The priorities of the business issues can also be found. Similarly, interviewing the DBAs in the
organization will also give a clear picture as what are the data sources with clean data, valid data
and consistent data with assured flow for several years.
Task Discuss various factors play vital role to design a good data warehouse.
Notes 2. A data warehouse can enhance business productivity since it is able to quickly and
efficiently gather information, which accurately describes the organization.
3. A data warehouse facilitates customer relationship marketing since it provides a consistent
view of customers and items across all lines of business, all departments, and all markets.
4. A data warehouse may bring about cost reduction by tracking trends, patterns, and
exceptions over long periods of time in a consistent and reliable manner.
5. A data warehouse provides a common data model for all data of interest, regardless of
the data’s source. This makes it easier to report and analyze information than it would be
if multiple data models from disparate sources were used to retrieve information such as
sales invoices, order receipts, general ledger charges, etc.
6. Because they are separate from operational systems, data warehouses provide retrieval of
data without slowing down operational systems.
refreshment, data source synchronisation, planning for disaster recovery, managing access control Notes
and security, managing data growth, managing database performance, and data warehouse
enhancement and extension.
We describe here the physical implementation of an OLAP server in a Data Warehouse. There are
three different possible designs:
1. Relational OLAP (ROLAP)
2. Multidimensional OLAP (MOLAP)
3. Hybrid OLAP (HOLAP)
ROLAP
ROLAP stores the data based on the already familiar relational DBMS technology. In this case,
data and the related aggregations are stored in RDBMS, and OLAP middleware is used to
implement handling and exploration of data cubes. This architecture focuses on the optimisation
of the RDBMS back end and provides additional tools and services such as data cube navigation
logic. Due to the use of the RDBMS back end, the main advantage of ROLAP is scalability in
handling large data volumes.
Example: ROLAP engines include the commercial IBM Informix Metacube (www.ibm.
com) and the Micro-strategy DSS server (www.microstrategy.com), as well as the open-source
product Mondrian (mondrian.sourceforge.net).
MOLAP
In contrast to ROLAP, which uses tuples as the data storage unit, the MOLAP uses a dedicated
n-dimensional array storage engine and OLAP middleware to manage data. Therefore, OLAP
queries are realised through a direct addressing to the related multidimensional views (data
cubes). Additionally, this architecture focuses on pre-calculation of the transactional data into
the aggregations, which results in fast query execution performance. More specifically, MOLAP
precalculates and stores aggregated measures at every hierarchy level at load time, and stores
and indexes these values for immediate retrieval. The full precalculation requires a substantial
amount of overhead, both in processing time and in storage space. For sparse data, MOLAP
uses sparse matrix compression algorithms to improve storage utilisation, and thus in general is
characterised by smaller on-disk size of data in comparison with data stored in RDBMS.
HOLAP
To achieve a tradeoff between ROLAP’s scalability and MOLAP’s query performance, many
commercial OLAP servers are based on the HOLAP approach. In this case, the user decides which
portion of the data to store in the MOLAP and which in the ROLAP. For instance, often the low-
level data are stored using a relational database, while higher-level data, such as aggregations,
are stored in a separate MOLAP. An example product that supports all three architectures is
Microsoft’s OLAP Services (www.microsoft.com/), which is part of the company’s SQL Server.
Notes of data warehouses is that data is stored at its most elemental level for use in reporting and
information analysis.
Within this generic intent, there are two primary approaches to organising the data in a data
warehouse.
The first is using a “dimensional” approach. In this style, information is stored as “facts”
which are numeric or text data that capture specific data about a single transaction or event,
and “dimensions” which contain reference information that allows each transaction or event
to be classified in various ways. As an example, a sales transaction would be broken up into
facts such as the number of products ordered, and the price paid, and dimensions such as date,
customer, product, geographical location and salesperson. The main advantages of a dimensional
approach is that the Data Warehouse is easy for business staff with limited information
technology experience to understand and use. Also, because the data is pre-processed into the
dimensional form, the Data Warehouse tends to operate very quickly. The main disadvantage of
the dimensional approach is that it is quite difficult to add or change later if the company changes
the way in which it does business.
The second approach uses database normalization. In this style, the data in the data warehouse is
stored in third normal form. The main advantage of this approach is that it is quite straightforward
to add new information into the database the primary disadvantage of this approach is that it can
be rather slow to produce information and reports.
Task Database means to keep records in a particular format. What about data
warehouse?
Case Study Fast Food
T
he Fast Food industry is highly competitive, one where a very small change in
operations can have a significant impact on the bottom line. For this reason, quick
access to comprehensive information for both standard and on-demand reporting
is essential.
Exclusive Ore designed and implemented a data warehouse and reporting structure to
address this requirement for Summerwood Corporation, a fast food franchisee operating
approximately 80 Taco Bell and Kentucky Fried Chicken restaurants in and around
Philadelphia. The Summerwood Data Warehouse now provides strategic and tactical
decision support to all levels of management within Summerwood.
The data warehouse is implemented in Microsoft SQL Server 2000, and incorporates data
from two principal sources:
1. Daily sales information automatically polled by the TACO system DePol utility.
2. Period based accounting information from the Dynamics (Microsoft Great Plains)
accounting database.
This data is automatically refreshed periodically (or on-demand if required) and is
maintained historically over several years for comparative purposes.
For reporting and analysis purposes, the data in the warehouse is processed into OLAP
Cubes. The cubes are accessed through Excel by using BusinessQuery MD. Data can be
analyzed (sliced and diced) by store, by company, by zone and area, by accounting year,
quarter and period (as far back as 1996), and by brand and concept. The available cubes
and some example analyses are shown below. While each represents an area of analytical
focus, cross cube analysis is also possible.
1. PL Cube. Contains Profit & Loss, Cash Flow and EBIDTA statements for Summerwood.
Amounts can be viewed for any period as a period, quarter-to-date, year-to-date, or
rolling 13 period amount, and can be compared to either of two budgets, compared
to the corresponding period from the prior year, or as a percent of sales.
2. BS Cube. Contains the Balance Sheet for Summerwood. Balances can be viewed as
of any period, and can be compared to the preceding period or the corresponding
period in the prior year.
3. SalesMix Cube. Contains daily sales of all menu items in all stores. In addition to
the standard analysis parameters, this data can also be sliced and diced by brand, by
item category or by menu item, by calendar year, month and week, and by pricing
tier. This cube can be used to compute sales amounts and counts, costs and variance
from list price.
Contd...
Notes 4. SalesDayPart Cube. Contains sales amounts and counts at 15 minute intervals.
In addition to the standard analysis parameters, the data in this cube can also be
analyzed by calendar year, month and week, and by eight-hour, four-hour, two-
hour, one-hour and 15 minute intervals, or by specific meal (e.g., lunch, dinner,
breakfast, between-meals, etc.).
5. SalesOps Cube. Contains daily sales summary for each store. In addition to the
standard analysis parameters, this data can also be sliced and diced by a comparable
indicator, by calendar year, month and week, and by pricing tier. Gross sales, taxable
sales, non-tax sales, manual over/under, deletions, labor, cash over/short, deposits
and average check are available.
Many amounts can be viewed optionally as variances, as a percent of sales, or summarized
as week-to-date, period-to-date, year-to-date, or rolling 52-week amountsReportCard
Cube. Contains the daily report card amounts. Some of these are also in the SalesOps cube.
In addition, the Report Card contains speed-of-service and peak -hour information.
The data structure implemented for Summerwood allows them to maintain several distinct
organizational structures in order to properly represent each store in (1) the corporate
structure, i.e. the subsidiary to which they belong, (2) the operations structure, i.e. the zone/
area and (3) the concept structure, i.e., KFC, TB-PH (a Taco Bell – Pizza Hut combination
restaurant), etc.
The Summerwood data warehouse and the resulting OLAP cubes permit investigation
along any of these corporate hierarchies – i.e., by operating company, by zone or area, or by
brand or concept. This permits comparisons between concepts, say, or of all stores within
a concept. Similarly, it is easy to do area-toareacomparisons, or zone-to-zone comparisons,
or to view the performance of all stores within an area.
The data warehouse also supports a time dimension based on the 13 period calendar under
which Summerwood operates. This calendar has been built into the warehouse, permitting
easy comparison of any period to the prior period or to the same period in a prior year.
Instead of comparing at the period level, comparisons and trends can be done at quarterly
or annual levels. Lower level examination is also possible, e.g., comparing week -to-week
or even day-to-day.
The PL and BS cubes contain the full Profit and Loss and Balance Sheet statements for each
period during the last five years (65 periods in all), down to the account level. This makes
it easy for Summerwood to evaluate trends in any expense category, comparing store-to-
store, period-to-period, zone-to-zone, or concept-to-concept.
The SalesOps and SalesMix cubes are updated overnight and contain up-to-the-minute
(through yesterday) information captured by the cash registers (POS) in each store. This
enables managers to evaluate and compare trends in speed of service, labor usage, over/
under rings, employee food purchases, etc., by store, zone, area, concept, subsidiary, etc.
Because sales and counts are recorded in 15-minute intervals, called day parts, managers
can use this to find strange sales patterns, possibly suggestive of employee theft, during
the midnight hours.
1.9 Summary
zz Data warehousing is the consolidation of data from disparate data sources into a single
target database to be utilized for analysis and reporting purposes.
zz The primary goal of data warehousing is to analyze the data for business intelligence
purposes. For example, an insurance company might create a data warehouse to capture
policy data for catastrophe exposure.
zz The data is sourced from front-end systems that capture the policy information into the Notes
data warehouse.
zz The data might then be analyzed to identify windstorm exposures in coastal areas prone to
hurricanes and determine whether the insurance company is overexposed.
zz The goal is to utilize the existing information to make accurate business decisions.
1.10 Keywords
Data Sources: Data sources refer to any electronic repository of information that contains data of
interest for management use or analytics.
Data Warehouse Architecture: It is a description of the elements and services of the warehouse,
with details showing how the components will fit together and how the system will grow over
time.
Data Warehouse: It is a relational database that is designed for query and analysis rather than
for transaction processing.
Job Control: This includes job definition, job scheduling (time and event), monitoring, logging,
exception handling, error handling, and notification.
Metadata: Metadata, or “data about data”, is used not only to inform operators and users of the
data warehouse about its status and the information held within the data warehouse
1. (a) 2. (b)
3. (d) 4. (c)
5. virtual 6. data mart
7. ROLAP 8. business area
9. decision makers 10. relational database
11. dependent data mart
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata McGraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/ The
MIT Press, 1996.
V. Cherkassky and F. Mulier, Learning From Data, John Wiley & Sons, 1998.
Contents
Objectives
Introduction
2.1 Motivation for Data Mining: Why is it Important?
2.2 What is Data Mining?
2.3 Definition of Data Mining
2.4 Architecture of Data Mining
2.5 How Data Mining Works?
2.6 Data Mining – On What Kind of Data?
2.6.1 Flat Files
2.6.2 Relational Databases
2.6.3 Data Warehouses
2.6.4 Data Cube
2.6.5 Transaction Database
2.6.6 Advanced Database Systems and Advanced Database Applications
2.7 Data Mining Functionalities — What Kinds of Patterns can be Mined?
2.8 A Classification of Data Mining Systems
2.9 Advantages of Data Mining
2.10 Disadvantages of Data Mining
2.11 Ethical Issues of Data Mining
2.11.1 Consumers’ Point of View
2.11.2 Organizations’ Point of View
2.11.3 Government Point of View
2.11.4 Society’s Point of View
2.12 Analysis of Ethical Issues
2.13 Global Issues of Data Mining
2.14 Summary
2.15 Keywords
2.16 Self Assessment
2.17 Review Questions
2.18 Further Readings
Objectives Notes
Introduction
This unit provides an introduction to the multidisciplinary field of data mining. It discusses
the evolutionary path of database technology, which led up to the need for data mining, and
the importance of its application potential. The basic architecture of data mining systems is
described, and a brief introduction to the concepts of database systems and data warehouses is
given. A detailed classification of data mining tasks is presented, based on the different kinds
of knowledge to be mined. A classification of data mining systems is presented, and major
challenges in the field are discussed.
With the increased and widespread use of technologies, interest in data mining has increased
rapidly. Companies are now utilized data mining techniques to exam their database looking
for trends, relationships, and outcomes to enhance their overall operations and discover new
patterns that may allow them to better serve their customers. Data mining provides numerous
benefits to businesses, government, society as well as individual persons. However, like many
technologies, there are negative things that caused by data mining such as invasion of privacy
right. In addition, the ethical and global issues regarding the use of data mining will also be
discussed.
Notes knowledge can be applied to decision-making, process control, information management, and
query processing. Therefore, data mining is considered one of the most important frontiers in
database and information systems and one of the most promising interdisciplinary developments
in the information technology.
For many years, statistics have been used to analyze data in an effort to find correlations, patterns, Notes
and dependencies. However, with an increased in technology more and more data are available,
which greatly exceed the human capacity to manually analyze them. Before the 1990’s, data
collected by bankers, credit card companies, department stores and so on have little used. But
in recent years, as computational power increases, the idea of data mining has emerged. Data
mining is a term used to describe the “process of discovering patterns and trends in large data
sets in order to find useful decision-making information.” With data mining, the information
obtained from the bankers, credit card companies, and department stores can be put to good
use
6. User interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or task,
providing information to help focus the search, and performing exploratory data mining
based on the intermediate data mining results. In addition, this component allows the
user to browse database and data warehouse schemas or data structures, evaluate mined
patterns, and visualise the patterns in different forms.
Note Data mining involves an integration of techniques from multiple disciplines such as
database and data warehouse technology, statistics, machine learning, high-performance
computing, pattern recognition, neural networks, data visualisation, information retrieval,
image and signal processing, and spatial or temporal data analysis. In this book the
emphasis is given on the database perspective that places on efficient and scalable data
mining techniques.
For an algorithm to be scalable, its running time should grow approximately linearly in proportion
to the size of the data, given the available system resources such as main memory and disk
space.
Notes
Task Data mining is a term used to describe the “process of discovering patterns and
trends in large data sets in order to find useful decision-making information.” Discuss.
Flat files are actually the most common data source for data mining algorithms, especially at the
research level. Flat files are simple data files in text or binary format with a structure known by
the data mining algorithm to be applied. The data in these files can be transactions, time-series
data, scientific measurements, etc.
Example: Data mining systems can analyse customer data for a company to predict the
credit risk of new customers based on their income, age, and previous credit information. Data
mining systems may also detect deviations, such as items whose sales are far from those expected
in comparison with the previous year.
A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and that usually resides at a single site. Data warehouses are constructed via a
process of data cleaning, data integration, data transformation, data loading, and periodic data
refreshing. Figure 2.3 shows the typical framework for construction and use of a data warehouse
for a manufacturing company.
To facilitate decision making, the data in a data warehouse are organised around major subjects,
such as customer, item, supplier, and activity. The data are stored to provide information from
a historical perspective (such as from the past 510 years) and are typically summarised. For
example, rather than storing the details of each sales transaction, the data warehouse may store
a summary of the transactions per item type for each store or, summarised to a higher level, for
each sales region.
The data cube has a few alternative names or a few variants, such as, “multidimensional
databases,” “materialised views,” and “OLAP (On-Line Analytical Processing).” The general
idea of the approach is to materialise certain expensive computations that are frequently
inquired, especially those involving aggregate functions, such as count, sum, average, max, etc.,
and to store such materialised views in a multi-dimensional database (called a “data cube”) for
decision support, knowledge discovery, and many other applications. Aggregate functions can
be precomputed according to the grouping by different sets or subsets of attributes. Values in
each attribute may also be grouped into a hierarchy or a lattice structure.
Example: “Date” can be grouped into “day”, “month”, “quarter”, “year” or “week”,
which forms a lattice structure.
Generalisation and specialisation can be performed on a multiple dimensional data cube by
“roll-up” or “drill-down” operations, where a roll-up operation reduces the number of dimensions
in a data cube or generalises attribute values to high-level concepts, whereas a drill-down
operation does the reverse. Since many aggregate functions may often need to be computed
repeatedly in data analysis, the storage of precomputed results in a multiple dimensional data
cube may ensure fast response time and flexible views of data from different angles and at
different abstraction levels.
Notes Figure 2.4: Eight Views of Data Cubes for Sales Information
For example, a relation with the schema “sales(part; supplier; customer; sale price)” can be
materialised into a set of eight views as shown in Figure 2.4, where psc indicates a view consisting
of aggregate function values (such as total sales) computed by grouping three attributes part,
supplier, and customer, p indicates a view consisting of the corresponding aggregate function
values computed by grouping part alone, etc.
A transaction database is a set of records representing transactions, each with a time stamp,
an identifier and a set of items. Associated with the transaction files could also be descriptive
data for the items. For example, in the case of the video store, the rentals table such as shown in
Figure 2.5, represents the transaction database. Each record is a rental contract with a customer
identifier, a date, and the list of items rented (i.e. video tapes, games, VCR, etc.). Since relational
databases do not allow nested tables (i.e. a set as attribute value), transactions are usually stored
in flat files or stored in two normalised transaction tables, one for the transactions and one for
the transaction items. One typical data mining analysis on such data is the so-called market
basket analysis or association rules in which associations between items occurring together or in
sequence are studied.
Figure 2.5: Fragment of a Transaction Database for the Rentals at Our Video Store
Rental
Transaction ID Date Time Customer ID Item ID
T12345 10/12/06 10:40 C12345 11000
With the advances of database technology, various kinds of advanced database systems
have emerged to address the requirements of new database applications. The new database
applications include handling multimedia data, spatial data, World-Wide Web data and the
engineering design data. These applications require efficient data structures and scalable methods
for handling complex object structures, variable length records, semi-structured or unstructured Notes
data, text and multimedia data, and database schemas with complex structures and dynamic
changes. Such database systems may raise many challenging research and implementation issues
for data mining and hence discussed in short as follows:
Multimedia Databases
Multimedia databases include video, images, audio and text media. They can be stored on
extended object-relational or object-oriented databases, or simply on a file system. Multimedia is
characterised by its high dimensionality, which makes data mining even more challenging. Data
mining from multimedia repositories may require computer vision, computer graphics, image
interpretation, and natural language processing methodologies.
Spatial Databases
Spatial databases are databases that, in addition to usual data, store geographical information
like maps, and global or regional positioning. Such spatial databases present new challenges to
data mining algorithms.
Time-series Databases
Time-series databases contain time related data such stock market data or logged activities.
These databases usually have a continuous flow of new data coming in, which sometimes causes
the need for a challenging real time analysis. Data mining in such databases commonly includes
the study of trends and correlations between evolutions of different variables, as well as the
prediction of trends and movements of the variables in time.
Worldwide Web
The Worldwide Web is the most heterogeneous and dynamic repository available. A very large
number of authors and publishers are continuously contributing to its growth and metamorphosis,
and a massive number of users are accessing its resources daily. Data in the Worldwide Web
is organised in inter-connected documents. These documents can be text, audio, video, raw
data, and even applications. Conceptually, the Worldwide Web is comprised of three major
components: The content of the Web, which encompasses documents available; the structure of
the Web, which covers the hyperlinks and the relationships between documents; and the usage of
the web, describing how and when the resources are accessed. A fourth dimension can be added
relating the dynamic nature or evolution of the documents. Data mining in the Worldwide Web,
or web mining, tries to address all these issues and is often divided into web content mining, web
structure mining and web usage mining.
Database technology has evolved in parallel to the evolution of software to support engineering.
In these applications relatively simple operations are performed on large volumes of data with
uniform structure. The engineering world, on the other hand, is full of computationally intensive,
logically complex applications requiring sophisticated representations. Recent developments in
database technology emphasise the need to provide general-purpose support for the type of
functions involved in the engineering process such as the design of buildings, system components,
or integrated circuits etc.
Characterisation
Example: One may want to characterise the OurVideoStore customers who regularly
rent more than 30 movies a year. With concept hierarchies on the attributes describing the
target class, the attributeoriented induction method can be used, for example, to carry out data
summarisation. Note that with a data cube containing summarisation of data, simple OLAP
operations fit the purpose of data characterisation.
Discrimination
Data discrimination produces what are called discriminant rules and is basically the comparison
of the general features of objects between two classes referred to as the target class and the
contrasting class.
Example: One may want to compare the general characteristics of the customers who
rented more than 30 movies in the last year with those whose rental account is lower than 5.
The techniques used for data discrimination are very similar to the techniques used for data
characterisation with the exception that data discrimination results include comparative
measures.
Association Analysis
Association analysis is based on the association rules. It studies the frequency of items occurring
together in transactional databases, and based on a threshold called support, identifies the
frequent item sets. Another threshold, confidence, which is the conditional probability than an
item appears in a transaction when another item appears, is used to pinpoint association rules.
Association analysis is commonly used for market basket analysis.
Example: It could be useful for the OurVideoStore manager to know what movies are
often rented together or if there is a relationship between renting a certain type of movies and
buying popcorn or pop. The discovered association rules are of the form: P→Q [s, c], where P
and Q are conjunctions of attribute value-pairs, and s (for support) is the probability that P and
Notes Q appear together in a transaction and c (for confidence) is the conditional probability that Q
appears in a transaction when P is present. For example, the hypothetic association rule
Rent Type (X, “game”) ^ Age(X,“13-19”) → Buys(X, “pop”) [s=2% ,c=55%]
would indicate that 2% of the transactions considered are of customers aged between 13 and
19 who are renting a game and buying a pop, and that there is a certainty of 55% that teenage
customers, who rent a game, also buy pop.
Classification
Classification is the processing of finding a set of models (or functions) that describe and
distinguish data classes or concepts, for the purposes of being able to use the model to predict
the class of objects whose class label is unknown. The derived model is based on the analysis of
a set of training data (i.e., data objects whose class label is known). The derived model may be
represented in various forms, such as classification (IF-THEN) rules, decision trees, mathematical
formulae, or neural networks. For example, after starting a credit policy, the Our Video Store
managers could analyse the customers’ behaviors vis-à-vis their credit, and label accordingly the
customers who received credits with three possible labels “safe”, “risky” and “very risky”. The
classification analysis would generate a model that could be used to either accept or reject credit
requests in the future.
Prediction
Classification can be used for predicting the class label of data objects. There are two major types
of predictions: one can either try to predict (1) some unavailable data values or pending trends
and (2) a class label for some data. The latter is tied to classification. Once a classification model
is built based on a training set, the class label of an object can be foreseen based on the attribute
values of the object and the attribute values of the classes. Prediction is however more often
referred to the forecast of missing numerical values, or increase/ decrease trends in time related
data. The major idea is to use a large number of past values to consider probable future values.
Clustering
Example: For a data set with two attributes: AGE and HEIGHT, the following rule
represents most of the data assigned to cluster 10:
If AGE >= 25 and AGE <= 40 and HEIGHT >= 5.0ft and HEIGHT <= 5.5ft then CLUSTER = 10
Evolution and deviation analysis pertain to the study of time related data that changes in time.
Evolution analysis models evolutionary trends in data, which consent to characterising,
comparing, classifying or clustering of time related data. For example, suppose that you have
the major stock market (time-series) data of the last several years available from the New York
Stock Exchange and you would like to invest in shares of high-tech industrial companies. A data
mining study of stock exchange data may identify stock evolution regularities for overall stocks
and for the stocks of particular companies. Such regularities may help predict future trends in
stock market prices, contributing to your decision-making regarding stock investment.
Deviation analysis, on the other hand, considers differences between measured values and
expected values, and attempts to find the cause of the deviations from the anticipated values.
For example, a decrease in total demand of CDs for rent at Video library for the last month, in
comparison to that of the same month of the last year, is a deviation pattern. Having detected a
significant deviation, a data mining system may go further and attempt to explain the detected
pattern (e.g., did the new comedy movies were released last year in comparison to the same
period this year?).
Marketing/Retailing
Data mining can aid direct marketers by providing them with useful and accurate trends about
their customers’ purchasing behavior. Based on these trends, marketers can direct their marketing
attentions to their customers with more precision. For example, marketers of a software company
may advertise about their new software to consumers who have a lot of software purchasing
history. In addition, data mining may also help marketers in predicting which products their
customers may be interested in buying. Through this prediction, marketers can surprise their
customers and make the customer’s shopping experience becomes a pleasant one.
Retail stores can also benefit from data mining in similar ways. For example, through the trends
provide by data mining, the store managers can arrange shelves, stock certain items, or provide
a certain discount that will attract their customers.
Banking/Crediting
Data mining can assist financial institutions in areas such as credit reporting and loan information.
For example, by examining previous customers with similar attributes, a bank can estimated
the level of risk associated with each given loan. In addition, data mining can also assist credit
card issuers in detecting potentially fraudulent credit card transaction. Although the data mining
technique is not a 100% accurate in its prediction about fraudulent charges, it does help the credit
card issuers reduce their losses.
Law Enforcement
Data mining can aid law enforcers in identifying criminal suspects as well as apprehending these
criminals by examining trends in location, crime type, habit, and other patterns of behaviors.
Researchers
Data mining can assist researchers by speeding up their data analyzing process; thus, allowing
them more time to work on other projects.
Privacy Issues
Personal privacy has always been a major concern in this country. In recent years, with the
widespread use of Internet, the concerns about privacy have increase tremendously. Because
of the privacy issues, some people do not shop on Internet. They are afraid that somebody may
have access to their personal information and then use that information in an unethical way; thus
causing them harm.
Although it is against the law to sell or trade personal information between different organizations,
selling personal information have occurred. For example, according to Washing Post, in 1998,
CVS had sold their patient’s prescription purchases to a different company. In addition, American
Express also sold their customers’ credit care purchases to another company. What CVS and
American Express did clearly violate privacy law because they were selling personal information
without the consent of their customers. The selling of personal information may also bring harm Notes
to these customers because you do not know what the other companies are planning to do with
the personal information that they have purchased.
Security Issues
Although companies have a lot of personal information about us available online, they do not
have sufficient security systems in place to protect that information. For example, recently the
Ford Motor credit company had to inform 13,000 of the consumers that their personal information
including Social Security number, address, account number and payment history were accessed
by hackers who broke into a database belonging to the Experian credit reporting agency. This
incidence illustrated that companies are willing to disclose and share your personal information,
but they are not taking care of the information properly. With so much personal information
available, identity theft could become a real problem.
Trends obtain through data mining intended to be used for marketing purpose or for some other
ethical purposes, may be misused. Unethical businesses or people may used the information
obtained through data mining to take advantage of vulnerable people or discriminated against
a certain group of people. In addition, data mining technique is not a 100 percent accurate; thus
mistakes do happen which can have serious consequence.
According to the consumers, data mining benefits businesses more than it benefit them.
Consumers may benefit from data mining by having companies customized their product and
service to fit the consumers’ individual needs. However, the consumers’ privacy may be lost as
a result of data mining.
Data mining is a major way that companies can invade the consumers’ privacy. Consumers are
surprised as how much companies know about their personal lives. For example, companies may
know your name, address, birthday, and personal information about your family such as how
many children you have. They may also know what medications you take, what kind of music
you listen to, and what are your favorite books or movies. The lists go on and on. Consumers
are afraid that these companies may misuse their information, or not having enough security
to protect their personal information from unauthorized access. For example, the incidence
about the hackers in the Ford Motor company case illustrated how insufficient companies are
at protecting their customers’ personal information. Companies are making profits from their
customers’ personal data, but they do not want to spend a lot amount of money to design a
sophisticated security system to protect that data. At least half of Internet users interviewed by
Statistical Research, Inc. claimed that they were very concerned about the misuse of credit care
information given online, the selling or sharing of personal information by different web sites
and cookies that track consumers’ Internet activity.
Data mining that allows companies to identify their best customers could just be easily used
by unscrupulous businesses to attack vulnerable customer such as the elderly, the poor, the
sick, or the unsophisticated people. These unscrupulous businesses could use the information
Notes unethically by offering these vulnerable people inferior deals. For example, Mrs. Smith’s husband
was diagnosis with colon cancer, and the doctor predicted that he is going to die soon. Mrs.
Smith was so worry and depressed. Suppose through Mrs. Smith’s participation in a chat room
or mailing list, someone predicts that either she or someone close to her has a terminal illness.
Maybe through this prediction, Mrs. Smith started receiving email from some strangers stating
that they know a cure for colon cancer, but it will cause her a lot of money. Mrs. Smith who is
desperately wanted to save her husband, may fall into their trap. This hypothetical example
illustrated that how unethical it is for somebody to use data obtained through data mining to
target vulnerable person who are desperately hoping for a miracle.
Data mining can also be used to discriminate against a certain group of people in the population.
For example, if through data mining, a certain group of people were determine to carry a high risk
for a deathly disease (eg. HIV, cancer), then the insurance company may refuse to sell insurance
policy to them based on this information. The insurance company’s action is not only unethical,
but may also have severe impact on our health care system as well as the individuals involved.
If these high risk people cannot buy insurance, they may die sooner than expected because they
cannot afford to go to the doctor as often as they should. In addition, the government may have
to step in and provide insurance coverage for those people, thus would drive up the health care
costs.
Data mining is not a flawless process, thus mistakes are bound to happen. For example, a file of
one person may get mismatch to another person file. In today world, where we replied heavily
on the computer for information, a mistake generated by the computer could have serious
consequence. One may ask is it ethical for someone with a good credit history to get reject for a
loan application because his/her credit history get mismatched with someone else bearing the
same name and a bankruptcy profile? The answer is “NO” because this individual does not do
anything wrong. However, it may take a while for this person to get his file straighten out. In
the mean time, he or she just has to live with the mistake generated by the computer. Companies
might say that this is an unfortunate mistake and move on, but to this individual this mistake
can ruin his/her life.
Data mining is a dream comes true to businesses because data mining helps enhance their overall
operations and discover new patterns that may allow companies to better serve their customers.
Through data mining, financial and insurance companies are able to detect patterns of fraudulent
credit care usage, identify behavior patterns of risk customers, and analyze claims. Data mining
would help these companies minimize their risk and increase their profits. Since companies are
able to minimize their risk, they may be able to charge the customers lower interest rate or lower
premium. Companies are saying that data mining is beneficial to everyone because some of the
benefit that they obtained through data mining will be passed on to the consumers.
Data mining also allows marketing companies to target their customers more effectively; thus,
reducing their needs for mass advertisements. As a result, the companies can pass on their saving
to the consumers. According to Michael Turner, an executive director of a Directing Marking
Association “Detailed consumer information lets apparel retailers market their products to
consumers with more precision. But if privacy rules impose restrictions and barriers to data
collection, those limitations could increase the prices consumers pay when they buy from catalog
or online apparel retailers by 3.5% to 11%”.
When it comes to privacy issues, organizations are saying that they are doing everything they
can to protect their customers’ personal information. In addition, they only use consumer data
for ethical purposes such as marketing, detecting credit card fraudulent, and etc. To ensure that
personal information are used in an ethical way, the chief information officers (CIO) Magazine
has put together a list of what they call the Six Commandments of Ethical Date Management. The Notes
six commandments include:
1. Data is a valuable corporate asset and should be managed as such, like cash, facilities or
any other corporate asset;
2. The CIO is steward of corporate data and is responsible for managing it over its life cycle
(from its generation to its appropriate destruction);
3. The CIO is responsible for controlling access to and use of data, as determined by
governmental regulation and corporate policy;
4. The CIO is responsible for preventing inappropriate destruction of data;
5. The CIO is responsible for bringing technological knowledge to the development of data
management practices and policies;
6. The CIO should partner with executive peers to develop and execute the organization’s
data management policies.
Since data mining is not a perfect process, mistakes such as mismatching information do occur.
Companies and organizations are aware of this issue and try to deal it. According to Agrawal,
a IBM’s researcher, data obtained through mining is only associated with a 5 to 10 percent loss
in accuracy. However, with continuous improvement in data mining techniques, the percent in
inaccuracy will decrease significantly.
The government is in dilemma when it comes to data mining practices. On one hand, the
government wants to have access to people’s personal data so that it can tighten the security
system and protect the public from terrorists, but on the other hand, the government wants to
protect the people’s privacy right. The government recognizes the value of data mining to the
society, thus wanting the businesses to use the consumers’ personal information in an ethical way.
According to the government, it is against the law for companies and organizations to trade data
they had collected for money or data collected by another organization. In order to protect the
people’s privacy right, the government wants to create laws to monitor the data mining practices.
However, it is extremely difficult to monitor such disparate resources as servers, databases, and
web sites. In addition, Internet is global, thus creating tremendous difficulty for the government
to enforce the laws.
Data mining can aid law enforcers in their process of identify criminal suspects and apprehend
these criminals. Data mining can help reduce the amount of time and effort that these law enforcers
have to spend on any one particular case. Thus, allowing them to deal with more problems.
Hopefully, this would make the country becomes a safer place. In addition, data mining may
also help reduce terrorist acts by allowing government officers to identify and locate potential
terrorists early. Thus, preventing another incidence likes the World Trade Center tragedy from
occurring on American soil.
Data mining can also benefit the society by allowing researchers to collect and analyze data more
efficiently. For example, it took researchers more than a decade to complete the Human Genome
Project. But with data mining, similar projects could be completed in a shorter amount of time.
Data mining may be an important tool that aid researchers in their search for new medications,
biological agents, or gene therapy that would cure deadly diseases such as cancers or AIDS.
Task Do you think data mining provide help in marketing and research? Discuss
with an example.
FBI, the majority of the hackers are from Russia and the Ukraine. Though, it is not surprised to Notes
find that the increase in fraudulent credit card usage in Russian and Ukraine is corresponded to
the increase in domestic credit card theft.
After the hackers gained access to the consumer data, they usually notify the victim companies
of the intrusion or theft, and either directly demanded the money or offer to patch the system for
a certain amount of money. In some cases, when the companies refuse to make the payments or
hire them to fix system, the hackers have release the credit card information that they previous
obtained onto the Web. For example, a group of hackers in Russia had attacked and stolen about
55,000 credit card numbers from merchant card processor CreditCards.com. The hackers black
mailed the company for $ 100,000, but the company refused to pay. As a result, the hackers
posted almost half of the stolen credit card numbers onto the Web. The consumers whose card
numbers were stolen incurred unauthorized charges from a Russian-based site. Similar problem
also happened to CDUniverse.com in December 1999. In this case, a Russian teenager hacked into
CDUniverse.com site and stolen about 300,000 credit card numbers. This teenager, like the group
mentioned above also demanded $100,000 from CDUniverse.com. CDUniverse.com refused to
pay and again their customers’ credit card numbers were released onto the Web called the Maxus
credit card pipeline.
Besides hacking into e-commerce companies to steal their data, some hackers just hack into a
company for fun or just trying to show off their skills.
Example: A group of hacker called “BugBear” had hacked into a security consulting
company’s website. The hackers did not stole any data from this site, instead they leave a message
like this “It was fun and easy to break into your box.”
Besides the above cases, the FBI is saying that in 2001, about 40 companies located in 20 different
states have already had their computer systems accessed by hackers. Since hackers can hack
into the US e-commerce companies, then they can hack into any company worldwide. Hackers
could have a tremendous impact on online businesses because they scared the consumers from
purchasing online. Major hacking incidences liked the two mentioned above illustrated that
the companies do not have sufficient security system to protect customer data. More efforts are
needed from the companies as well as the government to tighten security against these hackers.
Since the Internet is global, efforts from different governments worldwide are needed. Different
countries need to join hand and work together to protect the privacy of their people.
A
s a management information systems department, our charge, simply stated, is to
answer questions. We answer questions reactively when needed and proactively
when possible. Being able to turn data into information, information into action
and action into value is our reason for being. As technology changes, industry competition
tightens, and our clients become increasingly computer savvy, our department must
rapidly seek new technology and methods to meet the needs of our wide and varied list
of clients.
Beyond the online mainframe systems, printouts, and electronic file attachments, we
desire to provide internet based, intelligent, integrated systems that give our end users the
best management information systems in the world. By using formal data warehousing
practices, tools and methodologies, state of the art data extraction, transformation and
Contd...
Notes summarization tools and thin client application deployment, we want to move beyond
“data reporting” to “data mining.”
According to the authors of Data Mining Techniques for Marketing, Sales and Customer
Support, “to really achieve its promise, data mining needs to become an essential business
process, incorporated into other processes, including marketing, sales, customer support,
product design and inventory control. The ‘virtuous cycle’ incorporates data mining into
the larger context of other business processes. It focuses on action based discovery and not
the discovery mechanism itself.”
To this end, MIS is developing a customized process to re-engineer existing MIS applications
into a data warehousing environment where significant improvements and benefits for
end users and the corporation can be realized. The process is founded in accepted data
warehousing principles using an iterative rapid application development methodology,
which is reusable across systems, functions and business solutions.
Data Warehousing
To successfully engage data mining in our processes, the first step is to know who our
customers are. We are able to list them by name, job title, function, and business unit, and
communicate with them regularly.
Next we must be able to identify the appropriate business opportunities. In MIS, our
priorities are based on business needs as articulated to us by our clients through ad hoc
requests and project management meetings and processes. Constant communication,
integration and feedback are required to ensure we are investing our resources in proper
ways.
Once having identified our customer base and business cases, we must be able to transform
data into useful information. Transforming and presenting data as information is our
primary function in the corporation. We are constantly looking for new and improved
ways to accomplish this directive. The latest evolution in efficiently transforming and
presenting data is formal data warehousing practices with browser based front ends.
Source data is crucial to data quality and mining efforts. As each new on-line transactional
system and data base plat form is introduced the complexity of our tasks increases. “Using
operational data presents many challenges to integrators and analysts such as bad data
formats, confusing data fields, lack of functionality, legal ramifications, organizational
factors, reluctance to change, and conflicting timelines (Berry, 25).” Also, the more disparate
the input data sources, the more complicated the integration.
A clear definition of the business need is also required to ensure the accuracy of the end
results. Defining a logical view of the data needed to supply the correct information,
independent of source data restraints, is necessary. Here clients and analysts get the
opportunity to discuss their business needs and solutions proactively.
Next, a mapping from the physical source data to the logical view is required and usually
involves some compromises from the logical view due to physical data constraints. Then
questions about presentation can begin to be answered. Who needs it? How often? In what
format? What technology is available?
The first iteration of our SAS Data Warehousing solution accesses five operational systems
existing on six data platforms. In addition to printed reports the users expect, the data
warehouse is also accessible through MDDB OLAP technology over the intranet. Users
can now ask and answer their own questions, enabling the creativity needed for successful
data mining. With help from the SAS System, we are busily integrating additional data,
accessing more data platforms and streamlining our processes.
zz The information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and science
exploration.
zz Data mining can be viewed as a result of the natural evolution of information technology.
zz An evolutionary path has been witnessed in the database industry in the development of
data collection and database creation, data management and data analysis functionalities.
zz Data mining refers to extracting or “mining” knowledge from large amounts of data.
Some other terms like knowledge mining from data, knowledge extraction, data/pattern
analysis, data archaeology, and data dredging are also used for data mining.
zz Knowledge discovery as a process and consists of an iterative sequence of the data cleaning,
data integration, data selection data transformation, data mining, pattern evaluation and
knowledge presentation.
2.15 Keywords
Data cleaning: To remove noise and inconsistent data.
Data integration: Multiple data sources may be combined.
Data mining: It refers to extracting or “mining” knowledge from large amounts of data.
Data selection: Data relevant to the analysis task are retrieved from the database.
Data transformation: Where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations.
KDD: Many people treat data mining as a synonym for another popularly used term.
Knowledge presentation: Visualisation and knowledge representation techniques are used to
present the mined knowledge to the user.
Pattern evaluation: To identify the truly interesting patterns representing knowledge based on
some interestingness measures.
1. (b) 2. (c)
3. relational database 4. periodic data refreshing
5. transaction database 6. Multimedia
7. stock market data 8. characterization
9. Association analysis 10. unsupervised
11. Data mining
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata McGraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/ The
MIT Press, 1996.
V. Cherkassky and F. Mulier, Learning From Data, John Wiley & Sons, 1998.
Contents
Objectives
Introduction
3.1 Statistical Perspective on Data Mining
3.2 What is Statistics and why is Statistics needed?
3.3 Similarity Measures
3.3.1 Introduction
3.3.2 Motivation
3.3.3 Classic Similarity Measures
3.3.4 Dice
3.3.5 Overlap
3.4 Decision Trees
3.5 Neural Networks
3.6 Genetic Algorithms
3.7 Application of Genetic Algorithms in Data Mining
3.8 Summary
3.9 Keywords
3.10 Self Assessment
3.11 Review Questions
3.12 Further Readings
Objectives
After studying this unit, you will be able to:
zz Know data mining techniques
zz Describe statistical perspectives on data mining
zz Explain decision trees
Introduction
Data mining, the extraction of hidden predictive information from large databases, is a powerful
new technology with great potential to help companies focus on the most important information
in their data warehouses. Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses
offered by data mining move beyond the analyses of past events provided by retrospective
tools typical of decision support systems. Data mining tools can answer business questions that
traditionally were too time consuming to resolve. They scour databases for hidden patterns,
finding predictive information that experts may miss because it lies outside their expectations.
Most companies already collect and refine massive quantities of data. Data mining techniques Notes
can be implemented rapidly on existing software and hardware platforms to enhance the value of
existing information resources, and can be integrated with new products and systems as they are
brought on-line. When implemented on high performance client/server or parallel processing
computers, data mining tools can analyze massive databases to deliver answers to questions such
as, “Which clients are most likely to respond to my next promotional mailing, and why?”
Figure 3.1: The Statistical Thinking Process based on Data in Constructing Statistical Models
for Decision-making under Uncertainties
Similarity measures provide the framework on which many data mining decision are based.
Tasks such as classification and clustering usually assume the existence of some similarity
measure, while fields with poor methods to compute similarity often find that searching data
is a cumbersome task. Several classic similarity measures are discussed, and the application of
similarity measures to other field are addressed.
3.3.1 Introduction
The goal of information retrieval (IR) systems is to meet users needs. In practical terms, a need
is usually manifested in the form of a short textual query entered in the text box of some search
engine online. IR systems typically do not directly answer a query, instead they present a
ranked list of documents that are judged relevant to that query by some similarity measure. Sine
similarity measures have the effect of clustering and classifying information with respect to a
query, users will commonly find new interpretations of their information need that may or may
not be useful to them when reformulating their query. In the case when the query is a document
from the initial collection, similarity measures can be used to cluster and classify documents
within a collection. In short, similarity measure can add a rudimentary structure to a previously
unstructured collection.
3.3.2 Motivation
Similarity measures used in IR systems can distort one’s perception of the entire data set. For
example, if a user types a query into a search engine and does not find a satisfactory answer in
the top ten returned web pages, then he/she will usually try to reformulate his/her query once
or twice. If a satisfactory answer is still not returned, then the user will often assume that one
does not exist. Rarely does a user understand or care what ranking scheme a particular search
engine employs.
An understanding of the similarity measures, however, is crucial in today’s business world. Many
business decisions are often based on answers to questions that are posed in a way similar to how
queries are given to search engines. Data miners do not have the luxury of assuming that the
answers given to them from a database or IR system are correct or all-inclusive they must know
the drawbacks of any similarity measure used and adjust their business decisions accordingly.
A similarity measure is defined as a mapping from a pair of tuples of size k to a scalar number.
By convention, all similarity measures should map to the range [-1, 1] or [0, 1], where a similarity
score of 1 indicates maximum similarity. Similarity measure should exhibit the property that
their value will increase as the number of common properties in the two items being compared
increases.
A popular model in many IR applications is the vector-space model, where documents are
represented by a vector of size n, where n is the size of the dictionary. Thus, document I is
represented by a vector di = (w1i,….,wki), where wki denotes the weight associated with term k
in document i. in the simplest case, wki is the frequency of occurrence of term k in document i.
Queries are formed by creating a pseudo-document vector q of size n, where wkq is assumed to
be non-zero if and only if term k occurs in the query.
Given two similarity scores sim (q, di) = s1 and sim (q, dj) = s2, s1 > s2 means that document i is
judged m ore relevant than document j to query q. since similarity measure are a pairwise measure,
the values of s1 and s2 do not imply a relationship between documents i and j themselves.
Notes From a set theoretic standpoint, assume that a universe Ω exists from which subsets A, B are
generated. From the IR perspective, Ω is the dictionary while A and B are documents with
A usually representing the query. Some similarity measures are more easily visualize via set
theoretic notation.
As a simple measure, A∩B denotes the number of shared index terms. However, this simple
coefficient takes no information about the sizes of A and B into account. The Simple coefficient is
analogous to the binary weighting scheme in IR that can be thought of as the frequency of term
co-occurrence with respect to two documents. Although the Simple coefficient is technically a
similarity measure,
Most similarity measures are themselves evaluated by precision and recall, let A denote the set of
retrieved documents and B denote the set of relevant documents. Define precision and recall as
|A B|
P(A, B) =
|A|
And
|A B|
P(A, B) =
|A|
respectively. Informally, precision is the ratio of returned relevance documents to the total
number of documents returned, while recall is the ratio of returned relevant documents to the
total number of relevant documents to the total number of relevant, documents. Precision is
often evaluated at varying levels of recall (namely, I = 1, …., │B│) to produce a precision-recall
graph. Ideally, IR systems generate high precision at all levels of recall. In practice, however,
most systems exhibits lower precision values at higher levels of recall.
While the different notation styles may not yield exactly the same numeric values for each pair of
items, the ordering of the items within a set is preserved.
3.3.4 Dice
The dice coefficient is a generalization of the harmonic mean of the precision and recall measures.
A system with a high harmonic mean should theoretically by closer to an ideal retrieval system in
that it can achieve high precision values at high levels of recall. The harmonic mean for precision
and recall is given by
2
E=
1 1
+
P R
≅ ,
α ∑ k = 1 w kq + (1 − α )∑ k = 1 w kj2
n 2 n
with a ε [0, 1]. To show that the Dice coefficient is a weighted harmonic mean, let a = ½.
As its name implies, the Overlap coefficient attempts to determine the degree to which two sets
overlap. The Overlap coefficient is compared as
|A B|
sim(d,d j ) = D(A,B) =
min(|A|,|B|)
∑
n
w kq w kj
≅ k =1
min (∑ + ∑ k = 1 w kj2 )
n 2 n
k =1
w kq
The Overlap coefficient is sometimes calculated using the max operator in place of the min.
Note The denominator does not necessarily normalize the similarity values produced by
this measure. As a result, the Overlap values are typically higher in magnitude than other
similarity measures.
Task “Statistics is mathematics but it’s very useful in data mining.” Discuss
The first step is to design a specific network architecture (that includes a specific number of
“layers” each consisting of a certain number of “neurons”). The size and structure of the network
needs to match the nature (e.g., the formal complexity) of the investigated phenomenon. Because
the latter is obviously not known very well at this early stage, this task is not easy and often
involves multiple “trials and errors.”
The new network is then subjected to the process of “training.” In that phase, neurons apply
an iterative process to the number of inputs (variables) to adjust the weights of the network in
order to optimally predict (in traditional terms one could say, find a “fit” to) the sample data on
which the “training” is performed. After the phase of learning from an existing data set, the new
network is ready and it can then be used to generate predictions.
Neural networks have seen an explosion of interest over the last few years, and are being
successfully applied across an extraordinary range of problem domains, in areas as diverse as
finance, medicine, engineering, geology and physics. Indeed, anywhere that there are problems
of prediction, classification or control, neural networks are being introduced. This sweeping Notes
success can be attributed to a few key factors:
1. Power: Neural networks are very sophisticated modeling techniques capable of modeling
extremely complex functions. In particular, neural networks are nonlinear (a term which
is discussed in more detail later in this section). For many years linear modeling has been
the commonly used technique in most modeling domains since linear models have well-
known optimization strategies. Where the linear approximation was not valid (which was
frequently the case) the models suffered accordingly. Neural networks also keep in check
the curse of dimensionality problem that bedevils attempts to model nonlinear functions
with large numbers of variables.
2. Ease of use: Neural networks learn by example. The neural network user gathers
representative data, and then invokes training algorithms to automatically learn the
structure of the data. Although the user does need to have some heuristic knowledge of
how to select and prepare data, how to select an appropriate neural network, and how
to interpret the results, the level of user knowledge needed to successfully apply neural
networks is much lower than would be the case using (for example) some more traditional
nonlinear statistical methods.
Neural networks are also intuitively appealing, based as they are on a crude low-level model of
biological neural systems. In the future, the development of this neurobiological modeling may
lead to genuinely intelligent computers.
Neural networks are applicable in virtually every situation in which a relationship between the
predictor variables (independents, inputs) and predicted variables (dependents, outputs) exists,
even when that relationship is very complex and not easy to articulate in the usual terms of
“correlations” or “differences between groups.” A few representative examples of problems to
which neural network analysis has been applied successfully are:
1. Detection of medical phenomena: A variety of health-related indices (e.g., a combination
of heart rate, levels of various substances in the blood, respiration rate) can be monitored.
The onset of a particular medical condition could be associated with a very complex (e.g.,
nonlinear and interactive) combination of changes on a subset of the variables being
monitored. Neural networks have been used to recognize this predictive pattern so that
the appropriate treatment can be prescribed.
2. Stock market prediction: Fluctuations of stock prices and stock indices are another
example of a complex, multidimensional, but in some circumstances at least partially-
deterministic phenomenon. Neural networks are being used by many technical analysts
to make predictions about stock prices based upon a large number of factors such as past
performance of other stocks and various economic indicators.
3. Credit assignment: A variety of pieces of information are usually known about an applicant
for a loan. For instance, the applicant’s age, education, occupation, and many other facts
may be available. After training a neural network on historical data, neural network
analysis can identify the most relevant characteristics and use those to classify applicants
as good or bad credit risks.
4. Monitoring the condition of machinery: Neural networks can be instrumental in cutting costs
by bringing additional expertise to scheduling the preventive maintenance of machines.
A neural network can be trained to distinguish between the sounds a machine makes when
it is running normally (“false alarms”) versus when it is on the verge of a problem. After
this training period, the expertise of the network can be used to warn a technician of an
upcoming breakdown, before it occurs and causes costly unforeseen “downtime.”
Notes 5. Engine management: Neural networks have been used to analyze the input of sensors from
an engine. The neural network controls the various parameters within which the engine
functions, in order to achieve a particular goal, such as minimizing fuel consumption.
Genetic algorithms are very easy to develop and to validate which makes them highly attractive of
they apply. The algorithm is parallel, meaning that it can applied to large populations efficiently.
The algorithm is also efficient in that if it begins with a poor original solution, it can rapidly
progress to good solutions. Use of mutation makes the method capable of identifying global
optima even in very nonlinear problem domains. The method does not require knowledge about
the distribution of the data.
Genetic algorithms require mapping data sets to a from where attributes have discrete values for
the genetic algorithm to work with. This is usually possible, but can be lose a great deal of detail
information when dealing with continuous variables. Coding the data into categorical from can
unintentionally lead to biases in the data.
There are also limits to the size of data set that can be analyzed with genetic algorithms. For very
large data sets, sampling will be necessary, which leads to different results across different runs
over the same data set.
Notes
T
o scope, define & design a structured & systemized architecture for capturing &
reporting Business & Financial performance metrics for APJCC:
1. Evaluate current processes & methodology for capturing & reporting Business &
financial performance metrics
2. Scope framework for systemized data capture & reporting of Business & financial
performance metrics
3. Key Areas identified: (1) ABC (2) OPEX (3) Revenue (4) Profitability Analysis
4. Define processes, systems & tools for systemized data capture & reporting of Business
& financial performance metrics
5. Design systems & tools for systemized data capture & reporting of Business &
financial performance metrics
To scope, define & design framework for building a Customer Information Datamart
Architecture for APJCC
1. Evaluate current processes & systems for capturing all customer contact information
for APJCC
2. Scope framework for systemized data capture of customer intelligence data
Contd...
Notes 3. Scope for pilot: KL Hub (for actual data capture, UAT & roll-out), but framework
must incorporate APJCC perspective
4. Define data definitions, DWH structure, data capture processes, business logics &
system rules, applications & tools for Datamart.
5. Design & implement
3.8 Summary
zz In this unit, you learnt about the data mining technique. Data Mining is an analytic process
designed to explore data (usually large amounts of data - typically business or market
related) in search of consistent patterns and/or systematic relationships between variables,
and then to validate the findings by applying the detected patterns to new subsets of
data.
zz The ultimate goal of data mining is prediction - and predictive data mining is the most
common type of data mining and one that has the most direct business applications.
zz The process of data mining consists of three stages: (1) the initial exploration, (2) model
building or pattern identification with validation/verification, and (3) deployment (i.e., the
application of the model to new data in order to generate predictions).
zz In this unit you also learnt about a statistical perspective of data mining, similarity measures,
decision tree and many more.
3.9 Keywords
Decision Tree: A decision tree is a structure that can be used to divide up a large collection
of records into successively smaller sets of records by applying a sequence of simple decision
rules.
Dice: The dice coefficient is a generalization of the harmonic mean of the precision and recall
measures.
Genetic Algorithms: Genetic algorithms are mathematical procedures utilizing the process of
genetic inheritance.
Similarity Measures: Similarity measures provide the framework on which many data mining
decision are based.
1. Statistics 2. Data
3. Similarity measures 4. information retrieval (IR)
5. Overlap coefficient 6. Neural networks
7. Genetic algorithms 8. True
9. False 10. True
11. True
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/ The
MIT Press, 1996.
V. Cherkassky and F. Mulier, Learning From Data, John Wiley & Sons, 1998.
Contents
Objectives
Introduction
4.1 What is Classification and Prediction?
4.1.1 Classification
4.1.2 Prediction
4.2 Issues regarding Classification and Prediction
4.3 Statistical based Algorithms
4.4 Naive Bayesian Classification
4.5 Distance-based Algorithms
4.6 Distance Functions
4.7 Classification by Decision Tree
4.7.1 Basic Algorithm for Learning Decision Trees
4.7.2 Decision Tree Induction
4.7.3 Tree Pruning
4.7.4 Extracting Classification Rules from Decision Trees
4.8 Neural Network based Algorithms
4.9 Rule-based Algorithms
4.10 Combining Techniques
4.11 Summary
4.12 Keywords
4.13 Self Assessment
4.14 Review Questions
4.15 Further Readings
Objectives
After studying this unit, you will be able to:
zz Describe the concept of data mining classification
zz Discuss basic knowledge of different classification techniques
zz Explain rule based algorithms
Notes Introduction
Classification is a data mining (machine learning) technique used to predict group membership
for data instances.
Example: You may wish to use classification to predict whether the weather on a particular
day will be “sunny”, “rainy” or “cloudy”. Popular classification techniques include decision trees
and neural networks.
Data classification is a two step process. In the first step, a model is built describing a predetermined
set of data classes or concepts. The model is constructed by analyzing database tuples described
by attributes. Each tuple is assumed to belong to a predefined class, as determined by one of
the attributes, called the class label attribute. In the context of classification, data tuples are
also referred to as samples, examples, or objects. The data tuples analyzed to build the model
collectively form the training data set. The individual tuples making up the training set are
referred to as training samples and are randomly selected from the sample population.
Since the class label of each training sample is provided, this step is also known as supervised
learning (i.e., the learning of the model is ‘supervised’ in that it is told to which class each training
sample belongs). It contrasts with unsupervised learning (or clustering), in which the class labels
of the training samples are not known, and the number or set of classes to be learned may not be
known in advance.
Typically, the learned model is represented in the form of classification rules, decision trees,
or mathematical formulae. For example, given a database of customer credit information,
classification rules can be learned to identify customers as having either excellent or fair credit
ratings (Figure 4.1a). The rules can be used to categorize future data samples, as well as provide a
better understanding of the database contents. In the second step (Figure 4.1b), the model is used
for classification. First, the predictive accuracy of the model (or classifier) is estimated.
The holdout method is a simple technique which uses a test set of class-labeled samples. These
samples are randomly selected and are independent of the training samples. The accuracy of a
model on a given test set is the percentage of test set samples that are correctly classified by the
model. For each test sample, the known class label is compared with the learned model’s class
prediction for that sample. Note that if the accuracy of the model were estimated based on the
training data set, this estimate could be optimistic since the learned model tends to over fit the
data (that is, it may have incorporated some particular anomalies of the training data which are
not present in the overall sample population). Therefore, a test set is used.
(a) Learning: Training data are analyzed by a classification algorithm. Here, the class label
attribute is credit rating, and the learned model or classifier is represented in the form of
classification rule.
(b) Classification: Test data are used to estimate the accuracy of the classification rules. If the
accuracy is considered acceptable, the rules can be applied to the classification of new data
tuples.
(a)
(b)
4.1.1 Classification
Classification is a data mining technique used to predict group membership for data instances.
Example: You may wish to use classification to predict whether the weather on a
particular day will be “sunny”, “rainy” or “cloudy”. Popular classification techniques include
decision trees and neural networks.
In classification, there is a target categorical variable, such as income bracket, which, for example,
could be partitioned into three classes or categories: high income, middle income, and low income.
The data mining model examines a large set of records, each record containing information on
the target variable as well as a set of input or predictor variables. For example, consider the
excerpt from a data set shown in Table 4.1.
Notes Table 4.1: Excerpt from Data Set for Classifying Income
Suppose that the researcher would like to be able to classify the income brackets of persons not
currently in the database, based on other characteristics associated with that person, such as
age, gender, and occupation. This task is a classification task, very nicely suited to data mining
methods and techniques. The algorithm would proceed roughly as follows. First, examine the
data set containing both the predictor variables and the (already classified) target variable, income
bracket. In this way, the algorithm (software) “learns about” which combinations of variables are
associated with which income brackets. For example, older females may be associated with the
high-income bracket. This data set is called the training set. Then the algorithm would look at new
records, for which no information about income bracket is available. Based on the classifications
in the training set, the algorithm would assign classifications to the new records. For example, a
63-year-old female professor might be classified in the high-income bracket.
Examples of classification tasks in business and research include:
1. Determining whether a particular credit card transaction is fraudulent
2. Placing a new student into a particular track with regard to special needs
3. Assessing whether a mortgage application is a good or bad credit risk
4. Diagnosing whether a particular disease is present
5. Determining whether a will was written by the actual deceased, or fraudulently by someone
else
6. Identifying whether or not certain financial or personal behavior indicates a possible
terrorist threat
The learning of the model is ‘supervised’ if it is told to which class each training sample belongs.
In contrasts with unsupervised learning (or clustering), in which the class labels of the training
samples are not known, and the number or set of classes to be learned may not be known in
advance.
Typically, the learned model is represented in the form of classification rules, decision trees, or
mathematical formulae.
4.1.2 Prediction
Prediction is similar to classification, except that for prediction, the results lie in the future.
Examples of prediction tasks in business and research include:
1. Predicting the price of a stock three months into the future
2. Predicting the percentage increase in traffic deaths next year if the speed limit is increased
3. Predicting the winner of this fall’s baseball World Series, based on a comparison of team
statistics
4. Predicting whether a particular molecule in drug discovery will lead to a profitable new Notes
drug for a pharmaceutical company.
Any of the methods and techniques used for classification may also be used, under appropriate
circumstances, for prediction. These include the traditional statistical methods of point estimation
and confidence interval estimations, simple linear regression and correlation, and multiple
regression.
Notes classifiers. Naive Bayesian classifiers assume that the effect of an attribute value on a given class
is independent of the values of the other attributes. This assumption is called class conditional
independence. It is made to simplify the computations involved, and in this sense, is considered
“naive”. Bayesian belief networks are graphical models, which unlike naive Bayesian classifiers,
allow the representation of dependencies among subsets of attributes.
Apply Bayes Rule: c is the class, {v} observed attribute values:
P ({v}| c) P (c)
P (c |{v}) =
P ({v})
P ({ck }P ({v}| ck )
P (ck |{v}) =
P ({v})
P ({ck }P ({v}| ck )
P({v}) (the evidence): Âk P ({v})
=1
Bayes Rules:
MAP vs. ML
Rather than computing full posterior, can simplify computation if interested in classification:
1. ML (Maximum Likelihood) Hypothesis
assume all hypotheses equiprobable a priori – simply maximize data likelihood:
P ({v}c | P (c))
= arg max
c ŒC P ({v})
Bayes Theorem
Bayes’ theorem relates the conditional and marginal probabilities of events A and B, where B has
a non-vanishing probability:
P ( B | A) P ( A)
P( A | B) = .
P( B)
Use training examples to estimate class-conditional probability density functions for white-blood
cell count (W)
0.4
0.35
0.3
0.25
P(W/Di)
0.2
0.15
0.1
0.05
0
20 25 30 35 40 45 50 55 60
Notes In other words, Naïve Bayes classifiers assume that the effect of a variable value on a given
class is independent of the values of other variable. This assumption is called class conditional
independence. It is made to simplify the computation and in this sense considered to be
“Naïve”.
This assumption is a fairly strong assumption and is often not applicable. However, bias
in estimating probabilities often may not make a difference in practice - it is the order of the
probabilities, not their exact values, that determine the classifications.
The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly
suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often
outperform more sophisticated classification methods.
To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the
illustration above. As indicated, the objects can be classified as either GREEN or RED. Our task
is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the
currently exiting objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case
(which hasn’t been observed yet) is twice as likely to have membership GREEN rather than RED.
In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based
on previous experience, in this case the percentage of GREEN and RED objects, and often used to
predict outcomes before they actually happen.
Thus, we can write:
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for
class membership are:
40
Prior probability for Green µ
60
20
Prior probability for Red µ
60
Notes
Having formulated our prior probability, we are now ready to classify a new object (WHITE
circle). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or
RED) objects in the vicinity of X, the more likely that the new cases belong to that particular
color. To measure this likelihood, we draw a circle around X which encompasses a number (to be
chosen a priori) of points irrespective of their class labels. Then we calculate the number of points
in the circle belonging to each class label. From this we calculate the likelihood:
From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood
of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:
1
Probability of X given Green µ
40
3
Probability of X given Red µ
20
Although the prior probabilities indicate that X may belong to GREEN (given that there are twice
as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership
of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the
Bayesian analysis, the final classification is produced by combining both sources of information,
i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes’ rule
(named after Rev. Thomas Bayes 1702-1761).
Posterior probability of X being Green µ
Prior probability of Green × Likelihood of X given Green
4 1 1
= ¥ =
6 40 60
Posterior probability of X being Red µ
Prior probability of Red × Likelihood of X given Red
2 3 1
= ¥ =
6 20 20
Finally, we classify X as RED since its class membership achieves the largest posterior
probability.
In theory, Bayesian classifiers have the minimum error rate in comparison to all other classifiers.
However, in practice this is not always the case owing to inaccuracies in the assumptions made
for its use, such as class conditional independence, and the lack of available probability data.
However, various empirical studies of this classifier in comparison to decision tree and neural
network classifiers have found it to be comparable in some domains.
Bayesian classifiers are also useful in that they provide a theoretical justification for other
classifiers which do not explicitly use Bayes theorem. For example, under certain assumptions,
it can be shown that many neural network and curve fitting algorithms output the maximum
posteriori hypothesis, as does the naive Bayesian classifier.
Euclidean Distance
Euclidean distance is the most common distance used as the dissimilarity measure. It is defined
as
Figure 4.3 illustrate the effects the rotations of scaling on Euclidean distance in a 2D space. It
is obvious from Figure 4.3 that dissimilarity is preserved after rotation. But after scaling the
x-axis, the dissimilarity between objects is changed. So Euclidean distance is invariant to rotation,
but not to scaling. If rotation is the only acceptable operation for an image database, Euclidean
distance would be a good choice.
Other Distances
There are also many other distances that can be used for different data. Edit distance fits sequence
and text data. The Tanimoto distance is suitable for data with binary-valued features.
Actually, data normalization is one way to overcome the limitation of the distance functions.
Functions. For example, normalizing the data to the same scale can overcome the scaling
problem of Euclidean distance, however, normalization may lead to information loss and lower
classification accuracy.
For example, to decide whether play golf or not, let us consider the following decision tree Notes
(see Figure)
In order to determine the decision (classification) for a given set of weather conditions from the
decision tree, first look at the value of Outlook. There are three possibilities.
1. If the value of Outlook is sunny, next consider the value of Humidity. If the value is less
than or equal to 75 the decision is play. Otherwise the decision is don’t play.
2. If the value of Outlook is overcast, the decision is play.
3. If the value of Outlook is rain, next consider the value of Windy. If the value is true the
decision is don’t play, otherwise the decision is play.
Decision Trees are useful for predicting exact outcomes. Applying the decision trees algorithm
to a training dataset results in the formation of a tree that allows the user to map a path to
a successful outcome. At every node along the tree, the user answers a question (or makes a
“decision”), such as “play” or “don’t play”.
The decision trees algorithm would be useful for a bank that wants to ascertain the characteristics
of good customers. In this case, the predicted outcome is whether or not the applicant represents
a bad credit risk. The outcome of a decision tree may be a Yes/No result (applicant is/is not a bad
credit risk) or a list of numeric values, with each value assigned a probability.
The training dataset consists of the historical data collected from past loans. Attributes that affect
credit risk might include the customer’s educational level, the number of kids the customer has,
or the total household income. Each split on the tree represents a decision that influences the final
predicted variable. For example, a customer who graduated from high school may be more likely
to pay back the loan. The variable used in the first split is considered the most significant factor.
So if educational level is in the first split, it is the factor that most influences credit risk.
Decision trees have been used in many application areas ranging from medicine to game theory
and business. Decision trees are the basis of several commercial rule induction systems.
Notes Method
1. Create a node N;
2. If samples are all of the same class, C then
3. Return N as a leaf node labeled with the class C
4. If attribute-list is empty then
5. Return N as a leaf node labeled with the most common class in samples; // majority
voting
6. Select test-attribute, the attribute among attribute-list with the highest information gain
7. Label node N with test-attribute
8. For each known value ai of test-attribute // partition the samples
9. Grow a branch from node N for the condition test-attribute=ai
10. Let si be the set of samples in samples for which test-attribute=ai; // a partition
11. If si is empty then
12. Attach a leaf labeled with the most common class in samples
13. Else attach the node returned by Generate decision tree (si, attribute-list - test-attribute).
The automatic generation of decision rules from examples is known as rule induction or automatic
rule induction.
Generating decision rules in the implicit form of a decision tree are also often called rule induction,
but the terms tree induction or decision tree inductions are sometimes preferred.
The basic algorithm for decision tree induction is a greedy algorithm, which constructs decision
trees in a top-down recursive divide-and-conquer manner. The basic algorithm for learning
decision trees, is a version of ID3, a well-known decision tree induction algorithm.
The basic strategy is as follows:
1. The tree starts as a single node representing the training samples (step 1).
2. If the samples are all of the same class, then the node becomes a leaf and is labeled with that
class (steps 2 and 3).
3. Otherwise, the algorithm uses an entropy-based measure known as information gain as a
heuristic for selecting the attribute that will best separate the samples into individual classes
(step 6). This attribute becomes the “test” or “decision” attribute at the node (step 7). In this
version of the algorithm, all attributes are categorical, i.e., discrete-valued. Continuous-
valued attributes must be discretized.
4. A branch is created for each known value of the test attribute, and the samples are
partitioned accordingly (steps 8-10).
5. The algorithm uses the same process recursively to form a decision tree for the samples at
each partition. Once an attribute has occurred at a node, it need not be considered in any of
the node’s descendents (step 13).
6. The recursive partitioning stops only when any one of the following conditions is true: Notes
(a) All samples for a given node belong to the same class (steps 2 and 3), or
(b) There are no remaining attributes on which the samples may be further partitioned
(step 4). In this case, majority voting is employed (step 5). This involves converting
the given node into a leaf and labeling it with the class in majority among samples.
Alternatively, the class distribution of the node samples may be stored; or
(c) There are no samples for the branch test-attribute=at (step 11). In this case, a leaf is
created with the majority class in samples (step 12).
Decision tree induction algorithms have been used for classification in a wide range of application
domains. Such systems do not use domain knowledge. The learning and classification steps of
decision tree induction are generally fast. Classification accuracy is typically high for data where
the mapping of classes consists of long and thin regions in concept space.
Attribute Selection Measure
The information gain measure is used to select the test attribute at each node in the tree. Such
a measure is also referred to as an attribute selection measure. The attribute with the highest
information gain is chosen as the test attribute for the current node. This attribute minimizes
the information needed to classify the samples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions. Such an information-theoretic approach minimizes
the expected number of tests needed to classify an object and guarantees that a simple (but not
necessarily the simplest) tree is found.
Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values
defining m distinct classes, d (for i = 1,..., m). Let s, be the number of samples of S in class d. The
expected information needed to classify a given sample is given by:
m
I(s1, s2 …, sm) = – ∑ pi log 2 ( pi )
i =1
where p, is the probability than an arbitrary sample belongs to class d and is estimated by Si/s.
Note that a log function to the base 2 is used since the information is encoded in bits.
Attribute A can be used to partition S into v subsets, {S1,S2, ….. , SV}, where Sj contains those
samples in S that have value aj of A. If A were selected as the test attribute (i.e., best attribute for
splitting), then these subsets would correspond to the branches grown from the node containing
the set S. Let sj be the number of samples of class d in a subset Sj. The entropy, or expected
information based on the partitioning into subsets by A is given by:
v s1 j + ... + smj
E ( A) = ∑ I ( s1 j ...smj ).
i =1 s
s1 j + ... + smj
∑
v
The term acts as the weight of the jth subset and is the number of samples in
i =1
s
the subset (i.e., having value aj of A) divided by the total number of samples in S. The smaller the
entropy value is, the greater the purity of the subset partitions. The encoding information that
would be gained by branching on A is
Gain(A) = I(sl, s2, . . . , sm) - E(A).
In other words, Gain(A) is the expected reduction in entropy caused by knowing the value of
attribute A.
Notes The algorithm computes the information gain of each attribute. The attribute with the highest
information gain is chosen as the test attribute for the given set S. A node is created and labeled
with the attribute, branches are created for each value of the attribute, and the samples are
partitioned accordingly.
After building a decision tree a tree pruning step can be performed to reduce the size of the
decision tree. Decision trees that are too large are susceptible to a phenomenon called as
overfitting. Pruning helps by trimming the branches that reflects anomalies in the training
data due to noise or outliers and helps the initial tree in a way that improves the generalization
capability of the tree. Such methods typically use statistical measures to remove the least reliable
branches, generally resulting in faster classification and an improvement in the ability of the tree
to correctly classify independent test data.
There are two common approaches to tree pruning.
Pre-pruning Approach
In the pre-pruning approach, a tree is “pruned” by halting its construction early (e.g., by deciding
not to further split or partition the subset of training samples at a given node). Upon halting, the
node becomes a leaf. The leaf may hold the most frequent class among the subset samples, or the
probability distribution of those samples.
When constructing a tree, measures such as statistical significance, x2, information gain, etc., can
be used to assess the goodness of a split. If partitioning the samples at a node would result in a split
that falls below a pre-specified threshold, then further partitioning of the given subset is halted.
There are difficulties, however, in choosing an appropriate threshold. High thresholds could
result in oversimplified trees, while low thresholds could result in very little simplification.
Post-pruning Approach
The post-pruning approach removes branches from a “fully grown” tree. A tree node is pruned
by removing its branches.
The cost complexity pruning algorithm is an example of the post-pruning approach. The pruned
node becomes a leaf and is labeled by the most frequent class among its former branches. For
each non-leaf node in the tree, the algorithm calculates the expected error rate that would occur if
the subtree at that node were pruned. Next, the expected error rate occurring if the node were not
pruned is calculated using the error rates for each branch, combined by weighting according to
the proportion of observations along each branch. If pruning the node leads to a greater expected
error rate, then the subtree is kept. Otherwise, it is pruned. After generating a set of progressively
pruned trees, an independent test set is used to estimate the accuracy of each tree. The decision
tree that minimizes the expected error rate is preferred.
Example: A company is trying to decide whether to bid for a certain contract or not. They
estimate that merely preparing the bid will cost £10,000. If their company bid then they estimate
that there is a 50% chance that their bid will be put on the “short-list”, otherwise their bid will be
rejected.
Once “short-listed” the company will have to supply further detailed information (entailing costs
estimated at £5,000). After this stage their bid will either be accepted or rejected.
The company estimate that the labour and material costs associated with the contract are £127,000.
They are considering three possible bid prices, namely £155,000, £170,000 and £190,000. They
estimate that the probability of these bids being accepted (once they have been short-listed) is Notes
0.90, 0.75 and 0.35 respectively.
What should the company do and what is the expected monetary value of your suggested course
of action?
Solution
The decision tree for the problem is shown in Figure 4.5:
Below we carry out step 1 of the decision tree solution procedure which (for this example)
involves working out the total profit for each of the paths from the initial node to the terminal
node (all figures in £’000).
1. Path to terminal node 7 – the company do nothing
Total profit = 0
2. Path to terminal node 8 – the company prepare the bid but fail to make the short-list
Total cost = 10 Total profit = –10
3. Path to terminal node 9 – the company prepare the bid, make the short-list and their bid of
£155K is accepted
Total cost = 10 + 5 + 127 Total revenue = 155 Total profit = 13
4. Path to terminal node 10 – the company prepare the bid, make the short-list but their bid of
£155K is unsuccessful
Total cost = 10 + 5 Total profit = –15
5. Path to terminal node 11 – the company prepare the bid, make the short-list and their bid
of £170K is accepted
Total cost = 10 + 5 + 127 Total revenue = 170 Total profit = 28
Notes 6. Path to terminal node 12 - the company prepare the bid, make the short-list but their bid of
£170K is unsuccessful
Total cost = 10 + 5 Total profit = –15
7. Path to terminal node 13 - the company prepare the bid, make the short-list and their bid
of £190K is accepted
Total cost = 10 + 5 + 127 Total revenue = 190 Total profit = 48
8. Path to terminal node 14 - the company prepare the bid, make the short-list but their bid of
£190K is unsuccessful
Total cost = 10 + 5 Total profit = –15
9. Path to terminal node 15 - the company prepare the bid and make the short-list and then
decide to abandon bidding (an implicit option available to the company)
Total cost = 10 + 5 Total profit = –15
Hence we can arrive at the table below indicating for each branch the total profit involved in that
branch from the initial node to the terminal node.
Task “Different distance functions have different characteristics, which fit various
types of data.” Explain
Even though the pruned trees are more compact than the originals, they can still be very
complex. Large decision trees are difficult to understand because each node has a specific context
established by the outcomes of tests at antecedent nodes. To make a decision-tree model more
readable, a path to each leaf can be transformed into an IF-THEN production rule. The IF part
consists of all tests on a path, and the THEN part is a final classification. Rules in this form are
called decision rules, and a collection of decision rules for all leaf nodes would classify samples
exactly as the tree does. As a consequence of their tree origin, the IF parts of the rules would be
mutually exclusive and exhaustive, so the order of the rules would not matter. An example of
the transformation of a decision tree into a set of decision rules is given in Figure 4.6, where the
two given attributes, A and B, may have two possible values, 1 and 2, and the final classification
is into one of two classes.
A rule can be “pruned” by removing any condition in its antecedent that does not improve the
estimated accuracy of the rule. For each class, rules within a class may then be ranked according
to their estimated accuracy. Since it is possible that a given test sample will not satisfy any rule
antecedent, a default rule assigning the majority class is typically added to the resulting rule
set.
An artificial neural network is a system based on the operation of biological neural networks,
in other words, is an emulation of biological neural system. Why would be necessary the
implementation of artificial neural networks? Although computing these days is truly advanced,
there are certain tasks that a program made for a common microprocessor is unable to perform;
even so a software implementation of a neural network can be made with their advantages and
disadvantages.
Notes Another aspect of the artificial neural networks is that there are different architectures, which
consequently requires different types of algorithms, but despite to be an apparently complex
system, a neural network is relatively simple.
Artificial Neural Networks (ANN) are among the newest signal-processing technologies in the
engineer’s toolbox. The field is highly interdisciplinary, but our approach will restrict the view
to the engineering perspective. In engineering, neural networks serve two important functions:
as pattern classifiers and as nonlinear adaptive filters. We will provide a brief overview of the
theory, learning rules, and applications of the most important neural network models. Definitions
and Style of Computation An Artificial Neural Network is an adaptive, most often nonlinear
system that learns to perform a function (an input/output map) from data. Adaptive means
that the system parameters are changed during operation, normally called the training phase.
After the training phase the Artificial Neural Network parameters are fixed and the system is
deployed to solve the problem at hand (the testing phase ). The Artificial Neural Network is
built with a systematic step-by-step procedure to optimize a performance criterion or to follow
some implicit internal constraint, which is commonly referred to as the learning rule . The input/
output training data are fundamental in neural network technology, because they convey the
necessary information to “discover” the optimal operating point. The nonlinear nature of the
neural network processing elements (PEs) provides the system with lots of flexibility to achieve
practically any desired input/output map, i.e., some Artificial Neural Networks are universal
mappers . There is a style in neural computation that is worth describing.
An input is presented to the neural network and a corresponding desired or target response set
at the output (when this is the case the training is called supervised ). An error is composed from
the difference between the desired response and the system output. This error information is fed
back to the system and adjusts the system parameters in a systematic fashion (the learning rule).
The process is repeated until the performance is acceptable. It is clear from this description that
the performance hinges heavily on the data. If one does not have data that cover a significant
portion of the operating conditions or if they are noisy, then neural network technology is
probably not the right solution. On the other hand, if there is plenty of data and the problem is
poorly understood to derive an approximate model, then neural network technology is a good
choice. This operating procedure should be contrasted with the traditional engineering design,
made of exhaustive subsystem specifications and intercommunication protocols. In artificial
neural networks, the designer chooses the network topology, the performance function, the
learning rule, and the criterion to stop the training phase, but the system automatically adjusts
the parameters. So, it is difficult to bring a priori information into the design, and when the
system does not work properly it is also hard to incrementally refine the solution. But ANN-
based solutions are extremely efficient in terms of development time and resources, and in many
difficult problems artificial neural networks provide performance that is difficult to match with
other technologies. Denker 10 years ago said that “artificial neural networks are the second best
way to implement a solution” motivated by the simplicity of their design and because of their Notes
universality, only shadowed by the traditional design obtained by studying the physics of the
problem. At present, artificial neural networks are emerging as the technology of choice for many
applications, such as pattern recognition, prediction, system identification, and control.
Artificial neural networks emerged after the introduction of simplified neurons by McCulloch
and Pitts in 1943 (McCulloch & Pitts, 1943). These neurons were presented as models of biological
neurons and as conceptual components for circuits that could perform computational tasks. The
basic model of the neuron is founded upon the functionality of a biological neuron. “Neurons are
the basic signaling units of the nervous system” and “each neuron is a discrete cell whose several
processes arise from its cell body”.
The neuron has four main regions to its structure. The cell body, or soma, has two offshoots
from it, the dendrites, and the axon, which end in presynaptic terminals. The cell body is the
heart of the cell, containing the nucleus and maintaining protein synthesis. A neuron may have
many dendrites, which branch out in a treelike structure, and receive signals from other neurons.
A neuron usually only has one axon which grows out from a part of the cell body called the
axon hillock. The axon conducts electric signals generated at the axon hillock down its length.
These electric signals are called action potentials. The other end of the axon may split into several
branches, which end in a presynaptic terminal. Action potentials are the electric signals that
neurons use to convey information to the brain. All these signals are identical. Therefore, the
brain determines what type of information is being received based on the path that the signal
took. The brain analyzes the patterns of signals being sent and from that information it can
interpret the type of information being received. Myelin is the fatty tissue that surrounds and
insulates the axon. Often short axons do not need this insulation. There are uninsulated parts
of the axon. These areas are called Nodes of Ranvier. At these nodes, the signal traveling down
the axon is regenerated. This ensures that the signal traveling down the axon travels fast and
remains constant (i.e. very short propagation delay and no weakening of the signal). The synapse
is the area of contact between two neurons. The neurons do not actually physically touch. They
are separated by the synaptic cleft, and electric signals are sent through chemical 13 interaction.
The neuron sending the signal is called the presynaptic cell and the neuron receiving the signal
is called the postsynaptic cell. The signals are generated by the membrane potential, which is
based on the differences in concentration of sodium and potassium ions inside and outside the
cell membrane. Neurons can be classified by their number of processes (or appendages), or by
their function. If they are classified by the number of processes, they fall into three categories.
Unipolar neurons have a single process (dendrites and axon are located on the same stem), and
are most common in invertebrates. In bipolar neurons, the dendrite and axon are the neuron’s
two separate processes. Bipolar neurons have a subclass called pseudo-bipolar neurons, which
are used to send sensory information to the spinal cord. Finally, multipolar neurons are most
Notes common in mammals. Examples of these neurons are spinal motor neurons, pyramidal cells and
Purkinje cells (in the cerebellum). If classified by function, neurons again fall into three separate
categories. The first group is sensory, or afferent, neurons, which provide information for
perception and motor coordination. The second group provides information (or instructions) to
muscles and glands and is therefore called motor neurons. The last group, interneuronal, contains
all other neurons and has two subclasses. One group called relay or projection interneurons have
long axons and connect different parts of the brain. The other group called local interneurons are
only used in local circuits.
When creating a functional model of the biological neuron, there are three basic components
of importance. First, the synapses of the neuron are modeled as weights. The strength of the
connection between an input and a neuron is noted by the value of the weight. Negative weight
values reflect inhibitory connections, while positive values designate excitatory connections
[Haykin]. The next two components model the actual activity within the neuron cell. An adder
sums up all the inputs modified by their respective weights. This activity is referred to as linear
combination. Finally, an activation function controls the amplitude of the output of the neuron.
An acceptable range of output is usually between 0 and 1, or -1 and 1.
Mathematically, this process is described in the figure
From this model the interval activity of the neuron can be shown to be:
The output of the neuron, yk, would therefore be the outcome of some activation function on the
value of vk.
As mentioned previously, the activation function acts as a squashing function, such that the
output of a neuron in a neural network is between certain values (usually 0 and 1, or -1 and 1). In
general, there are three types of activation functions, denoted by Φ(.) . First, there is the Threshold
Function which takes on a value of 0 if the summed input is less than a certain threshold value
(v), and the value 1 if the summed input is greater than or equal to the threshold value.
Secondly, there is the Piecewise-Linear function. This function again can take on the values of 0
or 1, but can also take on values between that depending on the amplification factor in a certain
region of linear operation.
Thirdly, there is the sigmoid function. This function can range between 0 and 1, but it is also
sometimes useful to use the -1 to 1 range. An example of the sigmoid function is the hyperbolic
tangent function.
Notes The artifcial neural networks which we describe are all variations on the parallel distributed
processing (PDP) idea. The architecture of each neural network is based on very similar building
blocks which perform the processing. In this unit we first discuss these processing units and
discuss diferent neural network topologies. Learning strategies as a basis for an adaptive
system
1R Algorithm Notes
One of the simple approaches used to find classification rules is called 1R, as it generated a one
level decision tree. This algorithm examines the “rule that classify an object on the basis of a
single attribute”.
The basic idea is that rules are constructed to test a single attribute and branch for every value
of that attribute. For each branch, the class with the best classification is the one occurring most
often in the training data. The error rate of the rules is then determined by counting the number
of instances that do not have the majority class in the training data. Finally, the error rate for each
attribute’s rule set is evaluated, and the rule set with the minimum error rate is chosen.
A comprehensive comparative evaluation of the performance of 1R and other methods on 16
datasets (many of which were most commonly used in machine learning research) was performed.
Despite it simplicity, 1R produced surprisingly accurate rules, just a few percentage points lower
in accuracy than the decision produced by the state of the art algorithm (C4). The decision tree
produced by C4 were in most cases considerably larger than 1R’s rules, and the rules generated
by 1R were much easier to interpret. 1R teherfore provides a baseline performance using a
rudimentary technique to be used before progressing to more sophisticated algorithms.
Other Algorithms
Basic covering algorithms construct rules that classify training data perfectly, that is, they tend to
overfit the training set causing insufficient generalization and difficulty for processing new data.
However, for applications in real world domains, methods for handling noisy data, mechanisms
for avoiding overfitting even on training data, and relaxation requirements of the constraints are
needed. Pruning is one of the ways of dealing with these problems, and it approaches the problem
of overfitting by learning a general concept from the training set “to improve the prediction of
unseen instance”. The concept of Reduced Error Pruning (REP) was developed by, where some
of the training examples were withheld as a test set and performance of the rule was measured
on them. Also, Incremental Reduced Error Pruning (IREP) has proven to be efficient in handling
over-fitting, and it form the basis of RIPPER. SLIPPER (Simple Learner with Iterative Pruning to
Produce Error Reduction) uses “confidence-rated boosting to learn an ensemble of rules.”
Rule based algorithms are widely used for deriving classification rules applied in medical
sciences for diagnosing illnesses, business planning, banking government and different
disciplines of science. Particularly, covering algorithms have deep roots in machine learning.
Within data mining, covering algorithms including SWAP-1, RIPPER, and DAIRY are used in
text classification, adapted in gene expression programming for discovering classification rules.
Task Explain how will you remove the training data covered by rule R.
Notes We begin with what us perhaps the best-known data type in traditional data analysis, namely,
d-dimensional vectors x of measurements on N objects or individual, or N objects where for each
of which we have d measurements or attributes. Such data is often referred to as multivariate
data and can be thought of as an N x d data matrix. Classical problems in data analysis involving
multivariate data include classification (learning a functional mapping from a vector x to y where
y is a categorical, or scalar, target variable of interest), regression (same as classification, except y,
which takes real values), clustering (learning a function that maps x into a set of categories, where
the categories are unknown a priori), and density estimation (estimating the probability density
function, or PDF, for x, p (x)).
The dimensionality d of the vectors x plays a significant role in multivariate modeling. In
problems like text classification and clustering of gene expression data, d can be as large 103 and
104 dimensions. Density estimation theory shows that the amount of data needed to reliably to
estimate a density function scales exponentially in d (the so-called “curse of dimensionality”).
Fortunately, many predictive problems including classification and regression do not need a full
d dimensional estimate of the PDF p(x), relying instead on the simpler problem of determining
of a conditional probability density function p(y/x), where y is the variable whose value the data
minor wants to predict.
Recent research has shown that combining different models can be effective in reducing the
instability that results form predictions using a single model fit to a single set of data. A variety of
model-combining techniques (with exotic names like bagging, boosting, and stacking) combine
massive computational search methods with variance-reduction ideas from statistics; the result
is relatively powerful automated schemes for building multivariate predictive models. As the
data minor’s multivariate toolbox expands, a significant part of the data mining is the practical
intuition of the tools themselves.
Case Study Hideaway Warehouse Management System (WMS)
The Company
Hideaway Beds – Wall Bed Company offers the Latest Designs wall beds. Wall Beds have
been around since 1918 in American and Europe. The company ships their products
to approximately 100 retailers in Australia as well as taking online orders directly from
individual consumers.
Key Benefits
1. Order accuracy increases from 80% to 99.9 %
2. Order picking times reduced by one third
Contd...
Notes order for assembling. Once the final product is ready it is then shipped to customer and all
phases of their assembly processes are thereby recorded into the warehouse management
system. There was no longer a need for paper recording or manual data entry. Using RF
handhelds and barcode technology Naxtor WMS allow Hideaway to track and trace every
item as it is received, put-away, picked and shipped. Each member of the warehouse staff
became an expert on stock and assembly location and warehouse processes.
Hideaway beds at times also required generating their own barcodes, and with Naxtor
WMS with a click of a button the application generates EAN32 and many generic barcode
read by most of the barcode reader. The printed labels are then attached to the incoming
assembly items ready to be put away to their respective locations.
“Having the warehouse system in place means that instead of product location and
warehouse layout sitting in someone’s head, it’s now in the system, and that makes it much
more transferable so that every staff member is an expert,” says Von
After only three week installation and training process, the warehouse was able to leverage
the full capabilities of Naxtor WMS. The company was able to equip warehouse staff with
PSC falcon line or wireless data collection and printing technologies.
4.11 Summary
zz Bayesian classifiers are statistical classifiers. They can predict class membership probabilities,
such as the probability that a given sample belongs to a particular class.
zz Bayesian classification is based on Bayes theorem. Bayesian classifiers exhibited high
accuracy and speed when applied to large databases.
zz Bayesian belief networks are graphical models, which unlike naive Bayesian classifiers,
allow the representation of dependencies among subsets of attributes.
zz Bayes’ theorem relates the conditional and marginal probabilities of events A and B, where
B has a non-vanishing probability.
zz Naïve Bayes classifiers assume that the effect of a variable value on a given class is
independent of the values of other variable. This assumption is called class conditional
independence. It is made to simplify the computation and in this sense considered to be
“Naïve”.
zz In theory, Bayesian classifiers have the minimum error rate in comparison to all other
classifiers.
zz Classification and prediction are two forms of data analysis, which can be used to extract
models describing important data classes or to predict future data trends. Classification
predicts categorical labels (or discrete values) where as, prediction models continuous-
valued functions.
zz The learning of the model is ‘supervised’ if it is told to which class each training sample
belongs. In contrasts with unsupervised learning (or clustering), in which the class labels
of the training samples are not known, and the number or set of classes to be learned may
not be known in advance.
zz Prediction is similar to classification, except that for prediction, the results lie in the
future.
zz Any of the methods and techniques used for classification may also be used, under
appropriate circumstances, for prediction.
zz Data cleaning, relevance analysis and data transformation are the preprocessing steps that Notes
may be applied to the data in order to help improve the accuracy, efficiency, and scalability
of the classification or prediction process.
zz Classification and prediction methods can be compared and evaluated according to the
criteria of Predictive accuracy, Speed, Robustness, Scalability and Interpretability.
zz A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on
an attribute, each branch represents an outcome of the test, and leaf nodes represent classes
or class distributions. The topmost node in a tree is the root node.
4.12 Keywords
Bayes theorem: Bayesian classification is based on Bayes theorem.
Bayesian belief networks: These are graphical models, which unlike naive Bayesian classifiers,
allow the representation of dependencies among subsets of attributes
Bayesian classification: Bayesian classifiers are statistical classifiers.
Classification: Classification is a data mining technique used to predict group membership for
data instances.
Data cleaning: Data cleaning refers to the preprocessing of data in order to remove or reduce
noise (by applying smoothing techniques, for example), and the treatment of missing values (e.g.,
by replacing a missing value with the most commonly occurring value for that attribute, or with
the most probable value based on statistics).
Decision Tree: A decision tree is a flow-chart-like tree structure, where each internal node denotes
a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent
classes or class distributions. The topmost node in a tree is the root node.
Decision tree induction: The automatic generation of decision rules from examples is known as
rule induction or automatic rule induction.
Interpretability: This refers is the level of understanding and insight that is provided by the
learned model.
Naive Bayesian classifiers: They assume that the effect of an attribute value on a given class is
independent of the values of the other attributes.
Overfitting: Decision trees that are too large are susceptible to a phenomenon called as
overfitting.
Prediction: Prediction is similar to classification, except that for prediction, the results lie in the
future.
Predictive accuracy: This refers to the ability of the model to correctly predict the class label of
new or previously unseen data.
Scalability: This refers to the ability of the learned model to perform efficiently on large amounts
of data.
Supervised learning: The learning of the model is ‘supervised’ if it is told to which class each
training sample belongs.
Tree pruning: After building a decision tree a tree pruning step can be performed to reduce the
size of the decision tree.
Unsupervised learning: In unsupervised learning, the class labels of the training samples are not
known, and the number or set of classes to be learned may not be known in advance.
1. Statistical 2. Bayes
3. Naïve Bayesian 4. Graphical
5. Bayesian 6. Classification
7. Supervised 8. Predictive accuracy
9. Decision tree 10. Information gain
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/ The
MIT Press, 1996.
V. Cherkassky and F. Mulier, Learning From Data, John Wiley & Sons, 1998.
Contents
Objectives
Introduction
5.1 Data Extraction
5.2 Data Reconciliation
5.2.1 Model
5.2.2 Modelling Aspects
5.3 Data Aggregation and Customization
5.4 Query Optimization
5.5 Update Propagation
5.6 Modelling and Measuring Data Warehouse Quality
5.6.1 Quality Definition
5.6.2 Data Quality Research
5.6.3 Data Quality
5.7 Major Research Project in Data Warehousing
5.8 Three Perspectives of Data Warehouse Metadata
5.9 Summary
5.10 Keywords
5.11 Self Assessment
5.12 Review Questions
5.13 Further Readings
Objectives
After studying this unit, you will be able to:
zz Describe data extraction and reconciliation
zz Know data aggregation and customization
zz Explain query optimization
zz Know modelling and measuring data warehouse quality
zz Explain three perspectives of data warehouse metadata
Introduction Notes
An enterprise data warehouse often fetches records from several disparate systems and store
them centrally in an enterprise-wide warehouse. But what is the guarantee that the quality of
data will not degrade in the process of centralization?
Many of the data warehouses are built on n-tier architecture with multiple data extraction and
data insertion jobs between two consecutive tiers. As it happens, the nature of the data changes
as it passes from one tier to the next tier. Data reconciliation is the method of reconciling or tie-up
the data between any two consecutive tiers (layers).
Example: One of the source systems for a sales analysis data warehouse might be an
order entry system that records all of the current order activities.
Designing and creating the extraction process is often one of the most time-consuming tasks
in the ETL process and, indeed, in the entire data warehousing process. The source systems
might be very complex and poorly documented, and thus determining which data needs to be
extracted can be difficult. The data has to be extracted normally not only once, but several times
in a periodic manner to supply all changed data to the data warehouse and keep it up-to-date.
Moreover, the source system typically cannot be modified, nor can its performance or availability
be adjusted, to accommodate the needs of the data warehouse extraction process.
These are important considerations for extraction and ETL in general. This unit, however, focuses
on the technical considerations of having different kinds of sources and extraction methods. It
assumes that the data warehouse team has already identified the data that will be extracted, and
discusses common techniques used for extracting data from source databases.
The extraction method you should choose is highly dependent on the source system and also from
the business needs in the target data warehouse environment. Very often, there is no possibility
to add additional logic to the source systems to enhance an incremental extraction of data due to
the performance or the increased workload of these systems. Sometimes even the customer is not
allowed to add anything to an out-of-the-box application system.
The estimated amount of the data to be extracted and the stage in the ETL process (initial load or
maintenance of data) may also impact the decision of how to extract, from a logical and a physical
perspective. Basically, you have to decide how to extract data logically and physically.
Notes necessary on the source site. An example for a full extraction may be an export file of a
distinct table or a remote SQL statement scanning the complete source table.
2. Incremental Extraction: At a specific point in time, only the data that has changed since
a well-defined event back in history will be extracted. This event may be the last time of
extraction or a more complex business event like the last booking day of a fiscal period. To
identify this delta change there must be a possibility to identify all the changed information
since this specific time event. This information can be either provided by the source data
itself such as an application column, reflecting the last-changed timestamp or a change
table where an appropriate additional mechanism keeps track of the changes besides the
originating transactions. In most cases, using the latter method means adding extraction
logic to the source system.
Many data warehouses do not use any change-capture techniques as part of the extraction
process. Instead, entire tables from the source systems are extracted to the data warehouse or
staging area, and these tables are compared with a previous extract from the source system to
identify the changed data. This approach may not have significant impact on the source systems,
but it clearly can place a considerable burden on the data warehouse processes, particularly if the
data volumes are large.
Depending on the chosen logical extraction method and the capabilities and restrictions on the
source side, the extracted data can be physically extracted by two mechanisms. The data can
either be extracted online from the source system or from an offline structure. Such an offline
structure might already exist or it might be generated by an extraction routine.
There are the following methods of physical extraction:
1. Online Extraction: The data is extracted directly from the source system itself. The
extraction process can connect directly to the source system to access the source tables
themselves or to an intermediate system that stores the data in a preconfigured manner (for
example, snapshot logs or change tables).
Note The intermediate system is not necessarily physically different from the source
system.
With online extractions, you need to consider whether the distributed transactions are
using original source objects or prepared source objects.
2. Offline Extraction: The data is not extracted directly from the source system but is staged
explicitly outside the original source system. The data already has an existing structure
(for example, redo logs, archive logs or transportable tablespaces) or was created by an
extraction routine.
data. Data reconciliation is based on a comparison of the data loaded into business intelligence Notes
and the application data in the source system. You can access the data in the source system
directly to perform this comparison.
The term productive DataSource is used for DataSources that are used for data transfer in the
productive operation of business intelligence. The term data reconciliation DataSource is used for
DataSources that are used as a reference for accessing the application data in the source directly
and therefore allow you to draw comparisons to the source data.
5.2.1 Model
Figure below shows the data model for reconciling application data and loaded data in the
data flow with transformation. The data model can also be based on 3.x objects (data flow with
transfer rules).
The productive DataSource uses data transfer to deliver the data that is to be validated to BI.
The transformation connects the DataSource fields with the InfoObject of a DataStore object that
has been created for data reconciliation, by means of a direct assignment. The data reconciliation
DataSource allows a VirtualProvider direct access to the application data. In a MultiProvider,
the data from the DataStore object is combined with the data that has been read directly. In a
query that is defined on the basis of a MultiProvider, the loaded data can be compared with the
application data in the source system.
In order to automate data reconciliation, we recommend that you define exceptions in the
query that proactively signal that differences exist between the productive data in BI and the
reconciliation data in the source. You can use information broadcasting to distribute the results
of data reconciliation by e-mail.
Data reconciliation for DataSources allows you to check the integrity of the loaded data by, for
example, comparing the totals of a key figure in the DataStore object with the corresponding
totals that the VirtualProvider accesses directly in the source system.
In addition, you can use the extractor or extractor error interpretation to identify potential errors
in the data processing. This function is available if the data reconciliation DataSource uses a
different extraction module to the productive DataSource.
We recommend that you keep the volume of data transferred as small as possible because the data
reconciliation DataSource accesses the data in the source system directly. This is best performed
using a data reconciliation DataSource delivered by business intelligence Content or a generic
DataSource using function modules because this allows you to implement an aggregation logic.
For mass data, you generally need to aggregate the data or make appropriate selections during
extraction.
The data reconciliation DataSource has to provide selection fields that allow the same set of data
to be extracted as the productive DataSource.
Task “The source systems for a data warehouse are typically transaction processing
applications.” Discuss
Example: A site that sells music CDs might advertise certain CDs based on the age of the
user and the data aggregate for their age group. Online analytic processing (OLAP) is a simple
type of data aggregation in which the marketer uses an online reporting mechanism to process
the information.
Data aggregation can be user-based: personal data aggregation services offer the user a single
point for collection of their personal information from other Web sites. The customer uses a
single master personal identification number (PIN) to give them access to their various accounts
(such as those for financial institutions, airlines, book and music clubs, and so on). Performing
this type of data aggregation is sometimes referred to as “screen scraping.”
As a data aggregator, data is your business not a byproduct of your business. You buy data, Notes
transform it, scrub it, cleanse it, standardize it, match it, validate it, analyze it, statistically
project it, and sell it. You need a rock-solid data aggregation solution as the foundation of your
operations.
Notes 3. Store intermediate results: Sometimes logic for a query can be quite complex. Often, it
is possible to achieve the desired result through the use of subqueries, inline views, and
UNION-type statements. For those cases, the intermediate results are not stored in the
database, but are immediately used within the query. This can lead to performance issues,
especially when the intermediate results have a large number of rows.
The way to increase query performance in those cases is to store the intermediate results in a
temporary table, and break up the initial SQL statement into several SQL statements. In many
cases, you can even build an index on the temporary table to speed up the query performance
even more. Granted, this adds a little complexity in query management (i.e., the need to manage
temporary tables), but the speedup in query performance is often worth the trouble.
Specific query optimization strategies:
1. Use Index: Using an index is the first strategy one should use to speed up a query. In fact,
this strategy is so important that index optimization is also discussed.
2. Aggregate Table: Pre-populating tables at higher levels so less amount of data need to be
parsed.
3. Vertical Partitioning: Partition the table by columns. This strategy decreases the amount of
data a SQL query needs to process.
4. Horizontal Partitioning: Partition the table by data value, most often time. This strategy
decreases the amount of data a SQL query needs to process.
5. De-normalization: The process of de-normalization combines multiple tables into a single
table. This speeds up query performance because fewer table joins are needed.
6. Server Tuning: Each server has its own parameters and often tuning server parameters so
that it can fully take advantage of the hardware resources can significantly speed up query
performance.
critical, real-time decision support systems. Below are some of the most common technological Notes
methods developed to address the problems related to data sharing through data propagation.
Bulk Extract: In this method of data propagation, copy management tools or unload utilities are
being used in order to extract all or a subset of the operational relational database. Typically, the
extracted data is the transported to the target database using file transfer protocol (FTP) any other
similar methods. The data which has been extracted may be transformed to the format used by
the target on the host or target server.
The database management system load products are then used in order to refresh the database
target. This process is most efficient for use with small source files or files that have a high
percentage of changes because this approach does not distinguish changed versus unchanged
records. Apparently, it is least efficient for large files where only a few records have changed.
File Compare: This method is a variation of the bulk move approach. This process compares the
newly extracted operational data to the previous version. After that, a set of incremental change
records is created. The processing of incremental changes is similar to the techniques used in bulk
extract except that the incremental changes are applied as updates to the target server within the
scheduled process. This approach is recommended for smaller files where there only few record
changes.
Change Data Propagation: This method captures and records the changes to the file as part of the
application change process. There are many techniques that can be used to implement Change
Data Propagation such as triggers, log exits, log post processing or DBMS extensions. A file of
incremental changes is created to contain the captured changes. After completion of the source
transaction, the change records may already be transformed and moved to the target database.
This type of data propagation is sometimes called near real time or continuous propagation and
used in keeping the target database synchronized within a very brief period of a source system.
Notes can avoid mistakes related to schema information. Based on this analysis, we can safely argue
that different roles imply a different collection of quality aspects, which should be ideally treated
in a consistent and meaningful way.
From the previous it follows that, on one hand, the quality of data is of highly subjective nature
and should ideally be treated differently for each user. At the same time, the quality goals of the
involved stakeholders are highly diverse in nature. They can be neither assessed nor achieved
directly but require complex measurement, prediction, and design techniques, often in the form
of an interactive process. On the other hand, the reasons for data deficiencies, non-availability
or reachability problems are definitely objective, and depend mostly on the information system
definition and implementation. Furthermore, the prediction of data quality for each user must be
based on objective quality factors that are computed and compared to users’ expectations. The
question that arises, then, is how to organize the design, administration and evolution of the data
warehouse in such a way that all the different, and sometimes opposing, quality requirements
of the users can be simultaneously satisfied. As the number of users and the complexity of data
warehouse systems do not permit to reach total quality for every user, another question is how
to prioritize these requirements in order to satisfy them with respect to their importance. This
problem is typically illustrated by the physical design of the data warehouse where the problem
is to find a set of materialized views that optimize user requests response time and the global
data warehouse maintenance cost at the same time.
Before answering these questions, though, it should be useful to make a clear-cut definition of the
major concepts in these data warehouse quality management problems. To give an idea of the
complexity of the situation let us present a verbal description of the situation. The interpretability
of the data and the processes of the data warehouse is heavily dependent on the design process (the
level of the description of the data and the processes of the warehouse) and the expressive power
of the models and the languages which are used. Both the data and the systems architecture (i.e.
where each piece of information resides and what the architecture of the system is) are part of the
interpretability dimension. The integration process is related to the interpretability dimension, by
trying to produce minimal schemata. Furthermore, processes like query optimization (possibly
using semantic information about the kind of the queried data -e.g. temporal, aggregate, etc.),
and multidimensional aggregation (e.g. containment of views, which can guide the choice of
the appropriate relations to answer a query) are dependent on the interpretability of the data
and the processes of the warehouse. The accessibility dimension of quality is dependent on the
kind of data sources and the design of the data and the processes of the warehouse. The kind of
views stored in the warehouse, the update policy and the querying processes are all influencing
the accessibility of the information. Query optimization is related to the accessibility dimension,
since the sooner the queries are answered, the higher the transaction availability is. The extraction
of data from the sources is also influencing (actually determining) the availability of the data
warehouse. Consequently, one of the primary goals of the update propagation policy should
be to achieve high availability of the data warehouse (and the sources). The update policies, the
evolution of the warehouse (amount of purged information) and the kind of data sources are all
influencing the timeliness and consequently the usefulness of data. Furthermore, the timeliness
dimension influences the data warehouse design and the querying of the information stored in
the warehouse (e.g., the query optimization could possibly take advantage of possible temporal
relationships in the data warehouse). The believability of the data in the warehouse is obviously
influenced from the believability of the data in the sources. Furthermore, the level of the desired
believability influences the design of the views and processes of the warehouse. Consequently,
the source integration should take into account the believability of the data, whereas the data
warehouse design process should also take into account the believability of the processes. The
validation of all the processes of the data warehouse is another issue, related with every task in
the data warehouse environment and especially with the design process. Redundant information
in the warehouse can be used from the aggregation, customization and query optimization
processes in order to obtain information faster. Also, replication issues are related to these tasks.
Finally, quality aspects influence several factors of data warehouse design. For instance, the Notes
required storage space can be influenced by the amount and volume of the quality indicators
needed (time, believability indicators etc.). Furthermore, problems like the improvement of
query optimization through the use of quality indicators (e.g. ameliorate caching), the modeling
of incomplete information of the data sources in the data warehouse, the reduction of negative
effects schema evolution has on data quality and the extension of data warehouse models and
languages, so as to make good use of quality information have to be dealt with.
Models and tools for the management of data warehouse quality can build on substantial previous
work in the fields of data quality.
A definition and quantification of quality is given, as the fraction of Performance over Expectance.
Taguchi defined quality as the loss imparted to society from the time a product is shipped. The
total loss of society can be viewed as the sum of the producer’s loss and the customer’s loss. It is
well known that there is a tradeoff between the quality of a product or service and a production
cost and that an organization must find an equilibrium between these two parameters. If the
equilibrium is lost, then the organization loses anyway (either by paying too much money to
achieve a certain standard of quality, called “target”, or by shipping low quality products of
services, which result in bad reputation and loss of market share).
Quite a lot of research has been done in the field of data quality. Both researchers and
practitioners have faced the problem of enhancing the quality of decision support systems,
mainly by ameliorating the quality of their data. In this section we will present the related
work on this field, which more or less influenced our approach for data warehouse quality. A
detailed presentation can be found in Wang et al., presents a framework of data analysis, based
on the ISO 9000 standard. The framework consists of seven elements adapted from the ISO 9000
standard: management responsibilities, operation and assurance cost, research and development,
production, distribution, personnel management and legal function. This framework reviews a
significant part of the literature on data quality, yet only the research and development aspects
of data quality seem to be relevant to the cause of data warehouse quality design. The three
main issues involved in this field are: analysis and design of the data quality aspects of data
products, design of data manufacturing systems (DMS’s) that incorporate data quality aspects
and definition of data quality dimensions. We should note, however, that data are treated as
products within the proposed framework. The general terminology of the framework regarding
quality is as follows: Data quality policy is the overall intention and direction of an organization
with respect to issues concerning the quality of data products. Data quality management is the
management function that determines and implements the data quality policy. A data quality
system encompasses the organizational structure, responsibilities, procedures, processes and
resources for implementing data quality management. Data quality control is a set of operational
techniques and activities that are used to attain the quality required for a data product. Data
quality assurance includes all the planed and systematic actions necessary to provide adequate
confidence that a data product will satisfy a given set of quality requirements.
The quality of the data that are stored in the warehouse, is obviously not a process by itself; yet
it is influenced by all the processes which take place in the warehouse environment. As already
mentioned, there has been quite a lot of research on the field of data quality, in the past. We define
data quality as a small subset of the factors proposed in other models. For example, our notion of
Notes data quality, in its greater part, is treated as a second level factor, namely believability. Yet, in our
model, the rest of the factors proposed elsewhere, are treated as process quality factors.
The completeness factor describes the percentage of the interesting real-world information entered
in the sources and/or the warehouse. For example, completeness could rate the extent to which a
string describing an address did actually fit in the size of the attribute which represents the address.
The credibility factor describes the credibility of the source that provided the information. The
accuracy factor describes the accuracy of the data entry process which happened at the sources.
The consistency factor describes the logical coherence of the information with respect to logical
rules and constraints. The data interpretability factor is concerned with data description (i.e. data
layout for legacy systems and external data, table description for relational databases, primary
and foreign keys, aliases, defaults, domains, explanation of coded values, etc.)
Notes counterpart which provides specific details over the actual components that execute the activity
and the conceptual perspective which abstractly represents the basic interrelationships between
data warehouse stakeholders and processes in a formal way.
Figure 5.2: The Reasoning behind the 3 Perspectives of the Process Metamodel
Typically, the information about how a process is executed concerns stakeholders who are
involved in the everyday use of the process. The information about the structure of the process
concern stakeholders that manage it and the information relevant to the reasons behind this
structure concern process engineers who are involved in the monitoring or evolution of the
process environment. In the case of data warehouses it is expected that all these roles are covered
by the data warehouse administration team, although one could also envision different schemes.
Another important issue shown in Figure 5.2 is that there is also a data flow at each of the
perspectives: a type description of the incoming and outcoming data at the logical level, where
the process acts as an input/output function, a physical description of the details of the physical
execution for the data involved in the activity and a relationship to the conceptual entities related
to these data, connected through a corresponding role.
Once again, we have implemented the process metamodel in the Telos language and specifically
in the ConceptBase metadata repository. The implementation of the process metamodel in
ConceptBase is straightforward, thus we choose to follow an informal, bird’s-eye view of the
model, for reasons of presentation and lack of space. Wherever definitions, constraints, or queries
of ConceptBase are used in the chapter, they will be explained properly in natural language,
too.
In the sequel, when we present the entities of the metamodel, this will be done in the context of a
specific perspective. In the Telos implementation of the metamodel, this is captured by specializing
the generic classes ConceptualObject, LogicalObject and PhysicalObject with ISA relationships,
accordingly. We will start the presentation of the metamodel from the logical perspective. First,
we will show how it deals with the requirements of structure complexity and capturing of data
semantics in the next two sections. In the former, the requirement of trace logging will be fulfilled
too. The full metamodel is presented in Figure 5.3.
Case Study Data Warehousing System for a Logistics Regulatory
Body
Business Challenge
The customer, the sole regulatory body for ports and maritime affairs of a leading
republic in the Asia Pacific, protects the maritime and ports interests of the republic in
the international arena by regulating and licensing port and marine services and facilities.
It also manages vessel traffic in the port, ensures navigational safety and port/maritime
security, and a clean marine environment.
In the wake of increasing vessel traffic through the port and booming number of users,
the customer required a system that would provide an integrated framework in which
end-users could perform online queries, prepare ad hoc reports and analyze data with
user-friendly tools.
Mahindra Satyam Role
Mahindra Satyam identified Data Warehouse (DW) as a potential and essential tool for
high-level analysis and planning, and for faster and easier information management.
The role involved developing a data warehouse (DW) application that has the capability
of comparative or variance analysis, trend analysis, slice and dice, ranking, drill down,
visualization on table, alerts and job scheduling etc.
Business Benefits
1. Helped the customer promote the nation as an important hub port and an International
maritime centre
2. Facilitated easier monitoring of the increased vessel movement through the port
Contd...
5.9 Summary
zz In the process of extracting data from one source and then transforming the data and
loading it to the next layer, the whole nature of the data can change considerably.
zz It might also happen that some information is lost while transforming the data. A
reconciliation process helps to identify such loss of information.
zz One of the major reasons of information loss is loading failures or errors during loading.
zz Data reconciliation is often confused with the process of data quality testing. Even worse,
sometimes data reconciliation process is used to investigate and pin point the data issues.
zz While data reconciliation may be a part of data quality assurance, these two things are not
necessarily same.
zz Scope of data reconciliation should be limited to identify, if at all, there is any issue in the
data or not.
zz The scope should not be extended to automate the process of data investigation and pin
pointing the issues.
5.10 Keywords
Data Aggregation: Data aggregation is any process in which information is gathered and
expressed in a summary form, for purposes such as statistical analysis.
Data Extraction: Extraction is the operation of extracting data from a source system for further
use in a data warehouse environment.
Data Quality: Data quality has been defined as the fraction of performance over expectancy, or
as the loss imparted to society from the time a product is shipped.
Update Propagation: Data Propagation is the distribution of data from one or more source data
warehouses to one or more local access databases, according to propagation rules.
1. (a) 2. (c)
3. data warehouse 4. Full extraction
5. Data reconciliation 6. integrity
7. interpretability 8. Query optimization
9. Taguchi 10. Conceptual
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/ The
MIT Press, 1996.
V. Cherkassky and F. Mulier, Learning From Data, John Wiley & Sons, 1998.
Contents
Objectives
Introduction
6.1 Enterprise Application Integration
6.2 Reasons for Emergence of EAI
6.3 Advantages of Implementing EAI
6.4 EAI Functioning
6.5 Source Integration
6.6 The Practice of Source Integration
6.7 Service-oriented Architecture
6.8 Portal-oriented Application Integration
6.9 Portal Architecture
6.10 Portals and Application Integration
6.11 Portal-oriented B2B Application Integration
6.12 Research in Source Integration
6.13 Towards Systematic Methodologies for Source Integration
6.13.1 EAI Technology Trends
6.13.2 EAI Tools/Products
6.13.3 EAI Solution Types
6.13.4 OSI Model for EAI
6.14 EAI Architecture
6.14.1 Layers of EAI
6.14.2 EAI Software Topology
6.15 EAI Solution
6.15.1 Requirements for Effective EAI Solution
6.15.2 EAI Software Flexibility
6.16 EAI Solution Evalution
6.16.1 Criteria for Selection
6.16.2 EAI Solution Evaluation Methodology
6.17 EAI Software Checklist
6.18 EAI Market Segmentation
6.19 EAI Implementation
6.20 Types of EAI
6.21 Summary
6.22 Keywords
6.23 Self Assessment
6.24 Review Questions
6.25 Further Readings
Notes Objectives
After studying this unit, you will be able to:
zz Know enterprise application integration
zz Describe practice of source integration
zz Explain research in source integration
zz Describe methodologies of source integration
Introduction
Enterprise Application Integration deals with approaches and mechanisms to integrate multiple
applications, and deliver processes, and each business process supports a specific business use
case. And, each business process includes multiple sub processes, which also perform some
related tasks for achieving the overall goal of the business process. In addition, as each business
process works across multiple functional areas (which are typically realized using multiple
applications), there is also a need for performing specific business activities that would support
the overall business process in the context of the functional area.
e-business
e-business requires connecting of customers, suppliers and partners across the world, so as to
form an integrated value and supply chain over the Internet.
Opening up of business processes to share information and allow market access requires
information to flow transparently and seamlessly both externally and internally.
Business Process Automation requires new products and services to be integrated with already
existent applications so as to improve efficiency, operating costs and customer services across an
organization.
ERP vendors are coming up with a product line complete with interfaces/ adapters to assist the
ERP solution to be integrated with other applications as they have realized that ERP solutions to
be effective should be integrated with the back end legacy applications.
There is a movement towards virtual enterprise linking application systems from various
companies in the supply chain. Significant developments in peer-to-peer networking and
distributed processing have made it possible for businesses to integrate better their own
functional departments as well as integrate with their partners and suppliers for better SCM &
CRM. Re-engineering of business processes by organizations for greater customer focus requires
close cooperation between standalone applications.
Zero latency enterprise refers to an organization that can change its business rules in real time to
act on new market opportunities and customer demands. An enterprise application integration
solution accelerates responses and facilitates business changes in the zero latency enterprise.
In the today’s competitive business environment the need to align business systems with business
goals is all the more a reality. Business processes evolve continuously requiring new methods
and data, which in turn require integration with the existing ones. These new applications should
start operations quickly moving IT management to shorter application lifecycles. This is made
possible because of EAI solutions, which help in integrating different applications and also assist
in changing the business rules as required in minimum amount of time.
Intranet/Internet Explosion
The Intranet/Internet explosion is leading to surge in the demand for a new class of human
active applications that require integration with back end legacy applications. This feature again
is enabled by EAI solution which can integrated the front end and back end applications.
Notes 3. Assists in formation of Zero Latency Enterprise - when all functions within the organization
work with the same up-to-data information, latency between applications is eliminated/
reduced.
4. Updating and integrating of applications is possible whenever required. New applications
can be created by integrating real time data from different parts of the enterprise.
5. Assists in rapid business process change.
6. Enables creation of virtual corporations with virtual supply chains and operations through
sharing of data beyond the organization.
7. Makes possible for legacy or proprietary systems to function on web.
8. Enhancements to standard applications can be made rapidly.
Service-based application integration is not a new approach. We’ve been looking for mechanisms Notes
to bind applications together at the service level for years, including frameworks, transactions,
and distributed objects all in wide use today. However, the new notion of Web services, such as
Microsoft’s .NET strategy, is picking up steam as we attempt to identify a new mechanism that’s
better able to leverage the power of the Internet to provide access to remote application services
through a well-defined interface and directory service: Universal Description, Discovery and
Integration (UDDI).
The uses for this type of integration are endless, including the creation of composite applications,
or applications that aggregate the processes and information of many applications. For example,
using this paradigm, application developers simply need to create the interface and add the
application services by binding the interface to as many Internet-connected application services
as are required.
The downside, at least with service-based integration, is that this makes it necessary to change
the source and target applications or, worse in a number of instances, to create a new application
(a composite application). This has the effect of adding cost to the application integration project
and is the reason many choose to stay at the information level.
Still, the upside of this approach is that it is consistent with the “baby step” approach most
enterprises find comfortable when implementing solutions to integration problems. Service-based
solutions tend to be created in a series of small, lower-risk steps. This type of implementation can
be successful from the department to the enterprise to the trading community, but never the
other way around from the trading community to the department.
Service-oriented integration (SOI) is defined as integrating computing entities using only service
interactions in a service-oriented architecture. Service-oriented integration addresses problems
with integrating legacy and inflexible heterogeneous systems by enabling IT organizations to
offer the functionality locked in existing applications as reusable services.
In contrast to traditional enterprise application integration (EAI), the significant characteristics of
services-oriented integration are:
1. Well-defined, standardized interfaces: Consumers are provided with easily understood
and consistent access to the underlying service.
2. Opaqueness: The technology and location of the application providing the functionality is
hidden behind the service interface. In fact, there is no need for a fixed services provider.
3. Flexibility: Both the providers of services and consumers of services can change - the
service description is the only constant. Provided both the provider and consumer continue
to adhere to the service description, the applications will continue to work.
Service-oriented Application Integration (SOAI) allows applications to share common business
logic or methods. This is accomplished either by defining methods that can be shared, and
therefore integrated, or by providing the infrastructure for such method sharing such as Web
services (Figure 6.1). Methods may be shared either by being hosted on a central server, by
accessing them interapplication (e.g., distributed objects), or through standard Web services
mechanisms, such as .NET.
Sounds great, doesn’t it? The downside might give you pause, however. This “great-sounding” Notes
application integration solution also confronts us with the most invasive level of application
integration, thus the most costly. This is no small matter if you’re considering Web services,
distributed objects, or transactional frameworks.
While IOAI generally does not require changes to either the source or target applications, SOAI
requires that most, if not all, enterprise applications be changed in order to take advantage of the
paradigm. Clearly, this downside makes SOAI a tough sell. However, it is applicable in many
problem domains. You just need to make sure you leverage SOAI only when you need it.
Changing applications is a very expensive proposition. In addition to changing application logic,
there is the need to test, integrate, and redeploy the application within the enterprise a process
that often causes costs to spiral upward. This seems to be the case, no matter if you’re approaching
SOAI with older technologies such as Common Object Request Broker Architecture (CORBA), or
new technologies such as .NET, the latest service-based architecture to come down the road.
Before embracing the invasiveness and expense of SOAI, enterprises must clearly understand both
its opportunities and its risks. Only then can its value be evaluated objectively. The opportunity
to share application services that are common to many applications and therefore making it
possible to integrate those applications represents a tremendous benefit. However, that benefit
comes with the very real risk that the expense of implementing SOAI will outpace its value.
Task “The EAI solution works at both data level and business process level and
assists in sharing data of different applications.” Discuss.
Business Needs
Importance of EAI
Typically, a business process involves interactions among various organizational units, which
translates into a business process automation requiring interactions with the various applications
in an organization. The major challenges faced by the IT organization when integrating these
applications relate to the integration of different domains, architecture and technologies. These
challenges necessitate a well planned EAI strategy and architecture. There are two main forms
of EAI. The first one integrates applications within a company (intra-EAI) and serves the first
business need. The second form (inter-EAI) relates to B2B integration and serves the second
business need.
There are several strategies available for EAI. They are listed below.
1. Data Level Integration: Data are shared or moved among the applications.
2. Applications Interface Integration: One application can share some functionality residing
in other applications. It allows sharing application components.
Notes 3. Business Method Integration: One application can share business services provided by the
other applications.
4. Presentation Integration: Provides unified view of data to the end user.
5. B2B Integration: Provides integration of applications residing in two different
organizations.
Role of SOA
The best strategy for EAI is Business Method Integration, which allows one application to use
the business services provided by the other applications. It makes B2B integration easier, which
boils down to the choice of technologies for protocol and transport. A protocol defines the
‘language’ for the communication and transport carries the messages as per the protocol from one
application to another. The service oriented Architecture (SOA) acts like an enabler to Business
Method Integration strategy. SOA is the proponent of business driven application architecture,
rather than technology driven application architecture, where a business service can be readily
mapped to a technology component in an application. Embracing SOA along with supports for
multi-protocols (native, SOAP, XML) and multi-transports (TCP/IP, MQ, HTTP) for the service
invocations would truly allow us to implement a flexible and extensible EAI architecture.
Portals have become so common and so much has been written about them that we will cover just Notes
the basic concepts here. The important point to remember in the context of application integration
is that portals have become the primary mechanism by which we accomplish application
integration. Whether that is good, bad, or indifferent doesn’t really matter. It is simply the way
it is. Trading partners have extended the reach of internal enterprise systems by utilizing the
familiar web browser interface.
POAI by Example
An example of POAI is an automobile parts supplier that would like to begin selling parts to
retail stores (B2B) using a portal. This portal would allow the retail stores to access catalog
information, place orders, and track orders over the web, currently the parts supplier leverages
SAP as its preferred inventory control system and a common built mainframe application written
in COBOL/DB2 serves as its sales order system. Information from each system is required for the
B2B portal, and the portal users to update those back-end systems as well.
In order to create a portal, the parts supplier must design the portal application, including the
user interface and application behavior, as well as determine which information contained within
the back-end systems (SAP and the mainframe) needs to be shared with the portal application.
The portal application requires a traditional analysis and design life cycle and a local database.
This portal application must be able to control user interaction, capturing and processing errors
and controlling the transaction from the user interface all the way to the back-end systems.
Although you can employ many types of enabling technologies when creating portals, most
portals are built using application servers. Application servers provide the interface development
environments for designing the user interface, a programming environment to define application
behavior, and back-end connectors to move information in and out of back-end systems,
including SAP and mainframe systems. Although not integrating the application directly the
portal externalizes the information to the trading partner in this case, the owner of a retail auto
parts store and also update the back-end system, in this case with order placed by the store owner
perhaps with the status of existing orders.
Other examples of portals include entire enterprise that are integrated with a single portal
application. As many as a dozen companies may use that portal, B2B to purchase goods and
services from many companies at the same time. The same type of architecture and enabling
technology applies in this case, however, the number of systems integrated with the portal
application greatly increases.
Portal Power
The use of portals to integrate enterprises has many advantages. The primary one is that there
is no need to integrate back-end systems directly between companies or within enterprises,
which eliminates the associated cost or risk. What’s more you usually don’t have to worry
about circumventing firewalls or application-to-application security, because portals typically
do nothing more than web-enable existing systems from a single enterprise. With portals, you
simple connect to each back-end system through a point of integration and externalize the
information into a common user interface. Of course portals themselves are applications and
must be designed, built, and tested like any other enterprise application.
Portal-oriented B2B application integration also provides a good facility for Web-enabling
existing enterprise systems for any purpose, including B2B and business-to-consumer (B2C)
selling over the Web. If you need to move information to a user interface for any reason, this is
the best approach.
In many B2B application integration problem domains, the users prefer to interact with the
back-end systems through a user interface rather than have the systems automatically exchange
Notes information behind the scenes (as in data-oriented B2B application integration). Today more
B2B information flows through user interfaces (portal-oriented B2B application integration)
than automatically through back-end integration. However, the trend is moving from portals to
real-time information exchange. We will eventually remove from the equation the end user,
who is the most obvious point of latency when considering portal-oriented B2B application
integration.
The advantages of portal-oriented integration are clear:
1. It supports a true noninvasive approach, allowing other organizations to interact with a
company’s internal systems through a controlled interface accessible over the Web.
2. It is typically much faster to implement than real-time information exchange with
back-end systems, such as the data-, method-, and application interface–oriented
approaches.
3. Its enabling technology is mature, and many examples of portal-oriented B2B application
integration exist to learn from.
4. However, there are also disadvantages to portal-level B2B application integration:
5. Information does not flow in real time and thus requires human interaction. As a result,
systems do not automatically react to business events within a trading community (such as
the depletion of inventory).
6. Information must be abstracted, most typically through another application logic layer
(such as an application server). As a result, some portal-oriented solutions actually add
complexity to the solution.
7. Security is a significant concern when enterprise and trading community data is being
extended to users over the Web.
8. The notion of POAI has gone through many generations, including single system portals,
multiple-enterprise-system portals, and now, enterprise portals.
Single-system Portals
Single-system portals, as you might expect are single enterprise systems that have their user
interface extended to the web (Figure 6.3).
A number of approaches exist to create a portal for a single enterprise system, including
application servers, page servers, and technology for translating simple screens to HTML.
Trading Community
When the multiple-enterprise-system portal is extended to include systems that exist within
many companies, the result is an enterprise portal (Figure 6.4).
Web Clients
The Web client is a PC or any device that runs a web browser and is capable of displaying
HTML and graphics. The web browser makes requests to the web server and processes the
files the web server returns. Rather than exchanging message, the web client exchanges entire
files. Unfortunately, the process is insufficient and resource intensive. Still, with the web as our
preferred common application platform, these drawbacks are also inevitable.
Today, web browsers need not run on PCs. They can also run wireless devices such as personal
digital assistants (PDAs) and cellular phones.
Web Servers
Web servers, at their core, are file servers. Like traditional file servers, they respond to requests
from web clients, then send requested file. Web servers are required with portals because the
information coming from the application server must be converted into HTML and pumped
down to the web browser using HTTP, HTML, graphics, and multimedia files (audio, video and
animation) have been traditionally stored on web servers.
Today’s web servers pull double duty. Not only do they serve up file content to hordes of web Notes
clients, but they perform rudimentary application processing, as well. With enabling technologies
such as Common Gateway Interface (CGI), Netscape Application Programming Interface
(NSAPI), and Internet Server Application Programming Interface (ISAPI), web servers can query
the web client for information, and then, using web server APIs, they can pass that information
to an external process that runs on the web server (Figure 6.6). In many cases, this means users
can access information on a database server or on application servers.
Figure 6.6: Using the Web Server API to Customize Information sent to the Client Level
Database Servers
Database servers, when leveraged with portals work just as they do in more traditional client/
server architectures they respond to requests and return information. Sometimes the requests
come from web servers that communicate with the database server through a process existing on
the web server. Sometimes they come directly from web client communication with the database
server via a Call-Level Interface (CLI), such as JDBC for ActiveX.
Back-end Applications
Back-end applications are enterprise applications existing either within a single enterprise or
across many enterprises. These are typically a mix of ERP systems, such as SAP R/3 or PeopleSoft,
custom applications existing on mainframes, and newer client/server systems. Portals gather the
appropriate information from these back-end systems and externalize this information through
the user interface.
Although the mechanism employed to gather back-end information varies from technology to
technology, typically, portal development environments provide connectors or adaptors to link
to various back-end systems, or they provide APIs to allow developers to bind the back-end
systems to the portal technology.
Application Servers
Application servers work with portal applications by providing a middle layer between the back-
end applications, databases, and the web server. Application servers communicate with both the
web server and the resource server using transaction-oriented application development. As with
three-tier client/servers, application servers bring load-balancing recovery services and fail-over
capabilities to the portal.
Task Check near by you and choose a small business owner apply POAI on that
business.
Banking
The basis of competition in banking and financial services is customer retention. Customers
with multiple accounts are less likely to change, but most financial institutions have stovepipe
systems for credit cards, checking, savings, mortgage, brokerage, and other services. An EAI
implementation would integrate the systems so that a data warehouse can aggregate account data,
provide a single view to the customer, and recommend what additional products the customer
should be offered. In EAI systems instituted at Bank of America (Bank of America (NYSE: BAC
TYO: 8648 ) is the largest commercial bank in the United States in terms of deposits, and the
largest company of its kind in the world.) and Royal Bank of Canada (Canada’s central bank, Notes
established under the Bank of Canada Act (1934). It was founded during the Great Depression to
regulate credit and currency. The Bank acts as the Canadian government’s fiscal agent and has
the sole right to issue paper money.) a transaction in one account triggers an event in another
process for a marketing contact.
Manufacturing
Manufacturing’s key competitive measure is cost reduction through inventory control and just-
in-time processes. Allowing outside suppliers to view inventory data is possible with enterprise
resource (Any software system designed to support and automate the business processes of
medium and large businesses.) applications, but EAI provides a way for the manufacturer’s
systems to communicate with external systems to track sales, forecast demand, and maintain a
detailed view of pricing and availability among many suppliers. General Motors’ Covisint is a
custom vehicle order management system that interfaces with inventory, planning, and logistics
applications in the supply chain.
Life Science
In the life sciences industries, being first to market is critical since patents on new compounds
expire within 17 years from the time the first documents are submitted to the Food and Drug
Administration. EAI applications in pharmaceuticals, biotechnology companies, and research
institutions integrate data from diverse laboratory systems with clinical data and other core
systems so that it can be available to analytical applications. Several such projects are quietly
underway within U.S. companies working on human and animal genome projects.
The initial focus of EAI was at the data-level i.e., moving or replicating data among databases,
but it is evolving into business process automation. The present EAI technology is different to
the earlier EAI solutions as its focus is on integrating enterprise applications and not data or
assortment of different application types. Also the EAI solution can be reused for many other
needs, not just on the same platform but also across heterogeneous platforms and networks
and between multiple suppliers’ packaged applications. The other differences in the past and
present EAI solutions are that the integration is now at business process and practices level, not
at application level or database level and the middleware is transparent to the user, so specific
expertise in particular application-infrastructure technologies not required.
The Enterprise Application Integration trends are as follows:
1. Point-to-point Interfaces
2. Integration with Packaged integration brokers
Disadvantages/Constraints
1. If the number applications connected are many this leads to inter application spaghetti.
2. The approach is labor intensive and involves high cost and risk. It also does not assist if
applications need to be changed or added.
3. The maintenance costs are also huge.
1. Scalability
(a) For content based and subject based routing
(b) For incrementing applications
2. Advanced team development and management development capability-version control,
source code management etc
3. Handle batch as well as near real time integrations
4. Handle integration of mainframe as well as client/server capability
5. Low to moderate learning curve
6. Strong service and support capabilities to assist with project management
7. Vendor reputation
There are many types of products that have one or more functionalities of EAI. These are MOM
(message-oriented middleware) systems; publish/subscribe systems, Transaction Processing
monitors, application servers, data warehouse and data mart systems and logical integration
systems. On the basis of the level of integration the tools perform the EAI solutions can be broadly
categorized into Data level products and Business Model level products.
The various products, which support the movement of data between applications, are:
1. File transfer tools
2. Copy management
3. Data propagation
4. Schema-specific data synchronization
5. Database replication
6. Extraction/Transformation
Only extraction/transformation products are capable of getting data directly into and/or out of
an application’s data store and can also change the format of the source data so as to fit the target
product group of EAI solutions.
Extraction/transformation products are of three types:
1. Code Generators
2. Transformation Engines
3. Data Warehouse and Data mart Loaders
Code Generator
The code generator assists in the manual coding of programs by extracting data from an
application and transforming it for loading into another application. This is useful for simple
application network.
Disadvantages
1. The resulting program is not independent of the source or target system, so for integrating
with more than one system extra programming / processing is required.
2. The desired level of data movement cannot be achieved, so modifications have to be done
to the generated code
3. Language used for the generated program may differ from system to system
4. Scalability is a major concern as the integration is point-to-point
5. Modifying an application can require major regenerations and modifications to existing
interfaces.
They use application metadata to create export-transform-load programs like code generators.
But the difference is that all code is executed at a central location independent of the source
Notes and target. This works by getting the source data and moving it to a separate location where
transformation takes place.
Advantages
1. Centralized approach assists in scalability
2. Rapid interface development
3. Data staging
4. For large volumes of data some tools have transient data store, where excess data is
processed.
5. The same development environment and tools can be used for all application interfaces, so
there is minimal impact on the source and target systems.
6. It is very useful for large data volumes
Disadvantage
As transformation is done in a centralized location the tools are not scalable.
The Data warehouse and Data mart loaders can be found in either code generator or engine/ hub
forms. The focus is in transforming operational data into a form that can be loaded into a very
specific type of data store. Data aggregation is required so as to transform data in an application
network.
Disadvantages
Warehouse loaders do not have the fault tolerance or performance requirements that make them
viable for linking together a host of operational systems.
target applications know in advance which application is involved. This makes the message Notes
broker the most scalable and best EAI option as the source and/ or the target applications and/
or the business process logic can be changed without interrupting the whole system. The Message
broker is a set of software components that allow applications to communicate with each other
through non-invasive bi-directional exchange of messages.
There are primarily two types of EAI solution at high level- data level integration and message-
based integration. Data level integration basically assists applications in the exchange and sharing
of data across a common data store. For inter-enterprise application integration at data level
Extensible Markup Language (XML) is very useful. Message based application integration is
based on messaging software that is network aware. This is nearer to the complete EAI solution.
Message oriented middleware products are thus becoming increasingly popular. Most EAI
software offer tools to model business processes and link the applications with middleware that
can make each application communicate via data message.
The Open System Interconnection Model for EAI contains 12 layers as against the seven-layered
structure for network applications. The various layers are as follows (Table 6.1):
The EAI solutions can be categorized as a three-layer solution on the basis of the level of
integration and functionality. The three specific layers to EAI solution are:
1. Communications
2. Routing and brokering
3. Business Intelligence
Communications
The communications layer comprises of tools that assist in accessing data sources, inter-process
communications, network transports and representations of messages that pass between
applications. It includes the facilities for distributing processing over a network and includes the
following technologies: TCP/IP, publish and subscribe, database server protocols and middleware,
multicast IP, asynchronous messaging, remote procedure calls, etc. The communications layer
essentially views the world as a set of data sources.
In this layer some amount of decision-making and processing capabilities can be found. The
primary job of this layer is to aggregate, broker, transform, filter, and format data so it can be
understood by the other systems that are connected by the EAI solution.
Business Intelligence
The Business Intelligence layer plays a critical role in achieving the virtual application. This layer
provides an environment that responds to messages from the routing and brokering layer. It
then uses a set of declarative rules to make intelligent business decisions based on company
goals. This layer connects to rules analyzers and on-line analytical processing (OLAP) services to Notes
assist in the decision making process. It is essential for companies to build this layer for a more
proactive and competitive approach to conducting business.
The integration topology is a major consideration when building an EAI architecture to meet the
diverse set of integration requirements in an organization. Selecting the right topology will assist
in integration performance, event management and maintenance costs.
Types of software topology are:
1. Hub/star topology
2. Bus topology
3. Point-to-point topology
4. Pipeline topology
5. Network topology
Hub/Star Topology
Hub typology is useful for creating a central point of control. Messages are sent from source to
central hub, which is often in the machine itself. Hub typology works well if business events are
independent and if the Message Oriented Middleware (MOM) on which the typology is based
is from a single vendor. Here the source application sends a single message in one format and
the hub reformats the message as necessary and relays it to the various spokes connected to the
hub.
Advantages
Disadvantages
1. Mostly the hubs available cannot handle incoming transaction from any other source than
the middleware on which they operate.
2. They cannot manage integration events involving multiple sources and destinations
3. If database is required, it would become a source of processing or routing bottlenecks as
volumes grow and integration rules become complex.
Bus Topology
Bus typology is useful for distributing information to many destinations. Source applications
put messages onto a system-wide logical software bus that is accessible to other applications.
One or more applications can then selectively subscribe to the messages broadcast on the bus.
Traffic does not need to flow through the central switching point. This is possible in publish and
subscribe middleware only. Bus typology circumvents the problem of bottlenecks.
Point-to-point topology enables applications to communicate directly with one another. This is
useful when synchronous communication and persistence are required. Applications with pre-
built integration for ERP applications use this topology. Too much of the point-to-point integration
in an organizations IT structure leads to inter application. Benefit of this topology is its ability to
take full advantage of the context and semantics of the original data as it is transformed into one
or more target structures. The major constraint for this topology is if there is any change in either
of the applications like up gradation, etc then the whole integration has to be changed.
Pipeline Topology
Pipeline topology is useful if dynamic configuration is not required and multiple pipelines are
independent of each other. The information flows will be based on the First In First Out approach.
This is a very simple level of integration.
Network Topology
Network topology is the best option available if there is a lot of asynchronous activity and
independent transactions must coexist with one another. For this topology to work well, the
interfaces must be well defined and robust. If there is a snag at the interface level then the entire
network communication can fail.
Task E-business is a business done on internet. Discuss something about B2B and
B2C business.
1. IT strategy needs to be mapped out according to the business strategy and the objectives
2. Complete understanding of the business processes data models and the supporting systems
and applications currently in place
3. Planning for the whole process’ right from need identification, vendor selection to
implementation and future requirements
4. The EAI architecture, viz., process models and integration requirements, has to be
formulated from the IT strategy and architecture
5. Evaluate the EAI tools and the vendors
6. Accountability and ownership has to be established
7. Evaluate the solutions and the scope of integration covered by the technology
EAI software must be implemented with five layers of technology for flexibility. The different
layers are as follows:
1. Business Process Support
2. Transportation
3. Services
4. Interfaces
5. Transformation
EAI solution set has tools, which let the users visually diagram the business processes so as to
let the users declare rules for each message. This is useful to visualize the business processes
and thereby control different activities and ease the flow of information. Intelligent routing
capability that can look at a message and figure out the nest course of action is required in an
EAI solution.
Transportation
Data can be routed point-to-point or with an architecture called publish/ subscribe, in which
applications send messages to other applications that have registered interest with the message
broker. The application sending information is the publisher and that receiving information
is the subscriber. Depending on the network and platforms the application resides on this can
be done with middleware such as database drivers, component object models or messaging
middleware.
Services
This characteristic is required by messages to carry out missions successfully. The different
services that are to be present are:
1. Queuing to store messages if receiving application is slower than the sending one
2. Transactional Integrity- to confirm that the transaction has completed before a message is
sent or acknowledged as received.
3. Message priority, error handling and hooks to let the network management tools control
the network traffic
Notes Interfaces
Access to application is through the interfaces. Interfaces interact with the application either
via descriptions they provide to their platforms component model or by taking advantage of
the program Application Programming Interface. Thus the interfaces play an important role in
selecting an EAI tool as they should be such that no/minimum coding will be required while
integrating.
Transformation
As data format is not same for all applications, tools are required that let users visually map,
coordinate one application data format with the another application data format and transform
the information as required.
Information Group has come out with an evaluation methodology for EAI solution on the basis
of seven criteria, which can be used to compare different solutions. The point to note here is that
these criteria are customer specific i.e., dependent on the customer requirements the importance Notes
of each criterion varies. The criteria to be checked are:
1. Adapter/ Connector Fit
2. Tools Productivity/ Quality
3. Runtime Quality/ Scalability
4. Runtime Fit to Purpose
5. Business Process Support
6. Integrator Resources
7. Purchase and Ownership Cost
Adapter/Connector Fit
The extent of provision of pre built or pre configured adapters or connectors in the solution. The
rating is dependent on what packaged applications and legacy environments are required to
be integrated and the quantitative and qualitative assessment of the adapters/ connectors. The
important point to consider in assessment is the impact of the adapter/ connectors available on
the time to market, where a high rating means the amount of pre built connectivity will accelerate
the integration project.
The productivity of the integration development work is dependent on the quality of tools
provided and thus this criterion’s impact is more in the case where the adapter/ connectors are
not available. If the amount of custom integration work to do is more then this criterion increases
in vitality. This also determines the flexibility and the maintenance cost of the system.
Scalability is important as it determines the speed of the system. Quality of service includes the
level of reliability or guarantee of delivery and the level of transaction integrity. Throughput,
latency and efficiency may also be considered for assessment of quality of service.
There are four main points, which are required in different combinations:
1. Transactional real-time component integration
2. Queued messaging model
3. Publish and subscribe messaging
4. Bulk data movement
All integration scenarios require business process support. There are two ways by which this can
be taken care of are:
1. Facilities to model business processes and generate or execute an automated version of the
process are include in the integration solution
2. Specific business processes are already pre configured as part of the solution
The price sensitivity is high in this category as the differentiation is very less.
Topology Independence
The architecture to select for connecting an integrated process depends on various factors/ issues
like performance, timing requirement, event coordination etc. Therefore an open EAI topology
has to be chosen, not restricting only to Hub or Bus or any other approach. Flexibility is the key
word.
Business processes often are required to be platform independent. So the EAI software should be
flexible enough to execute the process on any platform.
The EAI software should focus on the business process and not on the underlying technology that
is used to transfer the data. Good EAI software provides pre-built adaptability for all middleware
categories, like MOM; publish/subscribe middleware and ORB
The EAI software should support not only message routing but also provide direct access to
databases, files, email systems etc without separate steps i.e., it should be a part of the integrated
process.
The EAI software should not only create and maintain the adapters from applications metadata,
but also provide descriptions with semantics and syntax, eliminating the need for coding.
The EAI software should provide a graphical environment to describe the processes and also
should have provision for acknowledging events, trigger execution, intelligently route data and
ensure transactional integrity across entire integration scenario.
Real time events triggering business processes have to be monitored and managed to ensure that
they achieve a coordinated result. The software should also include a run time environment,
which supports active listening, event coordination and multi threaded processing.
EAI software should handle the complexities of the Business process integration by itself without
resorting to hand coding.
High Performance
As business process involves high transaction volumes or complex rules, the EAI software should
prevent bottleneck and should have features like multi-threading and multi-processing along
with performance monitoring tools.
Proven Implementation
The EAI software should be proven and in use by other customers so as to minimize risk, as
business process integration is a mission critical task.
This provides connectivity among heterogeneous hardware, operating systems and application
platforms. The various technologies providing platform integration are:
1. Messaging: this is for asynchronous connectivity
2. Remote Procedure Calls: Synchronous connectivity
3. Object Request Brokers: Both types of connectivity
The logic for connecting each application must be defined either through code or pre coded
applications adapters. Additional functionality is required to reconcile the differences in data
representation in the system. This can be done by hand coding or by the use of data translations
and transformation products. Logic is required for message routing and this can be provided
either through hand coding or by a Message Broker.
Monitoring and management of end-to-end business process has to be done through hand coding
or automated process management tools.
Data Integration
Component Integration
Hub and spoke integration-hub provides some of the integration. Application servers are used to
provide data access to variety of relational database sources applications Adapters to packaged
applications and middleware services like messaging are also available.
Application Integration
Application integration provides a framework for technology for near real time processing. The
framework includes:
1. Underlying platform integration technology
2. Event integration through Message Broker that provide data translation
3. Transformation & rules based routing
4. Application interface integration provided through application adapters to packages
5. Custom applications
Integration frameworks assist in reducing the complexity of creating, managing and changing Notes
integration solution. The advantage is faster time to market through pre built adapters and
reusable integration infrastructure.
Process Integration
This provides the highest level of abstraction and adaptability for an EAI solution. This enables
managers to define, monitor and change business processes through a graphical modeling
interface.
Business Process Modeling helps business users and analysts to define how information flows
across systems and organizational boundaries through a graphical model and declarative
language instead of programming. The integration solution is generated from the model. When
changes are required, they can be made in the model and the same can be regenerated in the
solution. Simulation can also be done before the implementation of the solution.
Database Linking
This is basically linking two or more databases so that information is shared between the databases
at some point. The information can be exchanged and duplicate information maintained or
information can be shared. This is the simplest and initial form of EAI.
Application Linking
This one is more complex than database linking. Application linking means both processes and
data between two or more applications are integrated. The advantage of this there is redundant
business processes are not created as the processes are shared between applications.
Data Warehousing
This is similar to database linking. Data Warehousing is the collection of meaningful data from
several data sources to support decision-making efforts within an organization. The data from
different data stores is extracted, aggregated and migrated into a data mart or data warehouse.
EAI assists in real time data warehousing.
A virtual system means that for any transaction, the information required for it will be available
irrespective of where the information exists. EAI helps to integrate diverse systems so that they
appear as one monolithic and unified application.
Data Level
Data level EAI is the process/technology to move data between data stores, i.e., extracting
information from one database, processing the information if needed and updating the
information in another database. Business logic may also be transformed and applied to the data
that is extracted and loaded.
Business model level divided into three category and these are:
1. Application program interface level
2. Method level
3. User interface level
Here the custom or packaged applications’ interfaces are used for integration. Developers use the
interfaces to access the business processes and information so as to integrate various applications
to share business logic and information.
Method Level
The business logic is shared between different applications within the enterprise. The methods
of various applications can be accessed without rewriting each method within the respective
applications.
User Level
User Interfaces are used to tie different applications together. This process uses windows and
menus to get the relevant data that needs to be extracted and moved to other applications and
data stores.
Case Study Healthcare Middleware Solution
Middleware/Data Management Solution to significantly streamline laboratory work flow:
The Client
Our client is a software company that develops value-added solutions in data management
for clinical laboratories. Our client is experienced in the most advanced Microsoft and
Internet technologies.
Contd...
Contd...
6.21 Summary
zz Enterprise Application Integration is an integration framework composed of a collection of
technologies and services which form a middleware to enable integration of systems and
applications across the enterprise.
zz Enterprise application integration (EAI) is the process of linking such applications within
a single organization together in order to simplify and automate business processes to the
greatest extent possible, while at the same time avoiding having to make sweeping changes
to the existing applications or data structures. In the words of the Gartner Group, EAI is the
“unrestricted sharing of data and business processes among any connected application or
data sources in the enterprise.”
6.22 Keywords
CRM: CRM (Customer Relationship Management) is a comprehensive strategy and process of
acquiring, retaining and partnering with selective customers to create superior value for the
company and the customer.
e-business: e-business is the use of the Internet and other networks and information technologies
to support electronic commerce, enterprise communications and collaboration, and web-enabled
business processes both within an internetworked enterprise, and with its customers and business
partners.
e-commerce: It is a general concept covering any form of business transaction or information
exchange executed using information and communication technologies.
Enterprise Application Integration: Enterprise Application Integration (EAI) is defined as the
use of software and computer systems architectural principles to integrate a set of enterprise
computer applications.
1. (a) 2. (c)
3. Business Process 4. Zero latency
5. Service-Oriented Application Integration (SOAI)
6. Business Method Integration strategy 7. Back-end applications
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/ The
MIT Press, 1996.
V. Cherkassky and F. Mulier, Learning From Data, John Wiley & Sons, 1998.
Contents
Objectives
Introduction
7.1 Introduction to EAI
7.2 Purpose of EAI
7.3 Summary
7.4 Keywords
7.5 Review Questions
7.6 Further Readings
Objectives
After studying this unit, you will be able to:
zz Know about EAI
zz Describe EAI industry specific implementations
zz Describe case studies
Introduction
Enterprise Application Integration (EAI) is a process of data and application integration
technologies which focuses on linking transactional applications together, typically in real
time. This differentiates it from EII (Enterprise Information Integration), which is focused on
presenting a unified view of data. It is common knowledge that EII is viewed as a bi-product of
EAI, because, since the application processes are interlinked, the flow of information needs to be
channeled for a unified view. EAI tools are also generally focused on point-to-point application
communications.
Technologies that can make up an EAI solution include web services, transaction monitors,
message brokers, and message queues. Most commonly, EAI vendors discuss messaging and
web services.
In today’s dynamic environment, applications like supply Chain Management, Customer
Relationship Management, Business Intelligence and Integrated Collaboration environments
have become indispensable for any organization that depends upon continuous uninterrupted
production in a solid and competitive commercial environment. And so therefore the EAI
methodologies provide a consistent solution for this. Each of these applications in themselves are
a mamoth effort. Enterprise Application Integration (EAI) is a step forward where it links these
applications and others in order to realize financial and operational competitive advantages.
Notes independent, and automated functions. The result of this diversity was a collection of stovepipe
applications rather than a unified network of linked systems. But now, the companies are realizing
the utmost need to integrate these independent data silos to leverage the information stored in
them across the various vertical and horizontal domains as well as surmount the ever-increasing
costs of building new applications.
And this is where an EAI solution comes into the picture.
EAI is a collection of processes, software and hardware tools, methodologies, and technologies.
When implemented together, they have the aim of consolidating, connecting, and organizing
all the businesses computer applications, data, and business processes (both legacy and new)
into a seamlessly interfaced framework of system components that allow real-time exchange,
management, and easy reformulation of the company’s mission-critical information and
knowledge. It is an unrestricted sharing of data throughout the networked applications or data
sources in an enterprise.
When designing an Enterprise Application Integration (EAI) solution, it is important to recognize
that there are different levels of integration, each with its own requirements and considerations.
Successful implementation of consistent, scalable, reliable, incremental, cost-effective EAI
solutions depends on the standards and methodologies that we define for these levels. It must be
determined how we need to share information:
1. Within an application
2. Between applications within an enterprise
3. Between enterprises
4. Directly with customers
Moreover, the solution should be based upon a multi-tiered, non-redundant, coarse-grained
architecture that is enforced across the enterprise. This requires the total set of functions involved
in the end-to-end integrated solution be provided by distinct layers, each providing unique and
non-overlapping services as shown in Figure 7.1.
This architecture provides sufficient high-level consistency for interoperability and a certain Notes
degree of local freedom.
Case Study Case 1: Business Case Introduction
R
oyal Wallace, a UK-based transportation company, is a global leader in the rail
equipment and servicing industry. Its wide-range of products includes passenger
rail vehicles and total transit systems. It also manufactures locomotives, freight cars,
bogies, and provides rail-control solutions.
Because of the structure of its business, the Royal Wallace company has a presence in several
countries across Europe and in North America, including the USA and Canada. And, each of
these country-specific sites has its own dedicated legacy system such as SAP-based systems
in UK and Germany, Oracle financials in Canada, and so on (see Figure 1).
Contd...
Notes As can be seen in Figure 1, in the as-is scenario, these legacy systems are acting as
independent data silos with absolutely no kind of data-sharing happening between them.
If the German SAP system wants to procure something from any of its vendors, it does
so by using some kind of manual process at its end. The same scenario is replicated for
all the other systems, too. And, this is leading to a lot of data duplication taking place.
For example, if two different systems are using the same vendor (say, Vendor A) for
procuring white board markers, this vendor (Vendor A) information is getting stored at
both the legacy systems. Also, this kind of scenario makes it impossible to leverage existing
information across multiple systems.
Royal Wallace understands that, to survive the competition in today’s fast paced economy
and to reduce its time-to-market, it needs to put an integration solution in place as soon as
possible. The integration model should integrate all the existing legacy systems irrespective of
the platforms and operating systems that these systems are based upon. The model should take
care of the various data formats that would pass through this integration scenario and apply
the necessary transformation, translation, and data validation rules upon them. And last, but
definitely not the least, this integration solution should be scalable enough such that any new
system could be easily plugged into this solution in the future with minimal changes.
The integration model should provide:
1. Integration between the various legacy systems
2. Automated process for procurement of goods
3. Data transformation and translation logic according to the prescribed rules
4. Persistent storage mechanism for the data
5. Validation of business data to a certain extent
6. Connectivity to various drop zones and databases
As seen in Figure 2, the EAI solution proposed would make use of the integration
capabilities of the Seebeyond eGate software to seamlessly integrate the various individual
legacy systems of the Royal Wallace company with the SAP-based purchasing order (PO)
system.
SeeBeyond eGate Integrator is a fully J2EE certified and Web services-based distributed
integration platform. It provides the core integration platform, comprehensive systems
connectivity, guaranteed messaging, and robust transformation capabilities.
Contd...
In this scenario, the SAP-based PO system acts as a data hub for all the legacy systems Notes
such that all the procurement orders at Royal Wallace would be routed through this SAP
based PO system. Also, Seebeyond is acting as the integration hub between the SAP-based
PO system on one end and the legacy systems at the Royal Wallace company on the other,
thereby enabling a bi-directional flow of data.
Process Flow
Whenever a Royal Wallace employee needs to procure something, she logs on to an intranet
application. The workflow in this application is managed by the SAP-based purchasing
order (PO) system. The PO system, in turn, places the order with the vendor on behalf of the
employee. The vendor acknowledges the receipt of the order to the PO system and delivers
the goods to the concerned legacy system. Once the goods are delivered, the vendor sends
an invoice for the same to the SAP PO system. The PO system, in turn, sends this invoice
to the appropriate legacy system which then makes the necessary payment to the vendor.
The legacy system also sends back a ‘payment done’ kind of acknowledge back to the PO
system. This scenario is replicated for all the other legacy systems, too.
The EAI solution (Seebeyond eGate) is responsible for all the communication between the
SAP PO system and the legacy systems. Various interfaces were developed as part of this
solution to enable bi-directional flow of data. All the information pertaining to the various
legacy systems and the SAP-based PO system (source and target) were captured as part of
the functional specifications. These specifications covered the following topics:
1. The platforms, databases, and operating systems for the source and target
2. The various applications constituting these systems and the way they interacted with
each other
3. The input and output data formats for the source and target (EBCDIC, ASCII, Binary,
and so forth) Contd...
The various systems at Royal Wallace company that were to be incorporated into the
integration model were based on different platforms. And, the applications that were
running on these systems used to exchange data in various different formats with the
outside world. Most of these were proprietary formats that each of these applications
had created as per their own requirements over the years; there was no single standard
approach that had been followed. Some of them had fixed-length records, others were
delimited; some followed ASCII character encoding, others were EBCDIC; and so on. And
above all, as most of these formats were in existence over the past several years, the system
owners were very reluctant to replace them for a common standard format/template.
So one of the major challenges was to come up with a solution that could somehow transform
and translate these multiple data formats passing through the integration model, thereby
enabling the various systems to communicate with the SAP-based PO system. Moreover,
the solution had to be scalable enough to incorporate any new data formats in the future
with minimal changes to the existing scenario.
The solution was to make use of the Common Data Format (CDF), as shown in Figure 4.
Contd...
144 LOVELY PROFESSIONAL UNIVERSITY
Unit 7: Applications Cases of Integration
The Common Data Format (CDF) was an event structure that was generated within Notes
the Seebeyond eGate environment. Because the various applications deployed on the
legacy systems used the integration model to procure goods from vendors, the incoming
data to the EAI environment from the various applications had some common data
elements irrespective of the data formats that these applications used. For example, all
the procurement requests had information about type and quantity of the goods to be
procured. All these common elements were captured under multiple nodes in the CDF
event structure in Seebeyond. This proved useful in several ways:
The entire data structure for the whole integration model was defined using a single event
definition in Seebeyond. Now, all the incoming data into EAI could be parsed and then
mapped to the CDF event structure.
Once the data was available in the CDF, it was very easy to create the event structure as per
the target system requirement and then simply map the CDF data into that structure.
Project Build, Test and Deployment
After an elaborate design an architecture phase, the project entered the build phase. It is
during this period that the various integration components were developed by making use
of the Seebeyond eGate software.
As seen in Figure 5, three different environments were used for this purpose.
Case Summary
The EAI solution implemented at Royal Wallace addressed all the integration requirements
of the company. It provided for an automated procurement process in place of a manual
one and seamlessly integrated all the legacy applications at the company with minimum
fuss. The changes required in the existing scenario were kept to a minimum. Moreover,
because the Common Data Format (CDF) provides a lot of flexibility to the model, any new
application can be easily incorporated with a plug-and-play kind of adaptability.
Question
Summarize the case in your own language.
Notes
Case Study Case 2: EAI Implementation
W
e appreciate Softima’s contribution as a SI (System Integration) partner to
webmethods and also as a reliable Vendor for Parkway Group. We command
their participation and cooperation in the overall Implementation of Phase-I and
Phase-II. Our experience working with them has been very smooth and we will be more
than happy to recommend Softima’s as a valuable partner to other clients.
The Client
The Client is one of the largest health care organizations in the world. They own a network
of hospitals across the globe with an emphasis in Asia. The IT organization prides itself on
being on the cutting edge of all technologies including ERP.
The Challenge
The Client has three major hospitals in Singapore and has different applications for each
department. Their IT team had developed several systems and the remaining ones were
purchased from different Vendors. A VB application (Interface Engine) was designed
to send data from all systems to the SAP Health Care Module which was utilized by all
departments as a central system. The need was to integrate various systems like IRIS,
Cerner and Merlin.
To Replace the Interface Engine with webMethods Integration Platform and to Provide
a Common Integration Framework -Re-architect Interface Engine using webMethods
Integration Platform by Applying HL7 Health and to integrate them into the SAP HCM
system.
First phase of the project involved integrating Laboratory System (Cerner) and Radiology
System (IRIS) with SAP Health Care Module. Data format for communication was HCM
(for SAP system) and HL7 (for IRIS & Cerner systems). After receiving the data from a
system, webMethods parses, validates, logs and delivers the messages to corresponding
receiver. Protocols used for this communication were TCP/IP, Directory and SAP RFC’s.
Second phase of the project involved integrating Pharmacy (Merlin) with SAP Health Care
Module by using webMethods as the Interface Engine. Data format for communication is
HCM (for SAP), HL7 and Custom messages (for Merlin System). Protocols used for this
integration were FTP, Directory, SAP RFC and SAP BAPI.
Contd...
Question
EAI in the system should work properly. Give your suggestion.
Notes
Case Study Case 3: Application Integration for MTNL
M
ahanagar Telephone Nigam Ltd, had developed a portal for online presentation
of bills to its customer. The application allows the MTNL customers to view their
bills on the screen; the customer can in turn take a print out of the bills and
present at any MTNL payment counters. The bills are printed with bar codes and hence the
existing application of MTNL can read the bar codes and fetch the bills details online from
the servers for accepting the payment. Besides bill viewing and printing; the customer can
get all the other details like unpaid bills listing, paid bills listing, STD/ISD details and etc.
The services is extremely fast, free and the customer just need to have an e-mail account to
get the bills e-mail to them on a regular basis.
MTNL New Delhi needed an EAI tool to integrate seamlessly their existing remote billing
server. There are currently 11 remote billing server (CSMS) and depending upon the
telephone location, the customer database along with the bills are stored in the respective
remote servers. EAI was needed to integrate the portal application with these remote billing
servers. Under this all the bills location, fetching, updating the bills status was to be done
by the EAI tool. MTNL, New Delhi chose Microsoft BizTalk Server 2002 as the EAI tool
and Microsoft SQL Server 2000 Enterprise Edition as database for the portal application.
Tender was floated by MTNL and COMM-IT India PVT LTD was given the order for the
same. The scope covered in the job were:
1. Supply, Installation, demonstration and commissioning of EAI.
2. Training on the EAI application which also covered BizTalk Server 2002 training
Question
What are the benefits of application integration at MTNL?
Source: http://www.comm-it.in/form/Case_EnterpriseApplicationIntegration.aspx
Case Study Case 4: Integrating and Automating Composite
Applications
E
nterprises are increasingly seeking new competitive advantages, and are demanding
IT provide new applications in record times to meet those demands, and deliver
strategic value.
Enterprise architects are therefore turning to a new breed of applications that use and
reuse components developed with multiple technologies - including Web Services, J2EE,
.NET, and messaging - across a wide range of platforms and systems. These composite
applications are more efficient to develop, and deliver faster results to the business, but
entail many new challenges such as where to focus development time, how to integrate
legacy applications, and how to maximize limited compute resources.
Questions
Take any organization and apply integration tool in that it should work or not.
Its really helpful for organization or not give your suggestions.
Source: BMC Software
zz There are many types of EAI software on the market (such as Sun Microsystems SeeBeyond),
each approaching the problem of integration from a different angle and presenting a
different solution. However, there are four overarching purposes for which EAI software
can be used to improve efficiency.
zz The integration problems many enterprises face today are due to the fact that until relatively
recently there was no expectation that applications should be able to ‘talk’ to each other.
zz Until the advent of networks, computer applications were designed to perform a specific
purpose, and were often written in a range of different programming languages and used
different data structures than each other, with no thought to integration.
7.4 Keywords
Auditing: Auditing is an evaluation of a person, organization, system, process, enterprise, project
or product. Audits are performed to ascertain the validity and reliability of information; also to
provide an assessment of a system’s internal control.
Database: Database is a set of computer programs that controls the creation, maintenance, and
the use of the database in a computer platform or of an organization and its end users.
EAI: Enterprise Application Integration is the term used to describe the integration of the
computer applications of an enterprise so as to maximise their utility throughout the enterprise.
Operating System: An operating system (OS) is an interface between hardware and user which is
responsible for the management and coordination of activities and the sharing of the resources of
the computer that acts as a host for computing applications run on the machine.
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Contents
Objectives
Introduction
8.1 Data Warehouse Refreshment
8.2 Incremental Data Extraction
8.3 Data Cleaning
8.3.1 Data Cleaning for Missing Values
8.3.2 Noisy Data
8.4 Summary
8.5 Keywords
8.6 Self Assessment
8.7 Review Questions
8.8 Further Readings
Objectives
After studying this unit, you will be able to:
zz Know data warehouse refreshment
zz Explain incremental data extraction
zz Describe data cleaning
Introduction
A distinguishing characteristic of data warehouses is the temporal character of warehouse data,
i.e., the management of histories over an extended period of time. Historical data is necessary for
business trend analysis which can be expressed in terms of analysing the temporal development
of real-time data. For the refreshment process, maintaining histories in the DWH means that
either periodical snapshots of the corresponding operational data or relevant operational updates
are propagated and stored in the warehouse, without overriding previous warehouse states.
Extraction is the operation of extracting data from a source system for further use in a data
warehouse environment. This is the first step of the ETL process. After the extraction, this data
can be transformed and loaded into the data warehouse.
The source systems for a data warehouse are typically transaction processing applications. For
example, one of the source systems for a sales analysis data warehouse might be an order entry
system that records all of the current order activities.
The data warehouse can be defined as a hierarchy of data stores which goes from source data
to highly aggregated data. Between these tow extremes can be other data stores depending on
the requirements of OLAP applications. One of these stores in the Corporate Data Warehouse
store (CDW) which groups all aggregated views used for the generation of the data marts. The
corporate data store can be complemented by an Operational Data Store (ODS) which groups the
base data collected and integrated from the sources. Data extracted from each source can also be
stored in different data structures. This hierarchy of data stores is a logical way to represent the
data flow between the sources and the data marts. In practice all the intermediate states between
the source and the data marts can be represented in the same database.
Distinguish four levels in the construction of the hierarchy of stores. The first level includes three
major steps:
1. The extraction of data from the operation data sources
2. Their cleaning with respect to the common rules defined for the data warehouse store.
3. Their possible archiving in the case when integration needs some synchronization between
extraction.
Note However that this decomposition is only logical. The extraction step and part
of cleaning step can be grouped into the same software component, such as a wrapper or
a data migration tool.
When the extraction and cleaning steps are separated data need to be stored in between. This can
be done using one storage medium per source or one shared medium for all sources.
The second level is the integration step. This phase is often coupled with rich data transformation
capabilities into the same software component which usually performs the loading into the ODS
when it exists or into the CDW. The third level concerns the data aggregation for the purpose of
cubes construction. Finally the fourth level is a step of cube customization. All these steps can
also be grouped into the same software such as multi-database system.
In order to understand which kind of tools the refreshment process needs, it is important to Notes
locate it within the global data warehouse lifecycle which is defined by three following phases:
Design Phase
The design phase consists of the definition of user views, auxiliary views, source extractors,
data cleaners, data integrators and all others features that guarantee an explicit specification of
the data warehouse application. These specifications could be done with respect to abstraction
levels (conceptual, logical and physical) ad user perspectives (source view, enterprise view, client
views). The result of the design is a set of formal or semiformal specification which constitutes
the metadata used by the data warehouse system and applications.
Loading Phase
The loading phase consists of the initial data warehouse instantiation which is the initial
computation of the data warehouse content. This initial loading is globally a sequential process
of four steps:
1. Preparation
2. Integration
3. High level aggregation
4. Customization
The first step is done for each source and consists of data extraction, data cleaning and possibly
data archiving before or after cleaning. The second step consists of data integration which is
reconciliation of data originated from heterogeneous sources and derivation of the base relations
of the ODS. The third step consists of the computation of aggregated views from base views. In
all three steps not just the loading of data but also the loading of indexed is of crucial importance
for query and update performance. While the data extracted from eth sources and integrated in
the ODS are considered as ground data with very low-level aggregation the data in aggregated
views are generally highly summarized using aggregation functions. These aggregated views
constitute what is sometimes called the CDS, i.e. the set of materialized views from which data
marts are derived. The fourth step consists of the derivation and customization of the user views
which define the data marts. Customization refers to various presentations needed by the users
for multidimensional data.
Refreshment Phase
The refreshment phase has a data flow similar to the loading phase but, while the loading process
is a massive feeding of the data warehouse the refreshment process capture the differential
changes that occurred in the sources and propagates them through the hierarchy of data stores.
The preparation step extract from each source the data that characterize the changes that have
occurred in this source since the last extraction. As for the loading phase, these data are cleaned
and possibly archived before their integration. The integration step reconciles the source changes
coming from multiple sources and adds them to the ODS. The aggregation step computes
incrementally the hierarchy of aggregated views using these changes. The customization step
propagates the summarized data to the data marts.
The refreshment of a data warehouse is an important process which determines the effective
usability of the data collected and aggregated from the sources. Indeed the quality of data
provided to the decision makers depends on the capability of the data warehouse system to
Notes propagate the changes made at the data sources in reasonable time. Most of the design decisions
are then influenced by re choice of data structures and updating techniques that optimize the
refreshment of the data warehouse.
Building an efficient refreshment strategy depends on various parameters related to the
following:
1. Application requirements: e.g., data freshness, computation time of queries and views,
data accuracy
2. Source Constraints: e.g., availability windows, frequency of change.
3. Data Warehouse System limits: e.g., storage space limit, functional limits.
Most of these parameters may evolve during the data warehouse lifetime, hence leading to
frequent reconfiguration of the data warehouse architecture and changes in the refreshment
strategies. Consequently data warehouse administrators must be provided with powerful tools
that enable them to efficiently redesign data warehouse applications.
For those corporations in which an ODS makes sense, Inmon proposes to distinguish among
three classes of ODSs, depending on the speed of refreshment demanded.
1. The first class of ODSs is refreshed within a few seconds after the operational data sources
are updated. Very little transformations are performed as the data passes form the
operational environment into the ODS. A typical example of such an ODS is given by a
banking environment where data sources keep individual accounts of a large multinational
customer, and the ODS stores the total balance for this customer.
2. With the second class of ODSs integrated and transformed data are first accumulated
and stored into an intermediate data store and then periodically forwarding to the ODS
on say an hourly basis. This class usually involves more integration and transformation
processing. To illustrate this consider now a bank that stores in the ODS an integrated
individual bank account on a weekly basis, including the number of transactions during
the week the starting and ending balances the largest and smallest transactions, etc. The
daily transactions processed at the operational level are stored and forwarded on an hourly
basis. Each change received by the ODS triggers the updating of a composite record o the
current week.
3. Finally, the third class of ODSs is strongly asynchronous. Data are extracted from the
sources and used to refresh the ODS on a day-or-more basis. As an example of this class,
consider an ODS that stores composite customer records computed from different sources.
As customer data change very slowly, it is reasonable to refresh ODS in a more infrequent
fashion.
Quite similar distinctions also apply for the refreshment of a global data warehouse except
that there is usually no counterpart for ODS of the first class. The period for refreshment is
considered to be larger for global data warehouses. Nevertheless, different data warehouses
demand different speed of refreshment. Besides the speed of the refreshment, which can be
determined statically after analyzing the requirements of the information processing application
other dynamic parameters may influence the refreshment strategy of the data warehouse. For
instance one may consider the volume of changes in the data sources as given by the number of
update transactions. Coming back to the previous example of an ODS of the second class, such
a parameter may determine dynamically the moment at which the changes accumulated into an
intermediate data store should be forwarded to the ODS. Another parameter can be determined
by the profile to queries that execute on the data warehouse. Some strategic queries that require
to use fresh data may entail the refreshment of the data warehouse for instance using the changes
that have been previously logged between then sources and the ODS or the sources and the
global data warehouse.
In any case, the refreshment of a data warehouse is considered to be a difficult and critical Notes
problem for three main reasons:
1. First the volume of data stored in a warehouse is usually large and is predicted to grow in
the near future. Recent inquiries show that 100 GB warehouses are becoming commonplace.
Also a study from META Group published in January 1996 reported that 52% of the
warehouses surveyed would be 20 GB to 1 TB or larger in 12-18 months. In particular the
level of detail required by the business leads to fundamentally new volumes of warehoused
data. Further the refreshment process must be propagated along the various levels of data
(ODS, CDW and data marts), which enlarges the volume of data must be refreshed.
2. Second the refreshment of warehouse requires the execution of transactional workloads of
varying complexity. In fact, the refreshment of warehouses yields different performance
challenges depending on its level in the architecture. The refreshment of an ODS involves
many transactions that need to access and update a few records. Thus, the performance
requirements for refreshment are those of general purpose record-level update processing.
The refreshment of a global data warehouse involves heavy load and access transactions.
Possibly large volumes of data are periodically loaded in the data warehouse, and once
loaded, these data are accessed either for informational processing or for refreshing
the local warehouses. Power for loading is now measured in GB per hour and several
companies are moving to parallel architectures when possible to increase their processing
power for loading and refreshment. The network interconnecting the data sources to the
warehouse can also be bottleneck during refreshment and calls for compression techniques
for data transmission. Finally, as a third reason the refreshment of local warehouses
involves transactions that access many data perform complex calculations to produce
highly summarized and aggregated data and update a few records in the local warehouses.
This is particularly true for the local data warehouses that usually contain the data cubes
manipulated by OLAP applications. Thus, a considerable processing time may be needed
to refresh the warehouses. This is a problem because there is always a limited time frame
during which the refreshment is expected to happen. Even if this time frame goes up to
several hours and does not occur at peak periods it may be challenging to guarantee that
the data warehouse will be refreshed within it.
3. Third, the refreshment of a warehouse may be run concurrently with the processing of
queries. This may happen because the time frame during which the data warehouse is not
queried is either too short or nonexistent.
Task The weather date is stored for different locations in a warehouse. The weather
data consists of ‘temperature,’ ‘pressure,’ humidity,’ and ‘wind velocity.’ The location is
defined in terms of ‘latitude,’ ‘longitude,’ altitude’ and ‘time.’ Assume that nation() is a
function that returns the name of the country for a given latitude and longitude. Propose a
warehousing model for this case.
Notes It is convenient to associate a wrapper with the data source in order to provide a uniform
description of the capabilities of the data sources. Moreover the role of the wrapper in a data
warehouse context in enlarged. Its first functionality is to give a description of the data stored by
each data source in a common data model. I assume that this common model is a relational data
model. This is the typical functionality of a wrapper in a classical wrapper/mediator architecture
therefore, I will call it wrapper functionality. The second functionality is to detect the changes of
interest that have happened in the underlying data source. This is a specific functionality required
by data warehouse architectures in order to support the refreshment of the data warehouse in
a incremental way. For this reason I reserve the term change monitoring to refer to this kind of
functionality.
Wrapper Functionality
The principal function of the wrapper relative to this functionality is to make the underlying data
source appear as having the same data format and model that are used in the data warehouse
system. For instance, if the data source is a set of XML document and the data model used in the
data warehouse is the relational model, then the wrapper must be defined in such a way that it
present the data sources of this type as it they were relational.
The development of wrapper generators has received attention from the research community
especially in the case of sources that contain semi-structured data such as HTML or SGML
documents. These tools for instance, enable to query the documents using an OQL-base
interface.
Another important function that should be implemented by the wrapper is to establish the
communication with the underlying data source and allow the transfer of information between
the data source and the change monitor component. If the data warehouse system and the data
source share the same data model then the function of the wrapper would be just to translate
the data format and to support the communication with the data source. For data sources that
are relational system and supposing that the data model used in the data warehouse is also
relational it is possible to use wrappers that have been developed by software companies such as
database vendors or database independent companies. These wrappers also called “middleware”,
“gateways” or “brokers” have varying capabilities in terms of application programming interface,
performance and extensibility.
In the client server database environment several kinds of middleware have already been
developed to enable the exchange queries and their associated answers between a client
application and a database server, or between database servers in a transparent way. The term
“transparent” usually means that the middleware hide the underlying network protocol, the
database systems and the database query languages supported by these database systems from
the application.
The usual sequence of steps during the interaction of a client application and a database server
through a middleware agent is as follows. First, the middleware enables the application to connect
and disconnect to the database server. Then, it allows the preparation and execution of requests.
A request preparation specifies the request with formal parameters which generally entails its
compilation in the server. A prepared request can then be executed by invoking its name and
passing its actual parameters. Requests are generally expressed in SQL. Another functionality
offered by middleware is the fetching of results which enables a client application to get back all
or part of the result of a request. When the results are large, they can be cached on the serve. The
transfer of requests and results is often built on a protocol supporting remote procedure calls.
There has been an important effort to standardize the programming interface offered by
middleware and the underlying communication protocol. Call Level Interface (CLI) is a
standardized API developed by the X/Open standardization committee. It enables a client
application to extract data from a relational database server through a standard SQL-based
interface. This API is currently supported by several middleware products such as ODBC and Notes
IDAPI. The RDA standard communication protocol specifies the messages to be exchanged
between clients and servers. Its specialization to SQL requests enables the transport of requests
generated by a CLI interface.
Despite these efforts, existing middleware products do not actually offer a standard interface for
client-server developments. Some products such as DAL/DAM or SequeLink offer their own API,
although some compatibility is sometimes offered with other tools, such as ODBC. Furthermore,
database vendors have developed their own middleware. For instance, Oracle proposes several
levels of interface, such as Oracle Common Interface (OCI), on top of its client-server protocol
named SQL*Net. The OCI offers a set of functions close to the ones of CLI, and enables any client
having SQL*NET to connect to an Oracle server using any kind of communication protocol.
Finally the alternative way to provide a transparent access to database servers is to use Internet
protocols. In fact it must be noted that the World Wide Web is simply a standard-based client-
server architecture.
The following methods can be used to clean data for missing values in a particular attribute:
1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective, unless
the tuple contains several attributes with missing values. It is especially poor when the
percentage of missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming and may
not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant, such as a label like “Unknown”. If missing values are replaced by, say,
“Unknown”, then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown”. Hence, although this
method is simple, it is not recommended.
4. Use the attribute mean to fill in the missing value: You can fill the missing values with the
average value in that attribute.
5. Use the attribute mean for all samples belonging to the same class as the given tuple.
6. Use the most probable value to fill in the missing value: This may be determined with
inference-based tools using a Bayesian formalism or decision tree induction. For example,
using the other customer attributes in your data set, you may construct a decision tree to
predict the missing values for income.
Notes Note: Methods 3 to 6 bias the data. The filled-in value may not be correct. Method 6, however,
is a popular strategy. In comparison to the other methods, it uses the most information from the
present data to predict missing values.
Noise is a random error or variance in a measured variable. Given a numeric attribute such as,
say, price, we can “smooth” out the data by using the following techniques:
Binning Methods
Binning methods smooth a sorted data value by consulting the “neighborhood”, or values
around it. The sorted values are distributed into a number of ‘buckets’ or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing. The following
example illustrates some binning techniques. In this example, the data for price are first sorted
and partitioned into equi-depth bins (of depth 3).
1. In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value
in this bin is replaced by the value 9.
2. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced
by the bin median.
3. In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest boundary
value. In general, the larger the width, the greater the effect of the smoothing. Alternatively,
bins may be equi-width, where the interval range of values in each bin is constant.
Example
1. Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
2. Partition into (equi-width) bins:
(a) Bin 1: 4, 8, 15
(b) Bin 2: 21, 21, 24
(c) Bin 3: 25, 28, 34
3. Smoothing by bin means:
(a) Bin 1: 9, 9, 9,
(b) Bin 2: 22, 22, 22
(c) Bin 3: 29, 29, 29
4. Smoothing by bin boundaries:
(a) Bin 1: 4, 4, 15
(b) Bin 2: 21, 21, 24
(c) Bin 3: 25, 25, 34
Clustering
Outliers may be detected by clustering, where similar values are organized into groups or
“clusters”. Intuitively, values which fall outside of the set of clusters may be considered outliers
(Figure 8.1).
Outliers may be identified through a combination of computer and human inspection. In one
application, for example, an information-theoretic measure was used to help identify outlier
patterns in a handwritten character database for classification. The measure’s value reflected
the “surprise” content of the predicted character label with respect to the known label. Outlier
patterns may be informative (e.g., identifying useful data exceptions, such as different versions
of the characters “0” or “7”), or “garbage” (e.g., mislabeled characters). Patterns whose surprise
content is above a threshold are output to a list. A human can then sort through the patterns in
the list to identify the actual garbage ones.
This is much faster than having to manually search through the entire database. The garbage
patterns can then be removed from the (training) database.
Regression
Data can be smoothed by fitting the data to a function, such as with regression. Linear regression
involves finding the “best” line to fit two variables, so that one variable can be used to predict
the other. Multiple linear regression is an extension of linear regression, where more than two
variables are involved and the data are fit to a multidimensional surface. Using regression to find
a mathematical equation to fit the data helps smooth out the noise.
Inconsistent Data
There may be inconsistencies in the data recorded for some transactions. Some data inconsistencies
may be corrected manually using external references. For example, errors made at data entry may
be corrected by performing a paper trace. This may be coupled with routines designed to help
correct the inconsistent use of codes. Knowledge engineering tools may also be used to detect
the violation of known data constraints. For example, known functional dependencies between
attributes can be used to find values contradicting the functional constraints.
Notes There may also be inconsistencies due to data integration, where a given attribute can have
different names in different databases. Redundancies may also result.
Case Study Banco Popular
The Company
Banco Popular, the largest bank in Puerto Rico, with more then 2 million customers,
recognizes that effective service requires understanding individual needs and responding
to them promptly. That’s why, for the past 17 years, it has used a customer information
system to keep all its employees abreast of its customer relationships.
The bank has achieved steady growth with this approach, but until recently, success had
taken a toll on data system performance. As the bank added more and more customer
and account data to the system, sluggish response times and inconsistent, duplicated,
and nonstandardized data values made it increasingly difficult for employees to access
information.
Open Database is Key to Customer Data System Success
By replacing the old system with a new DB2-based system, the bank has built a solid
foundation for staying in tune with its clientele. The customer data system is integrated
with all other applications within the bank, and, working in conjunction with data cleansing
and reengineering software from Trillium Software, provides a complete, unified view of
each customer. A program analyst at Banco Popular said, “Opening a new account once
required several operations. Now it is completed in just one menu function. The DB2-based
system has improved both customer satisfaction and productivity.”
Residing on an IBM platform, the system is the hub of activity at the bank. As many as 50
percent of the 6,000 employees in more then 200 branches depend on it to manage more
than 5.7 million personal and business accounts. Every transaction that requires accessing,
updating, or collecting information on customers is handled by the customer data system.
For example, before opening a new account, customer service associates access the system
to research the customer’s existing relationships with the bank.
Any new customer and account information is then fed into the system, where it can be
used by other divisions of the bank.
Because of the key role the customer data system plays in growing the bank’s business,
Banco Popular insisted that the database at the heart of its new system be both robust and
open. The bank understood it needed a relational database. The main advantage of DB2
among relational databases is that companies can implement any third-party application on
top of it and it will meet expectations. Scalability was equally important to Banco Popular.
In the past, it had been constrained by a system that was unable to keep up with growing
data volumes and number of users. DB2 also gave the bank virtually limitless scalability as
well as high performance.
The bank is confident that its needs will be fulfilled for many years.
The Trillium Software System is used to cleanse and standardize each customer record and
then to identify and match those records against the database.
Banco Popular took advantage of Trillium’s highly customizable business rules to
eliminate duplicate records and develop an accurate view of its current customers and their
relationships with the bank. This has helped the bank establish meaningful and profitable
relationships with its customers. Contd...
The data cleansing process is also helping reduce marketing costs. Banco Popular will use Notes
the Trillium Software System® to enforce standardization and cleanse data. This will be
the key to an accurate “householding” process, which is a way of identifying how many
account holders live at the same address.
By doing this, the bank can eliminate duplicate mailings to the same household, which
makes the bank look much more efficient in its customers’ eyes, and saves at least $70,000
in mailing expenses every month. Banco Popular’s home-grown address standardization
system will soon be replaced by Trillium Software’s geocoding solution. This will save the
cost of changing and recertifying the system each time the US Postal Service changes its
standardization requirements.
DB2 can easily handle customer information systems containing millions of records in
multiple languages that are initially cleansed with the Trillium Software System. Not only
is Banco Popular expanding, its customers may be represented within complex financial
records on the database in either English or Spanish. Trillium Software scales in step with
the growing DB2 database and works in numerous languages to provide a global solution
for this multinational bank.
8.4 Summary
zz DWH refreshment so far has been investigated in the research community mainly in
relation to techniques for maintaining materialized views.
zz In these approaches, the DWH is considered as a set of materialized views defined over
operational data. Thus, the topic of warehouse refreshment is defined as a problem of
updating a set of views (the DWH) as a result of modifications of base relations (residing in
operational systems). Several issues have been investigated in this context.
zz The extraction method you should choose is highly dependent on the source system and
also from the business needs in the target data warehouse environment.
zz Very often, there is no possibility to add additional logic to the source systems to enhance
an incremental extraction of data due to the performance or the increased workload of
these systems.
zz Sometimes even the customer is not allowed to add anything to an out-of-the-box
application system.
8.5 Keywords
Corporate Data Store: The corporate data store can be complemented by an Operational Data
Store (ODS) which groups the base data collected and integrated from the sources.
Data Cleaning: Data cleaning can be applied to remove noise and correct inconsistencies in the
data.
Incremental Data Extraction: Incremental data extraction can be implemented depends on the
characteristics of the data sources and also on the desired functionality of the data warehouse
system.
The Design Phase: The design phase consists of the definition of user views, auxiliary views,
source extractors, data cleaners, data integrators.
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/ The
MIT Press, 1996.
V. Cherkassky and F. Mulier, Learning From Data, John Wiley & Sons, 1998.
Contents
Objectives
Introduction
9.1 Update Propagation into Materialized Views
9.2 Types of Materialized Views
9.2.1 Materialized Views with Aggregates
9.2.2 Materialized Views Containing Only Joins
9.2.3 Nested Materialized Views
9.3 Towards a Quality-oriented Refreshment Process
9.3.1 View Maintenance, Data Loading and Data Refreshment
9.3.2 The Refreshment Process is a Workflow
9.4 Implementation of the Approach
9.4.1 The Workflow of the Refreshment Process
9.4.2 Defining Refreshment Scenarios
9.5 Implementation Issues
9.6 Summary
9.7 Keywords
9.8 Self Assessment
9.9 Review Questions
9.10 Further Readings
Objectives
After studying this unit, you will be able to:
zz Explain update propagation into materialized views
zz Know towards a quality-oriented refreshment process
zz Describe implementation of the approach
Introduction
Your organization has decided to build a data warehouse. You have defined the business
requirements and agreed upon the scope of your application, and created a conceptual design.
Now you need to translate your requirements into a system deliverable. To do so, you create the
logical and physical design for the data warehouse.
The logical design is more conceptual and abstract than the physical design. In the logical design,
you look at the logical relationships among the objects. In the physical design, you look at the most
effective way of storing and retrieving the objects as well as handling them from a transportation
and backup/recovery perspective.
Typically, data flows from one or more online transaction processing (OLTP) database into a data
warehouse on a monthly, weekly, or daily basis. The data is normally processed in a staging file
before being added to the data warehouse. Data warehouses commonly range in size from tens
of gigabytes to a few terabytes. Usually, the vast majority of the data is stored in a few very large
fact tables.
One technique employed in data warehouses to improve performance is the creation of
summaries. Summaries are special types of aggregate views that improve query execution times
by pre-calculating expensive joins and aggregation operations prior to execution and storing the
results in a table in the database.
Example: You can create a table to contain the sums of sales by region and by product.
The summaries or aggregates that are referred to in this book and in literature on data warehousing
are created in Oracle Database using a schema object called a materialized view. Materialized views
can perform a number of roles, such as improving query performance or providing replicated
data.
In the past, organizations using summaries spent a significant amount of time and effort creating
summaries manually, identifying which summaries to create, indexing the summaries, updating
them, and advising their users on which ones to use. The introduction of summary management
eased the workload of the database administrator and meant the user no longer needed to be
aware of the summaries that had been defined. The database administrator creates one or more
materialized views, which are the equivalent of a summary. The end user queries the tables and
views at the detail data level. The query rewrite mechanism in the Oracle server automatically
rewrites the SQL query to use the summary tables. This mechanism reduces response time for
returning results from the query. Materialized views within the data warehouse are transparent
to the end user or to the database application.
Although materialized views are usually accessed through the query rewrite mechanism, an end
user or database application can construct queries that directly access the materialized views.
However, serious consideration should be given to whether users should be allowed to do this
because any change to the materialized views will affect the queries that reference them.
In data warehouses, you can use materialized views to pre-compute and store aggregated data
such as the sum of sales. Materialized views in these environments are often referred to as
summaries, because they store summarized data. They can also be used to pre-compute joins
with or without aggregations. A materialized view eliminates the overhead associated with
expensive joins and aggregations for a large or important class of queries.
In distributed environments, you can use materialized views to replicate data at distributed sites
and to synchronize updates done at those sites with conflict resolution methods. The materialized
views as replicas provide local access to data that otherwise would have to be accessed from
remote sites. Materialized views are also useful in remote data marts.
You can also use materialized views to download a subset of data from central servers to mobile
clients, with periodic refreshes and updates between clients and the central servers.
You can use materialized views to increase the speed of queries on very large databases. Queries
to large databases often involve joins between tables, aggregations such as SUM, or both. These
operations are expensive in terms of time and processing power. The type of materialized view
you create determines how the materialized view is refreshed and used by query rewrite.
Materialized views improve query performance by pre-calculating expensive join and aggregation
operations on the database prior to execution and storing the results in the database. The
query optimizer automatically recognizes when an existing materialized view can and should
be used to satisfy a request. It then transparently rewrites the request to use the materialized
view. Queries go directly to the materialized view and not to the underlying detail tables. In
general, rewriting queries to use materialized views rather than detail tables improves response.
Figure 9.1 illustrates how query rewrite works.
When using query rewrite, create materialized views that satisfy the largest number of queries.
For example, if you identify 20 queries that are commonly applied to the detail or fact tables, then
you might be able to satisfy them with five or six well-written materialized views. A materialized
view definition can include any number of aggregations (SUM, COUNT(x), COUNT(*),
COUNT(DISTINCT x), AVG, VARIANCE, STDDEV, MIN, and MAX). It can also include any
number of joins. If you are unsure of which materialized views to create, Oracle provides the
SQLAccess Advisor, which is a set of advisory procedures in the DBMS_ADVISOR package to
help in designing and evaluating materialized views for query rewrite.
If a materialized view is to be used by query rewrite, it must be stored in the same database as
the detail tables on which it relies. A materialized view can be partitioned, and you can define
a materialized view on a partitioned table. You can also define one or more indexes on the
materialized view.
Unlike indexes, materialized views can be accessed directly using a SELECT statement. However,
it is recommended that you try to avoid writing SQL statements that directly reference the
materialized view, because then it is difficult to change them without affecting the application.
Instead, let query rewrite transparently rewrite your query to use the materialized view.
Notes
Note The techniques shown in this unit illustrate how to use materialized views in
data warehouses. Materialized views can also be used by Oracle Replication.
In data warehouses, materialized views normally contain aggregates as shown in Example 9.1.
For fast refresh to be possible, the SELECT list must contain all of the GROUP BY columns (if
present), and there must be a COUNT(*) and a COUNT(column) on any aggregated columns.
Also, materialized view logs must be present on all tables referenced in the query that defines
the materialized view. The valid aggregate functions are: SUM, COUNT(x), COUNT(*), AVG,
VARIANCE, STDDEV, MIN, and MAX, and the expression to be aggregated can be any SQL
value expression.
Fast refresh for a materialized view containing joins and aggregates is possible after any type of
DML to the base tables (direct load or conventional INSERT, UPDATE, or DELETE). It can be
defined to be refreshed ON COMMIT or ON DEMAND. A REFRESH ON COMMIT materialized
view will be refreshed automatically when a transaction that does DML to one of the materialized
view’s detail tables commits. The time taken to complete the commit may be slightly longer than
usual when this method is chosen. This is because the refresh operation is performed as part of
the commit process. Therefore, this method may not be suitable if many users are concurrently
changing the tables upon which the materialized view is based.
Here are some examples of materialized views with aggregates. Note that materialized view logs
are only created because this materialized view will be fast refreshed.
Note COUNT(*) must always be present to guarantee all types of fast refresh.
Otherwise, you may be limited to fast refresh after inserts only. Oracle recommends that
you include the optional aggregates in column Z in the materialized view in order to
obtain the most efficient and accurate fast refresh of the aggregates.
Task Suppose that a data warehouse for a university consists of the following
for dimension: Student, course, Semester and Instructor, and two measures count and
avg _grade. When at the lowest conceptual level (e.g. for a given student, course, semester
and instructor combination), the avg_grade measure stores the actual course grade of the
student at the higher conceptual levels, avg_grade stores the average grade for the given
combination.
1. Draw a snowflake schema diagram for the data warehouse.
2. Starting with base cuboids, what specific OLAP operations should one perform in
order to list the average grade of CS courses for each student.
3. If each dimension has five levels (including all), such as student <major <status<
university<all, how many cuboids will this cube contain (including the base and
appes cuboids)
Some materialized views contain only joins and no aggregates, such as in example, where a
materialized view is created that joins the sales table to the times and customers tables.
The advantage of creating this type of materialized view is that expensive joins will be
precalculated.
Fast refresh for a materialized view containing only joins is possible after any type of DML to the
base tables (direct-path or conventional INSERT, UPDATE, or DELETE).
A materialized view containing only joins can be defined to be refreshed ON COMMIT or ON
DEMAND. If it is ON COMMIT, the refresh is performed at commit time of the transaction that
does DML on the materialized view’s detail table.
Notes If you specify REFRESH FAST, Oracle performs further verification of the query definition to
ensure that fast refresh can be performed if any of the detail tables change. These additional
checks are:
1. A materialized view log must be present for each detail table and the ROWID column must
be present in each materialized view log.
2. The rowids of all the detail tables must appear in the SELECT list of the materialized view
query definition.
3. If there are no outer joins, you may have arbitrary selections and joins in the WHERE
clause. However, if there are outer joins, the WHERE clause cannot have any selections.
Further, if there are outer joins, all the joins must be connected by ANDs and must use the
equality (=) operator.
4. If there are outer joins, unique constraints must exist on the join columns of the inner table.
For example, if you are joining the fact table and a dimension table and the join is an outer
join with the fact table being the outer table, there must exist unique constraints on the join
columns of the dimension table.
If some of these restrictions are not met, you can create the materialized view as REFRESH
FORCE to take advantage of fast refresh when it is possible. If one of the tables did not meet all
of the criteria, but the other tables did, the materialized view would still be fast refreshable with
respect to the other tables for which all the criteria are met.
If the materialized view contains only joins, the ROWID columns for each table (and each instance
of a table that occurs multiple times in the FROM list) must be present in the SELECT list of the
materialized view.
If the materialized view has remote tables in the FROM clause, all tables in the FROM clause must
be located on that same site. Further, ON COMMIT refresh is not supported for materialized
view with remote tables. Materialized view logs must be present on the remote site for each detail
table of the materialized view and ROWID columns must be present in the SELECT list of the
materialized view.
To improve refresh performance, you should create indexes on the materialized view’s columns
that store the rowids of the fact table.
In this example, to perform a fast refresh, UNIQUE constraints should exist on c.cust_id and Notes
t.time_id. You should also create indexes on the columns sales_rid, times_rid, and customers_rid,
as illustrated in the following. This will improve the refresh performance.
CREATE INDEX mv_ix_salesrid ON detail_sales_mv(“sales_rid”);
Alternatively, if the previous example did not include the columns times_rid and customers_rid,
and if the refresh method was REFRESH FORCE, then this materialized view would be fast
refreshable only if the sales table was updated but not if the tables times or customers were
updated.
CREATE MATERIALIZED VIEW detail_sales_mv
PARALLEL
BUILD IMMEDIATE
REFRESH FORCE AS
SELECT s.rowid “sales_rid”, c.cust_id, c.cust_last_name, s.amount_sold,
s.quantity_sold, s.time_id
FROM sales s, times t, customers c
WHERE s.cust_id = c.cust_id(+) AND s.time_id = t.time_id(+);
In a data warehouse, you typically create many aggregate views on a single join (for example,
rollups along different dimensions). Incrementally maintaining these distinct materialized
aggregate views can take a long time, because the underlying join has to be performed many
times.
Using nested materialized views, you can create multiple single-table materialized views based
on a joins-only materialized view and the join is performed just once. In addition, optimizations
can be performed for this class of single-table aggregate materialized view and thus refresh is
very efficient.
Some types of nested materialized views cannot be fast refreshed. Use EXPLAIN_MVIEW to
identify those types of materialized views. You can refresh a tree of nested materialized views in
the appropriate dependency order by specifying the nested = TRUE parameter with the DBMS_
MVIEW. REFRESH parameter. For example, if you call DBMS_MVIEW.REFRESH (‘SUM_SALES_
CUST_TIME’, nested => TRUE), the REFRESH procedure will first refresh the join_sales_cust_time
materialized view, and then refresh the sum_sales_cust_time materialized view.
aggregated data and can be organized into a multidimensional structure. Data extracted from Notes
each source can also be stored in intermediate data recipients. Obviously, this hierarchy of data
stores is a logical way to represent the data flows which go from the sources to the data marts.
All these stores are not necessarily materialized, and if they are, they can just constitute different
layers of the same database.
Figure 9.2 shows a typical data warehouse architecture. This is a logical view whose operational
implementation receives many different answers in the data warehousing products. Depending
on each data source, extraction and cleaning can be done by the same wrapper or by distinct
tools. Similarly data reconciliation (also called multi-source cleaning) can be separated from or
merged with data integration (multi-sources operations). High level aggregation can be seen as a
set of computation techniques ranging from simple statistical functions to advanced data mining
algorithms. Customisation techniques may vary from one data mart to another, depending on
the way decision makers want to see the elaborated data.
The refreshment of a data warehouse is an important process which determines the effective
usability of the data collected and aggregated from the sources. Indeed, the quality of data
provided to the decision makers depends on the capability of the data warehouse system to
convey in a reasonable time, from the sources to the data marts, the changes made at the data
sources. Most of the design decisions are then concerned by the choice of data structures and
update techniques that optimise the refreshment of the data warehouse.
There is a quiet great confusion in the literature concerning data warehouse refreshment. Indeed,
this process is often either reduced to view maintenance problem or confused with the data
loading phase. Our purpose in this paper is to show that the data warehouse refreshment is a
more complex than the view maintenance problem, and different from the loading process. We
define the refreshment process as a workflow whose activities depend on the available products
for data extraction, cleaning and integration, and whose triggering events of these activities
depend on the application domain and on the required quality in terms of data freshness.
Data refreshment in data warehouses is generally confused with data loading as done during the
initial phase or with update propagation through a set of materialized views. Both analogies are
wrong. The following paragraphs argument on the differences between data loading and data
refreshment, and between view maintenance and data refreshment.
The data warehouse loading phase consists in the initial data warehouse instantiation, that is the
initial computation of the data warehouse content. This initial loading is globally a sequential
process of four steps (Figure 9.3): (i) preparation, (ii) integration, (iii) high level aggregation and
(iv) customisation. The first step is done for each source and consists in data extraction, data
cleaning and possibly data archiving before or after cleaning. Archiving data in a history can
be used both for synchronisation purpose between sources having different access frequencies
and for some specific temporal queries. The second step consists in data reconciliation and
integration, that is cleaning multi-source cleaning of data originated from heterogeneous sources,
and derivation of the base relations (or base views) of the operational data store (ODS). The third
step consists in the computation of aggregated views from base views. While the data extracted
from the sources and integrated in the ODS is considered as ground data with very low level
aggregation, the data in the corporate data warehouse (CDW) is generally highly summarised
using aggregation functions. The fourth step consists in the derivation and customisation of the
user views which define the data marts. Customisation refers to various presentations needed by
the users for multidimensional data.
The main feature of the loading phase is that it constitutes the latest stage of the data warehouse
design project. Before the end of the data loading, the data warehouse does not yet exist for the
users.
Consequently, there is no constraint on the response time. But, in contrast, with respect to the
data sources, the loading phase requires more availability.
The data flow which describes the loading phase can serve as a basis to define the refreshment
process, but the corresponding workflows are different. The workflow of the refreshment process
is dynamic and can evolve with users’ needs and with source evolution, while the workflow of
the initial loading process is static and defined with respect to current user requirements and
current sources.
The difference between the refreshment process and the loading process is mainly in the Notes
following. First, the refreshment process may have a complete asynchronism between its
different activities (preparation, integration, aggregation and customisation). Second, there may
be a high level parallelism within the preparation activity itself, each data source having its
own availability window and its own strategy of extraction. The synchronization is done by the
integration activity. Another difference lies in the source availability. While the loading phase
requires a long period of availability, the refreshment phase should not overload the operational
applications which use the data sources. Then, each source provides a specific access frequency
and a restricted availability duration. Finally, there are more constraints on response time for
the refreshment process than for the loading process. Indeed, with respect to the users, the data
warehouse does not exist before the initial loading, so the computation time is included within
the design project duration. After the initial loading, the data becomes visible and should satisfy
user requirements in terms of data availability, accessibility and freshness.
The propagation of changes during the refreshment process is done through a set of independent
activities among which we find the maintenance of the views stored in the ODS and CDW levels.
The view maintenance phase consists in propagating a certain change raised in a given source
over a set of views stored at the ODS or CDW level. Such a phase is a classical materialized
view maintenance problem except that, in data warehouses, the changes to propagate into the
aggregated views are not exactly those occurred in the sources, but the result of pre-treatments
performed by other refreshment activities such as data cleaning and multi-source data
reconciliation.
The view maintenance problem has been intensively studied in the database research community.
Most of the references focus on the problems raised by the maintenance of a set of materialized
(also called concrete) views derived from a set of base relations when the current state of the base
relations is modified. The main results concern:
1. Self-maintainability: Results concerning the self-maintainability are generalized for a set
of views : a set of view V is self-maintainable with respect to the changes to the underlying
base relations if the changes may be propagated in every views in V without querying the
base relations (i.e. the information stored in the concrete views plus the instance of the
changes are sufficient to maintain the views).
2. Coherent and efficient update propagation: Various algorithms are provided to schedule
updates propagation through each individual view, taking care of interdependencies
between views, which may lead to possible inconsistencies. For this purpose, auxiliary views
are often introduced to facilitate update propagation and to enforce self-maintainability.
Results over the self-maintainability of a set of views are of a great interest in the data warehouse
context, and it is commonly admitted that the set of views stored in a data warehouse have
to be globally selfmaintainable. The rationale behind this recommendation is that the self-
maintainability is a strong requirement imposed by the operational sources in order to not
overload their regular activity.
Research on data warehouse refreshment has mainly focused on update propagation through
materialized views. Many papers have been published on this topic, but a very few is devoted to
the whole refreshment process as defined before. We consider view maintenance just as one step
of the complete refreshment process. Other steps concern data cleaning, data reconciliation, data
customisation, and if needed data archiving. In another hand, extraction and cleaning strategies
may vary from one source to another, as well as update propagation which may vary from
one user view to another, depending for example on the desired freshness for data. So the data
warehouse refreshment process cannot be limited to a view maintenance process.
Notes To summarize the previous discussion, we can say that a refreshment process is a complex system
which may be composed of asynchronous and parallel activities that need a certain monitoring.
The refreshment process is an event-driven system which evolves frequently, following the
evolution of data sources and user requirements. Users, data warehouse administrators and data
source administrators may impose specific constraints as, respectively, freshness of data, space
limitation of the ODS or CDW, and access frequency to sources. There is no simple and unique
refreshment strategy which is suitable for all data warehouse applications, for all data warehouse
user, or for the whole data warehouse lifetime.
Task Suppose that the data for analysis include the attribute age. The age values for
the data tuples are (in creasing order):
13, 15, 16, 19, 20, 21, 22, 22, 25, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 58
1. Use smoothing by bin means to smooth the above data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given data.
2. How might you determine outliers in the data?
3. Use Min-max transformation to transform the value 35 for age onto the range
[0.0,1.0]
The refreshment process aims to propagate changes raised in the data sources to the data
warehouse stores. This propagation is done through a set of independent activities (extraction,
cleaning, integration, ...) that can be organized in different ways, depending on the semantics one
wants to assign to the refreshment process and on the quality he wants to achieve. The ordering
of these activities and the context in which they are executed define this semantics and influence
this quality. Ordering and context result from the analysis of view definitions, data source
constraints and user requirement in terms of quality factors. In the following subsections, we will
describe the refreshment activities and their organization as a workflow. Then we give examples Notes
of different workflow scenarios to show how refreshment may be a dynamic and evolving
process. Finally, we summarize the different perspectives through which a given refreshment
scenario should be considered.
The refreshment process is similar to the loading process in its data flow but, while the loading
process is a massive feeding of the data warehouse, the refreshment process captures the
differential changes hold in the sources and propagates them through the hierarchy of data stores
in the data warehouse. The preparation step extracts from each source the data that characterises
the changes that have occurred in this source since the last extraction. As for loading, this data
is cleaned and possibly archived before its integration. The integration step reconciliates the
source changes coming from multiple sources and adds them to the ODS. The aggregation
step recomputes incrementally the hierarchy of aggregated views using these changes. The
customisation step propagates the summarized data to the data marts. As well as for the loading
phase, this is a logical decomposition whose operational implementation receives many different
answers in the data warehouse products. This logical view allows a certain traceability of the
refreshment process. Figure 9.4 shows the activities of the refreshment process as well as a sample
of the coordinating events.
In workflow systems, activities are coordinated by control flows which may be notification of
process commitment, emails issued by agents, temporal events, or any other trigger events. In the
refreshment process, coordination is done through a wide range of event types.
You can distinguish several event types which may trigger and synchronize the refreshment
activities. They might be temporal events, termination events or any other user-defined event.
Depending on the refreshment scenario, one can choose an appropriate set of event types which
allows to achieve the correct level of synchronization.
Activities of the refreshment workflow are not executed as soon as they are triggered, they may
depend on the current state of the input data stores. For example, if the extraction is triggered
periodically, it is actually executed only when there are effective changes in the source log file. If
Notes the cleaning process is triggered immediately after the extraction process, it is actually executed
only if the extraction process has gathered some source changes. Consequently, we can consider
that the state of the input data store of each activity may be considered as a condition to effectively
execute this activity.
Within the workflow which represents the refreshment process, activities may be of different
origins and different semantics, the refreshment strategy is logically considered as independent
of what the activities actually do. However, at the operational level, some activities can be merged
(e.g., extraction and cleaning), and some others decomposed (e.g. integration). The flexibility
claimed for workflow systems should allow to dynamically tailor the refreshment activities and
the coordinating events.
There may be another way to represent the workflow and its triggering strategies. Indeed,
instead of considering external events such as temporal events or termination events of the
different activities, we can consider data changes as events. Hence, each input data store of the
refreshment workflow is considered as an event queue that triggers the corresponding activity.
However, to be able to represent different refreshment strategies, this approach needs a parametric
synchronization mechanism which allows to trigger the activities at the right moment. This can
be done by introducing composite events which combine, for example, data change events and
temporal events. Another alternative is to put locks on data stores and remove them after an
activity or a set of activities decide to commit. In the case of a long term synchronization policy,
as it may sometimes happen in some data warehouses, this latter approach is not sufficient.
Two main agent types are involved in the refreshment workflow: human agents which define
requirements, constraints and strategies, and computer agents which process activities. Among
human agents we can distinguish users, the data warehouse administrator, source administrators.
Among computer agents, we can mention source management systems, database systems
used for the data warehouse and data marts, wrappers and mediators. For simplicity, agents
are not represented in the refreshment workflow which concentrates on the activities and their
coordination.
To illustrate different workflow scenarios, we consider the following example which concern
three national Telecom billing sources represented by three relations S1, S2, and S3. Each relation
has the same (simplified) schema: (#PC, date, duration, cost). An aggregated view V with schema
(avgduration, avg-cost, country) is defined in a data warehouse from these sources as the average
duration and cost of a phone call in each of the three country associated with the sources, during
the last 6 months. We assume that the construction of the view follows the steps as explained
before. During the preparation step, the data of the last six months contained in each source is
cleaned (e.g., all cost units are translated in Euros). Then, during the integration phase, a base
relation R with schema (date, duration, cost country) is constructed by unioning the data coming
from each source and generating an extra attribute (country). Finally, the view is computed using
aggregates (Figure 9.5).
We can define another refreshment scenario with the same sources and similar views. This
scenario mirrors the average duration and cost for each day instead of for the last six months.
This leads to change the frequency of extraction, cleaning, integration and propagation. Figure
9.6 gives such a possible scenario. Frequencies of source extractions are those which are allowed
by source administrators.
When the refreshment activities are long term activities or when the DWA wants to apply
validation procedures between activities, temporal events or activity terminations can be used
to synchronize all the refreshment process. In general, the quality requirements may impose a
certain synchronization strategy. For example, if users desire high freshness for data, this means
that each update in a source should be mirrored as soon as possible to the views. Consequently,
this determines the strategy of synchronization: trigger the extraction after each change in a
source, trigger the integration, when semantically relevant, after the commit of each data source,
propagate changes through views immediately after integration, and customize the user views
in data marts.
Refreshment Scheduling
of these changes. The triggering of the extraction may be also different from one source Notes
to another. Different events can be defined, such as temporal events (periodic or fixed
absolute time), after each change detected on the source, on demand from the integration
process.
3. ODS-driven refreshment which defines part of the process which is automatically
monitored by the data warehouse system. This part concerns the integration phase. It may
be triggered at a synchronization point defined with respect to the ending of the preparation
phase. Integration can be considered as a whole and concerns all the source changes at the
same time. In this case, it can be triggered by an external event which might be a temporal
event or the ending of the preparation phase of the last source. The integration can also
be sequenced with respect to the termination of the preparation phase of each source, that
is extraction is integrated as soon as its cleaning is finished. The ODS can also monitor
the preparation phase and the aggregation phase by generation the relevant events that
triggers activities of these phases.
In the very simple case, one of the two first approaches is used as a single strategy. In a more
complex case, there may be as much strategies as the number of sources or high level aggregated
views. In between, there may be, for example, four different strategies corresponding to the
previous four phases. For some given user views, one can apply the client driven strategy (pull
strategy), while for other views one can apply the ODS-driven strategy (push strategy). Similarly,
some sources are solicited through a pull strategy while other apply a push strategy.
The strategy to choose depends on the semantic parameters but also on the tools available to
perform the refreshment activities (extraction, cleaning, integration). Some extraction tools do
also the cleaning in the fly while some integrators propagate immediately changes until the high
level views. Then, the generic workflow in Figure 9.4 is a logical view of the refreshment process.
It shows the main identified activities and the potential event types which can trigger them.
Notes
Case Study
BT Groups
A global communications company runs the Trillium Software System® at the heart of its
customer-focused operations.
The Company
BT Group is one of Europe’s leading providers of telecommunications services. Its principal
activities include local, national, and international telecommunications services, higher-
value broadband and internet products and services, and IT solutions.
In the UK, BT serves over 20 million business and residential customers with more than 29
million exchange lines, and provides network services to other licensed operators.
Reducing customer dissatisfaction by 25% a year is a key target in BT Group’s drive to
deliver the highest levels of customer satisfaction. In the 2003 financial year, the Group as
a whole achieved a 37% reduction in customer dissatisfaction.
The Challenge
BT’s well-established strategy puts customers first and includes ensuring that customers
are recognized and properly managed by appropriate account teams who can access all
relevant information rapidly. This is made possible behind the scenes by a well-developed
and strategic approach to customer information management and data quality.
“BT recognizes that there is a link between the quality and completeness of customer
information and good account management, customer service, and effective operations.
Customer data is a key business asset. We must manage it with strategic intent,” said Nigel
Turner, Manager of Information & Knowledge Management Consultancy, BT Exact (BT’s
research, technology, and IT operations business), which is assisting BT in its information
management strategy.
Three initiatives take center stage in BT’s customer information strategy: the Name and
Address System (NAD), a Customer Relationship Management (CRM) system from Siebel,
and the Customer Service System (CSS).
NAD is being built from multiple legacy data sources to provide what is to become a single
definitive customer name and address repository. The multimillion-dollar Siebel-based
CRM solution commenced rollout in what is destined to become one of the largest Siebel
implementations anywhere in the world. CSS provides a complete and accurate view
of each customer from 28 million customer records across 29 disparate repositories and
11 mainframes, where many customers have multiple records. The system, now central
to customer services, determines the availability of services, the location of a prospect,
account management, billing, and fault repair.
A significant challenge in each of these enormous and mission-critical business investments
is ensuring the quality of source data in order to deliver output that is accurate, complete,
and valuable enough to deliver a strong return on investment (ROI). Garbage in, garbage
out applies! BT has needed a process capable of taking tens of millions of customer records
from multiple disparate sources, then cleansing and standardizing address formats, then
accurately identifying, grouping and linking records that are common to one customer
account, household or business. Furthermore, the strategy has had to be capable of building
and maintaining customer data quality over time, encompassing both existing data offline
and new data entered by call center operatives and others in real time. Contd...
9.6 Summary
zz This unit has presented an analysis of the refreshment process in data warehouse
applications.
zz You have demonstrated, that the refreshment process cannot be limited neither to a view
maintenance process nor to a loading process.
zz You have shown through a simple example, that the refreshment of a data warehouse can
be conceptually viewed as a workflow process.
zz You have identified the different tasks of the workflow and shown how they can be
organized in different refreshment scenarios, leading to different refreshment semantics.
zz You have highlighted design decisions impacting over the refreshment semantics and we
have shown how the decisions may be related to some quality factors such as data freshness
and to some constraints such as source availability and accessibility.
9.7 Keywords
Data Refreshment: Data refreshment in data warehouses is generally confused with data loading
as done during the initial phase or with update propagation through a set of materialized
views.
Data Warehousing: Data warehousing is a new technology which provides software infrastructure
for decision support systems and OLAP applications.
Materialized View: A materialized view eliminates the overhead associated with expensive joins
and aggregations for a large or important class of queries.
Nested Materialized View: A nested materialized view is a materialized view whose definition is
based on another materialized view.
1. (b) 2. (a)
3. (d) 4. materialized views
5. workflow 6. refreshment process
7. integration step 8. improving query performance
9. database administrator 10. remote
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Notes Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/ The
MIT Press, 1996.
V. Cherkassky and F. Mulier, Learning From Data, John Wiley & Sons, 1998.
Contents
Objectives
Introduction
10.1 Multi-dimensional View of Information
10.2 The Logical Model for Multi-dimensional Information
10.3 Relational Implementation of the Model
10.3.1 Star Schema
10.3.2 Snowflake Schema
10.4 ROLAP Model
10.5 MOLAP Model
10.6 Conceptual Models for Multi-dimensional Information
10.7 Summary
10.8 Keywords
10.9 Self Assessment
10.10 Review Questions
10.11 Further Readings
Objectives
After studying this unit, you will be able to:
zz Explain multi-dimensional view of information
zz Describe ROLAP data model
zz Describe MOLAP data model
zz Describe logical and conceptual model for multi-dimensional information
Introduction
Metadata is used throughout Oracle OLAP to define a logical multidimensional model:
1. To describe the source data as multidimensional objects for use by the analytic workspace
build tools.
2. To identify the components of logical objects in an analytic workspace for use by the
refresh, aggregation, and enablement tools.
3. To describe relational views of analytic workspaces as multidimensional objects for use by
OLAP applications.
Logical Cubes: Logical cubes provide a means of organising measures that have the same
shape, that is, they have the exact same dimensions. Measures in the same cube have the same
relationships to other logical objects and can easily be analysed and displayed together.
Logical Measures: Measures populate the cells of a logical cube with the facts collected about
business operations. Measures are organised by dimensions, which typically include a Time
dimension.
Measures are static and consistent while analysts are using them to inform their decisions. They
are updated in a batch window at regular intervals: weekly, daily, or periodically throughout the
day. Many applications refresh their data by adding periods to the time dimension of a measure,
and may also roll off an equal number of the oldest time periods. Each update provides a fixed
historical record of a particular business activity for that interval. Other applications do a full
rebuild of their data rather than performing incremental updates.
Logical Dimensions: Dimensions contain a set of unique values that identify and categorise data. Notes
They form the edges of a logical cube, and thus of the measures within the cube. Because measures
are typically multidimensional, a single value in a measure must be qualified by a member of
each dimension to be meaningful. For example, the Sales measure has four dimensions: Time,
Customer, Product, and Channel. A particular Sales value (43,613.50) only has meaning when it
is qualified by a specific time period (Feb-01), a customer (Warren Systems), a product (Portable
PCs), and a channel (Catalog).
Logical Hierarchies and Levels: A hierarchy is a way to organise data at different levels of
aggregation. In viewing data, analysts use dimension hierarchies to recognise trends at one level,
drill down to lower levels to identify reasons for these trends, and roll up to higher levels to see
what affect these trends have on a larger sector of the business.
Each level represents a position in the hierarchy. Each level above the base (or most detailed)
level contains aggregate values for the levels below it. The members at different levels have a
one-to-many parent-child relation. For example, Q1-2002 and Q2-2002 are the children of 2002,
thus 2002 is the parent of Q1-2002 and Q2-2002.
Suppose a data warehouse contains snapshots of data taken three times a day, that is, every
8 hours. Analysts might normally prefer to view the data that has been aggregated into days,
weeks, quarters, or years. Thus, the Time dimension needs a hierarchy with at least five levels.
Similarly, a sales manager with a particular target for the upcoming year might want to allocate
that target amount among the sales representatives in his territory; the allocation requires a
dimension hierarchy in which individual sales representatives are the child values of a particular
territory.
Hierarchies and levels have a many-to-many relationship. A hierarchy typically contains several
levels, and a single level can be included in more than one hierarchy
Logical Attributes: An attribute provides additional information about the data. Some attributes
are used for display. For example, you might have a product dimension that uses Stock Keeping
Units (SKUs) for dimension members. The SKUs are an excellent way of uniquely identifying
thousands of products, but are meaningless to most people if they are used to label the data in a
report or graph. You would define attributes for the descriptive labels.
In the logical multidimensional model, a cube represents all measures with the same shape, that
is, the exact same dimensions. In a cube shape, each edge represents a dimension. The dimension
members are aligned on the edges and divide the cube shape into cells in which data values are
stored.
In an analytic workspace, the cube shape also represents the physical storage of multidimensional
measures, in contrast with two-dimensional relational tables. An advantage of the cube shape is
that it can be rotated: there is no one right way to manipulate or view the data. This is an important
part of multidimensional data storage, calculation, and display, because different analysts need
to view the data in different ways. For example, if you are the Sales Manager, then you need to
look at the data differently from a product manager or a financial analyst.
Assume that a company collects data on sales. The company maintains records that quantify how
many of each product was sold in a particular sales region during a specific time period. You can
visualise the sales measure as the cube shown in Figure 10.2.
Figure 10.2 compares the sales of various products in different cities for January 2001 (shown)
and February 2001 (not shown). This view of the data might be used to identify products that
are performing poorly in certain markets. Figure 10.3 shows sales of various products during a
four-month period in Rome (shown) and Tokyo (not shown). This view of the data is the basis
for trend analysis.
A cube shape is three dimensional. Of course, measures can have many more than three
dimensions, but three dimensions are the maximum number that can be represented pictorially.
Additional dimensions are pictured with additional cube shapes.
A star schema is a convention for organising the data into dimension tables, fact tables, and
materialised views. Ultimately, all of the data is stored in columns, and metadata is required to
identify the columns that function as multidimensional objects.
Dimension Tables
A star schema stores all of the information about a dimension in a single table. Each level of a
hierarchy is represented by a column or column set in the dimension table. A dimension object
can be used to define the hierarchical relationship between two columns (or column sets) that
represent two levels of a hierarchy; without a dimension object, the hierarchical relationships are
defined only in metadata. Attributes are stored in columns of the dimension tables.
A snowflake schema normalises the dimension members by storing each level in a separate
table.
Fact Tables
Measures are stored in fact tables. Fact tables contain a composite primary key, which is composed
of several foreign keys (one for each dimension table) and a column for each measure that uses
these dimensions.
Aggregate data is calculated on the basis of the hierarchical relationships defined in the dimension
tables. These aggregates are stored in separate tables, called summary tables or materialised
views. Oracle provides extensive support for materialised views, including automatic refresh
and query rewrite.
Queries can be written either against a fact table or against a materialised view. If a query is
written against the fact table that requires aggregate data for its result set, the query is either
redirected by query rewrite to an existing materialised view, or the data is aggregated on the
fly.
Each materialised view is specific to a particular combination of levels; in Figure 10.5, only two
materialised views are shown of a possible 27 (3 dimensions with 3 levels have 3**3 possible level
combinations).
Example: Let, an organisation sells products throughtout the world. The main four major
dimensions are product, location, time and organisation.
In the example Figure 10.5, sales fact table is connected to dimensions location, product, time
and organisation. It shows that data can be sliced across all dimensions and again it is possible
for the data to be aggregated across multiple dimensions. “Sales Dollar” in sales fact table can
be calculated across all dimensions independently or in a combined manner, which is explained
below:
1. Sales Dollar value for a particular product
2. Sales Dollar value for a product in a location
3. Sales Dollar value for a product in a year within a location
4. Sales Dollar value for a product in a year within a location sold or serviced by an
employee
The snowflake schema is a variant of the star schema model, where some dimension tables are
normalised, thereby further splitting the data into additional tables. The resulting schema graph
forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the dimension tables Notes
of the snowflake model may be kept in normalised form. Such a table is easy to maintain and also
saves storage space because a large dimension table can be extremely large when the dimensional
structure is included as columns. Since much of this space is redundant data, creating a normalised
structure will reduce the overall space requirement. However, the snowflake structure can reduce
the effectiveness of browsing since more joins will be needed to execute a query. Consequently,
the system performance may be adversely impacted. Performance benchmarking can be used to
determine what is best for your design.
Example: In Snowflake schema, the example diagram shown in Figure 10.6 has 4
dimension tables, 4 lookup tables and 1 fact table. The reason is that hierarchies (category, branch,
state, and month) are being broken out of the dimension tables (PRODUCT, ORGANISATION,
LOCATION, and TIME) respectively and shown separately. In OLAP, this Snowflake schema
approach increases the number of joins and poor performance in retrieval of data.
A compromise between the star schema and the snowflake schema is to adopt a mixed schema
where only the very large dimension tables are normalised. Normalising large dimension tables
saves storage space, while keeping small dimension tables unnormalised may reduce the cost and
performance degradation due to joins on multiple dimension tables. Doing both may lead to an
overall performance gain. However, careful performance tuning could be required to determine
which dimension tables should be normalised and split into multiple tables.
Sophisticated applications may require multiple fact tables to share dimension tables. This kind
of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact
constellation.
Example: A fact constellation schema of a data warehouse for sales and shipping is
shown in the following Figure 10.7.
Figure 10.7: Fact Constellation Schema of a Data Warehouse for Sales and Shipping
Task Discuss the pros and cons of top down and bottom up approaches.
In data warehousing, there is a distinction between a data warehouse and a data mart. A data
warehouse collects information about subjects that span the entire organisation, such as customers,
items, sales, assets, and personnel, and thus its scope is enterprise-wide. For data warehouses,
the fact constellation schema is commonly used since it can model multiple, interrelated subjects.
A data mart, on the other hand, is a department subset of the data warehouse that focuses on
selected subjects, and thus its scope is department-wide. For data marts, the star or snowflake
schema are popular since each are geared towards modeling single subjects
A concept hierarchy defines a sequence of mappings from a set of low level concepts to higher
level, more general concepts. Consider a concept hierarchy for the dimension location. City
values for location include Lucknow, Mumbai, New York, and Chicago. Each city, however, can
be mapsped to the province or state to which it belongs. For example, Lucknow can be mapped
to Uttar Pradesh, and Chicago to Illinois. The provinces and states can in turn be mapped to
the country to which they belong, such as India or the USA. These mappings form a concept
hierarchy for the dimension location, mapping a set of low level concepts (i.e., cities) to higher
level, more general concepts (i.e., countries). The concept hierarchy described above is illustrated
in Figure 10.8.
Many concept hierarchies are implicit within the database schema. For example, suppose that the
dimension location is described by the attributes number, street, city, province-or_state, zipcode,
and country. These attributes are related by a total order, forming a concept hierarchy such as
“street < city < province.or.state < country”. This hierarchy is shown in Figure 10.9(a).
Alternatively, the attributes of a dimension may be organised in a partial order, forming a lattice.
An example of a partial order for the time dimension based on the attributes day, week, month,
quarter, and year is “day < {month <quarter; week} < year”. This lattice structure is shown in
Figure 10.9(b).
A concept hierarchy that is a total or partial order among attributes in a database schema is
called a schema hierarchy. Concept hierarchies that are common to many applications may
be predefined in the data mining system, such as the concept hierarchy for time. Data mining
systems should provide users with the flexibility to tailor predefined hierarchies according to
their particular needs. For example, one may like to define a fiscal year starting on April 1, or an
academic year starting on September 1.
Data Warehouses use On-line Analytical Processing (OLAP) to formulate and execute user
queries. OLAP is an SLQ-based methodology that provides aggregate data (measurements)
along a set of dimensions, in which each dimension table includes a set of attributes each
measure depends on a set of dimensions that provide context for the measure, e.g. for the reseller
Notes company, the measure is the number of sold units, which are described by the corresponding
location, time, and item type all dimensions are assumed to uniquely determine the measure,
e.g., for the reseller company, the location, time, producer, and item type provide all necessary
information to determine context of a particular number of sold units.
There are five basic OLAP commands that are used to perform data retrieval from a Data
warehouse.
1. ROLL UP, which is used to navigate to lower levels of details for a given data cube. This
command takes the current data cube (object) and performs a GROUP BY on one of the
dimensions, e.g., given the total number of sold units by month, it can provide sales
summarised by quarter.
2. DRILL DOWN, which is used to navigate to higher levels of detail. This command is the
opposite of ROLL UP, e.g., given the total number of units sold for an entire continent, it
can provide sales in the U.S.A.
3. SLICE, which provides a cut through a given data cube. This command enables users to
focus on some specific slice of data inside the cube, e.g., the user may want to look at the
data concerning unit sales only in Mumbai.
4. DICE, which provides just one cell from the cube (the smallest slice), e.g. it can provide
data concerning the number of sold Canon printers in May 2002 in Lucknow.
5. PIVOT, which rotates the cube to change the perspective, e.g., the “time item” perspective
may be changed into “time location.”
These commands, in terms of their specification and execution, are usually carried out using
a point-and-click interface, and therefore we do not describe their syntax. Instead, we give
examples for each of the above OLAP commands.
ROLL UP Command
The ROLL UP allows the user to summarise data into a more general level in hierarchy. For
instance, if the user currently analyses the number of sold CPU units for each month in the first
half of 2002, this command will allows him/her to aggregate this information into the first two
quarters. Using ROLL UP, the view
Notes
is transformed into
From the perspective of a three-dimensional cuboid, the time _y_ axis is transformed from
months to quarters; see the shaded cells in Figure 10.10.
The DRILL DOWN command provides a more detailed breakdown of information from lower
in the hierarchy. For instance, if the user currently analyses the number of sold CPU and Printer
units in Europe and U.S.A., it will allows him/her to find details of sales in specific cities in the
U.S.A., i.e., the view.
is transformed into
Again, using a data cube representation, the location (z) axis is transformed from summarisation
by continents to sales for individual cities; see the shaded cells in Figure 10.11.
These commands perform selection and projection of the data cube onto one or more user-
specified dimensions.
SLICE allows the user to focus the analysis of the data on a particular perspective from one or
more dimensions. For instance, if the user analyses the number of sold CPU and Printer units in
all combined locations in the first two quarters of 2002, he/she can ask to see the units in the same
time frame in a particular city, say in Los Angeles. The view
Notes
The DICE command, in contrast to SLICE, requires the user to impose restrictions on all
dimensions in a given data cube. An example SLICE command, which provides data about sales
only in L.A., and DICE command, which provides data about sales of Canon printers in May
2002 in L.A.
PIVOT Command
PIVOT is used to rotate a given data cube to select a different view. Given that the user currently
analyses the sales for particular products in the first quarter of 2002, he/she can shift the focus to
see sales in the same quarter, but for different continents instead of for products, i.e., the view.
Many conceptual data models exist with different features and expressive powers, mainly Notes
depending on the application domain for which they are conceived. As we have said in the
Introduction, in the context of data warehousing it was soon realized that traditional conceptual
models for database modelling, such as the Entity-Relationship model, do not provide a suitable
means to describe the fundamental aspects of such applications. The crucial point is that in designing
a data warehouse, there is the need to represent explicitly certain important characteristics of the
information contained therein, which are not related to the abstract representation of real world
concepts, but rather to the final goal of the data warehouse: supporting data analysis oriented
to decision making. More specifically, it is widely recognized that there are at least two specific
notions that any conceptual data model for data warehousing should include in some form:
the fact (or its usual representation, the data cube) and the dime on. A fact is an entity of an
application that is the subject of decision-oriented analysis and is usually represented graphically
by means of a data cube. A dimension corresponds to a perspective under which facts can be
fruitfully analyzed. Thus, for instance, in a retail business, a fact is a sale and possible dimensions
are the location of the sale, the type of product sold, and the time of the sale.
Practitioners usually tend to model these notions using structures that refer to the practical
implementation of the application. Indeed, a widespread notation used in this context is the
“star schema” (and variants thereof) in which facts and dimensions are simply relational tables
connected in a specific way. An example is given in Figure 10.14. Clearly, this low level point of
view barely captures the essential aspects of the application. Conversely, in a conceptual model
these concepts would be represented in abstract terms which is fundamental for concentration on
the basic, multidimensional aspects that can be employed in data analysis, as opposed to getting
distracted by the implementation details.
Before tackling in more detail the characteristics of conceptual models for multidimensional
applications, it is worth making two general observations. First, we note that in contrast to
other application domains, in this context not only at the physical (and logical) but also at
Notes the conceptual level, data representation is largely influenced by the way in which final users
need to view the information. Second, we recall that conceptual data models are usually used
in the preliminary phase of the design process to analyze the application in the best possible
way, without implementation “contaminations”. There are however further possible uses of
multidimensional conceptual representations. First of all, they can be used for documentation
purposes, as they are easily understood by non-specialists. They can also be used to describe
in abstract terms the content of a data warehousing application already in existence. Finally, a
conceptual scheme provides a description of the contents of the data warehouse which, leaving
aside the implementation aspects, is useful as a reference for devising complex analytical
queries.
Case Study The Carphone Warehouse
The Carphone Warehouse Calls Trillium Software® in CRM Initiative
The Carphone Warehouse Group plc, known as The Phone House in some countries
of operation, was founded in 1989. One of the largest independent retailers of mobile
communications in Europe, the group sells mobile phones, phone insurance, network and
fixed-line connections through its 1,300 stores, Web site, and direct marketing operations
and employs approximately 11,000 people.
A Better Mobile Life
The Carphone Warehouse mission is not just to sell products and services profitably but
to offer “A Better Mobile Life” by providing all touch points within the business enough
information to give educated advice to customers and deliver customer delight. This
customer relationship management (CRM) strategy requires that customer-facing staff
know their customers well enough to support a “consultative sell” and that marketing
campaigns use segmentation techniques to target only relevant customers.
Single Views
Previously product-centric, The Carphone Warehouse move to a customercentric sales
and marketing model presented a challenge. It needed to take fragmented customer
information stored across several product-oriented sales and marketing databases and
create a new database of “single customer views.” These views would then need to be
made available to stores, call centers, and marketing in formats suited to their respective
process requirements.
A significant component of the Single Customer View project would be the migration of
data from three Oracle source databases to a single customer view database (also on Oracle).
Informatica’s Extract, Transform, and Load (ETL) tool would be used in this process. The
Single Customer View database would then support the Customer Dashboard, “One” an
at-a-glance resource for call centers and stores and the E.piphany tool used by marketing
for campaigns.
The CRM program manager at The Carphone Warehouse knew that before data could
be migrated, the source data, fields, format, and content needed to be understood before
designing migration mapping rules. He also knew that there could be duplicates and other
data quality issues.
“We knew that we had to identify and resolve any significant data quality issues before
attempting data migration. We knew that trying to discover and address them manually
across some 11 million records would be time-consuming and error-prone. We needed an
automated data quality solution,” he said.
Contd...
The Carphone Warehouse decided to evaluate data discovery, cleansing, matching, and Notes
enhancement solutions. An invitation to tender was issued to six vendors, and quickly
narrowed down vendors to two finalists. Each was offered a 2 million-record database of
the company’s customer records for discovery and cleansing.
Data Quality out of the Box
The Carphone Warehouse chose Trillium Software primarily because in tests it proved
to be the most effective in data discovery and cleansing, matching, and enhancement. It
supports Informatica and could also easily scale up to their 11 million-record customer
database.
The CRM program manager continued, “We especially liked the Trillium Software toolset
for out-of-the-box functionality. With tight timescales, we needed to deliver big wins
quickly. We were also impressed by sophisticated configuration capabilities that would
later enable us to handle the more complex and less obvious data quality issues.”
The Carphone Warehouse quickly proceeded to profile source data using the Trillium
Software System®. As expected, it discovered a range of issues that needed to be
addressed.
Using “One,” sales consultants in stores and call centers are generating additional revenue
from closing more sales and crosssell opportunities, for example, for insurance products.
They are also winning more upsell opportunities such as phone upgrades.
“Beyond the revenue benefits to The Carphone Warehouse, customers experience a
higher level of service from us now,” said the CRM program manager. “With better data
quality, we have a more complete picture of our customers and can target campaigns more
effectively. And when a customer makes direct contact, we know a great deal more about
them. Overall, with the help of Trillium Software, we are fulfilling our mission to deliver
customer delight and a better mobile life.”
The company originally estimated that its investment in the Trillium Software System
would be recouped in 12 months, but the company’s CRM program manager stated it
actually paid for itself in less than 6 months.
10.7 Summary
zz The data warehouses are supposed to provide storage, functionality and responsiveness
to queries beyond the capabilities of today’s transaction-oriented databases. Also data
warehouses are set to improve the data access performance of databases.
zz Traditional databases balance the requirement of data access with the need to ensure
integrity of data.
zz In present day organizations, users of data are often completely removed from the data
sources.
zz Many people only need read-access to data, but still need a very rapid access to a larger
volume of data than can conveniently by downloaded to the desktop.
zz Often such data comes from multiple databases. Because many of the analyses performed
are recurrent and predictable, software vendors and systems support staff have begun to
design systems to support these functions.
zz Currently there comes a necessity for providing decision makers from middle management
upward with information at the correct level of detail to support decision-making.
zz Data warehousing, online analytical processing (OLAP) and data mining provide this
functionality.
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Contents
Objectives
Introduction
11.1 Query Processing: An Overview
11.2 Description and Requirements for Data Warehouse Queries
11.2.1 Understanding Bitmap Filtering
11.2.2 Optimized Bitmap Filtering Requirements
11.3 Writing your own Queries
11.4 Query Processing Techniques
11.4.1 Relational Model
11.4.2 Cost Model
11.4.3 Full Scan Technique
11.4.4 Scanning with Index Techniques
11.4.5 RID Index Technique
11.4.6 BitMap Index Technique
11.4.7 Inverted Partitioned Index Technique
11.5 Comparison of the Query Processing Strategies
11.6 Summary
11.7 Keywords
11.8 Self Assessment
11.9 Review Questions
11.10 Further Readings
Objectives
After studying this unit, you will be able to:
zz Explain query processing
zz Know description and requirements for data warehouse queries
zz Describe query processing techniques
Introduction
The Query Language SQL is one of the main reasons of success of RDBMS. A user just needs
to specify the query in SQL that is close to the English language and does not need to say how
such query is to be evaluated. However, a query needs to be evaluated efficiently by the DBMS.
But how is a query-evaluated efficiently? This unit attempts to answer this question. The unit
Notes covers the basic principles of query evaluation, the cost of query evaluation, the evaluation of
join queries, etc. in detail. It also provides information about query evaluation plans and the role
of storage in query evaluation and optimisation.
In the first step Scanning, Parsing, and Validating is done to translate the query into its internal
form. This is then further translated into relational algebra (an intermediate query form). Parser
checks syntax and verifies relations. The query then is optimised with a query plan, which then
is compiled into a code that can be executed by the database runtime processor.
You can define query evaluation as the query-execution engine taking a query-evaluation plan, Notes
executing that plan, and returning the answers to the query. The query processing involves the
study of the following concepts:
How to measure query costs?
Algorithms for evaluating relational algebraic operations.
How to evaluate a complete expression using algorithms on individual operations?
Optimisation
Example: σ (salary < 5000) (πsalary (EMP)) is equivalent to πsalary (σsalary <5000
(EMP)).
Each relational algebraic operation can be evaluated using one of the several different algorithms.
Correspondingly, a relational-algebraic expression can be evaluated in many ways.
An expression that specifies detailed evaluation strategy is known as evaluation-plan, for
example, you can use an index on salary to find employees with salary < 5000 or we can perform
complete relation scan and discard employees with salary ≥ 5000. The basis of selection of any of
the scheme will be the cost.
Query Optimisation: Amongst all equivalent plans choose the one with the lowest cost. Cost
is estimated using statistical information from the database catalogue, for example, number of
tuples in each relation, size of tuples, etc.
Thus, in query optimisation we find an evaluation plan with the lowest cost. The cost estimation
is made on the basis of heuristic rules.
The bitmap filter compares favorably to the bitmap index. A bitmap index is an alternate form
for representing row ID (RID) lists in a value-list index using one or more bit vectors indicating
which row in a table contains a certain column value. Both can be very effective in removing
unnecessary rows from result processing; however, there are important differences between a
Notes bitmap filter and a bitmap index. First, bitmap filters are in-memory structures, thus eliminating
any index maintenance overhead due to data manipulation language (DML) operations made
to the underlying table. In addition, bitmap filters are very small and, unlike existing on-disk
indexes that typically depend on the size of the table on which they are built, bitmap filters can
be created dynamically with minimal impact on query processing time.
Bitmap filtering and optimized bitmap filtering are implemented in the query plan by using the
bitmap show plan operator. Bitmap filtering is applied only in parallel query plans in which
hash or merge joins are used. Optimized bitmap filtering is applicable only to parallel query
plans in which hash joins are used. In both cases, the bitmap filter is created on the build input
(the dimension table) side of a hash join; however, the actual filtering is typically done within the
Parallelism operator, which is on the probe input (the fact table) side of the hash join. When the
join is based on an integer column, the filter can be applied directly to the initial table or index scan
operation rather than the Parallelism operator. This technique is called in-row optimization.
When bitmap filtering is introduced in the query plan after optimization, query compilation time
is reduced; however, the query plans that the optimizer can consider are limited, and cardinality
and cost estimates are not taken into account.
Optimized bitmap filters have the following advantages:
1. Filtering from several dimension tables is supported.
2. Multiple filters can be applied to a single operator.
3. Optimized bitmap filters can be applied to more operator types. These include exchange
operators such as the Distribute Streams and Repartition Streams operators, table or index
scan operators, and filter operators.
4. Filtering is applicable to SELECT statements and the read-only operators used in INSERT,
UPDATE, DELETE, and MERGE statements.
5. Filtering is applicable to the creation of indexed views in the operators used to populate the
index.
6. The optimizer uses cardinality and cost estimates to determine if optimized bitmap filtering
is appropriate.
7. The optimizer can consider more plans.
A bitmap filter is useful only if it is selective. The query optimizer determines when a optimized
bitmap filter is selective enough to be useful and to which operators the filter is applied. The
optimizer places the optimized bitmap filters on all branches of a star join and uses costing rules to
determine whether the plan provides the smallest estimated execution cost. When the optimized
bitmap filter is nonselective, the cost estimate is usually too high and the plan is discarded. When
considering where to place optimized bitmap filters in the plan, the optimizer looks for hash join
variants such as a right-deep stack of hash joins. Joins with dimension tables are implemented to
execute the likely most selective join first.
The operator in which the optimized bitmap filter is applied contains a bitmap predicate in the
form of PROBE([Opt_Bitmap1001], {[column_name]} [, ‘IN ROW’]). The bitmap predicate reports
on the following information:
1. The bitmap name that corresponds to the name introduced in the Bitmap operator. The
prefix ‘Opt_’ indicates an optimized bitmap filter is used.
2. The column probed against. This is the point from which the filtered data flows through Notes
the tree.
3. Whether the bitmap probe uses in-row optimization. When it is, the bitmap probe is
invoked with the IN ROW parameter. Otherwise, this parameter is missing.
Task Describe the main activities associated with various design steps of data
warehouse?
Notes The performance of these query processing techniques are compared using a disk I/O. Finally,
we propose a recommendation for Database Management Systems (DBMSs) to select the most
cost-effective query processing techniques based on the cost model
Before you can analyze and interpret customers buying behavior, we have to store all the historical
information of customers. This voluminous amount of data is most suitably stored in a data
warehouse. The current DBMSs mainly deal with daily operation of business such as Transaction
Processing System (TPS) involving reading and writing data. Data warehouse handles the large
read only information. The volume of data in data warehouse is increasing rapidly. If a query
was to be made on this large volume of data, the response time will, as expected, not be good.
The next question that comes to mind will be: How can we achieve a reasonable response time?
Response time relies heavily on query processing technique and experiment on new techniques.
A new reliable and faster query processing technique will greatly enhance the performance of
data warehousing processes..
Suppose we are working with a relation R that has m number of tuples {t1, t2, … tm }, and O
number of attributes {a1, a2, …, a0}, the size S of O number of attributes {s1, s2, …, so} and instance
of tuples {i 1 1, i 1 2, i 1 o, …, i m o }. The term’s relation and table are used. Table 11.1 shows the
relational model R.
a1 a2 .... a4 ....
i11 i12 .... i14 ....
: : : : :
: : : : :
im1 im2 : im4 :
Assumptions
Let us assume that:
The access time per tuple with 1000 bytes is 0.1 sec.
There are 10,000 tuples in the R relation.
The disk I/O fetches 100 bytes in 0.01 sec.
In a data warehousing environment, the information sources are made up of combined data from
different sources. These data are fed into a central repository from the dispersed client locations
and form the R relation. In such a situation, usually the respective clients are only interested in
their own data sets. Suppose, the particular client wants to display a1 and a4. Given this situation,
how can we achieve a reasonable response time?
Here, we introduce a cost model to measure the performance of query processing techniques.
In general, there are three components to make up the response time of a query. Scanning the
data in the table level only will exclude the access time of the index level such as in the case of
the full scan technique. Scanning the data in the index level only will exclude the access time of
the table level such as in the case of the inverted partitioned indexes. Table 11.2 shows the cost
model for respective query processing techniques. To measure the response time of a query, we
need to submit what type of query processing techniques as a parameter such as Full Scan, Index
techniques or Inverted Partitioned Index. Thus, DBMSs will be able to pick up what components
required.
Notes
Example:
CostModel (Inverted Partitioned Index) = Access Time of Index Level + Instruction
Time (Inverted Partitioned Index)
From Table 11.3, we observe that the Instruction Time of BitMap index is the highest because
it has most steps than another. To simplify our calculation, given a query, we assume that the
found set of R relation is from 10%, 20%, …, 90%, 100%. The selective attributes is/are from 1,2,
…, O.
A full scan technique will scan a relation R from the top until the end of the table. The total time
that is required to process a query is the total time taken to read the entire table.
Start Scanning
a1 a2 .... a4 ....
i11 i12 .... i14 ....
: : : : :
: : : : :
im1 im2 : im4 :
End Scanning
Notes The response time of a query using the Full Scan technique:
CostModel(m) = (m * att) + (m * itime(fs))
Where,
m = Total number of tuples in the relation
att = Access time per tuple
itime = Instruction time of an algorithm
fs = Full scan algorithm
Since the full scan technique is unaffected by the found set of department relation and the number
of selective attributes. Therefore;
Response time of a query = (10,000 * 0.1) + (10,000 * 0.05) = 1,500 sec.
The response time of a query remains constant. The average response time for 10% to 100% found
set is 1,500 sec.
To access all the tuples in a relation is very costly when there are only small found sets. Since
fetching data in the index level is normally 1/10 smaller than table level, therefore, an indexing
technique is introduced. For example, to fetch data from the table level is 0.1 sec. and to fetch
data from index level is 0.01 sec. Query processing will first process data in the index level and
then only will it fetches data from the table level. In the following subsections, we discuss the
efficiencies of the RID index technique and BitMap Index Technique in query processing.
Task A database has four transactions. Let min sup = 60% and min_conf = 80%
Tid date intms_bought
T100 10/15/99 {K,A,D,B}
T200 10/15/99 {D,A,C,E,B}
T300 10/19/99 {C,A,B,E,}
T400 10/22/99 {B,A,D}
1. Find all frequent item sets using FP_growth and Apriori techniques. Compare the
efficiency of the two mining processes.
2. List all the strong association rules (with support and confidence c) matching the
following meta rule, where X is a variable representing customers and item i denotes
variable representing items (eg “A”,”B” etc.)
Transaction, buys (X item1) buys (X, item2)=>(X, item3) [s,c]
The RID index technique is one of the traditional index techniques used in the Transaction
Processing System (TPS). The RID index creates a list of record identification which acts as
pointers to records in the table. The total time that is required to process a query is access time of
index level and access time of selective table level.
Example:
|sc| = found set * total number of tuples = 10% * 10,000 = 1000
Let att = 0.1
Let ati = 0.01
Let itime(rdi) = 0.07
The response time of a query
= |sc| * (att + ati + itime(rdi))
= 1000 * (0.1 + 0.01 + 0.07) = 180 sec.
If the found set is 20%, it will be 20% * 10,000 * 0.18 = 360 sec. and the response times are as
depicted in Table 11.5.
Figure 11.5: The Response Time of RID Index in Different Found Sets
Based on Table 11.5, the average response time from 10% to 100% found sets is:
(180 + 360 + 540 + 720 + 900 + 1080 + 1260 + 1440 + 1620 + 1800) / 10 = 990 sec..
From Table 11.5, we know that the found sets are between 80 % and 90 %, the responses time are
between 1440 sec. and 1620 sec.. Moreover, the response time of full scan without index is 1500
sec. We can therefore use the cost model to derive at the actual percentage of found set.
The response time of a query = |sc| * (att + ati + itime(rid))
Given: The response time of a query = 1500 sec. Where,
att = 0.1
The BitMap Index technique saves more spaces than the RID index technique that stores bits
instead of record identification. The total time that is required to process a query is 1/8 of access
time of index level and access time of selective table level.
The response time of a query in BitMap is
CostModel(sc) = 1/8 * ((|sc| * att) + (|sc| * ati)) +(|sc| * itime(bm)))
Where,
sc = Selective conditions
|sc| = The total number of found sets in selective conditions
m = Total number of tuples in the relation
att = Access time per tuple
itime = Instruction time of an algorithm
ati = Access time per index
Example:
Let |sc| = found set * total number of tuples = 10% * 10,000 = 1,000
Let att = 0.1
Let ati = 0.01
Let itime(bm) = 1.2
The response time of a query:
= 1/8 (|sc| * (att + ati + itime(bm)))
Based on Table 11.6, the average response time from 10% to 100% found sets is:
(164 + 328 + 492 + 656 + 820 + 984 + 1148 + 1312 + 1476 + 1640) / 10 = 902 sec.
From the Table 11.6, we found that for the found sets between 90% and 100%, the response times
are between 1476 sec. and 1640 sec.. Note that thre response time of full scan without index is
1500 sec. We can now use the cost model to derive the actual percentage of found set.
The response time of a query = 1/8 (|sc| * (att + ati + itime(bm)))
Given: The response time of a query = 1500 sec.
att = 0.1
ati = 0.01
itime = 1.2
total number of tuples = 10,000
step 1) 1,500 = 1/8 * (10,000 * found set (0.1 + 0.01 + 1.2))
step 2) 12,000 = 13,100 * found set
step 3) found set = 12,000 / 13,100 = 91.6 %
From the result, we conclude that the full scan technique outperforms the RID index technique
when there is more than or equal to 91.6% found set in the relation. Therefore, DBMSs will use
the full scan technique instead of the BitMap index technique.
The technique of Inverted Partitioned processes data in index level only. Therefore, it reduces the
time to access data from two levels.
The response time of a query in Inverted Partitioned Index is
CostModel(sc) = (|sc| * ati) + (|sc| * itime(ipi))
Where,
sc = Selective conditions
|sc| = The total number of found sets in selective conditions
ati = Access time per index
itime = Instruction time of an algorithm
Notes
Example:
Let |sc| = found set * total number of tuples = 10% * 10,000 = 1,000
Let ati = 0.01
Let itime(bm) = 0.04
The response time of a query = |sc| * ati * itime(rdi)
= 1,000 * (0.01 + 0.04) = 50 sec..
If the foundset is 20%, it will be 2,000 * (0.01 + 0.04) = 100 sec. and the response time are summarised
in Table 11.7.
Table 11.7: The Response Time of BitMap Index in different Found Sets
Based on Table 11.7, the average response time from 10 % to 100% found sets is (50 + 100 + 150 +
200 + 250 + 300 + 350 + 400 + 450 + 500)/ 10 = 275 sec.
Based on the Table 11.8, you learnt that full scan technique and scanning with index techniques Notes
are not good for small selected attributes.
Case Study Oki Data Americas, Inc.
Better Data Extends Value
How can you extend the value of a new software implementation? Follow the example
of Oki Data Americas, Inc. A year after the company implemented SAP R/3 business
software, it realized low data quality was limiting the value of its new data solution. Oki
Data used the Trillium Software System® to consolidate data from legacy systems within
its SAP database, increasing both the value of its data solution and the resolution of its
customer view.
Oki Data needed to create a single, comprehensive data source by:
1. Identifying and deduplicating customer records in its SAP and legacy systems.
2. Identifying unique customer records in the legacy systems and loading them into the
SAP database.
3. Consolidating legacy systems and SAP data to ensure completeness of customer
records.
Manually consolidating the data was an unappealing prospect. Oki Data would have
had to match records and make data conversions manually. In addition to being time-
consuming, manual consolidation would provide virtually no improvement in the quality
of name and address data.
It became clear to Oki Data that it needed a data quality solution that could instill higher
quality in diverse data from across the enterprise.
Three Months from Concept to Remedy
Oki Data quickly identified Trillium Software as its data quality solution provider. The
Trillium Software System met more of Oki Data’s 30 solution criteria than other solution
candidates.
Oki Data needed a data quality solution that functioned within a heterogeneous
environment. The platform-independent architecture of the Trillium Software System
ensured compatibility across all past and present systems.
The Trillium Software solution cleansed, standardized, deduplicated and matched Oki
Data’s US and Canadian customer data. Oki Data particularly liked the software’s ability
to correct (geocode) addresses according to US postal directories. During the evaluation
process, Trillium Software was the only vendor to provide Oki Data with a knowledge
transfer that demonstrated the process and results of its data quality process using Oki
Data’s unique information.
Breadth of functionality was another area in which the Trillium Software System excelled.
The solution let Oki Data standardize and correct postal records for US and Canadian
customers. Moreover, it could process business data—such as marketing codes—included
in customer records.
The systems analyst at Oki Data said, “It is clear that data quality is an important issue,
and the Trillium Software System is a good tool for helping us ensure quality data. On-site
training and consulting that came with the software solution resulted in faster solution
implementation.” Contd...
Notes Oki Data’s color and monochrome LED page printers are the hallmark of its technology
leadership. Providing results comparable to a laser printer’s, Oki Data’s LED printers are
faster and have fewer moving parts. Over 29 years customers have relied on Oki Data’s
proven, reliable products. Oki Data’s five-year warranty on its LED printhead is a testament
to its ability to meet these customer expectations. Oki Data has created a unified customer
view.
Rapid ROI
Oki Data has wrought substantial value across the business from its investment in the
Trillium Software System. ROI indicators include:
1. Fewer shipping and mailing errors due to verified and corrected addresses
2. Less time spent by users correcting data errors
3. Improved customer service based on more complete customer views
A Bright Future
Oki Data’s process has underscored the importance of data quality and impelled the
company to consider data quality as an enterprise-wide issue. The Trillium Software
System’s enterprisewide compatibility and real-time online processing capabilities make it
a candidate for further initiatives within Oki Data. One such initiative is online data quality
processing in conjunction with Oki Data’s e-commerce website.
The data analyst noted, “We keep finding areas beyond the scope of the original project
where a Trillium Software application will solve a new and evolving business need. Going
forward, we see other potential uses of the Trillium Software System to help us better our
data quality.
11.6 Summary
zz In this unit you have study query processing and evaluation.
zz A query in a DBMS is a very important operation, as it needs to be efficient.
zz Query processing involves query parsing, representing query in alternative forms, finding
the best plan of evaluation of a query and then actually evaluating it.
zz The major query evaluation cost is the disk access time.
11.7 Keywords
Index Scan: Search algorithms that use an index are restricted because the selection condition
must be on the search-key of the index.
Indexing: A database index is a data structure that improves the speed of operations on a database
table.
Join: Join operation is considered as the real power behind the relational database
implementations.
Query Cost: Cost is generally measured as total elapsed time for answering the query.
Notes 8. Distinguish between full scan technique and RID index technique.
9. How will you calculate cost for simple hash-join? Explain.
10. Explain “A bitmap filter is useful only if it is selective”.
1. (b)
2. (a)
3. query
4. search criteria
5. Binary search
6. Primary index-scan for equality
7. Query optimizers
8. Bitmap filtering
9. read only
10. Transaction Processing System (TPS)
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”, Notes
Morgan Kaufmann Publishers, 1998.
Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/ The
MIT Press, 1996.
V. Cherkassky and F. Mulier, Learning From Data, John Wiley & Sons, 1998.
Contents
Objectives
Introduction
12.1 Metadata and Warehouse Quality
12.1.1 Business Quality
12.1.2 Information Quality
12.1.3 Technical Quality
12.2 Matadata Management in Data Warehousing
12.2.1 Benefits of Integrated Meta Data
12.2.2 Improved Productivity
12.2.3 Meta Data Sharing and Interchange
12.2.4 Why two Tools for Meta Data Management?
12.2.5 The Meta Data Hub
12.3 A Repository Model for the DWQ Framework
12.3.1 Quality Meta Model
12.3.2 A Quality-oriented Data Warehouse Process Model
12.4 Defining Data Warehouse Quality
12.5 Summary
12.6 Keywords
12.7 Self Assessment
12.8 Review Questions
12.9 Further Readings
Objectives
After studying this unit, you will be able to:
zz Describe metadata and warehouse quality
zz Know metadata management in data warehouse practice
zz Explain repository model for the DWQ framework
zz Define data warehouse quality
Introduction
Many researchers and practitioners share the understanding that a data warehouse (DW)
architecture can be formally understood as layers of materialized views on top of each other.
A data warehouse architecture exhibits various layers of data in which data from one layer are
derived from data of the lower layer. Data sources, also called stored in open databases, form Notes
the lowest layer. They may consist of structured data stored in open database system and legacy
systems, or unstructured or semi-structured data stored in files.
Directly related to economic success, business quality is the ability of the data warehouse to
provide information to those who need it, in order to have a positive impact on the business.
Business quality is made up of business drivers, or concepts that point out a company’s strategic
plans. So organizations should be concerned with how well the data warehouse helps accomplish
these drivers, such as changing economic factors, environmental concerns, and government
regulation.
Does the data warehouse align with business strategy, and how well does it support the process
of strengthening core competencies and improving competitive position? What about the
enablement of business tactics? Does the data warehouse play a tactical role, so that it makes a
positive day-to-day difference?
Information doesn’t have value if it’s not used. Therefore to have information quality, the focus
should be on the integration of information into the fabric of business processes, not on data
quality itself.
Information quality is the key to political success, which was described above as people actually
using the data warehouse. “Some companies don’t tell users it’s there,” Thomann says, “so they
may not understand the warehouse or know how to use it.” Success in this area means providing
awareness, access tools, and the knowledge and skills to use what they’re given. For example,
could your users easily make the shift from using greenbar reports to a multidimensional data
model? Then, assuming they understand the warehouse, can they get to the data easily? Who
gets the data? How frequently? How and when is the data used? You may be able to provide 24x7
access, but what if users are at home?
Notes Wells and Thomann believe that information quality also encompasses data quality and
performance. This can be a problem area, because everyone is happy if they can get their data
overnight. Then they want it in half a day. Then immediately. So these expectations must be
managed.
Technical quality is the ability of the data warehouse to satisfy users’ dynamic information needs.
Wells describes four important technical quality factors. The first is “reach,” or whether the data
warehouse can be used by those who are best served by its existence. In today’s information-
dependent business climate, organizations need to reach beyond the narrow and typical customer
base of suppliers, customers, and a few managers.
“Range” is also important. As its name implies, this defines a range of services provided by the
data warehouse. In general, these include “What data do I have, and, can I get the data?” For
example, Web enablement, such as Hotmail, are services which allow users to get information
from wherever they are.
“Manuverability” is the ability of the data warehouse to respond to changes in the business
environment. The data warehouse doesn’t remain stable, so manuverability becomes particularly
important. It is also the single most important factor not given attention today in data warehousing,
according to Wells. Manuverability sub-factors include managing:
1. Users and their expectations,
2. Upper management,
3. The overall business,
4. Technology,
5. Data sources, and
6. Technical platform.
Finally, “capability” is an organization’s technical capability to build, operate, maintain, and use
a data warehouse.
To date in data warehousing most organizations have avoided the issue of meta data management
and integration. Many companies, however, are now beginning to realize the importance of meta
data in decision processing and to understand that the meta data integration problem cannot be
ignored. There are two reasons for this:
1. The use of data warehousing and decision processing often involves a wide range of
different products, and creating and maintaining the meta data for these products is time-
consuming and error prone. The same piece of meta data (a relational table definition, for
example) may have to be defined to several products. This is not only cumbersome, but
also makes the job of keeping this meta data consistent and up to date difficult. Also, it is
often important in a data warehousing system to track how this meta data changes over
time. Automating the meta data management process and enabling the sharing of this so-
called technical meta data between products can reduce both costs and errors.
2. Business users need to have a good understanding of what information exists in a data
warehouse. They need to understand what the information means from a business
viewpoint, how it was derived, from what source systems it comes, when it was created,
what pre-built reports and analyses exist for manipulating the information, and so forth.
They also may want to subscribe to reports and analyses and have them run, and the
results delivered to them, on a regular basis. Easy access to this business meta data enables
business users to exploit the value of the information in a data warehouse. Certain types of
business meta data can also aid the technical staff examples include the use of a common
business model for discussing information requirements with business users and access to
existing business intelligence tool business views for analyzing the impact of warehouse
design changes.
The benefit of managing data warehouse technical meta data is similar to those obtained by
managing meta data in a transaction processing environment improved developer productivity.
Integrated and consistent technical meta data creates a more efficient development environment
for the technical staff who are responsible for building and maintaining decision processing
systems. One additional benefit in the data warehousing environment is the ability to track how
meta data changes over time. The benefits obtained by managing business meta data, on the
other hand, are unique to a decision processing environment and are key to exploiting the value
of a data warehouse once it has been put into production.
Figure 12.1 shows the flow of meta data through a decision processing system as it moves from
source systems, through extract and transformation (ETL) tools to the data warehouse and is
used by business intelligence (BI) tools and analytic applications. This flow can be thought of as
a meta data value chain. The further along the chain you go, the more business value there is in
the meta data. However, business value depends on the integrity of the meta data in the value
chain. As meta data is distributed across multiple sources in the value chain, integrity can only
be maintained if this distributed meta data is based on a common set of source meta data that
is current, complete and accurate. This common set of source meta data is often called the meta
data system of record. Another important aspect of the value chain is that business users need to
be able to follow the chain backward from the results of decision processing to the initial source
of the data on which the results are based.
There are two items in Figure 12.1(a) that have not been discussed so far. The decision processing
operations box in the diagram represents the meta data used in managing the operation of a
decision processing system. This includes meta data for tracking extract jobs, business user access
to the system, and so forth. The common business model box represents the business information
requirements of the organization. This model provides a high-level view, by business subject
area, of the information in the warehousing system. It is used to provide common understanding
and naming of information across warehouse projects.
Past attempts by vendors at providing tools for the sharing and interchange of meta data have
involved placing the meta data in a central meta data store or repository, providing import/
export utilities and programmatic APIs to this store and creating a common set of meta-models for
describing the meta data in the store. In the transaction processing environment, this centralized
approach has had mixed success, and there have been many failures. In the decision processing
marketplace, vendors are employing a variety of centralized and distributed approaches for meta
data management. The techniques used fall into one of three categories:
1. Meta data repositories for meta data sharing and interchange
2. Meta data interchange “standards” defined by vendor coalitions
3. Vendor specific “open” product APIs for meta data interchange.
Given that multiple decision processing meta data approaches and “standards” are likely to
prevail, we will, for the foreseeable future, be faced with managing multiple meta data stores, even
if those stores are likely to become more open. The industry trend toward building distributed
environments involving so-called federated data warehouses, consisting of an enterprise
warehouse and/or multiple data marts, will also encourage the creation of multiple meta data
stores. The only real solution to meta data management is to provide a facility for managing the
flow of meta data between different meta data stores and decision processing products. This
capability is provided using a meta data hub for the technical staff and a business information
directory for business users Figure 12.1(b).
Figure 12.1(b)
Notes
Task Discuss the role of Meta data repository in a data warehouse? How does it
differ from a catalog in a relational DBMS?
The proposed metamodel (i.e. the topmost layer in Figure 12.2) provides a notation for data
warehouse generic entities, such as schema or agent, including the business perspective. Each
box shown in Figure 12.2 is decomposed into more detailed data warehouse objects in the
metamodel. This metamodel is instantiated with the metadata of the data warehouse (i.e. the
second layer in figure 12.2), e.g. relational schema definitions or the description of the conceptual
data warehouse model. The lowest layer in Figure 12.2 represents the real world where the actual
data reside: in this level the metadata are instantiated with data instances, e.g. the tuples of a
relation or the objects of the real world which are represented by the entities of the conceptual
model.
Each object in the three levels and perspectives of the architectural framework can be subject to
quality measurement. Since quality management plays an important role in data warehouses,
we have incorporated it into our metamodeling approach. Thus, the quality model is part of the
metadata repository, and quality information is explicitly linked with architectural objects. This
way, stakeholders can represent their quality goals explicitly in the metadata repository, while,
at the same time, the relationship between the measurable architecture objects and the quality
values is retained.
The DWQ quality metamodel is based on the Goal-Question-Metric approach (GQM) of originally
developed for software quality management. In GQM, the high-level user requirements are
modeled as goals. Quality metrics are values which express some measured property of the
object. The relationship between goals and metrics is established through quality questions.
Notes The main difference in our approach resides in the following points:
1. A clear distinction between subjective quality goals requested by stakeholder and objective
quality factors attached to data warehouse objects.
2. Quality goal resolution is based on the evaluation of the composing quality factors, each
corresponding to a given quality question.
3. Quality questions are implemented and executed as quality queries on the semantically
rich metadata repository.
Figure 12.3 shows the DWQ Quality Model. The class “ObjectType” refers to any meta-object
of the DWQframework depicted in the first layer of figure 12.3. A quality goal is an abstract
requirement, defined on an object types, and documented by a purpose and the stakeholder
interested in. A quality goal roughly expresses natural language requirements like “improve the
availability of source s1 until the end of the month in the viewpoint of the DW administrator”.
Quality dimensions (e.g. “availability”) are used to classify quality goals and factors into different
categories. Furthermore, quality dimensions are used as a vocabulary to define quality factors
and goals; yet each stakeholder might have a different vocabulary and different preferences in
the quality dimensions. Moreover, a quality goal is operationally defined by a set of questions to
which quality factor values are provided as possible answers. As a result of the goal evaluation
process, a set of improvements (e.g. design decisions) can be proposed, in order to achieve the
expected quality. A quality factor represents an actual measurement of a quality value, i.e. it
relates quality values to measurable objects. A quality factor is a special property or characteristic
of the related object with respect to a quality dimension. It also represents the expected range of
the quality value, which may be any subset of a quality domain. Dependencies between quality
factors are also stored in the repository. Finally, the method of measurement is attached to the
quality factor through a measuring agent.
The quality meta-model is not instantiated directly with concrete quality factors and goals, it is Notes
instantiated with patterns for quality factors and goals. The use of this intermediate instantiation
level enables data warehouse stakeholders to define templates of quality goals and factors. For
example, suppose that the analysis phase of a data warehouse project has detected that the
availability of the source database is critical to ensure that the daily online transaction processing
is not affected by the loading process of the data warehouse. A source administrator might later
instantiate this template of a quality goal with the expected availability of his specific source
database. Thus, the programmers of the data warehouse loading programs know the time
window of the update process.
Based on the meta-model for data warehouse architectures, we have developed a set of quality
factor templates which can be used as initial set for data warehouse quality management. The
methodology is an adaptation of the Total Quality Management approach and consists of the
following steps:
1. Design of object types, quality factors and goals,
2. Evaluation of the quality factors,
3. Analysis of the quality goals and factors and their possible improvements
4. Re-evaluation of a quality goal due to the evolution of data warehouse.
As described in the previous section it is important that all relevant aspects of a data warehouse
are represented in the repository. Yet the described architecture and quality model does not
represent the workflow which is necessary to build and run a data warehouse, e.g. to integrate
data source or to refresh the data warehouse incrementally. Therefore, we have added a data
warehouse process model to our meta modeling framework. Our goal is to have a simple process
model which captures the most important issues of data warehouses rather than building a huge
construction which is difficult to understand and not very useful due to its complexity.
Figure 12.4 shows the meta model for data warehouse processes. A data warehouse process is
composed of several processes or process steps which may be further decomposed. Process steps
and the processes itself are executed in a specific order which is described by the “next” relation
between processes. A process works on an object type, e.g. data loading works on a source data
store and a data warehouse data store. The process itself must be executed by some object type,
usually an agent which is represented in the physical perspective of the architecture model. The
result of a process is some value of a domain, the execution of further processes may depend on
this value. For example, the data loading process returns as a result a boolean value representing
the completion value of the process, i.e. if it was successful or not. Further process steps like data
cleaning are only executed if the previous loading process was successful. The process is linked
to a stakeholder which controls or has initiated the process. Moreover, the result of a process is
the data which is produced as an outcome of the process, e.g. the tuples of a relation.
Processes affect a quality factor of an object type, e.g. the availability of data source or the
accuracy of a data store. It might be useful to store also the expected effect on the quality factor,
i.e. if the process improves or decreases the quality factor. However, the achieved effect on the
quality factor can only be determined by a new measurement of this factor. A query on the
metadata repository can then search for the processes which have improved the quality of a
certain object.
The processes can be subject to quality measurement, too. Yet, the quality of a process is usually
determined by the quality of its output. Therefore, we do not go into detail with process quality
but quality factors can be attached to processes, too.
As an example for a data warehouse process we have partially modeled the data warehouse
loading process in Figure 12.5. The loading process is composed of several steps, of which one
in our example is data cleaning. The data cleaning process step works on a data store, where
the data which have to be cleaned reside. It is executed by some data cleaning agent. It affects
among others the quality factors accuracy and availability, in the sense that accuracy is hopefully
improved and availability is decreased because of locks due to read-write operations on the data
store. The data cleaning process may also store some results of its execution in the metadata
repository, for example, a boolean value to represent the successful completion of the process
and the number of changed tuples in the data store.
The information stored in the repository may be used to find deficiencies in data warehouse.
To show the usefulness of this information we use the following query. It returns all data
cleaning processes which have decreased the availability of a data store according to the stored
measurements. The significance of the query is that it can show that the implementation of data
cleaning process has become inefficient.
GenericQueryClass DecreasedAvailability
isA DWCleaningProcess with
parameter
ds : DataStore Notes
constraint c :
$ exists qf1,qf2/DataStoreAvailability
t1,t2,t3/TransactionTime v1,v2/Integer
(qf1 onObject ds) and (qf2 onObject ds) and
(this worksOn ds) and (this executedOn t3) and
(qf1 when t1) and (qf2 when t2) and (t1<t2) and
(t1<t3) and (t3<t2) and (qf1 achieved v1) and
(qf2 achieved v2) and (v1 > v2) $
end
The query has a data store as parameter, i.e. the query will return only cleaning processes which
are related to the specified data store. The query returns the processes which have worked on the
specified data store and which were executed between the measurements of quality factors qf1
and qf2, and the measured value of the newer quality factor is lower than the value of the older
quality factor. The query can be formulated in a more generic way to deal with all types of data
warehouse processes but for reasons of simplicity and understandability, we have shown this
more special variant.
Finally, Figure 12.6 shows the trace of a process at the instance level. The process pattern for
DWLoading has been instantiated with a real process, which has been executed on the specified
date “April 15, 1999”. An instantiation of the links to the quality factors is not necessary, because
the information that “data cleaning” affects the accuracy and the availability of a data store is
already recorded in the process pattern shown in Figure 12.5.
Task Discuss the architecture of data warehouse with a neat diagram and explain
each component’s functionally in detail.
1. Completeness: Deals with to ensure is all the requisite information available? Are some
data values missing, or in an unusable state?
2. Consistency: Do distinct occurrences of the same data instances agree with each other or
provide conflicting information. Are values consistent across data sets?
3. Validity: refers to the correctness and reasonableness of data.
4. Conformity: Are there expectations that data values conform to specified formats? If so,
do all the values conform to those formats? Maintaining conformance to specific formats is
important.
5. Accuracy: Do data objects accurately represent the “real world” values they are expected
to model? Incorrect spellings of product or person names, addresses, and even untimely or
not current data can impact operational and analytical applications.
6. Integrity: What data is missing important relationship linkages? The inability to link
related records together may actually introduce duplication across your systems.
Data Warehousing
Data warehouses are one of the foundations of the Decision Support Systems of many IS
operations. As defined by the “father of data warehouse”, William H. Inmon, a data warehouse
is “a collection of Integrated, Subject-Oriented, Non Volatile and Time Variant databases where
each unit of data is specific to some period of time. Data Warehouses can contain detailed data,
lightly summarized data and highly summarized data, all formatted for analysis and decision
support” (Inmon, 1996). In the “Data Warehouse Toolkit”, Ralph Kimball gives a more concise Notes
definition: “a copy of transaction data specifically structured for query and analysis” (Kimball,
1998). Both definitions stress the data warehouse’s analysis focus, and highlight the historical
nature of the data found in a data warehouse.
The purpose of paper here is to formulate a descriptive taxonomy of all the issues at all the stages
of Data Warehousing. The phases are:
1. Data Source
2. Data Integration and Data Profiling
3. Data Staging and ETL
4. Database Scheme (Modeling)
Quality of data can be compromised depending upon how data is received, entered, integrated,
maintained, processed (Extracted, Transformed and Cleansed) and loaded. Data is impacted by
numerous processes that bring data into your data environment, most of which affect its quality
to some extent. All these phases of data warehousing are responsible for data quality in the data
warehouse. Despite all the efforts, there still exists a certain percentage of dirty data. This residual
dirty data should be reported, stating the reasons for the failure in data cleansing for the same.
Data quality problems can occur in many different ways. The most common include:
1. Poor data handling procedures and processes.
2. Failure to stick on to data entry and maintenance procedures.
3. Errors in the migration process from one system to another.
4. External and third-party data that may not fit with your company data standards or may
otherwise be of unconvinced quality.
Notes The assumptions undertaken are that data quality issues can arise at any stage of data
warehousing viz. in data sources, in data integration & profiling, in data staging, in ETL and
database modeling. Following model is depicting the possible stages which are vulnerable of
getting data quality problems.
Case Study Trans Union
The Company
Trans Union is a leading information and service provider for direct response marketers.
With a national name and address list of over 160 million consumers, Trans Union has one
of the most comprehensive data files in the business.
Trans Union understands that organizations need a broad range of information about
customers in order to develop one-to-one marketing relationships with prospects. Direct
response marketers from a variety of professions rely on the accuracy, quantity and
effectiveness of Trans Union products and services.
Competition in the field of consumer information resources is very intense and drives
marketing data providers to develop consumer lists faster and with better, deeper and
more accurate data. Trans Union recently made a push to reengineer its existing data to
find ways of creating new, salient products for its customers.
Contd...
Notes In addition, Trans Union was able to identify vehicles by type, classify each vehicle in
a standardized, format and create and append new vehicle classifications to the original
records.
“Training and learning the Trillium Software System occurred much more easily and
quickly than we had imagined. We were able to set up the tables and learn how to tune
them within one week, and that allowed us to go live immediately,” said the team leader.
Because the implementation was so quick and easy, Trans Union is now considering
adding more categories to the mix of products being mined from its consumer database.
The company has demonstrated once again that clean data provides a more accurate view
of consumers and delivers a more valuable product for its clients.
12.5 Summary
zz DWQ provides the framework and modeling formalisms for DW design, development
and maintenance tools that can be tuned to particular application domains and levels of
quality.
zz The potential of such an approach has been demonstrated already by successful commercial
usage of the ConceptBase metamodeling tool developed in the COMPULOG project.
zz DW development time for a given level of quality will be significantly reduced and
adaptation to changing user demands will be facilitated.
zz There is a high demand for design tools for distributed databases which is not satisfied by
current products.
zz DWQ has the potential to satisfy this demand for an increasingly important market
segment.
zz In addition, a host of quality-enhanced query and update services can be derived from
DWQ results.
zz Prototypes of specific models and tools developed in the project will be experimented
with in various industry and public administration settings, in order to gain reference
experiences for industrial uptake of project results.
12.6 Keywords
Data Warehouse Quality: In the DWQ (Data Warehouse Quality) project we have advocated
the need for enriched metadata facilities for the exploitation of the knowledge collected in a data
warehouse.
Information Quality: Information quality is the key to political success.
Technical Quality: Technical quality is the ability of the data warehouse to satisfy users’ dynamic
information needs.
The Metadata Hub: The meta data hub is used for managing the interchange and sharing of
technical meta data between decision processing products.
Notes 7. “The quality meta-model is not instantiated directly with concrete quality factors and
goals, it is instantiated with patterns for quality factors and goals.” Explain.
8. Explain data warehouse quality in detail.
9. “A quality factor is a special property or characteristic of the related object with respect to
a quality dimension.” Discuss.
10. Does the data warehouse play a tactical role, so that it makes a positive day-to-day
difference? Explain.
1. (a) 2. (b)
3. (c) 4. meta data hub
5. Quality metrics 6. natural language requirements
7. quality factor 8. data cleaning
9. loading process 10. query
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Contents
Objectives
Introduction
13.1 Representing and Analyzing Data Warehouse Quality
13.1.1 Data Warehouse Structure
13.1.2 Importing Data to the Data Warehouse
13.1.3 Preparing Data for Analysis with OLAP Server
13.1.4 Analyzing your Data
13.2 Quality Analysis in Data Staging
13.2.1 The Data Staging Process
13.2.2 Pros and Cons of Data Staging
13.3 Summary
13.4 Keywords
13.5 Self Assessment
13.6 Review Questions
13.7 Further Readings
Objectives
After studying this unit, you will be able to:
zz Represent and analyse data warehouse quality
zz Know quality analysis in data staging
Introduction
Data warehouses are complex systems consisting of many components which store highly
aggregated data for decision support. Due to the role of the data warehouses in the daily business
work of an enterprise, the requirements for the design and the implementation are dynamic
and subjective. Therefore, data warehouse design is a ontinuous process which has to reflect the
changing environment of a data warehouse, i.e. the data warehouse must evolve in reaction to the
enterprise’s evolution. Based on existing meta models for the architecture and quality of a data
warehouse, we propose in this paper a data warehouse process model to capture the dynamics of
a data warehouse. The evolution of a data warehouse is represented as a special process and the
evolution operators are linked to the corresponding architecture components and quality factors
they affect. We show the application of our model on schema evolution in data warehouses and
its consequences on data warehouse views. The models have been implemented in the metadata
repository Concept-Base which can be used to analyze the result of evolution operations and to
monitor the quality of a data warehouse.
Data quality (DQ) is an extremely important issue since quality determines the data’s usefulness
as well as the quality of the decisions based on the data. It has the following dimensions: accuracy,
accessibility, relevance, timeliness, and completeness. Data are frequently found to be inaccurate,
incomplete, or ambiguous, particularly in large, centralized databases. The economical and social
damage from poor-quality data has actually been calculated to have cost organizations billions of
dollars, data quality is the cornerstone of effective business intelligence.
Interest in data quality has been known for generations. For example, according to Hasan (2002),
treatment of numerical data for quality can be traced to the year 1881. An example of typical data
problems, their causes, and possible solutions is provided in Table 13.1.
Strong et al., (1997) conducted extensive research on data quality problems. Some of the problems
identified are technical ones such as capacity, while others relate to potential computer crimes.
The researchers divided these problems into the following four categories and dimensions.
1. Intrinsic DQ: Accuracy, objectivity, believability, and reputation
2. Accessibility DQ: Accessibility and access security
3. Contextual DQ: Relevancy, value added, timeliness, completeness and amount of data.
4. Representation DQ: Interpretability, ease of understanding, concise representation and
consistent representation.
Notes Although business executives recognize the importance of having highquality data, they discover
that numerous organizational and technical issues make it difficult to reach this objective.
For example, data ownership issues arise from the lack of policies defining responsibility and
accountability in managing data. Inconsistent data-quality requirements of various standalone
applications create an additional set of problems as organizations try to combine individual
applications into integrated enterprise systems. Interorganizational information systems add
a new level of complexity to managing data quality. Companies must resolve the issues of
administrative authority to ensure that each partner complies with the data-quality standards.
The tendency to delegate data-quality responsibilities to the technical teams, as opposed to
business users, is another common pitfall that stands in the way of high-quality data.
Different categories of data quality are proposed by Brauer (2001). They are: standardization (for
consistency), matching (of data if stored in different places), verification (against the source),
and enhancement (adding of data to increase its usefulness). Whichever system is used, once the
major variables and relationships in each category are identified, an attempt can be made to find
out how to better manage the data.
An area of increasing importance is the quality of data that are processed very fast in real time.
Many decisions are being made today in such an environment.
Physical Store
The physical store for the Data Warehouse includes one database that you can query using SQL
queries. The physical store contains all the data that you have imported from different sources.
Commerce Server automatically builds the physical store for the Data Warehouse in both the SQL
Server database and in the OLAP database. The Data Warehouse provides the data necessary for
all the Commerce Server reports available in the Analysis modules in Business Desk.
There is no need for you to directly modify the physical store for the Data Warehouse. If you need Notes
to extend the Data Warehouse, for example, to encompass third-party data, a site developer can
programmatically add the fields you need through the logical schema.
Logical Schema
The logical schema provides an understandable view of the data in the Data Warehouse, and
supports an efficient import process. For example, a site developer uses the logical schema to
modify the location of data stored in the underlying physical tables. When a site developer writes
code to add, update, or delete data in the Data Warehouse, the developer interacts with the
logical schema. When Commerce Server accesses data in the Data Warehouse, it accesses the
data through the logical schema. Only the site developer needs detailed knowledge of the logical
schema.
A logical schema includes the following:
1. Class: A logical collection of data members. For example, the RegisteredUser class contains
data members describing a registered user.
2. Data member: A structure that stores a piece of data. For example, the E-mail data member
of the RegisteredUser class stores the e-mail address for a registered user.
3. Relation: A connection between two classes in a parent-child relationship. This relationship
defines the number of instances of each class, and it provides the mechanism for sharing
data members between classes. For example, RegisteredUser is a parent to the child class
Request. There can be many requests for one registered user.
The logical schema uses classes, data members, relations, and other data structures to map data
in the physical store.
The data that populates the Data Warehouse typically comes from multiple data sources: Web
server logs, Commerce Server databases, and other data sources that you specify. The following
figure shows the sources for operational data, and how the data might be used to support tasks
run from Business Desk.
Because the Data Warehouse is not part of your run-time environment, a system administrator
must determine how frequently to import the operational data into the Data Warehouse. For
example, you can set up the Data Warehouse so that it automatically imports new data every day
or every week. The frequency with which you will need to import data depends on the amount
of new data collected every day in your operational data sources. Commerce Server includes
custom Data Transformation Service (DTS) tasks that simplify the importing of data into the Data
Warehouse. These DTS tasks import data that is used with the reports available from Business
Desk.
Even though the operational data can be imported from different types of databases or from
storage media that are not databases—all of the data is structured in a consistent manner after it
is gathered into the Data Warehouse. For example, you might have one data source in which the
first and last name of a user are stored in the same field, and another in which the first and last
names are stored in separate fields. When this data is imported into the Data Warehouse, it is
automatically structured to be consistent, thus enabling your analysis activities.
Notes
After data is imported into the Data Warehouse SQL Server database, it must be prepared for
analysis so business managers can run reports against it. To prepare data for reporting, the system
administrator runs a DTS task that exports a selected subset of data from the SQL Server database
to the OLAP database. In the OLAP database, the data is stored in multidimensional cubes.
By storing data in OLAP cubes, instead of in relational tables in SQL Server, the Data Warehouse
can retrieve data for reporting purposes more quickly. The data can be retrieved from the cubes
faster because it is aggregated. That is, data that belongs together is already associated so it is
easier to retrieve than searching an entire relational database for the smaller parts. For example,
using OLAP server you can run a report that lists users who visit your site based on the time of
their visit and on the ASP page that they access first. It would be extremely difficult to run such
a report against a large SQL Server database.
In multidimensional cubes, data is grouped in two kinds of structures:
1. Measures: The numeric values that are analyzed.
2. Dimensions: A business entity, such as color, size, product, or time. For example, you
would use the color dimension to contrast how many red products and blue products were
sold, the size dimension to contrast how many large and small products were sold.
It is the relationship between the dimension (for example, color) and measure (for example,
number of products sold) structures that provides the basis for your reports about user activity.
Task Discuss the various factors behind the measures of data warehouse quality.
Why its so important?
To analyze data about user activity on your site, you use the Analysis modules in Business Desk.
You can use the Analysis modules to run reports against the Data Warehouse, or to view and
analyze Segment models, which identify segments of the user population visiting your site.
Reports
Commerce Server provides two types of reports that you can use to analyze user data:
1. Dynamic reports are created at runtime: Each time the report is run, it gathers the most
recent data in the Data Warehouse. Only the report definition, which remains the same
over time, is stored. Commerce Server does not save completed dynamic reports; however,
you can export dynamic report results to Microsoft® Excel and store them there. For
information about exporting the results of a dynamic report to Microsoft Excel.
2. Static reports are run immediately upon request, and then stored, with the data, in
Completed Reports: The reports appear in a browser window in HTML format. You can
export static report results to the List Manager module, and then use the list in a direct mail
campaign. You can send these reports to others using e-mail, post them on your Web site,
and edit them in other applications. For example, using Microsoft Internet Explorer, you
can export the report into a Word document, and then edit it.
The data staging process imports data either as streams or files, transforms it, produces integrated,
cleaned data and stages it for loading into data warehouses, data marts, or Operational Data
Stores.
Kimball et.al. distinguish two data staging scenarios:
A data staging tool is available, and the data is already in a database. The data flow is set up
so that it comes out of the source system, moves through the transformation engine, and into a
staging database. The flow is illustrated in Figure 13.1.
In the second scenario, begin with a mainframe legacy system. Then extract the sought after data
into a flat file, move the file to a staging server, transform its contents, and load transformed data
into the staging database. Figure 13.2 illustrates this scenario.
The Data Warehouse Staging Area is temporary location where data from source systems Notes
is copied. A staging area is mainly required in a Data Warehousing Architecture for timing
reasons. In short, all required data must be available before data can be integrated into the Data
Warehouse.
Due to varying business cycles, data processing cycles, hardware and network resource limitations
and geographical factors, it is not feasible to extract all the data from all Operational databases
at exactly the same time.
Example: It might be reasonable to extract sales data on a daily basis, however, daily
extracts might not be suitable for financial data that requires a month-end reconciliation process.
Similarly, it might be feasible to extract “customer” data from a database in Singapore at noon
eastern standard time, but this would not be feasible for “customer” data in a Chicago database.
Data in the Data Warehouse can be either persistent (i.e. remains around for a long period) or
transient (i.e. only remains around temporarily).
Not all business require a Data Warehouse Staging Area. For many businesses it is feasible to use
ETL to copy data directly from operational databases into the Data Warehouse.
Task Discuss metadata and its importance of source system and Data Staging area.
Case Study AT&T
The Company
AT&T, a premier voice, video, and data communications company, was gearing up for
the next phase of telecommunications industry deregulation. It recognized that the path to
growth was paved with synergistic products and services, such as credit cards, e-commerce
offerings and integrated customer service. Moreover, it would need to offer a single support
number for all product and service inquiries.
These realizations drove AT&T to regard customers in a new light: not as individual
accounts, but as people who buy a family of products or services. To capitalize on the
opportunities this presented, AT&T needed to centralize its customer data and consistently
identify customers across touch points and accounts. The Integrated Customer View
project, with its customer data warehouse, answered these needs.
Contd...
Notes Creating a unified customer view within the warehouse demanded highquality, consistent
data. The stakes were high: the potential telecommunications market exceeding 100 million
customers meant that even 99 accuracy in customer data would still result in more than a
million faulty records.
The Challenge
Data quality wasn’t a new idea at the company. Until recently, there wasn’t a convenient
way for AT&T to reconcile different versions of the same consumer from one product
or department to the next. Although the problem was hardly unique to the carrier,
the ramifications of duplication and inaccuracy were significant given the size of its
marketplace.
Integrating these systems to present a unified customer view was far from straightforward.
In the telecommunications industry, the problem is even more challenging. Unlike other
businesses, such as consumer banks, AT&T might serve the same customer at multiple
addresses and with multiple phone numbers.
Another problem was data volatility: anyone could become a customer at any time.
Moreover, if the customer moved to another locale, the source and the content of the data
for that particular customer could change, too.
As for data sources, there are more than 1,500 local phone service providers across the U.S.
The content, format, and quality of name and address data can vary sharply between one
local exchange provider and the next.
The manager of the AT&T Integrated Customer View project explained, “We needed a data
cleansing system that could reach for the ‘knowledge’ in our address base, and perform
parsing and matching according to a standardized process.”
The Solution
The AT&T consumer division’s Integrated Customer View project included a team with
roughly a dozen core members, each of whom had years of experience in customer
identification and systems development. That core group was supplemented by a larger
group of business analysts from across the enterprise.
The team ruled out custom development because of stiff maintenance requirements
and rejected out-of-the-box direct mail software packages because they weren’t precise
enough. Ultimately, the team chose the Trillium Software System® for data identification,
standardization, and postal correction. Trillium Software’s solution was the only
package that could deliver the necessary functionality in both UNIX and mainframe
environments.
Multiplatform support was critical because, although the company was committed to
client/server migration, legacy systems would likely coexist through the foreseeable
future. Initially, the system would consist of two parts: a terabyte-sized master repository
residing on an IBM mainframe, and an Oracle-based data warehouse maintained on AT&T
UNIX server.
The AT&T project team spent nine months creating and testing data cleansing rules for an
initial loading of data into the data warehouse. Because of its unique data requirements,
the team developed an automated name and address matching process that resulted in a
large number of permutations. According to the AT&T project manager, “The problem was
so complex that it was beyond the ability of a single individual to handle.”
Contd...
13.3 Summary
zz We have extended our meta modeling framework for data warehouse architecture and
quality by a model for data warehouse processes and have specialized this model to the
case of data warehouse evolution. In detail, we have addressed the problem of evolution
of data warehouse views.
zz The management of the metadata in our repository system.
zz ConceptBase allows us to query and analyze the stored metadata for errors and
deficiencies.
zz In addition, features like client notification and active rules of ConceptBase support the
maintenance of the data warehouse components and keep data warehouse users up-to-
date on the status of the data warehouse.
13.4 Keywords
Data Quality: Data quality (DQ) is an extremely important issue since quality determines the
data’s usefulness as well as the quality of the decisions based on the data.
Data Warehouse Staging Area: The Data Warehouse Staging Area is temporary location where
data from source systems is copied.
Logical Schema: The logical schema provides an understandable view of the data in the Data
Warehouse, and supports an efficient import process.
Physical Store: The physical store for the Data Warehouse includes one database that you can
query using SQL queries.
1. (b)
2. (a)
3. (d)
4. Class
5. physical store
6. different sources
7. logical schema
8. Commerce Server
9. staging area
10. transient
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Notes Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/ The
MIT Press, 1996.
V. Cherkassky and F. Mulier, Learning From Data, John Wiley & Sons, 1998.
Contents
Objectives
Introduction
14.1 Quality Driven Data Warehouse Design
14.2 Interaction between Quality Factors and DW Tasks
14.2.1 Expected Results and Innovations
14.2.2 Quality Factors and Properties
14.3 The DWQ Data Warehouse Design Methodology
14.3.1 Data Warehouses, OLTP, OLAP and Data Mining
14.3.2 A Data Warehouse Supports OLTP
14.3.3 OLAP is a Data Warehouse Tool
14.3.4 Data Mining is a Data Warehouse Tool
14.3.5 Designing a Data Warehouse: Prerequisites
14.3.6 Data Warehouse Users
14.4 Optimizing and Materialization of DW Views
14.5 Summary
14.6 Keywords
14.7 Self Assessment
14.8 Review Questions
14.9 Further Readings
Objectives
After studying this unit, you will be able to:
zz Describe quality driven data warehouse design
zz Know interaction between quality factors and data warehouse tasks
zz Explain DWQ data warehouse design methodology
zz Describe optimizing the materialization of DW views
Introduction
A data warehouse (DW) can be seen as a set of materialized views defined over remote base
relations. When a query is posed, it is evaluated locally, using the materialized views, without
accessing the original information sources. The DWs are dynamic entities that evolve continuously
over time. As time passes, new queries need to be answered by them. Some of these queries can
be answered using exclusively the materialized views. In general though new views need to be
added to the DW.
The resulting standardized and integrated data are stored as materialized views in the data
warehouse. These base views (often called ODS or operational data store) are usually just slightly
aggregated. To customize them for different analyst users, data marts with more aggregated data
about specific domains of interest are constructed as second-level caches which are then accessed
by data analysis tools ranging from query facilities through spreadsheet tools to full-fledged data
mining systems.
Almost all current research and practice understand a data warehouse architecture as a stepwise Notes
information flow from information sources through materialized views towards analyst clients,
as shown in Figure 14.1.
The key argument pursued in this paper is that the architecture in Figure 14.1 covers only partially
the tasks faced in data warehousing and is therefore unable to even express, let alone support,
a large number of important quality problems and management strategies followed in data
warehouses. To illustrate this point, consider Figure 14.2 which shows the host of knowledge
sources used in the central data warehouse a large German bank has developed for financial
controlling.
Figure 14.2: Multiple Knowledge Sources Influencing the Design of a Real-World Data Warehouse
Notes Despite the fact, that this data warehouse talks “only” about financial figures, there is a host of
semantic coherency questions to be solved between the different accounting definitions required
by tax laws, stock exchanges, different financial products, and the like. At the same time, there
are massive physical data integration problems to be solved by re-calculating ten thousands of
multi-dimensional data cubes on a daily basis to have close to zero-latency information for top
management. In light of such problems, many architectures discussed in the literature appear
somewhat naive.
The key to solving these enormous problems in a flexible and evolvable manner is enriched
metadata management, used by different kinds of interacting software components. In the
following section, we shall present our approach how to organize this.
Vendors agree that data warehouses cannot be off-the-shelf products but must be designed and
optimized with great attention to the customer situation. Traditional database design techniques
do not apply since they cannot deal with DW-specific issues such as data source selection, temporal
and aggregated data, and controlled redundancy management. Since the wide variety of product
and vendor strategies prevents a low-level solution to these design problems at acceptable costs,
only an enrichment of metadata services linking heterogeneous implementations is a promising
solution. But this requires research in the foundations of data warehouse quality.
The goal of the DWQ project is to develop a semantic foundation that will allow the designers
of data warehouses to link the choice of deeper models, richer data structures and rigorous
implementation techniques to quality-of-service factors in a systematic manner, thus improving
the design, the operation, and most importantly the evolution of data warehouse applications.
DWQ’s research objectives address three critical domains where quality factors are of central
importance:
1. Enrich the semantics of meta databases with formal models of information quality to enable
adaptive and quantitative design optimization of data warehouses;
2. Enrich the semantics of information resource models to enable more incremental change
propagation and conflict resolution;
3. Enrich the semantics of data warehouse schema models to enable designers and query
optimizers to take explicit advantage of the temporal, spatial and aggregate nature of DW
data.
The results will be delivered in the form of publications and supported by a suite of protoptype
modules to achieve the following practical objectives:
1. Validating their individual usefulness by linking them with related methods and tools
of Software AG, a leading European vendor of DW solutions. The research goal is to
demonstrate progress over commercial state-of-the-art, and to give members of the
industrial steering committee a competitive advantage by early access to results
2. Demonstrating the interaction of the different contributions in the context of case studies in
telecommunications and and environmental protection.
Linking Data Warehousing and Data Quality. DWQ provides assistance to DW designers by
linking the main components of a DW reference architecture to a formal model of data quality,
as shown in Figure 14.4.
Notes Figure 14.4: Data Quality Issues and DWQ Research Results
A closer examination of the quality factor hierarchy reveals several relationships between quality
parameters and design/operational aspects of DW’s. The DWQ project will investigate these
relationships in a systematic manner:
1. The simple DW concept itself alleviates the problem of accessibility, by saving its users the
effort of searching in a large, poorly structured information space, and avoiding interference
of data analysis with operational data processing. However, the issue of delivering the
information efficiently, is an important open problem in the light of its differences from
traditional query processing. In a DW environment, there is an increased need for fast and
dynamic aggregate query processing, indexing of aggregate results, as well as fast update
of the DW content after changes are performed to the underlying information sources.
2. It remains difficult for DW customers to interpret the data because the semantics of data
description languages for data warehouse schemata is weak, does not take into account
domain-specific aspects, and is usually not formally defined and therefore hardly
computer-supported. The DWQ project will ensure interpretability by investigating the
syntax, semantics, and reasoning efficiency for rich schema languages which (a) give more
structure to schemas, and (b) allow the integration of concrete domains (e.g., numerical
reasoning, temporal and spatial domains) and aggregate data. This work builds in part on
results obtained in the CLN (Computational Logic) II ESPRIT Project.
3. The usefulness of data is hampered because it is hard to adapt the contents of the DW to
changing customer needs, and to offer a range of different policies for ensuring adequate
timeliness of data at acceptable costs. The DWQ project will develop policies to extend
active database concepts such that data caching is optimized for a given transaction load
on the DW, and that distributed execution of triggers with user-defined cache refreshment
policies becomes possible. This work builds on earlier active database research, e.g. in
ACTNET and the STRETCH and IDEA ESPRIT projects.
4. The believability of data is hampered because the DW customer often does not know the
credibility of the source and the accuracy of the data. Moreover, schema languages are too
weak to ensure completeness and consistency testing. To ensure the quality of individual
DW contents, the DWQ project will link rich schema languages to techniques for efficient
integrity checking for relational, deductive, and object-oriented databases. Moreover,
recent techniques from meta modeling and distributed software engineering will help to Notes
identify and maintain inter-view relationships. This requires deep integration of AI and
database techniques, building on experiences in view integration and maintenance, and
meta-level integration of heterogeneous databases.
Task Discuss the differences between no coupling, loose coupling, semi tight
coupling and tight coupling architectures for the integration of a data mining system with
a Database or data warehouse system.
The DWQ project will produce semantic foundations for DW design and evolution linked
explicitly to formal quality models, as indicated in the middle of Figure 14.5. These semantic
foundations will be made accessible by embedding them in methodological advice and prototype
tools. Their usefulness will be validated in the context of Software AG’s methodology and tool
suite and a number of sample applications.
After developing an initial reference model jointly with the industrial committee, the results will
be delivered in two stages to enable effective project control and coherence. The first group of
results will develop enriched formal meta models for describing the static architecture of a DW
and demonstrate how these enriched foundations are used in DW operation. The corresponding
tools include architecture modeling facilities including features for addressing DW-specific
issues such as resolution of multiple sources and management of partially aggregated multi-
dimensional data, as well as semantics-based methods for query optimization and incremental
update propagation.
Notes The second group of results focuses on enhancing these enriched models by tools that support the
evolution and optimization of DW applications under changing quality goals. The corresponding
tools include: evolution operators which document the link between design decisions and quality
factors, reasoning methods which analyze and optimize view definitions with multi-dimensional
aggregated data, and allow efficient quality control in bulk data reconciliation from new sources;
and quantitative techniques which optimize data source selection, integration strategies, and
redundant view materialization with respect to given quality criteria, esp. performance criteria.
To carry out data evaluation we firstly need to identify which quality factors to evaluate. The
choice of the most appropriate quality factors for a given DIS depends on the user applications
and the way the DIS is implemented. Several works study the quality factors that are more
relevant for different types of systems. The selection of the appropriate quality factors implies
the selection of metrics and the implementation of evaluation algorithms that measure, estimate
or bound such quality factors.
In order to calculate quality values corresponding to those factors, the algorithms need input
information describing system properties such as, for example, the time an activity needs to
execute or a descriptor stating if an activity materializes data or not. These properties can be of
two types: (i) descriptions, indicating some feature of the system (costs, delays, policies, strategies,
constraints, etc.), or (ii) measures, indicating a quality value corresponding to a quality factor,
which can be an actual value acquired from a source, a calculated value obtained executing an
evaluation algorithm or an expected value indicating the user desired value for the quality factor.
The selection of the adequate properties depends on the quality factors that are relevant for the
system and on the calculation processes.
Example: Consider a system where users are interested in the evaluation of response time
and freshness. To calculate the response time, it is necessary to know which activities materialize
data and the execution cost of the activities that do not materialize data. To calculate the data
freshness it is also necessary to know the refreshment frequencies and costs as well as the actual
freshness of the data in the sources. Other examples of properties can include execution policies,
source constraints and communication delays.
A relational database is designed for a specific purpose. Because the purpose of a data warehouse
differs from that of an OLTP, the design characteristics of a relational database that supports a
data warehouse differ from the design characteristics of an OLTP database.
A data warehouse supports an OLTP system by providing a place for the OLTP database to
offload data as it accumulates, and by providing services that would complicate and degrade
OLTP operations if they were performed in the OLTP database.
Without a data warehouse to hold historical information, data is archived to static media such as
magnetic tape, or allowed to accumulate in the OLTP database.
If data is simply archived for preservation, it is not available or organized for use by analysts
and decision makers. If data is allowed to accumulate in the OLTP so it can be used for analysis,
the OLTP database continues to grow in size and requires more indexes to service analytical
and report queries. These queries access and process large portions of the continually growing
historical data and add a substantial load to the database. The large indexes needed to support
these queries also tax the OLTP transactions with additional index maintenance. These queries
can also be complicated to develop due to the typically complex OLTP database schema.
A data warehouse offloads the historical data from the OLTP, allowing the OLTP to operate at
peak transaction efficiency. High volume analytical and reporting queries are handled by the
data warehouse and do not load the OLTP, which does not need additional indexes for their
support. As data is moved to the data warehouse, it is also reorganized and consolidated so that
analytical queries are simpler and more efficient.
Data mining is a technology that applies sophisticated and complex algorithms to analyze data
and expose interesting information for analysis by decision makers. Whereas OLAP organizes
data in a model suited for exploration by analysts, data mining performs analysis on data and
provides the results to decision makers. Thus, OLAP supports model-driven analysis and data
mining supports data-driven analysis.
Data mining has traditionally operated only on raw data in the data warehouse database or,
more commonly, text files of data extracted from the data warehouse database. In SQL Server
2000, Analysis Services provides data mining technology that can analyze data in OLAP cubes,
as well as data in the relational data warehouse database. In addition, data mining results can
be incorporated into OLAP cubes to further enhance model-driven analysis by providing an
additional dimensional viewpoint into the OLAP model. For example, data mining can be used
to analyze sales data against customer attributes and create a new cube dimension to assist the
analyst in the discovery of the information embedded in the cube data.
Before embarking on the design of a data warehouse, it is imperative that the architectural goals
of the data warehouse be clear and well understood. Because the purpose of a data warehouse
is to serve users, it is also critical to understand the various types of users, their needs, and the
characteristics of their interactions with the data warehouse.
A data warehouse exists to serve its users analysts and decision makers. A data warehouse must
be designed to satisfy the following requirements:
1. Deliver a great user experience user acceptance is the measure of success
2. Function without interfering with OLTP systems
3. Provide a central repository of consistent data
4. Answer complex queries quickly
5. Provide a variety of powerful analytical tools, such as OLAP and data mining
Most successful data warehouses that meet these requirements have these common
characteristics:
1. Are based on a dimensional model
2. Contain historical data
3. Include both detailed and summarized data
4. Consolidate disparate data from multiple sources while retaining consistency
5. Focus on a single subject, such as sales, inventory, or finance
Data warehouses are often quite large. However, size is not an architectural goal it is a characteristic
driven by the amount of data needed to serve the users.
The success of a data warehouse is measured solely by its acceptance by users. Without users,
historical data might as well be archived to magnetic tape and stored in the basement. Successful
data warehouse design starts with understanding the users and their needs.
Data warehouse users can be divided into four categories: Statisticians, Knowledge Workers, Notes
Information Consumers, and Executives. Each type makes up a portion of the user population as
illustrated in this Figure 14.6.
1. Statisticians: There are typically only a handful of sophisticated analysts Statisticians and
operations research types in any organization. Though few in number, they are some of the
best users of the data warehouse; those whose work can contribute to closed loop systems
that deeply influence the operations and profitability of the company. It is vital that these
users come to love the data warehouse. Usually that is not difficult; these people are often
very self-sufficient and need only to be pointed to the database and given some simple
instructions about how to get to the data and what times of the day are best for performing
large queries to retrieve data to analyze using their own sophisticated tools. They can take
it from there.
2. Knowledge Workers: A relatively small number of analysts perform the bulk of new queries
and analyses against the data warehouse. These are the users who get the “Designer” or
“Analyst” versions of user access tools. They will figure out how to quantify a subject
area. After a few iterations, their queries and reports typically get published for the benefit
of the Information Consumers. Knowledge Workers are often deeply engaged with the
data warehouse design and place the greatest demands on the ongoing data warehouse
operations team for training and support.
3. Information Consumers: Most users of the data warehouse are Information Consumers;
they will probably never compose a true ad hoc query. They use static or simple interactive
reports that others have developed. It is easy to forget about these users, because they
usually interact with the data warehouse only through the work product of others. Do not
neglect these users! This group includes a large number of people, and published reports
are highly visible. Set up a great communication infrastructure for distributing information
widely, and gather feedback from these users to improve the information sites over time.
4. Executives: Executives are a special case of the Information Consumers group. Few
executives actually issue their own queries, but an executive’s slightest musing can generate
a flurry of activity among the other types of users. A wise data warehouse designer/
implementer/owner will develop a very cool digital dashboard for executives, assuming
it is easy and economical to do so. Usually this should follow other data warehouse work,
but it never hurts to impress the bosses.
Models, which provide array-based computations in SQL, can be used in materialized views.
Because the MODEL clause calculations can be expensive, you may want to use two separate
materialized views: one for the model calculations and one for the SELECT ... GROUP BY query.
For example, instead of using one, long materialized view, you could create the following
materialized views:
CREATE MATERIALIZED VIEW my_groupby_mv
REFRESH FAST
ENABLE QUERY REWRITE AS
SELECT country_name country, prod_name prod, calendar_year year,
SUM(amount_sold) sale, COUNT(amount_sold) cnt, COUNT(*) cntstr
FROM sales, times, customers, countries, products
WHERE sales.time_id = times.time_id AND
sales.prod_id = products.prod_id AND
sales.cust_id = customers.cust_id AND
customers.country_id = countries.country_id
GROUP BY country_name, prod_name, calendar_year;
CREATE MATERIALIZED VIEW my_model_mv
ENABLE QUERY REWRITE AS
SELECT country, prod, year, sale, cnt
FROM my_groupby_mv
MODEL PARTITION BY(country) DIMENSION BY(prod, year)
MEASURES(sale s) IGNORE NAV
(s[‘Shorts’, 2000] = 0.2 * AVG(s)[CURRENTV(), year BETWEEN 1996 AND 1999],
s[‘Kids Pajama’, 2000] = 0.5 * AVG(s)[CURRENTV(), year BETWEEN 1995 AND 1999],
s[‘Boys Pajama’, 2000] = 0.6 * AVG(s)[CURRENTV(), year BETWEEN 1994 AND 1999],
...
<hundreds of other update rules>);
By using two materialized views, you can incrementally maintain the materialized view my_
groupby_mv. The materialized view my_model_mv is on a much smaller data set because it is
built on my_groupby_mv and can be maintained by a complete refresh.
Materialized views with models can use complete refresh or PCT refresh only, and are available
for partial text query rewrite only.
Notes
Case Study FedEx
Cost Savings All Day Long
How does FedEx make the case for IT spending? Cost savings is a large component. In
particular, an innovative Web-based customer service portal, called FedEx InSight, has
aligned with significant increases in the use of FedEx services by some of the company’s
most valuable customers.
With FedEx InSight, business customers view all outgoing, incoming, and third-party
shipments online and they prefer interacting with FedEx Insight over other channels. In
fact, they like it so much that they forgo lower rates from competitors in order to have
access to FedEx online tracking.
Cutting costs while increasing customer loyalty, InSight is considered by FedEx to be a
milestone technology. The innovative Web service lets business customers instantly
access all their current FedEx cargo information, tailor views, and drill down into freight
information including shipping date, weight, contents, expected delivery date, and related
shipments. Customers can even opt for email notifications of in-transit events, such as
attempted deliveries, delays at customs, etc.
The Perfect Match
InSight works because FedEx can link shipper and receiver data on shipping bills with
entries in a database of registered InSight customers. The linking software FedEx chose to
support InSight had to be superior in terms of its ability to recognize, interpret, and match
customer name and address information. Fast processing speed and flexibility were also
top criteria. After a broad and thorough evaluation of vendors in the data quality market,
the delivery giant chose Trillium Software®.
The real matching challenge was not with the records for outgoing shippers, who could
be easily identified by their account numbers. Linking shipment recipients to customers
in the InSight database was far more difficult. It relied on name and address information,
which is notoriously fraught with errors, omissions, and other anomalies—especially when
entered by individual shippers around the globe. The point of pain was being able to match
on addresses, because what FedEx receives on the airbills is not very standardized.
FedEx airbills had another problem: too much information. “For the purpose of matching
customers to shipments, the airbills contain a lot of garbage,” said FedEx’s senior technical
analyst. “Information such as parts numbers, stock-keeping units, signature requirements,
shipping contents, delivery instructions, country of manufacture, and more obscures the
name and address data, making it difficult to interpret that free-form text and correctly
identify name and address information.”
A Deeper Look at Data
As Trillium Software® demonstrated to FedEx during the sales cycle, no matching software
would be successful for FedEx airbills without some intelligent interpretation of free-form
text and standardization. Matching is more accurate when it acts on more complete and
standardized data.
The Trillium Software System® first investigates data entries word by word—not just line
by line—in order to understand maximum data content. Valid content is often “hidden”
when it’s entered in the wrong field or free-form text fields. Trillium technology reveals
this hidden data by identifying individual data elements in each shipping bill, interpreting
Contd...
the real meaning of each element, and ensuring that all valid data elements are part of the Notes
matching equation. It then standardizes content into a consistent format.
Beyond Address Correction
In the logistics industry, accuracy is everything; FedEx needed to identify customers
reliably based on a variety of data elements, including business names, office suite
numbers, and other address elements. Its chosen data quality solution had to identify and
distinguish between companies on different floors of an office tower or divisions within
a geographically dispersed corporate campus. It had to link customers based on detailed
analyses of abbreviations, nicknames, synonyms, personal names, street addresses, and
other information.
FedEx knew they needed more than address verification software that only confirmed
that an address was internally consistent and correct according to postal authorities. They
needed precision matching capabilities that would lie at the heart of InSight. The Trillium
Software System had all these capabilities, in addition to usability features that allowed
FedEx to tune and test quickly and iteratively the matching process until the match results
met the company’s stringent requirements.
Split-Second Processing
Speed was another requirement. To efficiently handle the volume of FedEx’s daily
transactions, the software had to identify and resolve matches at sub-second rates. Only
the Trillium Software System could demonstrate this capability and process millions of
records per day—as many as 500,000 records per hour.
Surgical Precision
“That we could customize [business] rules and surgically make changes was a big, big
winning point,” said the senior technical analyst. The Trillium Software System lets FedEx
target specific data issues and quickly modify rules to resolve them. The business rules,
written in plain text, were understandable, traceable, and repeatable. Because the analyst
team could see what the rules were and how they worked, they were more confident about
the matching process. “There’s a certain amount of confidence you have to have in the
product. You have to trust in what it’s doing.”
Rapid Rollout
FedEx took only about four months to implement its solution fully. Trillium Software
professional services helped FedEx get started. After just three days, the senior technical
analyst was ready to work on his own: “Once I understood it, it was just a matter of
applying that knowledge,” he stated.
He also gives Trillium Software Customer Support a lot of credit: “The Trillium Software
tech support is just terrific. Most of my support is done through email and someone always
gets back to me quickly. If I call, there’s no kind of triage. I tell them what language I speak,
and then I get to talk to someone.”
Award-Winning InSight
FedEx won several e-commerce awards for its innovation, and customers raved about their
InSight experiences. In fact, FedEx customers communicated that they would forgo lower
shipping rates from competitors, because they prized the ability to track their incoming
and outgoing shipments so easily with InSight.
FedEx also realized concrete gains from its investment. Repeatedly, implementation of
InSight was shown to align with significant increases in use of FedEx services by some of
the company’s largest customers.
Contd...
14.5 Summary
zz We have presented a comprehensive approach how to improve the quality of data
warehouses through the enrichment of metadata based on explicit modeling of the
relationships between enterprise models, source models, and client interest models.
zz Our algorithms, prototypical implementations, and partial validations in a number of
real-world applications demonstrate that an almost seamless handling of conceptual,
logical, and physical perspectives to data warehousing is feasible, considering a broad
range of quality criteria ranging from business-oriented accuracy and actuality to technical
systems performance and scalability.
14.6 Keywords
Data Mining: Data mining is a technology that applies sophisticated and complex algorithms to
analyze data and expose interesting information for analysis by decision makers.
Data Warehouse: Data warehouses support business decisions by collecting, consolidating, and
organizing data for reporting and analysis with tools such as online analytical processing (OLAP)
and data mining.
DWQ: DWQ provides assistance to DW designers by linking the main components of the data
warehouse reference architecture to a formal model of data quality.
OLAP: Online analytical processing (OLAP) is a technology designed to provide superior
performance for ad hoc business intelligence queries.
1. (c) 2. (a)
3. text 4. Executives
5. Materialized views 6. Data warehousing
7. customer 8. relational database
9. OLTP 10. intuitive model
Books A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison
Wesley, First Edition, 2000.
Sholom M. Weiss and Nitin Indurkhya, “Predictive Data Mining: A Practical Guide”,
Morgan Kaufmann Publishers, 1998.
Sushmita Mitra, Tinku Acharya, Data Mining – Multimedia, Soft Computing and
Bioinformatics, John Wiley & Sons, 2003.
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/ The
MIT Press, 1996.
V. Cherkassky and F. Mulier, Learning From Data, John Wiley & Sons, 1998.