DWDM Lecture Notes U-1

Knowledge Discovery in Databases (KDD)

Some people treat data mining same as Knowledge discovery while some people view data
mining essential step in process of knowledge discovery. Here is the list of steps involved
in knowledge discovery process:

Data Cleaning - In this step the noise and inconsistent data is removed.
Data Integration - In this step multiple data sources are combined.
Data Selection - In this step relevant to the analysis task are retrieved from the database.
Data Transformation - In this step data are transformed or consolidated intoforms
appropriate for mining by performing summary or aggregation operations.
Data Mining - In this step intelligent methods are applied in order to extract data
Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step,knowledge is represented.

The following diagram shows the process of knowledge discovery process:

Architecture of KDD

Data Warehouse:

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of

data in support of management's decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.

Data Warehouse Design Process:

A data warehouse can be built using a top-down approach, a bottom-up approach, or a

combination of both.

The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that must
be solved are clear and well understood.

The bottom-up approach starts with experiments and prototypes. This is useful in the early
stage of business modeling and technology development. It allows an organization to move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.

In the combined approach, an organization can exploit the planned and strategic nature of
the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.

The warehouse design process consists of the following steps:

Choose a business process to model, for example, orders, invoices, shipments, inventory,
account administration, sales, or the general ledger. If the business process is organizational
and involves multiple complex object collections, a data warehouse model should be
followed. However, if the process is departmental and focuses on the analysis of one kind of
business process, a data mart model should be chosen.
Choose the grain of the business process. The grain is the fundamental, atomic level of data
to be represented in the fact table for this process, for example, individual transactions,
individual daily snapshots, and so on.
Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status.
Choose the measures that will populate each fact table record. Typical measures are numeric
additive quantities like dollars sold and units sold.

A Three Tier Data Warehouse Architecture:


The bottom tier is a warehouse database server that is almost always a relationaldatabase
system. Back-end tools and utilities are used to feed data into the bottomtier from
operational databases or other external sources (such as customer profileinformation
provided by external consultants). These tools and utilities performdataextraction,
cleaning, and transformation (e.g., to merge similar data from differentsources into a
unified format), as well as load and refresh functions to update thedata warehouse . The
data are extracted using application programinterfaces known as gateways. A gateway is

supported by the underlying DBMS andallows client programs to generate SQL code to
be executed at a server.

Examplesof gateways include ODBC (Open Database Connection) and OLEDB (Open
Linkingand Embedding for Databases) by Microsoft and JDBC (Java Database
This tier also contains a metadata repository, which stores information aboutthe data
warehouse and its contents.


The middle tier is an OLAP server that is typically implemented using either a relational
OLAP (ROLAP) model or a multidimensional OLAP.

OLAP model is an extended relational DBMS thatmaps operations on

multidimensional data to standard relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose server
that directly implements multidimensional data and operations.


The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so

Data Warehouse Models:

There are three data warehouse models.

1. Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning the entire
It provides corporate-wide data integration, usually from one or more operational
systems or external information providers, and is cross-functional in scope.
It typically contains detailed data aswell as summarized data, and can range in size from a
few gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes, computer
superservers, or parallel architecture platforms. It requires extensive business modeling
and may take years to design and build.

2. Data mart:

A data mart contains a subset of corporate-wide data that is of value to aspecific group of
users. The scope is confined to specific selected subjects. For example,a marketing data
mart may confine its subjects to customer, item, and sales. Thedata contained in data
marts tend to be summarized.

Data marts are usually implemented on low-cost departmental servers that

areUNIX/LINUX- or Windows-based. The implementation cycle of a data mart ismore
likely to be measured in weeks rather than months or years. However, itmay involve
complex integration in the long run if its design and planning werenot enterprise-wide.

Depending on the source of data, data marts can be categorized as independent
ordependent. Independent data marts are sourced fromdata captured fromone or
moreoperational systems or external information providers, or fromdata generated
locallywithin a particular department or geographic area. Dependent data marts are
sourceddirectly from enterprise data warehouses.

3. Virtual warehouse:

A virtual warehouse is a set of views over operational databases. Forefficient

query processing, only some of the possible summary views may bematerialized.
A virtual warehouse is easy to build but requires excess capacity on operational
database servers.

Meta Data Repository:

Metadata are data about data.When used in a data warehouse, metadata are the data thatdefine
warehouse objects. Metadata are created for the data names anddefinitions of the given
warehouse. Additional metadata are created and captured fortimestamping any extracted data,
the source of the extracted data, and missing fieldsthat have been added by data cleaning or
integration processes.

A metadata repository should contain the following:

A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents.

Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or purged),
and monitoring information (warehouse usage statistics, error reports, and audit trails).

The algorithms used for summarization, which include measure and dimension
definitionalgorithms, data on granularity, partitions, subject areas, aggregation,
summarization,and predefined queries and reports.

The mapping from the operational environment to the data warehouse, which
includessource databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging rules,
andsecurity (user authorization and access control).

Data related to system performance, which include indices and profiles that improvedata
access and retrieval performance, in addition to rules for the timing and scheduling of
refresh, update, and replication cycles.

Business metadata, which include business terms and definitions, data

ownershipinformation, and charging policies.

OLAP(Online analytical Processing):

OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.

OLAP is part of the broader category of business intelligence, which also
encompasses relational database, report writing and data mining.
OLAP tools enable users to analyze multidimensional data interactively
frommultiple perspectives.

OLAP consists of three basic analytical operations:

 Consolidation (Roll-Up)
 Drill-Down

 Slicing And Dicing

Consolidation involves the aggregation of data that can be accumulated and computed in
one or more dimensions. For example, all sales offices are rolled up to the sales
department or sales division to anticipate sales trends.

The drill-down is a technique that allows users to navigate through the details. For
instance, users can view the sales by individual products that make up a region’s

Slicing and dicing is a feature whereby users can take out (slicing) a specific set of
data of the OLAP cube and view (dicing) the slices from different viewpoints.

Types of OLAP:

1. Relational OLAP (ROLAP):

ROLAP works directly with relational databases. The base data and the dimension
tables are stored as relational tables and new tables are created to hold the aggregated
information. It depends on a specialized schema design.
This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence,
each action of slicing and dicing is equivalent to adding a "WHERE" clause in the
SQL statement.
ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
standard relational database and its tables in order to bring back the data required to
answer the question.
ROLAP tools feature the ability to ask any question because the methodology does
not limit to the contents of a cube. ROLAP also has the ability to drill down to the
lowest level of detail in the database.

2. Multidimensional OLAP (MOLAP):

MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.

MOLAP stores this data in an optimized multi-dimensional array storage, rather than
in a relational database. Therefore it requires the pre-computation and storage of
information in the cube - the operation known as processing.

MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
The data cube contains all the possible answers to a given range of questions.

MOLAP tools have a very fast response time and the ability to quickly write
back data into the data set.

3. Hybrid OLAP (HOLAP):

There is no clear agreement across the industry as to what constitutes Hybrid OLAP,
except that a database will divide data between relational and specialized storage.
For example, for some vendors, a HOLAP database will use relational tables to hold
the larger quantities of detailed data, and use specialized storage for at least some
aspects of the smaller quantities of more-aggregate or less-detailed data.
HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the
capabilities of both approaches.
HOLAP tools can utilize both pre-calculated cubes and relational data sources.


