DWDM Unit-1
DWDM Unit-1
Data Warehousing is an architectural construct of information systems that provides users with
current and historical decision support information that is hard to access or present in traditional
operational data stores
Page 1
Data warehousing and Data mining Unit-I
individual data warehouses which are usually smaller than the corporate data warehouse.
Decision Support System (DSS): Information technology to help the knowledge worker
(executive, manager, and analyst) makes faster & better decisions
Drill-down: Traversing the summarization levels from highly summarized data to the underlying
current or old detail
Metadata: Data about data. Containing location and description of warehouse system
components: names, definition, structure…
Page 2
Data warehousing and Data mining Unit-I
Informational Data:
Focusing on providing answers to problems posed by decision makers
Summarized
Non updateable
Page 3
Data warehousing and Data mining Unit-I
The data source for data warehouse is coming from operational applications. The data entered into
the data warehouse transformed into an integrated structure and format. The transformation process
involves conversion, summarization, filtering and condensation. The data warehouse must be capable of
holding and managing large volumes of data as well as different structure of data structures over the time.
1. Data warehouse database
This is the central part of the data warehousing environment. This is the item number 2 in the
above arch. diagram. This is implemented based on RDBMS technology.
2. Sourcing, Acquisition, Clean up, and Transformation Tools
This is item number 1 in the above arch diagram. They perform conversions, summarization, key
changes, structural changes and condensation. The data transformation is required so that the
information can be used by decision support tools. The transformation produces programs, control
statements, JCL code, COBOL code, UNIX scripts, and SQL DDLcode etc., to move the data into data
warehouse from multiple operational systems.
Page 4
Data warehousing and Data mining Unit-I
3. Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. Itis
classified into two:
1. Technical Meta data: It contains information about data warehouse data used by warehouse
designer, administrator to carry out development and management tasks. It includes,
Info about data stores
Transformation descriptions. That is mapping methods from operational db to warehouse db
Warehouse Object and data structure definitions for target data
The rules used to perform clean up, and data enhancement
Data mapping operations
Access authorization, backup history, archive history, info delivery history, data acquisitionhistory,
data access etc.
2. Business Meta data: It contains info that gives info stored in data warehouse to users. It
includes,
Subject areas, and info object type including queries, reports, images, video, audio clipsetc.
Internet home pages
Info related to info delivery system
Data warehouse operational info such as ownerships, audit trails etc.,
Meta data helps the users to understand content and find the data. Meta data are stored in a
separate data stores which is known as informational directory or Meta data repository which helps to
integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
It is the gateway to the data warehouse environment
It supports easy distribution and replication of content for high performance andavailability
It should be searchable by business oriented key words
It should act as a launch platform for end user to access data and analysis tools
It should support the sharing of info
Page 5
Data warehousing and Data mining Unit-I
Page 6
Data warehousing and Data mining Unit-I
Once the EDW is implemented we start building subject area specific data marts which contain
data in a de normalized form also called star schema. The data in the marts are usually summarized based
on the end users analytical requirements. The reason to de normalize the data in the mart is to provide
faster access to the data for the end users analytics. If we were to have queried a normalized schema for
the same analytics, we would end up in a complex multiple level joins that would be much slower as
compared to the one on the de normalized schema.
We should implement the top-down approach when
1. The business has complete clarity on all or multiple subject areas data warehouse
requirements.
2. The business is ready to invest considerable time and money.
The advantage of using the Top Down approach is that we build a centralized repository to
cater for one version of truth for business data. This is very important for the data to be reliable,
consistent across subject areas and for reconciliation in case of data related contention between
subject areas.
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building the data marts
before which they can access their reports.
Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to build a data
warehouse. Here we build the data marts separately at different points of time as and when the specific
subject area requirements are clear. The data marts are integrated or combined together to form a data
warehouse. Separate data marts are combined through the use of conformed dimensions and conformed
facts. A conformed dimension and a conformed fact is onethat can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names and
consistent values across separate data marts. The conformed dimension means exact same thing with
every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it and at the
same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for knowing the
overall requirements of the warehouse.
We should implement the bottom up approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one data mart. The
advantage of using the Bottom Up approach is that they do not require high initial costs andhave a
faster implementation time; hence the business can start using the marts much earlier ascompared to
the top-down approach.
The disadvantages of using the Bottom Up approach are that it stores data in the de normalized
format; hence there would be high space usage for detailed data. We have a tendency of not keeping
detailed data in this approach hence losing out on advantage of having detail data
i.e. flexibility to easily cater to future requirements. Bottom up approach is more realistic but the
Page 8
Data warehousing and Data mining Unit-I
Design considerations
To be a successful data warehouse designer must adopt a holistic approach that is considering all
data warehouse components as parts of a single complex system, and take into account all possible data
sources and all known usage requirements.
Most successful data warehouses that meet these requirements have these common characteristics:
Are based on a dimensional model
Contain historical and current data
Include both detailed and summarized data
Consolidate disparate data from multiple sources while retaining consistencyData
warehouse is difficult to build due to the following reason:
Heterogeneity of data sources
Use of historical data
Growing nature of data base
Data warehouse design approach muse be business driven, continuous and iterative engineering
approach. In addition to the general considerations there are following specific points relevant to the data
warehouse design:
Data content
The content and structure of the data warehouse are reflected in its data model. The data model is
the template that describes how information will be organized within the integrated warehouse
framework. The data warehouse data must be a detailed data. It must be formatted, cleaned up and
transformed to fit the warehouse data model.
Meta data
It defines the location and contents of data in the warehouse. Meta data is searchable by users to
find definitions or subject areas. In other words, it must provide decision support oriented pointers
to warehouse data and thus provides a logical link between warehouse data and decision support
applications.
Data distribution
One of the biggest challenges when designing a data warehouse is the data placement and
distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessaryto know
how the data should be divided across multiple servers and which users should get access to which types
of data. The data can be distributed based on the subject area, location (geographical region), or time
(current, month, year).
Tools
A number of tools are available that are specifically designed to help in the implementation of
the data warehouse. All selected tools must be compatible with the given data warehouse environment
and with each other. All tools must be able to use a common Meta data repository.
Page 9
Data warehousing and Data mining Unit-I
Design steps
The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models
Technical considerations
A number of technical issues are to be considered when designing a data warehouse
environment. These issues include:
The hardware platform that would house the data warehouse
The DBMS that supports the warehouse data
The communication infrastructure that connects data marts, operational systems and endusers
The hardware and software to support meta data repository
The systems management framework that enables admin of the entire environment
Implementation considerations
The following logical steps needed to implement a data warehouse:
Collect and analyze business requirements
Create a data model and a physical design
Define data sources
Choose the DB tech and platform
Extract the data from operational DB, transform it, clean it up and load it into the
warehouse
Choose DB access and reporting tools
Choose DB connectivity software
Choose data analysis and presentation s/w
Update the data warehouse
Access tools
Data warehouse implementation relies on selecting suitable data access tools. The best way to
choose this is based on the type of data can be selected using this tool and the kind of access it permits
for a particular user. The following lists the various types of data that can be accessed:
Simple tabular form data
Ranking data
Multivariable data
Time series data
Page 10
Data warehousing and Data mining Unit-I
User levels
The users of data warehouse data can be classified on the basis of their skill level in accessing the
warehouse.
There are three classes of users:
Casual users: are most comfortable in retrieving info from warehouse in pre defined formats and running
pre existing queries and reports. These users do not need tools that allow for building standard and ad hoc
reports
Power Users: can use pre defined as well as user defined queries to create simple and ad hoc reports.
These users can engage in drill down operations. These users may have the experience of using reporting
Page 11
Data warehousing and Data mining Unit-I
Horizontal parallelism: which means that the data base is partitioned across multiple disks and
parallel processing occurs within a specific task that is performed concurrently on different
processors against different set of data
Vertical parallelism: This occurs among different tasks. All query components such as scan, join,
sort etc are executed in parallel in a pipelined fashion. In other words, an output from one task
becomes an input into another task.
Data partitioning:
Data partitioning is the key component for effective parallel execution of data base operations.
Partition can be done randomly or intelligently.
Random portioning includes random data striping across multiple disks on a single server. Another
option for random portioning is round robin fashion partitioning in which each record is placed on the
next disk assigned to the data base.
Intelligent partitioning assumes that DBMS knows where a specific record is located and does not waste
time searching for it across all disks.
The various intelligent partitioning include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the value of the
partitioning key for each row
Key range partitioning: Rows are placed and located in the partitions according to the value of the
partitioning key. That is all the rows with the key value from A to K are in partition 1, L to T are in
partition 2 and so on.
Schema portioning: an entire table is placed on one disk; another table is placed on different disk etc.
This is useful for small reference tables.
User defined portioning: It allows a table to be partitioned on the basis of a user definedexpression.
The cluster illustrated in figure is composed of multiple tightly coupled nodes. The Distributed
Lock Manager (DLM ) is required. Examples of loosely coupled systems are VAX clusters or Sun
clusters.
Since the memory is not shared among the nodes, each node has its own data cache. Cache
consistency must be maintained across the nodes and a lock manager is needed to maintain the
consistency. Additionally, instance locks using the DLM on the Oracle level must be maintained to
ensure that all nodes in the cluster see identical data.
There is additional overhead in maintaining the locks and ensuring that the data caches are
consistent. The performance impact is dependent on the hardware and software components, such as the
bandwidth of the high-speed bus through which the nodes communicate, and DLMperformance.
Parallel processing advantages of shared disk systems are as follows:
Shared disk systems permit high availability. All data is accessible even if one node dies.
These systems have the concept of one database, which is an advantage over shared nothing
systems.
Shared disk systems provide for incremental growth. Parallel processing disadvantages of
shared disk systems are these:
Inter-node synchronization is required, involving DLM overhead and greater
dependency on high-speed interconnect.
Page 15
Data warehousing and Data mining Unit-I
If the workload is not partitioned well there may be high synchronization overhead.
There is operating system overhead of running shared disk software.
Shared nothing systems are concerned with access to disks, not access to memory. Nonetheless,
adding more PUs and disks can improve scale up. Oracle Parallel Server can access the disks on a shared
nothing system as long as the operating system provides transparent disk access, but this access is
expensive in terms of latency.
Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
Shared nothing systems provide for incremental growth.
System growth is practically unlimited.
MPPs are good for read-only databases and decision support applications.
Failure is local: if one node fails, the others stay up.
Disadvantages
More coordination is required.
More overhead is required for a process working on a disk belonging to another node.
If there is a heavy workload of updates or inserts, as in an online transaction processing system, it may
be worthwhile to consider data-dependent routing to alleviate contention.
Page 16
Data warehousing and Data mining Unit-I
defined.
Parallel operations: Informix Online 7 executes queries, INSERT, UPDATE, DELELTE
in parallel.
3. IBM: IBMs parallel client/ server database product - DB2 Parallel Edition (DB2 PE)
DB2 is a database that is based on DB2/6000 SERVER architecture.
Architecture: DB2 PE is a Shared nothing model, in which all data is partitioned across
processor nodes.
All database operations and utilities are fully parallelized.
DB2 PE can run on Lan-based clusters.
Data partition: DB2 PE supports hash partitioning and node groups that allow a table to
span multiple nodes. The DBA can choose to partition a table on a table-by-table basis.
Parallel operations: All dat abase operat ions (quer y processing, INSERT,
UPDATE, DELELTE, load, recovery, index creation, backup, table reorganization) are
fully parallelized.
4. SYBASE:
SYBASE has implemented its parallel DBMS functionality in a product called SYBASE
MPP. It was jointly developed by Sybase and NCR.
Architecture: SYBASE MPP is designed to make multiple distributed SQL Servers look
like a single server to the user. It is a Shared nothing system that partitions data across
multiple SQL servers and supports both function shipping and data repartitioning.
SYBASE MPP is an Open Server application that operates on top of existing SQL
Server.
SYBASE MPP consists of specialized servers: Data server, DBA Server, Administrative
Server.
Data partition: It supports hash, key range, and Schema partitioning.
Parallel operations: All SQL statements and utilities are executed in parallel across SQL
Servers. SYBASE MPP supports Horizontal and vertical parallelism
Page 18
Data warehousing and Data mining Unit-I
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the
information gain after partitioning is |S | |S |
I ( S , T ) 1 Entropy( S1) 2 Entropy( S 2)
|S| |S|
Entropy is calculated based on class distribution of the samples in the set. Given m classes, the
m
entropy of S1 is Entropy( S1 ) pi log 2 ( pi )
i 1
Page 19
Data warehousing and Data mining Unit-I
Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct
values
E.g., for a set of attributes: {street, city, state, country}
Page 21
Data warehousing and Data mining Unit-I
Fig below represents 2-D view of sales details for the city Vancouver with respect to the
dimensions’ time and item.
Fig below represents 3-D view of sales details with respect to the dimensions time and item
and loaction. 3-D data of tables are represented as a series of 2-D tables.
Fig below represents 3-D data cube view of sales details with respect to the dimensions’ time
and item and location.
Fig below represents 4-D data cube view of sales details with respect to the dimensions time
,item, location and supplier.
Page 22
Data warehousing and Data mining Unit-I
In the data warehouse, a data cube of the above is referred to as a cuboid. Each data cube
consists of lattice of cuboids, each showing the data at a different level of summarization. The
lattice of cuboids is thus referred to as a data cube.
Figure shows a lattice of cuboids forming a data cube for the dimensions time, item,
location, and supplier. The cuboid which holds the lowest level of summarization is called
the base cuboid. For example, the 4-D cuboid in the above Figure is the base cuboid for
Page 23
Data warehousing and Data mining Unit-I
the given time, item, location, and supplier dimensions. 3-D (non-base) cuboid for time,
item, and location, summarized for all suppliers. The 0-D cuboid which holds the highest
level of summarization is called the apex cuboid. The apex cuboid is typically denoted by
all.
Page 24
Data warehousing and Data mining Unit-I
The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form. Such a table
is easy to maintain and also saves storage space. However, the snowflake structure can
reduce the effectiveness since more joins will be needed to execute a query. This may
affect the system performance. Snowflake schema of a data warehouse for sales.
A compromise between the star schema and the snowflake schema is to adopt
a mixed schema where only the very large dimension tables are normalized.
Normalizing large dimension tables saves storage space, while keeping small
dimension tables unnormalized may reduce the cost and performance degradation
due to joins on multiple dimension tables. Doing both may lead to an overall
performance gain.
Fact constellation: Sophisticated applications may require multiple fact tables to
share dimension tables. This kind of schema can be viewed as a collection of stars,
and hence is called a galaxy schema or a fact constellation.
Fig:- Fact constellation schema of a data warehouse for sales and shipping.
Page 25
Data warehousing and Data mining Unit-I
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume
that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.
- OLTP (On-line Transaction Processing) is characterized by a large number of short on-line
transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast
query processing, maintaining data integrity in multi-access environments and an effectiveness measured
by number of transactions per second. In OLTP database there is detailed and current data, and schema
used to store transactional databases is the entity model (usually 3NF).
- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions.
Queries are often very complex and involve aggregations. For OLAP systems a response time is an
effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP
database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).
-
Page 27
Data warehousing and Data mining Unit-I
The following table summarizes the major differences between OLTP and OLAP system design.
Page 29
Data warehousing and Data mining Unit-I
For example, a marketing data mart may confine its subjects to customer, item, and
sales. The data contained in data marts tend to be summarized. Data marts are
usually implemented on low cost UNIX servers or Windows/NT servers etc., The
implementation of a data mart is within weeks rather than months or years.
Depending on the source of data, data marts can be categorized into the following
two classes: Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data generated
locally within a particular department or geographic area. Dependent data marts
are sourced ectly from enterprise data warehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational
databases. For efficient query processing, only some of the possible summary
views may be materialized. A virtual warehouse is easy to build but requires
excess capacity on operational database servers.
Page 31