DWDM
DWDM
Digital Notes
Course : B.TECH
Branch : CS-AIML
Semester : 7th
1
UNIT-I: DATA WAREHOUSING
The term Data Warehouse was coined by Bill Inmon in 1990. According to Bill Inmon
"A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection
of data in support of management's decision making process".
Note − A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database.
Consumer goods
Banking services
2
Financial services
Manufacturing
Retail sectors
3
Data Warehouse Components:
1. Data warehouse database: This is the central part of the data warehousing
environment. The database is implemented on the RDBMS technology.
2. Data Source: source data can be grouped into 4 categories:
Production Data: comes from operational system of enterprise
Internal Data: private datasheet, documents, customer profiles etc.
Archived data: old data is archived
External Data.
3. Data Staging: After we have been extracted data from various operational system and
external sources, we have to prepare the files for storing in the data warehouse. The
extracted data coming from several different sources need to be changed, converted,
and made ready in a different format that is relevant to be saved for querying and
analysis.
Extraction: Data coming from different data sources is extracted in different ways.
For each data source technique is different.
4
transformation function ends, we have a collection of integrated data that is cleaned,
standardised, and summarised.
Meta data:
It is data about data. It stores the location and types of all the data in data warehouse. it
is used for maintaing, managing and using the data warehouse. It is classified into two
types:
1. Technical metadata: It contains information about data warehouse data used
by warehouse designer, administration to carry out development and
management tasks. It include,
Info about data store
Transaction descriptions.
2. Business metadata: It contains info that give info stored in data warehouse to
users.
4. Access tools
Its purpose is to provide info to business users for decision making. There are five
categories:
Data query and reporting tools
Application development tools
Executive info system tools (EIS)
Online Analytical Processing tools (OLAP tools)
Data mining tools
5. Data Marts: It is a subset of large amount of data present in a system. Each subset is
specific to a particular group of users and is related to a particular subject.
5
6. Information delivery system
It is used to enable the process of subscribing for data warehouse info.
Delivery to one or more destinations according to specified scheduling
algorithm.
1. Business considerations:-
Organizations interested in development of a data warehouse can choose one of the
following two approaches:
2. Design considerations:- There are several points related to data warehouse design:
a. Data Content: The data warehouse system should not contain as much details- level
data as the operational system used to source this data in.
b. Metadata: It is data about data. It means it is a description and context of the data. It
helps to organize, find abd understand data.
c. Data distribution: it becomes nessary to know how the data should be divided
across multiple servers and which users should get access to which type of data.
d. Tools: The tools provide the facilities for define the transformation and cleanup
6
1. Access tools: Ranking, statistical analysis, time series analysis, artificial
intelligence, information mapping are some of the examples of access tools
types.
2. Data extraction, cleanup and transformation and migration.
3. Metadata: it is data about data.
7
Disadvantage:-
Each node has access to the same disks and other resources.
Bandwidth of the high-speed bus limits the number of nodes (scalability) of the
system.
Distributed Lock Manager (DLM) is required.
8
The cluster illustrated in figure is composed of multiple tightly coupled nodes. The
Distributed Lock Manager (DLM) is required. Examples of loosely coupled systems are VAX
clusters or Sun clusters.
Since the memory is not shared among the nodes, each node has its own data cache. Cache
consistency must be maintained across the nodes and a lock manager is needed to maintain
the consistency. Additionally, instance locks using the DLM on the Oracle level must be
maintained to ensure that all nodes in the cluster see identical data.
There is additional overhead in maintaining the locks and ensuring that the data caches are
consistent. The performance impact is dependent on the hardware and software components,
such as the bandwidth of the high-speed bus through which the nodes communicate, and
DLM performance.
Advantages
Shared disk systems permit high availability. All data is accessible even if one node
dies. These systems have the concept of one database, which is an advantage over
shared nothing systems.
Shared nothing systems are typically loosely coupled. In shared nothing systems only
one CPU is connected to a given disk. If a table or database is located on that disk,
access depends entirely on the PU which owns it. Shared nothing systems can be
represented as follows:
9
Figure: Shared Nothing Architecture
Shared nothing systems are concerned with access to disks, not access to memory.
Nonetheless, adding more PUs and disks can improve scale up. Oracle Parallel Server can
access the disks on a shared nothing system as long as the operating system provides
transparent disk access, but this access is expensive in terms of latency.
Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
Shared nothing systems provide for incremental growth.
System growth is practically unlimited.
MPPs are good for read-only databases and decision support
applications. Failure is local: if one node fails, the others stay up.
Disadvantages
More coordination is required.
More overhead is required for a process working on a disk belonging to
another node. If there is a heavy workload of updates or inserts, as in an
online transaction processing system, it may be worthwhile to consider
data-dependent routing to alleviate contention.
10
Difference between Database System and Data Warehouse
S.no. Database Data Warehouse
Multidimensional data model stores data in the form of data cube. A data cube allows
data to be viewed in multiple dimensions.
Dimensions are entities with respect to which an organization wants to keep records.
A multidimensional database helps to provide data-related answers to complex
business queries quickly and accurately.
Data warehouses and Online Analytical Processing (OLAP) tools are based on a
multidimensional data model. OLAP in data warehousing enables users to view data
from different angles and dimensions.
11
There three types of multidimensional data model:
1. Star schema model
2. Snow flake schema model
3. Fact constellations
Define data cube A data cube allows data to be modeled and viewed in multiple
keep records. Each dimension may have a table associated with it called dimension table
Define facts.
A multidimensional data model is typically organized around a central theme and the theme
is represented by a fact table. Facts are numerical measures. Fact table contains the names of
The most common modeling paradigm is the star schema, in which the data warehouse
contains
1. Star schema consists of data in the form of facts and dimensions. The fact table
present in the center of star and points of the star are the dimension tables.
2. A large central table (fact table) containing the bulk of the data, with no redundancy.
3. A set of smaller attendant tables (dimension tables), one for each dimension. The
schema graph resembles a star burst, with the dimension tables displayed in a radial
pattern around the central fact table. It may have any number of dimension tables and
many-to-one relationship between the fact table and each dimension table.
12
Example: Suppose a STAR schema is composed of a fact table, SALES, and a number of
dimensions table connected to it for time, product and geographic locations
Snowflake schema is the further splitting of star schema dimension tables into one or more
multiple normalized table thereby reducing the redundancy. A snowflake schema can have
any number of dimensions and each dimension can have any number of levels.
Example:
13
Give the advantages and disadvantages of snowflake schema.
Advantage: Dimension table are kept in a normalized form and thus it is easy to maintain
and saves the storage space.
Disadvantage: It reduces the effectives of browsing since more join is needed to execute a
query.
Fact Constellations:
A fact constellation can have multiple fact tables that share many dimension tables.
This type of schema can be viewed as a collection of star snow flake and hence is
called a galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or data
mart. Fact Constellation Schema can design with a collection of de-normalized FACT,
Shared, and Conformed Dimension tables.
The main disadvantage of fact constellation schemas is its more complicated design.
4 It saves the space due to single fact It does not save space due to multiple
table. fact tables.
14
15