DWDM UNIT-1 Lecture Notes
DWDM UNIT-1 Lecture Notes
LECTURE NOTES
Introduction to Data warehouse:
A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and that usually resides at a single site. Data warehouses are constructed via a
process of data cleaning, data integration, data transformation, data loading, and periodic data
refreshing. Below figure shows the typical framework for construction and use of a data
warehouse for AllElectronics Sales warehouse example.
Time-variant: Data are stored to provide information from a historical perspective (e.g., the past
5–10 years). Every key structure in the data warehouse contains, either implicitly or explicitly,
an element of time.
Non-volatile: A data warehouse is always a physically separate store of data transformed from
the application data found in the operational environment and it is permanent nature.
Operational Database Systems: The major task of on-line operational database systems is to
perform on-line trans- action and query processing. These systems are called on-line transaction
processing (OLTP) systems. They cover most of the day-to-day operations of an organization.
Data warehouse systems: The major task of Data warehouse systems is to serve users or
knowledge workers in the role of data analysis and decision making. These systems are known as
on-line analytical processing (OLAP) systems.
The major distinguishing features between OLTP and OLAP are summarized as follows:
Feature OLTP OLAP
Characteristic operational processing informational processing
Orientation transaction analysis
User clerk, DBA, database knowledge worker (e.g.,
professional manager, executive,
Function day-to-day operations long-term
analyst) informational
requirements, decision support
DB design ER based, application-oriented star/snowflake, subject-oriented
Data current; guaranteed up-to-date historical; accuracy
maintained over time
Summarization primitive, highly detailed summarized, consolidated
View detailed, flat relational summarized, multidimensional
Unit of work short, simple transaction complex query
Access read/write mostly read
Focus data in information out
Operations index/hash on primary key lots of scans
Number of
records tens millions
Number
accessed of users thousands hundreds
DB size 100 MB to GB 100 GB to TB
Priority high performance, high high flexibility, end-user
Metric availability throughput
transaction autonomy
query throughput, response time
Bottom Tier - The bottom tier of the architecture is the data warehouse database server. It is the
relational database system. Back end tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources. These back end tools and utilities perform
the extraction, cleaning, transformation, and as well as load and refresh functions to update the
data warehouse. This tier also contains a metadata repository, which stores information about the
data warehouse and its contents.
Middle Tier: In the middle tier, the OLAP Server that can be implemented in either of the
following ways.
Top-Tier - This tier is the front-end client layer. This layer holds the query tools and reporting
tools, analysis tools and data mining tools.
Enterprise warehouse: An enterprise warehouse collects all of the information about subjects
spanning the entire organization. It provides corporate-wide data integration, usually from one or
more operational systems or external information providers, and is cross-functional in scope.
Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific
group of users. The scope is confined to specific selected subjects. For example, a marketing data
mart may confine its subjects to customer, item, and sales. The data contained in data marts tend
to be summarized.
Virtual warehouse: A virtual warehouse is a set of views over operational databases. For
efficient query processing, only some of the possible summary views may be materialized. A
virtual warehouse is easy to build but requires excess capacity on operational database servers.
Data Extraction: Typically gathers data from multiple, heterogeneous, and external sources.
The main objective of the extract step is to retrieve all the required data from the source system
with as little resources as possible.
Data cleaning: The cleaning step is one of the most important as it ensures the quality of the
data in the data warehouse, which detects errors in the data and rectifies them.
Data transformation: The transform step applies a set of rules to transform the data from the
source to the target. This converts data from legacy or host format to warehouse format.
Load: which sorts, summarizes, consolidates, computes views, checks integrity, and builds
indices and partitions.
Refresh: which propagates the updates from the data sources to the warehouse.
Metadata Repository:
Metadata are data about data. Metadata are the data that define warehouse objects.
A metadata repository should contain the following:
A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents.
Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or purged),
and monitoring information (warehouse usage statistics, error reports, and audit trails).
The algorithms used for summarization, which include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports.
The mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data extraction,
cleaning, transformation rules and defaults, data refresh and purging rules, and security.
Data related to system performance, which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of
refresh, update, and replication cycles.
Business metadata, which include business terms and definitions, data ownership
information, and charging policies.
Facts are numerical measures. A multidimensional data model is typically organized around a
central theme, this theme is represented by a fact table. Examples of facts for a sales data
warehouse include dollars sold (sales amount in dollars), and units sold (number of units sold).
The fact table contains the names of the facts, or measures, as well as keys to each of the related
dimension tables.
A 3-D data cube representation of the data in Table 3.3, according to the dimensions time,
item, and location. The measure displayed is dollars sold (in thousands).
Star schema: The most common modeling paradigm is the star schema, in which the data
warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern
around the central fact table.
Snowflake schema: The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies.
Example: Snowflake schema of a data warehouse for sales.
Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called
a galaxy schema or a fact constellation.
Fact-Less-Fact Table
Fact table is a collection of many facts and measures having multiple keys joined with one or
more dimension tables. Facts contain both numeric and additive fields.
But fact less fact table are different from all these.
A fact less fact table is fact table that does not contain facts. They contain only dimensional
keys.
It captures events that happen only at information level but not included in the calculations level.
A fact less fact table captures the many-to-many relationships between dimensions, but contains
no numeric or textual facts. They are often used to record events or coverage information.
Factless fact tables are used for tracking a process or collecting stats. They are called so
because, the fact table does not have aggregatable numeric values or facts or information.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either stepping down a concept hierarchy for a
dimension or introducing additional dimensions.
Slice: The slice operation performs a selection on one dimension of the given cube, resulting in a
sub cube.
Dice: The dice operation defines a sub cube by performing a selection on two or more
dimensions.
Ex: Dice for (location = “Toronto” or “Vancouver”) and (time = “Q1” or “Q2”) and (item =
“home entertainment” or “computer”).
Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in
view in order to provide an alternative presentation of the data.
Other OLAP operations: Some OLAP systems offer additional drilling operations. For
example, Drill-Across executes queries involving (i.e., across) more than one fact table. The
Drill-Through operation uses relational SQL facilities to drill through the bottom level of a data
cube down to its back-end relational tables.
Relational OLAP (ROLAP) Servers: These are the intermediate servers that stand in between a
relational back-end server and client front-end tools. They use a relational or extended-relational
DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces.
ROLAP servers include optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services. ROLAP technology tends to have greater
scalability than MOLAP technology.
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP
technology, benefiting from the greater scalability of ROLAP and the faster computation of
MOLAP. For example, a HOLAP server may allow large volumes of detail data to be stored in a
relational database, while aggregations are kept in a separate MOLAP store.
Specialized SQL servers: To meet the growing demand of OLAP processing in relational
databases, some database system vendors implement specialized SQL servers that provide
advanced query language and query processing support for SQL queries over star and snowflake
schemas in a read-only environment.