0% found this document useful (0 votes)
109 views

DWDM UNIT-1 Lecture Notes

Uploaded by

tkranthika.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

DWDM UNIT-1 Lecture Notes

Uploaded by

tkranthika.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT-I

DATA WAREHOUSING AND DATA MINING


LECTURE NOTES UNIT-I

UNIT-I: DATA WAREHOUSE SYLLABUS


Data warehouse: Introduction to Data warehouse, Difference between operational database
systems and data warehouses, Data warehouse Characteristics , Data warehouses, Data
warehouse characteristics, Data warehouse Architecture and its components, Extraction –
Transformation-Loading , Logical (Multi-Dimensional),Data Modeling Schema Design, Star and
Snow- Flake Schema , Fact Consultation, Fact Table, Fully Addictive, Non Addictive Measures;
Fact-Less-Facts, Dimension Table Characteristics ; OLAP cube, OLAP Operations, OLAP
Server Architecture- ROLAP,MOLAP and HOLAP.

LECTURE NOTES
Introduction to Data warehouse:
A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and that usually resides at a single site. Data warehouses are constructed via a
process of data cleaning, data integration, data transformation, data loading, and periodic data
refreshing. Below figure shows the typical framework for construction and use of a data
warehouse for AllElectronics Sales warehouse example.

AIML- |DATA WAREHOUSING AND DATA MINING 1


UNIT-I

Data warehouse Definition:


According to William H. Inmon, a leading architect in the construction of data warehouse
systems, “A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision making process”.

Key Features of Data Warehouse:


Subject-oriented: A data warehouse is organized around major subjects, such as customer,
supplier, product, and sales. Rather than concentrating on the day-to-day operations and
transaction processing of an organization, a data warehouse focuses on the modelling and
analysis of data for decision makers around subjects.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and on-line transaction records.

Time-variant: Data are stored to provide information from a historical perspective (e.g., the past
5–10 years). Every key structure in the data warehouse contains, either implicitly or explicitly,
an element of time.

Non-volatile: A data warehouse is always a physically separate store of data transformed from
the application data found in the operational environment and it is permanent nature.

Differences between Operational Database Systems and Data Warehouses:

Operational Database Systems: The major task of on-line operational database systems is to
perform on-line trans- action and query processing. These systems are called on-line transaction
processing (OLTP) systems. They cover most of the day-to-day operations of an organization.

AIML- |DATA WAREHOUSING AND DATA MINING 2


UNIT-I

Data warehouse systems: The major task of Data warehouse systems is to serve users or
knowledge workers in the role of data analysis and decision making. These systems are known as
on-line analytical processing (OLAP) systems.
The major distinguishing features between OLTP and OLAP are summarized as follows:
Feature OLTP OLAP
Characteristic operational processing informational processing
Orientation transaction analysis
User clerk, DBA, database knowledge worker (e.g.,
professional manager, executive,
Function day-to-day operations long-term
analyst) informational
requirements, decision support
DB design ER based, application-oriented star/snowflake, subject-oriented
Data current; guaranteed up-to-date historical; accuracy
maintained over time
Summarization primitive, highly detailed summarized, consolidated
View detailed, flat relational summarized, multidimensional
Unit of work short, simple transaction complex query
Access read/write mostly read
Focus data in information out
Operations index/hash on primary key lots of scans
Number of
records tens millions
Number
accessed of users thousands hundreds
DB size 100 MB to GB 100 GB to TB
Priority high performance, high high flexibility, end-user
Metric availability throughput
transaction autonomy
query throughput, response time

AIML- |DATA WAREHOUSING AND DATA MINING 3


UNIT-I

Data warehouse Architecture and its components

Data warehouses often adopt a three-tier architecture:

Bottom Tier - The bottom tier of the architecture is the data warehouse database server. It is the
relational database system. Back end tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources. These back end tools and utilities perform
the extraction, cleaning, transformation, and as well as load and refresh functions to update the
data warehouse. This tier also contains a metadata repository, which stores information about the
data warehouse and its contents.

Middle Tier: In the middle tier, the OLAP Server that can be implemented in either of the
following ways.

 By Relational OLAP (ROLAP), which is an extended relational database management


system. The ROLAP maps the operations on multidimensional data to standard relational
operations.
AIML- |DATA WAREHOUSING AND DATA MINING 4
UNIT-I

 By Multidimensional OLAP (MOLAP) model, which directly implements the


multidimensional data and operations.

Top-Tier - This tier is the front-end client layer. This layer holds the query tools and reporting
tools, analysis tools and data mining tools.

Data Warehouse Models


From the architecture point of view, there are three data warehouse models: the enterprise
warehouse, the data mart, and the virtual warehouse

Enterprise warehouse: An enterprise warehouse collects all of the information about subjects
spanning the entire organization. It provides corporate-wide data integration, usually from one or
more operational systems or external information providers, and is cross-functional in scope.

Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific
group of users. The scope is confined to specific selected subjects. For example, a marketing data
mart may confine its subjects to customer, item, and sales. The data contained in data marts tend
to be summarized.
Virtual warehouse: A virtual warehouse is a set of views over operational databases. For
efficient query processing, only some of the possible summary views may be materialized. A
virtual warehouse is easy to build but requires excess capacity on operational database servers.

AIML- |DATA WAREHOUSING AND DATA MINING 5


UNIT-I

Data Warehouse Back-End Tools and Utilities


Data warehouse systems use back-end tools and utilities to populate and refresh their data. This
process is done by ETL (Extraction- Transform-Load). ETL covers a process of how the data
are loaded from the source system to the data warehouse. Currently, the ETL encompasses a
cleaning step as a separate step. The sequence is then Extract-Clean-Transform-Load.

Data Extraction: Typically gathers data from multiple, heterogeneous, and external sources.
The main objective of the extract step is to retrieve all the required data from the source system
with as little resources as possible.

Data cleaning: The cleaning step is one of the most important as it ensures the quality of the
data in the data warehouse, which detects errors in the data and rectifies them.

Data transformation: The transform step applies a set of rules to transform the data from the
source to the target. This converts data from legacy or host format to warehouse format.

Load: which sorts, summarizes, consolidates, computes views, checks integrity, and builds
indices and partitions.

Refresh: which propagates the updates from the data sources to the warehouse.

Metadata Repository:
Metadata are data about data. Metadata are the data that define warehouse objects.
A metadata repository should contain the following:
 A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents.
 Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or purged),
and monitoring information (warehouse usage statistics, error reports, and audit trails).

AIML- |DATA WAREHOUSING AND DATA MINING 6


UNIT-I

 The algorithms used for summarization, which include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports.
 The mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data extraction,
cleaning, transformation rules and defaults, data refresh and purging rules, and security.
 Data related to system performance, which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of
refresh, update, and replication cycles.
 Business metadata, which include business terms and definitions, data ownership
information, and charging policies.

A Multidimensional Data Model:


Data warehouses modelled based on a multidimensional data model. This model views data in
the form of a data cube. A data cube allows data to be modeled and viewed in multiple
dimensions. It is defined by dimensions and facts.
Dimensions are the perspectives or entities with respect to which an organization wants to keep
records. For example, AllElectronics may create a sales data warehouse in order to keep records
of the store’s sales with respect to the dimensions time, item, branch, and location.
Each dimension may have a table associated with it, called a dimension table, which further
describes the dimension.

Facts are numerical measures. A multidimensional data model is typically organized around a
central theme, this theme is represented by a fact table. Examples of facts for a sales data
warehouse include dollars sold (sales amount in dollars), and units sold (number of units sold).
The fact table contains the names of the facts, or measures, as well as keys to each of the related
dimension tables.

AIML- |DATA WAREHOUSING AND DATA MINING 7


UNIT-I

A 3-D data cube representation of the data in Table 3.3, according to the dimensions time,
item, and location. The measure displayed is dollars sold (in thousands).

Measures of Data Cube Facts: Three Categories


 Distributive: if the result derived by applying the function to n aggregate values is the
same as that derived by applying the function on all the data without partitioning
 E.g., count(), sum(), min(), max()
 Algebraic: if it can be computed by an algebraic function with M arguments (where M is
a bounded integer), each of which is obtained by applying a distributive aggregate
function
 E.g., avg(), min_N(), standard_deviation()
 Holistic: if there is no constant bound on the storage size needed to describe a
subaggregate.
 E.g., median(), mode(), rank()

AIML- |DATA WAREHOUSING AND DATA MINING 8


UNIT-I

Multidimensional Data Model Conceptual Design Schemas:


A multidimensional data model can be designed conceptually with a star schema, a snowflake
schema, or a fact constellation schema.

Star schema: The most common modeling paradigm is the star schema, in which the data
warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern
around the central fact table.

Example: Star schema of a data warehouse for sales.

Snowflake schema: The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies.
Example: Snowflake schema of a data warehouse for sales.

AIML- |DATA WAREHOUSING AND DATA MINING 9


UNIT-I

Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called
a galaxy schema or a fact constellation.

Fact constellation schema of a data warehouse for sales and shipping.

AIML- |DATA WAREHOUSING AND DATA MINING 10


UNIT-I

Fact-Less-Fact Table
Fact table is a collection of many facts and measures having multiple keys joined with one or
more dimension tables. Facts contain both numeric and additive fields.
But fact less fact table are different from all these.
A fact less fact table is fact table that does not contain facts. They contain only dimensional
keys.
It captures events that happen only at information level but not included in the calculations level.
A fact less fact table captures the many-to-many relationships between dimensions, but contains
no numeric or textual facts. They are often used to record events or coverage information.

Factless fact tables are used for tracking a process or collecting stats. They are called so
because, the fact table does not have aggregatable numeric values or facts or information.

Two types of factless fact tables


1. factless fact tables those that describe events
2. factless fact tables those that describe Conditions
Factless fact tables for Events
The first type of factless fact table is a table that records an event. Many event-tracking tables in
dimensional data warehouses turn out to be factless.
Eg: Capturing the leaves taken by an employees
Factless fact tables for Conditions
Factless fact tables are also used to model or other important relationships among dimensions. In
these cases, there are no clear transactions or events. It is used to support negative analysis report
For example a Store that did not sell a product for a given period

Fact Table Measures:


The numeric measures in a fact table fall into three categories.
1) Fully Additive: - measures can be summed across any of the dimensions associated with
the fact table. Eg: Sales
2) Semi-Additive:- measures can be summed across some dimensions, but not all.
Eg: checking account or savings account balance amounts.

AIML- |DATA WAREHOUSING AND DATA MINING 11


UNIT-I

3) Non-Additive:-These are those specific class of fact measures which cannot be


aggregated across all/any dimension and their hierarchy.

OLAP Operations in the Multidimensional Data Model


Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs
aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by
dimension reduction.
Ex: Roll-up on Location (From cities to countries)

Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either stepping down a concept hierarchy for a
dimension or introducing additional dimensions.

Ex: Drill-down on Time (From quarters to months)

AIML- |DATA WAREHOUSING AND DATA MINING 12


UNIT-I

Slice: The slice operation performs a selection on one dimension of the given cube, resulting in a
sub cube.

Ex: Slice for Time= “Q1”

AIML- |DATA WAREHOUSING AND DATA MINING 13


UNIT-I

Dice: The dice operation defines a sub cube by performing a selection on two or more
dimensions.
Ex: Dice for (location = “Toronto” or “Vancouver”) and (time = “Q1” or “Q2”) and (item =
“home entertainment” or “computer”).

Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in
view in order to provide an alternative presentation of the data.

Other OLAP operations: Some OLAP systems offer additional drilling operations. For
example, Drill-Across executes queries involving (i.e., across) more than one fact table. The
Drill-Through operation uses relational SQL facilities to drill through the bottom level of a data
cube down to its back-end relational tables.

AIML- |DATA WAREHOUSING AND DATA MINING 14


UNIT-I

OLAP Server Architectures


Logically, OLAP servers present business users with multidimensional data from data
warehouses or data marts. The physical architecture and implementation of OLAP servers
include the following.

Relational OLAP (ROLAP) Servers: These are the intermediate servers that stand in between a
relational back-end server and client front-end tools. They use a relational or extended-relational
DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces.
ROLAP servers include optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services. ROLAP technology tends to have greater
scalability than MOLAP technology.

Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of


data through array-based multidimensional storage engines. They map multidimensional views
directly to data cube array structures. The advantage of using a data cube is that it allows fast
indexing to precomputed summarized data.

Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP
technology, benefiting from the greater scalability of ROLAP and the faster computation of
MOLAP. For example, a HOLAP server may allow large volumes of detail data to be stored in a
relational database, while aggregations are kept in a separate MOLAP store.

Specialized SQL servers: To meet the growing demand of OLAP processing in relational
databases, some database system vendors implement specialized SQL servers that provide
advanced query language and query processing support for SQL queries over star and snowflake
schemas in a read-only environment.

AIML- |DATA WAREHOUSING AND DATA MINING 15

You might also like