0% found this document useful (0 votes)
52 views49 pages

Chapter 2.introduction To Data Warehouse

Uploaded by

Madeed haji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views49 pages

Chapter 2.introduction To Data Warehouse

Uploaded by

Madeed haji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Introduction to Warehouses

 Data Warehouse and DBMS Architecture of Data Warehouse, Multidimensional data model
 Concepts of OLAP and Data Cube
 OLAP operations
 Dimensional Data Modelling- Star, Snow flake schemas
What is Data Warehouse

 A decision support database that is maintained separately from the organization’s


operational database
 Support information processing by providing a solid platform of consolidated,
historical data for analysis.
 A data warehouse is a Subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision-making process
Subject-oriented –

 Organized around major subjects, such as customer, supplier, product sales atc.

 Provide a simple and concise view around particular subject issues by excluding data
that is not useful for decision process
 Modeled in accordance with decision makers needs and not transaction processing
needs
Integrated –
 Constructed by integrating multiple, heterogeneous data sources- relational databases,
flat files, on-line transaction records
 Data cleaning and data integration techniques are applied.
 Ensure consistency in naming conventions, encoding structures, attribute measures,
etc. among different data sources
Time-variant : The time horizon for the data warehouse is significantly longer than that of
operational systems
 Operational database: current value data

 Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

 Every key structure in the data warehouse Contains an element of time, explicitly or implicitly

 key of operational data may or may not contain “time element”


Non-volatile

 A physically separate store of data transformed from the operational environment

 Operational update of data does not occur in the data warehouse environment

 Does not require transaction processing, recovery, and concurrency control mechanisms

 Does not require transaction processing, recovery, and concurrency control mechanisms

 Requires only two operations in data accessing: initial loading of data and access of data
What is Data Warehousing

 The process of constructing and using data warehouses


Example of data warehouse
OLTP

 The major task of online operational database systems is to perform online transaction and query processing.
 These systems are called online transaction processing(OLTP) systems.
 They cover most of the day-to-day operations of an organization such as purchasing, inventory, manufacturing,
banking, payroll, registration, and accounting.
OLAP

 Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and
decision making.
 Such systems can organize and present data in various formats in order to accommodate the diverse needs of
different users.
 These systems are known as online analytical processing (OLAP) systems.
Distinct features (OLTP vs. OLAP):

– User and system orientation: customer vs. market


– Data contents: current, detailed vs. historical, consolidated
– Database design: ER + application vs. star + subject
– View: current, local vs. evolutionary, integrated
– Access patterns: update vs. read-only but complex queries
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application -oriented subject -oriented
data current, up -to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad -hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB -GB 100GB -TB
met ric transaction throughput query throughput, response
Data Warehouse –Data Model? (4.2.1)

 A data warehouse views data in the form of a data cube


 A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and
facts.
 For example, AllElectronics may create a sales data warehouse in order to keep records of the store’s sales
with respect to the dimensions time, item, branch, and location.
 Facts are numeric measures. Think of them as the quantities by which we want to analyze relationships
between dimensions.
 Examples of facts for a sales data warehouse include dollars sold (sales amount in dollars), units sold
(number of units sold), and amount budgeted.
 The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension
tables.
Data representation

2-D View of Sales Data for


AllElectronics According to time
and item
Data representation

3-D View of Sales Data for


AllElectronics According to time,
item, and location
Data representation

3-D View of Sales Data for


AllElectronics According to time,
item, and location
Cube: a lattice of cuboids:
Given a set of dimensions, we can generate a cuboid for each of
the possible subsets of the given dimensions.
The result would form a lattice of cuboids, each showing the data
at a different level of summarization, or group-by.
The lattice of cuboids is then referred to as a data cube
An n-D base cube is called a base cuboid.
The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid.
Typical OLAP operations

Multidimensional data model provides users with the flexibility to


view data from different perspectives.
OLAP operations on data cubes results into different views of data,
allowing interactive queries and search of the record at hand.
OLAP Operations : Roll up

 summarize data by climbing up hierarchy or by dimension


reduction
OLAP Operations : Roll up/Drill up
 Summarize data by climbing up
hierarchy or by dimension reduction
 Ex. This hierarchy was defined as the
total order “street < city < province or
state < country.”
 The roll-up operation shown
aggregates the data by ascending the
location hierarchy from the level of
city to the level of country
OLAP Operations : Drill Down
 Drill-down can be realized by either
stepping down a concept hierarchy for
a dimension or introducing additional
dimensions
 Ex. This hierarchy was defined as the
total order concept hierarchy for time
defined as “day < month < quarter <
year.”
 Drill-down occurs by descending the
time hierarchy from the level of
quarter to the more detailed level of
month.
OLAP Operations : Slice
 The slice operation performs a
selection on one dimension of the
given cube, resulting in a subcube
 Ex. slice operation where the sales
data are selected from the central cube
for the dimension time using the
criterion time = “Q1.”
OLAP Operations : Dice
 The dice operation defines a subcube
by performing a selection on two or
more dimensions.
 Dice operation on the central cube

based on the following selection


criteria that involve three dimensions:
(location = “Toronto”or “Vancouver”)
and (time = “Q1” or “Q2”) and (item
= “home entertainment”
or“computer”).
OLAP Operations : Pivot (rotate)
 Pivot (also called rotate) is a
visualization operation that rotates the
data axes in view to provide an
alternative data presentation.
 shows a pivot operation where the item
and location axes in a 2-D slice are
rotated.
OLAP Operations : Pivot (rotate)
 Pivot (also called rotate) is a
visualization operation that rotates the
data axes in view to provide an
alternative data presentation.
 shows a pivot operation where the item
and location axes in a 2-D slice are
rotated.
OLAP Operations : Drill through

Through the bottom level of the cube to its back-end relational tables (using SQL)
OLAP Operations : Drilling across
 involving (across) more than one fact table. The operation, often referred
to as "OLAP Join"
 Drilling across simply means making separate queries against two or
more fact tables where the row headers of each query consist of identical
conformed attributes. The answer sets from the two queries are aligned
by performing a sort-merge operation on the common dimension
attribute row headers.
Conceptual Modeling of Relational Data Model

Example:
If we want to create exam database. Student will appear for the exam. Student can attempt only one or more exam. But one exam
can be given by many students.

To store information for above we will required a database. To design database for above scenario ,we have to find out the entities
and relationship among them.
Entities will be: Student and exam
Relationship will be: Student appears for exam
We can represent above structure using ER Model as

1 m
Student Appear Exam
Using above diagram It is clear that ,database will contain data for student and exam.

Roll Student Contact Address RollNo Exam_Id Date Marks


NO _Name
1 Punam 325345 Pune 1 101 12/02/2016 65
2 Nilima 435443 Mumbai
3 Divya 435443 Pune 3 102 12/02/2016 44

4 Kalyani 325345 Mumbai


4 103 12/02/2015 56
5

2 104 12/02/2016 67

Student 1 103 12/02/2015 87

Exam
Conceptual Modelling of Data warehouses
Design Decisions
 Choosing the process – selecting the subjects for the first set of logical structures to be designed

Ex. Student, Uniesrity,Sports etc.


 Choosing the grain- determining the level of detail for the data in the data structures
 Identifying and conforming the dimensions- choosing the dimensions to be included

Ex. Dimensions
Time , location , item , branch
 Choosing the facts-Selecting the metrics or units of measurements to be included

Facts
Quantity of sales, Sales Amount
 Choosing the duration of the database- determining how far back in time one should go for historical data
Conceptual Modelling of Data warehouses

• The model should provide the best data access


• The whole model should be queri-centric
• It must be optimized for queries and analysis
• It must be structured in such a way that every dimension can
interact equally with the fact table
• The model should allow drilling down or rolling up along
dimension hierarchies
Schemas for multidimensional data models (4.2.2)

 The most popular data model for a data warehouse is a multidimensional


model, which can exist in the form of
 A star schema
 A snowflake schema,
 A fact constellation schema.
A star schema
 In star schema data warehouse contains
1. a large central table (fact table) containing the bulk of the data, with no
redundancy
2. a set of smaller attendant tables (dimension tables), one for each
dimension. Dimension 3

Dimension 4
Dimension 1

Facts

Dimension 2
Dimension 5
A star schema
 In star schema data warehouse contains
1. a large central table (fact table) containing the bulk of the data, with no
redundancy
2. a set of smaller attendant tables (dimension tables), one for each
dimension. Dimension 3

Dimension 4
Dimension 1

Facts

Dimension 2
Dimension 5
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
STAR SCHEMA KEYS
•Avoid built-in meanings in the primary key of the
dimension table
•Do not use operational system keys as primary keys
• Operational system keys contain built in meaning
• The keys may be reassigned, thus giving wrong
aggregate value
• Use keys which are system generated sequence numbers
• The operational system keys can be stored as attributes
in dimension table.
Advantages of Star Schema

• The star schema reflects exactly the way the decision


makers thinks in terms of business metrics.
• Users understand the structures very easily
• It optimizes navigation through the database
• It is most suitable for query processing
• It allows query processor software to use better execution
plans

Disadvantages of Star Schema

• Dimension tables are not normalized leading to


redundancy and inconsistency
The snowflake schema is a variant of the star schema model(4.6)
 The snowflake schema is a variant of the star schema model
 some dimension tables are normalized, thereby further splitting the data
into additional tables
 The resulting schema graph forms a shape similar to a snowflake
 Ex. The item and location dimensions have been normalized to give rise
to two more tables supplier and city
Example of Snowflake Schema
supplier
supplier_key
item supplier_type
time item_key
time_key item_name
day brand
day_of_the_week Sales Fact Table Type
month supplier_type
quarter time_key supplier_key
year city
item_key
city_key
branch_key city
branch state_or_province
location_key country
branch_key
branch_name units_sold location
branch_type location_key
dollars_sold street
avg_sales City_key

Measures
Advantages of Snowflake Schema
• Small savings in storage space.
• Normalized structures are easier to update and maintain

Disadvantages of Snowflake Schema


• Schema is less intuitive and end-users are affected by the complexity
• Difficult to browse through contents
• Degraded query performance because of additional joins
Fact constellation.
 A fact constellation schema allows dimension tables to be shared between
fact tables
 For example, the dimensions tables for time, item, and location are
shared between the sales and shipping fact tables.
shipper
shipper_key
item shipper_name
item_key location_key
item_name shipper_type
time Sales Fact Table
brand Shipping Fact Table
time_key time_key type
day supplier_type time_key
Sales Fact Table
item_key
day_of_the_week
month Sales Fact Table
item_key
branch_key
quarter
year
shipper_key
location_key
From_location
branch units_sold
To_Location
branch_key dollars_sold
branch_name dollars_cost
branch_type avg_sales location
location_key units_shiped
Measures street
city
state_or_province
country
Inside a dimension table
Dimension table key- Primary key that uniquely identifies rows in the table
Table is wide-has many columns or attributes
Fewer number of records- fewer rows than fact table-dimension table in
hundreds –facts in millions
Textual attributes- attributes are of textual format-they represent textual
descriptions of a business dimension
Attributes not directly related-Some attributes are not related to one another
such as brand and supplier_type in item dimension table but both are
attributes of item
Not Normalized- For efficient query performance
Multiple Hierarchies- Dimension tables often provide for multiple
hierarchies so that drilling down may be performed along any of the multiple
hierarchies
Inside a Fact table
Fully Additive measures The values can be summed up by simple
addition
Semi Additive measures- Derived attributes such as dollars earned
per quantity is not additive.
Distinguish semi additive measures from fully additive measures
when performing aggregation in queries
Table deep, not wide- Fewer attributes than a dimension table , but
large number of rows .
Fact table is spread vertically while dimension table is spread
horizontally
Sparse data-There could be combinations of dimension table
attributes for which fact table entry may be null
Degenerate dimensions- these are some attributes such
sales_order_no which appear in fact table though these are not
measures –useful for analysis such as avg no of items per order
The Factless fact Table
Apart from Concatenated primary keys a fact table do not
contains facts or measures.
Ex. The fact table for analyzing student attendance with
possible dimensions as student course, professor, room
and date.
Date Key
Date Dimension Professor Dimension
Course Key
Course Dimension Professor Key
Student Key
Student Dimension Room Dimension
Room Key
The presence of a fact table entry itself indicates
attendance. Fact table represents events
Revision questions
1. Define data warehouse
2. Explain OLTP and OLAP
3. Compare OLAP and OLTP
4. State and explain OLAP operations
5.Explain star model
6.Explain snowflake mode
7.Explain fact constellation model
8.Breif fact table and dimension table

You might also like