Data Warehouse Development Approach
Data Warehouse Development Approach
1
1
Fundamental Questions
Before deciding to build a data warehouse for your organization, you need to ask the following basic and fundamental questions and address the relevant issues:
Top-down or bottom-up approach? Enterprise-wide or departmental? Which firstdata warehouse or data mart? Build pilot or go with a full-fledged implementation? Dependent or independent data marts?
2
There is no one-size-fits-all strategy to data warehousing One alternative is the hosted warehouse
Iterative
Build
Production
ISQS 6339, Data Mgmt & BI, Zhangxi Lin
Top-Down Approach
Analyze requirements at the enterprise level Develop conceptual information model Identify and prioritize subject areas Complete a model of selected subject area Map to available data Perform a source system analysis Implement base technical architecture Establish metadata, extraction, and load
processes for the initial subject area
Top down
The advantages of this approach are:
Takes longer to build even with an iterative method High exposure/risk to failure
Bottom-Up Approach
Define the scope and coverage of the
data warehouse and analyze the source systems within this scope
Bottom-Up
The advantages of this approach are:
Each data mart has its own narrow view of data Permeates redundant data in every data mart
10
11
12
The concept of business dimensions is fundamental to the requirements definition for a data warehouse.
13
Information package
Your primary goal in the requirements definition phase is to compile information packages
Once you have firmed up the information packages, youll be able to proceed to the other phases. Essentially, information packages enable you to:
Define the common subject areas Design key business metrics Decide how data must be presented Determine how users will aggregate or roll up Decide the data quantity for user analysis or query Decide how data will be accessed
14
15
16
17
Direct use by some tools More flexible to change Provides for speedier data loading Can become large and unmanageable Degrades query performance More complex metadata
State County City
18
18
Degenerate Dimensions
order_number and order_line in the fact table
For example, you may be looking for average number of products per order. Then you will have to relate the products to the order number to calculate the average. Attributes such as order_number and order_line in the example are called degenerate dimensions and these are kept as attributes of the fact table.
19
20
20
Analyze a representative sample of the data chosen using proven statistical methods. Ensure that the sample reflects: Test loads for different periods Day-to-day operations Seasonal data and worst-case scenarios Indexes and summaries
21
21
Data Partitioning
Breaking up of data into separate physical units that can be handled independently Types of data partitioning Horizontal partitioning. Vertical partitioning
22
22
Indexing
Indexing is used for the following reasons: It is a huge cost saving, greatly improving performance and scalability. It can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed.
23
23
Parallelism
Sales table P1 P2 P3
Customers table
P1
P2
P3
Designing summary tables offers the following benefits: Provides fast access to precomputed data Reduces use of I/O, CPU, and memory
25
25