Introduction To Data Warehousing

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 46

Introduction to Data Warehousing

Data Warehouse

Maintain historic data Analysis to get better understanding of business Better Decision making Definition: A data warehouse is a subject-oriented integrated

time-varying
non-volatile collection of data that is used primarily in organizational decision making. -- Bill Inmon, Building the Data Warehouse 1996

Subject Oriented

Data warehouse is organized around subjects such as sales, product, customer. It focuses on modeling and analysis of data for decision makers. Excludes data not useful in decision support process.

Integrated

Data Warehouse is constructed by integrating multiple heterogeneous sources. Data Preprocessing are applied to ensure consistency.

RDBMS
Data Processing Data Transformation
Legacy System Data Warehouse Data Processing Data Transformation

Flat File

Non-volatile

Mostly, data once recorded will not be updated. Data warehouse requires two operations in data accessing - Incremental loading of data - Access of data

load

access

Time Variant

Provides information from historical perspective e.g. past 510 years Every key structure contains either implicitly or explicitly an element of time

Why Data Warehouse?


Problem Statement: ABC Pvt Ltd is a company with branches at Mumbai, Delhi, Chennai and Bangalore. The Sales Manager wants quarterly sales report across the branches. Each branch has a separate operational system where sales transactions are recorded.

Why Data Warehouse?


Mumbai

Delhi

Get quarterly sales figure for each branch and manually calculate sales figure across branches.

Sales Manager

Chennai

Banglore

What if he need daily sales report across the branches?

Why Data Warehouse?


Solution: Extract sales information from each database. Store the information in a common repository at a single site.

Why Data Warehouse?


Mumbai

Delhi
Data Warehouse Chennai

Query & Analysis tools

Sales Manager

Banglore

Characteristics of Data Warehouse


Relational / Multidimensional database Query and Analysis rather than transaction Historical data from transactions Consolidates Multiple data sources Separates query load from transactions Mostly non volatile Large amount of data in order of TBs

When we say large - we mean it!


Terabytes -- 10^12 bytes: Yahoo! 300 Terabytes and growing Geographic Information Systems Exabytes -- 10^18 bytes: National Medical Records Zettabytes -- 10^21 bytes: Weather images Zottabytes -- 10^24 bytes:

Petabytes -- 10^15 bytes:

Intelligence Agency Videos

OLTP Vs Data Warehouse (OLAP)


OLTP
Indexes Data Joins Derived data and aggregates Few Normalized Many Rare

Data Warehouse (OLAP)


Many Generally De-normalized Some Common

Data Warehouse Architecture


Operational System Sales Data Mart Analysis Operational System ETL (Extract Transform and Load) Data Warehouse Generic Data Mart

Flat Files

Data Mining

Flat Files

Inventory Data Mart

Reporting

ETL
ETL stands for Extract, Transform and Load

Data is distributed across different sources Flat files, Streaming Data, DB Systems, XML, JSON Data can be in different format CSV, Key Value Pairs Different units and representation Country: IN or India Date: 20 Nov 2010 or 20101020

ETL Functions

Extract Collect data from different sources Parse data Remove unwanted data Transform Project Generate Surrogate keys Encode data Join data from different sources Aggregate Load

ETL Steps

The first step in ETL process is mapping the data between source systems and target database. The second step is cleansing of source data in staging area. The third step is transforming cleansed source data. Fourth step is loading into the target system.

Data before ETL Processing:

Data after ETL Processing:

ETL Glossary
Mapping: Defining relationship between source and target objects. Cleansing: The process of resolving inconsistencies in source data. Transformation: The process of manipulating data. Any manipulation beyond copying is a transformation. Examples include aggregating, and integrating data from multiple sources. Staging Area: A place where data is processed before entering the warehouse.

Dimension

Categorizes the data. For example - time, location, etc. A dimension can have one or more attributes. For example - day, week and month are attributes of time dimension. Role of dimensions in data warehousing. - Slice and dice - Filter by dimensions

Types of dimensions

Conformed Dimension - A dimension that is shared across fact tables. Junk Dimension - A junk dimension is a convenient grouping of flags and indicators. For example, payment method, shipping method. De-generated Dimension - A dimension key, that has no attributes and hence does not have its own dimension table. For example, transaction number, invoice number. Value of these dimension is mostly unique within a fact table. Role Playing Dimensions - Role Playing dimension refers to a dimension that play different roles in fact tables depending on the context. For example, the Date dimension can be used for the ordered date, shipment date, and invoice date. Slowly Changing Dimensions - Dimensions that have data

Types of Slowly Changing Dimension

Type1 - The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. Type 2 - The Type 2 method tracks historical data by creating multiple records for a given value in dimension table with separate surrogate keys. Type 3 - The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data. Type 4 - The Type 4 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of all changes. Type 1, 2 and 3 are commonly used.

Facts

Facts are values that can be examined and analyzed. For Example - Page Views, Unique Users, Pieces Sold, Profit. Fact and measure are synonymous. Types of facts: Additive - Measures that can be added across all dimensions. Non Additive - Measures that cannot be added across all dimensions. Semi Additive - Measures that can be added across few dimensions and not with others.

How to store data?


Facts and Dimensions: 1. Select the business process to model 2. Declare the grain of the business process 3. Choose the dimensions that apply to each fact table row 4. Identify the numeric facts that will populate each fact table row

Dimension Table

Contains attributes of dimensions e.g. Month is an attribute of Time dimension. Can also have foreign keys to another dimension table Usually identified by a unique integer primary key called surrogate key

Fact Table

Contains Facts Foreign keys to dimension tables Primary Key: usually composite key of all FKs

Types of schema used in data warehouse Star Schema


Snowflake Schema Fact Constellation Schema

Star Schema

Multi-dimensional Data Dimension and Fact Tables A fact table with pointers to Dimension tables

Star Schema

Snowflake Schema

An extension of star schema in which the dimension tables are partly or fully normalized. Dimension table hierarchies broken down into simpler tables.

Snowflake Schema

Fact Constellation Schema


A fact constellation schema allows dimension tables to be shared between fact tables. This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension. For example, a separate fact table for daily, weekly and monthly reporting requirement.

Fact Constellation Schema

In this example, the dimensions tables for time, item, and location are shared between both the sales and shipping fact tables.

Operations on Data Warehouse


Drill Down Roll up Slice & Dice Pivoting

Drill Down
Product Category e.g Home Appliances Sub Category e.g Kitchen Appliances Product e.g Toaster

Time

Roll Up
Year Fiscal Year

Quarter

Fiscal Quarter

Month

Fiscal Month

Fiscal Week

Day

Slice & Dice


Product = Toaster Product

Time Time

Pivoting
Product Product

Time

Region

Also called rotation Rotate on an axis Interchange Rows and Columns

Advantages of Data Warehouse

One consistent data store for reporting, forecasting, and analysis Easier and timely access to data Scalability Trend analysis and detection Drill down analysis

Disadvantages of Data Warehouse


Preparation may be time consuming. High associated cost

Case Study: Data Warehouse Design


ABC Pvt Ltd is a new company which produces bakery products with production unit located at Mumbai. They have sales units at Mumbai, Pune. The President of the company wants sales information.

Sales Information
Report: The number of units sold.
113

Report: The number of units sold over time


January 14 February 41 March 33 April 25

Report : The number of items sold for each product with time Jan Feb Mar Apr Wheat Bread Cheese Swiss Rolls 6 8 16 25 6 6 21
Product

17 8

Sales Information
Report: The number of items sold in each City for each product with time Jan Mumbai Wheat Bread Cheese Swiss Rolls Pune Wheat Bread Cheese Swiss Rolls 3 4 9 15 3 4 16 16 Feb Mar 3 6 6 3 7 8
Time

Apr 10

Product

Identify sales Facts & Dimensions

Facts Units sold


Dimensions Product, Time, Region. Fact Table City_ID Prod_ID 1 1 589 1218 Time_Id Units 1 1 3 4

2
2

589
1218

1
1

3
4

589

16

Time dimension table Time_Id 1 2 Month January 2011 Februry 2011

Identify sales Facts & Dimensions


Region Dimension Table

City_ID 1 2

City Mumbai Pune

Region West NorthWest

Country India India

Product Dimension Tables

Prod_ID 589 590 288

Product_Name Wheat Bread White Bread Coconut Cookies

Product_Category_ID 1 1 2

Product_Category_Id Product_Category 1 2 Bread Cookies

Data warehouse model


Time Product

Product Category

Sales Fact

Region

Thank You

You might also like