Data Warehousing: 5 April 2013 TCS Public

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 50
At a glance
Powered by AI
The key takeaways are about the need for data warehousing from both a business and systemic perspective, and the definition of a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data.

The components of a data warehouse are fact tables containing measures/metrics and dimension tables containing attributes that describe the facts. The dimensions provide context to the facts through attributes like who, what, when and where.

The different architectures of a data warehouse are star schema where there is one fact table in the center surrounded by dimension tables, and snowflake schema where dimensions are renormalized into multiple tables with hierarchies.

Data Warehousing

5 April 2013 TCS Public

Contents
Data Warehouse Concepts Data Warehouse Architectures Data Modeling Approaches Data Modeling Development Cycle

5 April 2013

Data Warehouse Concepts

5 April 2013

Data Warehouse Concepts Agenda


A.What is a Data Warehouse (DW) ?

B.What are the components of a DW ?


C.What are the various architectures/formats of a DW ? D.Examples of Data Warehousing tools in use

5 April 2013

Need for Data Warehousing Business View


Customer Centricity Single view of each customer and his/her activities Integrated information from heterogeneous sources Adaptability to rapidly changing business needs Multiple ways to view business performance Low cycle time, faster analytics Increased Global competition Crunch more and more data, faster and faster Mergers and Acquisition With each acquisition comes another set of disparate IT systems affecting consistency and performance

5 April 2013

Need for Data Warehousing Systemic View


Performance Optimization OLTP systems get overloaded with large analytical queries Data Models for OLTP and OLAP are very different Reduce reliance on IT to produce reports Reporting making on OLTP systems is very technical OLTP systems not built to hold history data Data Security To prevent unauthorized access to sensitive data

5 April 2013

OLTP vs. DSS : A comparison


OLTP Environment
get data IN large volumes of simple transaction queries continuous data changes low processing time mode of processing transaction details data inconsistency mostly current data high concurrent usage highly normalized data structure static applications automates routines

DSS Environment
get information OUT small number of diverse queries periodic updates only high processing time mode of discovery subject oriented - summaries data consistency historical data is relevant low concurrent usage fewer tables, but more columns per table Dynamic (ad-hoc) applications facilitates creativity
5 April 2013

Data Warehouse Defined

Data Warehouse is a Subject-Oriented Integrated Time-Variant Non-volatile collection of data enabling management decision making

5 April 2013

Subject Orientation
Process Oriented Subject Oriented

Entry
Sales Rep Quantity Sold Part Number Date Customer Name Product Description Unit Price Mail Address

Sales
Customers

Products

Transactional Storage

Data Warehouse Storage

5 April 2013

Data Volatility
Volatile
Insert Change

Non-Volatile

Delete Insert Change Access Record-by-Record Data Manipulation Load

Access

Mass Load / Access of Data

Transactional Storage

Data Warehouse Storage

5 April 2013

10

Time Variance
Current Data Historical Data

Sales ( Region , Year - Year 97 - 1st Qtr)


20 15 Sales ( in lakhs 10 ) 5 0 January February March Year97 East West North

Transactional Storage

Data Warehouse Storage

5 April 2013

11

Data Warehouse Characteristics


Stores large volumes of data used frequently by DSS Is maintained separately from operational databases Are relatively static with infrequent updates Contains data integrated from several, possibly heterogeneous operational databases Supports queries processing large data volumes

5 April 2013

12

Data Warehouse Concepts Agenda


A.What is a Data Warehouse (DW) ?

B.What are the components of a DW ?


C.What are the various architectures/formats of a DW ? D.Examples of Data Warehousing tools in use

5 April 2013

13

Data Warehouse Components


Metadata Layer
Extraction
FS1 FS2 S T A G I N G

Cleansing
Transformation Aggregation Summarization

Data Mart Population

DM1 DM2

. . .
FSn

Transmission
N E T W O R K

ODS

DW

DMn A R E A

OLAP ANALYSIS Knowledge Discovery

Legacy System

5 April 2013

14

Data Warehouse Build Lifecycle


Data extraction Data Cleansing and Transformation Data Load and refresh Build derived data and views Service queries Administer the warehouse

5 April 2013

15

Data Warehouse Concepts Agenda


A.What is a Data Warehouse (DW) ?

B.What are the components of a DW ?


C.What are the various architectures/formats of a DW ? D.Examples of Data Warehousing tools in use

5 April 2013

16

Data Warehouse Architectures


Virtual Data Warehouse Enterprise Data Warehouse Distributed Data Marts Multi-tiered warehouse

5 April 2013

17

Virtual Data Warehouse

REPORTING TOOL

Legacy Client/ Server OLTP Application

Operational Systems Data

U S E R S

External

5 April 2013

18

Enterprise Data Warehouse


Data Preparation

Legacy
Client/ Server

Select

REPORTING TOOL

Metadata Repository

Operational Systems Data

Extract

Transform Integrate

DATA WAREHOUSE

OLTP External

A P I

U S E R S

Maintain

5 April 2013

19

Distributed Data Marts

Legacy
Client/ Server

Select

Data Mart

REPORTING TOOL

Extract

Transform

Data Mart

OLTP External

Integrate

A P I

U S E R S

Maintain

Data Mart

Data Preparation Operational Systems Data

5 April 2013

20

Multi-tiered Data Warehouse: Option 1


Data Mart
Legacy

REPORTING TOOL

Select

Client/ Server

Extract

Metadata Repository

Transform
OLTP

Data Mart

Integrate

DATA WAREHOUSE

A P I

U S E R S

External

Maintain

Data Mart

Operational Systems Enterprise wide Data

5 April 2013

21

Multi-tiered Data Warehouse: Option 2

Legacy Client/ Server OLTP External

Select

Data Mart
Metadata Repository

REPORTING TOOL

Extract

Transform Integrate

Data Mart

DATA WAREHOUSE

A P I

U S E R S

Maintain

Data Mart

Data Preparation Operational Systems Data

5 April 2013

22

Relative Data sizes in a Data Warehouse


Highly Summarized Data

Lightly Summarized Data

Current Detail Data

Metadata

Older Detail Data

Cont.
5 April 2013 23

Data Warehouse - Example


Monthly sales by region for 1991-94 Monthly Sales by Product for 1991-94

Weekly sales by region for 1991-94

Weekly sales by product/sub-product for 1991-94

Sales Detail for 1991-94

Metadata

Sales Detail for 1985-90

5 April 2013

24

Building a Data Warehouse - Steps


Identify key business drivers, sponsorship, risks, ROI Survey information needs and identify desired functionality and define functional requirements for initial subject area. Architect long-term, data warehousing architecture Evaluate and Finalize DW tool & technology Conduct Proof-of-Concept Cont.
5 April 2013 25

Building a Data Warehouse - Steps


Design target data base schema Build data mapping, extract, transformation, cleansing and aggregation/summarization rules Build initial data mart, using exact subset of enterprise data warehousing architecture and expand to enterprise architecture over subsequent phases Maintain and administer data warehouse

5 April 2013

26

Representative DW Tools
Tool Category ETL Tools OLAP Server Products ETI Extract, Informatica, IBM Visual Warehouse Oracle Warehouse Builder Oracle Express Server, Hyperion Essbase, IBM DB2 OLAP Server, Microsoft SQL Server OLAP Services, Seagate HOLOS, SAS/MDDB Oracle Express Suite, Business Objects, Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query, MetaCube Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft SQL Server, RedBricks SAS Enterprise Miner, IBM Intelligent Miner, SPSS/Clementine, TCS Tools

OLAP Tools

Data Warehouse Data Mining & Analysis

5 April 2013

27

Top-Down Approach
Using the top-down approach, you can discover and draft a description of the business process. That description supplies you with concepts that will be used as a starting place. This is a functional or process-driven analysis.You need to, as the name implies, start at the top and drill downward, increasing the level of detail in an iterative fashion. This typically needs more time for development. Without it you may miss the following: Assumptions everyone expects you to know Future developments that could change your direction Opportunities to increase the quality, usability, accessibility, and enterprise data

5 April 2013

Top-Down Approach
With it you gain the following: An understanding of the way things fit together, from high to low levels of detail A sense of the political environment that may surround the data An enhancement of your understanding of data importance The guide to the level of detail you need for different audiences

5 April 2013

Top-Down Approach
In top-down analysis, people are the best source of your information. The top down implementation can also imply more of a need for an enterprise wide or corporate wide data warehouse with a higher degree of cross workgroup, department, or line of business access to the data. A top down implementation can result in more consistent data definitions and the enforcement of business rules across the organization, from the beginning. However, the cost of the initial planning and design can be significant. It is a time-consuming process and can delay actual implementation, benefits, and return-on-investment.

5 April 2013

Bottom-Up Approach
The bottom-up approach focuses instead on the inventory of things in a process. It implies an in-depth understanding of as much of the process as can be known at this point. Using this approach you discover and draft a list of potential elements without regard to how theyre used. The list usually consists of a mixed set of very low-level, detailed notions and high-level concepts. The trick is to aggregate them to the same level of detail. This is a data-driven analysis. You concentrate on what things are. You concentrate on the parts rather than the process. You need to, as the name implies, start at the bottom and aggregate up while increasing your level of aggregation, again in an iterative fashion. Without it you may miss a real-world understanding of the data and how it fits together, as well as the following: Data areas that everyone expects you to know

5 April 2013

Bottom-Up Approach
Relationships Fuzzy, currently undefined areas that need extra work to bring them to the same level of understanding With it you gain the following: An understanding of the things involved A sense of the quality levels that may be inherent to the data An enhancement of your understanding of data definitions

5 April 2013

Bottom-Up Approach
In bottom-up analysis, the current environment is the best source of your information. The bottom up implementation approach has become the choice of many organizations, especially business management, because of the faster payback. It enables faster results because data marts have a less complex design than a global data warehouse. In addition, the initial implementation is usually less expensive in terms of hardware and other resources than deploying the global data warehouse. Typically Bottom Up approach is confined to only limited set of requirements and focuses on short term solution which delivers the reporting needs quickly.

5 April 2013

Data Modeling Development Cycle


Conceptual Data Modeling Logical Data Modeling

Physical Data Modeling This data model includes all major things that need to be tracked, along with constraints. Usually, specified in terms of business requirements, forms, reports etc.

Database Creation

This is the actual implementation of a conceptual model in a logical data model. Usually expressed in terms of entities, attributes, relationships, and keys.

This is a complete model that includes all required tables, columns, relationship, database properties, referential integrity constraints for the physical implementation.

DBAs instruct the data Modeling tool to create SQL code from physical data model. The SQL code is then executed on the server to create databases.

5 April 2013

Development Cycle - Conceptual Data Modeling


CDM is the first step in constructing a data model in top-down approach and is a clear and accurate visual representation of the business of an organization. In many ways, it represents the users view of the business. A Conceptual Data Model (CDM) visualizes the users view of the business and provides high-level information about the subject areas of an organization. CDM discussion starts with main subject area of an organization. It relies on specs, reports, forms, views, requirements, application demos, and user interactions to form a conceptual view of business.

5 April 2013

Development Cycle - Logical Data Modeling


This is the next step of development after the conceptual data model.
A Logical data model (LDM) is the version of a data model that represents the business requirements (entire or part) of an organization and is developed before the physical data model. Logical data model includes all required entities, attributes, key groups, relationships, and functional constraints that represent business information and define business rules. Lot of clarification on definitions and calculations is accomplished through out the organization (especially between the data modelers and business users/analysts) in this phase. Once logical data model is completed, it is then forwarded to the business users for review and verification.

5 April 2013

Logical Model Focus Areas


Logical Model will focus on the following Entities Attributes Candidate and, eventually, primary keys Logical data types and domains Relationships, cardinality, and nullability Youll perform another entity analysis, which will seem similar to the exhaustive process you did while building the Conceptual model. Youll review each of the Conceptual model entities and their relationships one by one to discover the more detailed logical entities/relationships hidden in the generality. You should also note that you wont always do the modeling tasks in the serial fashion followed in this tutorial. Many times youll know the default, or pattern, of logical entities that maps to an enterprise concept. Your tools will allow you to source them into new Logical models so that the entities Geography, Calendar Date, Employee, and Address, for example, are always modeled, named, defined, and related the same way from project to project. Then you can tweak the default slightly for business specifics.

5 April 2013

Development Cycle - Physical Data Modeling


Physical Data Models are used to design the internal schema of a database, depicting the data tables (derived from the logical data entities), the data columns of those tables (derived from the entity attributes), and the relationships between the tables (derived from the entity relationships). Database performance, indexing strategy, physical storage and denormalization are important parameters of a physical model.
The transformations from logical model to physical model include imposing database rules, implementation of referential integrity, super types and sub types etc. Once physical data model is completed, it is then forwarded to technical teams (developer, group lead, DBA) for review.

5 April 2013

Development Cycle CDM, LDM, PDM comparisons


Conceptual Data Model
Provides high-level information about the subject areas and users view of an organization. Subject Areas Things to track No Keys identified No Keys identified

Logical Data Model


Represents business information and defines business rules Entity Attribute Primary Key Alternate Key

Physical Data Model


Represents the physical implementation of the model in a database. Table Column Primary Key Constraint Unique Constraint or Unique Index

No Rules or constraints

Rule, Functional Dependencies

Check Constraint, Default Value, User Defined constraints, referential constraints


Foreign Key Comment
5 April 2013

Relationship No Definition or comment

Relationship Definition

Development Cycle - Database Creation / Development


A physical database definition (say DDL for DB2, schema for Sybase or Oracle) can be generated by entering the gathered information into a physical design tool.
This must be reviewed carefully and in all likelihood modified to some degree, since no physical design tool generates 100 percent perfect database definitions. The script can then be run against the database management system to define the physical environment.

5 April 2013

Data Modeling for a Data Warehouse


Following are commonly followed data modeling techniques : Dimensional Modeling : a) Star Schema ( Denormalized data) b) Snow Flake Schema (Partial Normalized data) ER Modeling or Relation Modeling (Normalized data 1 NF, 2NF, 3NF) Pros and Cons of each Technique.

5 April 2013

Star Schema
Fact Table
This table is the core of the Star Schema Structure and contains the Facts or Measures available through the Data Warehouse.

These Facts answer the questions of What, How Much, or How Many.
Some Examples:
Sales Dollars, Units Sold, Gross Profit, Expense Amount, Net Income, Unit Cost, Number of Employees, Turnover, Salary, Tenure, etc.

5 April 2013

Star Schema
Dimension Tables
These tables describe the Facts or Measures. These tables contain the Attributes and may also be Hierarchical. These Dimensions answer the questions of Who, What, When, or Where. Some Examples:
Day, Week, Month, Quarter, Year Sales Person, Sales Manager, VP of Sales Product, Product Category, Product Line Cost Center, Unit, Segment, Business, Company

5 April 2013

Star Schema
Employee_Dim EmployeeKey EmployeeID . . .

Time_Dim TimeKey TheDate . . .

Shipper_Dim ShipperKey ShipperID . . .

Sales_Fact TimeKey EmployeeKey ProductKey CustomerKey ShipperKey Required Data (Business Metrics) or (Measures) . . .

Product_Dim ProductKey ProductID . . .

Customer_Dim CustomerKey CustomerID . . .

5 April 2013

Star Schema

Particular form of a dimensional model


Central fact table containing Measures Surrounded by one perimeter of descriptors - Dimensions

5 April 2013

Snow Flake Schema


Complex dimensions are re-normalized

Different levels or hierarchies of a dimension are kept separate


Given dimension has relationship to other levels of same dimension

5 April 2013

Modeling ER Model
Definition
Logical & Graphical representation of the information needs

Process

Classifying
Entities Characterizing Attributes Inter-relating Relationships

5 April 2013

Modeling ER Model
In ER modeling, naming entities is important for an easy and clear understanding and communications. Usually, the entity name is expressed grammatically in the form of a noun rather than a verb. The criteria for selecting an entity name is how well the name represents the characteristics and scope of the entity. In the detailed ER model, defining a unique identifier of an entity is the most critical task. These unique identifiers are called candidate keys. From them we can select the key that is most commonly used to identify the entity. It is called the primary key.
Another important concept in ER modeling is normalization. Normalization is a process for assigning attributes to entities in a way that reduces data redundancy, avoids data anomalies, provides a solid architecture for updating data, and reinforces the long-term integrity of the data model. The third normal form is usually adequate. A process for resolving the many-to-many relationships is an example of normalization.

5 April 2013

Modeling Example of ER model

5 April 2013

5 April 2013

50

You might also like