Data Warehousing: 5 April 2013 TCS Public
Data Warehousing: 5 April 2013 TCS Public
Data Warehousing: 5 April 2013 TCS Public
Contents
Data Warehouse Concepts Data Warehouse Architectures Data Modeling Approaches Data Modeling Development Cycle
5 April 2013
5 April 2013
5 April 2013
5 April 2013
5 April 2013
DSS Environment
get information OUT small number of diverse queries periodic updates only high processing time mode of discovery subject oriented - summaries data consistency historical data is relevant low concurrent usage fewer tables, but more columns per table Dynamic (ad-hoc) applications facilitates creativity
5 April 2013
Data Warehouse is a Subject-Oriented Integrated Time-Variant Non-volatile collection of data enabling management decision making
5 April 2013
Subject Orientation
Process Oriented Subject Oriented
Entry
Sales Rep Quantity Sold Part Number Date Customer Name Product Description Unit Price Mail Address
Sales
Customers
Products
Transactional Storage
5 April 2013
Data Volatility
Volatile
Insert Change
Non-Volatile
Access
Transactional Storage
5 April 2013
10
Time Variance
Current Data Historical Data
Transactional Storage
5 April 2013
11
5 April 2013
12
5 April 2013
13
Cleansing
Transformation Aggregation Summarization
DM1 DM2
. . .
FSn
Transmission
N E T W O R K
ODS
DW
DMn A R E A
Legacy System
5 April 2013
14
5 April 2013
15
5 April 2013
16
5 April 2013
17
REPORTING TOOL
U S E R S
External
5 April 2013
18
Legacy
Client/ Server
Select
REPORTING TOOL
Metadata Repository
Extract
Transform Integrate
DATA WAREHOUSE
OLTP External
A P I
U S E R S
Maintain
5 April 2013
19
Legacy
Client/ Server
Select
Data Mart
REPORTING TOOL
Extract
Transform
Data Mart
OLTP External
Integrate
A P I
U S E R S
Maintain
Data Mart
5 April 2013
20
REPORTING TOOL
Select
Client/ Server
Extract
Metadata Repository
Transform
OLTP
Data Mart
Integrate
DATA WAREHOUSE
A P I
U S E R S
External
Maintain
Data Mart
5 April 2013
21
Select
Data Mart
Metadata Repository
REPORTING TOOL
Extract
Transform Integrate
Data Mart
DATA WAREHOUSE
A P I
U S E R S
Maintain
Data Mart
5 April 2013
22
Metadata
Cont.
5 April 2013 23
Metadata
5 April 2013
24
5 April 2013
26
Representative DW Tools
Tool Category ETL Tools OLAP Server Products ETI Extract, Informatica, IBM Visual Warehouse Oracle Warehouse Builder Oracle Express Server, Hyperion Essbase, IBM DB2 OLAP Server, Microsoft SQL Server OLAP Services, Seagate HOLOS, SAS/MDDB Oracle Express Suite, Business Objects, Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query, MetaCube Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft SQL Server, RedBricks SAS Enterprise Miner, IBM Intelligent Miner, SPSS/Clementine, TCS Tools
OLAP Tools
5 April 2013
27
Top-Down Approach
Using the top-down approach, you can discover and draft a description of the business process. That description supplies you with concepts that will be used as a starting place. This is a functional or process-driven analysis.You need to, as the name implies, start at the top and drill downward, increasing the level of detail in an iterative fashion. This typically needs more time for development. Without it you may miss the following: Assumptions everyone expects you to know Future developments that could change your direction Opportunities to increase the quality, usability, accessibility, and enterprise data
5 April 2013
Top-Down Approach
With it you gain the following: An understanding of the way things fit together, from high to low levels of detail A sense of the political environment that may surround the data An enhancement of your understanding of data importance The guide to the level of detail you need for different audiences
5 April 2013
Top-Down Approach
In top-down analysis, people are the best source of your information. The top down implementation can also imply more of a need for an enterprise wide or corporate wide data warehouse with a higher degree of cross workgroup, department, or line of business access to the data. A top down implementation can result in more consistent data definitions and the enforcement of business rules across the organization, from the beginning. However, the cost of the initial planning and design can be significant. It is a time-consuming process and can delay actual implementation, benefits, and return-on-investment.
5 April 2013
Bottom-Up Approach
The bottom-up approach focuses instead on the inventory of things in a process. It implies an in-depth understanding of as much of the process as can be known at this point. Using this approach you discover and draft a list of potential elements without regard to how theyre used. The list usually consists of a mixed set of very low-level, detailed notions and high-level concepts. The trick is to aggregate them to the same level of detail. This is a data-driven analysis. You concentrate on what things are. You concentrate on the parts rather than the process. You need to, as the name implies, start at the bottom and aggregate up while increasing your level of aggregation, again in an iterative fashion. Without it you may miss a real-world understanding of the data and how it fits together, as well as the following: Data areas that everyone expects you to know
5 April 2013
Bottom-Up Approach
Relationships Fuzzy, currently undefined areas that need extra work to bring them to the same level of understanding With it you gain the following: An understanding of the things involved A sense of the quality levels that may be inherent to the data An enhancement of your understanding of data definitions
5 April 2013
Bottom-Up Approach
In bottom-up analysis, the current environment is the best source of your information. The bottom up implementation approach has become the choice of many organizations, especially business management, because of the faster payback. It enables faster results because data marts have a less complex design than a global data warehouse. In addition, the initial implementation is usually less expensive in terms of hardware and other resources than deploying the global data warehouse. Typically Bottom Up approach is confined to only limited set of requirements and focuses on short term solution which delivers the reporting needs quickly.
5 April 2013
Physical Data Modeling This data model includes all major things that need to be tracked, along with constraints. Usually, specified in terms of business requirements, forms, reports etc.
Database Creation
This is the actual implementation of a conceptual model in a logical data model. Usually expressed in terms of entities, attributes, relationships, and keys.
This is a complete model that includes all required tables, columns, relationship, database properties, referential integrity constraints for the physical implementation.
DBAs instruct the data Modeling tool to create SQL code from physical data model. The SQL code is then executed on the server to create databases.
5 April 2013
5 April 2013
5 April 2013
5 April 2013
5 April 2013
No Rules or constraints
Relationship Definition
5 April 2013
5 April 2013
Star Schema
Fact Table
This table is the core of the Star Schema Structure and contains the Facts or Measures available through the Data Warehouse.
These Facts answer the questions of What, How Much, or How Many.
Some Examples:
Sales Dollars, Units Sold, Gross Profit, Expense Amount, Net Income, Unit Cost, Number of Employees, Turnover, Salary, Tenure, etc.
5 April 2013
Star Schema
Dimension Tables
These tables describe the Facts or Measures. These tables contain the Attributes and may also be Hierarchical. These Dimensions answer the questions of Who, What, When, or Where. Some Examples:
Day, Week, Month, Quarter, Year Sales Person, Sales Manager, VP of Sales Product, Product Category, Product Line Cost Center, Unit, Segment, Business, Company
5 April 2013
Star Schema
Employee_Dim EmployeeKey EmployeeID . . .
Sales_Fact TimeKey EmployeeKey ProductKey CustomerKey ShipperKey Required Data (Business Metrics) or (Measures) . . .
5 April 2013
Star Schema
5 April 2013
5 April 2013
Modeling ER Model
Definition
Logical & Graphical representation of the information needs
Process
Classifying
Entities Characterizing Attributes Inter-relating Relationships
5 April 2013
Modeling ER Model
In ER modeling, naming entities is important for an easy and clear understanding and communications. Usually, the entity name is expressed grammatically in the form of a noun rather than a verb. The criteria for selecting an entity name is how well the name represents the characteristics and scope of the entity. In the detailed ER model, defining a unique identifier of an entity is the most critical task. These unique identifiers are called candidate keys. From them we can select the key that is most commonly used to identify the entity. It is called the primary key.
Another important concept in ER modeling is normalization. Normalization is a process for assigning attributes to entities in a way that reduces data redundancy, avoids data anomalies, provides a solid architecture for updating data, and reinforces the long-term integrity of the data model. The third normal form is usually adequate. A process for resolving the many-to-many relationships is an example of normalization.
5 April 2013
5 April 2013
5 April 2013
50