Unit 1 1
Unit 1 1
• Data Source LayerData Extraction LayerStaging AreaETL LayerData Storage LayerData Logic LayerData Presentation LayerMetadata LayerSystem Operations LayerThe picture below shows the relationships among the different components of
the data warehouse architecture:
• This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file, etc., can all act as a data source.
• Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data.Web server logs with user browsing data.Internal market research data.Third-party data, such as census data, demographics data, or survey
data.All these data sources together form the Data Source Layer.
• Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation.
• Staging Area
• This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration.
• ETL Layer
• This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens. The ETL design phase is often the most time-consuming
phase in a data warehousing project, and an ETL tool is often used in this layer.
• This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the
three, two of the three, or all three types.
• This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but do affect what the report looks like.
• This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns users of exceptions, among
others. Usually an OLAP tool and/or a reporting tool is used in this layer.
• Metadata Layer
• This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the metadata layer. A metadata tool is often used to manage metadata.
• This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.
Data Warehouse:
It is an optimized form of operational database contain
only relevant information and provide fast access to
data.
Subject oriented
Eg: Data related to all the departments of an
organization
Integrated:
A
Different views
B Wareho Single unified
of data use view
C
Time – variant
Nonvolatile
Data Warehouse Properties
Subject Integrated
Oriented
Data
Warehouse
Current
accounts
Loans
OLTP Applications
Integrated
Savings
Current
accounts
Customer
Loans
Time Data
Jan-97 January
Feb-97 February
Mar-97 March
Nonvolatile
Operational Warehouse
Load
No. of records
Millions Tens
accessed
Data Warehouse
Versus
Operational Database
Access Pattern
Data Warehouse
Versus
Operational Database
Detailed,
Summarized and
View consolidated
flat relation
Data Warehouse
Versus
Operational Database
28
Why Separate Data Warehouse?
• High performance for both systems
• DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
• Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
• Different functions and different data:
• missing data: Decision support requires historical data which
operational DBs do not typically maintain
• data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
• data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
29
3-Tier Data Warehouse
Architecture
Data ware house adopt a three tier architecture.
Data Extraction
Data Cleaning
Data Transformation
Load
Refresh
Bottom Tier Contains:
Data warehouse
Metadata Repository
Data Marts
Monitoring and Administration
Metadata repository:
Dependent Independent
sourced directly sourced from one or
from data warehouse more data sources
Monitoring & Administration:
Data Refreshment
Data source synchronization
Disaster recovery
Managing access control and security
Manage data growth, database performance
Controlling the number & range of queries
Limiting the size of data warehouse
Monitoring Administration Bottom Tier: Data
Warehouse Server
Data
Data
Metadata Warehouse Marts
Repository
Data
Source
B C
A
Middle Tier: OLAP Server
It presents the users a multidimensional data from data
warehouse or data marts.
Typically implemented using two models:
69
Multi dimensional Data models
• Multidimensional data models defined by fact and dimensions.
• Fact are numerical values such as total sales in dollar.
• Dimensions are entities or table with respect to organized record of information such as time
,item, location, suppliers.
• Three schema model are:
1. Star schema
2. Snowflake schema
3. Fact constellations
70
Star schema
71
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
72
Snowflake schema
73
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_
Measures street
country
74
Fact constellations
75
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
all all
77
Data Cubes
80
A 3-D data cube representation of the data according to the
dimensions time, item, and location. The measure displayed is
dollars sold (in thousands).
81
A 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure
displayed is dollars sold (in thousands). For improved
readability, only some of the cube values are shown.
82
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country
83
4-D data cube
84
85
Cube Operation