DMDW 7
DMDW 7
DMDW 7
School Of Computer
Engineering
A Special
Thanks to
J. Han and M. Kamber.
&
Tan, Steinbach, Kumar
for their slides and books, which I have
used for preparation of these slides.
Chapter Contents
3
q ETL (Extract, Transformation, Load)
q ETL Tools
q Metadata (data catalog)
q Schema Design
Ø Star Schema
Ø Snowflake Schema
Ø Fact constellation schema
q Dimension Table
q Fact Table
q OLAP Cube
q Operations on Datacube
ü Drill Down
ü Roll Up
ü Dice
ü Slice
q Pivot
Extraction, Transformation, and Loading (ETL)
4
q Data extraction
Ø get data from multiple, heterogeneous, and external sources
q Data cleaning
Ø detect errors in the data and rectify them when possible
q Data transformation
Ø convert data from legacy or host format to warehouse format
q Load
Ø sort, summarize, consolidate, compute views, check integrity, and build
indicies and partitions
q Refresh
Ø propagate the updates from the data sources to the warehouse
4
ETL Tools
5
q ETL tools are the equivalent of schema mappings in virtual integration, but are
more powerful
q Arbitrary pieces of code to take data from a source,
q convert it into data for the warehouse:
Ø import filters – read and convert from data sources
Ø data transformations – join, aggregate, filter, convert data
Ø de-duplication – finds multiple records referring to the same entity,
merges them
Ø profiling – builds tables, histograms, etc. to summarize data
Ø quality management – test against master values, known business rules,
constraints, etc.
Metadata (data catalog)
6
q Metadata repository is an integral part of a data warehouse system.
Ø Identify subjects of the data mart
Ø Identify dimensions and facts
Ø Indicate how data is derived from enterprise data warehouses, including
derivation rules
Ø Indicate how data is derived from operational data store,
Ø including derivation rules
Ø Identify available reports and predefined queries
Ø Identify data analysis techniques (e.g. drill-down)
Ø Identify responsible people
Metadata (cont..)
7
q Normally, it contains the following metadata:
Ø Business metadata - It contains the data ownership information,
Ø business definition, and changing policies.
Ø Operational metadata - It includes currency of data and data lineage.
Currency of data refers to the data being active, archived, or purged.
Lineage of data means history of data migrated and transformation
applied on it.
Ø Data for mapping from operational environment to data
warehouse - It metadata includes source databases and their contents,
data extraction, data partition, cleaning, transformation rules, data refresh
and purging rules.
Ø The algorithms for summarization - It includes dimension
algorithms, data on granularity, aggregation, summarizing, etc.
Schema Design
8
q Schema refers to the structure or organization of a database.
q It contains a logical description of the entire database, which includes names and
descriptions of tables, records, views, and indexes.
q While a relational model is used to describe a database, data warehouse schemas
get more specialized because the structure is optimized for reporting and
analysis
q Database organization
Ø must look like business
Ø must be recognizable by business user
Ø approachable by business user
Ø Must be simple
q Schema Types
Ø Star Schema
Ø Snowflake schema
Ø Fact Constellation Schema
Dimension Tables
9
q Every dimension contains attributes, which are grouped in the form of a
dimension.
q They are essentially a collection of information that can be referenced to answer
meaningful business questions when used together with fact tables.
q Hold descriptive information about a particular business perspective
q Define business in terms already familiar to users
Ø Wide rows with lots of descriptive text
Ø Small tables (about a million rows)
Ø Joined to fact table by a foreign key
Ø heavily indexed
Ø typical dimensions
ü time periods, geographic region (markets, cities), products,
ü customers, salesperson, etc.
Fact Table
10
q Central Table: Multiple dimension tables are linked to one fact table, which contains
‘keys’ and ‘measures’. By ‘keys’, we’re referring to the foreign keys of every
associated dimension.
q Keys are used to perform joins with dimension tables to run queries. ‘Measures’
refer to numeric data like price and quantity, which represents business
events or transactions, used to add detail to dimension data, so that effective
reports can be generated.
q Key value is a composite key made up of the primary keys of the
dimensions
q Joined to dimension tables through foreign keys that reference primary
q keys in the dimension tables
Ø Typical example: individual sales records
Ø mostly raw numeric items
Ø narrow rows, a few columns at most
Ø large number of rows (millions to a billion)
Ø Access via dimensions
Star Schema
11
q In the STAR Schema, the center of the star can have one fact table and a number
of associated dimension tables. It is known as star schema as its structure
resembles a star.
Ø The dimension table should contain the set of
attributes.
Ø The dimension table is joined to the fact table
using a foreign key
Ø The dimension table are not joined to each other
Ø Fact table would contain key and measure
Ø The Star schema is easy to understand and
provides optimal disk usage.
Ø The dimension tables are not normalized. For
instance, in the above figure,
Country_ID does not have Country lookup
table as an OLTP design would have.
Ø The schema is widely supported by BI Tools
Snowflake Schema
12
q In the snowflake schema, dimensions are stored in multiple dimension
tables instead of a single table per dimension.
q A Snowflake Schema is an extension of a Star
Schema, and it adds additional dimensions. The
dimension tables are normalized which splits data
into additional tables.
Ø The main benefit of the
snowflake schema it uses smaller
disk space.
Ø Easier to implement a dimension is
added to the Schema
Ø Due to multiple tables query
performance is reduced
Ø The primary challenge that you will
face while using the snowflake
Schema is that you need to perform
more maintenance efforts because of
the more lookup tables.
Fact Constellation Schema
13
q Multiple fact tables that share
many dimension tables.
q The schema is viewed as a
collection of stars hence the name
Galaxy Schema.
q Booking and Checkout may share
many dimension tables in the
hotel industry
Promotions
Hotels
Booking
Travel Agents
Checkout
Room Type
Customer
Star Vs Snowflake Schema: Key Differences
14
Star Schema Snow Flake Schema
Hierarchies for the dimensions are stored in Hierarchies are divided into separate tables.
the dimensional table.
It contains a fact table surrounded by dimension One fact table surrounded by dimension table
tables. which are in turn surrounded by dimension table
In a star schema, only single join creates A snowflake schema requires many joins to
the relationship between the fact table and fetch the data.
any dimension tables.
Simple DB Design. Very Complex DB Design.
Denormalized Data structure and query also Normalized Data Structure.
run faster.
High level of Data redundancy Very low-level data redundancy
Single Dimension table contains aggregated data. Data Split into different Dimension Tables.
Cube processing is faster. Cube processing might be slow because of
the complex join.
Offers higher performing queries using Star The Snow Flake Schema is represented by
Join Query Optimization. Tables may be centralized fact table which unlikely
connected with multiple dimensions. connected with multiple dimensions.
Multi-Dimensional Data
15
q A data warehouse is based on a multidimensional data model which views data
in the form of a data cube.
q OLAP databases are divided into one or more cubes. The cubes are designed in
such a way that creating and viewing reports become easy.
Ø Measures - numerical data being tracked
Ø Dimensions - business parameters that define a transaction
Ø Example: Analyst may want to view sales data (measure) by
Ø geography, by time, and by product (dimensions)
Ø Dimensional modeling is a technique for structuring data around
business concepts
Ø ER models describe “entities” and “relationships”
Ø Dimensional models describe “measures” and “dimensions”
Multi-Dimensional Data
16
3-Dimensional Data
A Sample Data Cube
19
Country
Mexico
sum
Cuboids Corresponding to the Cube
20
all
0-D (apex) cuboid
product date country
1-D cuboids
q Drill Down
q Roll Up
q Dice
q Slice
q Pivot
21
Roll-up and Drill Down
22
Higher Level of
Aggregation
Sales Channel
Region
Country
State
Location Address
Sales Representative
Low-level
Details
22
Roll up
23
q Roll-up is also known as "consolidation" or "aggregation." The Roll-up
operation can be performed in 2 ways
Ø Reducing dimensions
Ø Climbing up concept hierarchy. Concept hierarchy is a system of grouping
things based on their order or level.
q Applying roll up operation on LOCATION we have, so we can roll up to its
COUNTRY to USA and INDIA only
23
Drill down
24
q More detail information can be retrieved from this.
q Let us consider that the time axis has to drill down to get more information by
moving down the concept hierarchy by adding new dimension
q To get more details in all quarters, we mentioned the months from Jan to
Dec
24
Slicing and Dicing
25
Household
Telecomm
Video Europe
Far East
Audio India
26
Slice
27
q It selects a single dimension from the data cube which results in a new sub-
cube creation. A Slice is performed on the dimension Time = “Q1”.
27
Pivot
28
q It is also known as rotation operation as it rotates the current view to get
a new view of the representation. In the sub-cube obtained after the slice
operation, performing pivot operation gives a new view of it.
28
Recommended Text and Reference Books
29
q Text Book:
Ø J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 3rd ed., 2011
q Reference Books:
Ø H. Dunham. Data Mining: Introductory and Advanced Topics. Pearson
Education. 2006.
Ø I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools
and Techniques. Morgan Kaufmann. 2000.
Ø D. Hand, H. Mannila and P. Smyth. Principles of Data Mining.Prentice-Hall.
2001.
29
30