DW Mod 4
DW Mod 4
UNIVERSITY
JNANA SANGAMA, BELGAVI-590018, KARNATAKA
DATA WAREHOUSING
(As per CBCS Scheme 2022)
PREPARED BY:
INDHUMATHI R (ASST.PROF DEPT OF DS (CSE), KNSIT)
MODULE: 4
CHAPTER 10: PRINCIPLES OF DIMENSIONAL MODELING
Before designing a data warehouse, you gather requirements to understand what data is
needed for analysis. This includes identifying business metrics (facts) and how users
want to analyze these metrics (dimensions).
E-R modeling is designed to manage operational data in OLTP systems where the primary
focus is transaction-level detail with high efficiency in capturing, updating, and maintaining
data integrity.
Characteristics:
Page | 2
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Advantages:
Data integrity and consistency: Strict normalization rules ensure consistent data.
Optimized for updates and short transactions: Quick response times for individual
transaction queries.
Dimensional modeling is tailored for data warehouses that focus on analyzing large datasets
for decision-making, often across multiple dimensions (e.g., customer, time, product).
Characteristics:
Advantages:
Easy to understand: Simple schema structure (fact and dimension tables) makes it
intuitive for business users.
Efficient for querying and reporting: Optimized for read-heavy operations involving
aggregations.
Drill-down and roll-up analysis: Enables users to explore data at different
granularities.
Page | 3
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
The STAR schema consists of a central fact table surrounded by multiple dimension tables.
It’s ideal for answering complex business questions by joining facts with dimensions.
Example Scenario:
In a manufacturing company:
Fact Table: Orders (contains measures like quantity sold, revenue, profit).
Dimension Tables: Customer, Product, Salesperson, Date.
Analysis Queries:
Page | 4
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Page | 5
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
1. Primary Key:
o Each row in a dimension table is uniquely identified by a primary key. This
key links the dimension table to the fact table.
2. Wide Table:
o Dimension tables typically have many columns or attributes (sometimes more
than 50), making them wide. Each attribute provides descriptive details.
3. Textual Attributes:
o Dimension tables mostly contain textual data rather than numerical data used
in calculations. These attributes describe business entities, e.g., product names,
customer locations, etc.
4. Attributes Not Directly Related:
o Some attributes in a dimension table might not be directly related to one another.
For example, "product brand" and "package size" could both be part of the
product dimension, though they are unrelated.
5. Denormalized Structure:
o Dimension tables are not normalized to avoid complex joins and improve
query performance. This means redundant data is allowed to ensure faster
access.
6. Drilling Down and Rolling Up:
o Attributes allow for hierarchical analysis. For instance:
Drill down: State → City → Zip Code
Roll up: Zip Code → City → State
7. Multiple Hierarchies:
o Dimension tables can have multiple hierarchies depending on different
business needs. For example, in a product dimension:
Marketing hierarchy: Category → Department → Product
Finance hierarchy: Product → Finance Category → Finance Department
8. Fewer Records:
o Dimension tables generally contain fewer rows compared to fact tables. For
example, a product dimension table might have 500 rows, whereas the fact table
may have millions.
Page | 6
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Page | 7
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
1. Concatenated Key
Explanation: A fact table uses a combination of primary keys from related dimension
tables to uniquely identify each row.
Example: Imagine an e-commerce platform where each transaction involves a product,
customer, and date. Each transaction in the fact table will have a unique combination
of product ID, customer ID, and date.
Explanation: It refers to the level of detail stored in the fact table. The lower the grain,
the more detailed the data.
Example: In a retail chain, grain could be at the daily transaction level (low grain) or
at a monthly sales summary level (high grain). Storing data at the daily level allows for
more detailed analysis like specific-day promotions.
Page | 8
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
4. Semi-Additive Measures
5. Sparse Data
Explanation: Fact tables often have many possible combinations of dimension values,
but not all combinations will have data, leading to gaps.
Example: In a hotel booking system, there won’t be data for rooms not booked on
certain days, leading to sparse rows.
Explanation: Fact tables have many rows (large datasets) but fewer columns
(attributes).
Example: In a ticket booking system, each row represents a booking with details like
date, ticket ID, and seat, but there are fewer columns compared to dimension tables.
7. Degenerate Dimensions
Explanation: Attributes that are neither measures nor dimension-related but are
important for analysis, such as reference numbers.
Example: In an online order system, order_number and invoice_number are stored in
the fact table to track transactions without needing a separate dimension table.
Page | 9
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Explanation: Fact tables without measurable data, often used to track events or
activities.
Example: In a university, tracking student attendance doesn’t need a numeric fact. Each
row simply records that a student attended a specific class on a given day.
Explanation: Fact tables at the lowest grain allow easy addition of new dimensions or
attributes without changing existing rows.
Example: In a sales system, adding a new region dimension won’t affect the fact table
because it already tracks sales at the individual salesperson level.
Explanation: Keeping data at a low grain increases detail but also requires more
storage.
Example: In a streaming platform, storing every view per user per video gives detailed
insights but consumes significant storage compared to just storing daily summary
views.
Page | 10
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Explanation: Using operational system keys directly can lead to inconsistencies when
changes occur.
Example: In a warehouse system, if product codes are changed due to storage location
updates, using the operational key may lead to data mismatches. Instead, a generated
surrogate key avoids this issue.
Page | 11
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
In a STAR schema, each dimension table has a primary key that links to the fact
table through foreign keys.
This ensures a one-to-many relationship between dimension tables and the fact
table, enabling efficient data querying.
1. User-Friendliness
o Unlike OLTP systems where users rely on predefined templates, data
warehouse users need to formulate their own queries.
o The STAR schema aligns with how users think:
Fact tables contain metrics (e.g., sales).
Dimension tables hold attributes (e.g., product, customer, time).
o This makes it intuitive for users to navigate and understand.
2. Optimized Navigation
o Simple Join Paths:
The STAR schema minimizes complex joins, making it easier and faster to
navigate.
o Example:
Analyzing defective parts in a GM automobile:
Filter by time (January 2009).
Page | 12
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
1. STARjoin
o A high-speed, parallelizable join that can process multiple tables in one
pass, improving query performance significantly.
2. STARindex
Page | 13
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Page | 14
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Page | 15
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Scenario:
A customer's name is misspelled in the system as "Jonh Doe" instead of "John Doe."
Solution:
The incorrect name is simply overwritten with the correct one. No historical record of the
Page | 16
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Scenario:
The customer moves from New York to California, and the business needs to track orders by
state.
Solution:
A new row is added to the dimension table with the updated state, and each row is assigned a
unique surrogate key. This allows the platform to track which orders were placed from New
Before:
After:
Page | 17
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
o State: California
Scenario:
The customer’s loyalty tier changes from "Silver" to "Gold," and the business only needs to
Page | 18
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Solution:
A new column is added for the previous loyalty tier. Both the current and previous tiers are
Before:
After:
Page | 19
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
MISCELLANEOUS DIMENSIONS
1. Large Dimensions
Example: In a telecom company, the customer dimension can reach millions of rows,
containing attributes like customer ID, subscription type, location, and service preferences.
Searching across these attributes can be slow if not handled properly.
2. Multiple Hierarchies
Large dimensions often have multiple hierarchies, meaning different departments use different
sets of attributes to organize and analyze data.
Example: In a retail company, the marketing team might drill down products by category
(Electronics > Smartphones > Brand), while the finance team might group them by profitability
(High > Medium > Low).
Page | 20
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
These dimensions have attributes that change frequently, making it inefficient to keep creating
new rows using the type 2 approach.
Solutions:
Example: In e-commerce, customer behavior data such as purchase frequency or credit rating
changes rapidly. Instead of updating the main customer table, these attributes can be stored in
a behavior mini-dimension.
Page | 21
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
4. Junk Dimensions
Junk dimensions consolidate unrelated, often miscellaneous flags and textual attributes into a
single table instead of bloating the fact table or discarding them.
Example: A logistics company might track package statuses with various flags (e.g.,
"Delayed", "Fragile", "Priority"). Instead of spreading these across different tables, they can be
stored in a junk dimension for easier querying.
Snowflake Schema:
Definition: A more normalized version of the star schema, where dimension tables are
further broken down into related sub-dimensions.
Example: In e-commerce, instead of a single product table, there could be separate
tables for product categories, brands, and packages.
Advantage: Saves storage space and reduces data redundancy.
Disadvantage: Increases query complexity due to multiple table joins.
Partial Normalization: Only some dimension tables are normalized (e.g., splitting
product into brand and category).
Full Normalization: Every dimension is completely broken down (e.g., customer
dimension split into country, region, and city).
Example: In a telecom company, customer data might be broken into separate tables for
country, region, and city to avoid storing redundant geographic data.
Options to Normalize:
1. Partially Normalize: Example: Keep customer table intact but split product into brand
and category.
Page | 22
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
2. Fully Normalize Few Tables: Example: Normalize only high-cardinality tables like
customer addresses.
3. Fully Normalize All Tables: Example: Turn all dimensions into multiple related
tables.
Page | 23
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
When to Snowflake:
Reasons to Snowflake:
1. Storage Optimization:
For large datasets (millions of rows), snowflaking reduces redundancy, thus saving
significant storage space.
o Example: In a telecom company, demographic data like city classifications
can be split into a separate table from customer information to avoid
duplicating this data across millions of rows.
Page | 24
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
2. Granularity Differences:
Different attributes have different granularities and update cycles. Separating these
attributes allows better maintenance and reduces the impact of updates.
o Example: E-commerce platform updates product details frequently but
geographic data remains static. Separating these ensures efficient updates.
3. Frequent Attribute Browsing:
If certain attributes (e.g., demographics) are queried more frequently, placing them in
a subdimension improves access speed.
o Example: Marketing team in a retail company may often query demographic
data for customer segmentation.
An aggregate fact table is a precalculated summary table created by summarizing data from
the most granular (detailed) fact table. This helps improve query performance by reducing the
number of rows scanned during large queries.
Fact tables store metrics (e.g., sales amounts, quantities) at different levels of detail
(granularity). Queries manipulate these metrics to produce aggregated results.
Page | 25
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
1. Query 1:
"Total sales for customer number 12345678 during the first week of December 2008
for product Widget-1."
o Requires filtering rows by customer key, product key, and time key.
2. Query 2:
"Total sales for customer number 12345678 during the first three months of 2009 for
product Widget-1."
o Similar to Query 1 but involves a broader time range.
3. Query 3:
"Total sales for all customers in the south-central territory for the first two quarters
of 2009 for product category Bigtools."
o Requires summarization by territory and product category, while filtering by
time.
Running queries on large datasets can be time-consuming due to the need to scan numerous
fact table rows. Using aggregate tables improves query performance by precomputing
summary data.
Page | 26
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Page | 27
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
The ETL process refers to the extraction of data from various source systems, transformation
of that data into a format suitable for analysis, and loading the transformed data into a data
warehouse for storage and querying. ETL is crucial because the data from operational
systems is typically not structured for analytical purposes and needs to be processed and
optimized before storage in a data warehouse.
Key Challenges:
Time-consuming: The ETL process can take a long time, especially with large
volumes of data.
Arduous: Many data quality issues need to be resolved, and substantial computational
resources may be needed to handle the tasks.
This section covers the necessary steps and factors for successful ETL operations.
Key Factors:
Data quality: The data must be clean and accurate for meaningful analysis.
Scalability: ETL processes should be scalable to handle growing volumes of data.
Error handling: Proper mechanisms for identifying and resolving errors during ETL.
Source Identification: Identifying where the data will come from, whether from
operational systems or external sources.
Data Extraction: The process of retrieving data from various sources.
Data Transformation: This is where data is cleaned, formatted, and made consistent
for analysis.
Data Loading: The final step of loading the transformed data into the data
warehouse.
Page | 28
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Data Extraction
Data extraction refers to the process of collecting data from various sources (like operational
systems, databases, or flat files) to load it into a data warehouse. The main challenge of data
extraction is to ensure that the data pulled is accurate, complete, and in the correct format for
further processing.
Source Identification:
Before extracting data, it’s essential to identify the correct source. This could be
transactional systems (like ERP or CRM), external sources (web data, APIs), or internal
databases.
The source systems may vary in terms of structure (relational databases, flat files,
NoSQL systems), and the data may be in different formats (XML, JSON, CSV, etc.).
Page | 29
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Data extraction involves pulling data from various source systems to populate a data
warehouse. This process is essential for consolidating data for analytics, but it also presents
significant challenges due to the diversity and complexity of source systems.
Static Data Capture: This technique captures the data at a fixed point in time,
essentially taking a snapshot of data. It is used primarily for the initial load or full
refresh of the data warehouse, such as when new products are added or a complete
refresh of a dimension table is necessary.
Incremental Data Capture: Unlike static capture, this involves extracting only the
changes (insertions, updates, deletions) since the last extraction. This method is crucial
for keeping the data warehouse up to date with operational systems without reloading
the entire dataset.
Transaction Log-Based Extraction: This technique captures changes to the database
by reading the transaction logs, enabling real-time data capture. It is typically used in
environments where up-to-the-minute data is required.
Database Trigger-Based Extraction: Database triggers are set to capture changes at
the point of data modification in the source system, often used when database-level
changes need to be monitored closely.
Page | 30
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Capture Based on Date and Time Stamps: This approach relies on records having
time stamps that indicate when a record was created or updated. The data extraction
occurs after the fact, selecting records based on these time stamps. It works well when
data changes infrequently.
File Comparison: In cases where time stamps are unavailable or impractical, the data
is captured by comparing snapshots of files at different times. This technique compares
the current state of a dataset with its previous snapshot to identify changes. While
effective, it can be computationally expensive and inefficient, especially with large
datasets.
Different techniques for data extraction come with their advantages and disadvantages.
Page | 31
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Deferred Extraction: This method generally has less impact on operational systems
but may miss intermediary changes between extraction periods. However, it is more
manageable and costs less in terms of system overhead. This technique involves
extracting data at scheduled times, usually during off-peak hours, when the system's
performance is not critical. This approach ensures that the data extraction does not
interfere with the normal operation of the operational systems. It can be highly
beneficial when dealing with large datasets.
Choosing the Right Method: The best data extraction technique depends on the operational
environment, the volume of data, and the business needs. Real-time methods are ideal for
environments that require immediate updates, while deferred methods can be more cost-
effective and less intrusive for less critical updates.
Page | 32
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Data Transformation
Data transformation is the process of converting the extracted data into a format suitable for
analysis in the data warehouse. This step may involve cleaning, mapping, consolidating, or
reformatting the data.
Data Cleaning: Involves correcting errors, handling missing data, and resolving
inconsistencies.
Data Mapping: Ensures that data from different source systems is mapped correctly
into the warehouse schema.
Data Aggregation: Summarizing data, like calculating totals or averages, to reduce
volume and enhance query performance.
Data Integration: Merging data from multiple sources into a unified format.
Consolidation: Grouping data into a smaller set or changing its structure to be more
suitable for analysis (e.g., transforming transactional data into aggregated data).
Data Cleansing: Correcting issues like duplicate entries, data in the wrong format, or
outliers.
Implementation of Transformation:
Page | 33
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Data Loading
Data loading is the process of inserting the transformed data into the data warehouse. There are
different methods to perform data loading, depending on the type of load being executed
(initial, full refresh, or incremental).
1. Initial Load: The first load of data into the data warehouse, typically involving all the
data from source systems.
2. Incremental Loads: Regular updates that load only new or modified records since the
last extraction, making the process more efficient.
3. Full Refresh: The entire dataset is refreshed, which can be useful when the structure
of a table has changed or when old data needs to be completely replaced.
Load: Adds new data to the data warehouse without modifying existing records.
Append: Appends new records to existing data.
Destructive Merge: Replaces existing data with new records.
Constructive Merge: Combines new and existing records based on certain rules,
ensuring that both are preserved.
Page | 34
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Data Integration: This process combines data from different sources and formats into a
cohesive and usable format for decision-making. It often involves handling different data
structures, cleaning the data, and transforming it into a format that suits the warehouse’s needs.
Two approaches for data integration are:
Page | 35
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
The data extraction, transformation, and loading (ETL) process can be significantly aided by
specialized tools that help streamline and automate the process. These tools typically fall into
three broad categories:
Metadata is crucial for ETL because it documents all details about the data extraction,
transformation, and loading processes.
Metadata helps ensure that the data is accurately transformed, loaded, and queried. It
includes information about source and target systems, transformations applied to data,
and historical changes to data over time.
This section emphasizes the importance of a structured, well-planned approach to data
extraction, ensuring that data is consistently and accurately transferred from operational
systems to the data warehouse.
It also highlights the tools available to automate these processes and the role of
metadata in ensuring data integrity throughout the ETL process.
Page | 36
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Page | 37
DEPT OF CSE (DS)