Data Modelling
Data Modelling
Data modelling is the process of deining and structuring data to be stored, managed, and used
eficiently within a database or data system. It involves creating conceptual, logical, and
physical models that represent data relationships, rules, and constraints.
Types of Data Models
1. Conceptual Data Model: High-level view that deines business entities and
relationships. Used for business stakeholders.
2. Logical Data Model: More detailed, specifying attributes, keys, and relationships
without focusing on physical storage.
3. Physical Data Model: Deines how data is stored in a database, including tables,
columns, indexes, partitions, and data types.
Key Elements of Data Modelling
• Entities (e.g., Customer, Order)
• Attributes (e.g., Name, Order Date)
• Relationships (e.g., One-to-Many, Many-to-Many)
• Primary & Foreign Keys (for unique identiication and referential integrity)
• Normalization & Denormalization (to optimize storage and performance)
Importance of Data Modelling
• Ensures data consistency, accuracy, and integrity
• Improves database performance and scalability
• Helps in better decision-making and reporting
• Facilitates data governance and compliance
Numeric Values (e.g., Sales Amount, Order Count)
Foreign Keys linking to dimension tables
Granularity Level (deines the detail level, e.g., daily sales vs. monthly sales)
1. Additive Facts
2. Semi-Additive Facts
• Cannot be summed across time (e.g., summing account balances across multiple days
is incorrect).
Example Query (AWS Redshift / SQL):
SELECT Store_ID, AVG(Account_Balance) AS Avg_Balance
FROM Bank_Fact
GROUP BY Store_ID;
3. Non-Additive Facts
Sales_Amount,
Additive
Yes
Yes
Quantity_Sold
Semi- Account_Balance,
No
Yes
Additive Inventory_Level
Ensures correct aggregations in reporting.
Helps in designing fact tables for BI tools like Power BI, AWS Redshift, and Snow<lake.
Improves query performance and accuracy in data warehouses.
1. Delayed Transactions: A sale is recorded, but revenue data is not yet available.
2. Data Processing Issues: Data ingestion failure or missing values from the source system.
3. Factless Fact Tables: Some tables track events without numeric facts.
Substituting with Defaults
Forward Filling or Backward Filling
Excluding NULLs from Aggregations
Using Conditional Aggregation
Factless Fact Tables (If No Numeric Facts Exist)
• If a fact table is just tracking events (e.g., student course enrollment), NULLs are normal.
De<ine Business Rules for NULLs (Decide when NULLs are expected and how to handle
them).
Ensure Data Cleaning in ETL Pipelines (Use AWS Glue, Apache Airlow, etc.).
Avoid NULLs in Aggregated Queries (Use COALESCE() or IFNULL()).
Use Default Values if NULLs Are Not Valid (e.g., 0 for missing sales).
Year-To-Date (YTD) Facts in Dimensional Modelling
Year-To-Date (YTD) Facts represent the cumulative total of a fact (e.g., sales, revenue,
expenses) from the beginning of the year up to a speciic date. YTD calculations help in trend
analysis, performance tracking, and business forecasting.
SELECT
Date,
Store_ID,
SUM(Sales_Amount) OVER (PARTITION BY Store_ID
ORDER BY Date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS YTD_Sales
FROM Sales_Fact;
This computes cumulative sales from the start of the year per store.
SELECT
Store_ID,
DATE_TRUNC('year', Date) AS Year_Start,
Date,
SUM(Sales_Amount) OVER (PARTITION BY Store_ID, DATE_TRUNC('year', Date)
ORDER BY Date) AS YTD_Sales
FROM Sales_Fact;
Ensures YTD calculations reset every new year.
SELECT
Store_ID,
SUM(Sales_Amount) AS YTD_Sales
FROM Sales_Fact
WHERE Date BETWEEN DATE_TRUNC('year', CURRENT_DATE) AND CURRENT_DATE
GROUP BY Store_ID;
Captures individual transactions at the most detailed level.
Each row in the table represents a single event, such as an order, sale, or payment.
payment
The data is insert-
insert-only,
only meaning that records are not updated after being written.
Typically includes foreign keys to dimension tables like time, product, store, and
customer,
customer along with measurable facts like Sales_Amount and Quantity_Sold.
*
. Example Use Cases:
-
,
+
Captures aggregated data at regular time intervals (daily, weekly, monthly, etc.).
Provides a historical view of performance and helps track trends over time.
Instead of tracking individual transactions, it aggregates them into periodic summaries.
The data is typically not updated but new snapshots are added periodically.
2024-02-01 1 10,000 50 40
2024-02-02 1 12,500 60 50
Tracks the full lifecycle of a business process, with multiple status updates over time.
Unlike transactional fact tables (insert-only), accumulating snapshots update records as
processes move through different stages.
The table contains timestamps for different milestones (e.g., Order Date, Payment Date,
Shipping Date, Delivery Date).
If an order is completed, its corresponding record in the fact table is updated.
.
-
,
+
* Example Use Cases:
2 Advantage: Useful for tracking the progress of a business work<low and identifying
4
3
delays.
Contains only foreign keys to dimensions, without any measurable numeric facts.
Used for tracking events or conditions that do not have associated numerical metrics.
Helps in identifying patterns, participation, and event tracking.
Often used in many-to-many relationships, such as tracking student-course enrolments.
4 Advantage: Even without numerical values, this type of fact table helps in analytical
3
2
reporting (e.g., how many students enrolled in a course).
Comparison of Fact Table Types
Data Update
Fact Table Type Granularity Use Case Example
Type
Periodic Snapshot Fact Aggregated (daily, Insert-only (new Daily sales, monthly stock
Table monthly) snapshots) levels
Student-course enrolment,
Factless Fact Table Event tracking Insert-only
employee attendance
✔ Choose the right fact table type based on the business need.
✔ Use correct granularity to ensure optimized data storage and querying.
✔ Partition large fact tables (e.g., by date) to improve performance in AWS Redshift,
Snow<lake, or BigQuery.
✔ Index foreign keys to improve query eficiency.
✔ Use aggregations in snapshot tables to optimize report performance.
1.. Identify
I Business Process to Analyse
Understand the business requirements and determine what process needs analysis.
Examples:
Decide the level of detail for each record in the fact table.
Common granularity options:
• Transaction-level grain (each row represents an individual sale, order, or event).
• Daily snapshot grain (each row represents total sales per store per day).
• Monthly snapshot grain (aggregated sales per store per month).
Example:
*
. Choosing the right granularity ensures optimal performance and lexibility in analysis.
-
,
+
Select the numeric metrics that will be analysed and reported.
Example facts in a Sales Fact Table:
. Ensure that all required measures are included while avoiding unnecessary metrics.
-
,
+
*
Determine the descriptive attributes (dimensions) that will provide context to the
facts.
Examples of common dimensions:
.
-
,
+
* Dimension tables help in slicing and dicing the fact data for detailed analysis.
Each fact record should reference foreign keys to related dimension tables.
Example Fact Table Schema:
Date_ID Store_ID Product_ID Customer_ID Sales_Amount Quantity_Sold
The Date_ID, Store_ID, Product_ID, and Customer_ID are foreign keys linking to their
respective dimension tables.
*
. Foreign key relationships enable powerful analytical queries and ef<icient joins.
-
,
+
Decide how the measures should behave in summarized reports:
*
. Understanding additivity helps in designing meaningful aggregations.
-
,
+
Determine how to handle missing values or nulls in the fact table.
Common strategies:
• Replace NULL values with default values (e.g., 0 for missing sales data).
• Use factless fact tables if no numerical data is available (e.g., tracking student course
enrolments).
*
. Proper handling of missing data ensures accurate reporting.
-
,
+
Optimize the fact table for fast query performance using:
*
. Ef<icient storage techniques reduce query response time and improve performance.
-
,
+
9.. Implement ETL (Extract, Transform, Load) Process
Design an ETL pipeline to:
• Extract data from source systems (e.g., SQL Server, APIs, mainframe).
• Clean and transform the data (e.g., handling missing values, formatting timestamps).
• Load the data into the fact table in a data warehouse (AWS Redshift, Snowlake,
BigQuery).
.
-
,
+
* A well-designed ETL pipeline ensures data consistency and accuracy.
Run sample queries to verify correctness:
• Ensure no orphaned foreign keys (fact table rows should reference valid dimension
records).
• Conirm data accuracy against source systems.
.
-
,
+
* Thorough testing prevents incorrect reporting and analytics.
Step Description
3. Identify the Measures (Facts) Choose numeric values to store (e.g., Sales_Amount).
4. Identify the Dimensions Select attributes for analysis (e.g., Product, Date).
Uniqueness & Integrity:
• When business data changes (e.g., a customer moves to a new city), a new surrogate key
allows tracking history while keeping the original data.
Prevents Dependence on Source Systems:
• Source keys (e.g., Customer ID) can be reused, change format, or be null—causing
issues in a data warehouse.
• Surrogate keys ensure consistency even when source systems change.
Optimized Joins & Performance:
• Integer-based surrogate keys are faster to join than long or complex natural keys (e.g.,
alphanumeric customer codes).
• Reduces query time in large fact tables.
• When data is collected from multiple sources, different systems may use different
natural keys.
• A surrogate key provides a common identiier.
Issue:
• If Customer_ID changes in the source system (e.g., CUST_123 → CUST_999), the reference
is broken.
• Different source systems may have different ID formats (123, CUST_123, CUST-XYZ).
Customer Dimension Table with Surrogate Key
. Surrogate Key (integer) is used for joining with fact tables, making queries more eficient.
-
,
+
*
• AWS Glue, Apache Spark, or Python scripts can assign sequential surrogate keys
during ETL.
Surrogate Key vs. Natural Key: When to Use Each?
Conclusion
• Surrogate keys are essential for data warehousing due to their performance,
uniqueness, and lexibility.
• They solve natural key issues, such as data changes and source system dependencies.
• Used heavily in fact-dimension models for ef<icient joins, SCD handling, and data
integration.
Natural Keys in Data Warehousing & Dimensional Modelling
What is a Natural Key?
A natural key is a business-deined attribute (or set of attributes) that uniquely identiies a
record in a database. Unlike surrogate keys (which are system-generated integers), natural keys
come from the source system and have real-world meaning.
Business Meaningful
• Since natural keys already exist in source data, no additional column is needed for
identiication.
• Example: Instead of adding an artiicial Surrogate_Key, we can use Employee_ID directly.
Maintains Business Integrity
May Change Over Time
• Long string-based natural keys increase storage size and slow down joins.
• Example: Using Customer_Email as a key is ineficient in large datasets.
Complicated Composite Keys
• Some tables require multiple columns to form a natural key, making queries complex.
• Example: A transaction table might need (Store_ID, Transaction_Date, Register_Number)
as a key.
Feature
Natural Key Surrogate Key
✔ If the key is stable and never changes (e.g., VIN, Passport Number).
✔ If it is a widely accepted identi<ier (e.g., ISBN for books).
✔ If query performance is not a major concern (OLTP systems).
If the key changes frequently (e.g., Customer ID format updates).
If it is long and complex, affecting performance.
If it is not unique across multiple source systems.
Conclusion
• Natural keys are business-de<ined unique identi<iers, but they may change, be
reused, or impact performance.
• Surrogate keys are preferred in data warehousing to ensure consistency, stability,
and faster joins.
• A hybrid approach can be used: Keep the natural key for reference and use a
surrogate key for primary key relationships in fact and dimension tables.
Dimension Tables in Dimensional Modelling
What is a Dimension Table?
A dimension table is a table in a data warehouse that stores descriptive (textual) attributes
related to business entities. It provides context to the data stored in fact tables, enabling
analysis and reporting.
Example Entities Stored in Dimension Tables:
Customers
Products
Locations
Time (Date)
Employees
Each record in a dimension table represents a single business entity with descriptive details.
• Example: A Customer dimension may contain Customer Name, Address, Gender, and
Segment.
Contains a Primary Key (Surrogate Key)
Denormalized for Performance
Supports Hierarchies
Slowly Changing Dimensions (SCDs) Are Managed
• Historical changes (e.g., customer moves to a new city) are tracked using SCD Type 2
(new record with a different surrogate key).
Example: Product Dimension Table
. Notes:
-
,
+
*
• A shared dimension used across multiple fact tables and business processes.
• Ensures data consistency across different subject areas.
• Example: A Customer Dimension is used in both Sales Fact and Support Fact tables,
ensuring that customer information remains consistent across different departments.
• Helps in creating an enterprise-wide data warehouse with standardized dimensions.
• A single dimension used in different roles within the same fact table.
• The most common example is the Date Dimension, which can represent Order Date,
Shipment Date, and Payment Date in a Sales Fact Table.
• Instead of duplicating the Date Dimension, we create multiple aliases while joining it to
the fact table.
• Ensures data consistency while reducing redundancy in the model.
• A single table that stores miscellaneous, low-cardinality attributes (e.g., Flags, Yes/No
values) that don’t it in other dimensions.
• Example: A Customer Preference Dimension storing ields like "Subscribed to
Newsletter", "Loyalty Member", "Has Credit Card".
• Helps in reducing the number of columns in the fact table and avoids having too many
small dimensions.
• Optimizes storage and improves query performance by grouping multiple
binary/text attributes into a single dimension.
• A dimension stored within the fact table itself instead of having a separate table.
• Example: Order Number, Invoice Number, or Transaction ID in a Sales Fact Table.
• These attributes are business keys but lack descriptive attributes, so they don’t need
a separate dimension table.
• Useful when querying at a transaction level without needing additional joins.
Conclusion
• Conformed Dimensions ensure consistency across multiple fact tables.
• Role-Playing Dimensions allow reuse of the same dimension in different contexts.
• Slowly Changing Dimensions (SCDs) help in tracking historical changes in
attributes.
• Junk Dimensions group miscellaneous attributes to optimize data storage.
• Degenerate Dimensions store transactional keys directly in fact tables.
• Date Dimension is a critical table that enables ef<icient time-based reporting.
Issues in Joins with Fact Tables – NULL values in foreign keys can lead to missing data
when joining with fact tables.
Incorrect Aggregations in Reports – NULLs can cause incorrect grouping in BI reports.
Complications in Slowly Changing Dimensions (SCDs) – NULL values may create issues
when tracking historical changes.
Strategies for Handling NULLs in Dimension Tables
Instead of storing NULLs, replace them with default values to maintain referential integrity.
4 Example Replacements:
3
2
If a dimension key is missing, insert a special record with a default Surrogate Key (SK).
Example: Special Row in Customer Dimension
4 The fact table can reference this "Unknown" customer instead of having NULL foreign keys.
3
2
For late arriving dimensions, use a dummy placeholder record in the dimension table and
update it when the actual data arrives.
Step 1: Insert a placeholder record (Customer_SK = -1).
Step 2: When the real customer data arrives, update it with correct details.
Instead of allowing NULL values in foreign key columns, always ensure a valid surrogate key
is assigned.
• Instead of:
SELECT * FROM Sales_Fact sf
JOIN Customer_Dim cd ON sf.Customer_SK = cd.Customer_SK;
• Use:
SELECT * FROM Sales_Fact sf
JOIN Customer_Dim cd ON sf.Customer_SK = COALESCE(cd.Customer_SK, -1);
4
3
2 This ensures missing customers map to a valid record (-1).
Conclusion
• NULLs in dimension tables can lead to join issues and incorrect reports.
• Use default values or an "Unknown" surrogate key (-1) to handle missing data.
• Implement late arriving dimension handling to update missing records.
• Always ensure fact tables reference valid dimension keys to maintain data integrity.
Types of Hierarchies in Dimensions
• A well-de<ined and structured hierarchy where each level has a clear relationship
with its parent level.
• All branches of the hierarchy maintain the same number of levels.
• Example: Geographical Hierarchy
Continent → Country → State → City
Table Example:
1 CEO NULL 1
2 VP Sales 1 2
3 VP Finance 1 2
4 Sales Manager 2 3
5 Finance Analyst 3 3
o The "MacBook Pro" does not have a Subcategory but still belongs to the
hierarchy.
o Allows lexible analysis while handling missing intermediate levels.
1 CEO NULL
2 VP Sales 1
3 VP Finance 1
4 Sales Manager 2
5 Sales Rep 4
Bene<its of Hierarchies in Dimensional Modelling
Ef<icient Aggregation – Enables data summarization at multiple levels (e.g., sales at city,
state, country levels).
Drill-Down & Roll-Up – Users can drill down for details or roll up for summaries in BI
reports.
Improved Query Performance – Reduces the need for complex joins when querying data at
different levels.
Better Data Organization – Helps in structuring data for business intelligence and
analytics.
Conclusion
Hierarchies are essential in dimensional modelling to support multi-level reporting and
analytics. Depending on the business use case, different hierarchy types (balanced,
unbalanced, ragged, recursive) are used to model relationships effectively.
1. Star Schema
2. Snow<lake Schema
3. Galaxy Schema (Fact Constellation Schema)
9
1.. Star Schema :
De<inition
The Star Schema is the simplest and most widely used dimensional model, where a central fact
table is directly connected to multiple dimension tables. The structure looks like a star, with
the fact table at the center and dimension tables radiating outward.
Diagram Representation
+-------------+
| Date Dim |
+-------------+
|
+--------------+ | +--------------+ +--------------+
| Customer Dim |--|--| Product Dim |--| Supplier Dim |
+--------------+ | +--------------+ +--------------+
|
+--------------+
| Fact Table |
+--------------+
Characteristics
Fact table contains measurable business data (facts) with foreign keys referencing
dimension tables.
Dimension tables are denormalized (i.e., they store redundant data to improve query
performance).
Queries are fast because fewer joins are needed.
Example: Sales Data
Fact Table (Sales_Fact)
Advantages
Simple and easy to understand
Fast query performance (since dimension tables are denormalized)
Optimized for OLAP & BI reporting
Disadvantages
Denormalization leads to redundancy in dimension tables
Larger storage space is needed compared to normalized schemas
De<inition
The Snow<lake Schema is an extension of the Star Schema, where dimension tables are
further normalized into multiple related tables, reducing redundancy. The structure looks like
a snow<lake because dimension tables branch out.
Diagram Representation
+--------------+
| Date Dim |
+--------------+
|
+--------------+ | +--------------+ +-------------+
| Customer Dim |--|--| Product Dim |-------| Supplier Dim |
+--------------+ | +--------------+ +-------------+
|
+--------------+
| Fact Table |
+--------------+
Characteristics
301 NY USA
302 LA USA
Advantages
Less storage space required due to normalized dimension tables
Eliminates data redundancy
Better data integrity
Disadvantages
Query performance is slower due to multiple joins
More complex structure compared to Star Schema
3.. Galaxy Schema (Fact Constellation Schema) =
>
?
@
A
De<inition
The Galaxy Schema is a combination of multiple fact tables sharing common dimension
tables. It is useful for complex business scenarios where multiple business processes are
analysed together.
Diagram Representation
+--------------+
| Date Dim |
+--------------+
|
+--------------+ | +--------------+ +-------------+
| Customer Dim |--|--| Product Dim |-------| Supplier Dim |
+--------------+ | +--------------+ +-------------+
|
+------------+ +------------+
| Sales Fact | | Inventory Fact |
+------------+ +------------+
Characteristics
Advantages
Supports complex business use cases (e.g., sales + inventory analysis)
Flexible and scalable for large enterprises
Common dimensions help avoid data duplication
Disadvantages
Choosing the Right Schema
High
Storage Usage Low (normalized) High (multiple fact tables)
(denormalized)
Simple & Fast Data Integrity & Storage Large-scale DWs with
Use Case
Reporting Optimization multiple processes
Conclusion
What are the business objectives? (e.g., revenue tracking, customer retention analysis,
fraud detection)
Who are the primary users of the data? (Finance, Operations, Customer Support,
Marketing)
What reports and KPIs need to be generated? (e.g., Monthly Revenue, Churn Rate, Average
Revenue per User - ARPU)
What is the volume of data? (e.g., daily call records, millions of customer transactions)
What are the data sources? (e.g., CRM, call logs, billing systems, support tickets)
• Captures details of customer usage such as calls, messages, and data consumption
• Attributes: Usage_ID, Customer_ID, Subscription_ID, Call_Duration, Data_Used,
SMS_Count, Timestamp
4.. Billing Entity
Fact Tables (Transactional Data)
Fact tables store measurable business events (e.g., usage, billing, support interactions).
Usage_Fact Tracks call/data usage Call Duration, Data Used, SMS Count
Usage_Fact, Billing_Fact,
Customer_Dim Customer_ID, Name, Contact_Info
Support_Fact
Subscription_ID, Plan_Name,
Subscription_Dim Usage_Fact
Activation_Date
E
C
D
B Star Schema Example
+--------------+
| Date Dim |
+--------------+
|
+--------------+ | +--------------+ +-------------+
| Customer Dim |--|--| Subscription Dim |---| Plan Dim |
+--------------+ | +--------------+ +-------------+
|
+------------+ +------------+
| Usage Fact | | Billing Fact |
+------------+ +------------+
Surrogate Keys
Partitioning & Indexing
Handling Null Values
F Examples of Reports:
G
K
J
I
H