Data Modeling Concept Latest
Data Modeling Concept Latest
Interview: Basic
Concepts in Data
Modeling
Ankur Bhattacharya
What is a Data Warehouse
The data stored in a data warehouse is typically used for business analysis,
trend analysis, data mining, decision support, and generating reports. By
providing a centralized and structured view of data, data warehouses help
organizations make informed business decisions based on historical and
current data insights.
Difference Between
Database/DataLake/DataWarehouse
DataBase:
Data Warehouse:
Data Lake:
Purpose: A data lake is a storage system that holds a vast amount of raw,
unstructured, and structured data in its native format. It acts as a central
repository for diverse data sources, providing a single location for data
storage and exploration.
Data Structure: Data lakes are schema-on-read, meaning that the data's
structure is only imposed when it is read or queried, allowing for flexibility
and the inclusion of various data types.
Data Use: Data lakes are used for exploratory data analysis, data science,
and big data processing. They support data transformation and
preparation for analysis.
Data Ingestion: Data lakes can ingest both structured data (e.g.,
databases) and unstructured data (e.g., log files, social media data, etc.).
Examples: Amazon S3, Hadoop Distributed File System (HDFS), Azure
Data Lake Storage, Google Cloud Storage, etc.
Introduction to Data Mart
Data
Aspect Data Mart
Warehouse
Serves as a
centralized repository
Focuses on meeting the specific
that consolidates
Purpose needs of a particular business unit,
data from various
department, or user group.
sources across the
organization.
Generally, data
warehouses store
Contains a smaller dataset, limited to
large volumes of
Data Size the requirements of the targeted
historical and current
users or department.
data over extended
periods.
Uses a global,
enterprise-wide
schema, often Utilizes a local schema designed to
Schema Design denormalized to cater to the specific analytical needs
optimize query of the data mart's users.
performance for
complex analysis.
Aspect
DataWarehouse Data Mart
Schema on Write: ELT allows the data lake to accept data in its original schema,
providing greater flexibility to analyze diverse data types without upfront
transformation.
Use in Big Data and Data Lakes: ELT is well-suited for big data environments
and data lakes, where vast amounts of raw and unstructured data can be
ingested and processed, allowing for dynamic and agile data analysis.
Two Types of ETL: Initial and Incremental
Within the realm of Extract, Transform, Load (ETL) processes, there are two
primary approaches employed to manage data integration and
synchronization: Initial ETL and Incremental ETL. These methodologies
cater to specific data management needs and play pivotal roles in ensuring
data accuracy and timeliness in various systems.
Initial ETL:
Initial ETL, also known as Full ETL, is the first step in populating a data warehouse
or data storage system with data from diverse source systems. In this approach,
the entire dataset is extracted from the source systems, without considering
previous data loading history. All relevant data is pulled from the sources,
transformed into a consistent format, and loaded into the target destination.
Comprehensive Data Load: During an initial ETL process, all data required for
analysis or reporting is extracted from the source systems. This approach ensures
that the data warehouse starts with a complete and up-to-date dataset.
Time-Consuming: As it extracts and loads all data from scratch, the initial ETL
process can be time-consuming, especially when dealing with large volumes of
data.
Overwriting Existing Data: With Initial ETL, the existing data in the target
destination is usually truncated or overwritten, ensuring a fresh start with the
latest data.
Incremental ETL:
Incremental ETL, also known as Delta ETL or Change Data Capture (CDC), focuses
on extracting and loading only the changes or updates that have occurred in the
source data since the last ETL process. Rather than processing the entire dataset,
Incremental ETL identifies new or modified data and applies only these changes
to the target destination.
1.Data Source Layer: This is the first layer of the architecture, where data is
collected from various source systems. These sources can include databases, flat
files, external data feeds, and more. The data is extracted from these sources and
transformed into a format suitable for storage and analysis in the data
warehouse. The transformation process can involve data cleansing, integration,
and the application of business rules.
Source data can be production data,internal data,Archived data,External
Data,Example: Data which comes to your datalake can be source data for your
warehouse system
A.Data Mart: Data marts are subsets of the data warehouse specifically
tailored to meet the data delivery requirements of specific user groups or
business units. They serve as focused, smaller databases within the broader
data warehouse, providing granular access to data for specific analytical
needs.
B.Data Warehouse DBMS and Metadata: The data warehouse DBMS is the core
storage system where data from various sources is consolidated, transformed, and
organized for efficient querying and reporting. This DBMS often includes an
integrated metadata repository, which stores essential information about the data,
its structure, and its lineage, aiding in data management and governance.
diagram:1
3. Data Access Layer:The data access layer is where end-users, typically business
analysts, data scientists, and other stakeholders, access the data stored in the data
warehouse. This layer includes tools and interfaces for querying, reporting, and
analyzing the data. Users can interact with the data using various BI (Business
Intelligence) tools, SQL queries, or custom applications. The goal is to make it as
easy as possible for users to retrieve and analyze the data for decision-making and
business intelligence purposes.
OLTP vs OLAP
OLTP (Online
OLAP (Online Analytical
Aspect Transaction
Processing)
Processing)
Manages current,
Handles historical and aggregated
operational data. Primarily
Data Type data. Typically used for read-heavy
supports write-heavy
operations.
operations.
Accessed by operational
Primarily used by business analysts,
staff, including clerks,
Users data scientists, and decision-
customer service, and
makers.
administrators.
For example, in a sales data scenario, a fact table might contain measures like
sales amount, quantity sold, and profit. The foreign keys in the fact table would
link to dimension tables such as date, product, and store, providing additional
information about the sales events, such as the date of the sale, the product sold,
and the store where the sale occurred.
Grain Of Fact Table: Whats the level of uniquely identify the fact table
Sales_Fact Table:
Example:
Here we are capturing Monthly_Sales as part f our Periodic Snapshot Fact Table
Initial_Re
Order_I Product Order_D Final_Revie Approved_ Shipped_Dat
Customer_ID view_Dat Delivered_Date Status
D _ID ate w_Date Date e
e
2023- 2023-
1003 203 303 In Progress
07-03 07-04
2023-
1004 204 304 07-04
Pending
.
Here's a breakdown of the status for each order:
1. Order with Order_ID 1001: This order has gone through all stages and is marked
as "Delivered."
2. Order with Order_ID 1002: This order has gone through the initial and final
reviews, approved for fulfillment, and has been "Shipped" to the customer.
3. Order with Order_ID 1003: This order is still "In Progress," as it has gone through
the initial review, but the final review and approval are pending.
4. Order with Order_ID 1004: This order is still "Pending," as it has been placed, but
the initial review is pending
A factless fact table contains foreign keys referencing various dimension tables
but lacks any numeric measures. It captures events or occurrences without any
quantitative data associated with them. Factless fact tables are useful for
representing many-to-many relationships between dimensions or recording
events where no measures are relevant.
In this example, the Enrollment_Fact table is a factless fact table that captures
enrollment events for students in courses. It stores the foreign keys referencing
the Student_Dimension and Course_Dimension tables, indicating which student
has enrolled in which course and when the enrollment occurred.
What is Dimension Table
**Primary Key:** Each dimension table has a primary key that uniquely identifies
each record or row within the table. The primary key is used to establish
relationships with fact tables.
**No Numeric Data:** Unlike fact tables, dimension tables do not contain numeric
measures or metrics. Instead, they store textual or categorical data, such as
names, descriptions, codes, or labels.
Date Dimension:
A date dimension is a common example of a conformed dimension. In most
organizations, dates are used in various reports and analyses across different
departments and systems. The date dimension contains attributes such as:
1. Date
2. Month
3. Quarter
4. Year
5. Week
6. Day of the week
7. Public Holidays
8. Fiscal period, etc.
In a data warehouse or data mart, different parts of the organization can utilize this
date dimension without inconsistencies. For instance:
1. Finance Department: In the finance data mart, the date dimension is used to
analyze financial data by various time periods (e.g., monthly financial
statements).
2. Sales Department: In the sales data mart, the date dimension helps track sales
performance over time (e.g., daily, weekly, monthly, and yearly sales figures).
3. Human Resources Department: In the HR data mart, the date dimension can
be used for tracking employee records by hiring date, anniversary dates, or
other HR-related timeframes.
4. Marketing Department: In the marketing data mart, the date dimension assists
in analyzing campaign performance and customer behavior over time.
The key to a conformed dimension like the date dimension is that it's maintained
consistently throughout the organization. Any updates or changes to the
dimension are made in a centralized location, ensuring that everyone uses the
same date attributes and definitions. This uniformity is vital for producing accurate
and consistent reporting and analysis across the organization.
2. Junk Dimension: A junk dimension is a dimension table that consolidates
multiple low-cardinality flags or attributes into a single table. This helps reduce
the complexity of the schema and improve query performance
Suppose you are designing a data warehouse for a retail business, and you have
several low-cardinality attributes that are not directly related to any specific
dimension but are still useful for analysis and reporting. These attributes might
include:
1. Promotion Type: Describes the type of promotion used (e.g., discount, free
gift, buy-one-get-one).
2. Payment Method: Indicates how customers paid for their purchases (e.g.,
cash, credit card, debit card).
3. Purchase Channel: Specifies where the purchase was made (e.g., in-store,
online, mobile app).
4. Coupon Used: Indicates whether a coupon was applied to the purchase (yes
or no).
Instead of creating separate dimension tables for each of these attributes, you
can create a junk dimension called "Junk_Dimension" or
"Miscellaneous_Dimension." This dimension table might look like this:
Junk Dimension - Miscellaneous_Dimension:
JunkKey: A unique identifier for each combination of attributes.
Promotion Type: Attribute indicating the promotion type.
Payment Method: Attribute indicating the payment method.
Purchase Channel: Attribute specifying the purchase channel.
Coupon Used: Attribute indicating whether a coupon was used.
The "JunkKey" serves as a unique key for each combination of attribute values.
This allows you to reduce the number of dimension tables and keep these low-
cardinality attributes in one place, making it easier to manage and query the
data.
Suppose you are designing a data warehouse for a retail business, and you have
a fact table that records sales transactions. Within this fact table, you may find a
transaction ID that is unique for each sale. This transaction ID can serve as a
degenerate dimension.
Sales Fact Table:
Transaction ID (Degenerate Dimension): A unique identifier for each sales
transaction.
Product Key: Foreign key linking to the Product dimension table.
Date Key: Foreign key linking to the Date dimension table.
Sales Amount: The amount of the sale.
Quantity Sold: The number of items sold.
In this example, the "Transaction ID" is a degenerate dimension because it's an
attribute associated with each sale, but it doesn't have a corresponding
dimension table. Instead, it's derived directly from the fact table.
4.Role-Playing Dimension:In some cases, a single dimension table can play
multiple roles in a data model. For example, a date dimension can be used to
represent both the order date and the ship date, each serving as a different role
in the fact table.
Suppose you're working on a data warehouse for a retail company, and you
have a sales fact table that captures daily sales. You want to analyze sales from
different date-related perspectives.
Sales Fact Table:
DateKey (Role-Playing Date Dimension): A foreign key linked to the "Date"
dimension table.
ProductKey: A foreign key linked to the "Product" dimension table.
StoreKey: A foreign key linked to the "Store" dimension table.
SalesAmount: The amount of sales for each transaction.
In this scenario, the "Date" dimension table is used in multiple roles:
1. Order Date Role: The "DateKey" represents the order date when a customer
made a purchase.
2. Ship Date Role: The same "DateKey" can also represent the date when the
order was shipped.
3. Delivery Date Role: It can be used to represent the date of actual delivery.
Each role-playing dimension has a distinct meaning and serves a specific
analytical purpose, allowing you to answer questions like:
"What were the sales on each product for their order date?"
"How many products were delivered on time, based on their delivery date?"
"How many products were shipped and delivered on the same day, based
on their ship date and delivery date?"
In this example, the "Date" dimension table plays multiple roles within the
same fact table, offering different date-related perspectives for analysis and
reporting without the need for creating separate dimension tables for each
date role.
SCDs are dimension tables that track changes to descriptive attributes over
time. There are several types of SCDs, including Type 1 (overwrite existing data),
Type 2 (add new records with historical data), and Type 3 (maintain limited
history) , SCD 4 and SCD 6.
In a Type 1 SCD, when a change occurs, the existing record is simply updated with
the new data, overwriting the old data. Historical data is not preserved. This approach
is suitable when historical changes are not significant or when tracking historical
changes is not required.
In a Type 2 SCD, new rows are added to the dimension table to represent each
change. Each row has its own unique identifier (e.g., surrogate key) and a valid time
range during which it is applicable. This approach retains a full history of changes.
In a Type 3 SCD, additional columns are added to the dimension table to store limited
historical changes. Typically, only a small subset of important attributes are tracked as
historical columns. This approach is useful when a limited amount of historical data is
sufficient.
**Cons**: Limited historical data tracking, not suitable for all scenarios.
In a Type 4 SCD, a separate historical table is created to store historical data. The main
dimension table holds only the current data, while the historical table stores the
previous versions of the records. This approach separates the current and historical
data, maintaining data integrity.
**Pros**: Clear separation of current and historical data, no changes to the main table
structure.
**Pros**: Balances between history and current data, flexible for different
scenarios.
The choice of which SCD type to use depends on the specific requirements of
your data warehousing project, including the importance of historical data,
storage considerations, and query performance. Each SCD type has its
advantages and trade-offs, and the decision should be based on your
organization's data management needs
The Need for Schema Design: Schema design is crucial because it defines the
structure of a database, impacting how data is stored, accessed, and analyzed. A
well-designed schema ensures data accuracy, consistency, and efficiency. It
enables businesses to make informed decisions, generate reports, and gain insights
from their data. Whether using a Snowflake or Star schema, the choice depends on
the specific requirements of the application and the balance between data
integrity and query performance.
The star Schema: In contrast, the Star schema simplifies database design by
denormalizing data, creating a centralized fact table at the center, surrounded by
dimension tables. This design minimizes the number of joins required to retrieve
data, which can lead to faster query performance. Star schemas are particularly
well-suited for data warehousing and business intelligence applications where
query speed is a priority. The fact table typically contains foreign keys referencing
dimension tables, allowing for easy aggregation and analysis of data.
Dimension Tables:
product_vendor_dim:
Stores information about products and vendors.
Linked with the fact table through the "Product_id" foreign key.
vendor_dim:
1. Contains vendor details.
2. Connected to the product_vendor_dim table via the "Vendor_id" foreign key.
City_dim:
Holds data about cities, including their names, states, zip codes, and countries.
Linked to the customer_dim table through the "CityID" foreign key.
Customer_dim:
Stores customer information.
Connected to the City_dim table using the "CityID" foreign key.
Time_dim:
Focuses on time-related data, such as the time of sale and the hour of sale.
Joined with the fact table via the "Time_of_sale" and linked to the date_dim
table through the "Date_id" foreign key.
Date_Dim:
Contains date-related information, including days, months, and years.
Connected to the Time_dim table through the "Date_id" foreign key.
Fact Table:
Transaction_Fact_Table:
Serves as the central fact table where transactional data is stored.
Contains transaction-related details such as Transaction ID, Time of Sale, Product
ID, Employee ID, Customer ID, Amount, and Quantity.
Links to various dimension tables using their respective foreign keys, creating
relationships with the product, vendor, city, customer, time, and date
dimensions.
ER Diagaram Of above example
Example: Star Schema
In the below example, a star schema is implemented with a central fact table and
multiple dimension tables. Here's how each component contributes to the schema:
**Dimension Tables:**
- `product_vendor`: Stores details about products and vendors. It's linked to the
fact table via the `product_id` foreign key.
- `Employee`: Contains employee information. Connected to the fact table through
the `employee_id` foreign key.
- `customers`: Holds customer data, including personal and contact details. Linked
to the fact table by the `customer_id` foreign key.
- `Time_dim`: Focuses on time-related data. It's joined with the fact table through
the `time_id` foreign key.
**Fact Table:**
- `transaction_table`: Serves as the central fact table, storing transactional data
such as Transaction ID, Time ID, Product ID, Vendor ID, Employee ID, Customer ID,
Amount, and Quantity. This table links to various dimension tables using their
respective foreign keys, establishing relationships with product, vendor, employee,
customer, and time dimensions.
Dimension tables in a star schema are denormalized, meaning they may contain
redundant data to improve query performance. For instance, product_vendor
contains both VendorId and Vendorname, which could be in a separate table in a
normalized schema. Denormalization reduces the number of joins needed during
queries, thus enhancing performance
ER Diagaram Of above example