0% found this document useful (0 votes)
70 views46 pages

Data Modeling - Presentation PDF

Uploaded by

vigneshdataprof2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views46 pages

Data Modeling - Presentation PDF

Uploaded by

vigneshdataprof2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

How to get started with data modelling ?

Mentoring Sessions Recording:


Mentoring Sessions Recording:
Agenda:

1. Sample problem statement to build a data model

2. Get started with data modelling

3. Understand the concept of dimensional data modelling

4. Dive into dimensions and facts

5. Understand the slowly changing dimensions

6. Final data model view

7. Design aspects of a data pipeline


Problem Statement:

→Digital marketing team at udemy launches various campaigns to its users about the new course launches,
top selling courses, courses based on the user interest, new features added, courses price drop etc.

→ They engage directly with the users through emails, newsletters, sms, push notification etc. Here we
will mainly focus on EMAIL marketing campaigns launched.

→ Data generated based on the engagement of users for the emails sent by udemy become our point of
main interest.

→ Some of them are email_sent, email_bounce, email_open, email_click and email_unsubscribe.

→ We have third party tools being used to capture the above metrics and send us the user engagement
data on a daily basis.

→ The data received is made ready for analysts by making use of big data technologies.

Task: Build an end-to-end data pipeline to handle the above use case.
Have you ever been asked to design a data model ?
Data Modelling:

→Data models are the prototype built which helps us in getting to know the main system better.
→Data models doesn't actually consists of any data in it.
→We consider entities, relationship between entities, business logic for building data models
→The objective of data modelling is to store data in such a way which is easy and fast to query and analyze
the data.
→We will consider dimensional modelling of data warehouses with respect to big data needs.
Data Modelling Fundamentals:

Data models consists of the below:


1. Data subjects →Commonly referred to as entities. Is very familiar to "database tables"
2. Attributes of data subjects →Analogous to database columns
3. Relationship between db tables →Tells about how tables are related to each other
→Talks about various relationships we have at place :
single level - one table related to only one table
Multi-level - one table related to many tables
Hierarchy - one table divided into different sub tables which are related to the main table

Data modelling lifecycle:

→Conceptual modelling: An overview of how the model can be built upon. We can decide upon tables and
attributes that can be included in creation of the data model
→Logical modelling : Built upon creating relationships among entities, key constraints to be included
→Physical modelling : Final data model that will be used in building up the data warehouse which includes
granular level of details like indexes, partition columns etc

Conceptual Modelling Logical Modelling Physical Modelling


Transactional VS Analytical Data Modelling:
Transactional Analytical
1. To design transactional systems 1. To design analytical systems
like data bases especially DWH used for reporting
2. Follows data normalization rules purpose
2. Follows dimensional modelling
What is a data warehouse ?
→It is a data store unit where data from various sources is integrated into a common place, maintains the
history of data and organized well which helps in analytics and reporting.

Features:
1. Integration from different data sources - Data from various sources can some to a common place where
they can be stored in organized manner
2. Subject Oriented - Based on the data subject (kind of data) from a data source decides the way this data
has to be stored
3. Time variant - Data in the DWH contains historical data not just current data
4. Non volatile - Data doesn't change dynamically as in transactions systems.
WHY DWH :
1. Make data driven decisions - Based on past, present and future data. Try to find the unknown metrics
which is used for analysis
2. 2. One stop shopping - This is a common place where we can find data from various transactional DB,
operational source all at one place
What is dimensional data modelling ?
→Dimensional data modelling is a technique used to store data in data warehouses in the form of facts and
dimensions for fast and efficient query retrieval process.

Data dimension = Context (dimension) + measurement (fact)

Components for dimensional data modelling:


1. Dimension : To provide a business context for a data item.Ex product_id, product_sale_date, model_code
2. Fact: Contains data from collection and aggregation of measurements Ex. Sale_amount, product_count
3. Attribute: They are elements of the dimension/fact table.
4. Dimension table: Table which stores one or more business attributes of data
5. Fact table: Table which stores the quantitative measure of data which is used for analysis
Process of dimensional data modelling:

1. Identify the business process:


→Helps us to understand the final outcomes of the data model built and their usage throughout.
→To understand the data sources we ingest and form facts, dimensions from them
→Decide on the schema to be implemented

2. Identify grain
→Identification of grain signifies how much level of sub categorization can be considered while building the
data model.
Ex. College →Degree →Department

3. Identify dimensions
→Dimensions form the base to built the data model as they represent metrics around the business
attributes. It consists of detailed information around the business case like stock_code, customer_id,
invoice_date etc

4. Identify facts:
→Facts give us the detailed overview on the business aggregates or metrics used for data analysis.
Ex. In our case, it is amount_spent, review_score, review_count

5. Decide on the schema and updates on data


→we have mainly 2 kinds of schemas to choose from:
Design dimensional data model for udemy analytics
Database keys for DWH:

Primary Key VS Foreign Key

Natural Key VS Surrogate Key


Data modelling schemas:

1. Star Schema:
→Consists of single level of hierarchy of dimensions.
→Tables are in de normalized state
→SQL query performance is increased as less number of joins involved
→Data redundancy is high
Data modelling schema:

Snowflake schema:
→It is an extension of star schema where the dimensions are connected internally to one or more
dimensions.
→The tables are partially normalized
→SQL query performance is affected due to more joins involved
→Data redundancy is low
Data modelling schema:

Galaxy schema:
→Data schema with one or more fact tables
→Multiple fact tables share the common dimensions
→Used to handle more complex fact tables requirements
Data modelling schema:

Star Cluster schema:


→Combination of features from both star schema and snowflake schema
→Few of dimensions can be normalized and broken down to further granular level
→Achieves a balance between data latency and data redundancy
Which schema do you prefer to use and why ?
Designing Dimensions :
1. While creating dimension tables always prefix the table with “dim”. Ex. Dim_customer_details

Star Schema Snowflake Schema


1. Flat dimensions 1. Hierarchical dimensions
2. Has one surrogate key and 2. Has one surrogate key and one
multiple natural keys natural key in each of the
dimension tables
Designing facts:

1. Primary Key : Combination of surrogate keys from dimension tables or create a separate surrogate key for
fact table itself.
2. One or more facts can be stored in the fact table as well based on business rules
3. Append a fact table with “fact” and you can have a surrogate key generated for each row in fact tables.
4. There are mainly 3 types of facts namely:
4.1 Additive facts
4.2 Semi-Additive facts
4.3 Non additive facts
Types of facts:

1. Additive facts :
→Facts that can be calculated by using all the dimensions in the fact table.
Ex. Calculate numberof units sold in a retail store in the month of July 2022.
Number of units sold is the fact which can be calculated using the dimensions of orders, customers,
Calendar.
Types of facts:

2. Semi-Additive facts :
→Facts that can be calculated using some of the dimensions in the fact table.
Ex. Calculate total amount spent by a customer in the retail store till now.
Here we need the dimensions like customer, product and not necessarily date, store dimensions
Types of facts:

3. Non additive facts :


→Facts that cannot be calculated using dimensions in the table
Ex. Calculate the percentage of customers spread across different states in India
Can you give examples of different fact types that can be used in our use case ?
Types of fact tables:

1. Transaction fact tables:


→Fact tables with granular row data where each row denotes a single transaction taken place.
This is the smallest grain which can’t be broken down further. Each transaction can capture a lot
Of context which provides rich dimensional level fields.
→Generally the transactions are not updated frequently so the data remains more static.
→Ex. A customer performed a cash withdrawal from a nearby ATM. Details such as customer, bank,
Account, card details can be captured.

2. Periodic snapshot table:


→Fact tables which contain cumulative performance measurements of business at predefined snapshot
Date periods like daily, weekly, monthly etc.
→This can be of more use for data analysis purpose.
→Source for this fact table is from transaction facts where the date period can be chosen.
→Ex. The average purchases of made by a customer over the last 3 months
Types of fact tables:

3. Accumulative snapshots:
→Fact table that describe activity of businesses which has a clear beginning and end.
→These tables have a list of date or datetime columns to depict the milestone
Ex. User starts to use Zomato for the first time. Their count of ads clicks tracked over weeks like
Week 1, week 2, week 3, week 4.

4. Fact less fact tables:


→Tables that have no facts stored in them. Such fact tables have no metric/measurement stored in their
tables which is calculated from dimensions or raw sources.

→There are 2 kinds of such tables :

4.1 Fact less facts tracking event or activity


→Fact tables store transaction level information in it through the dimensions connected to it. Based on
which the required metrics can be calculated later which can be useful business information.
Let’s look at an example. Below model gives us details about employee leave tracker.
How to calculate the total count of leaves taken up by an employee ?

SELECT employee_name AS name, COUNT(leave_type_id) AS leave


FROM fact_leave f
INNER JOIN dim_employee d on d.employee_id = f.employee_id
WHERE employee_id =‘TT100’
4.2 Fact less facts describe condition, eligibility
→These fact tables can be used to calculate an eligibility criteria based on the fields in its fact table though
it doesn’t store the criteria directly.

SELECT employee_name AS name, COUNT(leave_type_id) AS leave


FROM fact_leave f
INNER JOIN dim_employee d on d.employee_id = f.employee_id
WHERE employee_id =‘TT100’ AND leave < 10
Slowly changing dimensions :

1. Dimensions that change slowly over time


2. Dimensions that store and manage both current and historical data over time in a DWH
3. Techniques to manage history within data warehouse
4.Historical change of data over time becomes important in DWH as it is used for analytical purposes

DWH SCD TYPES:

TYPE 0 Fixed dimension


TYPE 1 Overwrite old data and no history retention
TYPE 2 Maintain unlimited history ie all the history versions will be available
TYPE 3 Maintain limited history
TYPE 4 Combination of 1 & 2
TYPE 6 Hybrid type
TYPE – 0 SCD:

1. The data in these dimensions are fixed and never changes


2. Once the data is loaded into this tables, it remains fixated

Cons:
1. There is no history of data to be tracked
2. Data remains static and not suited for analytics

Country Timezone
India UTC +5:30
Australia UTC +8:00
Denmark UTC +1:00
TYPE – 1 SCD:

1. The row value which needs to be changed is updated with the new value
2. Old value is permanently deleted
3. Mainly useful for correcting errors
4. The data is over written

Cons:
1. History won’t be retained
2. Auditing of data won’t be possible

Student_key Student_id Student_nam Student_colle Student_plac


e ge e
8394873920 22CS120 Ram Vihar JNTUA Ananthpur
8394873921 22CS120 Ram Vihar JNTUA Hyderabad
TYPE – 2 SCD:

1. The data column value which is updated is stored as a new row and the old value also exists in a different
row
2. A new surrogate key is generated for the updated data row
3. Reports and analytics data before and after Type 2 SCD will give accurate results
4. Analytics done on old and new data changes can be captured accurately
5. Historical analysis done on historical data

Cons:
1. Huge storage in fact tables as all history versions of data is retained
2. Additional column fields to be included in fact table to identify the old and new versions of data.
3. Include natural keys from dimension tables for better identification of rows which differ only in surrogate
keys but all details remain the same

Student_key Student_id Student_nam Student_colle Student_plac


e ge e
8394873920 22CS120 Ram Vihar JNTUA Banglore
6489274890 22CS120 Ram Vihar JNTUA Hyderabad
WAYS TO HANDLE TYPE 2 - SCD:

1. To include new columns called “flag_change” whose values will be changed if any value changes in the
data row.
2. To include new columns like “start_date” and “end_date” to signify changes in the data rows.

Student_k Student_ Student Student Studen Flag_c Effective Expiry_tim


ey id _name _colleg t_plac hange _time e
e e
783657278 22ME800 Sumant JNTUA CHN Y 2022-05- 3022-12-12
h 30 00:00:00
10:00:00
829748392 20EE250 Ram JNTUA BLR N 2020-05- 2020-10-18
0 Vihar 10 10:00:00
10:00:00
829748392 20EE250 Ram JNTUA HYD N 2020-10- 2022-01-20
0 Vihar 18 20:00:00
10:00:00
829748392 20EE250 Ram JNTUA BLR Y 2022-01- 3022-12-12
0 Vihar 11 00:00:00
10:00:00
TYPE – 3 SCD:

1. Add a new column rather than a new row to reflect the changes done
2. Column for "old value" and "new value“
3. Supports back and forth switching for effective reportin

Cons:
1. It is not suitable for dwh where various columns are changed like place, country, address, pincode
2. It is suitable only for use cases where changes are limited

Student_key Student_id Student_nam Student_co New_plac


e old_place e
8394873920 22CS120 Ram Vihar ATP HYD
TYPE – 4 SCD:

1. Known to be fast growing dimension where storing current and all the historical data in a single
Dimension would make it inefficient to store and query
2. Dimension tables with updates are stored in 2 different tables comprising of current and historical changes

Cons:
1. More storage and maintenance is required to store them as two separate tables
Historical data table:

Student_key Student_id Student_nam Student_colle Student_plac


e ge e
8394873920 22CS120 Ram Vihar JNTUA Ananthpur
8394873920 22ME783 Anjali M JNTUA Hyderabad
Current data table:
Student_key Student_id Student_nam Student_colle Student_plac
e ge e
8394873925 22CS120 Ram Vihar JNTUA Chennai
8394873929 22CS120 Ram Vihar JNTUA Hyderabad
TYPE – 6 SCD :

1. It is a combination of type 1, 2 & 3 and known as hybrid type.


2. Stores the current data in all the historical row data in current column section.

Cons:
1. More complex to implement and stores lot of redundant data.
Student_k Student_ Student Student Studen Flag_c Effective Expiry_tim
ey id _name _curren t_plac hange _time e
t_place e
783657278 22ME800 Sumant CHN CHN Y 2022-05- 3022-12-12
h 30 00:00:00
10:00:00
829748392 20EE250 Anjali DEL BLR N 2020-05- 2020-10-18
0 10 10:00:00
10:00:00
829748392 20EE250 Anjali DEL HYD N 2020-10- 2022-01-20
0 19 20:00:00
10:00:00
829748392 20EE250 Anjali DEL BLR Y 2022-01- 3022-12-12
0 11 00:00:00
10:00:00
Data Modelling:

Tips to keep in mind:

1. Decide on the dimension and fact tables to be built

2. Explain the tables in the DWH

3. Talk about the main fields in the dims & facts

4. Choose the variants of facts or fact tables to be considered

5. Choose the schema to be implemented

6. Combine all the above into a data model to present your final design
Data Model:
Final Finish :

1. Present the entire design of data pipeline worked on along with architecture, components and data
flow.

2. Put forward the assumptions you have considered in designing the data pipeline

3. Highlight any edge cases you have considered and how you included them in your design

4. Ask for feedback and repeat the design life cycle to include new changes.

5. Have an open discussion on the design you have proposed and how it can be improvised.

6. End with questions you have for the Interviewer and their comments on your system design solution.
Design considerations for data pipelines:

1. Check for idempotence of pipelines :


→ Idempotence: For a given data pipeline for the same set of input should give the same output when ran
multiple times.

1.1 When pipelines are rerun ?


→ To backfill data, handle pipeline failures, testing pipelines

1.2 why it is important ?


→ When a data pipeline is rerun multiple times there are chances to add duplicate data, store the old
data which produce bad data that results in wrong results.

1.3 How to retain idempotent pipelines ?


→Perform a complete refresh of the data so that entire data gets re-written.
→Include duplicate checks to sustain only distinct data items in the tables.
→Make sure to have the dependent tables in updated state
Design considerations for data pipelines:

2. Monitoring and Alerting:

2.1 why it is important ?


→Pipelines built which are not monitored or maintained well leads to pipeline failures, cross team
dependent runs, stale data dump go unnoticed. Pipeline lifecycle wouldn’t get complete without having a
strong maintenance at place.

2.2 How it can be done ?


→Scheduling tools airflow, autosys enable us to monitor the job runs and raise alerts for unusual behavior
like failures, long running or hanging up of pipelines. Services like cloudwatch, datadog can be enables to
monitor and alert the pipeline runs.

2.3 Other ways of alerting ?


→We can integrate the above services with communication channels like outlook, slack and teams which
can send real time alerts to take immediate action on it.
Design considerations for data pipelines:

3. Data quality checks:

3.1 why it is important ?


→Data quality checks makes sure that data is up-to date and qualified for analytics.

3.2 How to get it done ?


→We can write manual test cases to perform data quality checks like duplicates, value range, schema
type.
We can also make use of frameworks designed for this purpose like DBT, great expectations.

3.3 How to handle failures ?


→We can write a separate pipeline to handle all quality checks for a set of pipelines which can be
monitored and maintained along.
Design considerations for data pipelines:

4. Dynamic resource allocation :

4.1 Why it is important ?


→Pipeline resources don’t get wasted when not in use and save costs to the company
→Pipeline doesn’t get failed due to lack of resources.
→Set spark properties such as spark.sql.files.max.PartitionByte to lower value, increase number of
partitions.

4.2 when it can be used?


→During the time high volatility in data being received.
→Either the data can be less than expected or more than expected.

4.3 How to get it done ?


→Cloud services like AWS, Azure, GCP provide the compatibility to dynamically scale up or scale down the
resources allocated to the cluster based on the amount of data dealing with.
→It can be enable in spark using spark.dynamicAllocation.enabled which can decide on min & max number
of executors to be brought up based on the workload.
Design considerations for data pipelines:

5. Watermarking Tables:

5.1 What is it all about ?


→In order to load only the incremental data into the table and to keep track of the pipeline runs we
introduce the concept of watermarking.

5.2 Why it is important ?


→To prevent overwriting huge amount of data daily
To prevent reading from large tables daily
To optimize on the resources being used to run the pipelines

5.3 How it can be implemented?


→Filter out based on date columns on which delta data to be processed by the pipelines
Create a watermark tables to store the to_date and from_date
Sources:

Data Engineer Specific:


https://www.striim.com/blog/guide-to-data-pipelines/
https://medium.com/the-data-experience/building-a-data-pipeline-from-scratch-32b712cfb1db
https://towardsdatascience.com/big-data-modeling-25c64d456308
https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/#6-
design-considerations

Generic to system design:


https://www.geeksforgeeks.org/top-10-system-design-interview-questions-and-answers/
https://www.freecodecamp.org/news/systems-design-for-interviews/
https://blog.tryexponent.com/how-to-nail-the-system-design-interview/

Udemy:
https://www.udemy.com/course/mastering-data-modeling-fundamentals/
https://www.udemy.com/course/data-warehouse-fundamentals-for-beginners/

Books:
https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-
dw-toolkit/
Sources:

Data Warehousing concepts:


https://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm
https://www.javatpoint.com/data-warehouse
https://www.guru99.com/data-warehouse-architecture.html
https://www.analyticsvidhya.com/blog/2021/07/a-brief-introduction-to-data-warehouse/
https://www.1keydata.com/datawarehousing/concepts.html

Data Modelling concepts:


https://www.guru99.com/data-modelling-conceptual-logical.html
https://www.ibm.com/cloud/learn/data-modeling

You might also like