Data Modeling - Presentation PDF
Data Modeling - Presentation PDF
→Digital marketing team at udemy launches various campaigns to its users about the new course launches,
top selling courses, courses based on the user interest, new features added, courses price drop etc.
→ They engage directly with the users through emails, newsletters, sms, push notification etc. Here we
will mainly focus on EMAIL marketing campaigns launched.
→ Data generated based on the engagement of users for the emails sent by udemy become our point of
main interest.
→ We have third party tools being used to capture the above metrics and send us the user engagement
data on a daily basis.
→ The data received is made ready for analysts by making use of big data technologies.
Task: Build an end-to-end data pipeline to handle the above use case.
Have you ever been asked to design a data model ?
Data Modelling:
→Data models are the prototype built which helps us in getting to know the main system better.
→Data models doesn't actually consists of any data in it.
→We consider entities, relationship between entities, business logic for building data models
→The objective of data modelling is to store data in such a way which is easy and fast to query and analyze
the data.
→We will consider dimensional modelling of data warehouses with respect to big data needs.
Data Modelling Fundamentals:
→Conceptual modelling: An overview of how the model can be built upon. We can decide upon tables and
attributes that can be included in creation of the data model
→Logical modelling : Built upon creating relationships among entities, key constraints to be included
→Physical modelling : Final data model that will be used in building up the data warehouse which includes
granular level of details like indexes, partition columns etc
Features:
1. Integration from different data sources - Data from various sources can some to a common place where
they can be stored in organized manner
2. Subject Oriented - Based on the data subject (kind of data) from a data source decides the way this data
has to be stored
3. Time variant - Data in the DWH contains historical data not just current data
4. Non volatile - Data doesn't change dynamically as in transactions systems.
WHY DWH :
1. Make data driven decisions - Based on past, present and future data. Try to find the unknown metrics
which is used for analysis
2. 2. One stop shopping - This is a common place where we can find data from various transactional DB,
operational source all at one place
What is dimensional data modelling ?
→Dimensional data modelling is a technique used to store data in data warehouses in the form of facts and
dimensions for fast and efficient query retrieval process.
2. Identify grain
→Identification of grain signifies how much level of sub categorization can be considered while building the
data model.
Ex. College →Degree →Department
3. Identify dimensions
→Dimensions form the base to built the data model as they represent metrics around the business
attributes. It consists of detailed information around the business case like stock_code, customer_id,
invoice_date etc
4. Identify facts:
→Facts give us the detailed overview on the business aggregates or metrics used for data analysis.
Ex. In our case, it is amount_spent, review_score, review_count
1. Star Schema:
→Consists of single level of hierarchy of dimensions.
→Tables are in de normalized state
→SQL query performance is increased as less number of joins involved
→Data redundancy is high
Data modelling schema:
Snowflake schema:
→It is an extension of star schema where the dimensions are connected internally to one or more
dimensions.
→The tables are partially normalized
→SQL query performance is affected due to more joins involved
→Data redundancy is low
Data modelling schema:
Galaxy schema:
→Data schema with one or more fact tables
→Multiple fact tables share the common dimensions
→Used to handle more complex fact tables requirements
Data modelling schema:
1. Primary Key : Combination of surrogate keys from dimension tables or create a separate surrogate key for
fact table itself.
2. One or more facts can be stored in the fact table as well based on business rules
3. Append a fact table with “fact” and you can have a surrogate key generated for each row in fact tables.
4. There are mainly 3 types of facts namely:
4.1 Additive facts
4.2 Semi-Additive facts
4.3 Non additive facts
Types of facts:
1. Additive facts :
→Facts that can be calculated by using all the dimensions in the fact table.
Ex. Calculate numberof units sold in a retail store in the month of July 2022.
Number of units sold is the fact which can be calculated using the dimensions of orders, customers,
Calendar.
Types of facts:
2. Semi-Additive facts :
→Facts that can be calculated using some of the dimensions in the fact table.
Ex. Calculate total amount spent by a customer in the retail store till now.
Here we need the dimensions like customer, product and not necessarily date, store dimensions
Types of facts:
3. Accumulative snapshots:
→Fact table that describe activity of businesses which has a clear beginning and end.
→These tables have a list of date or datetime columns to depict the milestone
Ex. User starts to use Zomato for the first time. Their count of ads clicks tracked over weeks like
Week 1, week 2, week 3, week 4.
Cons:
1. There is no history of data to be tracked
2. Data remains static and not suited for analytics
Country Timezone
India UTC +5:30
Australia UTC +8:00
Denmark UTC +1:00
TYPE – 1 SCD:
1. The row value which needs to be changed is updated with the new value
2. Old value is permanently deleted
3. Mainly useful for correcting errors
4. The data is over written
Cons:
1. History won’t be retained
2. Auditing of data won’t be possible
1. The data column value which is updated is stored as a new row and the old value also exists in a different
row
2. A new surrogate key is generated for the updated data row
3. Reports and analytics data before and after Type 2 SCD will give accurate results
4. Analytics done on old and new data changes can be captured accurately
5. Historical analysis done on historical data
Cons:
1. Huge storage in fact tables as all history versions of data is retained
2. Additional column fields to be included in fact table to identify the old and new versions of data.
3. Include natural keys from dimension tables for better identification of rows which differ only in surrogate
keys but all details remain the same
1. To include new columns called “flag_change” whose values will be changed if any value changes in the
data row.
2. To include new columns like “start_date” and “end_date” to signify changes in the data rows.
1. Add a new column rather than a new row to reflect the changes done
2. Column for "old value" and "new value“
3. Supports back and forth switching for effective reportin
Cons:
1. It is not suitable for dwh where various columns are changed like place, country, address, pincode
2. It is suitable only for use cases where changes are limited
1. Known to be fast growing dimension where storing current and all the historical data in a single
Dimension would make it inefficient to store and query
2. Dimension tables with updates are stored in 2 different tables comprising of current and historical changes
Cons:
1. More storage and maintenance is required to store them as two separate tables
Historical data table:
Cons:
1. More complex to implement and stores lot of redundant data.
Student_k Student_ Student Student Studen Flag_c Effective Expiry_tim
ey id _name _curren t_plac hange _time e
t_place e
783657278 22ME800 Sumant CHN CHN Y 2022-05- 3022-12-12
h 30 00:00:00
10:00:00
829748392 20EE250 Anjali DEL BLR N 2020-05- 2020-10-18
0 10 10:00:00
10:00:00
829748392 20EE250 Anjali DEL HYD N 2020-10- 2022-01-20
0 19 20:00:00
10:00:00
829748392 20EE250 Anjali DEL BLR Y 2022-01- 3022-12-12
0 11 00:00:00
10:00:00
Data Modelling:
6. Combine all the above into a data model to present your final design
Data Model:
Final Finish :
1. Present the entire design of data pipeline worked on along with architecture, components and data
flow.
2. Put forward the assumptions you have considered in designing the data pipeline
3. Highlight any edge cases you have considered and how you included them in your design
4. Ask for feedback and repeat the design life cycle to include new changes.
5. Have an open discussion on the design you have proposed and how it can be improvised.
6. End with questions you have for the Interviewer and their comments on your system design solution.
Design considerations for data pipelines:
5. Watermarking Tables:
Udemy:
https://www.udemy.com/course/mastering-data-modeling-fundamentals/
https://www.udemy.com/course/data-warehouse-fundamentals-for-beginners/
Books:
https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-
dw-toolkit/
Sources: