0% found this document useful (0 votes)

70 views46 pages

Data Modeling - Presentation PDF

Uploaded by

vigneshdataprof2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views46 pages

Data Modeling - Presentation PDF

Uploaded by

vigneshdataprof2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

How to get started with data modelling ?

Mentoring Sessions Recording:

Mentoring Sessions Recording:
Agenda:

1. Sample problem statement to build a data model

2. Get started with data modelling

3. Understand the concept of dimensional data modelling

4. Dive into dimensions and facts

5. Understand the slowly changing dimensions

6. Final data model view

7. Design aspects of a data pipeline

Problem Statement:

→Digital marketing team at udemy launches various campaigns to its users about the new course launches,
top selling courses, courses based on the user interest, new features added, courses price drop etc.

→ They engage directly with the users through emails, newsletters, sms, push notification etc. Here we
will mainly focus on EMAIL marketing campaigns launched.

→ Data generated based on the engagement of users for the emails sent by udemy become our point of
main interest.

→ Some of them are email_sent, email_bounce, email_open, email_click and email_unsubscribe.

→ We have third party tools being used to capture the above metrics and send us the user engagement
data on a daily basis.

→ The data received is made ready for analysts by making use of big data technologies.

Task: Build an end-to-end data pipeline to handle the above use case.
Have you ever been asked to design a data model ?
Data Modelling:

→Data models are the prototype built which helps us in getting to know the main system better.
→Data models doesn't actually consists of any data in it.
→We consider entities, relationship between entities, business logic for building data models
→The objective of data modelling is to store data in such a way which is easy and fast to query and analyze
the data.
→We will consider dimensional modelling of data warehouses with respect to big data needs.
Data Modelling Fundamentals:

Data models consists of the below:

1. Data subjects →Commonly referred to as entities. Is very familiar to "database tables"
2. Attributes of data subjects →Analogous to database columns
3. Relationship between db tables →Tells about how tables are related to each other
→Talks about various relationships we have at place :
single level - one table related to only one table
Multi-level - one table related to many tables
Hierarchy - one table divided into different sub tables which are related to the main table

Data modelling lifecycle:

→Conceptual modelling: An overview of how the model can be built upon. We can decide upon tables and
attributes that can be included in creation of the data model
→Logical modelling : Built upon creating relationships among entities, key constraints to be included
→Physical modelling : Final data model that will be used in building up the data warehouse which includes
granular level of details like indexes, partition columns etc

Conceptual Modelling Logical Modelling Physical Modelling

Transactional VS Analytical Data Modelling:
Transactional Analytical
1. To design transactional systems 1. To design analytical systems
like data bases especially DWH used for reporting
2. Follows data normalization rules purpose
2. Follows dimensional modelling
What is a data warehouse ?
→It is a data store unit where data from various sources is integrated into a common place, maintains the
history of data and organized well which helps in analytics and reporting.

Features:
1. Integration from different data sources - Data from various sources can some to a common place where
they can be stored in organized manner
2. Subject Oriented - Based on the data subject (kind of data) from a data source decides the way this data
has to be stored
3. Time variant - Data in the DWH contains historical data not just current data
4. Non volatile - Data doesn't change dynamically as in transactions systems.
WHY DWH :
1. Make data driven decisions - Based on past, present and future data. Try to find the unknown metrics
which is used for analysis
2. 2. One stop shopping - This is a common place where we can find data from various transactional DB,
operational source all at one place
What is dimensional data modelling ?
→Dimensional data modelling is a technique used to store data in data warehouses in the form of facts and
dimensions for fast and efficient query retrieval process.

Data dimension = Context (dimension) + measurement (fact)

Components for dimensional data modelling:

1. Dimension : To provide a business context for a data item.Ex product_id, product_sale_date, model_code
2. Fact: Contains data from collection and aggregation of measurements Ex. Sale_amount, product_count
3. Attribute: They are elements of the dimension/fact table.
4. Dimension table: Table which stores one or more business attributes of data
5. Fact table: Table which stores the quantitative measure of data which is used for analysis
Process of dimensional data modelling:

1. Identify the business process:

→Helps us to understand the final outcomes of the data model built and their usage throughout.
→To understand the data sources we ingest and form facts, dimensions from them
→Decide on the schema to be implemented

2. Identify grain
→Identification of grain signifies how much level of sub categorization can be considered while building the
data model.
Ex. College →Degree →Department

3. Identify dimensions
→Dimensions form the base to built the data model as they represent metrics around the business
attributes. It consists of detailed information around the business case like stock_code, customer_id,
invoice_date etc

4. Identify facts:
→Facts give us the detailed overview on the business aggregates or metrics used for data analysis.
Ex. In our case, it is amount_spent, review_score, review_count

5. Decide on the schema and updates on data

→we have mainly 2 kinds of schemas to choose from:
Design dimensional data model for udemy analytics
Database keys for DWH:

Primary Key VS Foreign Key

Natural Key VS Surrogate Key

Data modelling schemas:

1. Star Schema:
→Consists of single level of hierarchy of dimensions.
→Tables are in de normalized state
→SQL query performance is increased as less number of joins involved
→Data redundancy is high
Data modelling schema:

Snowflake schema:
→It is an extension of star schema where the dimensions are connected internally to one or more
dimensions.
→The tables are partially normalized
→SQL query performance is affected due to more joins involved
→Data redundancy is low
Data modelling schema:

Galaxy schema:
→Data schema with one or more fact tables
→Multiple fact tables share the common dimensions
→Used to handle more complex fact tables requirements
Data modelling schema:

Star Cluster schema:

→Combination of features from both star schema and snowflake schema
→Few of dimensions can be normalized and broken down to further granular level
→Achieves a balance between data latency and data redundancy
Which schema do you prefer to use and why ?
Designing Dimensions :
1. While creating dimension tables always prefix the table with “dim”. Ex. Dim_customer_details

Star Schema Snowflake Schema

1. Flat dimensions 1. Hierarchical dimensions
2. Has one surrogate key and 2. Has one surrogate key and one
multiple natural keys natural key in each of the
dimension tables
Designing facts:

1. Primary Key : Combination of surrogate keys from dimension tables or create a separate surrogate key for
fact table itself.
2. One or more facts can be stored in the fact table as well based on business rules
3. Append a fact table with “fact” and you can have a surrogate key generated for each row in fact tables.
4. There are mainly 3 types of facts namely:
4.1 Additive facts
4.2 Semi-Additive facts
4.3 Non additive facts
Types of facts:

1. Additive facts :
→Facts that can be calculated by using all the dimensions in the fact table.
Ex. Calculate numberof units sold in a retail store in the month of July 2022.
Number of units sold is the fact which can be calculated using the dimensions of orders, customers,
Calendar.
Types of facts:

2. Semi-Additive facts :
→Facts that can be calculated using some of the dimensions in the fact table.
Ex. Calculate total amount spent by a customer in the retail store till now.
Here we need the dimensions like customer, product and not necessarily date, store dimensions
Types of facts:

3. Non additive facts :

→Facts that cannot be calculated using dimensions in the table
Ex. Calculate the percentage of customers spread across different states in India
Can you give examples of different fact types that can be used in our use case ?
Types of fact tables:

1. Transaction fact tables:

→Fact tables with granular row data where each row denotes a single transaction taken place.
This is the smallest grain which can’t be broken down further. Each transaction can capture a lot
Of context which provides rich dimensional level fields.
→Generally the transactions are not updated frequently so the data remains more static.
→Ex. A customer performed a cash withdrawal from a nearby ATM. Details such as customer, bank,
Account, card details can be captured.

2. Periodic snapshot table:

→Fact tables which contain cumulative performance measurements of business at predefined snapshot
Date periods like daily, weekly, monthly etc.
→This can be of more use for data analysis purpose.
→Source for this fact table is from transaction facts where the date period can be chosen.
→Ex. The average purchases of made by a customer over the last 3 months
Types of fact tables:

3. Accumulative snapshots:
→Fact table that describe activity of businesses which has a clear beginning and end.
→These tables have a list of date or datetime columns to depict the milestone
Ex. User starts to use Zomato for the first time. Their count of ads clicks tracked over weeks like
Week 1, week 2, week 3, week 4.

4. Fact less fact tables:

→Tables that have no facts stored in them. Such fact tables have no metric/measurement stored in their
tables which is calculated from dimensions or raw sources.

→There are 2 kinds of such tables :

4.1 Fact less facts tracking event or activity

→Fact tables store transaction level information in it through the dimensions connected to it. Based on
which the required metrics can be calculated later which can be useful business information.
Let’s look at an example. Below model gives us details about employee leave tracker.
How to calculate the total count of leaves taken up by an employee ?

SELECT employee_name AS name, COUNT(leave_type_id) AS leave

FROM fact_leave f
INNER JOIN dim_employee d on d.employee_id = f.employee_id
WHERE employee_id =‘TT100’
4.2 Fact less facts describe condition, eligibility
→These fact tables can be used to calculate an eligibility criteria based on the fields in its fact table though
it doesn’t store the criteria directly.

SELECT employee_name AS name, COUNT(leave_type_id) AS leave

FROM fact_leave f
INNER JOIN dim_employee d on d.employee_id = f.employee_id
WHERE employee_id =‘TT100’ AND leave < 10
Slowly changing dimensions :

1. Dimensions that change slowly over time

2. Dimensions that store and manage both current and historical data over time in a DWH
3. Techniques to manage history within data warehouse
4.Historical change of data over time becomes important in DWH as it is used for analytical purposes

DWH SCD TYPES:

TYPE 0 Fixed dimension

TYPE 1 Overwrite old data and no history retention
TYPE 2 Maintain unlimited history ie all the history versions will be available
TYPE 3 Maintain limited history
TYPE 4 Combination of 1 & 2
TYPE 6 Hybrid type
TYPE – 0 SCD:

1. The data in these dimensions are fixed and never changes

2. Once the data is loaded into this tables, it remains fixated

Cons:
1. There is no history of data to be tracked
2. Data remains static and not suited for analytics

Country Timezone
India UTC +5:30
Australia UTC +8:00
Denmark UTC +1:00
TYPE – 1 SCD:

1. The row value which needs to be changed is updated with the new value
2. Old value is permanently deleted
3. Mainly useful for correcting errors
4. The data is over written

Cons:
1. History won’t be retained
2. Auditing of data won’t be possible

Student_key Student_id Student_nam Student_colle Student_plac

e ge e
8394873920 22CS120 Ram Vihar JNTUA Ananthpur
8394873921 22CS120 Ram Vihar JNTUA Hyderabad
TYPE – 2 SCD:

1. The data column value which is updated is stored as a new row and the old value also exists in a different
row
2. A new surrogate key is generated for the updated data row
3. Reports and analytics data before and after Type 2 SCD will give accurate results
4. Analytics done on old and new data changes can be captured accurately
5. Historical analysis done on historical data

Cons:
1. Huge storage in fact tables as all history versions of data is retained
2. Additional column fields to be included in fact table to identify the old and new versions of data.
3. Include natural keys from dimension tables for better identification of rows which differ only in surrogate
keys but all details remain the same

Student_key Student_id Student_nam Student_colle Student_plac

e ge e
8394873920 22CS120 Ram Vihar JNTUA Banglore
6489274890 22CS120 Ram Vihar JNTUA Hyderabad
WAYS TO HANDLE TYPE 2 - SCD:

1. To include new columns called “flag_change” whose values will be changed if any value changes in the
data row.
2. To include new columns like “start_date” and “end_date” to signify changes in the data rows.

Student_k Student_ Student Student Studen Flag_c Effective Expiry_tim

ey id _name _colleg t_plac hange _time e
e e
783657278 22ME800 Sumant JNTUA CHN Y 2022-05- 3022-12-12
h 30 00:00:00
10:00:00
829748392 20EE250 Ram JNTUA BLR N 2020-05- 2020-10-18
0 Vihar 10 10:00:00
10:00:00
829748392 20EE250 Ram JNTUA HYD N 2020-10- 2022-01-20
0 Vihar 18 20:00:00
10:00:00
829748392 20EE250 Ram JNTUA BLR Y 2022-01- 3022-12-12
0 Vihar 11 00:00:00
10:00:00
TYPE – 3 SCD:

1. Add a new column rather than a new row to reflect the changes done
2. Column for "old value" and "new value“
3. Supports back and forth switching for effective reportin

Cons:
1. It is not suitable for dwh where various columns are changed like place, country, address, pincode
2. It is suitable only for use cases where changes are limited

Student_key Student_id Student_nam Student_co New_plac

e old_place e
8394873920 22CS120 Ram Vihar ATP HYD
TYPE – 4 SCD:

1. Known to be fast growing dimension where storing current and all the historical data in a single
Dimension would make it inefficient to store and query
2. Dimension tables with updates are stored in 2 different tables comprising of current and historical changes

Cons:
1. More storage and maintenance is required to store them as two separate tables
Historical data table:

Student_key Student_id Student_nam Student_colle Student_plac

e ge e
8394873920 22CS120 Ram Vihar JNTUA Ananthpur
8394873920 22ME783 Anjali M JNTUA Hyderabad
Current data table:
Student_key Student_id Student_nam Student_colle Student_plac
e ge e
8394873925 22CS120 Ram Vihar JNTUA Chennai
8394873929 22CS120 Ram Vihar JNTUA Hyderabad
TYPE – 6 SCD :

1. It is a combination of type 1, 2 & 3 and known as hybrid type.

2. Stores the current data in all the historical row data in current column section.

Cons:
1. More complex to implement and stores lot of redundant data.
Student_k Student_ Student Student Studen Flag_c Effective Expiry_tim
ey id _name _curren t_plac hange _time e
t_place e
783657278 22ME800 Sumant CHN CHN Y 2022-05- 3022-12-12
h 30 00:00:00
10:00:00
829748392 20EE250 Anjali DEL BLR N 2020-05- 2020-10-18
0 10 10:00:00
10:00:00
829748392 20EE250 Anjali DEL HYD N 2020-10- 2022-01-20
0 19 20:00:00
10:00:00
829748392 20EE250 Anjali DEL BLR Y 2022-01- 3022-12-12
0 11 00:00:00
10:00:00
Data Modelling:

Tips to keep in mind:

1. Decide on the dimension and fact tables to be built

2. Explain the tables in the DWH

3. Talk about the main fields in the dims & facts

4. Choose the variants of facts or fact tables to be considered

5. Choose the schema to be implemented

6. Combine all the above into a data model to present your final design
Data Model:
Final Finish :

1. Present the entire design of data pipeline worked on along with architecture, components and data
flow.

2. Put forward the assumptions you have considered in designing the data pipeline

3. Highlight any edge cases you have considered and how you included them in your design

4. Ask for feedback and repeat the design life cycle to include new changes.

5. Have an open discussion on the design you have proposed and how it can be improvised.

6. End with questions you have for the Interviewer and their comments on your system design solution.
Design considerations for data pipelines:

1. Check for idempotence of pipelines :

→ Idempotence: For a given data pipeline for the same set of input should give the same output when ran
multiple times.

1.1 When pipelines are rerun ?

→ To backfill data, handle pipeline failures, testing pipelines

1.2 why it is important ?

→ When a data pipeline is rerun multiple times there are chances to add duplicate data, store the old
data which produce bad data that results in wrong results.

1.3 How to retain idempotent pipelines ?

→Perform a complete refresh of the data so that entire data gets re-written.
→Include duplicate checks to sustain only distinct data items in the tables.
→Make sure to have the dependent tables in updated state
Design considerations for data pipelines:

2. Monitoring and Alerting:

2.1 why it is important ?

→Pipelines built which are not monitored or maintained well leads to pipeline failures, cross team
dependent runs, stale data dump go unnoticed. Pipeline lifecycle wouldn’t get complete without having a
strong maintenance at place.

2.2 How it can be done ?

→Scheduling tools airflow, autosys enable us to monitor the job runs and raise alerts for unusual behavior
like failures, long running or hanging up of pipelines. Services like cloudwatch, datadog can be enables to
monitor and alert the pipeline runs.

2.3 Other ways of alerting ?

→We can integrate the above services with communication channels like outlook, slack and teams which
can send real time alerts to take immediate action on it.
Design considerations for data pipelines:

3. Data quality checks:

3.1 why it is important ?

→Data quality checks makes sure that data is up-to date and qualified for analytics.

3.2 How to get it done ?

→We can write manual test cases to perform data quality checks like duplicates, value range, schema
type.
We can also make use of frameworks designed for this purpose like DBT, great expectations.

3.3 How to handle failures ?

→We can write a separate pipeline to handle all quality checks for a set of pipelines which can be
monitored and maintained along.
Design considerations for data pipelines:

4. Dynamic resource allocation :

4.1 Why it is important ?

→Pipeline resources don’t get wasted when not in use and save costs to the company
→Pipeline doesn’t get failed due to lack of resources.
→Set spark properties such as spark.sql.files.max.PartitionByte to lower value, increase number of
partitions.

4.2 when it can be used?

→During the time high volatility in data being received.
→Either the data can be less than expected or more than expected.

4.3 How to get it done ?

→Cloud services like AWS, Azure, GCP provide the compatibility to dynamically scale up or scale down the
resources allocated to the cluster based on the amount of data dealing with.
→It can be enable in spark using spark.dynamicAllocation.enabled which can decide on min & max number
of executors to be brought up based on the workload.
Design considerations for data pipelines:

5. Watermarking Tables:

5.1 What is it all about ?

→In order to load only the incremental data into the table and to keep track of the pipeline runs we
introduce the concept of watermarking.

5.2 Why it is important ?

→To prevent overwriting huge amount of data daily
To prevent reading from large tables daily
To optimize on the resources being used to run the pipelines

5.3 How it can be implemented?

→Filter out based on date columns on which delta data to be processed by the pipelines
Create a watermark tables to store the to_date and from_date
Sources:

Data Engineer Specific:

https://www.striim.com/blog/guide-to-data-pipelines/
https://medium.com/the-data-experience/building-a-data-pipeline-from-scratch-32b712cfb1db
https://towardsdatascience.com/big-data-modeling-25c64d456308
https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/#6-
design-considerations

Generic to system design:

https://www.geeksforgeeks.org/top-10-system-design-interview-questions-and-answers/
https://www.freecodecamp.org/news/systems-design-for-interviews/
https://blog.tryexponent.com/how-to-nail-the-system-design-interview/

Udemy:
https://www.udemy.com/course/mastering-data-modeling-fundamentals/
https://www.udemy.com/course/data-warehouse-fundamentals-for-beginners/

Books:
https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-
dw-toolkit/
Sources:

Data Warehousing concepts:

https://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm
https://www.javatpoint.com/data-warehouse
https://www.guru99.com/data-warehouse-architecture.html
https://www.analyticsvidhya.com/blog/2021/07/a-brief-introduction-to-data-warehouse/
https://www.1keydata.com/datawarehousing/concepts.html

Data Modelling concepts:

https://www.guru99.com/data-modelling-conceptual-logical.html
https://www.ibm.com/cloud/learn/data-modeling

Unit II DWDM
No ratings yet
Unit II DWDM
97 pages
1.1 (Dimensional Modelling)
No ratings yet
1.1 (Dimensional Modelling)
51 pages
Unit Iv
No ratings yet
Unit Iv
33 pages
BI - Chap 3 - Data Warehouses Design
No ratings yet
BI - Chap 3 - Data Warehouses Design
54 pages
Chapter - 2 - Data Warehouse Modelling
No ratings yet
Chapter - 2 - Data Warehouse Modelling
32 pages
CH 3
No ratings yet
CH 3
60 pages
3 - Business Analysis in Data Mining - L6 - 7 - 8 - 9 - 10
No ratings yet
3 - Business Analysis in Data Mining - L6 - 7 - 8 - 9 - 10
40 pages
Database Management System (DBMS) Lab Report
100% (1)
Database Management System (DBMS) Lab Report
48 pages
Unit - I
No ratings yet
Unit - I
65 pages
DMDW Unit2
No ratings yet
DMDW Unit2
35 pages
Dim Modelling Part 1 - Sh24
No ratings yet
Dim Modelling Part 1 - Sh24
50 pages
02 - Data Modeling
No ratings yet
02 - Data Modeling
32 pages
Lecture 3
No ratings yet
Lecture 3
42 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
104 pages
Unit - 4
No ratings yet
Unit - 4
36 pages
DW Unit IV Notes
No ratings yet
DW Unit IV Notes
36 pages
DW Notes
No ratings yet
DW Notes
13 pages
Data Warehousing Fundamentals: Priyanka Deshmukh
No ratings yet
Data Warehousing Fundamentals: Priyanka Deshmukh
43 pages
Data Modeling, Star Schema, Snowflake Schema
No ratings yet
Data Modeling, Star Schema, Snowflake Schema
7 pages
Unit-1 Lecture Notes
100% (1)
Unit-1 Lecture Notes
43 pages
dw4 - Dimension1
No ratings yet
dw4 - Dimension1
75 pages
Experiment No 1
No ratings yet
Experiment No 1
7 pages
Busiess Analytics Data Modeling Lecture 2
No ratings yet
Busiess Analytics Data Modeling Lecture 2
24 pages
Week 5
No ratings yet
Week 5
19 pages
Experiment2 E059 DWM PDF
No ratings yet
Experiment2 E059 DWM PDF
10 pages
4 Lecture 4-Dimensional Modelling
No ratings yet
4 Lecture 4-Dimensional Modelling
45 pages
Dimensional Modelling
No ratings yet
Dimensional Modelling
36 pages
Datawarehousing Top50 Interview Questions
No ratings yet
Datawarehousing Top50 Interview Questions
10 pages
Unit 4
No ratings yet
Unit 4
11 pages
Dimensional Modeling
No ratings yet
Dimensional Modeling
7 pages
Unit 3
No ratings yet
Unit 3
18 pages
Lecture 3 Data Warehouse Modelling
No ratings yet
Lecture 3 Data Warehouse Modelling
58 pages
Datawarehousing - Study - 1
No ratings yet
Datawarehousing - Study - 1
3 pages
DWM Exp1 C49
No ratings yet
DWM Exp1 C49
13 pages
Maharishi Markandeshwar (Deemed To Be University), Mullana (Ambala)
No ratings yet
Maharishi Markandeshwar (Deemed To Be University), Mullana (Ambala)
18 pages
Final DWM
No ratings yet
Final DWM
30 pages
DW Unit 4
No ratings yet
DW Unit 4
39 pages
DWM Exp 1-2
No ratings yet
DWM Exp 1-2
9 pages
DWM Chp2 Notes
No ratings yet
DWM Chp2 Notes
21 pages
DWH Architecture & Concepts
No ratings yet
DWH Architecture & Concepts
37 pages
Tutorial # 1
No ratings yet
Tutorial # 1
58 pages
Experiment No.02: LAB Manual Part A
No ratings yet
Experiment No.02: LAB Manual Part A
10 pages
Data Mining and Warehousing (chp#3) .
No ratings yet
Data Mining and Warehousing (chp#3) .
11 pages
Dimensional Modeling: Prof. Sunita Sahu
No ratings yet
Dimensional Modeling: Prof. Sunita Sahu
50 pages
Dimensional Modeling in Data Warehousing
No ratings yet
Dimensional Modeling in Data Warehousing
23 pages
Dimension Modelling Techniques in Business Intelligence
No ratings yet
Dimension Modelling Techniques in Business Intelligence
4 pages
Data Warehouse and Data Modelling
No ratings yet
Data Warehouse and Data Modelling
11 pages
Amey B-50 DWM Lab Experiment-1
No ratings yet
Amey B-50 DWM Lab Experiment-1
12 pages
Daily Update Xtream Codes
63% (8)
Daily Update Xtream Codes
4 pages
Unit 2
No ratings yet
Unit 2
8 pages
Lab 7 PL/SQL: IS221: Database Management Systems
100% (1)
Lab 7 PL/SQL: IS221: Database Management Systems
36 pages
What Is Data Warehouse?: Explanatory Note
No ratings yet
What Is Data Warehouse?: Explanatory Note
11 pages
Data Mning
No ratings yet
Data Mning
10 pages
Unit 1 and 2 PDF
No ratings yet
Unit 1 and 2 PDF
83 pages
Lecture 7 p1
No ratings yet
Lecture 7 p1
38 pages
DW Life Cycle
No ratings yet
DW Life Cycle
114 pages
Data Model
100% (1)
Data Model
11 pages
Data Warehouse Ques
No ratings yet
Data Warehouse Ques
10 pages
Dimensional Modelling
No ratings yet
Dimensional Modelling
26 pages
What Is Dimensional Model
No ratings yet
What Is Dimensional Model
7 pages
What Is A Data Warehouse
No ratings yet
What Is A Data Warehouse
11 pages
Datawarehouse Concepts
No ratings yet
Datawarehouse Concepts
7 pages
Mongo DB
No ratings yet
Mongo DB
20 pages
DBMS C1P2
No ratings yet
DBMS C1P2
42 pages
SQL Exercises and Solutions in MySQL Practice
No ratings yet
SQL Exercises and Solutions in MySQL Practice
10 pages
DWH Int Questions
100% (1)
DWH Int Questions
9 pages
Big Data Engineer Resume
100% (2)
Big Data Engineer Resume
6 pages
A. Multiple Choice Questions (80%)
No ratings yet
A. Multiple Choice Questions (80%)
4 pages
Es Scribd Com Soporte Vital Basico Libro Del Proveedor AHA 2020
No ratings yet
Es Scribd Com Soporte Vital Basico Libro Del Proveedor AHA 2020
160 pages
The Worlds of Database Systems: File Systems and Databases
No ratings yet
The Worlds of Database Systems: File Systems and Databases
50 pages
Lesson 3 Data Science
No ratings yet
Lesson 3 Data Science
12 pages
DBMS Full Notes
No ratings yet
DBMS Full Notes
49 pages
DB2 - Lec5
No ratings yet
DB2 - Lec5
39 pages
UU-COM-4008 Reading Material Week 3
No ratings yet
UU-COM-4008 Reading Material Week 3
9 pages
Diagnostic Information For Database Replay Issues
No ratings yet
Diagnostic Information For Database Replay Issues
10 pages
2013-Db-Christian Antognini-Query Optimizer 12c Was Ist Neu - Praesentation
No ratings yet
2013-Db-Christian Antognini-Query Optimizer 12c Was Ist Neu - Praesentation
56 pages
Configuring BI Publisher To Use Oracle BI Data
No ratings yet
Configuring BI Publisher To Use Oracle BI Data
10 pages
Computer Science Faculty Information Systems Department: Data Warehousing & BI
No ratings yet
Computer Science Faculty Information Systems Department: Data Warehousing & BI
52 pages
Oracle Reviewer
No ratings yet
Oracle Reviewer
24 pages
Manipulating Data Create Insert Delete Update MYSQL
No ratings yet
Manipulating Data Create Insert Delete Update MYSQL
18 pages
Sharding in MongoDB
No ratings yet
Sharding in MongoDB
3 pages
Roles and Responsibilities of DBA by Dr. Mariam Rehman
No ratings yet
Roles and Responsibilities of DBA by Dr. Mariam Rehman
8 pages
Basic Unix Commands For DBA
No ratings yet
Basic Unix Commands For DBA
8 pages
The Key Components of Hbase Are Zookeeper, Regionserver, Region, Catalog Tables and Hbase Master
No ratings yet
The Key Components of Hbase Are Zookeeper, Regionserver, Region, Catalog Tables and Hbase Master
5 pages
Pharas Ram Rai - Denodo Admin - Autodesk
No ratings yet
Pharas Ram Rai - Denodo Admin - Autodesk
2 pages
Monitoring Data Guard Configuration Health Using SQL:: Primary Database Queries
No ratings yet
Monitoring Data Guard Configuration Health Using SQL:: Primary Database Queries
2 pages
Mobile Computing Unit 3
No ratings yet
Mobile Computing Unit 3
3 pages
File System Types
No ratings yet
File System Types
2 pages
CSC410 2017-2018
No ratings yet
CSC410 2017-2018
2 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet

Data Modeling - Presentation PDF

Uploaded by

Data Modeling - Presentation PDF

Uploaded by

How to get started with data modelling ?

Mentoring Sessions Recording:

1. Sample problem statement to build a data model

2. Get started with data modelling

3. Understand the concept of dimensional data modelling

4. Dive into dimensions and facts

5. Understand the slowly changing dimensions

6. Final data model view

7. Design aspects of a data pipeline

→ Some of them are email_sent, email_bounce, email_open, email_click and email_unsubscribe.

Data models consists of the below:

Data modelling lifecycle:

Conceptual Modelling Logical Modelling Physical Modelling

Data dimension = Context (dimension) + measurement (fact)

Components for dimensional data modelling:

1. Identify the business process:

5. Decide on the schema and updates on data

Primary Key VS Foreign Key

Natural Key VS Surrogate Key

Star Cluster schema:

Star Schema Snowflake Schema

3. Non additive facts :

1. Transaction fact tables:

2. Periodic snapshot table:

4. Fact less fact tables:

→There are 2 kinds of such tables :

4.1 Fact less facts tracking event or activity

SELECT employee_name AS name, COUNT(leave_type_id) AS leave

SELECT employee_name AS name, COUNT(leave_type_id) AS leave

1. Dimensions that change slowly over time

DWH SCD TYPES:

TYPE 0 Fixed dimension

1. The data in these dimensions are fixed and never changes

Student_key Student_id Student_nam Student_colle Student_plac

Student_key Student_id Student_nam Student_colle Student_plac

Student_k Student_ Student Student Studen Flag_c Effective Expiry_tim

Student_key Student_id Student_nam Student_co New_plac

Student_key Student_id Student_nam Student_colle Student_plac

1. It is a combination of type 1, 2 & 3 and known as hybrid type.

Tips to keep in mind:

1. Decide on the dimension and fact tables to be built

2. Explain the tables in the DWH

3. Talk about the main fields in the dims & facts

4. Choose the variants of facts or fact tables to be considered

5. Choose the schema to be implemented

1. Check for idempotence of pipelines :

1.1 When pipelines are rerun ?

1.2 why it is important ?

1.3 How to retain idempotent pipelines ?

2. Monitoring and Alerting:

2.1 why it is important ?

2.2 How it can be done ?

2.3 Other ways of alerting ?

3. Data quality checks:

3.1 why it is important ?

3.2 How to get it done ?

3.3 How to handle failures ?

4. Dynamic resource allocation :

4.1 Why it is important ?

4.2 when it can be used?

4.3 How to get it done ?

5.1 What is it all about ?

5.2 Why it is important ?

5.3 How it can be implemented?

Data Engineer Specific:

Generic to system design:

Data Warehousing concepts:

Data Modelling concepts:

You might also like