Intro - Data - Modeling
Intro - Data - Modeling
Learning objectives
Data Visualization- • Explain what a data model is and why relationships are an important
part of a data model.
Data Model (Power Pivot) • Introduction to the concepts of normalization, denormalization, and
star schemas.
Source: Analyzing Data with Microsoft Power BI and Power Pivot for Excel
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 1 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 2 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 3
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 4 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 5 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 6
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 7 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 8 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 9
1
17-02-2023
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 10 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 11 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 12
Introducing the data model Introducing the data model Introducing the data model
• Both the Sales table and the Product table have a ProductKey column. • The Sales table is called the source table. • In the diagram view, a relationship is drawn identifying the one and
• In Product, this is a primary key, meaning it has a different value in each the many side with a number (one) and an asterisk (many).
row and can be used to uniquely identify a product. • The Product table is known as the target of the relationship.
• The primary key is nothing special. • The source table is also called the many side of the relationship. • Note that there is also an arrow in the middle, but it does not
• From a technical point of view, it is just the column that you consider as the one that represent the direction of the relationship.
uniquely identifies a row. • The target table is known as the one side of the relationship.
• When you have a unique identifier in a table, and a column in another • Rather, it is the direction of filter propagation.
table that references it, you can create a relationship between the two
• We will use one side and many side terminology.
• When the relationship is in place, you can sum the values from the
tables. • The ProductKey column exists in both the Sales (foreign key) and Sales table, slicing them by columns in the Product table.
• If you have a model where the desired key for the relationship is not a Product tables (primary key).
unique identifier in one of the two tables, you must massage the model
• A foreign key is a column that points to a primary key in another table.
• Go to Excel, insert pivot table, in Row put Color from Product table
with one of the many techniques you learn in this book. and in values put Quantity from Sales table.
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 13 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 14 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 15
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 16 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 17 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 18
2
17-02-2023
F01 xx 10.xlsx
Granularity in multiple tables The reason for this design technique?
• Once you store the product key in the Sales table, you rely on the • By storing the product category in a separate table, you have a data
relationship to retrieve all the attributes of the product, including the model where the category name, although referenced from many
product category, the color, and all the other product information.
products, is stored in a single row of the Product Category table.
• Thus, the problem of granularity becomes much less of an issue.
• This is a good method of storing information for two reasons.
• If you look carefully at the Product table, you will notice that the product
• First, it reduces the size on disk of the model by avoiding repetitions of the
• category and subcategory are missing. same name.
• Instead, there is a ProductSubcategoryKey column, it is a reference (that is, • Second, if at some point you must update the category name, you only need
a foreign key) to the key in another table (where it is a primary key) that to do it once on the single row that stores it.
contains the product subcategories. • All the products will automatically use the new name through the
• In fact, in the database, there are two tables containing a product category relationship.
and product subcategory. Chain of relationships, starting from Product, reaching Product Subcategory, and finally Product Category.
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 19 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 20 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 21
Design technique Online transactional processing (OLTP) systems. Design technique - denormalize
• There is a name for this design technique: normalization. • Highly normalized structures are typical of online transactional processing (OLTP) • When building a data model to do reporting, you must reach a
systems.
• An attribute such as the product category is said to be normalized when it is • OLTP systems are databases that are designed to handle your everyday jobs. That
reasonable level of denormalization no matter how the original data
stored in a separate table and replaced with a key that points to that table. includes operations like preparing invoices, placing orders, shipping goods, and solving is stored.
claims.
• The opposite technique—that is, storing attributes in the table to • If you denormalize too much, you face the problem of granularity.
• These databases are very normalized because they are designed to use the least amount
which they belong—is called denormalization. of space (which typically means they run faster) with a lot of insert and update • Intuitively, you denormalize up to the point where a table is a self-
operations.
• When a model is denormalized, the same attribute appears multiple times, contained structure that completely describes the entity it stores.
• In fact, during the everyday work of a company, you typically update information—for
and if you need to update it, you will have to update all the rows containing it. example, about a customer—want it to be automatically updated on all the data that • If the model is designed the right way, with the right level of
reference this customer. denormalization, then granularity comes out in a very natural way.
• This happens in a smooth way if the customer information is correctly normalized.
• If the customer information were denormalized, updating the address of a customer • On the other hand, if the model is over-denormalized, then you must
would result in hundreds of update statements executed by the server, causing poor worry about granularity, and you start facing issues.
performance.
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 22 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 23 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 24
3
17-02-2023
Deactivated
store tens—if not hundreds of millions—of rows. • System might follow the relationship between Geography and Store
• Both
• Fact tables are related to dimensions, but dimensions should not
have relationships among them. • Neither Excel nor Power BI let you build such a model.
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 28 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 29 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 30
Snowflake Snowflake
• How would you resolve ambiguity in this scenario? • A snowflake is a variation of a star schema where a dimension is not • The difference between this example and the previous one is that this
relationship is the only one between Product Subcategory and the other
• The answer is very simple. You must denormalize the relevant linked directly to the fact table. dimensions linked to the fact table or to Product.
columns of the Geography table, both in Store and in Customer, • Rather, it is linked through another dimension. F01 xx 15.xlsx • Thus, you can think of Product Subcategory as a dimension that groups
removing the Geography table from the model. different products together, but it does not group together any other
dimension or fact.
• You could include the ContinentName columns in both Store and in • The same, obviously, is true for Product Category.
Customer • Thus, even if snowflakes violate the aforementioned rule, they do not
introduce any kind of ambiguity, and a data model with snowflakes is
absolutely fine.
• Still, whenever you work with a data model, representing it with a star
schema is the right thing to do.
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 31 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 32 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 33
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 34 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 35 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 36