0% found this document useful (0 votes)
7 views

Intro - Data - Modeling

Uploaded by

Abhyudya Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Intro - Data - Modeling

Uploaded by

Abhyudya Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

17-02-2023

Learning objectives

Data Visualization- • Explain what a data model is and why relationships are an important
part of a data model.

Data Model (Power Pivot) • Introduction to the concepts of normalization, denormalization, and
star schemas.

Dr. Achint Nigam Introduction to data modeling


ASTP- DoM
BITS Pilani

Source: Analyzing Data with Microsoft Power BI and Power Pivot for Excel

17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 1 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 2 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 3

Why should you learn anything about


Running example
data modeling?
• Power Pivot for Excel and Power BI, share a lot of features, namely • You must learn data modeling if you want to improve your analytical • Contoso database.
the VertiPaq database engine and the DAX language, inherited from capabilities and if you prefer to focus on making the right decision • Contoso is a fictitious company that sells electronics all over the
SQL Server Analysis Services. rather than on finding the right complex DAX formula. world, through different sales channels.
• COM add-in in Excel. • Shape your brain in such a way that you see the model in your mind • The table containing the sales is made of 12,000,000 rows.
when you are thinking of the scenario.
• Data modeling is complex, challenging, and mind-stretching.
• Being a good data modeler basically means being able to match your
specific model with one of the many different patterns that have
already been studied and solved by others.

17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 4 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 5 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 6

Working with a single table Working with a single table Granularity


• Excel—the dataset cannot exceed 1,000,000 rows, Contoso has 12 • Adding one column to the query is not a big issue. • Size matters because it relates to granularity.
million rows. • The problem is that the more columns you add, the larger the table • You can think of granularity as the level of detail in your tables.
• You can choose not to load the sales of each product. becomes—not only in width (the number of columns), but also in
length (the number of rows). • The higher the granularity, the more detailed your information.
• You can group data by category and subcategory, significantly • Having more details means being able to perform more detailed
• If you do not want to decide in advance which columns you will use to
reducing the number of rows. slice the data, you will end up having to load the full 12,000,000 (granular) analyses.
• 12 million to 63,948 rows. (F01 xx 01.xlsx) rows—meaning an Excel table is no longer an option. • Your ability to slice and dice depends on the number of columns in
• Doing this, you implicitly limited yourself in your analytical power. • This is where Power Pivot (Data Model) comes into play. the table—thus, on its granularity.
• For example, if you want to perform an analysis slicing by color, then the table • Using Power Pivot, you no longer face the limitation of 1,000,000 rows.
is no longer a good source because you don’t have the product color among Indeed, there is virtually no limit on the number of rows you can load in a • Increasing the number of columns increases the number of rows.
the columns. Power Pivot table. (F01 xx 03.xlsx)

17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 7 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 8 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 9

1
17-02-2023

Granularity Granularity Introducing the data model


• If you have data at the wrong granularity, formulas are nearly impossible to • Unfortunately, the computed number is incorrect: It is highly exaggerated. • What is a data model? A data model is just a set of tables linked by
write. • You must fix the granularity at the customer level by either reloading the relationships (single-table model is already a data model).
• It is not correct to say that a higher granularity is always a good option. table or relying on a more complex DAX formula. • F01 xx 07.xlsx
• You must have data at the right granularity, where right means the best • You cannot use a value that has a meaning at the customer level with the • Power Pivot tab >Manage> Home tab of the Power Pivot window, >
level of granularity to meet your needs. same meaning at the individual sale level. Diagram View in the View group.
• Refer F01 xx 04.xlsx. • If your model has a single table, then you must choose the granularity of • Two disconnected tables, are not yet a true data model.
• On every row of the Sales table, there is an additional column reporting the the table, taking into account all the possible measures and analyses that
yearly income of the customer who bought that product. you might want to perform. • For a more meaningful model, create a relationship between the two
• No matter how hard you work, the granularity will never be perfect for all tables.
• AverageYearlyIncome := AVERAGE (Sales[YearlyIncome])
your measures.

17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 10 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 11 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 12

Introducing the data model Introducing the data model Introducing the data model
• Both the Sales table and the Product table have a ProductKey column. • The Sales table is called the source table. • In the diagram view, a relationship is drawn identifying the one and
• In Product, this is a primary key, meaning it has a different value in each the many side with a number (one) and an asterisk (many).
row and can be used to uniquely identify a product. • The Product table is known as the target of the relationship.
• The primary key is nothing special. • The source table is also called the many side of the relationship. • Note that there is also an arrow in the middle, but it does not
• From a technical point of view, it is just the column that you consider as the one that represent the direction of the relationship.
uniquely identifies a row. • The target table is known as the one side of the relationship.
• When you have a unique identifier in a table, and a column in another • Rather, it is the direction of filter propagation.
table that references it, you can create a relationship between the two
• We will use one side and many side terminology.
• When the relationship is in place, you can sum the values from the
tables. • The ProductKey column exists in both the Sales (foreign key) and Sales table, slicing them by columns in the Product table.
• If you have a model where the desired key for the relationship is not a Product tables (primary key).
unique identifier in one of the two tables, you must massage the model
• A foreign key is a column that points to a primary key in another table.
• Go to Excel, insert pivot table, in Row put Color from Product table
with one of the many techniques you learn in this book. and in values put Quantity from Sales table.

17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 13 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 14 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 15

Granularity in multiple tables Granularity in multiple tables


• What about granularity in the new data model, which now contains • In F01 xx 01.xlsx, you had a single table containing sales at the
two tables (Product and Sales)? granularity of the product category and subcategory.
• Now you have two different granularities. • This was because the product category and product subcategory were
stored in the Sales table.
• Sales has a granularity at the individual sale level, whereas Product
• In other words, you had to make a decision about granularity, mainly
has a granularity at the product level. because you stored information in the wrong place.
• Granularity is a concept that is applied to a table, not to the model as • Once each piece of information finds its right place, granularity
a whole. becomes much less of a problem.
• You must adjust the granularity level for each table in the model. • In fact, the product category is an attribute of a product, not of an
individual sale.

17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 16 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 17 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 18

2
17-02-2023

F01 xx 10.xlsx
Granularity in multiple tables The reason for this design technique?
• Once you store the product key in the Sales table, you rely on the • By storing the product category in a separate table, you have a data
relationship to retrieve all the attributes of the product, including the model where the category name, although referenced from many
product category, the color, and all the other product information.
products, is stored in a single row of the Product Category table.
• Thus, the problem of granularity becomes much less of an issue.
• This is a good method of storing information for two reasons.
• If you look carefully at the Product table, you will notice that the product
• First, it reduces the size on disk of the model by avoiding repetitions of the
• category and subcategory are missing. same name.
• Instead, there is a ProductSubcategoryKey column, it is a reference (that is, • Second, if at some point you must update the category name, you only need
a foreign key) to the key in another table (where it is a primary key) that to do it once on the single row that stores it.
contains the product subcategories. • All the products will automatically use the new name through the
• In fact, in the database, there are two tables containing a product category relationship.
and product subcategory. Chain of relationships, starting from Product, reaching Product Subcategory, and finally Product Category.

17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 19 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 20 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 21

Design technique Online transactional processing (OLTP) systems. Design technique - denormalize
• There is a name for this design technique: normalization. • Highly normalized structures are typical of online transactional processing (OLTP) • When building a data model to do reporting, you must reach a
systems.
• An attribute such as the product category is said to be normalized when it is • OLTP systems are databases that are designed to handle your everyday jobs. That
reasonable level of denormalization no matter how the original data
stored in a separate table and replaced with a key that points to that table. includes operations like preparing invoices, placing orders, shipping goods, and solving is stored.
claims.
• The opposite technique—that is, storing attributes in the table to • If you denormalize too much, you face the problem of granularity.
• These databases are very normalized because they are designed to use the least amount
which they belong—is called denormalization. of space (which typically means they run faster) with a lot of insert and update • Intuitively, you denormalize up to the point where a table is a self-
operations.
• When a model is denormalized, the same attribute appears multiple times, contained structure that completely describes the entity it stores.
• In fact, during the everyday work of a company, you typically update information—for
and if you need to update it, you will have to update all the rows containing it. example, about a customer—want it to be automatically updated on all the data that • If the model is designed the right way, with the right level of
reference this customer. denormalization, then granularity comes out in a very natural way.
• This happens in a smooth way if the customer information is correctly normalized.
• If the customer information were denormalized, updating the address of a customer • On the other hand, if the model is over-denormalized, then you must
would result in hundreds of update statements executed by the server, causing poor worry about granularity, and you start facing issues.
performance.

17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 22 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 23 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 24

Introducing star schemas Star schemas Star schemas


• In a typical company like Contoso, there are several informational • This separation between assets and events leads to a data-modeling • If you design this schema, putting all dimensions around a single fact
technique known as a star schema. F01 xx 13.xlsx
assets: products, stores, employees, customers, and time. table, you obtain the typical figure of a star schema.
• In a star schema, you divide your entities (tables) into two categories:
• These assets interact with each other, and they generate events. • Dimensions: A dimension is an informational asset, like a product, a
• For example, a product is sold by an employee, who is working in a customer, an employee, or a patient.
• Dimensions have attributes.
store, to a particular customer, and on a given date. • For example, a product has attributes like its color, its category and subcategory, its
manufacturer, and its cost.
• There is almost always a clear separation between assets and events. • A patient has attributes such as a name, address, and date of birth.
• In a medical environment, assets might include patients, diseases, • Facts: A fact is an event involving some dimensions.
and medications, whereas an event is a patient being diagnosed with • In Contoso, a fact is the sale of a product.
• A sale involves a product, a customer, a date, and other dimensions. Facts have
a specific disease and obtaining a medication to resolve it. metrics, which are numbers that you can aggregate to obtain insights from your
business.
17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 25 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 26 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 27

3
17-02-2023

Star schemas Star schemas


• You use dimensions to slice and dice the data, whereas you use fact • Suppose we add a new dimension, Geography, that contains details about
tables to aggregate numbers. geographical places, like the city, state, and country/region of a place.
• Dimensions tend to be small tables, with fewer than 1,000,000 • Both the Store and Customer dimensions can be related to Geography.
rows—generally in the order of magnitude of a few hundred or • This is a bad model? Because it introduces ambiguity.
thousand. • Slice by city, and you want to compute the amount sold
• Fact tables, on the other hand, are much larger. They are expected to • System might follow the relationship between Geography and Customer

Deactivated
store tens—if not hundreds of millions—of rows. • System might follow the relationship between Geography and Store
• Both
• Fact tables are related to dimensions, but dimensions should not
have relationships among them. • Neither Excel nor Power BI let you build such a model.

17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 28 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 29 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 30

Snowflake Snowflake
• How would you resolve ambiguity in this scenario? • A snowflake is a variation of a star schema where a dimension is not • The difference between this example and the previous one is that this
relationship is the only one between Product Subcategory and the other
• The answer is very simple. You must denormalize the relevant linked directly to the fact table. dimensions linked to the fact table or to Product.
columns of the Geography table, both in Store and in Customer, • Rather, it is linked through another dimension. F01 xx 15.xlsx • Thus, you can think of Product Subcategory as a dimension that groups
removing the Geography table from the model. different products together, but it does not group together any other
dimension or fact.
• You could include the ContinentName columns in both Store and in • The same, obviously, is true for Product Category.
Customer • Thus, even if snowflakes violate the aforementioned rule, they do not
introduce any kind of ambiguity, and a data model with snowflakes is
absolutely fine.
• Still, whenever you work with a data model, representing it with a star
schema is the right thing to do.

17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 31 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 32 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 33

Naming objects Summary Summary


• Table names for dimensions should consist of only the business asset • A single table is already a data model, although in its simplest form. • A normalized model is a data model where the data is stored in a compact
• With a single table, you must define the granularity of your data. Choosing way, avoiding repetitions of the same value in different rows. This structure
name, in singular or plural form (singular is preferable). typically increases the number of tables.
the right granularity makes calculations much easier to author.
• If the business asset contains multiple words, use casing to separate • The difference between working with a single table and multiple ones is • A denormalized model has a lot of repetitions (for example, the name Red
the words. that when you have multiple tables, they are joined by relationships. is repeated multiple times, once for each red product), but has fewer
tables.
• ProductCategory, CountryShipment • In a relationship, there is a one side and a many side, indicating how many • Normalized models are used for OLTP, whereas denormalized models are
rows you are likely to find if you follow the relationship. Because one used in analytical data models.
• Table names for facts should consist of the business name for the product has many sales, the Product table will be the one side, and the
fact, which is always plural. Sales table will be the many side. • A typical analytical model differentiates between informational assets
• If a table is the target of a relationship, it needs to have a primary key, (dimensions) and events (facts). By classifying each entity in the model as
• Avoid names that are too long/short. which is a column with unique values that can be used to identify a single either a fact or a dimension, the model is built in the form of a star schema.
• The key to a dimension is the dimension name followed by Key. row. If a key is not available, then the relationship cannot be defined. • Star schemas are the most widely used architecture for analytical models,
and for a good reason: They work fine nearly always.

17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 34 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 35 17-02-2023 Data Vizualization using Power BI. (c) Dr. Achint Nigam 36

You might also like