Data Warehouse Unit 4 CS3551
Data Warehouse Unit 4 CS3551
For example, the conceptual data model for an auto dealership might show the
data entities like this:
1. A Showrooms entity that represents information about the different outlets the
dealership has
2. A Cars entity that represents the several cars the dealership currently stocks
3. A Customers entity that represents all the customers who have made a purchase
in the dealership
4. A Sales entity that represents the information about the actual sale
5. A Salesperson entity that represents the information about all the salespeople
who work for the dealership
This conceptual model would also include business requirements, such as the
following:
Logical data models map the conceptual data classes to technical data
structures. They give more details about the data concepts and complex data
relationships that were identified in the conceptual data model, such as these:
In our auto dealership example, the logical data model would expand the
conceptual model and take a deeper look at the data classes as follows:
• The Showrooms entity has fields such as name and location as text data and a
phone number as numerical data.
• The Customers entity has a field email address with the format
[email protected] or [email protected]. The field name can be no more
than 100 characters long.
• The Sales entity has a customer’s name and a salesperson’s name as fields,
along with the date of sale as a date data type and the amount as a decimal data
type.
Logical models thus act as a bridge between the conceptual data model and the
underlying technology and database language that developers use to create the
database. However, they are technology agnostic, and you can implement them
in any database language. Data engineers and stakeholders typically make
technology decisions after they have created a logical data model.
Suppose that the auto dealership decided to create a data archive in Amazon S3
Glacier Flexible Retrieval. Their physical data model describes the following
specifications:
• In Sales, the sale amount is a float data type, and the date of sale is a timestamp
data type.
• In Customers, the customer name is a string data type.
• In S3 Glacier Flexible Retrieval terminology, a vault is the geographical
location of your data.
Your physical data model also includes additional details such as which AWS
Region you will create your vault in. The physical data model thus acts as a
bridge between the logical data model and the final technology implementation.
What are the types of data modeling techniques?
Data modeling techniques are the different methods that you can use to create
different data models. The approaches have evolved over time as the result of
innovations in database concepts and data governance. The following are the
main types of data modeling:
In hierarchical data modeling, you can represent the relationships between the
various data elements in a tree-like format. Hierarchical data models represent
one-to-many relationships, with parents or root data classes mapping to several
children.
In the auto dealership example, the parent class Showrooms would have both
entities Cars and Salespeople as children because one showroom has several
cars and salespeople working in it.
Hierarchical data modeling has evolved over time into graph data modeling.
Graph data models represent data relationships that treat entities equally.
Entities can link to each other in one-to-many or many-to-many relationships
without any concept of parent or child.
For example, one showroom can have several salespeople, and one salesperson
can also work at several showrooms if their shifts vary by location.
Salesperson ID Name
1 Jane
2 John
Car ID Car Brand
C1 XYZ
C2 ABC
Salesperson ID and Car ID are primary keys that uniquely identify individual
real-world entities. In the showroom table, these primary keys act as foreign
keys that link the data segments.
The relationships between the different entities are at the heart of data
modeling. Business rules initially define these relationships at a conceptual
level. You can think of relationships as the verbs in your data model. For
instance, the salesperson sells many cars, or the showroom employs many
salespeople.
After you conceptually understand your entities and their relationships, you can
determine the data modeling technique that best suits your use case. For
example, you might use relational data modeling for structured data but
dimensional data modeling for unstructured data.
You can optimize your data model further to suit your technology and
performance requirements. For example, if you plan to use Amazon Aurora and
a structured query language (SQL), you will put your entities directly into tables
and specify relationships by using foreign keys. By contrast, if you choose to
use Amazon DynamoDB, you will need to think about access patterns before
you model your table. Because DynamoDB prioritizes speed, you first
determine how you will access your data and then model your data in the form
it will be accessed.
You will typically revisit these steps repeatedly as your technology and
requirements change over time.
Fact
Dimensions
Measure
Considering the relational context, there are two basic models which are used in
dimensional modeling:
o Star Model
o Snowflake Model
The star model is the underlying structure for a dimensional model. It has one
broad central table (fact table) and a set of smaller tables (dimensions) arranged
in a radial design around the primary table. The snowflake model is the
conclusion of decomposing one or more of the dimensions.
Fact Table
Fact tables are used to data facts or measures in the business. Facts are the
numeric data elements that are of interest to the company.
The fact table includes numerical values of what we measure. For example, a
fact value of 20 might means that 20 widgets have been sold.
Each fact table includes the keys to associated dimension tables. These are
known as foreign keys in the fact table.
Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store
fields that describe the facts.
Dimension tables contain the details about the facts. That, as an example,
enables the business analysts to understand the data and their reports better.
The dimension tables include descriptive data about the numerical values in the
fact table. That is, they contain the attributes of the facts. For example, the
dimension tables for a marketing analysis function might include attributes such
as time, marketing region, and product type.
The attributes in a dimension table are used as row and column headings in a
document or query results display.
Example: A city and state can view a store summary in a fact table. Item
summary can be viewed by brand, color, etc. Customer information can be
viewed by name and address.
Fact Table
Time ID Product ID Customer ID Unit Sold
4 17 2 1
8 21 3 2
8 4 1 1
In this example, Customer ID column in the facts table is the foreign keys that
join with the dimension table. By following the links, we can see that row 2 of
the fact table records the fact that customer 3, Gaurav, bought two items on day
8.
Dimension Tables
Customer ID Name Gender Income Education Region
1 Rohan Male 2 3 4
2 Sandeep Male 3 5 1
3 Gaurav Male 1 7 3
Hierarchy
A hierarchy is a directed tree whose nodes are dimensional attributes and whose
arcs model many to one association between dimensional attributes team. It
contains a dimension, positioned at the tree's root, and all of the dimensional
attributes that define it.
Facts
Facts are the measurable data elements that represent the business metrics of
interest. For example, in a sales data warehouse, the facts might include sales
revenue, units sold, and profit margins. Each fact is associated with one or
more dimensions, creating a relationship between the fact and the descriptive
data.
Dimension
Dimensions are the descriptive data elements that are used to categorize or
classify the data. For example, in a sales data warehouse, the dimensions
might include product, customer, time, and location. Each dimension is made
up of a set of attributes that describe the dimension. For example, the product
dimension might include attributes such as product name, product category,
and product price.
Attributes
Fact Table
In a dimensional data model, the fact table is the central table that contains the
measures or metrics of interest, surrounded by the dimension tables that
describe the attributes of the measures. The dimension tables are related to the
fact table through foreign key relationships
Dimension Table
Dimensions of a fact are mentioned by the dimension table and they are
basically joined by a foreign key. Dimension tables are simply de-normalized
tables. The dimensions can be having one or more relationships.
Steps to Create Dimensional Data Modeling
Step-1: Identifying the business objective: The first step is to identify the
business objective. Sales, HR, Marketing, etc. are some examples of the need
of the organization. Since it is the most important step of Data Modelling the
selection of business objectives also depends on the quality of data available
for that process.
Step-4: Identifying the Fact: The measurable data is held by the fact table.
Most of the fact table rows are numerical values like price or cost per unit,
etc.
Step-5: Building of Schema: We implement the Dimension Model in this
step. A schema is a database structure. There are two popular schemes: Star
Schema and Snowflake Schema.
Consider the data of a shop for items sold per quarter in the city of Delhi. The
data is shown in the table. In this 2D representation, the sales for Delhi are
shown for the time dimension (organized in quarters) and the item dimension
(classified according to the types of an item sold). The fact or measure
displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example,
suppose the data according to time and item, as well as the location is
considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data
are shown in the table. The 3D data of the table are represented as a series of 2D
tables.
2. Let us take the example of the data of a factory which sells products per
quarter in Bangalore. The data is represented in the table given below :
2D factory data
In the above given presentation, the factory’s sales for Bangalore are, for the
time dimension, which is organized into quarters and the dimension of items,
which is sorted according to the kind of item which is sold. The facts here are
represented in rupees (in thousands).
Now, if we desire to view the data of the sales in a three-dimensional table,
then it is represented in the diagram given below. Here the data of the sales is
represented as a two dimensional table. Let us consider the data according to
item, time and location (like Kolkata, Delhi, Mumbai). Here is the table :
3D data representation as 2D
3D data representation
Features of multidimensional data models:
Measures: Measures are numerical data that can be analyzed and compared,
such as sales or revenue. They are typically stored in fact tables in a
multidimensional data model.
Dimensions: Dimensions are attributes that describe the measures, such as
time, location, or product. They are typically stored in dimension tables in a
multidimensional data model.
Cubes: Cubes are structures that represent the multidimensional relationships
between measures and dimensions in a data model. They provide a fast and
efficient way to retrieve and analyze data.
Aggregation: Aggregation is the process of summarizing data across
dimensions and levels of detail. This is a key feature of multidimensional data
models, as it enables users to quickly analyze data at different levels of
granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-
level summary of data to a lower level of detail, while roll-up is the opposite
process of moving from a lower-level detail to a higher-level summary. These
features enable users to explore data in greater detail and gain insights into the
underlying patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of
detail. For example, a time dimension might be organized into years, quarters,
months, and days. Hierarchies provide a way to navigate the data and perform
drill-down and roll-up operations.
OLAP (Online Analytical Processing): OLAP is a type of multidimensional
data model that supports fast and efficient querying of large datasets. OLAP
systems are designed to handle complex queries and provide fast response
times.
For example, a relation with the schema sales (part, supplier, customer, and
sale-price) can be materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate function value (such as total-
sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific
attributes are chosen to be measure attributes, i.e., the attributes whose values
are of interest. Another attributes are selected as dimensions or functional
attributes. The measure attributes are aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the
store's sales for the dimensions time, item, branch, and location. These
dimensions enable the store to keep track of things like monthly sales of items,
and the branches and locations at which the items were sold. Each dimension
may have a table identify with it, known as a dimensional table, which describes
the dimensions. For example, a dimension table for items may contain the
attributes item_name, brand, and type.
If a query contains constants at even lower levels than those provided in a data
cube, it is not clear how to make the best use of the precomputed results stored
in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.
Example: In the 2-D representation, we will look at the All Electronics sales
data for items sold per quarter in the city of Vancouver. The measured display
in dollars sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For
example, suppose we would like to view the data according to time, item as well
as the location for the cities Chicago, New York, Toronto, and Vancouver. The
measured display in dollars sold (in thousands). These 3-D data are shown in
the table. The 3-D data of the table are represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as
shown in fig:
Let us suppose that we would like to view our sales data with an additional
fourth dimension, such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds
the lowest level of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time,
item, location, and supplier dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure displayed is dollars
sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is
known as the apex cuboid. In this example, this is the total sales, or dollars sold,
summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids
creating 4-D data cubes for the dimension time, item, location, and supplier.
Each cuboid represents a different degree of summarization.
Data cube classification:
The data cube can be classified into two categories:
• Multidimensional data cube: It basically helps in storing large amounts of
data by making use of a multi-dimensional array. It increases its efficiency
by keeping an index of each dimension. Thus, dimensional is able to
retrieve data fast.
• Relational data cube: It basically helps in storing large amounts of data by
making use of relational tables. Each relational table displays the
dimensions of the data cube. It is slower compared to a Multidimensional
Data Cube.
Data cube operations:
Data cube operations are used to manipulate data to meet the needs of users.
These operations help to select particular data for the analysis purpose. There
are mainly 5 operations listed below-
• Roll-up: operation and aggregate certain similar data attributes having the
same dimension together. For example, if the data cube displays the daily
income of a customer, we can use a roll-up operation to find the monthly
income of his salary.
• Dicing: this operation does a multidimensional cutting, that not only cuts
only one dimension but also can go to another dimension and cut a certain
range of it. As a result, it looks more like a subcube out of the whole
cube(as depicted in the figure). For example- the user wants to see the
annual salary of Jharkhand state employees.
• Pivot: this operation is very important from a viewing point of view. It
basically transforms the data cube in terms of view. It doesn’t change the
data present in the data cube. For example, if the user is comparing year
versus branch, using the pivot operation, the user can change the viewpoint
and now compare branch versus item type.
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or
measured, such as a sale or log in. A dimension includes reference data about
the fact, such as date, item, or customer.
Fact Tables
A fact table might involve either detail level fact or fact that have been
aggregated (fact tables that include aggregated fact are often instead called
summary tables). A fact table generally contains facts with the same level of
aggregation.
Dimension Tables
Fact tables store data about sales while dimension tables data about the
geographic region (markets, cities), clients, products, times, channels.
The star schema is intensely suitable for data warehouse database design
because of the following features:
Star Schemas are easy for end-users and application to understand and navigate.
With a well-designed schema, the customer can instantly analyze large,
multidimensional data sets.
A star schema database has a limited number of table and clear join paths, the
query run faster than they do against OLTP systems. Small single-table queries,
frequently of a dimension table, are almost instantaneous. Large join queries
that contain multiple tables takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the
central fact table. When the two-dimension table is used in a query, only one
join path, intersecting the fact tables, exist between those two tables. This
design feature enforces authentic and consistent query results.
Structural simplicity also decreases the time required to load large batches of
record into a star schema database. By describing facts and dimensions and
separating them into the various table, the impact of a load structure is reduced.
Dimension table can be populated once and occasionally refreshed. We can add
new facts regularly and selectively by appending records to a fact table.
Built-in referential integrity
Easily Understood
There is some condition which cannot be meet by star schemas like the
relationship between the user, and bank account cannot describe as star schema
as the relationship between them is many to many.
The TIME table has a column for each day, month, quarter, and year. The ITEM
table has columns for each item_Key, item_name, brand, type, supplier_type.
The BRANCH table has columns for each branch_key, branch_name,
branch_type. The LOCATION table has columns of geographic data, including
street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the
dimension tables, TIME, ITEM, BRANCH, and LOCATION, instead of four
columns for time data, four columns for ITEM data, three columns for
BRANCH data, and four columns for LOCATION data. Thus, the size of the
fact table is significantly reduced. When we need to change an item, we need
only make a single change in the dimension table, instead of making many
changes in the fact table.
The snowflake schema is an expansion of the star schema where each point of
the star explodes into more points. It is called snowflake schema because the
diagram of snowflake schema resembles a snowflake. Snowflaking is a method
of normalizing the dimension tables in a STAR schemas. When we normalize
all the dimension tables entirely, the resultant structure resembles a snowflake
with the fact table in the middle.
The snowflake schema consists of one fact table which is linked to many
dimension tables, which can be linked to other dimension tables through a
many-to-one relationship. Tables in a snowflake schema are generally
normalized to the third normal form. Each dimension table performs exactly one
level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each
having three levels. A snowflake schemas can have any number of dimension,
and each dimension can have any number of levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store,
Location, Time, Product, Line, and Family dimension tables. The Market
dimension has two dimension tables with Store as the primary dimension table,
and Location as the outrigger dimension table. The product dimension has three
dimension tables with Product as the primary dimension table, and the Line and
Family table are the outrigger dimension tables.
A star schema store all attributes for a dimension into one denormalized table.
This needed more disk space than a more normalized snowflake schema.
Snowflaking normalizes the dimension by moving attributes with low
cardinality into separate dimension tables that relate to the core dimension table
by using foreign keys. Snowflaking for the sole purpose of minimizing disk
space is not recommended, because it can adversely impact query performance.
The STAR schema for sales, as shown above, contains only five tables, whereas
the normalized version now extends to eleven tables. We will notice that in the
snowflake schema, the attributes with low cardinality in each original
dimension tables are removed to form separate tables. These new tables are
connected back to the original dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex
dimensions and relationship. It is suitable for many to many and one to many
relationships between dimension levels.
This particular kind of data warehouse schema is shaped like a snowflake. The
snowflake schema aims to normalize the star schema's denormalized data.
When the star schema's dimensions are intricate, highly structured, and have
numerous degrees of connection, and the kid tables have several parent tables,
the snowflake structure emerges. Some of the star schema's common issues are
resolved by the snowflake schema.
The star schema is the most straightforward method for arranging data in the
data warehouse. Any or even more Fact Tables that index a number of
Dimension Tables may be present in the star schema's central area. Dimensions
Keys, Values, and Attributes are found in Dimension Tables, which are used to
define Dimensions.
While the Dimension Data is contained inside the Dimension Tables, the Fact
Data is arranged within the Fact Tables. In a star schema, the Fact Tables are
the integrating points at the core of a star.
The star schema is characterized by a denormalized data structure, with all data
related to a particular subject stored in a single large table and connected to
smaller, dimensional tables through a single join.
Basis of
Star Schema Snowflake Schema
Distinction
It is simpler to understand
Understanding More complex to understand compared
compared to snowflake
Complexity to star schema.
schema.
• The star schema has a • The more complex data structure can
Disadvantages limited ability to depict be harder to understand and work
complex relationships with.
between data.
• Multiple joins between tables can
• Can suffer from data result in slower query performance.
redundancy and decreased
• Requires more storage and
data integrity.
processing resources due to the
• May not be suitable for larger number of tables.
smaller volumes of data.
A Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.
It deals with querying, statistical analysis, and reporting via tables, charts, or
graphs. Nowadays, information processing of data warehouse is to construct a
low cost, web-based accessing tools typically integrated with web browsers.
Analytical Processing
Data Mining
It helps in the analysis of hidden design and association, constructing scientific
models, operating classification and prediction, and performing the mining
results using visualization tools.
The process architecture defines an architecture in which the data from the data
warehouse is processed for a particular computation.
In this architecture, the data is collected into single centralized storage and
processed upon completion by a single machine with a huge structure in terms
of memory, processor, and storage.
It is very successful when the collection and consumption of data occur at the
same location.
In this architecture, information and its processing are allocated across data
centers, and its processing is distributed across data centers, and processing of
data is localized with the group of the results into centralized storage.
Distributed architectures are used to overcome the limitations of the centralized
process architectures where all the information needs to be collected to one
central location, and results are available in one central location.
There are several architectures of the distributed process:
Client-Server
In this architecture, the user does all the information collecting and presentation,
while the server does the processing and management of data.
Three-tier Architecture
N-tier Architecture
Cluster Architecture
Peer-to-Peer Architecture
This is a type of architecture where there are no dedicated servers and clients.
Instead, all the processing responsibilities are allocated among all machines,
called peers. Each machine can perform the function of a client or server or just
process data.
Intraquery Parallelism
Interquery parallelism does not help in this function since each query is run
sequentially.
This application of parallelism decomposes the serial SQL, query into lower-
level operations such as scan, join, sort, and aggregation.
Each RDBMS server can read, write, update, and delete information from the
same shared database, which would need the system to implement a form of a
distributed lock manager (DLM).
DLM components can be found in hardware, the operating system, and separate
software layer, all depending on the system vendor.
It is relatively simple to implement and has been very successful up to the point
where it runs into the scalability limitations of the shared-everything
architecture.
The key point of this technique is that a single RDBMS server can probably
apply all processors, access all memory, and access the entire database, thus
providing the client with a consistent single system image.
In shared-memory SMP systems, the DBMS considers that the multiple
database components executing SQL statements communicate with each other
by exchanging messages and information via the shared memory.
All processors have access to all data, which is partitioned across local disks.
Shared-Nothing Architecture
Each processor has its memory and disk and communicates with other
processors by exchanging messages and data over the interconnection network.
This architecture is optimized specifically for the MPP and cluster systems.
The tools that allow sourcing of data contents and formats accurately and
external data stores into the data warehouse have to perform several essential
tasks that contain:
1. The ability to identify the data in the data source environment that can be
read by the tool is necessary.
2. Support for flat files, indexed files, and legacy DBMSs is critical.
3. The capability to merge records from multiple data stores is required in
many installations.
4. The specification interface to indicate the information to be extracted and
conversation are essential.
5. The ability to read information from repository products or data
dictionaries is desired.
6. The code develops by the tool should be completely maintainable.
7. Selective data extraction of both data items and records enables users to
extract only the required data.
8. A field-level data examination for the transformation of data into
information is needed.
9. The ability to perform data type and the character-set translation is a
requirement when moving data between incompatible systems.
10. The ability to create aggregation, summarization and derivation fields and
records are necessary.
11. Vendor stability and support for the products are components that must
be evaluated carefully.
The warehouse team needs tools that can extract, transform, integrate, clean,
and load information from a source system into one or more data warehouse
databases. Middleware and gateway products may be needed for warehouses
that extract a record from a host-based source system.
Warehouse Storage
Software products are also needed to store warehouse data and their
accompanying metadata. Relational database management systems are well
suited to large and growing warehouses.
Different types of software are needed to access, retrieve, distribute, and present
warehouse data to its end-clients.