Unit 4 - Notes-1
Unit 4 - Notes-1
PARTITIONING
Partitioning is done to enhance performance and facilitate easy management of data.
Partitioning also helps in balancing the various requirements of the system. It optimizes the
hardware performance and simplifies the management of data warehouse by partitioning each
fact table into multiple separate partitions. In this chapter, we will discuss different
partitioning strategies.
TYPES OF PARTITIONING
HORIZONTAL PARTITIONING
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we
have to keep in mind the requirements for manageability of the data warehouse.
1. Partitioning by Time into Equal Segments
In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each
time period represents a significant retention period within the business. For example, if the
user queries for month to date data then it is appropriate to partition the data into monthly
segments. We can reuse the partitioned tables by removing the data in them.
2. Partition by Time into Different-sized Segments
This kind of partition is done where the aged data is accessed infrequently. It is implemented
as a set of small partitions for relatively current data, larger partition for inactive data.
Points to Note
∙ The detailed information remains available online.
∙ The number of physical tables is kept relatively small, which reduces the
operating cost.
∙ This technique is suitable where a mix of data dipping recent history and data
mining through entire history is required.
∙ This technique is not useful where the partitioning profile changes on a regular
basis, because repartitioning will increase the operation cost of data
warehouse.
5. Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions.
Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order
to apply comparisons, that dimension may be very large. This would definitely affect the
response time.
6. Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is archived. It uses
metadata to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data
warehouse.
VERTICAL PARTITION
Vertical partitioning splits the data vertically. The following images depicts how vertical
partitioning is done.
1. Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following tables
that show how normalization is performed.
Table before Normalization
Product_id Qty Value sales_date Store_id Store_name Location Region
16 sunny Bangalore W
64 san Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
2. Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting
is to speed up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a
major join operation between two partitions.
What is Fact Table?
A fact table is a primary table in a dimensional model.
1. Measurements/facts
2. Foreign key to dimension table
Characteris Located at the center of a star or Connected to the fact table and located
tic snowflake schema and at the edges of the star or snowflake
surrounded by dimensions. schema
Key Primary Key in fact table Dimension table has a primary key
is mapped as foreign columns that uniquely identifies
keys to each dimension.
Dimensions.
Storage Helps to store report labels and Load detailed atomic data into
filter domain values in dimensional structures.
dimension tables.
The fact table is a central table in the data schemas. It is found in the centre of a star schema
or snowflake schema and surrounded by a dimension table. It contains the facts of a particular
business process, such as sales revenue by month. Facts are known as measurements or
matrices. It captures a measurement or a metric. It is an essential concept for data
warehousing and BI Certification.
The fact table stores quantitative information of analysis that is not arranged. The fact table is
a primary table in the dimensional model. It also contains measurement, metric and
quantitative information.
Types of Facts
1. Summative facts: Summative facts are used with aggregation functions such as sum (),
average (), etc.
2. Semi summative facts: There are small numbers of quasi-summative fact aggregation
functions that will apply. For example, consider bank account details. We also cannot
also apply () for a bank balance which will not have useful results, but the minimum()
and maximum() functions return useful information.
3. Non-additive facts: We cannot use numerical aggregation functions such as sum (),
average (), on non-additive facts. For non-additive facts, ratio or percentage is used.
The transaction fact table is a basic approach to operate the businesses. These fact tables
represent an event that occurs at the primary point. A line exists in the fact table for the
customer or product when the transaction occurs.
Many rows in a fact table connect to a customer or product because they are involved in
multiple transactions. Transaction data is often structured quickly in a one-dimensional
framework. The lowest-level data is the rawest dimensional data that cannot be done by
summarized data.
The snapshot fact table describes the state of things at a particular time and contains many
semi-additive and non-additive facts.
Example: The daily equilibrium fact is expressed by the customer dimension but not by the
time dimension.
Periodic snapshots require the performance of the business at regular and estimated time
intervals. Unlike a transaction fact table where we load a row for each event, with periodic
snapshots, we take a picture of the activity at the end of the day, week, or month, and then
another picture at the end of the next period.
The accumulated fact table is used to show the activity of a process that has a beginning and
an end.
For example, we are processing an order. An order remains in the process until it will be
processed. As the step towards completing the order is completed, the corresponding row in
the fact table is updated.
We have also a transaction fact tables which contain no measures. We call it as fact less fact
tables. These tables are used to capture the action of the business process. For example, a
criminal case is a simple fact with no measures but can have a lot of dimensional attributes
associated with the fact.
Summary tables store data that is aggregated and/or summarized for performance reasons
(i.e., to improve the performance of business queries). Most business queries (i.e.,
approximately 80%) will run against summary tables.
Data is aggregated by combining multiple concepts together and/or combining large amounts
of detailed data together. Most business queries analyze a summarization or aggregation of
data (i.e., facts) across one or more dimensions. Therefore, a summary table may use
multiple dimensions. For example, a table that analyzes accounts by region by customer by
service by month uses four dimensions.
Design Considerations
the main objective when designing summary tables is to minimize the amount of data being
accessed and the number of tables being joined. This is done by storing intermediate query
results, such as:
Examine highly used business queries and problem business queries (i.e., queries that are
slow or consume a lot of resources) and identify aggregation requirements. Define:
Fact data to be summarized. For example, a fact table has the following three
dimensions: Service, Geographical Location, and Time. There are multiple queries that
aggregate the facts by month. The summary table created to meet this requirement has the
following dimensions: Service, Geographical Location, and Month. The summary table time
dimension contains a month (i.e., MM) instead of a date (i.e., YYMMDD). This summary
table reduces the number of rows to be read and the length of the rows.
Attributes (i.e., metrics) in the fact tables that should be aggregated. Examine problem
queries and identify the attributes that are aggregated by the queries. The aggregation will
most probably be across multiple dimensions, and multiple attributes from the same fact table
will also probably have to be aggregated.
Related facts to be aggregated into the same summary table. Examine problem business
queries and for each query identify the facts (i.e., from different fact tables) that are
aggregated by the query. Aggregating multiple facts into the same summary table improves
performance by replacing multiple queries or a complex query containing multiple unions
with a simple query.
For each summary table, select the degree of aggregation required. Once facts are aggregated
to a certain level of detail more detail is not available within that summary table.
To ensure greater flexibility, a rule of thumb is to aggregate to one level of detail greater than
what is required and aggregate up a level when the queries run. However, this approach
should not be used if the number of rows to be aggregated (when the queries are run) will be
large.
Summary tables are recreated on a regular basis, therefore including dimension data in
summary tables is not an issue. To limit the number of table joins (i.e., summary table to
dimension tables), a rule of thumb is to use real world keys in summary tables and not use
generated keys. This approach should not be used if the summarization level of the summary
table is low (i.e., the summary table contains a lot of rows). Another rule of thumb that
minimizes joins is to always store physical dates in summary tables.
Design Indexes
To maximize the performance (and use) of summary tables, the rule of thumb is to index all
access paths to a summary table. Summary tables (i.e., aggregations) will change regularly.
The design, creation, and maintenance of summary tables should be assigned to someone
who is end-user-oriented and doesn't mind change.
It is very easy to create a lot of summary tables. Too many summary tables will increase costs
(e.g., disk space, resources to create the tables, resources to maintain the tables).
1. Store multiple occurrences within a level in one summary table instead of separate
summary tables for each level (e.g., store monthly transactions in one summary table instead
of creating a separate table for each month).
2. Aggregating multiple facts into one summary table considerably increases the row size of
the summary table; therefore only use this approach for highly used queries.
3. Aggregate summary tables to one level of detail greater than the level required.