0% found this document useful (0 votes)
11 views17 pages

Unit 4 - Notes-1

Partitioning is essential for managing large fact tables in data warehouses, enhancing performance, and facilitating backup/recovery. Various partitioning strategies include horizontal partitioning by time, dimension-based partitioning, and vertical partitioning through normalization and row splitting. The document also discusses fact tables, dimension tables, types of facts, and design considerations for summary tables to optimize data access and query performance.

Uploaded by

smartbroad26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views17 pages

Unit 4 - Notes-1

Partitioning is essential for managing large fact tables in data warehouses, enhancing performance, and facilitating backup/recovery. Various partitioning strategies include horizontal partitioning by time, dimension-based partitioning, and vertical partitioning through normalization and row splitting. The document also discusses fact tables, dimension tables, types of facts, and design considerations for summary tables to optimize data access and query performance.

Uploaded by

smartbroad26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT 4

PARTITIONING
Partitioning is done to enhance performance and facilitate easy management of data.
Partitioning also helps in balancing the various requirements of the system. It optimizes the
hardware performance and simplifies the management of data warehouse by partitioning each
fact table into multiple separate partitions. In this chapter, we will discuss different
partitioning strategies.

Why is it Necessary to Partition?

Partitioning is important for the following reasons −

∙ For easy management,


∙ To assist backup/recovery,
∙ To
enhance performance.
For Easy Management
The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge
size of fact table is very hard to manage as a single entity. Therefore it needs partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the
data. Partitioning allows us to load only as much data as is required on a regular basis. It
reduces the time to load and also enhances the performance of the system.
Note − To cut down on the backup size, all partitions other than the current partition can be
marked as read-only. We can then put these partitions into a state where they cannot be
modified. Then they can be backed up. It means only the current partition is to be backed up.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query
performance is enhanced because now the query scans only those partitions that are relevant.
It does not have to scan the whole data.

TYPES OF PARTITIONING

HORIZONTAL PARTITIONING

There are various ways in which a fact table can be partitioned. In horizontal partitioning, we
have to keep in mind the requirements for manageability of the data warehouse.
1. Partitioning by Time into Equal Segments
In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each
time period represents a significant retention period within the business. For example, if the
user queries for month to date data then it is appropriate to partition the data into monthly
segments. We can reuse the partitioned tables by removing the data in them.
2. Partition by Time into Different-sized Segments
This kind of partition is done where the aged data is accessed infrequently. It is implemented
as a set of small partitions for relatively current data, larger partition for inactive data.
Points to Note
∙ The detailed information remains available online.
∙ The number of physical tables is kept relatively small, which reduces the
operating cost.
∙ This technique is suitable where a mix of data dipping recent history and data
mining through entire history is required.
∙ This technique is not useful where the partitioning profile changes on a regular
basis, because repartitioning will increase the operation cost of data
warehouse.

3. Partition on a Different Dimension


The fact table can also be partitioned on the basis of dimensions other than time such as
product group, region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like on a
state by state basis. If each region wants to query on information captured within its region,
it would prove to be more effective to partition the fact table into regional partitions. This will
cause the queries to speed up because it does not require to scan information that is not
relevant.
Points to Note
∙ The query does not have to scan irrelevant data which speeds up the query
process.
∙ This technique is not appropriate where the dimensions are unlikely to change in
future. So, it is worth determining that the dimension does not change in
future.
∙ If the dimension changes, then the entire fact table would have to be
repartitioned.
Note − We recommend to perform the partition only on the basis of time dimension, unless
you are certain that the suggested dimension grouping will not change within the life of the
data warehouse.
4. Partition by Size of Table
When there are no clear basis for partitioning the fact table on any dimension, then we should
partition the fact table on the basis of their size. We can set the predetermined size as a
critical point. When the table exceeds the predetermined size, a new table partition is created.
Points to Note
∙ This partitioning is complex to manage.
∙ It requires metadata to identify what data is stored in each partition.

5. Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions.
Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order
to apply comparisons, that dimension may be very large. This would definitely affect the
response time.
6. Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is archived. It uses
metadata to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data
warehouse.

VERTICAL PARTITION

Vertical partitioning splits the data vertically. The following images depicts how vertical
partitioning is done.

Vertical partitioning can be performed in the following two ways −


∙ Normalization
∙ Row Splitting

1. Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following tables
that show how normalization is performed.
Table before Normalization
Product_id Qty Value sales_date Store_id Store_name Location Region

30 5 3.67 3-Aug-13 16 sunny Bangalore S

35 4 5.33 3-Sep-13 16 sunny Bangalore S

40 5 2.50 3-Sep-13 64 san Mumbai W


45 7 5.66 3-Sep-13 16 sunny Bangalore S

Table after Normalization


Store_id Store_name Location Region

16 sunny Bangalore W

64 san Mumbai S

Product_id Quantity Value sales_date Store_id

30 5 3.67 3-Aug-13 16

35 4 5.33 3-Sep-13 16

40 5 2.50 3-Sep-13 64

45 7 5.66 3-Sep-13 16

2. Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting
is to speed up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a
major join operation between two partitions.
What is Fact Table?
A fact table is a primary table in a dimensional model.

A Fact Table contains

1. Measurements/facts
2. Foreign key to dimension table

What is a Dimension Table?

∙ A dimension table contains dimensions of a fact.


∙ They are joined to fact table via a foreign key.
∙ Dimension tables are de-normalized tables.
∙ The Dimension Attributes are the various columns in a dimension table
∙ Dimensions offers descriptive characteristics of the facts with the help of their
attributes
∙ No set limit set for given for number of dimensions
∙ The dimension can also contain one or more hierarchical relationships
Fact Table vs Dimension Table
Parameters Fact Table Dimension Table

Definition Measurements, metrics or Companion table to the fact table


facts about a business contains descriptive attributes to be
process. used as query constraining.

Characteris Located at the center of a star or Connected to the fact table and located
tic snowflake schema and at the edges of the star or snowflake
surrounded by dimensions. schema

Design Defined by their grain or its Should be wordy, descriptive,


most atomic level. complete, and quality assured.

Task Fact table is a measurable event Collection of reference information


for which dimension table data is about a business.
collected and is used for analysis
and reporting.

Type of Facts tables could contain Evert dimension table contains


Data information like sales against a attributes which describe the details of
set of dimensions like Product the dimension. E.g., Product
and Date. dimensions can contain Product ID,
Product Category, etc.

Key Primary Key in fact table Dimension table has a primary key
is mapped as foreign columns that uniquely identifies
keys to each dimension.
Dimensions.

Storage Helps to store report labels and Load detailed atomic data into
filter domain values in dimensional structures.
dimension tables.

Hierarchy Does not contain Hierarchy Contains Hierarchies. For example


Location could contain, country, pin
code, state, city, etc.

Types of Facts Table

The fact table is a central table in the data schemas. It is found in the centre of a star schema
or snowflake schema and surrounded by a dimension table. It contains the facts of a particular
business process, such as sales revenue by month. Facts are known as measurements or
matrices. It captures a measurement or a metric. It is an essential concept for data
warehousing and BI Certification.

The fact table stores quantitative information of analysis that is not arranged. The fact table is
a primary table in the dimensional model. It also contains measurement, metric and
quantitative information.
Types of Facts

There are three types of facts:

1. Summative facts: Summative facts are used with aggregation functions such as sum (),
average (), etc.
2. Semi summative facts: There are small numbers of quasi-summative fact aggregation
functions that will apply. For example, consider bank account details. We also cannot
also apply () for a bank balance which will not have useful results, but the minimum()
and maximum() functions return useful information.
3. Non-additive facts: We cannot use numerical aggregation functions such as sum (),
average (), on non-additive facts. For non-additive facts, ratio or percentage is used.

Types of Fact Table

1. Transaction Fact Table

The transaction fact table is a basic approach to operate the businesses. These fact tables
represent an event that occurs at the primary point. A line exists in the fact table for the
customer or product when the transaction occurs.

Many rows in a fact table connect to a customer or product because they are involved in
multiple transactions. Transaction data is often structured quickly in a one-dimensional
framework. The lowest-level data is the rawest dimensional data that cannot be done by
summarized data.

2. Snapshot Fact Table

The snapshot fact table describes the state of things at a particular time and contains many
semi-additive and non-additive facts.

Example: The daily equilibrium fact is expressed by the customer dimension but not by the
time dimension.

Periodic snapshots require the performance of the business at regular and estimated time
intervals. Unlike a transaction fact table where we load a row for each event, with periodic
snapshots, we take a picture of the activity at the end of the day, week, or month, and then
another picture at the end of the next period.

Example: Performance summary of a salesman during the previous month.


3. Accumulated Fact Sheet

The accumulated fact table is used to show the activity of a process that has a beginning and
an end.

For example, we are processing an order. An order remains in the process until it will be
processed. As the step towards completing the order is completed, the corresponding row in
the fact table is updated.

Fact less Facts

We have also a transaction fact tables which contain no measures. We call it as fact less fact
tables. These tables are used to capture the action of the business process. For example, a
criminal case is a simple fact with no measures but can have a lot of dimensional attributes
associated with the fact.

DESIGN SUMMARY TABLE

Summary tables store data that is aggregated and/or summarized for performance reasons
(i.e., to improve the performance of business queries). Most business queries (i.e.,
approximately 80%) will run against summary tables.
Data is aggregated by combining multiple concepts together and/or combining large amounts
of detailed data together. Most business queries analyze a summarization or aggregation of
data (i.e., facts) across one or more dimensions. Therefore, a summary table may use
multiple dimensions. For example, a table that analyzes accounts by region by customer by
service by month uses four dimensions.

Design Considerations

the main objective when designing summary tables is to minimize the amount of data being
accessed and the number of tables being joined. This is done by storing intermediate query
results, such as:

1. summaries of large amounts of data (e.g., summing product inventory by


quarter), 2. Combinations of multiple concepts (e.g., sales by customer by market),
3. Reference data (e.g., product description).

Identify What to Aggregate

Examine highly used business queries and problem business queries (i.e., queries that are
slow or consume a lot of resources) and identify aggregation requirements. Define:

Fact data to be summarized. For example, a fact table has the following three
dimensions: Service, Geographical Location, and Time. There are multiple queries that
aggregate the facts by month. The summary table created to meet this requirement has the
following dimensions: Service, Geographical Location, and Month. The summary table time
dimension contains a month (i.e., MM) instead of a date (i.e., YYMMDD). This summary
table reduces the number of rows to be read and the length of the rows.

Attributes (i.e., metrics) in the fact tables that should be aggregated. Examine problem
queries and identify the attributes that are aggregated by the queries. The aggregation will
most probably be across multiple dimensions, and multiple attributes from the same fact table
will also probably have to be aggregated.

Related facts to be aggregated into the same summary table. Examine problem business
queries and for each query identify the facts (i.e., from different fact tables) that are
aggregated by the query. Aggregating multiple facts into the same summary table improves
performance by replacing multiple queries or a complex query containing multiple unions
with a simple query.

Identify How Much to Aggregate

For each summary table, select the degree of aggregation required. Once facts are aggregated
to a certain level of detail more detail is not available within that summary table.

To ensure greater flexibility, a rule of thumb is to aggregate to one level of detail greater than
what is required and aggregate up a level when the queries run. However, this approach
should not be used if the number of rows to be aggregated (when the queries are run) will be
large.

Select the Level of Denormalization

Summary tables are recreated on a regular basis, therefore including dimension data in
summary tables is not an issue. To limit the number of table joins (i.e., summary table to
dimension tables), a rule of thumb is to use real world keys in summary tables and not use
generated keys. This approach should not be used if the summarization level of the summary
table is low (i.e., the summary table contains a lot of rows). Another rule of thumb that
minimizes joins is to always store physical dates in summary tables.

Design Indexes

To maximize the performance (and use) of summary tables, the rule of thumb is to index all
access paths to a summary table. Summary tables (i.e., aggregations) will change regularly.
The design, creation, and maintenance of summary tables should be assigned to someone
who is end-user-oriented and doesn't mind change.

It is very easy to create a lot of summary tables. Too many summary tables will increase costs
(e.g., disk space, resources to create the tables, resources to maintain the tables).

Consider the following when deciding to create summary tables:

1. Store multiple occurrences within a level in one summary table instead of separate
summary tables for each level (e.g., store monthly transactions in one summary table instead
of creating a separate table for each month).

2. Aggregating multiple facts into one summary table considerably increases the row size of
the summary table; therefore only use this approach for highly used queries.

3. Aggregate summary tables to one level of detail greater than the level required.

You might also like