0% found this document useful (0 votes)
80 views68 pages

Unit 345 DW Autosaved

Data warehousing

Uploaded by

tamilkumaran2904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views68 pages

Unit 345 DW Autosaved

Data warehousing

Uploaded by

tamilkumaran2904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

CCS341 - DW

UNIT 3 - META DATA, DATA MART AND PARTITION


STRATEGY

Metadata

Metadata is simply defined as data about data. The data that is used to
represent other data is known as metadata. For example, the index of a book
serves as a metadata for the contents in the book. In other words, we can say
that metadata is the summarized data that leads us to detailed data. In terms
of data warehouse, we can define metadata as follows.

● Metadata is the road-map to a data warehouse.


● Metadata in a data warehouse defines the warehouse objects.
● Metadata acts as a directory. This directory helps the decision support
system to locate the contents of a data warehouse.

Note − In a data warehouse, we create metadata for the data names and
definitions of a given data warehouse. Along with this metadata, additional
metadata is also created for time-stamping any extracted data, the source of
extracted data.

Categories of Metadata
Metadata can be broadly categorized into three categories −

● Business Metadata − It has the data ownership information, business


definition, and changing policies.
● Technical Metadata − It includes database system names, table and
column names and sizes, data types and allowed values. Technical
metadata also includes structural information such as primary and
foreign key attributes and indices.
● Operational Metadata − It includes currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation
applied on it.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata
in a warehouse is different from the warehouse data, yet it plays an important
role. The various roles of metadata are explained below.

● Metadata acts as a directory.


● This directory helps the decision support system to locate the contents
of the data warehouse.
● Metadata helps in decision support system for mapping of data when
data is transformed from operational environment to data warehouse
environment.
● Metadata helps in summarization between current detailed data and
highly summarized data.
● Metadata also helps in summarization between lightly detailed data and
highly summarized data.
● Metadata is used for query tools.
● Metadata is used in extraction and cleansing tools.
● Metadata is used in reporting tools.
● Metadata is used in transformation tools.
● Metadata plays an important role in loading functions.
The following diagram shows the roles of metadata.

Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the
following metadata −

● Definition of data warehouse − It includes the description of structure


of data warehouse. The description is defined by schema, view,
hierarchies, derived data definitions, and data mart locations and
contents.
● Business metadata − It contains has the data ownership information,
business definition, and changing policies.
● Operational Metadata − It includes currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation
applied on it.
● Data for mapping from operational environment to data
warehouse − It includes the source databases and their contents, data
extraction, data partition cleaning, transformation rules, data refresh and
purging rules.
● Algorithms for summarization − It includes dimension algorithms,
data on granularity, aggregation, summarizing, etc.

Challenges for Metadata Management


The importance of metadata can not be overstated. Metadata helps in driving
the accuracy of reports, validates data transformation, and ensures the
accuracy of calculations. Metadata also enforces the definition of business
terms to business end-users. With all these uses of metadata, it also has its
challenges. Some of the challenges are discussed below.

● Metadata in a big organization is scattered across the organization. This


metadata is spread in spreadsheets, databases, and applications.
● Metadata could be present in text files or multimedia files. To use this
data for information management solutions, it has to be correctly
defined.
● There are no industry-wide accepted standards. Data management
solution vendors have narrow focus.
● There are no easy and accepted methods of passing metadata.

Why Do We Need a Data Mart?


Listed below are the reasons to create a data mart −
● To partition data in order to impose access control strategies.
● To speed up the queries by reducing the volume of data to be scanned.
● To segment data into different hardware platforms.
● To structure data in a form suitable for a user access tool.

Note − Do not data mart for any other reason since the operation cost of data
marting could be very high. Before data marting, make sure that data marting
strategy is appropriate for your particular solution.

Cost-effective Data Marting


Follow the steps given below to make data marting cost-effective −
● Identify the Functional Splits
● Identify User Access Tool Requirements
● Identify Access Control Issues
Identify the Functional Splits
In this step, we determine if the organization has natural functional splits. We
look for departmental splits, and we determine whether the way in which
departments use information tend to be in isolation from the rest of the
organization. Let's have an example.
Consider a retail organization, where each merchant is accountable for
maximizing the sales of a group of products. For this, the following are the
valuable information −
● sales transaction on a daily basis
● sales forecast on a weekly basis
● stock position on a daily basis
● stock movements on a daily basis
As the merchant is not interested in the products they are not dealing with, the
data marting is a subset of the data dealing which the product group of
interest. The following diagram shows data marting for different users.
Given below are the issues to be taken into account while determining the
functional split −
● The structure of the department may change.
● The products might switch from one department to other.
● The merchant could query the sales trend of other products to analyze
what is happening to the sales.

Note − We need to determine the business benefits and technical feasibility of


using a data mart.

Identify User Access Tool Requirements

We need data marts to support user access tools that require internal data
structures. The data in such structures are outside the control of data
warehouse but need to be populated and updated on a regular basis.

There are some tools that populate directly from the source system but some
cannot. Therefore additional requirements outside the scope of the tool are
needed to be identified for future.

Note − In order to ensure consistency of data across all access tools, the data
should not be directly populated from the data warehouse, rather each tool
must have its own data mart.

Identify Access Control Issues


There should to be privacy rules to ensure the data is accessed by authorized
users only. For example a data warehouse for retail banking institution
ensures that all the accounts belong to the same legal entity. Privacy laws can
force you to totally prevent access to information that is not owned by the
specific bank.
Data marts allow us to build a complete wall by physically separating data
segments within the data warehouse. To avoid possible privacy problems, the
detailed data can be removed from the data warehouse. We can create data
mart for each legal entity and load it via data warehouse, with detailed
account data.

Designing Data Marts


Data marts should be designed as a smaller version of starflake schema
within the data warehouse and should match with the database design of the
data warehouse. It helps in maintaining control over database instances.

The summaries are data marted in the same way as they would have been
designed within the data warehouse. Summary tables help to utilize all
dimension data in the starflake schema.

Cost of Data Marting


The cost measures for data marting are as follows −
● Hardware and Software Cost
● Network Access
● Time Window Constraints
Hardware and Software Cost
Although data marts are created on the same hardware, they require some
additional hardware and software. To handle user queries, it requires
additional processing power and disk storage. If detailed data and the data
mart exist within the data warehouse, then we would face additional cost to
store and manage replicated data.

Note − Data marting is more expensive than aggregations, therefore it should


be used as an additional strategy and not as an alternative strategy.

Network Access

A data mart could be on a different location from the data warehouse, so we


should ensure that the LAN or WAN has the capacity to handle the data
volumes being transferred within the data mart load process.

Time Window Constraints


The extent to which a data mart loading process will eat into the available time
window depends on the complexity of the transformations and the data
volumes being shipped. The determination of how many data marts are
possible depends on −
● Network capacity.
● Time window available
● Volume of data being transferred
● Mechanisms being used to insert data into a data mart

Partitioning Strategy

Partitioning is done to enhance performance and facilitate easy management


of data. Partitioning also helps in balancing the various requirements of the
system. It optimizes the hardware performance and simplifies the
management of data warehouse by partitioning each fact table into multiple
separate partitions. In this chapter, we will discuss different partitioning
strategies.

Why is it Necessary to Partition?


Partitioning is important for the following reasons −
● For easy management,
● To assist backup/recovery,
● To enhance performance.
For Easy Management
The fact table in a data warehouse can grow up to hundreds of gigabytes in
size. This huge size of fact table is very hard to manage as a single entity.
Therefore it needs partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact
table with all the data. Partitioning allows us to load only as much data as is
required on a regular basis. It reduces the time to load and also enhances the
performance of the system.

Note − To cut down on the backup size, all partitions other than the current
partition can be marked as read-only. We can then put these partitions into a
state where they cannot be modified. Then they can be backed up. It means
only the current partition is to be backed up.

To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be
enhanced. Query performance is enhanced because now the query scans
only those partitions that are relevant. It does not have to scan the whole data.

Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal
partitioning, we have to keep in mind the requirements for manageability of
the data warehouse.
Partitioning by Time into Equal Segments

In this partitioning strategy, the fact table is partitioned on the basis of time
period. Here each time period represents a significant retention period within
the business. For example, if the user queries for month to date data then it
is appropriate to partition the data into monthly segments. We can reuse the
partitioned tables by removing the data in them.

Partition by Time into Different-sized Segments


This kind of partition is done where the aged data is accessed infrequently. It
is implemented as a set of small partitions for relatively current data, larger
partition for inactive data.
Points to Note
● The detailed information remains available online.
● The number of physical tables is kept relatively small, which reduces the
operating cost.
● This technique is suitable where a mix of data dipping recent history and
data mining through entire history is required.
● This technique is not useful where the partitioning profile changes on a
regular basis, because repartitioning will increase the operation cost of
data warehouse.
Partition on a Different Dimension
The fact table can also be partitioned on the basis of dimensions other than
time such as product group, region, supplier, or any other dimension. Let's
have an example.

Suppose a market function has been structured into distinct regional


departments like on a state by state basis. If each region wants to query on
information captured within its region, it would prove to be more effective to
partition the fact table into regional partitions. This will cause the queries to
speed up because it does not require to scan information that is not relevant.

Points to Note
● The query does not have to scan irrelevant data which speeds up the
query process.
● This technique is not appropriate where the dimensions are unlikely to
change in future. So, it is worth determining that the dimension does not
change in future.
● If the dimension changes, then the entire fact table would have to be
repartitioned.
Note − We recommend to perform the partition only on the basis of time
dimension, unless you are certain that the suggested dimension grouping will
not change within the life of the data warehouse.

Partition by Size of Table

When there are no clear basis for partitioning the fact table on any dimension,
then we should partition the fact table on the basis of their size. We can
set the predetermined size as a critical point. When the table exceeds the
predetermined size, a new table partition is created.

Points to Note
● This partitioning is complex to manage.
● It requires metadata to identify what data is stored in each partition.
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition
the dimensions. Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the
variations in order to apply comparisons, that dimension may be very large.
This would definitely affect the response time.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is
archived. It uses metadata to allow user access tool to refer to the correct
table partition.
This technique makes it easy to automate table management facilities within
the data warehouse.

Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts
how vertical partitioning is done.
Vertical partitioning can be performed in the following two ways −
● Normalization
● Row Splitting
Normalization
Normalization is the standard relational method of database organization. In
this method, the rows are collapsed into a single row, hence it reduce space.
Take a look at the following tables that show how normalization is performed.
Table before Normalization

Product_id Qty Value sales_date Store_id Store_name Location Region

30 5 3.67 3-Aug-13 16 sunny Bangalore S

35 4 5.33 3-Sep-13 16 sunny Bangalore S

40 5 2.50 3-Sep-13 64 san Mumbai W

45 7 5.66 3-Sep-13 16 sunny Bangalore S

Table after Normalization


Store_id Store_name Location Region

16 sunny Bangalore W

64 san Mumbai S

Product_id Quantity Value sales_date Store_id

30 5 3.67 3-Aug-13 16

35 4 5.33 3-Sep-13 16

40 5 2.50 3-Sep-13 64

45 7 5.66 3-Sep-13 16

Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive
of row splitting is to speed up the access to large table by reducing its size.

Note − While using vertical partitioning, make sure that there is no


requirement to perform a major join operation between two partitions.

Identify Key to Partition


It is very crucial to choose the right partition key. Choosing a wrong partition
key will lead to reorganizing the fact table. Let's have an example. Suppose
we want to partition the following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
● region
● transaction_date
Suppose the business is organized in 30 geographical regions and each
region has different number of branches. That will give us 30 partitions, which
is reasonable. This partitioning is good enough because our requirements
capture has shown that a vast majority of queries are restricted to the user's
own business region.
If we partition by transaction_date instead of region, then the latest transaction
from every region will be in one partition. Now the user who wants to look at
data within his own region has to query across multiple partitions.
Hence it is worth determining the right partitioning key.
UNIT IV - DW
DIMENSIONAL MODELING AND SCHEMA
Dimensional modeling represents data with a cube operation, making more suitable
logical data representation with OLAP data management. The perception of
Dimensional Modeling was developed by Ralph Kimball and is consist
of "fact" and "dimension" tables.
In dimensional modeling, the transaction record is divided into either "facts," which are
frequently numerical transaction data, or "dimensions," which are the reference
information that gives context to the facts. For example, a sale transaction can be
damage into facts such as the number of products ordered and the price paid for the
products, and into dimensions such as order date, user name, product number, order
ship-to, and bill-to locations, and salesman responsible for receiving the order.

Objectives of Dimensional Modeling


The purposes of dimensional modeling are:
1. To produce database architecture that is easy for end-clients to understand and write
queries.
2. To maximize the efficiency of queries. It achieves these goals by minimizing the number
of tables and relationships between them.

Advantages of Dimensional Modeling


Following are the benefits of dimensional modeling are:
Dimensional modeling is simple: Dimensional modeling methods make it possible for
warehouse designers to create database schemas that business customers can easily
hold and comprehend. There is no need for vast training on how to read diagrams, and
there is no complicated relationship between different data elements.
Dimensional modeling promotes data quality: The star schema enable warehouse
administrators to enforce referential integrity checks on the data warehouse. Since the
fact information key is a concatenation of the essentials of its associated dimensions, a
factual record is actively loaded if the corresponding dimensions records are duly
described and also exist in the database.
By enforcing foreign key constraints as a form of referential integrity check, data
warehouse DBAs add a line of defense against corrupted warehouses data.
Performance optimization is possible through aggregates: As the size of the data
warehouse increases, performance optimization develops into a pressing concern.
Customers who have to wait for hours to get a response to a query will quickly become
discouraged with the warehouses. Aggregates are one of the easiest methods by which
query performance can be optimized.

Disadvantages of Dimensional Modeling


1. To maintain the integrity of fact and dimensions, loading the data warehouses with a
record from various operational systems is complicated.
2. It is severe to modify the data warehouse operation if the organization adopting the
dimensional technique changes the method in which it does business.

Elements of Dimensional Modeling


Fact
It is a collection of associated data items, consisting of measures and context data. It
typically represents business items or business transactions.

Dimensions
It is a collection of data which describe one business dimension. Dimensions decide the
contextual background for the facts, and they are the framework over which OLAP is
performed.

Measure
It is a numeric attribute of a fact, representing the performance or behavior of the
business relative to the dimensions.
Considering the relational context, there are two basic models which are used in
dimensional modeling:
o Star Model
o Snowflake Model

The star model is the underlying structure for a dimensional model. It has one broad
central table (fact table) and a set of smaller tables (dimensions) arranged in a radial
design around the primary table. The snowflake model is the conclusion of
decomposing one or more of the dimensions.

Fact Table
Fact tables are used to data facts or measures in the business. Facts are the numeric
data elements that are of interest to the company.
Characteristics of the Fact table

The fact table includes numerical values of what we measure. For example, a fact value
of 20 might means that 20 widgets have been sold.
Each fact table includes the keys to associated dimension tables. These are known as
foreign keys in the fact table.
Fact tables typically include a small number of columns.
When it is compared to dimension tables, fact tables have a large number of rows.

Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store fields that
describe the facts.
Characteristics of the Dimension table

Dimension tables contain the details about the facts. That, as an example, enables the
business analysts to understand the data and their reports better.
The dimension tables include descriptive data about the numerical values in the fact
table. That is, they contain the attributes of the facts. For example, the dimension tables
for a marketing analysis function might include attributes such as time, marketing
region, and product type.
Since the record in a dimension table is denormalized, it usually has a large number of
columns. The dimension tables include significantly fewer rows of information than the
fact table.
The attributes in a dimension table are used as row and column headings in a
document or query results display.
Example: A city and state can view a store summary in a fact table. Item summary can
be viewed by brand, color, etc. Customer information can be viewed by name and
address.
Fact Table

Time ID Product ID Customer ID Unit Sold

4 17 2 1

8 21 3 2

8 4 1 1

In this example, Customer ID column in the facts table is the foreign keys that join with
the dimension table. By following the links, we can see that row 2 of the fact table
records the fact that customer 3, Gaurav, bought two items on day 8.
Dimension Tables
Customer ID Name Gender Income Education Region

1 Rohan Male 2 3 4

2 Sandeep Male 3 5 1

3 Gaurav Male 1 7 3

Hierarchy
A hierarchy is a directed tree whose nodes are dimensional attributes and whose arcs
model many to one association between dimensional attributes team. It contains a
dimension, positioned at the tree's root, and all of the dimensional attributes that define
it.

What is Multi-Dimensional Data Model?


A multidimensional model views data in the form of a data-cube. A data cube enables
data to be modeled and viewed in multiple dimensions. It is defined by dimensions and
facts.
The dimensions are the perspectives or entities concerning which an organization
keeps records. For example, a shop may create a sales data warehouse to keep
records of the store's sales for the dimension time, item, and location. These
dimensions allow the save to keep track of things, for example, monthly sales of items
and the locations at which the items were sold. Each dimension has a table related to it,
called a dimensional table, which describes the dimension further. For example, a
dimensional table for an item may contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales.
This theme is represented by a fact table. Facts are numerical measures. The fact table
contains the names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is
shown in the table. In this 2D representation, the sales for Delhi are shown for the time
dimension (organized in quarters) and the item dimension (classified according to the
types of an item sold). The fact or measure displayed in rupee_sold (in thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose the
data according to time and item, as well as the location is considered for the cities
Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table. The 3D
data of the table are represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data
cube, as shown in fig:

What is Data Cube?


When data is grouped or combined in multidimensional matrices called Data Cubes.
The data cube method has a few alternative names or a few variants, such as
"Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical
Processing)."
The general idea of this approach is to materialize certain expensive computations that
are frequently inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price)
can be materialized into a set of eight views as shown in fig, where psc indicates a view
consisting of aggregate function value (such as total-sales) computed by grouping three
attributes part, supplier, and customer, p indicates a view composed of the
corresponding aggregate function values calculated by grouping part alone, etc.

A data cube is created from a subset of attributes in the database. Specific attributes
are chosen to be measure attributes, i.e., the attributes whose values are of interest.
Another attributes are selected as dimensions or functional attributes. The measure
attributes are aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's
sales for the dimensions time, item, branch, and location. These dimensions enable the
store to keep track of things like monthly sales of items, and the branches and locations
at which the items were sold. Each dimension may have a table identify with it, known
as a dimensional table, which describes the dimensions. For example, a dimension
table for items may contain the attributes item_name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could
be sparse in many cases because not every cell in each dimension may have
corresponding data in the database.
Techniques should be developed to handle sparse cubes efficiently.
If a query contains constants at even lower levels than those provided in a data cube, it
is not clear how to make the best use of the precomputed results stored in the data
cube.
The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.
A data cube enables data to be modeled and viewed in multiple dimensions. A
multidimensional data model is organized around a central theme, like sales and
transactions. A fact table represents this theme. Facts are numerical measures. Thus,
the fact table contains measure (such as Rs_sold) and keys to each of the related
dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which
are used for analyzing the relationship between dimensions.

Example: In the 2-D representation, we will look at the All Electronics sales data
for items sold per quarter in the city of Vancouver. The measured display in dollars
sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example,
suppose we would like to view the data according to time, item as well as the location
for the cities Chicago, New York, Toronto, and Vancouver. The measured display in
dollars sold (in thousands). These 3-D data are shown in the table. The 3-D data of the
table are represented as a series of 2-D tables.

Conceptually, we may represent the same data in the form of 3-D data cubes, as shown
in fig:
Let us suppose that we would like to view our sales data with an additional fourth
dimension, such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the
lowest level of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item,
location, and supplier dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure displayed is dollars sold (in
thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as
the apex cuboid. In this example, this is the total sales, or dollars sold, summarized over
all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating
4-D data cubes for the dimension time, item, location, and supplier. Each cuboid
represents a different degree of summarization.
What is Star Schema?
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as
date, item, or customer.

A star schema is a relational schema where a relational schema whose design


represents a multidimensional data model. The star schema is the explicit data
warehouse schema. It is known as star schema because the entity-relationship
diagram of this schemas simulates a star, with points, diverge from a central table. The
center of the schema consists of a large fact table, and the points of the star are the
dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to the
dimension table. The primary key of the fact tables is generally a composite key that is
made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact
table generally contains facts with the same level of aggregation.

Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that
categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list. The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional tables
are usually small in size than fact table.

Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.

Characteristics of Star Schema


The star schema is intensely suitable for data warehouse database design because of
the following features:

○ It creates a DE-normalized database that can quickly provide query responses.

○ It provides a flexible design that can be changed easily or added to throughout


the development cycle, and as the database grows.

○ It provides a parallel in design to how end-users typically think of and use the
data.

○ It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema


Star Schemas are easy for end-users and application to understand and navigate. With
a well-designed schema, the customer can instantly analyze large, multidimensional
data sets.

The main advantage of star schemas in a decision-support environment are:


Query Performance

A star schema database has a limited number of table and clear join paths, the query
run faster than they do against OLTP systems. Small single-table queries, frequently of
a dimension table, are almost instantaneous. Large join queries that contain multiple
tables takes only seconds or minutes to run.

In a star schema database design, the dimension is connected only through the central
fact table. When the two-dimension table is used in a query, only one join path,
intersecting the fact tables, exist between those two tables. This design feature enforces
authentic and consistent query results.

Load performance and administration

Structural simplicity also decreases the time required to load large batches of record
into a star schema database. By describing facts and dimensions and separating them
into the various table, the impact of a load structure is reduced. Dimension table can be
populated once and occasionally refreshed. We can add new facts regularly and
selectively by appending records to a fact table.
Built-in referential integrity

A star schema has referential integrity built-in when information is loaded. Referential
integrity is enforced because each data in dimensional tables has a unique primary key,
and all keys in the fact table are legitimate foreign keys drawn from the dimension table.
A record in the fact table which is not related correctly to a dimension cannot be given
the correct key value to be retrieved.

Easily Understood

A star schema is simple to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end-user because they
represent the fundamental relationship between parts of the underlying business.
Customer can also browse dimension table attributes before constructing a query.

Disadvantage of Star Schema


There is some condition which cannot be meet by star schemas like the relationship
between the user, and bank account cannot describe as star schema as the relationship
between them is many to many.

Example: Suppose a star schema is composed of a fact table, SALES, and several
dimension tables connected to it for time, branch, item, and geographic locations.

The TIME table has a column for each day, month, quarter, and year. The ITEM table
has columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH
table has columns for each branch_key, branch_name, branch_type. The LOCATION
table has columns of geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the
dimension tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for
time data, four columns for ITEM data, three columns for BRANCH data, and four
columns for LOCATION data. Thus, the size of the fact table is significantly reduced.
When we need to change an item, we need only make a single change in the dimension
table, instead of making many changes in the fact table.

We can create even more complex star schemas by normalizing a dimension table into
several tables. The normalized dimension table is called a Snowflake.

What is Snowflake Schema?


A snowflake schema is equivalent to the star schema. "A schema is known as a
snowflake if one or more dimension tables do not connect directly to the fact table but
must join through other dimension tables."

The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of
snowflake schema resembles a snowflake. Snowflaking is a method of normalizing the
dimension tables in a STAR schemas. When we normalize all the dimension tables
entirely, the resultant structure resembles a snowflake with the fact table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is
diagramed with each fact surrounded by its associated dimensions, and those
dimensions are related to other dimensions, branching out into a snowflake pattern.

The snowflake schema consists of one fact table which is linked to many dimension
tables, which can be linked to other dimension tables through a many-to-one
relationship. Tables in a snowflake schema are generally normalized to the third normal
form. Each dimension table performs exactly one level in a hierarchy.

The following diagram shows a snowflake schema with two dimensions, each having
three levels. A snowflake schemas can have any number of dimension, and each
dimension can have any number of levels.

Example: Figure shows a snowflake schema with a Sales fact table, with Store,
Location, Time, Product, Line, and Family dimension tables. The Market dimension has
two dimension tables with Store as the primary dimension table, and Location as the
outrigger dimension table. The product dimension has three dimension tables with
Product as the primary dimension table, and the Line and Family table are the outrigger
dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This
needed more disk space than a more normalized snowflake schema. Snowflaking
normalizes the dimension by moving attributes with low cardinality into separate
dimension tables that relate to the core dimension table by using foreign keys.
Snowflaking for the sole purpose of minimizing disk space is not recommended,
because it can adversely impact query performance.

In snowflake, schema tables are normalized to delete redundancy. In snowflake


dimension tables are damaged into multiple dimension tables.
Figure shows a simple STAR schema for sales in a manufacturing company. The sales
fact table include quantity, price, and other relevant metrics. SALESREP, CUSTOMER,
PRODUCT, and TIME are the dimension tables.

The STAR schema for sales, as shown above, contains only five tables, whereas the
normalized version now extends to eleven tables. We will notice that in the snowflake
schema, the attributes with low cardinality in each original dimension tables are
removed to form separate tables. These new tables are connected back to the original
dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex dimensions
and relationship. It is suitable for many to many and one to many relationships between
dimension levels.

Advantage of Snowflake Schema

1. The primary advantage of the snowflake schema is the development in query


performance due to minimized disk storage requirements and joining smaller
lookup tables.

2. It provides greater scalability in the interrelationship between dimension levels


and components.

3. No redundancy, so it is easier to maintain.


Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional


maintenance efforts required due to the increasing number of lookup tables. It is
also known as a multi fact star schema.

2. There are more complex queries and hence, difficult to understand.

3. More tables more join so more query execution time.

Difference between Star and Snowflake


Schemas

Star Schema

○ In a star schema, the fact table will be at the center and is connected to the
dimension tables.

○ The tables are completely in a denormalized structure.

○ SQL queries performance is good as there is less number of joins involved.

○ Data redundancy is high and occupies more disk space.


Snowflake Schema

○ A snowflake schema is an extension of star schema where the dimension tables


are connected to one or more dimensions.

○ The tables are partially denormalized in structure.

○ The performance of SQL queries is a bit less when compared to star schema as
more number of joins are involved.

○ Data redundancy is low and occupies less disk space when compared to star
schema.
Let's see the differentiate between Star and Snowflake Schema.
Basis for Comparison Star Schema Snowflake Schema

Ease of It has redundant data and No redundancy and


Maintenance/change hence less easy to therefore more easy to
maintain/change maintain and change

Ease of Use Less complex queries and More complex queries and
simple to understand therefore less easy to
understand

Parent table In a star schema, a In a snowflake schema, a


dimension table will not dimension table will have
have any parent table one or more parent tables

Query Performance Less number of foreign More foreign keys and thus
keys and hence lesser more query execution time
query execution time

Normalization It has De-normalized It has normalized tables


tables

Type of Data Warehouse Good for data marts with Good to use for data
simple relationships (one warehouse core to simplify
to one or one to many) complex relationships
(many to many)

Joins Fewer joins Higher number of joins


Dimension Table It contains only a single It may have more than one
dimension table for each dimension table for each
dimension dimension

Hierarchies Hierarchies for the Hierarchies are broken into


dimension are stored in the separate tables in a
dimensional table itself in a snowflake schema. These
star schema hierarchies help to drill
down the information from
topmost hierarchies to the
lowermost hierarchies.

When to use When the dimensional When dimensional table


table contains less number store a huge number of
of rows, we can go for Star rows with redundancy
schema. information and space is
such an issue, we can
choose snowflake schema
to store space.

Data Warehouse system Work best in any data Better for small data
warehouse/ data mart warehouse/data mart.

What is Fact Constellation Schema?


A Fact constellation means two or more fact tables sharing one or more dimensions. It
is also called Galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or data
mart. Fact Constellation Schema can design with a collection of de-normalized FACT,
Shared, and Conformed Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult to


summarize information. Fact Constellation Schema can implement between aggregate
Fact tables or decompose a complex Fact table into independent simplex Fact tables.

Example: A fact constellation schema is shown in the figure below.


This schema defines two fact tables, sales, and shipping. Sales are treated along four
dimensions, namely, time, item, branch, and location. The schema contains a fact table
for sales that includes keys to each of the four dimensions, along with two measures:
Rupee_sold and units_sold. The shipping table has five dimensions, or keys: item_key,
time_key, shipper_key, from_location, and to_location, and two measures: Rupee_cost
and units_shipped.

The primary disadvantage of the fact constellation schema is that it is a more


challenging design because many variants for specific kinds of aggregation must be
considered and selected.

Data Warehouse Applications


The application areas of the data warehouse are:
Information Processing

It deals with querying, statistical analysis, and reporting via tables, charts, or graphs.
Nowadays, information processing of data warehouse is to construct a low cost,
web-based accessing tools typically integrated with web browsers.

Analytical Processing

It supports various online analytical processing such as drill-down, roll-up, and pivoting.
The historical data is being processed in both summarized and detailed format.

OLAP is implemented on data warehouses or data marts. The primary objective of


OLAP is to support ad-hoc querying needed for support DSS. The multidimensional
view of data is fundamental to the OLAP application. OLAP is an operational view, not a
data structure or schema. The complex nature of OLAP applications requires a
multidimensional view of the data.

Data Mining

It helps in the analysis of hidden design and association, constructing scientific models,
operating classification and prediction, and performing the mining results using
visualization tools.
Data mining is the technique of designing essential new correlations, patterns, and
trends by changing through high amounts of a record save in repositories, using pattern
recognition technologies as well as statistical and mathematical techniques.

It is the phase of selection, exploration, and modeling of huge quantities of information


to determine regularities or relations that are at first unknown to access precise and
useful results for the owner of the database.

It is the process of inspection and analysis, by automatic or semi-automatic means, of


large quantities of records to discover meaningful patterns and rules.

Data Warehouse Process Architecture


The process architecture defines an architecture in which the data from the data
warehouse is processed for a particular computation.

Following are the two fundamental process architectures:

Centralized Process Architecture

In this architecture, the data is collected into single centralized storage and processed
upon completion by a single machine with a huge structure in terms of memory,
processor, and storage.
Centralized process architecture evolved with transaction processing and is well suited
for small organizations with one location of service.

It requires minimal resources both from people and system perspectives.

It is very successful when the collection and consumption of data occur at the same
location.

Distributed Process Architecture

In this architecture, information and its processing are allocated across data centers,
and its processing is distributed across data centers, and processing of data is localized
with the group of the results into centralized storage. Distributed architectures are used
to overcome the limitations of the centralized process architectures where all the
information needs to be collected to one central location, and results are available in
one central location.

There are several architectures of the distributed process:

Client-Server
In this architecture, the user does all the information collecting and presentation, while
the server does the processing and management of data.

Three-tier Architecture

With client-server architecture, the client machines need to be connected to a server


machine, thus mandating finite states and introducing latencies and overhead in terms
of record to be carried between clients and servers.

N-tier Architecture

The n-tier or multi-tier architecture is where clients, middleware, applications, and


servers are isolated into tiers.

Cluster Architecture

In this architecture, machines that are connected in network architecture (software or


hardware) to approximately work together to process information or compute
requirements in parallel. Each device in a cluster is associated with a function that is
processed locally, and the result sets are collected to a master server that returns it to
the user.

Peer-to-Peer Architecture

This is a type of architecture where there are no dedicated servers and clients. Instead,
all the processing responsibilities are allocated among all machines, called peers. Each
machine can perform the function of a client or server or just process data.

Types of Database Parallelism


Parallelism is used to support speedup, where queries are executed faster because
more resources, such as processors and disks, are provided. Parallelism is also used to
provide scale-up, where increasing workloads are managed without increase
response-time, via an increase in the degree of parallelism.

Different architectures for parallel database systems are shared-memory, shared-disk,


shared-nothing, and hierarchical structures.

(a)Horizontal Parallelism: It means that the database is partitioned across multiple


disks, and parallel processing occurs within a specific task (i.e., table scan) that is
performed concurrently on different processors against different sets of data.
(b)Vertical Parallelism: It occurs among various tasks. All component query operations
(i.e., scan, join, and sort) are executed in parallel in a pipelined fashion. In other words,
an output from one function (e.g., join) as soon as records become available.

Intraquery Parallelism
Intraquery parallelism defines the execution of a single query in parallel on multiple
processors and disks. Using intraquery parallelism is essential for speeding up
long-running queries.

Interquery parallelism does not help in this function since each query is run sequentially.

To improve the situation, many DBMS vendors developed versions of their products that
utilized intraquery parallelism.

This application of parallelism decomposes the serial SQL, query into lower-level
operations such as scan, join, sort, and aggregation.

These lower-level operations are executed concurrently, in parallel.

Interquery Parallelism
In interquery parallelism, different queries or transaction execute in parallel with one
another.
This form of parallelism can increase transactions throughput. The response times of
individual transactions are not faster than they would be if the transactions were run in
isolation.

Thus, the primary use of interquery parallelism is to scale up a transaction processing


system to support a more significant number of transactions per second.

Database vendors started to take advantage of parallel hardware architectures by


implementing multiserver and multithreaded systems designed to handle a large
number of client requests efficiently.

This approach naturally resulted in interquery parallelism, in which different server


threads (or processes) handle multiple requests at the same time.

Interquery parallelism has been successfully implemented on SMP systems, where it


increased the throughput and allowed the support of more concurrent users.

Shared Disk Architecture


Shared-disk architecture implements a concept of shared ownership of the entire
database between RDBMS servers, each of which is running on a node of a distributed
memory system.

Each RDBMS server can read, write, update, and delete information from the same
shared database, which would need the system to implement a form of a distributed
lock manager (DLM).

DLM components can be found in hardware, the operating system, and separate
software layer, all depending on the system vendor.

On the positive side, shared-disk architectures can reduce performance bottlenecks


resulting from data skew (uneven distribution of data), and can significantly increase
system availability.

The shared-disk distributed memory design eliminates the memory access bottleneck
typically of large SMP systems and helps reduce DBMS dependency on data
partitioning.
Shared-Memory Architecture
Shared-memory or shared-everything style is the traditional approach of implementing
an RDBMS on SMP hardware.

It is relatively simple to implement and has been very successful up to the point where it
runs into the scalability limitations of the shared-everything architecture.

The key point of this technique is that a single RDBMS server can probably apply all
processors, access all memory, and access the entire database, thus providing the
client with a consistent single system image.
In shared-memory SMP systems, the DBMS considers that the multiple database
components executing SQL statements communicate with each other by exchanging
messages and information via the shared memory.

All processors have access to all data, which is partitioned across local disks.

Shared-Nothing Architecture
In a shared-nothing distributed memory environment, the data is partitioned across all
disks, and the DBMS is "partitioned" across multiple co-servers, each of which resides
on individual nodes of the parallel system and has an ownership of its disk and thus its
database partition.

A shared-nothing RDBMS parallelizes the execution of a SQL query across multiple


processing nodes.

Each processor has its memory and disk and communicates with other processors by
exchanging messages and data over the interconnection network.

This architecture is optimized specifically for the MPP and cluster systems.
The shared-nothing architectures offer near-linear scalability. The number of processor
nodes is limited only by the hardware platform limitations (and budgetary constraints),
and each node itself can be a powerful SMP system.

Data Warehouse Tools


The tools that allow sourcing of data contents and formats accurately and external data
stores into the data warehouse have to perform several essential tasks that contain:

○ Data consolidation and integration.

○ Data transformation from one form to another form.

○ Data transformation and calculation based on the function of business rules that
force transformation.

○ Metadata synchronization and management, which includes storing or updating


metadata about source files, transformation actions, loading formats, and events.

There are several selection criteria which should be considered while implementing a
data warehouse:
1. The ability to identify the data in the data source environment that can be read by
the tool is necessary.

2. Support for flat files, indexed files, and legacy DBMSs is critical.

3. The capability to merge records from multiple data stores is required in many
installations.

4. The specification interface to indicate the information to be extracted and


conversation are essential.

5. The ability to read information from repository products or data dictionaries is


desired.

6. The code develops by the tool should be completely maintainable.

7. Selective data extraction of both data items and records enables users to extract
only the required data.

8. A field-level data examination for the transformation of data into information is


needed.

9. The ability to perform data type and the character-set translation is a requirement
when moving data between incompatible systems.

10. The ability to create aggregation, summarization and derivation fields and
records are necessary.

11. Vendor stability and support for the products are components that must be
evaluated carefully.

Data Warehouse Software Components


A warehousing team will require different types of tools during a warehouse project.
These software products usually fall into one or more of the categories illustrated, as
shown in the figure.
Extraction and Transformation

The warehouse team needs tools that can extract, transform, integrate, clean, and load
information from a source system into one or more data warehouse databases.
Middleware and gateway products may be needed for warehouses that extract a record
from a host-based source system.

Warehouse Storage

Software products are also needed to store warehouse data and their accompanying
metadata. Relational database management systems are well suited to large and
growing warehouses.

Data access and retrieval

Different types of software are needed to access, retrieve, distribute, and present
warehouse data to its end-clients.
UNIT V - DW
SYSTEM & PROCESS MANAGERS

Data Warehousing - System Managers

System management is mandatory for the successful implementation of a


data warehouse. The most important system managers are −
● System configuration manager
● System scheduling manager
● System event manager
● System database manager
● System backup recovery manager
System Configuration Manager
● The system configuration manager is responsible for the management
of the setup and configuration of data warehouse.
● The structure of configuration manager varies from one operating
system to another.
● In Unix structure of configuration, the manager varies from vendor to
vendor.
● Configuration managers have single user interface.
● The interface of configuration manager allows us to control all aspects
of the system.

Note − The most important configuration tool is the I/O manager.

System Scheduling Manager


System Scheduling Manager is responsible for the successful implementation
of the data warehouse. Its purpose is to schedule ad hoc queries. Every
operating system has its own scheduler with some form of batch control
mechanism. The list of features a system scheduling manager must have is
as follows −
● Work across cluster or MPP boundaries
● Deal with international time differences
● Handle job failure
● Handle multiple queries
● Support job priorities
● Restart or re-queue the failed jobs
● Notify the user or a process when job is completed
● Maintain the job schedules across system outages
● Re-queue jobs to other queues
● Support the stopping and starting of queues
● Log Queued jobs
● Deal with inter-queue processing

Note − The above list can be used as evaluation parameters for the
evaluation of a good scheduler.

Some important jobs that a scheduler must be able to handle are as follows −
● Daily and ad hoc query scheduling
● Execution of regular report requirements
● Data load
● Data processing
● Index creation
● Backup
● Aggregation creation
● Data transformation

Note − If the data warehouse is running on a cluster or MPP architecture, then


the system scheduling manager must be capable of running across the
architecture.

System Event Manager


The event manager is a kind of a software. The event manager manages the
events that are defined on the data warehouse system. We cannot manage
the data warehouse manually because the structure of data warehouse is very
complex. Therefore we need a tool that automatically handles all the events
without any intervention of the user.

Note − The Event manager monitors the events occurrences and deals with
them. The event manager also tracks the myriad of things that can go wrong
on this complex data warehouse system.
Events
Events are the actions that are generated by the user or the system itself. It
may be noted that the event is a measurable, observable, occurrence of a
defined action.
Given below is a list of common events that are required to be tracked.
● Hardware failure
● Running out of space on certain key disks
● A process dying
● A process returning an error
● CPU usage exceeding an 805 threshold
● Internal contention on database serialization points
● Buffer cache hit ratios exceeding or failure below threshold
● A table reaching to maximum of its size
● Excessive memory swapping
● A table failing to extend due to lack of space
● Disk exhibiting I/O bottlenecks
● Usage of temporary or sort area reaching a certain thresholds
● Any other database shared memory usage
The most important thing about events is that they should be capable of
executing on their own. Event packages define the procedures for the
predefined events. The code associated with each event is known as event
handler. This code is executed whenever an event occurs.

System and Database Manager


System and database manager may be two separate pieces of software, but
they do the same job. The objective of these tools is to automate certain
processes and to simplify the execution of others. The criteria for choosing a
system and the database manager are as follows −
● increase user's quota.
● assign and de-assign roles to the users
● assign and de-assign the profiles to the users
● perform database space management
● monitor and report on space usage
● tidy up fragmented and unused space
● add and expand the space
● add and remove users
● manage user password
● manage summary or temporary tables
● assign or deassign temporary space to and from the user
● reclaim the space form old or out-of-date temporary tables
● manage error and trace logs
● to browse log and trace files
● redirect error or trace information
● switch on and off error and trace logging
● perform system space management
● monitor and report on space usage
● clean up old and unused file directories
● add or expand space.
System Backup Recovery Manager
The backup and recovery tool makes it easy for operations and management
staff to back-up the data. Note that the system backup manager must be
integrated with the schedule manager software being used. The important
features that are required for the management of backups are as follows −
● Scheduling
● Backup data tracking
● Database awareness
Backups are taken only to protect against data loss. Following are the
important points to remember −
● The backup software will keep some form of database of where and
when the piece of data was backed up.
● The backup recovery manager must have a good front-end to that
database.
● The backup recovery software should be database aware.
● Being aware of the database, the software then can be addressed in
database terms, and will not perform backups that would not be viable.

Data Warehousing - Process Managers

Process managers are responsible for maintaining the flow of data both into
and out of the data warehouse. There are three different types of process
managers −
● Load manager
● Warehouse manager
● Query manager
Data Warehouse Load Manager
Load manager performs the operations required to extract and load the data
into the database. The size and complexity of a load manager varies between
specific solutions from one data warehouse to another.
Load Manager Architecture
The load manager does performs the following functions −
● Extract data from the source system.
● Fast load the extracted data into temporary data store.
● Perform simple transformations into structure similar to the one in the
data warehouse.

Extract Data from Source


The data is extracted from the operational databases or the external
information providers. Gateways are the application programs that are used to
extract data. It is supported by underlying DBMS and allows the client
program to generate SQL to be executed at a server. Open Database
Connection (ODBC) and Java Database Connection (JDBC) are examples of
gateway.
Fast Load
● In order to minimize the total load window, the data needs to be loaded
into the warehouse in the fastest possible time.
● Transformations affect the speed of data processing.
● It is more effective to load the data into a relational database prior to
applying transformations and checks.
● Gateway technology is not suitable, since they are inefficient when large
data volumes are involved.
Simple Transformations
While loading, it may be required to perform simple transformations. After
completing simple transformations, we can do complex checks. Suppose we
are loading the EPOS sales transaction, we need to perform the following
checks −
● Strip out all the columns that are not required within the warehouse.
● Convert all the values to required data types.
Warehouse Manager
The warehouse manager is responsible for the warehouse management
process. It consists of a third-party system software, C programs, and shell
scripts. The size and complexity of a warehouse manager varies between
specific solutions.
Warehouse Manager Architecture
A warehouse manager includes the following −
● The controlling process
● Stored procedures or C with SQL
● Backup/Recovery tool
● SQL scripts

Functions of Warehouse Manager


A warehouse manager performs the following functions −
● Analyzes the data to perform consistency and referential integrity
checks.
● Creates indexes, business views, partition views against the base data.
● Generates new aggregations and updates the existing aggregations.
● Generates normalizations.
● Transforms and merges the source data of the temporary store into the
published data warehouse.
● Backs up the data in the data warehouse.
● Archives the data that has reached the end of its captured life.

Note − A warehouse Manager analyzes query profiles to determine whether


the index and aggregations are appropriate.

Query Manager
The query manager is responsible for directing the queries to suitable tables.
By directing the queries to appropriate tables, it speeds up the query request
and response process. In addition, the query manager is responsible for
scheduling the execution of the queries posted by the user.
Query Manager Architecture
A query manager includes the following components −
● Query redirection via C tool or RDBMS
● Stored procedures
● Query management tool
● Query scheduling via C tool or RDBMS
● Query scheduling via third-party software
Functions of Query Manager
● It presents the data to the user in a form they understand.
● It schedules the execution of the queries posted by the end-user.
● It stores query profiles to allow the warehouse manager to determine
which indexes and aggregations are appropriate.

Data Warehousing - Tuning

A data warehouse keeps evolving and it is unpredictable what query the user
is going to post in the future. Therefore it becomes more difficult to tune a data
warehouse system. In this chapter, we will discuss how to tune the different
aspects of a data warehouse such as performance, data load, queries, etc.

Difficulties in Data Warehouse Tuning


Tuning a data warehouse is a difficult procedure due to following reasons −
● Data warehouse is dynamic; it never remains constant.
● It is very difficult to predict what query the user is going to post in the
future.
● Business requirements change with time.
● Users and their profiles keep changing.
● The user can switch from one group to another.
● The data load on the warehouse also changes with time.

Note − It is very important to have a complete knowledge of data warehouse.

Performance Assessment
Here is a list of objective measures of performance −
● Average query response time
● Scan rates
● Time used per day query
● Memory usage per process
● I/O throughput rates
Following are the points to remember.
● It is necessary to specify the measures in service level agreement
(SLA).
● It is of no use trying to tune response time, if they are already better
than those required.
● It is essential to have realistic expectations while making performance
assessment.
● It is also essential that the users have feasible expectations.
● To hide the complexity of the system from the user, aggregations and
views should be used.
● It is also possible that the user can write a query you had not tuned for.
Data Load Tuning
Data load is a critical part of overnight processing. Nothing else can run until
data load is complete. This is the entry point into the system.

Note − If there is a delay in transferring the data, or in arrival of data then the
entire system is affected badly. Therefore it is very important to tune the data
load first.

There are various approaches of tuning data load that are discussed below −
● The very common approach is to insert data using the SQL Layer. In
this approach, normal checks and constraints need to be performed.
When the data is inserted into the table, the code will run to check for
enough space to insert the data. If sufficient space is not available, then
more space may have to be allocated to these tables. These checks
take time to perform and are costly to CPU.
● The second approach is to bypass all these checks and constraints and
place the data directly into the preformatted blocks. These blocks are
later written to the database. It is faster than the first approach, but it
can work only with whole blocks of data. This can lead to some space
wastage.
● The third approach is that while loading the data into the table that
already contains the table, we can maintain indexes.
● The fourth approach says that to load the data in tables that already
contain data, drop the indexes & recreate them when the data load is
complete. The choice between the third and the fourth approach
depends on how much data is already loaded and how many indexes
need to be rebuilt.
Integrity Checks
Integrity checking highly affects the performance of the load. Following are the
points to remember −
● Integrity checks need to be limited because they require heavy
processing power.
● Integrity checks should be applied on the source system to avoid
performance degrade of data load.
Tuning Queries
We have two kinds of queries in data warehouse −
● Fixed queries
● Ad hoc queries
Fixed Queries
Fixed queries are well defined. Following are the examples of fixed queries −
● regular reports
● Canned queries
● Common aggregations
Tuning the fixed queries in a data warehouse is same as in a relational
database system. The only difference is that the amount of data to be queried
may be different. It is good to store the most successful execution plan while
testing fixed queries. Storing these executing plan will allow us to spot
changing data size and data skew, as it will cause the execution plan to
change.
Note − We cannot do more on fact table but while dealing with dimension
tables or the aggregations, the usual collection of SQL tweaking, storage
mechanism, and access methods can be used to tune these queries.

Ad hoc Queries
To understand ad hoc queries, it is important to know the ad hoc users of the
data warehouse. For each user or group of users, you need to know the
following −
● The number of users in the group
● Whether they use ad hoc queries at regular intervals of time
● Whether they use ad hoc queries frequently
● Whether they use ad hoc queries occasionally at unknown intervals.
● The maximum size of query they tend to run
● The average size of query they tend to run
● Whether they require drill-down access to the base data
● The elapsed login time per day
● The peak time of daily usage
● The number of queries they run per peak hour

Points to Note

● It is important to track the user's profiles and identify the queries that are
run on a regular basis.
● It is also important that the tuning performed does not affect the
performance.
● Identify similar and ad hoc queries that are frequently run.
● If these queries are identified, then the database will change and new
indexes can be added for those queries.
● If these queries are identified, then new aggregations can be created
specifically for those queries that would result in their efficient execution.

Data Warehousing - Testing

Testing is very important for data warehouse systems to make them work
correctly and efficiently. There are three basic levels of testing performed on a
data warehouse −
● Unit testing
● Integration testing
● System testing
Unit Testing
● In unit testing, each component is separately tested.
● Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
● This test is performed by the developer.
Integration Testing
● In integration testing, the various modules of the application are brought
together and then tested against the number of inputs.
● It is performed to test whether the various components do well after
integration.
System Testing
● In system testing, the whole data warehouse application is tested
together.
● The purpose of system testing is to check whether the entire system
works correctly together or not.
● System testing is performed by the testing team.
● Since the size of the whole data warehouse is very large, it is usually
possible to perform minimal system testing before the test plan can be
enacted.
Test Schedule
First of all, the test schedule is created in the process of developing the test
plan. In this schedule, we predict the estimated time required for the testing of
the entire data warehouse system.
There are different methodologies available to create a test schedule, but
none of them are perfect because the data warehouse is very complex and
large. Also the data warehouse system is evolving in nature. One may face
the following issues while creating a test schedule −
● A simple problem may have a large size of query that can take a day or
more to complete, i.e., the query does not complete in a desired time
scale.
● There may be hardware failures such as losing a disk or human errors
such as accidentally deleting a table or overwriting a large table.

Note − Due to the above-mentioned difficulties, it is recommended to always


double the amount of time you would normally allow for testing.

Testing Backup Recovery


Testing the backup recovery strategy is extremely important. Here is the list of
scenarios for which this testing is needed −
● Media failure
● Loss or damage of table space or data file
● Loss or damage of redo log file
● Loss or damage of control file
● Instance failure
● Loss or damage of archive file
● Loss or damage of table
● Failure during data failure
Testing Operational Environment
There are a number of aspects that need to be tested. These aspects are
listed below.
● Security − A separate security document is required for security testing.
This document contains a list of disallowed operations and devising
tests for each.
● Scheduler − Scheduling software is required to control the daily
operations of a data warehouse. It needs to be tested during system
testing. The scheduling software requires an interface with the data
warehouse, which will need the scheduler to control overnight
processing and the management of aggregations.
● Disk Configuration. − Disk configuration also needs to be tested to
identify I/O bottlenecks. The test should be performed with multiple
times with different settings.
● Management Tools. − It is required to test all the management tools
during system testing. Here is the list of tools that need to be tested.
o Event manager
o System manager
o Database manager
o Configuration manager
o Backup recovery manager
Testing the Database
The database is tested in the following three ways −
● Testing the database manager and monitoring tools − To test the
database manager and the monitoring tools, they should be used in the
creation, running, and management of test database.
● Testing database features − Here is the list of features that we have to
test −
o Querying in parallel
o Create index in parallel
o Data load in parallel
● Testing database performance − Query execution plays a very
important role in data warehouse performance measures. There are
sets of fixed queries that need to be run regularly and they should be
tested. To test ad hoc queries, one should go through the user
requirement document and understand the business completely. Take
time to test the most awkward queries that the business is likely to ask
against different index and aggregation strategies.
Testing the Application
● All the managers should be integrated correctly and work in order to
ensure that the end-to-end load, index, aggregate and queries work as
per the expectations.
● Each function of each manager should work correctly
● It is also necessary to test the application over a period of time.
● Week end and month-end tasks should also be tested.
Logistic of the Test
The aim of system test is to test all of the following areas −
● Scheduling software
● Day-to-day operational procedures
● Backup recovery strategy
● Management and scheduling tools
● Overnight processing
● Query performance

Note − The most important point is to test the scalability. Failure to do so will
leave us a system design that does not work when the system grows.

You might also like