Unit 345 DW Autosaved
Unit 345 DW Autosaved
Metadata
Metadata is simply defined as data about data. The data that is used to
represent other data is known as metadata. For example, the index of a book
serves as a metadata for the contents in the book. In other words, we can say
that metadata is the summarized data that leads us to detailed data. In terms
of data warehouse, we can define metadata as follows.
Note − In a data warehouse, we create metadata for the data names and
definitions of a given data warehouse. Along with this metadata, additional
metadata is also created for time-stamping any extracted data, the source of
extracted data.
Categories of Metadata
Metadata can be broadly categorized into three categories −
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the
following metadata −
Note − Do not data mart for any other reason since the operation cost of data
marting could be very high. Before data marting, make sure that data marting
strategy is appropriate for your particular solution.
We need data marts to support user access tools that require internal data
structures. The data in such structures are outside the control of data
warehouse but need to be populated and updated on a regular basis.
There are some tools that populate directly from the source system but some
cannot. Therefore additional requirements outside the scope of the tool are
needed to be identified for future.
Note − In order to ensure consistency of data across all access tools, the data
should not be directly populated from the data warehouse, rather each tool
must have its own data mart.
The summaries are data marted in the same way as they would have been
designed within the data warehouse. Summary tables help to utilize all
dimension data in the starflake schema.
Network Access
Partitioning Strategy
Note − To cut down on the backup size, all partitions other than the current
partition can be marked as read-only. We can then put these partitions into a
state where they cannot be modified. Then they can be backed up. It means
only the current partition is to be backed up.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be
enhanced. Query performance is enhanced because now the query scans
only those partitions that are relevant. It does not have to scan the whole data.
Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal
partitioning, we have to keep in mind the requirements for manageability of
the data warehouse.
Partitioning by Time into Equal Segments
In this partitioning strategy, the fact table is partitioned on the basis of time
period. Here each time period represents a significant retention period within
the business. For example, if the user queries for month to date data then it
is appropriate to partition the data into monthly segments. We can reuse the
partitioned tables by removing the data in them.
Points to Note
● The query does not have to scan irrelevant data which speeds up the
query process.
● This technique is not appropriate where the dimensions are unlikely to
change in future. So, it is worth determining that the dimension does not
change in future.
● If the dimension changes, then the entire fact table would have to be
repartitioned.
Note − We recommend to perform the partition only on the basis of time
dimension, unless you are certain that the suggested dimension grouping will
not change within the life of the data warehouse.
When there are no clear basis for partitioning the fact table on any dimension,
then we should partition the fact table on the basis of their size. We can
set the predetermined size as a critical point. When the table exceeds the
predetermined size, a new table partition is created.
Points to Note
● This partitioning is complex to manage.
● It requires metadata to identify what data is stored in each partition.
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition
the dimensions. Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the
variations in order to apply comparisons, that dimension may be very large.
This would definitely affect the response time.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is
archived. It uses metadata to allow user access tool to refer to the correct
table partition.
This technique makes it easy to automate table management facilities within
the data warehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts
how vertical partitioning is done.
Vertical partitioning can be performed in the following two ways −
● Normalization
● Row Splitting
Normalization
Normalization is the standard relational method of database organization. In
this method, the rows are collapsed into a single row, hence it reduce space.
Take a look at the following tables that show how normalization is performed.
Table before Normalization
16 sunny Bangalore W
64 san Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive
of row splitting is to speed up the access to large table by reducing its size.
Dimensions
It is a collection of data which describe one business dimension. Dimensions decide the
contextual background for the facts, and they are the framework over which OLAP is
performed.
Measure
It is a numeric attribute of a fact, representing the performance or behavior of the
business relative to the dimensions.
Considering the relational context, there are two basic models which are used in
dimensional modeling:
o Star Model
o Snowflake Model
The star model is the underlying structure for a dimensional model. It has one broad
central table (fact table) and a set of smaller tables (dimensions) arranged in a radial
design around the primary table. The snowflake model is the conclusion of
decomposing one or more of the dimensions.
Fact Table
Fact tables are used to data facts or measures in the business. Facts are the numeric
data elements that are of interest to the company.
Characteristics of the Fact table
The fact table includes numerical values of what we measure. For example, a fact value
of 20 might means that 20 widgets have been sold.
Each fact table includes the keys to associated dimension tables. These are known as
foreign keys in the fact table.
Fact tables typically include a small number of columns.
When it is compared to dimension tables, fact tables have a large number of rows.
Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store fields that
describe the facts.
Characteristics of the Dimension table
Dimension tables contain the details about the facts. That, as an example, enables the
business analysts to understand the data and their reports better.
The dimension tables include descriptive data about the numerical values in the fact
table. That is, they contain the attributes of the facts. For example, the dimension tables
for a marketing analysis function might include attributes such as time, marketing
region, and product type.
Since the record in a dimension table is denormalized, it usually has a large number of
columns. The dimension tables include significantly fewer rows of information than the
fact table.
The attributes in a dimension table are used as row and column headings in a
document or query results display.
Example: A city and state can view a store summary in a fact table. Item summary can
be viewed by brand, color, etc. Customer information can be viewed by name and
address.
Fact Table
4 17 2 1
8 21 3 2
8 4 1 1
In this example, Customer ID column in the facts table is the foreign keys that join with
the dimension table. By following the links, we can see that row 2 of the fact table
records the fact that customer 3, Gaurav, bought two items on day 8.
Dimension Tables
Customer ID Name Gender Income Education Region
1 Rohan Male 2 3 4
2 Sandeep Male 3 5 1
3 Gaurav Male 1 7 3
Hierarchy
A hierarchy is a directed tree whose nodes are dimensional attributes and whose arcs
model many to one association between dimensional attributes team. It contains a
dimension, positioned at the tree's root, and all of the dimensional attributes that define
it.
Now, if we want to view the sales data with a third dimension, For example, suppose the
data according to time and item, as well as the location is considered for the cities
Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table. The 3D
data of the table are represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data
cube, as shown in fig:
A data cube is created from a subset of attributes in the database. Specific attributes
are chosen to be measure attributes, i.e., the attributes whose values are of interest.
Another attributes are selected as dimensions or functional attributes. The measure
attributes are aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's
sales for the dimensions time, item, branch, and location. These dimensions enable the
store to keep track of things like monthly sales of items, and the branches and locations
at which the items were sold. Each dimension may have a table identify with it, known
as a dimensional table, which describes the dimensions. For example, a dimension
table for items may contain the attributes item_name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could
be sparse in many cases because not every cell in each dimension may have
corresponding data in the database.
Techniques should be developed to handle sparse cubes efficiently.
If a query contains constants at even lower levels than those provided in a data cube, it
is not clear how to make the best use of the precomputed results stored in the data
cube.
The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.
A data cube enables data to be modeled and viewed in multiple dimensions. A
multidimensional data model is organized around a central theme, like sales and
transactions. A fact table represents this theme. Facts are numerical measures. Thus,
the fact table contains measure (such as Rs_sold) and keys to each of the related
dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which
are used for analyzing the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data
for items sold per quarter in the city of Vancouver. The measured display in dollars
sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example,
suppose we would like to view the data according to time, item as well as the location
for the cities Chicago, New York, Toronto, and Vancouver. The measured display in
dollars sold (in thousands). These 3-D data are shown in the table. The 3-D data of the
table are represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown
in fig:
Let us suppose that we would like to view our sales data with an additional fourth
dimension, such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the
lowest level of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item,
location, and supplier dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure displayed is dollars sold (in
thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as
the apex cuboid. In this example, this is the total sales, or dollars sold, summarized over
all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating
4-D data cubes for the dimension time, item, location, and supplier. Each cuboid
represents a different degree of summarization.
What is Star Schema?
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as
date, item, or customer.
A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact
table generally contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that
categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list. The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional tables
are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.
○ It provides a parallel in design to how end-users typically think of and use the
data.
A star schema database has a limited number of table and clear join paths, the query
run faster than they do against OLTP systems. Small single-table queries, frequently of
a dimension table, are almost instantaneous. Large join queries that contain multiple
tables takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the central
fact table. When the two-dimension table is used in a query, only one join path,
intersecting the fact tables, exist between those two tables. This design feature enforces
authentic and consistent query results.
Structural simplicity also decreases the time required to load large batches of record
into a star schema database. By describing facts and dimensions and separating them
into the various table, the impact of a load structure is reduced. Dimension table can be
populated once and occasionally refreshed. We can add new facts regularly and
selectively by appending records to a fact table.
Built-in referential integrity
A star schema has referential integrity built-in when information is loaded. Referential
integrity is enforced because each data in dimensional tables has a unique primary key,
and all keys in the fact table are legitimate foreign keys drawn from the dimension table.
A record in the fact table which is not related correctly to a dimension cannot be given
the correct key value to be retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end-user because they
represent the fundamental relationship between parts of the underlying business.
Customer can also browse dimension table attributes before constructing a query.
Example: Suppose a star schema is composed of a fact table, SALES, and several
dimension tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table
has columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH
table has columns for each branch_key, branch_name, branch_type. The LOCATION
table has columns of geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the
dimension tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for
time data, four columns for ITEM data, three columns for BRANCH data, and four
columns for LOCATION data. Thus, the size of the fact table is significantly reduced.
When we need to change an item, we need only make a single change in the dimension
table, instead of making many changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into
several tables. The normalized dimension table is called a Snowflake.
The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of
snowflake schema resembles a snowflake. Snowflaking is a method of normalizing the
dimension tables in a STAR schemas. When we normalize all the dimension tables
entirely, the resultant structure resembles a snowflake with the fact table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is
diagramed with each fact surrounded by its associated dimensions, and those
dimensions are related to other dimensions, branching out into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension
tables, which can be linked to other dimension tables through a many-to-one
relationship. Tables in a snowflake schema are generally normalized to the third normal
form. Each dimension table performs exactly one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having
three levels. A snowflake schemas can have any number of dimension, and each
dimension can have any number of levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store,
Location, Time, Product, Line, and Family dimension tables. The Market dimension has
two dimension tables with Store as the primary dimension table, and Location as the
outrigger dimension table. The product dimension has three dimension tables with
Product as the primary dimension table, and the Line and Family table are the outrigger
dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This
needed more disk space than a more normalized snowflake schema. Snowflaking
normalizes the dimension by moving attributes with low cardinality into separate
dimension tables that relate to the core dimension table by using foreign keys.
Snowflaking for the sole purpose of minimizing disk space is not recommended,
because it can adversely impact query performance.
The STAR schema for sales, as shown above, contains only five tables, whereas the
normalized version now extends to eleven tables. We will notice that in the snowflake
schema, the attributes with low cardinality in each original dimension tables are
removed to form separate tables. These new tables are connected back to the original
dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex dimensions
and relationship. It is suitable for many to many and one to many relationships between
dimension levels.
Star Schema
○ In a star schema, the fact table will be at the center and is connected to the
dimension tables.
○ The performance of SQL queries is a bit less when compared to star schema as
more number of joins are involved.
○ Data redundancy is low and occupies less disk space when compared to star
schema.
Let's see the differentiate between Star and Snowflake Schema.
Basis for Comparison Star Schema Snowflake Schema
Ease of Use Less complex queries and More complex queries and
simple to understand therefore less easy to
understand
Query Performance Less number of foreign More foreign keys and thus
keys and hence lesser more query execution time
query execution time
Type of Data Warehouse Good for data marts with Good to use for data
simple relationships (one warehouse core to simplify
to one or one to many) complex relationships
(many to many)
Data Warehouse system Work best in any data Better for small data
warehouse/ data mart warehouse/data mart.
It deals with querying, statistical analysis, and reporting via tables, charts, or graphs.
Nowadays, information processing of data warehouse is to construct a low cost,
web-based accessing tools typically integrated with web browsers.
Analytical Processing
It supports various online analytical processing such as drill-down, roll-up, and pivoting.
The historical data is being processed in both summarized and detailed format.
Data Mining
It helps in the analysis of hidden design and association, constructing scientific models,
operating classification and prediction, and performing the mining results using
visualization tools.
Data mining is the technique of designing essential new correlations, patterns, and
trends by changing through high amounts of a record save in repositories, using pattern
recognition technologies as well as statistical and mathematical techniques.
In this architecture, the data is collected into single centralized storage and processed
upon completion by a single machine with a huge structure in terms of memory,
processor, and storage.
Centralized process architecture evolved with transaction processing and is well suited
for small organizations with one location of service.
It is very successful when the collection and consumption of data occur at the same
location.
In this architecture, information and its processing are allocated across data centers,
and its processing is distributed across data centers, and processing of data is localized
with the group of the results into centralized storage. Distributed architectures are used
to overcome the limitations of the centralized process architectures where all the
information needs to be collected to one central location, and results are available in
one central location.
Client-Server
In this architecture, the user does all the information collecting and presentation, while
the server does the processing and management of data.
Three-tier Architecture
N-tier Architecture
Cluster Architecture
Peer-to-Peer Architecture
This is a type of architecture where there are no dedicated servers and clients. Instead,
all the processing responsibilities are allocated among all machines, called peers. Each
machine can perform the function of a client or server or just process data.
Intraquery Parallelism
Intraquery parallelism defines the execution of a single query in parallel on multiple
processors and disks. Using intraquery parallelism is essential for speeding up
long-running queries.
Interquery parallelism does not help in this function since each query is run sequentially.
To improve the situation, many DBMS vendors developed versions of their products that
utilized intraquery parallelism.
This application of parallelism decomposes the serial SQL, query into lower-level
operations such as scan, join, sort, and aggregation.
Interquery Parallelism
In interquery parallelism, different queries or transaction execute in parallel with one
another.
This form of parallelism can increase transactions throughput. The response times of
individual transactions are not faster than they would be if the transactions were run in
isolation.
Each RDBMS server can read, write, update, and delete information from the same
shared database, which would need the system to implement a form of a distributed
lock manager (DLM).
DLM components can be found in hardware, the operating system, and separate
software layer, all depending on the system vendor.
The shared-disk distributed memory design eliminates the memory access bottleneck
typically of large SMP systems and helps reduce DBMS dependency on data
partitioning.
Shared-Memory Architecture
Shared-memory or shared-everything style is the traditional approach of implementing
an RDBMS on SMP hardware.
It is relatively simple to implement and has been very successful up to the point where it
runs into the scalability limitations of the shared-everything architecture.
The key point of this technique is that a single RDBMS server can probably apply all
processors, access all memory, and access the entire database, thus providing the
client with a consistent single system image.
In shared-memory SMP systems, the DBMS considers that the multiple database
components executing SQL statements communicate with each other by exchanging
messages and information via the shared memory.
All processors have access to all data, which is partitioned across local disks.
Shared-Nothing Architecture
In a shared-nothing distributed memory environment, the data is partitioned across all
disks, and the DBMS is "partitioned" across multiple co-servers, each of which resides
on individual nodes of the parallel system and has an ownership of its disk and thus its
database partition.
Each processor has its memory and disk and communicates with other processors by
exchanging messages and data over the interconnection network.
This architecture is optimized specifically for the MPP and cluster systems.
The shared-nothing architectures offer near-linear scalability. The number of processor
nodes is limited only by the hardware platform limitations (and budgetary constraints),
and each node itself can be a powerful SMP system.
○ Data transformation and calculation based on the function of business rules that
force transformation.
There are several selection criteria which should be considered while implementing a
data warehouse:
1. The ability to identify the data in the data source environment that can be read by
the tool is necessary.
2. Support for flat files, indexed files, and legacy DBMSs is critical.
3. The capability to merge records from multiple data stores is required in many
installations.
7. Selective data extraction of both data items and records enables users to extract
only the required data.
9. The ability to perform data type and the character-set translation is a requirement
when moving data between incompatible systems.
10. The ability to create aggregation, summarization and derivation fields and
records are necessary.
11. Vendor stability and support for the products are components that must be
evaluated carefully.
The warehouse team needs tools that can extract, transform, integrate, clean, and load
information from a source system into one or more data warehouse databases.
Middleware and gateway products may be needed for warehouses that extract a record
from a host-based source system.
Warehouse Storage
Software products are also needed to store warehouse data and their accompanying
metadata. Relational database management systems are well suited to large and
growing warehouses.
Different types of software are needed to access, retrieve, distribute, and present
warehouse data to its end-clients.
UNIT V - DW
SYSTEM & PROCESS MANAGERS
Note − The above list can be used as evaluation parameters for the
evaluation of a good scheduler.
Some important jobs that a scheduler must be able to handle are as follows −
● Daily and ad hoc query scheduling
● Execution of regular report requirements
● Data load
● Data processing
● Index creation
● Backup
● Aggregation creation
● Data transformation
Note − The Event manager monitors the events occurrences and deals with
them. The event manager also tracks the myriad of things that can go wrong
on this complex data warehouse system.
Events
Events are the actions that are generated by the user or the system itself. It
may be noted that the event is a measurable, observable, occurrence of a
defined action.
Given below is a list of common events that are required to be tracked.
● Hardware failure
● Running out of space on certain key disks
● A process dying
● A process returning an error
● CPU usage exceeding an 805 threshold
● Internal contention on database serialization points
● Buffer cache hit ratios exceeding or failure below threshold
● A table reaching to maximum of its size
● Excessive memory swapping
● A table failing to extend due to lack of space
● Disk exhibiting I/O bottlenecks
● Usage of temporary or sort area reaching a certain thresholds
● Any other database shared memory usage
The most important thing about events is that they should be capable of
executing on their own. Event packages define the procedures for the
predefined events. The code associated with each event is known as event
handler. This code is executed whenever an event occurs.
Process managers are responsible for maintaining the flow of data both into
and out of the data warehouse. There are three different types of process
managers −
● Load manager
● Warehouse manager
● Query manager
Data Warehouse Load Manager
Load manager performs the operations required to extract and load the data
into the database. The size and complexity of a load manager varies between
specific solutions from one data warehouse to another.
Load Manager Architecture
The load manager does performs the following functions −
● Extract data from the source system.
● Fast load the extracted data into temporary data store.
● Perform simple transformations into structure similar to the one in the
data warehouse.
Query Manager
The query manager is responsible for directing the queries to suitable tables.
By directing the queries to appropriate tables, it speeds up the query request
and response process. In addition, the query manager is responsible for
scheduling the execution of the queries posted by the user.
Query Manager Architecture
A query manager includes the following components −
● Query redirection via C tool or RDBMS
● Stored procedures
● Query management tool
● Query scheduling via C tool or RDBMS
● Query scheduling via third-party software
Functions of Query Manager
● It presents the data to the user in a form they understand.
● It schedules the execution of the queries posted by the end-user.
● It stores query profiles to allow the warehouse manager to determine
which indexes and aggregations are appropriate.
A data warehouse keeps evolving and it is unpredictable what query the user
is going to post in the future. Therefore it becomes more difficult to tune a data
warehouse system. In this chapter, we will discuss how to tune the different
aspects of a data warehouse such as performance, data load, queries, etc.
Performance Assessment
Here is a list of objective measures of performance −
● Average query response time
● Scan rates
● Time used per day query
● Memory usage per process
● I/O throughput rates
Following are the points to remember.
● It is necessary to specify the measures in service level agreement
(SLA).
● It is of no use trying to tune response time, if they are already better
than those required.
● It is essential to have realistic expectations while making performance
assessment.
● It is also essential that the users have feasible expectations.
● To hide the complexity of the system from the user, aggregations and
views should be used.
● It is also possible that the user can write a query you had not tuned for.
Data Load Tuning
Data load is a critical part of overnight processing. Nothing else can run until
data load is complete. This is the entry point into the system.
Note − If there is a delay in transferring the data, or in arrival of data then the
entire system is affected badly. Therefore it is very important to tune the data
load first.
There are various approaches of tuning data load that are discussed below −
● The very common approach is to insert data using the SQL Layer. In
this approach, normal checks and constraints need to be performed.
When the data is inserted into the table, the code will run to check for
enough space to insert the data. If sufficient space is not available, then
more space may have to be allocated to these tables. These checks
take time to perform and are costly to CPU.
● The second approach is to bypass all these checks and constraints and
place the data directly into the preformatted blocks. These blocks are
later written to the database. It is faster than the first approach, but it
can work only with whole blocks of data. This can lead to some space
wastage.
● The third approach is that while loading the data into the table that
already contains the table, we can maintain indexes.
● The fourth approach says that to load the data in tables that already
contain data, drop the indexes & recreate them when the data load is
complete. The choice between the third and the fourth approach
depends on how much data is already loaded and how many indexes
need to be rebuilt.
Integrity Checks
Integrity checking highly affects the performance of the load. Following are the
points to remember −
● Integrity checks need to be limited because they require heavy
processing power.
● Integrity checks should be applied on the source system to avoid
performance degrade of data load.
Tuning Queries
We have two kinds of queries in data warehouse −
● Fixed queries
● Ad hoc queries
Fixed Queries
Fixed queries are well defined. Following are the examples of fixed queries −
● regular reports
● Canned queries
● Common aggregations
Tuning the fixed queries in a data warehouse is same as in a relational
database system. The only difference is that the amount of data to be queried
may be different. It is good to store the most successful execution plan while
testing fixed queries. Storing these executing plan will allow us to spot
changing data size and data skew, as it will cause the execution plan to
change.
Note − We cannot do more on fact table but while dealing with dimension
tables or the aggregations, the usual collection of SQL tweaking, storage
mechanism, and access methods can be used to tune these queries.
Ad hoc Queries
To understand ad hoc queries, it is important to know the ad hoc users of the
data warehouse. For each user or group of users, you need to know the
following −
● The number of users in the group
● Whether they use ad hoc queries at regular intervals of time
● Whether they use ad hoc queries frequently
● Whether they use ad hoc queries occasionally at unknown intervals.
● The maximum size of query they tend to run
● The average size of query they tend to run
● Whether they require drill-down access to the base data
● The elapsed login time per day
● The peak time of daily usage
● The number of queries they run per peak hour
Points to Note
● It is important to track the user's profiles and identify the queries that are
run on a regular basis.
● It is also important that the tuning performed does not affect the
performance.
● Identify similar and ad hoc queries that are frequently run.
● If these queries are identified, then the database will change and new
indexes can be added for those queries.
● If these queries are identified, then new aggregations can be created
specifically for those queries that would result in their efficient execution.
Testing is very important for data warehouse systems to make them work
correctly and efficiently. There are three basic levels of testing performed on a
data warehouse −
● Unit testing
● Integration testing
● System testing
Unit Testing
● In unit testing, each component is separately tested.
● Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
● This test is performed by the developer.
Integration Testing
● In integration testing, the various modules of the application are brought
together and then tested against the number of inputs.
● It is performed to test whether the various components do well after
integration.
System Testing
● In system testing, the whole data warehouse application is tested
together.
● The purpose of system testing is to check whether the entire system
works correctly together or not.
● System testing is performed by the testing team.
● Since the size of the whole data warehouse is very large, it is usually
possible to perform minimal system testing before the test plan can be
enacted.
Test Schedule
First of all, the test schedule is created in the process of developing the test
plan. In this schedule, we predict the estimated time required for the testing of
the entire data warehouse system.
There are different methodologies available to create a test schedule, but
none of them are perfect because the data warehouse is very complex and
large. Also the data warehouse system is evolving in nature. One may face
the following issues while creating a test schedule −
● A simple problem may have a large size of query that can take a day or
more to complete, i.e., the query does not complete in a desired time
scale.
● There may be hardware failures such as losing a disk or human errors
such as accidentally deleting a table or overwriting a large table.
Note − The most important point is to test the scalability. Failure to do so will
leave us a system design that does not work when the system grows.