0% found this document useful (0 votes)
32 views33 pages

Chapter 2

Uploaded by

bbby6dfdrg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views33 pages

Chapter 2

Uploaded by

bbby6dfdrg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Data Warehousing – Terminologies, Delivery

Process, System Processes

Chapter-2
Out-Line
Metadata
Metadata Repository
Data Cube
Virtual Warehouse
Data Warehousing - Delivery Process
Delivery Method
Data Warehousing - System Processes
Process Flow in Data Warehouse
Metadata

Metadata is simply defined as data about data. The data that


are used to represent other data is known as metadata. For
example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that metadata
is the summarized data that leads us to the detailed data.
In terms of data warehouse, we can define metadata as
following −
Metadata is a road-map to data warehouse.
Metadata in data warehouse defines the warehouse objects.
This directory helps the decision support system to locate the
contents of a data warehouse.
Metadata acts as a directory.
Metadata Repository

Metadata repository is an integral part of a data warehouse system. It


contains the following metadata −
• Business metadata − It contains the data ownership information,
business definition, and changing policies.
• Operational metadata − It includes currency of data and data lineage.
Currency of data refers to the data being active, archived, or purged.
Lineage of data means history of data migrated and transformation
applied on it.
• Data for mapping from operational environment to data warehouse − It
metadata includes source databases and their contents, data extraction,
data partition, cleaning, transformation rules, data refresh and purging
rules.
• The algorithms for summarization − It includes dimension algorithms,
data on granularity, aggregation, summarizing, etc.
Data Cube

• A data cube helps us represent data in multiple dimensions. It is


defined by dimensions and facts. The dimensions are the entities
with respect to which an enterprise preserves the records.
Illustration of Data Cube
• Suppose a company wants to keep track of sales records with the
help of sales data warehouse with respect to time, item, branch, and
location. These dimensions allow to keep track of monthly sales and
at which branch the items were sold. There is a table associated with
each dimension. This table is known as dimension table. For
example, "item" dimension table may have attributes such as
item_name, item_type, and item_brand.
• The following table represents the 2-D view of Sales Data for a
company with respect to time, item, and location dimensions.
Data Cube(cont…)
Data Cube(cont…)

But here in this 2-D table, we have records with


respect to time and item only. The sales for
New Delhi are shown with respect to time, and
item dimensions according to type of items
sold. If we want to view the sales data with one
more dimension, say, the location dimension,
then the 3-D view would be useful. The 3-D
view of the sales data with respect to time,
item, and location is shown in the table below −
Data Cube(cont…)
Data Cube(cont…)

The previous-fingure 3-D table can be


represented as 3-D data cube as shown in the
following figure −
Data Mart

Data marts contain a subset of organization-


wide data that is valuable to specific groups of
people in an organization. In other words, a
data mart contains only those data that is
specific to a particular group. For example, the
marketing data mart may contain only data
related to items, customers, and sales. Data
marts are confined to subjects.
Data Mart(cont….)

Points to Remember About Data Marts


Windows-based or Unix/Linux-based servers are used to implement
data marts. They are implemented on low-cost servers.
The implementation cycle of a data mart is measured in short
periods of time, i.e., in weeks rather than months or years.
The life cycle of data marts may be complex in the long run, if their
planning and design are not organization-wide.
Data marts are small in size.
Data marts are customized by department.
The source of a data mart is departmentally structured data
warehouse.
Data marts are flexible.
Data Mart(cont….)

The following figure shows a graphical


representation of data marts:
Virtual Warehouse

The view over an operational data warehouse is


known as virtual warehouse. It is easy to build
a virtual warehouse. Building a virtual
warehouse requires excess capacity on
operational database servers.
Data Warehousing - Delivery Process

A data warehouse is never static; it evolves as the business


expands. As the business evolves, its requirements keep changing
and therefore a data warehouse must be designed to ride with
these changes. Hence a data warehouse system needs to be
flexible.
Ideally there should be a delivery process to deliver a data
warehouse. However data warehouse projects normally suffer
from various issues that make it difficult to complete tasks and
deliverables in the strict and ordered fashion demanded by the
waterfall method. Most of the times, the requirements are not
understood completely. The architectures, designs, and build
components can be completed only after gathering and studying
all the requirements.
Data Warehousing - Delivery Process(cont…)

Delivery Method
The delivery method is a variant of the joint
application development approach adopted for
the delivery of a data warehouse. We have staged
the data warehouse delivery process to minimize
risks. The approach that we will discuss here does
not reduce the overall delivery time-scales but
ensures the business benefits are delivered
incrementally through the development process.
Data Warehousing - Delivery
Process(cont…)
Note − The delivery process is broken into phases to reduce the project and
delivery risk.
The following diagram explains the stages in the delivery process
Data Warehousing - Delivery
Process(cont…)
IT Strategy
Data warehouse are strategic investments that require a business
process to generate benefits. IT Strategy is required to procure and
retain funding for the project.
Business Case
The objective of business case is to estimate business benefits that
should be derived from using a data warehouse. These benefits
may not be quantifiable but the projected benefits need to be
clearly stated. If a data warehouse does not have a clear business
case, then the business tends to suffer from credibility problems at
some stage during the delivery process. Therefore in data
warehouse projects, we need to understand the business case for
investment.
Data Warehousing - Delivery
Process(cont…)
Education and Prototyping
Organizations experiment with the concept of data analysis and educate
themselves on the value of having a data warehouse before settling
for a solution. This is addressed by prototyping. It helps in
understanding the feasibility and benefits of a data warehouse. The
prototyping activity on a small scale can promote educational process
as long as −
• The prototype addresses a defined technical objective.
• The prototype can be thrown away after the feasibility concept has
been shown.
• The activity addresses a small subset of eventual data content of the
data warehouse.
• The activity timescale is non-critical.
Data Warehousing - Delivery
Process(cont…)
The following points are to be kept in mind to produce
an early release and deliver business benefits.
• Identify the architecture that is capable of evolving.
• Focus on business requirements and technical
blueprint phases.
• Limit the scope of the first build phase to the
minimum that delivers business benefits.
• Understand the short-term and medium-term
requirements of the data warehouse.
Data Warehousing - Delivery
Process(cont…)
Business Requirements
To provide quality deliverables, we should make sure the overall
requirements are understood. If we understand the business
requirements for both short-term and medium-term, then we
can design a solution to fulfill short-term requirements. The
short-term solution can then be grown to a full solution.
The following aspects are determined in this stage −
• The business rule to be applied on data.
• The logical model for information within the data warehouse.
• The query profiles for the immediate requirement.
• The source systems that provide this data.
Data Warehousing - Delivery
Process(cont…)
Technical Blueprint
This phase need to deliver an overall architecture satisfying the long
term requirements. This phase also deliver the components that
must be implemented in a short term to derive any business
benefit. The blueprint need to identify the followings.
• The overall system architecture.
• The data withholding policy.
• The backup and recovery strategy.
• The server and data mart architecture.
• The capacity plan for hardware and infrastructure.
• The components of database design.
Data Warehousing - Delivery
Process(cont…)
Building the Version
• In this stage, the first production deliverable is produced. This production
deliverable is the smallest component of a data warehouse. This smallest
component adds business benefit.
History Load
• This is the phase where the remainder of the required history is loaded into the
data warehouse. In this phase, we do not add new entities, but additional physical
tables would probably be created to store increased data volumes.
• Let us take an example. Suppose the build version phase has delivered a retail
sales analysis data warehouse with 2 months’ worth of history. This information
will allow the user to analyze only the recent trends and address the short-term
issues. The user in this case cannot identify annual and seasonal trends. To help
him do so, last 2 years’ sales history could be loaded from the archive. Now the
40GB data is extended to 400GB.
Note − The backup and recovery procedures may become complex, therefore it is
recommended to perform this activity within a separate phase
Data Warehousing - Delivery
Process(cont…)
Ad hoc Query
• In this phase, we configure an ad hoc query tool that is used to operate a data
warehouse. These tools can generate the database query.
• Note − It is recommended not to use these access tools when the database is
being substantially modified.
Automation
• In this phase, operational management processes are fully automated. These
would include −
• Transforming the data into a form suitable for analysis.
• Monitoring query profiles and determining appropriate aggregations to maintain
system performance.
• Extracting and loading data from different source systems.
• Generating aggregations from predefined definitions within the data warehouse.
• Backing up, restoring, and archiving the data.
Data Warehousing - Delivery
Process(cont…)
Extending Scope
• In this phase, the data warehouse is extended to address a new set of business requirements.
The scope can be extended in two ways −
• By loading additional data into the data warehouse.
• By introducing new data marts using the existing information.
• Note − This phase should be performed separately, since it involves substantial efforts and
complexity.
Requirements Evolution
• From the perspective of delivery process, the requirements are always changeable. They are not
static. The delivery process must support this and allow these changes to be reflected within the
system.
• This issue is addressed by designing the data warehouse around the use of data within business
processes, as opposed to the data requirements of existing queries.
• The architecture is designed to change and grow to match the business needs, the process
operates as a pseudo-application development process, where the new requirements are
continually fed into the development activities and the partial deliverables are produced. These
partial deliverables are fed back to the users and then reworked ensuring that the overall system
is continually updated to meet the business needs.
Data Warehousing - System Processes

We have a fixed number of operations to be applied on the


operational databases and we have well-defined
techniques such as use normalized data, keep table
small, etc. These techniques are suitable for delivering a
solution. But in case of decision-support systems, we do
not know what query and operation needs to be executed
in future. Therefore techniques applied on operational
databases are not suitable for data warehouses.
• In this , section we will discuss how to build data
warehousing solutions on top open-system technologies
like Unix and relational databases.
Data Warehousing - System
Processes(cont..)

Process Flow in Data Warehouse


There are four major processes that contribute
to a data warehouse −
• Extract and load the data.
• Cleaning and transforming the data.
• Backup and archive the data.
• Managing queries and directing them to the
appropriate data sources.
Data Warehousing - System Processes(cont..)
Data Warehousing - System Processes(cont..)

Extract and Load Process


• Data extraction takes data from the source systems. Data load
takes the extracted data and loads it into the data warehouse.
-Note − Before loading the data into the data warehouse, the
information extracted from the external sources must be
reconstructed.
Controlling the Process
• Controlling the process involves determining when to start
data extraction and the consistency check on data. Controlling
process ensures that the tools, the logic modules, and the
programs are executed in correct sequence and at correct
time.
Data Warehousing - System Processes(cont..)

When to Initiate Extract


• Data needs to be in a consistent state when it is extracted, i.e., the data
warehouse should represent a single, consistent version of the information
to the user.
• For example, in a customer profiling data warehouse in telecommunication
sector, it is illogical to merge the list of customers at 8 pm on Wednesday
from a customer database with the customer subscription events up to 8
pm on Tuesday. This would mean that we are finding the customers for
whom there are no associated subscriptions.
Loading the Data
• After extracting the data, it is loaded into a temporary data store where it is
cleaned up and made consistent.
• Note − Consistency checks are executed only when all the data sources have
been loaded into the temporary data store.
Data Warehousing - System Processes(cont..)

Clean and Transform Process


Once the data is extracted and loaded into the temporary data store, it is time to perform
Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming −
• Clean and transform the loaded data into a structure
• Partition the data
• Aggregation
Clean and Transform the Loaded Data into a Structure
• Cleaning and transforming the loaded data helps speed up the queries. It can be done by
making the data consistent −
• within itself.
• with other data within the same data source.
• with the data in other source systems.
• with the existing data present in the warehouse.
Transforming involves converting the source data into a structure. Structuring the data increases
the query performance and decreases the operational cost. The data contained in a data
warehouse must be transformed to support performance requirements and control the
ongoing operational costs.
Data Warehousing - System Processes(cont..)

Partition the Data


• It will optimize the hardware performance and
simplify the management of data warehouse. Here
we partition each fact table into multiple separate
partitions.
• Aggregation
• Aggregation is required to speed up common
queries. Aggregation relies on the fact that most
common queries will analyze a subset or an
aggregation of the detailed data
Data Warehousing - System Processes(cont..)

Backup and Archive the Data


• In order to recover the data in the event of data loss, software failure, or hardware failure, it
is necessary to keep regular back ups. Archiving involves removing the old data from the
system in a format that allow it to be quickly restored whenever required.
• For example, in a retail sales analysis data warehouse, it may be required to keep data for 3
years with the latest 6 months data being kept online. In such as scenario, there is often a
requirement to be able to do month-on-month comparisons for this year and last year. In this
case, we require some data to be restored from the archive.
Query Management Process
• This process performs the following functions −manages the queries.
• helps speed up the execution time of queris.
• directs the queries to their most effective data sources.
• ensures that all the system sources are used in the most effective way.
• monitors actual query profiles.
The information generated in this process is used by the warehouse management process to
determine which aggregations to generate. This process does not generally operate during
the regular load of information into data warehouse.
END

You might also like