3.1 What Is Data Warehouse?: Unit Iii
3.1 What Is Data Warehouse?: Unit Iii
3.1 What Is Data Warehouse?: Unit Iii
The term Data Warehouse was coined by Bill Inmon in 1990, which he
defined in the following way: "A warehouse is a subject-oriented, integrated,
time-variant and non-volatile collection of data in support of management's
decision making process". He defined the terms in the sentence as follows:
Integrated: Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.
Time-variant: All data in the data warehouse is identified with a particular time
period.
Non-volatile: Data is stable in a data warehouse. More data is added but data is never
removed.
Data Mart: Departmental subsets that focus on selected subjects. A data mart
is a segment of a data warehouse that can provide data for reporting and
analysis on a section, unit, department or operation in the company, e.g. sales,
payroll, production. Data marts are sometimes complete individual data
warehouses which are usually smaller than the corporate data warehouse.
• Operational Data:
Focusing on transactional function such as bank card
withdrawals and deposits
Detailed
Updateable
Reflects current data
• Informational Data:
Focusing on providing answers to problems posed
by decision makers
Summarized
Non updateable
Data Warehouse Characteristics
However:
The major distinguishing features between OLTP and OLAP are summarized as
follows.
2. Data contents: An OLTP system manages current data that, typically, are too
detailed to be easily used for decision making. An OLAP system manages large
amounts of historical data, provides facilities for summarization and aggregation, and
stores and manages information at different levels of granularity. These features
make the data easier for use in informed decision making.
4. View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historical data or data in different organizations. In
contrast, an OLAP system often spans multiple versions of a database schema. OLAP
systems also deal with information that originates from different organizations,
integrating information from many data stores. Because of their huge volume, OLAP
data are stored on multiple storage media.
5. Access patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery
mechanisms. However, accesses to OLAP systems are mostly read-only operations
although many could be complex queries.
Star schema: The star schema is a modeling paradigm in which the data
warehouse contains (1) a large central table (fact table), and (2) a set of
smaller attendant tables (dimension tables), one for each dimension. The
schema graph resembles a starburst, with the dimension tables displayed in a
radial pattern around the central fact table.
3. Slice and dice: The slice operation performs a selection on one dimension of the
given cube, resulting in a sub cube. Figure shows a slice operation where the sales
data are selected from the central cube for the dimension time using the criteria
time=”Q2". The dice operation defines a sub cube by performing a selection on two
or more dimensions.
4. Pivot (rotate): Pivot is a visualization operation which rotates the data axes in
view in order to provide an alternative presentation of the data. Figure shows a pivot
operation where the item and location axes in a 2-D slice are rotated.
From the architecture point of view, there are three data warehouse models: the
enterprise warehouse, the data mart, and the virtual warehouse.
(i).Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data
generated locally within a particular department or geographic area.
EXTRACT
Some of the data elements in the operational database can be reasonably be expected
to be useful in the decision making, but others are of less value for that purpose. For
this reason, it is necessary to extract the relevant data from the operational database
before bringing into the data warehouse. Many commercial tools are available to help
with the extraction process. Data Junction is one of the commercial products. The
user of one of these tools typically has an easy-to-use windowed interface by which
to specify the following:
(i) Which files and tables are to be accessed in the source database?
(ii) Which fields are to be extracted from them? This is often done
internally by SQL Select statement.
(iii) What are those to be called in the resulting database?
(iv) What is the target machine and database format of the output?
(v) On what schedule should the extraction process be repeated?
TRANSFORM
The operational databases developed can be based on any set of priorities, which
keeps changing with the requirements. Therefore those who develop data warehouse
based on these databases are typically faced with inconsistency among their data
sources. Transformation process deals with rectifying any inconsistency (if any).
CLEANSING
Information quality is the key consideration in determining the value of the
information. The developer of the data warehouse is not usually in a position to
change the quality of its underlying historic data, though a data warehousing project
can put spotlight on the data quality issues and lead to improvements for the future. It
is, therefore, usually necessary to go through the data entered into the data warehouse
and make it as error free as possible. This process is known as Data Cleansing.
Data Cleansing must deal with many types of possible errors. These include missing
data and incorrect data at one source; inconsistent data and conflicting data when two
or more source are involved. There are several algorithms followed to clean the data,
which will be discussed in the coming lecture notes.
LOADING
Loading often implies physical movement of the data from the computer(s) storing
the source database(s) to that which will store the data warehouse database, assuming
it is different. This takes place immediately after the extraction phase. The most
common channel for data movement is a high-speed communication link. Ex: Oracle
Warehouse Builder is the API from Oracle, which provides the features to perform
the ETL task on Oracle Data Warehouse.
The data quality of a source largely depends on the degree to which it is governed by
schema and integrity constraints controlling permissible data values. For sources
without schema, such as files, there are few restrictions on what data can be entered
and stored, giving rise to a high probability of errors and inconsistencies. Database
systems, on the other hand, enforce restrictions of a specific data model (e.g., the
relational approach requires simple attribute values, referential integrity, etc.) as well
as application-specific integrity constraints. Schema-related data quality problems
thus occur because of the lack of appropriate model-specific or application-specific
integrity constraints, e.g., due to data model limitations or poor schema design, or
because only a few integrity constraints were defined to limit the overhead for
integrity control. Instance-specific problems relate to errors and inconsistencies that
cannot be prevented at the schema level (e.g., misspellings).
For both schema- and instance-level problems we can differentiate different problem
scopes: attribute (field), record, record type and source; examples for the various
cases are shown in Tables 1 and 2. Note that uniqueness constraints specified at the
schema level do not prevent duplicated instances, e.g., if information on the same real
world entity is entered twice with different attribute values (see example in Table 2).
Multi-source problems
The problems present in single sources are aggravated when multiple sources need to
be integrated. Each source may contain dirty data and the data in the sources may be
represented differently, overlap or contradict. This is because the sources are
typically developed, deployed and maintained independently to serve specific needs.
This results in a large degree of heterogeneity w.r.t. data management systems, data
models, schema designs and the actual data.
At the schema level, data model and schema design differences are to be
addressed by the steps of schema translation and schema integration, respectively.
The main problems w.r.t. schema design are naming and structural conflicts. Naming
conflicts arise when the same name is used for different objects (homonyms) or
different names are used for the same object (synonyms). Structural conflicts occur in
many variations and refer to different representations of the same object in different
sources, e.g., attribute vs. table representation, different component structure,
different data types, different integrity constraints, etc. In addition to schema-level
conflicts, many conflicts appear only at the instance level (data conflicts). All
problems from the single-source case can occur with different representations in
different sources (e.g., duplicated records, contradicting records,…). Furthermore,
even when there are the same attribute names and data types, there may be different
value representations (e.g., for marital status) or different interpretation of the values
(e.g., measurement units Dollar vs. Euro) across sources. Moreover, information in
the sources may be provided at different aggregation levels (e.g., sales per product vs.
sales per product group) or refer to different points in time (e.g. current sales as of
yesterday for source 1 vs. as of last week for source 2).
The two sources in the example of Fig. 3 are both in relational format but exhibit
schema and data conflicts. At the schema level, there are name conflicts (synonyms
Customer/Client, Cid/Cno, Sex/Gender) and structural conflicts (different
representations for names and addresses). At the instance level, we note that there are
different gender representations (“0”/”1” vs. “F”/”M”) and presumably a duplicate
record (Kristen Smith). The latter observation also reveals that while Cid/Cno are
both source-specific identifiers, their contents are not comparable between the
sources; different numbers (11/493) may refer to the same person while different
persons can have the same number (24). Solving these problems requires both
schema integration and data cleaning; the third table shows a possible solution. Note
that the schema conflicts should be resolved first to allow data cleaning, in particular
detection of duplicates based on a uniform representation of names and addresses,
and matching of the Gender/Sex values.
Data analysis
Metadata reflected in schemas is typically insufficient to assess the data quality
of a source, especially if only a few integrity constraints are enforced. It is thus
important to analyse the actual instances to obtain real (reengineered)
metadata on data characteristics or unusual value patterns. This metadata helps
finding data quality problems. Moreover, it can effectively contribute to identify
attribute correspondences between source schemas (schema matching), based
on which automatic data transformations can be derived.
There are two related approaches for data analysis, data profiling and
data mining. Data profiling focuses on the instance analysis of individual
attributes. It derives information such as the data type, length, value range,
discrete values and their frequency, variance, uniqueness, occurrence of null
values, typical string pattern (e.g., for phone numbers), etc., providing an exact
view of various quality aspects of the attribute.
Table 3 shows examples of how this metadata can help detecting data
quality problems.
Metadata are data about data. When used in a data warehouse, metadata are the data
that define warehouse objects. Metadata are created for the data names and
definitions of the given warehouse. Additional metadata are created and captured for
time stamping any extracted data, the source of the extracted data, and missing fields
that have been added by data cleaning or integration processes. A metadata repository
should contain:
greater scalability
FROM SALES
()
Cube Computation: ROLAP-Based Method
Computation
Partition arrays into chunks (a small sub cube which fits in memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces memory
access and storage cost.
3.4.2 Indexing OLAP data
The bitmap indexing method is popular in OLAP products because it allows quick
searching in data cubes.
The join indexing method gained popularity from its use in relational
database query processing. Traditional indexing maps the value in a given column to
a list of rows having that value. In contrast, join indexing registers the joinable rows
of two relations from a relational database. For example, if two relations R(RID;A)
and S(B; SID) join on the attributes A and B, then the join index record contains the
pair (RID; SID), where RID and SID are record identifiers from the R and S
relations, respectively.
1. Information processing
2. Analytical processing
multidimensional analysis of data warehouse data
Note:
Most data mining tools need to work on integrated, consistent, and cleaned
data, which requires costly data cleaning, data transformation and data integration as
preprocessing steps. A data warehouse constructed by such preprocessing serves as a
valuable source of high quality data for OLAP as well as for data mining.
Effective data mining needs exploratory data analysis. A user will often want
to traverse through a database, select portions of relevant data, analyze them at
different granularities, and present knowledge/results in different forms. On-line
analytical mining provides facilities for data mining on different subsets of data and
at different levels of abstraction, by drilling, pivoting, filtering, dicing and slicing on
a data cube and on some intermediate data mining results.
A metadata directory is used to guide the access of the data cube. The data cube
can be constructed by accessing and/or integrating multiple databases and/or by filtering a
data warehouse via a Database API which may support OLEDB or ODBC connections.
Since an OLAM engine may perform multiple data mining tasks, such as concept
description, association, classification, prediction, clustering, time-series analysis ,etc., it
usually consists of multiple, integrated data mining modules and is more sophisticated than
an OLAP engine.