Chapter-2 DATA WAREHOUSE PDF
Chapter-2 DATA WAREHOUSE PDF
Chapter-2 DATA WAREHOUSE PDF
Chapter-2
DATA WAREHOUSING
Data warehousing is the process of combining enterprise-wide mutual data into a single
storage area from which end-users can easily run queries, make reports and perform analysis. Data
warehousing is the data management and analysis technology adopting an update driven principle.
Data warehouse systems are valuable tools in today's competitive and fast evolving world. The data
warehouse is a new approach to enterprise-wide computing at the architectural level. A data
warehouse can provide a central repository for large amounts of diverse and valuable information.
Data warehouse supports business analysis and decision making by creating an enterprise-wide
integrated database of summarized, historical information. It integrates data from multiple,
incompatible sources. By transforming data into meaningful information, a data warehouse allows
the business manager to perform more substantive, accurate and consistent analysis. Data
warehousing improves the productivity of corporate decision-makers through consolidation,
conversion, transformation and integration of operational data and provides a consistent view of the
enterprise.
DEFINITION:
A formal definition of a data warehouse by W H Inmon (in 1993)
“A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of
data in support of the management's decision-making process”.
SUBJECT-ORIENTED:
A data warehouse is organized around major subjects such as customer, products, sales, etc.
Data are organized according to subject instead of application.
For example, an insurance company using a data warehouse would organize their data by
customer, premium, and claim instead of by different products (auto, life, etc.).
The data organized by subject obtains only the information necessary for the decision
support processing.
Information is presented according to specific subjects or areas of interest, Data is
manipulated to provide information about a particular subject.
Focusing on the modeling and analysis of data for decision makers, not on daily operations or
transaction processing.
Provide a simple and concise view around particular subjects issues by excluding data that
are not useful in the decision support process.
In operational systems data is stored by individual applications or business processes. Like
data about individual order, customer etc.
In data warehouse data is stored by real world business objects or events not by the
applications.
NON-VOLATILE:
INTEGRATED:
A single source of information for and about understanding multiple areas of interest.
The data warehouse provides one-stop shopping and contains information about a variety
of subjects.
For Example: The University data warehouse has information on faculties, students
and staff, instrumental workload and students outcomes etc.,
Data warehouse integrates the data so that inconsistencies are removed.
A data warehouse is usually constructed by integrating multiple; heterogeneous sources
such as relational databases, flat files, and OLTP (On Line Analytical Processing) files.
When data resides in many separate applications in the operational environment, the
encoding of data is often inconsistent.
If one application gender may be coded as m/f and in another it might be 0 and 1.
When data are moved from operational environment into the data warehouse, they assume
a consistent coding convention i.e. gender data transformed into m and f.
DATA WAREHOUSING CONCEPTS:
The most concept of data warehouse is the multidimensional data model.
At the core of the design of the data warehouse lies a multidimensional view of the data
model.
Multidimensional data model stores data in the form of data cube.Mostly, data warehousing
supports two or three-dimensional cubes.
A data cube allows data to be viewed in multiple dimensions.A dimensions are entities with
respect to which an organization wants to keep records.
Dimension model is developed for implementing data warehouse and data marts.
MDDM provide both a mechanism to store data and a way for business analysis.
The multi dimensional data model is the integral part of OLAP because it provides answers
quickly.
The multi dimensional data model is designed to solve complex queries in real time.
The multi-dimensional data model is composed of logical cubes, measures, dimensions,
hierarchies, levels and attributes.
For example in store sales record, dimensions allow the store to keep track of things like
monthly sales of items and the branches and locations.
A multidimensional databases helps to provide data-related answers to complex business
queries quickly and accurately.
Data warehouses and Online Analytical Processing (OLAP) tools are based on a
multidimensional data model.
OLAP in data warehousing enables users to view data from different angles and dimensions.
There are two components of MDDM:
I. Dimensions : Texture attributes to analyses of data.
II. Facts : Numeric volume to analyze business
When data is grouped or combined together in multi dimensional matrices called Data
Cubes.
Changing from one dimensional hierarchy to another is clearly accomplished in data
cube by a technique called piroting(also known as rotation).
A popular conceptual model that influences data warehouse architecture is a
multidimensional view of data.
Figure demonstrates a multidimensional view of the information corresponding to
above Table. This model views data in the form of a data cube (or, more precisely, a
hypercube). It has three dimensions, namely gender, profession and year. Each
dimension can be divided into sub dimensions.
Multidimensional representation of data:
In a multidimensional data model, there is a set of numeric measures that are the main theme or
subject of the analysis. In this example, the numeric measure is employment. We can have more
than one numeric measure. Examples of numeric measures are sales, budget, revenue, inventory,
population, etc. Each numeric measure depends on a set of dimensions, which provide the context
for the measure. All the dimensions together are assumed to uniquely determine the measure. Thus,
the multidimensional data views a measure as a value placed in a cell in the multidimensional space.
Each dimension, in turn, is described by a set of attributes. In general, dimensions are the
perspectives or entities with respect to which an organization wants to keep records.
The formal definition of data cube is an n-dimensional data cube, C [A1 A2, ..…An], is a
database with n dimensions as A1, A2, …. An , each of which represents a theme and contains
lAil number of distinct elements in the dimension Ai, Each distinct element of AI corresponds
to a data row of C. A data cell in the cube, C [a1, a2, ..., an] stores the numeric measures of the
data for Ai=ai , for all i. Thus, a data cell corresponds to an instantiation of all dimensions.
In the above example, C [gender, profession, year] is the data cube, and a data cell C [male, civil
engineer, 1992] stores 2780 as its associated measure. As Igenderl = 2, I profession I = 6 and I year
I= 5, we have three dimensions with 2, 6 and 5 rows, respectively.
o DIMENSION MODELLING:
The concept of a dimension provides a lot of semantic information, especially about the
hierarchical relationship between its elements.
Dimension modelling is a special technique for structuring data around business
concepts.
Unlike ER modelling, which describes entities and relationships, dimension modelling
structures the numeric measures and the dimensions.
The dimension schema can represent the details of the dimensional modelling. The
following figures show the dimension modelling for our example.
Figure: Dimension modelling
o LATTICE OF CUBOIDS:
The dimension hierarchy helps us view the multidimensional data in several different
data cube representations.
Multidimensional data can be viewed as a lattice of cuboids.
The bottom most cuboid is the base cuboid, It consists all the data cells.
The top most cuboid is the apex cuboid, It can contains only one cell with numeric
measures of all n dimensions.
In lattice of cuboids, all other cuboids lie between the base cuboid and the apex cuboid.
The C [A1 , A2 ,... An] at the best level of granularity is called the base cuboid and it
consists of all the data cells.
The (n-1) - D cubes are obtained by grouping the cells and computing the combined
numeric measure of a given dimension.
Finally, the coarsest level consists of one cell with numeric measures of all n
dimensions.
This is called an apex cuboid.
In the lattice of cuboids, all other cuboids lie between the base cuboid and the apex
cuboid.
For example, the lattice of cuboids is a trivial one and it contains just two cuboids-the
base cuboid and the apex cuboid.
Consider an example a store called, Deccan Electronics may create a sales data
warehouse in order to keep records of the store's sales with respect to the time, product
and location. Thus, the dimensions are time, product and location. These dimensions
allow the store to keep track of things like monthly sales of item and the locations at
which the items were sold. Dimension tables for product contain: the attributes item
name, brand, and type. The attributes shop, manager, city, region, state, and country
describe the dimension location.
These attributes are related by a total order forming a hierarchy, such as shop<city<state <
country.
This hierarchy shown in Figure. An example of a partial order for the time dimension is the
attributes week, month, quarter and year.
The sales data warehouse includes the sales amount in rupees and the total number of units
sold. Note that we can have more than one numeric measure.
Figure shows the multidimensional model for such situations. The dimension hierarchies
considered for the data cube are time:(month <quarter < year); location: (city < province <
country); and product.
Figure shows the cuboid C[quarter,city,product]. In the given dimensional hierarchies, the
base cuboid of the lattice is C[month,city,product] and the apex cuboid is
C[year,country,product]. Other intermediate cuboids in the lattice are C (quarter, province,
product],C [quarter, country, product], C [month, province, product], C {month, country,
product], C [year, city, product] and C [year, province, product].
o Dimension schema:
The multidimensional data model has the following conceptual basic components:
1. Summary measure: e.g., employment, sales, etc.
2. Summary function: e.g., sum
3. Dimension: e.g., gender, year, profession, state
4. Dimension hierarchy: e.g., professional class , profession.
SUMMARY MEASURES: Summary measure is essentially the main theme of the analysis in a
multidimensional model. A measure value is computed for a given cell by aggregating the data
corresponding to the respective dimension-value sets defining the cell. The measures can be
categorized into 3 groups based on the kind of aggregate Function used.
Distributive: A numeric measure is distributive if it can be computed in a distributed
manner as follows. Suppose the data is partitioned into a few subsets. The measure can be
simply the aggregation of the measures of all partitions. For example, count, sum, min and
max are distributive measures.
Algebraic: An aggregate function is algebraic if it can be computed by an algebraic
function with some set of arguments, each of which may be obtained by a distributive
measure. For example, average is obtained by sum/count.Other examples of distributive
functions are standard-deviations and center-of mass.
Holistic: An aggregate function is holistic if there is no constant bound on the storage
size needed to describe a sub aggregate. That is, there does not exist an algebraic function
that can be used to compute this function. Examples of such functions are median, mode,
most-frequent, etc.
OLAP OPERATIONS:
Once we model our data warehouse in the form of a multidimensional data cube, it is
Necessary to explore the different analytical tools with which to perform the complex
analysis of data.
These data analysis tools are called OLAP (On-Line Analytical Processing). OLAP is mainly
used to access the live data online and to analyze it.
OLAP tools are designed in order to achieve such analyses on very large databases. OLAP
provides a user-friendly environment for interactive data analysis.
In the multidimensional model, the data are organized into multiple dimensions and each
dimension contains multiple levels of abstraction. Such an organization provides the users
with the flexibility to view data from different perspectives. There exist a number of OLAP
operations on data cubes which allow interactive querying and analysis of the data.
OLAP provides a user-friendly environment for interactive data analysis.
OLAP operations exist to materialize different views of data, allowing interactive querying
and analysis of data.
The three basic operations of OLAP are:
i. Roll-up or Drill-up
ii. Roll-down or Drill-down
iii. Slicing and Dicing
iv. Pivot
Roll-Up/Drill-Up/Consolidation Operation:
o Drills with switching from a detailed to an aggregated level within same classification
hierarchy.
o Roll-up operation helps to dimension reduction.
o Roll-up is performed by climbing up a concept hierarchy for the dimension location.
o Example: week>month>quarter
Consider the following cube illustrating temperature of certain days recorded weekly:
Fig_7: Example.
Assume we want to set up levels (hot(80-85), mild(70-75), cold(64-69)) in temperature from the
above cube. To do this we have to group columns and add up the values according to the concept
hierarchy. This operation is called roll-up.By doing this we obtain the following cube:
Fig_8: Rollup.
The concept hierarchy can be defined as hot-->day-->week. The roll-up operation groups the data by
levels of temperature.
Roll-Up Operation
Roll-Down/Drill-Down Operation:
o Switching from an aggregated to a more detailed level within the same classification
hierarchy.
o It is performed by the following ways:
By stepping down a concept hierarchy for a dimension or introducing a additional
dimension.
By introducing a new dimension.
o It navigates from less detailed data to more detailed data.
o It can be realized by either stepping down a concept hierarchy for a dimension or introducing
additional dimensions. Performing roll down operation on the same cube mentioned above:
Fig: Rolldown
The result of a drill-down operation performed on the central cube by stepping down a concept
hierarchy for temperature can be defined as day<--week<--cool. Drill-down occurs by descending
the time hierarchy from the level of week to the more detailed level of day. Also new dimensions
can be added to the cube, because drill-down adds more detail to the given data.
Roll-Down Operation
Slicing Operation:
o The slice operation selects one particular dimension from a given cube and provides a new
sub-cube.
o This operation is used for reducing the data cube by one or more dimensions. The slice
operation performs a selection on one dimension of the given cube, resulting in a subcube.
o For example, in the cube example above, if we make the selection, tempera ture=cool we
will obtain the following cube:
Fig 0: Slicing.
o Figure 1 shows a slice operation where the sales data are selected from the central cube for
the dimension time, using the criteria time= 'Q2'.
o The dice operation selects two or more dimensions from a given cube and provides a new
sub cube.
o This operation is also used for reducing the data cube by one or more dimensions.
o This operation is for selecting a smaller data cube and analyzing it from different
perspectives. The dice operation defines a subcube by performing a selection on two or more
dimensions.
o For example, applying the selection (time = day 3 OR time = day 4) AND (temperature =
cool OR temperature = hot) to the original cube we get the following subcube (still two-
dimensional):
o Figure 2 shows a dice operation on the central cube based on the following selection criteria,
which involves three dimensions: (location= "Mumbai" or "Pune ") and (time = "Q1" or
"Q2").
Pivot Operation:
o A visualization operation which rotates the data access in order to provide an alternative
representation.
o 3D>2D
o It rotates the data axes in view in order to provide an alternative presentation of data.
o Pivot groups data with different dimensions.
o Pivot (also called "rotate") is a visualization operation which rotates the data axes in order to
provide an alternative presentation of the same data. Ex: rotating the axes in a 3-D cube, or
transforming a 3-D cube into, a series of 2-D planes. .
o The below cubes shows 2D represntation of Pivot.
WAREHOUSE SCHEMA:
STAR SCHEMA:
SNOWFLAKE SCHEMA:
Star schema consists of a single fact table and a single denormalized dimension table for each
dimension of the multidimensional data model.
To support attribute hierarchies, the dimension tables can be normalized to create snowflake
schemas.
A snowflake schema consists of a single fact table and multiple dimension tables. Like the
Star Schema, each tuple of the fact table consists of a (foreign) key pointing to each of the
dimension tables that provide its multidimensional coordinates.
It also stores numerical values (non-dimensional attributes, and results of statistical
functions) for those coordinates. Dimension Tables in a star schema are denormalized, while
those in a snowflake schema are normalized.
The advantage of the snowflake schema is as follows:
A Normalized table is easier to maintain.
Normalizing also saves storage space, since an un-normalized Dimension Table tends
to be large and may contain redundant information:
However, the snowflake structure may be reducing the effectiveness of navigating across the
tables due to a larger number of join operations.
An example of a snowflake schema for a company Deccan Electronics is given in Figure. It can be
seen that the dimension table for the items is normalized resulting in two tables namely, the item and
supplier tables.
A Fact Constellation is a kind of schema where we have more than one Fact Table
sharing among them some Dimension Tables. It is also called Galaxy Schema.
For example, let us assume that Deccan Electronics would like to have another Fact Table for
supply and delivery.
Figure: Fact Constellation Schema
DATAWAREHOUSING ARCHITECTURE:
This structure can be visualized as 3-tier architecture.
Tier 1 is essentially the warehouse server, Tier 2 is the OLAP-engine for analytical
processing, and Tier 3 is a client containing reporting tools, visualization tools, data
mining tools, querying tools, etc.
There is also the backend process which is concerned with extracting data from multiple
operational databases and from external sources; with cleaning, transforming and integrating
this data for loading into the data warehouse server; and of course, with periodically
refreshing the warehouse.
Tier 1 contains the main data warehouse. It can follow one of three models or some
combination of these. It can be single enterprise warehouse, or may contain several
departmental marts. The third model is to have a virtual warehouse.
Tier 2 follows three different ways of designing the OLAP engine, namely ROLAP, MOLAP
and extended SQL OLAP.
Below Figure shows a typical data warehousing architecture.
OR
DataWarehouse Architecture
Data Sources:
All the data related to any business organization is stored in operational databases, external
files and flat files.
These sources are application oriented.
E.g: Complete data of organization such as training detail, customer detail, sales,
departments, transactions, employee detail etc.
Data present here is different formats. Contain data that is not well documented.
Bottom Tier: Data Warehouse server
o Data Warehouse server fetch only relevant information based on data mining (mining a
knowledge from large amount of data) request.
o E.g: Customer profile information provided by external consultants.
o Data is feed into bottom tier by some backend tools and utilities.
o Function performed by backend tools and utilities are:
Data Extraction
Data Cleaning
Data Transformation
Load
Refresh
o Bottom tier contains:
Data Warehouse
Meta data repository
Data Marts
Monitoring and Administration
Data Warehouse:
It is an optimized form of operational database contain only relevant information and provide fast
access to data.
Data Marts:
Subset of data warehouse contain only small slices of data warehouse.
E.g: Data pertaining to the single department.
Two types of data marts:
Monitoring & Administration:
o Query Tools: Point and click creation of SQL used in customer mailing list.
o Reporting Tools: Production reporting tools, report writers.
o Analysis Tools: Prepare charts based on analysis.
o Data mining Tools: Mining knowledge, discover hidden piece of information, new
correlations, useful pattern.
WAREHOUSE SERVER:
o A data warehouse server is the physical storage used by a data warehouse system.
o Various processed data and other relevant information that comes from several applications
and sources are stored in a data warehouse server where it is organized for future business
analysis and user query purposes.
o There are three data warehouse models:
Enterprise Warehouse
Data Marts
Virtual Data Warehouse
ENTERPRISE WAREHOUSE:
This model collects all the information about the subjects, spanning the entire
organization.
It provides corporate wide data integration, usually from one or more operational
systems or external information providers.
An enterprise data warehouse is a unified database that holds all the business
information an organization and makes it accessible all across the company.
DATA MARTS:
Data Marts are partitions of the overall data warehouse. If we visualize the data
warehouse as covering every aspect of a company's business (sales, purchasing, payroll,
and so forth), then a data mart is a subset of that huge data warehouse built specifically
for a department.
Data marts may contain some overlapping data. A store sales data mart, for example,
would also need some data from inventory and payroll.
There are several ways to partition the data, such as by business function or geographic
region. There are many alternatives to design a data warehouse.
One feasible option is to start with a set of data marts for each of the component
departments. One can have a stand-alone data mart or a dependent data mart.
The current trend is to define the data warehouse as a conceptual environment.
The industry is moving away from a single, physical data warehouse toward a set of
smaller, more manageable, databases called data marts.
The physical data marts together serve as the conceptual data warehouse. These marts
must provide the easiest possible access to information required by its user community.
o Stand-alone (Independent) Data Mart:
Independent data marts, in contrast, are standalone systems built by drawing
data directly from operational or external sources of data or both.
In a bottom-up approach a data mart development is “Independent” of
enterprise data warehouse.
This approach enables a department or work-group to implement a data mart
with minimal or no impact on the enterprise's operational database.
Data Marts:
o It defines a single subject.
o It stores department specific information.
o Design for middle management users.
VIRTUAL DATAWAREHOUSE:
When end-users access the “system of record” (the OLTP system) directly and generate
“summarized data” reports and thereby given the feel of a “data warehouse”, such a data
warehouse is known as a “Virtual data warehouse”.
This model creates a virtual. view of databases, allowing the creation of a "virtual warehouse"
as opposed to a physical warehouse.
In a virtual warehouse, we have a logical description of all the databases and their structures,
and individuals who want to get information from those databases do not have to know
anything about them
This approach creates a single "virtual database" from all the data resources. The data
resources can be local or remote.
In this type of a data warehouse, the data is not moved from the sources. Instead, the users are
given direct access to the data. The direct access to the data is sometimes through simple SQL
queries, view definition,
The virtual data warehouse scheme lets a client application access data distributed across
multiple data sources through a single SQL statement, a single interface.
All data sources are accessed as though they are local users and their applications do not even
need to know the physical location of the data.
A virtual database is easy and fast, but it is not without problems. Since the queries must
compete with the production data transactions, its performance can be considerably degraded.
Since there is no metadata, no summary data or history, all the queries must be repeated,
creating an additional burden on the system.
This model creates a virtual view of databases, allowing the creation of a virtual warehouse.
In a virtual warehouse we have a logical description of all the databases and their structures.
The data resources can be either local or remote.
In this type of data warehouse, the data is not moved from the sources, Instead the users are
given the direct access to the data.
METADATA:
Meta data is a data that describes other data.
Metadata summarizes basic information about data, which can make finding and working
with particular instances of data easier.
Metadata can be created manually, or by automated information processing.
Manual creation tends to be more accurate, allowing the user to input any information
they feel is relevant or needed to help describe the file.
Automated metadata creation can be much more elementary, usually only displaying
information such as file size, file extension, when the file was created and who created
the file.
Metadata serves to identify the contents and location of data in the warehouse.
Metadata is a bridge between the data warehouse and the decision support application.
In addition to' providing a logical linkage between data and application, metadata can
isolate access to information across the entire data warehouse, and can enable the
development of applications which automatically update themselves to reflect data
warehouse content changes.
Metadata is needed to provide a definite interpretation. Metadata provides a catalogue of
data in the data warehouse and the pointers to this data.
Metadata may also contain: data extraction/ transformation history, column aliases, data
warehouse table sizes, data communication/modelling algorithms and data usage
statistics.
Metadata is also used to describe many aspects of the applications, including hierarchical
relationships, stored formulae, whether calculations have to be performed before or after
consolidation, currency conversion information, time series information, item description
and notes for reporting, security and access controls, data update status, formatting
information, data sources, availability of pre-calculated summary tables, and data storage
parameters.
In the absence of this information, the actual data is not. In simple, Metadata is "data
about data" definition.
Business metadata, which includes business terms and definitions, data ownership
information, and changing policies.
OLAP ENGINE:
o The main function of the OLAP engine is to present the user a multidimensional view of the
data warehouse and to provide tools for OLAP operations.
o If the warehouse server organizes the data warehouse in the form of multidimensional arrays,
then the implementational considerations of the OLAP engine are different from those when
the server keeps the warehouse in a relational form.
o There are three options of the OLAP engine:
Specialized SQL Server:
This model assumes that the warehouse organizes data in a relational structure and
the engine provides an SQL-like environment for OLAP tools.
The main idea is to exploit the capabilities of SQL. We shall see that the standard
SQL is not suitable for OLAP operations.
However, some researchers, (and some vendors) are attempting to extend the
abilities of SQL to provide OLAP operations.
ROLAP( Relational OLAP):
ROLAP works with data that resides in relational database where the base data
and dimension tables are stored as relational tables.
This model permits multidimensional analysis of data as this enables users to
perform a function equivalent to that of the traditional OLAP slicing and dicing
features.
This is achieve through use of SQL reporting tool to extract or ‘query’ data
directly from the data warehouse.
Each action of slicing and dicing is equivalent to adding a "WHERE" clause in the
SQL statement.
The ROLAP approach begins with the premise that data does not need to be
stored multidimensionally to be viewed multidimensionally.
Normally, the ROLAP engine formulates optimized SQL statements that it sends
to the RDBMS server.
It then takes the data back from the server, reintegrates it, and performs further
analysis and computation before delivering the finished results to the user.
Two important features of ROLAP are
Data warehouse and relational database are inseparable
Any change in the dimensional structure requires a physical reorganization of
the database, which is too time consuming. So ROLAP tool is the only
appropriate choice.
Advantages of ROLAP :
Can handle large amounts of data: The data size limitation of ROLAP
technology is the limitation on data size of the underlying relational database.
ROLAP itself places no limitation on data amount.
Disadvantages of ROLAP :
Performance can be slow: The query time can be long if the underlying data
size is large.
Limited by SQL functionalities: It is difficult to perform complex calculations
using SQL
Figure:ROLAP
MOLAP(Multi-Dimensional OLAP):
The third option is to have a special purpose Multidimensional Data Model for the
data warehouse, with a Multidimensional CLAP (MOLAP) server for analysis.
MOLAP servers support multidimensional views of data through array-based data
warehouse servers.
They map multidimensional views of a data cube to array structures.
Note: The advantage of using a data cube is that it allows fast indexing to
precompute summarized data. As with a multidimensional data store, storage
utilization is low, and MOLAP is recommended in such cases.
Advantages of MOLAP:
Excellent performance: MOLAP cubes are built for fast data retrieval,
and are optimal for slicing and dicing operations.
Can perform complex calculations: All calculations have been pre-
generated when the cube is created. Hence, complex calculations are easier
to generate and return the result quickly.
Disadvantages of MOLAP:
Limited in the amount of data it can handle: Because all calculations are
performed when the cube is built, it is not possible to include a large
amount of data in the cube itself.
Requires additional investment: Cube technology are often proprietary
and do not already exist in the organization. Therefore, to adopt MOLAP
technology, chances are additional investments in human and capital
resources are needed.
ROLAP vs MOLAP:
The following arguments can be given in favour of MOLAP:
Relational tables are unnatural for multidimensional data.
Multidimensional arrays provide efficiency in storage and operations.
There is a mismatch between multidimensional operations and SQL.
For ROLAP to achieve efficiency, it has to perform outside current relational systems,
which is the same as what MOLAP does.
The following arguments can be given in favour of ROLAP:
ROLAP integrates naturally with existing technology and standards.
MOLAP does not support ad hoc queries effectively, because it is optimized for
multidimensional operations.
Since data has to be downloaded into MOLAP systems, updating is difficult.
The efficiency of ROLAP can be achieved by using techniques such as encoding and
compression.
ROLAP can readily take advantage of parallel relational technology.
DATA EXTRACTION:
Data extraction is the process of extracting data for the warehouse from various sources. The
data may come from a variety of sources, such as production data, legacy data, internal
office systems, external systems, metadata.
DATA CLEANING:
Data cleaning is essential in the construction of quality data warehouses.
The data cleaning techniques include
1. Using transformation rules, e.g., translating attribute names like 'age' to 'DOB';
2. Using domain-specific knowledge;
3. Performing parsing and fuzzy matching, e.g., for multiple data sources, one can
designate a preferred source as a matching standard, and
4. Auditing, i.e., discovering facts that flag unusual patterns.
DATA TRANSFORMATION:
The sources of data for data warehouse are usually heterogeneous. Data transformation is
concerned with transforming heterogeneous data to a uniform structure so that the data can
be combined and integrated.
Convert data from legacy or host format to warehouse format.
The transformation of data into a desired state include functions such as data formatting,
splitting data, joining data, creating rows and columns, using lookup tables or creating
combinations within the data.
LOADING:
A loading system should also allow system administrators to monitor the status, cancel,
suspend, resume loading or change the loading rate, and restart loading after failures without
any loss of data integrity.
There are different data loading strategies: Batch loading, Sequential loading, Incremental
loading.
REFRESH:
When the source data is updated, we need to update the warehouse. This process is called the
refresh function.
When the source data is updated, we need to update the warehouse this process called the
refresh function.
Refresh policy is set by the data administrator, based on user needs and data traffic.